I am currently trying to implement K-FOLD cross validation in classification using sklearn in python. I understand the basic concept behind K-FOLD and cross validation. However, I dont understand what is the cross_val_score and what does it do and what role does the CV iteration have in getting the array of scores we get. Below are the examples from the official documentation page of sklearn.

**Example 1**from sklearn import datasets, linear_modelfrom sklearn.model_selection import cross_val_scorediabetes = datasets.load_diabetes()X = diabetes.data[:150]y = diabetes.target[:150]lasso = linear_model.Lasso()print(cross_val_score(lasso, X, y, cv=3)) ***OUPUT***[0.33150734 0.08022311 0.03531764]

Taking a look at Example 1, the output generates 3 values in an array. I know that when we use kfold, n_split is the command that generates number of folds. So what does cv do in this example?

**My Code**kf = KFold(n_splits=4,random_state=seed,shuffle=False)print('Get_n_splits',kf.get_n_splits(X),'\n\n')for train_index, test_index in kf.split(X):print('TRAIN:', train_index, 'TEST:', test_index)x_train, x_test = df.iloc[train_index], df.iloc[test_index]y_train, y_test = df.iloc[train_index], df.iloc[test_index]print('\n\n')# use train_test_split to split into training and testing datax_train, x_test, y_train, y_test = cross_validation.train_test_split(X, y,test_size=0.25,random_state=0)# fit / train the model using the training dataclf = BernoulliNB()model = clf.fit(x_train, y_train)y_predicted = clf.predict(x_test)scores = cross_val_score(model, df, y, cv=4)print('\n\n')print('Bernoulli Naive Bayes Classification Cross-validated Scores:', scores)print('\n\n')

Looking at My Code, I am using 4 Fold cross validation for Bernoulli Naive Bayes Classifier and am using cv=4 in score as below :scores = cross_val_score(model, df, y, cv=4)The above line gives me an array of 4 values. However, if I change it to cv= 8 as below :scores = cross_val_score(model, df, y, cv=8)then an array of 8 values is generated as output. So again, what does cv do here.

I did read the documentation over and over again and searched numerous websites but since I am a newbie, I really don't understand what cv does and how the scores are generated.

Any and all help would be really appreciated.

Thanks in advance

2

Best Answer


In a K-FOLD Cross Validation, the following procedure is followed as follows:

  1. Model is trained using K-1 of the folds as training data
  2. Resulting Model is validated on the remaining data

This process is repeated K times and performance measure such as "ACCURACY" is computed at each step.

Please look at the image below to get a clear picture. It is taken from Cross Validation module of Scikit-Learn.

Cross Validation

>>> from sklearn.model_selection import cross_val_score>>> clf = svm.SVC(kernel='linear', C=1)>>> scores = cross_val_score(clf, iris.data, iris.target, cv=5)>>> scores array([0.96..., 1. ..., 0.96..., 0.96..., 1. ])>>> print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))Accuracy: 0.98 (+/- 0.03)

Here the single mean Score is calculated. By default, the score computed at each CV iteration is the score method of the estimator.

I have taken help from the links mentioned below.

  1. "https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score"

  2. 'https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation'

Cross Validation involves splitting the data into a training set and a validation set. The model is trained on the training set and then evaluated on the validation set. This process is repeated multiple times, with the model being trained on different subsets of the data each time. The final evaluation score is the average of the scores from the individual cross-validation folds.