How to train with TimeSeriesSplit from sklearn?

Question

I have this kind of data (columns):

| year-month | client_id | Y | X1.. Xn |

Where Y is if the client client_id purchased the product in a given year-month. And X are the explanatory variables. I have two years of monthly data, and I have done the split correctly with TimeSeriesSplit() given in this answer. The problem now, is that I'm looking to do a GridSearchCV() on that split, trying different models (RF, XGBoostClassifier(), LightGBM(), etc.) with different hyperparameters, but I can't figure out a way to use the GridSearchCV() with the split done.

Any suggestions?

Best Answer

Accepted Answer

Assuming you have splits df based on this question.First save indices for each Fold into arrays of tuples (train,test), i.e,:

 [(train_indices, test_indices), # 1stfold(train_indices, test_indices)] # 2nd fold etc

The following code will do this:

custom_cv = []for FOLD_train,FOLD_test in zip(splits['train'],splits['test']):custom_cv.append((np.array(FOLD_train.index.values.tolist()),np.array(FOLD_test.index.values.tolist())))

you can use GridSearchCV() in the following manner:

Here we create dictionary with classifier functions and another dictionary with param list

This is just a sample make sure to limit search space when testing,

from sklearn.ensemble import GradientBoostingClassifierfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.linear_model import LogisticRegressionfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.svm import SVCfrom sklearn.model_selection import GridSearchCVfrom xgboost import XGBRegressordict_classifiers = {"Random Forest": RandomForestClassifier(),"Gradient Boosting Classifier": GradientBoostingClassifier(),"Linear SVM": SVC(),"XGB": XGBRegressor(),"Logistic Regression": LogisticRegression(),"Nearest Neighbors": KNeighborsClassifier(),"Decision Tree": DecisionTreeClassifier(),}params = {"Random Forest": {"max_depth": range(5, 30, 5), "min_samples_leaf": range(1, 30, 2),"n_estimators": range(100, 2000, 200)},"Gradient Boosting Classifier": {"learning_rate": [0.001, 0.01, 0.1], "n_estimators": range(1000, 3000, 200)},"Linear SVM": {"kernel": ["rbf", "poly"], "gamma": ["auto", "scale"], "degree": range(1, 6, 1)},"XGB": {'min_child_weight': [1, 5, 10],'gamma': [0.5, 1, 1.5, 2, 5],'subsample': [0.6, 0.8, 1.0],'colsample_bytree': [0.6, 0.8, 1.0],'max_depth': [3, 4, 5], "n_estimators": [300, 600],"learning_rate": [0.001, 0.01, 0.1],},"Logistic Regression": {'penalty': ['none', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]},"Nearest Neighbors": {'n_neighbors': [3, 5, 11, 19], 'weights': ['uniform', 'distance'], 'metric': ['euclidean', 'manhattan']},"Decision Tree": {'criterion': ['gini', 'entropy'], 'max_depth': np.arange(3, 15)},}for classifier_name in dict_classifiers.keys() & params:print("training: ", classifier_name)gridSearch = GridSearchCV(estimator=dict_classifiers[classifier_name], param_grid=params[classifier_name], cv=custom_cv)gridSearch.fit(df[['X']].to_numpy(), # shoud have shape of (n_samples, n_features) df[['Y']].to_numpy().reshape((-1))) #this should be an array with shape (n_samples,)print(gridSearch.best_score_, gridSearch.best_params_)

replace ['X'] with df.columns[pd.Series(df.columns).str.startswith('X')] on gridsearch.fit, if you want to pass in all columns starting with 'X' in their name (e.g., 'X1','X2', ...) as train_set.

How to train with TimeSeriesSplit from sklearn?

Best Answer

Random Posts