Adding Dropping Column instance into a Pipeline

Question

In general, we will df.drop('column_name', axis=1) to remove a column in a DataFrame.I want to add this transformer into a Pipeline

Example:

numerical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean')),('scaler', StandardScaler(with_mean=False))])

How can I do it?

Best Answer

You can encapsulate your Pipeline into a ColumnTransformer which allows you to select the data that is processed through the pipeline as follows:

import pandas as pdfrom sklearn.preprocessing import StandardScalerfrom sklearn.impute import SimpleImputerfrom sklearn.compose import make_column_selector, make_column_transformercol_to_exclude = 'A'df = pd.DataFrame({'A' : [ 0]*10, 'B' : [ 1]*10, 'C' : [ 2]*10})numerical_transformer = make_pipelineSimpleImputer(strategy='mean'),StandardScaler(with_mean=False))transform = ColumnTransformer((numerical_transformer, make_column_selector(pattern=f'^(?!{col_to_exclude})')))transform.fit_transform(df)

NOTE: I am using here a regex pattern to exclude the column A.

The simplest way is to use the transformer special value of 'drop' in sklearn.compose.ColumnTransformer:

import pandas as pdfrom sklearn.compose import ColumnTransformerfrom sklearn.pipeline import Pipeline# Specify columns to dropcolumns_to_drop = ['feature1', 'feature3']# Create a pipeline with ColumnTransformer to drop columnspreprocessor = ColumnTransformer(transformers=[('column_dropper', 'drop', columns_to_drop),])pipeline = Pipeline(steps=[('preprocessing', preprocessor),])# Transform the DataFrame using the pipelinetransformed_data = pipeline.fit_transform(df)

Accepted Answer

You can write a custom Transformer like this :

class columnDropperTransformer():def __init__(self,columns):self.columns=columnsdef transform(self,X,y=None):return X.drop(self.columns,axis=1)def fit(self, X, y=None):return self

And use it in a pipeline :

import pandas as pd# sample dataframedf = pd.DataFrame({"col_1":["a","b","c","d"],"col_2":["e","f","g","h"],"col_3":[1,2,3,4],"col_4":[5,6,7,8]})# your piplinepipeline = Pipeline([("columnDropper", columnDropperTransformer(['col_2','col_3']))])# apply the pipeline to dataframepipeline.fit_transform(df)

Output :

 col_1 col_40 a 51 b 62 c 73 d 8

Adding Dropping Column instance into a Pipeline

Best Answer

Random Posts