In general, we will df.drop('column_name', axis=1)
to remove a column in a DataFrame.I want to add this transformer into a Pipeline
Example:
numerical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='mean')),('scaler', StandardScaler(with_mean=False))])
How can I do it?
Best Answer
You can write a custom Transformer like this :
class columnDropperTransformer():def __init__(self,columns):self.columns=columnsdef transform(self,X,y=None):return X.drop(self.columns,axis=1)def fit(self, X, y=None):return self
And use it in a pipeline :
import pandas as pd# sample dataframedf = pd.DataFrame({"col_1":["a","b","c","d"],"col_2":["e","f","g","h"],"col_3":[1,2,3,4],"col_4":[5,6,7,8]})# your piplinepipeline = Pipeline([("columnDropper", columnDropperTransformer(['col_2','col_3']))])# apply the pipeline to dataframepipeline.fit_transform(df)
Output :
col_1 col_40 a 51 b 62 c 73 d 8
You can encapsulate your Pipeline
into a ColumnTransformer
which allows you to select the data that is processed through the pipeline as follows:
import pandas as pdfrom sklearn.preprocessing import StandardScalerfrom sklearn.impute import SimpleImputerfrom sklearn.compose import make_column_selector, make_column_transformercol_to_exclude = 'A'df = pd.DataFrame({'A' : [ 0]*10, 'B' : [ 1]*10, 'C' : [ 2]*10})numerical_transformer = make_pipelineSimpleImputer(strategy='mean'),StandardScaler(with_mean=False))transform = ColumnTransformer((numerical_transformer, make_column_selector(pattern=f'^(?!{col_to_exclude})')))transform.fit_transform(df)
NOTE: I am using here a regex pattern to exclude the column A
.
The simplest way is to use the transformer
special value of 'drop'
in sklearn.compose.ColumnTransformer
:
import pandas as pdfrom sklearn.compose import ColumnTransformerfrom sklearn.pipeline import Pipeline# Specify columns to dropcolumns_to_drop = ['feature1', 'feature3']# Create a pipeline with ColumnTransformer to drop columnspreprocessor = ColumnTransformer(transformers=[('column_dropper', 'drop', columns_to_drop),])pipeline = Pipeline(steps=[('preprocessing', preprocessor),])# Transform the DataFrame using the pipelinetransformed_data = pipeline.fit_transform(df)