An Easy Way for Data Preprocessing — Sklearn-Pandas

Gal Hever
4 min readNov 30, 2019

--

A Hands-On Guide for Sklearn-Pandas in Python

Introduction

This blog post will help you to preprocess your data just in few minutes using Sklearn-Pandas package. For the first time that you get a new raw dataset, you need to work hard until it will get the shape that you need before entering the model. Usually, it’s a long and exhausting procedure (e.g. imputing missing values, dealing with categorical and numerical features) that could be saved by Sklearn-Pandas.

Let’s start with an example. I’ll use the “Movies Dataset” from Kaggle that includes 45K movies that were rated by 270K users. You can download the dataset from here.

Let’s Code!

First, let’s install and import the main packages that will be used and get the data:

pip install sklearn-pandasfrom sklearn_pandas import DataFrameMapper, gen_features, CategoricalImputer
import sklearn.preprocessing
import pandas as pd
import numpy as np
movies = pd.read_csv('../Data/movies_metadata.csv')
ratings = pd.read_csv('../Data/ratings.csv')

Let’s take a look at the data.

movies.rename(columns={'id': 'movieId'}, inplace=True)
movies.info()
movies.isna().sum()

We can see that there are categorical and numerical features, but a few of the numerical features were identified as categories. I’ll organize the data types so it will make sense.

movies['movieId'] = movies['movieId'].apply(lambda x: x if x.isdigit() else 0)movies['budget'] = movies['budget'].apply(lambda x: x if x.isdigit() else 0)movies['release_date']=pd.to_datetime(movies['release_date'], errors="coerce")movies['movieId'] = movies['movieId'].astype('int64')

For our example, we will use just a few of the features that will help us to understand the main concept of this package. Let’s drop the irrelevant features and start working with the package.

movies = movies.drop([‘overview’,’homepage’,’original_title’,’imdb_id’, ‘belongs_to_collection’, ‘genres’,’poster_path’, ‘production_companies’,’production_countries’,’spoken_languages’, ‘tagline’], axis=1)

Now, we will separate the features into 4 groups that each we will be treated differently.

col_cat_list = list(movies.select_dtypes(exclude=np.number))
col_num_list = list(movies.select_dtypes(include=np.number))
col_date = ['release_date']
col_none = [‘movieId’]

Let’s organize the data in different lists per feature type.

num_cols = [‘budget’, ‘popularity’]
[ col_cat_list.remove(x) for x in num_cols ]
[ col_num_list.append(x) for x in num_cols ]
col_cat_list.remove(‘release_date’)
col_num_list.remove(‘movieId’)

And convert them to a list of lists:

col_categorical = [ [x] for x in col_cat_list ]
col_numerical = [ [x] for x in col_num_list ]

Now, the features are defined as below and we can start using the package.

First, for dealing with the “datetime” feature we will need to use the function below that will separate the date to three columns of year, month and day.

from sklearn.base import TransformerMixinclass DateEncoder(TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
X = pd.Series(X)
dt = X.dt
return pd.concat([dt.year, dt.month, dt.day], axis=1)

The next step will be to define the functions for each of the groups as below:

classes_categorical = [ CategoricalImputer, sklearn.preprocessing.LabelEncoder]
classes_numerical = [ {'class':sklearn.preprocessing.Imputer, 'strategy' : 'median'}, sklearn.preprocessing.StandardScaler]
classes_dates = [DateEncoder]
classes_none = [None]

We will use “gen_features” to match each group with each one of the functions.

feature_def = gen_features(
columns = col_categorical
, classes = classes_categorical
)
feature_def_numerical = gen_features(
columns = col_numerical
, classes = classes_numerical
)
feature_def_date = gen_features(
columns = col_date
, classes = classes_dates
)
feature_def_none = gen_features(
columns = col_none
, classes = classes_none
)
feature_def.extend(feature_def_date)
feature_def.extend(feature_def_numerical)
feature_def.extend(feature_def_none)

We are almost done! The last step is to use the “mapper” to apply the functions that we defined on the groups as below:

mapper = DataFrameMapper(feature_def , df_out = True)
new_df_movies = mapper.fit_transform(movies)
new_df_movies.rename(columns={'release_date_0': 'year', 'release_date_1': 'month', 'release_date_2':'day'}, inplace=True)

And here we are done! The final dataset will be ready to enter the model.

End Notes

Sklearn-Pandas is a package that helps to preprocess the raw data before entering the model. It can save you time and can make this step much easier.

The completed code for this tutorial can be found on GitHub.

If you wish also to know how to generate new features automatically, you can continue to the next part of this blog post that engages at Automated Feature Engineering.

--

--