Featuretools — A game-changer for your predictions

Gal Hever
6 min readNov 30, 2019

--

An Auto Feature Engineering

Introduction

This blog-post will give credit to some basic tool that can improve your predictions easily just in a few minutes! It will also save you tones of time and efforts on tuning your model, just by getting more information from the data that you already have.

What is Feature Engineering?

The formal definition of feature engineering is creating new features from the features that you already collected. Sometimes just feeding the model with the original features is not enough and working a bit more in this level can be crucial to your predictions. There are two types of new features creation: transformation and aggregation.

Transformation

This method can create new features by using different operations on one or multiple features in a single table. For example, addition, subtraction, multiplication and division are all kinds of transformations which create new features from one or more features in your table.

Aggregation

This method can create new features by using different operations on one feature based on multiple tables. For example, calculating the average, sum, maximum and minimum are all kinds of aggregations which create new features from one feature that exists in multiple tables.

Why do I need to use Feature Engineering?

If your independent features can’t explain the response variable it will be almost impossible for your model to predict the target. Sometimes you have the right information but in a different form. In such cases you will need to invest more time in feature engineering.

What is Featuretools?

Featuretools is a package in python that can help you to find new features automatically. It can be really frustrating to extract new features manually and can also waste a lot of time which could be saved by using this package.

Before writing any piece of code, let’s understand the major components that this package includes.

  • Entities
  • Deep Feature Synthesis (DFS)
  • Feature primitives

Entity

An Entity is a table that is included in lots of features. You can also think about it as a simple DataFrame in Pandas. A collection of Entities is called an EntitySet. Each Entity must have an index column that is used as a unique key for each row.

Deep Feature Synthesis (DFS)

DFS is a method for generating new features automatically on relational dataset.

Feature primitives

Feature primitive is an operation such as aggregation or transformation that will be performed on one or more features.

How does it work in practice?

Let’s take a look at some short example to understand this concept. First, you will need to install and import the featuretools package.

# Install package
pip install featuretools
# Import package
import featuretools as ft

If you want to find out which aggregations and transformations functions are included in this package, you can use the function below that will give you some details about each of them.

ft.list_primitives()

Just for practice, I used a toy dataset from Kaggle that you can download also from here. In my previous blog post, I have worked on the preprocessing level of this dataset and in this blog-post I will continue from the same point.

Using the next python code you can create your first empty EntitySet. I chose to call the EntitySet ‘movies_entitySet’ so you can use your own informative name and then add to it a new Entity.

# initialize entityset
es = ft.EntitySet(id = 'movies_entitySet')

Next, let’s create the first new entity and add it to “movies_entitySet” EntitySet.

variable_types = {cat:ft.variable_types.Categorical for cat in col_cat_list}# initialize entity
es.entity_from_dataframe(entity_id = 'movies_entity_id', dataframe = new_df_movies, make_index = False, index = 'movieId', variable_types = variable_types)

It’s important to define the features types before using the transformation and aggregation. It is possible to define it by “variable_types”. Moreover, it is really important that each entity will have a unique key. If there is no unique key it is possible to make one using “make_index”.

In order to present the data of the defined entity, it is possible to use the code below:

es['movies_entity_id'].df

It is also possible to check the data types of the features in your entity using the next line of code.

es['movies_entity_id']

Transformation

Let’s start with an example of the transformation method. First, let’s define “movies_entity_id” as a target entity and we’ll use ‘add’ and ‘multiply’ as transformation methods.

movies_new_features, feature_defs = ft.dfs(entityset = es, 
target_entity = ‘movies_entity_id’,
trans_primitives = [‘add’, ‘multiply’])
movies_new_features.head()

In the example above, "add" and "multiply” are called transform primitive because those functions take one movie and transform it into another shape.

Those transformations will apply only on all the numeric features so it is important to define beforehand the data type of each feature. This operation will generate new features of all the possible combination of multiplication and addition between all of the numeric features.

Aggregation

For the next features generator we will need to use more than one entity, so let’s define a new entity that is called “ratings_entity_id” and add it to our collection. This time we don’t have a unique key that can identify each row so we will create one by “make_index”.

es.entity_from_dataframe(entity_id = ‘ratings_entity_id’, dataframe = ratings, 
make_index = True, index = ‘rating_id’,
variable_types = {‘userId’: ft.variable_types.Categorical, ‘timestamp’: ft.variable_types.Categorical})

Now, we need to create a relationship between the entities and to define the foreign key between them. In our case, the foreign key will be “movieId”.

movies_ratings_relation = ft.Relationship(es['movies_entity_id' ['movieId'],es['ratings_entity_id']['movieId'])es = es.add_relationship(movies_ratings_relation)
es

The last step will be to define the methods that will quantify the movies rating. We have two options, one is to create a specific generator for specific statistical methods such as: max, min, mean, count, sum, etc.

df, features = ft.dfs(entityset = es, target_entity = 'movies_entity_id', agg_primitives = ['mean, max'])

In the example above, “mean” and “max” are aggregations primitives that compute a single value based on many ratings related to one movie.

The other option is to create a general generator that will apply all of the aggregations on the numeric features as the example below.

df, features = ft.dfs(entityset = es, target_entity = 'movies_entity_id')

That’s it! Now you have tones of new features that you can examine with your model and using dimension reduction methods to leave just the relevant features for your predictions.

End Notes

You can use this brute force method if you don’t have any clue on your data and you want to investigate new features that might be productive for your final predictions. It’s a fast and easy way to create many possibilities just in a few seconds!

The completed code for this tutorial can be found on GitHub.

--

--

No responses yet