[PYTHON] Notes on how to use featuretools

What are feature tools

If you have knowledge about the target domain when performing machine learning, you can improve the accuracy by considering an appropriate feature amount and giving it as a feature amount, but even if you do not have domain knowledge, you can add or aggregate. You can take the strategy of expecting to find the feature amount by chance. Since it is a brute force approach that tries all possible combinations from one end, it seems to be called brute force feature engineering.

featuretools is a python library that semi-automates the troublesome feature creation if done manually. very convenient.

featuretools official tutorial https://docs.featuretools.com/en/stable/

In this article, I will follow the code of this blog https://blog.amedama.jp/entry/featuretools-brute-force-feature-engineering

  1. install

Enter with pip

terminal


pip install featuretools

You can also link with another library by installing addon additionally. https://docs.featuretools.com/en/stable/install.html

  1. Deep Feature Synthesis

When multiple DataFrames are given, the features are created by performing four arithmetic operations such as aggregating, calculating statistics, and performing four arithmetic operations between the features. Deep Feature Synthesis does these tasks for good Shioume. Yes, the function that does this is featuretools.dfs (). https://docs.featuretools.com/en/stable/getting_started/afe.html

In order to realize this good automation of Shioume, it is necessary to specify more detailed data types than pandas.DataFrame. For example, there are Datetime, DateOfBirth, DatetimeTimeIndex, NumericTimeIndex, etc. just for the data type that expresses time, and it makes it difficult for inappropriate combinations to occur. https://docs.featuretools.com/en/stable/getting_started/variables.html

3. In case of one entity

featuretools calls the input data entity. I think that you often bring data with pandas.DataFrame, but in that case, one pandas.DataFrame is one entity.

3-1. trans_primitives only

trans_primitives performs calculations between features

Create a DataFrame to use

python


import pandas as pd
data = {'name': ['a', 'b', 'c'],
        'x': [1, 2, 3],
        'y': [2, 4, 6],
        'z': [3, 6, 9],}
df = pd.DataFrame(data)
df

image.png

Create an EntitySet

First, create an empty featuretools.EntitySet. EntitySet is an object for defining the relationship between entities and the content to be processed, but only id is written below. The id can be omitted, but in the following, id ='example'.

python


import featuretools as ft
es = ft.EntitySet(id='example')
es

Add entity to EntitySet

Below, the df created earlier is registered so that it can be called by the name'locations'. index = is an argument to specify index as it is, and if omitted, the first column of DataFrame is treated as index.

python


es = es.entity_from_dataframe(entity_id='locations',
                              dataframe=df,
                              index='name')
es

output


Entityset: example
  Entities:
    locations [Rows: 3, Columns: 4]
  Relationships:
    No relationships

The entity registered in EntitySet can be called as follows

python


es['locations']

output


Entity: locations
  Variables:
    name (dtype: index)
    x (dtype: numeric)
    y (dtype: numeric)
    z (dtype: numeric)
  Shape:
    (Rows: 3, Columns: 4)

python


es['locations'].df

image.png

Run dfs

Now that we have an EntitySet, all we have to do is pass it to ft.dfs () to create the feature. target_entity is the main entity, the calculation method that trans_primitives applies to the combination between features, and the calculation method that agg_primitives uses for aggregate.

The available primitives are summarized below https://primitives.featurelabs.com/

In the following, add_numeric is instructed to add the sum between features, and subtract_numeric is instructed to add the difference between features.

python


feature_matrix, feature_defs = ft.dfs(entityset=es,
                                      target_entity='locations',
                                      trans_primitives=['add_numeric', 'subtract_numeric'],
                                      agg_primitives=[],
                                      max_depth=1)
feature_matrix

Originally there were x, y, z, and the sum of them, x + y, x + z, y + z, and the difference x-y, x-z, y-z have been added. image.png

3-2. aggregate only

For agg_primitives, specify the calculation method to create aggregate.

Create a DataFrame to use

python


data = {'item_id': [1, 2, 3, 4, 5],
        'name': ['apple', 'broccoli', 'cabbage', 'dorian', 'eggplant'],
        'category': ['fruit', 'vegetable', 'vegetable', 'fruit', 'vegetable'],
        'price': [100, 200, 300, 4000, 500]}
item_df = pd.DataFrame(data)
item_df

You now have a DataFrame with two categorical variables to use for aggregate image.png

Create an EntitySet

Same as before until entity is added

python


es = ft.EntitySet(id='example')
es = es.entity_from_dataframe(entity_id='items',
                              dataframe=item_df,
                              index='item_id')
es

Add a relationship here to use for aggregate.

In the following, it is instructed to create a new entity called category based on the entity called items and set the index at that time as category.

python


es = es.normalize_entity(base_entity_id='items',
                         new_entity_id='category',
                         index='category')
es

output


Entityset: example
  Entities:
    items [Rows: 5, Columns: 4]
    category [Rows: 2, Columns: 1]
  Relationships:
    items.category -> category.category

As for what happens with this, first of all, items are left as they are.

output


es['items'].df

image.png

On the other hand, the entity called category is indexed in the newly specified category column, so it is as follows.

python


es['category'].df

image.png

Run dfs

Try to specify count, sum, mean for agg_primitives

python


feature_matrix, feature_defs = ft.dfs(entityset=es,
                                      target_entity='items',
                                      trans_primitives=[],
                                      agg_primitives=['count', 'sum', 'mean'],
                                      max_depth=2)
feature_matrix

Since the only column that can be aggregated is items.price, COUNT (), category.MEAN (), and category.SUM () are calculated respectively, and a DataFrame with 3 additional columns is created. image.png

4. In case of two entities

Create a DataFrame

python


data = {'item_id': [1, 2, 3],
        'name': ['apple', 'banana', 'cherry'],
        'price': [100, 200, 300]}
item_df = pd.DataFrame(data)
item_df

image.png

python


from datetime import datetime
data = {'transaction_id': [10, 20, 30, 40],
        'time': [
            datetime(2016, 1, 2, 3, 4, 5),
            datetime(2017, 2, 3, 4, 5, 6),
            datetime(2018, 3, 4, 5, 6, 7),
            datetime(2019, 4, 5, 6, 7, 8),
        ],
        'item_id': [1, 2, 3, 1],
        'amount': [1, 2, 3, 4]}
tx_df = pd.DataFrame(data)
tx_df

image.png

Create an EntitySet

I will add the entity as before

python


es = ft.EntitySet(id='example')
es = es.entity_from_dataframe(entity_id='items',
                              dataframe=item_df,
                              index='item_id')
es = es.entity_from_dataframe(entity_id='transactions',
                              dataframe=tx_df,
                              index='transaction_id',
                              time_index='time')
es

Create a relationship that connects two entities. You merge in the item_id column of items and the item_id column of transactions.

python


relationship = ft.Relationship(es['items']['item_id'], es['transactions']['item_id'])
es = es.add_relationship(relationship)
es

output


Entityset: example
  Entities:
    items [Rows: 3, Columns: 3]
    transactions [Rows: 4, Columns: 4]
  Relationships:
    transactions.item_id -> items.item_id

Run dfs

python


feature_matrix, feature_defs = ft.dfs(entityset=es,
                                      target_entity='items',
                                      trans_primitives=['add_numeric', 'subtract_numeric'],
                                      agg_primitives=['count', 'sum', 'mean'],
                                      max_depth=2)
feature_matrix

image.png If you write out only the column headings, it looks like the following. It does both the aggregation and the calculation between the features.

output


['name',
 'price',
 'COUNT(transactions)',
 'MEAN(transactions.amount)',
 'SUM(transactions.amount)',
 'COUNT(transactions) + MEAN(transactions.amount)',
 'COUNT(transactions) + SUM(transactions.amount)',
 'COUNT(transactions) + price',
 'MEAN(transactions.amount) + SUM(transactions.amount)',
 'MEAN(transactions.amount) + price',
 'price + SUM(transactions.amount)',
 'COUNT(transactions) - MEAN(transactions.amount)',
 'COUNT(transactions) - SUM(transactions.amount)',
 'COUNT(transactions) - price',
 'MEAN(transactions.amount) - SUM(transactions.amount)',
 'MEAN(transactions.amount) - price',
 'price - SUM(transactions.amount)']

How max_depth works

Let's see what happens if we gradually increase max_depth with the above code

max_depth=1


['name',
 'price',
 'COUNT(transactions)',
 'MEAN(transactions.amount)',
 'SUM(transactions.amount)']

max_depth=1 → 2 increase


['COUNT(transactions) + MEAN(transactions.amount)',
 'COUNT(transactions) + SUM(transactions.amount)',
 'COUNT(transactions) + price',
 'MEAN(transactions.amount) + SUM(transactions.amount)',
 'MEAN(transactions.amount) + price',
 'price + SUM(transactions.amount)',
 'COUNT(transactions) - MEAN(transactions.amount)',
 'COUNT(transactions) - SUM(transactions.amount)',
 'COUNT(transactions) - price',
 'MEAN(transactions.amount) - SUM(transactions.amount)',
 'MEAN(transactions.amount) - price',
 'price - SUM(transactions.amount)']

max_depth=2 → 3 increase


['MEAN(transactions.amount + items.price)',
 'MEAN(transactions.amount - items.price)',
 'SUM(transactions.amount + items.price)',
 'SUM(transactions.amount - items.price)']

max_depth=Increase from 3 to 4


['COUNT(transactions) + MEAN(transactions.amount + items.price)',
 'COUNT(transactions) + MEAN(transactions.amount - items.price)',
 'COUNT(transactions) + SUM(transactions.amount + items.price)',
 'COUNT(transactions) + SUM(transactions.amount - items.price)',
 'MEAN(transactions.amount + items.price) + MEAN(transactions.amount - items.price)',
 'MEAN(transactions.amount + items.price) + MEAN(transactions.amount)',
 'MEAN(transactions.amount + items.price) + SUM(transactions.amount + items.price)',
 'MEAN(transactions.amount + items.price) + SUM(transactions.amount - items.price)',
 'MEAN(transactions.amount + items.price) + SUM(transactions.amount)',
 'MEAN(transactions.amount + items.price) + price',
 'MEAN(transactions.amount - items.price) + MEAN(transactions.amount)',
 'MEAN(transactions.amount - items.price) + SUM(transactions.amount + items.price)',
 'MEAN(transactions.amount - items.price) + SUM(transactions.amount - items.price)',
 'MEAN(transactions.amount - items.price) + SUM(transactions.amount)',
 'MEAN(transactions.amount - items.price) + price',
 'MEAN(transactions.amount) + SUM(transactions.amount + items.price)',
 'MEAN(transactions.amount) + SUM(transactions.amount - items.price)',
 'SUM(transactions.amount + items.price) + SUM(transactions.amount - items.price)',
 'SUM(transactions.amount + items.price) + SUM(transactions.amount)',
 'SUM(transactions.amount - items.price) + SUM(transactions.amount)',
 'price + SUM(transactions.amount + items.price)',
 'price + SUM(transactions.amount - items.price)',
 'COUNT(transactions) - MEAN(transactions.amount + items.price)',
 'COUNT(transactions) - MEAN(transactions.amount - items.price)',
 'COUNT(transactions) - SUM(transactions.amount + items.price)',
 'COUNT(transactions) - SUM(transactions.amount - items.price)',
 'MEAN(transactions.amount + items.price) - MEAN(transactions.amount - items.price)',
 'MEAN(transactions.amount + items.price) - MEAN(transactions.amount)',
 'MEAN(transactions.amount + items.price) - SUM(transactions.amount + items.price)',
 'MEAN(transactions.amount + items.price) - SUM(transactions.amount - items.price)',
 'MEAN(transactions.amount + items.price) - SUM(transactions.amount)',
 'MEAN(transactions.amount + items.price) - price',
 'MEAN(transactions.amount - items.price) - MEAN(transactions.amount)',
 'MEAN(transactions.amount - items.price) - SUM(transactions.amount + items.price)',
 'MEAN(transactions.amount - items.price) - SUM(transactions.amount - items.price)',
 'MEAN(transactions.amount - items.price) - SUM(transactions.amount)',
 'MEAN(transactions.amount - items.price) - price',
 'MEAN(transactions.amount) - SUM(transactions.amount + items.price)',
 'MEAN(transactions.amount) - SUM(transactions.amount - items.price)',
 'SUM(transactions.amount + items.price) - SUM(transactions.amount - items.price)',
 'SUM(transactions.amount + items.price) - SUM(transactions.amount)',
 'SUM(transactions.amount - items.price) - SUM(transactions.amount)',
 'price - SUM(transactions.amount + items.price)',
 'price - SUM(transactions.amount - items.price)']

max_depth=4 → 5 increase


[]

max_depth=Increase from 5 to 6


[]

It seems that it is a specification that applies agg_primitives, trans_primitives, agg_primitives, trans_primitives and ends.

CUSTOM primitives

It seems that you can also add your own primitive and calculate https://docs.featuretools.com/en/stable/getting_started/primitives.html#simple-custom-primitives

Summary

Convenient! !! !!

Recommended Posts

Notes on how to use featuretools
Notes on how to use pywinauto
Notes on how to use doctest
How to use Dataiku on Windows
How to use homebrew on Debian
Autoencoder with Chainer (Notes on how to use + trainer)
Notes on how to write requirements.txt
[Hyperledger Iroha] Notes on how to use the Python SDK
Notes on how to use marshmallow in the schema library
How to use mecab, neologd-ipadic on colab
How to use Google Assistant on Windows 10
Memorandum on how to use gremlin python
How to use xml.etree.ElementTree
How to use Python-shell
How to use tf.data
How to use virtualenv
How to use Seaboan
How to use image-match
How to use Pandas 2
How to use Virtualenv
How to use pytest_report_header
How to use Bio.Phylo
How to use SymPy
How to use x-means
How to use WikiExtractor.py
How to use IPython
How to use virtualenv
How to use Matplotlib
How to use iptables
How to use numpy
How to use TokyoTechFes2015
How to use venv
How to use dictionary {}
How to use Pyenv
How to use list []
How to use python-kabusapi
How to use OptParse
How to use return
How to use dotenv
How to use pyenv-virtualenv
How to use Go.mod
How to use imutils
How to use import
How to use Python Kivy ④ ~ Execution on Android ~
How to use Qt Designer
How to use search sorted
[gensim] How to use Doc2Vec
python3: How to use bottle (2)
Understand how to use django-filter
[Python] How to use list 1
How to use FastAPI ③ OpenAPI
How to register on pypi
How to use Python argparse
How to use IPython Notebook
How to use Pandas Rolling
[Note] How to use virtualenv
How to use redis-py Dictionaries
Python: How to use pydub
[Python] How to use checkio
[Go] How to use "... (3 periods)"
How to use Django's GeoIp2