In [ ]:
import featuretools as ft
import pandas as pd
import numpy as np
In [ ]:
es = ft.demo.load_mock_customer(return_entityset=True)
es
If you want view the variables (columns), and types for the "transactions" entity, you can do the following:
In [ ]:
es['transactions'].variables
If you want to view the underlying Dataframe, you can do the following:
In [ ]:
es['transactions'].df.head()
The function normalize_entity
creates a new entity and a relationship from unique values of an existing entity. It takes 2 similar arguments:
additional_variables
removes variables from the base entity and moves them to the new entity. copy_variables
keeps the given variables in the base entity, but also copies them to the new entity.
In [ ]:
data = ft.demo.load_mock_customer()
transactions_df = data["transactions"].merge(data["sessions"]).merge(data["customers"])
products_df = data["products"]
es = ft.EntitySet(id="customer_data")
es = es.entity_from_dataframe(entity_id="transactions",
dataframe=transactions_df,
index="transaction_id",
time_index="transaction_time")
es = es.entity_from_dataframe(entity_id="products",
dataframe=products_df,
index="product_id")
new_relationship = ft.Relationship(es["products"]["product_id"], es["transactions"]["product_id"])
es = es.add_relationship(new_relationship)
Before we normalize to create a new entity, let's look at base entity
In [ ]:
es['transactions'].df.head()
Notice the columns session_id
, session_start
, join_date
, device
, customer_id
, and zip_code
.
In [ ]:
es = es.normalize_entity(base_entity_id="transactions",
new_entity_id="sessions",
index="session_id",
make_time_index="session_start",
additional_variables=["session_start", "join_date"],
copy_variables=["device", "customer_id", "zip_code"])
Above, we normalized the columns to create a new entity.
additional_variables
, the following columns ['session_start', 'join_date]
will be removed from the products
entity, and moved to the new device
entity. copy_variables
, the following columns ['device', 'customer_id', 'zip_code']
will be copied from the products
entity to the new device
entity. Let's see this in the actual EntitySet
.
In [ ]:
es['transactions'].df.head()
Notice above how ['device', 'customer_id', 'zip_code']
are still in the transactions
entity, while ['session_start', 'join_date']
are not. But, they have all been moved to the sessions
entity, as seen below.
In [ ]:
es['sessions'].df.head()
In [ ]:
data = ft.demo.load_mock_customer()
transactions_df = data["transactions"].merge(data["sessions"]).merge(data["customers"])
products_df = data["products"]
es = ft.EntitySet(id="customer_data")
es = es.entity_from_dataframe(entity_id="transactions",
dataframe=transactions_df,
index="transaction_id",
time_index="transaction_time")
es.plot()
Notice how the variable type of session_id
is Numeric, and the variable type of session_start
is Datetime.
Now, let's normalize the transactions entity to create a new entity.
In [ ]:
es = es.normalize_entity(base_entity_id="transactions",
new_entity_id="sessions",
index="session_id",
make_time_index="session_start",
additional_variables=["session_start"])
es.plot()
The type for session_id
is now Id
in the transactions
entity, and Index
in the new entity, sessions
. This is the case because when we normalize the entity, we create a new relationship between the transactions
and sessions
. There is a one to many relationship between the parent entity, sessions
, and child entity, transactions
.
Therefore, session_id
has type Id
in transactions
because it represents an Index
in another entity. There would be a similar effect if we added another entity using entity_from_dataframe
and add_relationship
.
In addition, when we created the new entity, we specified a time_index
which was the variable (column) session_start
. This changed the type of session_start
to datetime_time_index
in the new sessions
entity because it now represents a time_index.
You might want to create features that are conditioned on multiple values before they are calculated. This would require the use of interesting_values
. However, since we are trying to create the feature with multiple conditions, we will need to modify the Dataframe before we create the EntitySet
.
Let's look at how you might accomplish this.
First, let's create our Dataframes.
In [ ]:
data = ft.demo.load_mock_customer()
transactions_df = data["transactions"].merge(data["sessions"]).merge(data["customers"])
products_df = data["products"]
In [ ]:
transactions_df.head()
In [ ]:
products_df.head()
Now, let's modify our transactions
Dataframe to create the additional column that represents multiple conditions for our feature.
In [ ]:
transactions_df['product_id_device'] = transactions_df['product_id'].astype(str) + ' and ' + transactions_df['device']
Here, we created a new column called product_id_device
, which just combines the product_id
column, and the device
column.
Now let's create our EntitySet
.
In [ ]:
es = ft.EntitySet(id="customer_data")
es = es.entity_from_dataframe(entity_id="transactions",
dataframe=transactions_df,
index="transaction_id",
time_index="transaction_time",
variable_types={"product_id": ft.variable_types.Categorical,
"product_id_device": ft.variable_types.Categorical,
"zip_code": ft.variable_types.ZIPCode})
es = es.entity_from_dataframe(entity_id="products",
dataframe=products_df,
index="product_id")
es = es.normalize_entity(base_entity_id="transactions",
new_entity_id="sessions",
index="session_id",
additional_variables=["device", "product_id_device", "customer_id"])
es = es.normalize_entity(base_entity_id="sessions",
new_entity_id="customers",
index="customer_id")
es
Now, we are ready to add our interesting values.
First, let's view our options for what the interesting values could be.
In [ ]:
interesting_values = transactions_df['product_id_device'].unique().tolist()
interesting_values
If you wanted to, you could pick a subset of these, and the where
features created would only use those conditions. In our example, we will use all the possible interesting values.
Here, we set all of these values as our interesting values for this specific entity and variable. If we wanted to, we could make interesting values in the same way for more than one variable, but we will just stick with this one for this example.
In [ ]:
es['sessions']['product_id_device'].interesting_values = interesting_values
Now we can run DFS.
In [ ]:
feature_matrix, feature_defs = ft.dfs(entityset=es,
target_entity="customers",
agg_primitives=["count"],
where_primitives=["count"],
trans_primitives=[])
feature_matrix.head()
To better understand the where
clause features, let's examine one of those features.
The feature COUNT(sessions WHERE product_id_device = 5 and tablet)
, tells us how many sessions the customer purchased product_id 5 while on a tablet. Notice how the feature depends on multiple conditions (product_id = 5 & device = tablet).
In [ ]:
feature_matrix[["COUNT(sessions WHERE product_id_device = 5 and tablet)"]]
You may have created your EntitySet
, and then applied DFS to create features. However, you may be puzzled as to why no aggregation features were created.
Let's look at a simple example.
In [ ]:
data = ft.demo.load_mock_customer()
transactions_df = data["transactions"].merge(data["sessions"]).merge(data["customers"])
es = ft.EntitySet(id="customer_data")
es = es.entity_from_dataframe(entity_id="transactions",
dataframe=transactions_df,
index="transaction_id")
es
Notice how we only have 1 entity in our EntitySet
. If we try to create aggregation features on this EntitySet
, it will not be possible because DFS needs 2 entities to generate aggregation features.
In [ ]:
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="transactions")
feature_defs
None of the above features are aggregation features. To fix this issue, you can add another entity to your EntitySet
.
Solution #1 - You can add new entity if you have additional data.
In [ ]:
products_df = data["products"]
es = es.entity_from_dataframe(entity_id="products",
dataframe=products_df,
index="product_id")
es
Notice how we now have an additional entity in our EntitySet
, called products
.
Solution #2 - You can normalize an existing entity.
In [ ]:
es = es.normalize_entity(base_entity_id="transactions",
new_entity_id="sessions",
index="session_id",
make_time_index="session_start",
additional_variables=["device", "customer_id", "zip_code", "session_start", "join_date"])
es
Notice how we now have an additional entity in our EntitySet
, called sessions
. Here, the normalization created a relationship between transactions
and sessions
. However, we could have specified a relationship between transactions
and products
if we had only used Solution #1.
Now, we can generate aggregation features.
In [ ]:
feature_matrix, feature_defs = ft.dfs(entityset=es, target_entity="transactions")
feature_defs[:-10]
A few of the aggregation features are:
<Feature: sessions.SUM(transactions.amount)>
<Feature: sessions.STD(transactions.amount)>
<Feature: sessions.MAX(transactions.amount)>
<Feature: sessions.SKEW(transactions.amount)>
<Feature: sessions.MIN(transactions.amount)>
<Feature: sessions.MEAN(transactions.amount)>
<Feature: sessions.COUNT(transactions)>
One issue you may encounter while running ft.dfs
is slow performance. While Featuretools has generally optimal default settings for calculating features, you may want to speed up performance when you are calculating on a large number of features.
One quick way to speed up performance is by adjusting the n_jobs
settings of ft.dfs
or ft.calculate_feature_matrix
.
# setting n_jobs to -1 will use all cores
feature_matrix, feature_defs = ft.dfs(entityset=es,
target_entity="customers",
n_jobs=-1)
feature_matrix, feature_defs = ft.calculate_feature_matrix(entityset=es,
features=feature_defs,
n_jobs=-1)
For more ways to speed up performance, please visit:
When using DFS to generate features, you may wish to include only certain features. There are multiple ways that you do this:
Use the ignore_variables
to specify variables in an entity that should not be used to create features. It is a dictionary mapping an entity id to a list of variable names to ignore.
Use drop_contains
to drop features that contain any of the strings listed in this parameter.
Use drop_exact
to drop features that exactly match any of the strings listed in this parameter.
Here is an example of using all three parameters:
In [ ]:
es = ft.demo.load_mock_customer(return_entityset=True)
feature_matrix, feature_defs = ft.dfs(entityset=es,
target_entity="customers",
ignore_variables={
"transactions": ["amount"],
"customers": ["age", "gender", "date_of_birth"]
}, # ignore these variables
drop_contains=["customers.SUM("], # drop features that contain these strings
drop_exact=["STD(transactions.quanity)"]) # drop features that exactly match
When using DFS to generate features, you may wish to use only certain features or entities for specific primitives. This can be done through the primitive_options
parameter. The primitive_options
parameter is a dictionary or list of dictionaries that maps a primitive or a tuple of primitives to a dictionary containing options for the primitive(s). This parameter can also be a list of option dictionaries if the primitive
takes multiple inputs. Each dictionary supplies options for their respective input column. There are multiple ways to control how primitives get applied through these options:
Use ignore_entities
to specify entities that should not be used to create features for that primitive. It is a list of entity ids to ignore.
Use include_entities
to specify the only entities to be included to create features for that primitive. It is a list of entity ids to include.
Use ignore_variables
to specify variables in an entity that should not be used to create features for that primitive. It is a dictionary mapping an entity id to a list of variable names to ignore.
Use include_variables
to specify the only variables in an entity that should be used to create features for that primitive. It is a dictionary mapping an entity id to a list of variable names to include.
You can also use primitive_options
to specify which entities or variables you wish to use as groupbys for groupby transformation primitives:
Use ignore_groupby_entities
to specify entities that should not be used to get groupbys for that primitive. It is a list of entity ids to ignore.
Use include_groupby_entities
to specify the only entities that should be used to get groupbys for that primitive. It is a list of entity ids to include.
Use ignore_groupby_variables
to specify variables in an entity that should not be used as groupbys for that primitive. It is a dictionary mapping an entity id to a list of variable names to ignore.
Use include_groupby_variables
to specify the only variables in an entity that should be used as groupbys for that primitive. It is a dictionary mapping an entity id to a list of variable names to include.
Here is an example of using some of these options:
In [ ]:
es = ft.demo.load_mock_customer(return_entityset=True)
feature_matrix, feature_defs = ft.dfs(entityset=es,
target_entity="customers",
primitive_options={"mode": {"ignore_entities": ["sessions"],
"include_variables": {"products": ["brand"],
"transactions": ["product_id"]}},
# For mode, ignore the "sessions" entity and only include "brands" in the
# "products" entity and "product_id" in the "transactions" entity
("count", "mean"): {"include_entities": ["sessions", "transactions"]}
# For count and mean, only include the entities "sessions" and "transactions"
})
For a more examples of specifying options for DFS, please visit:
You may encounter a situation when you wish to make prediction using only a certain amount of historical data. You can accomplish this using the training_window
parameter in ft.dfs
. When you use the training_window
, Featuretools will use the historical data between the cutoff_time
and cutoff_time - training_window
.
In order to make the calculation, Featuretools will check the time in the time_index
column of the target_entity
.
In [ ]:
es = ft.demo.load_mock_customer(return_entityset=True)
es['customers'].time_index
Our target_entity has a time_index
, which is needed for the training_window
calculation. Here, we are creating a cutoff time dataframe so that we can have a unique training window for each customer.
In [ ]:
cutoff_times = pd.DataFrame()
cutoff_times['customer_id'] = [1, 2, 3, 1]
cutoff_times['time'] = pd.to_datetime(['2014-1-1 04:00', '2014-1-1 05:00', '2014-1-1 06:00', '2014-1-1 08:00'])
cutoff_times['label'] = [True, True, False, True]
feature_matrix, feature_defs = ft.dfs(entityset=es,
target_entity="customers",
cutoff_time=cutoff_times,
cutoff_time_in_index=True,
training_window="1 hour")
feature_matrix.head()
Above, we ran DFS with training_window
argument of 1 hour
to create features that only used customer data collected in the last hour (from the cutoff time we provided).
In [ ]:
transactions_df = ft.demo.load_mock_customer(return_single_table=True)
es = ft.EntitySet(id="customer_data")
es = es.entity_from_dataframe(entity_id="transactions",
dataframe=transactions_df,
index="transaction_id",
time_index="transaction_time")
feature_matrix, feature_defs = ft.dfs(entityset=es,
target_entity="transactions",
trans_primitives=['time_since', 'day', 'is_weekend',
'cum_min', 'minute',
'num_words', 'weekday', 'cum_count',
'percentile', 'year', 'week',
'cum_mean'])
Before we examine the output, let's look at our original single table.
In [ ]:
transactions_df.head()
Now we can look at the transformations that Featuretools was able to apply to this single entity (table) to create feature matrix.
In [ ]:
feature_matrix.head()
Yes, another open source library AutoNormalize, also produced by Feature Labs, automates table normalization and integrates with Featuretools. To install run:
python -m pip install featuretools[autonormalize]
A normalized EntitySet
will help Featuretools to generate more features. For example:
In [ ]:
from featuretools.autonormalize import autonormalize as an
es = an.normalize_entity(es)
es.plot()
As you can see, AutoNormalize creates a relational EntitySet
. Below, we run dfs on the EntitySet
, and you can see all the features created; take note of the aggregated features.
In [ ]:
feature_matrix, feature_defs = ft.dfs(entityset=es,
target_entity="transaction_id",
trans_primitives=[])
feature_matrix.head()
One concern you might have with using DFS is about label leakage. You want to make sure that labels in your data aren't used incorrectly to create features and the feature matrix.
Featuretools is particularly focused on helping users avoid label leakage.
There are two ways to prevent label leakage depending on if your data has timestamps or not.
In the case where you do not have timestamps, you can create one EntitySet
using only the training data and then run ft.dfs
. This will create a feature matrix using only the training data, but also return a list of feature definitions. Next, you can create an EntitySet
using the test data and recalculate the same features by calling ft.calculate_feature_matrix
with the list of feature definitions from before.
Here is what that flow would look like:
First, let's create our training data.
In [ ]:
train_data = pd.DataFrame({"customer_id": [1, 2, 3, 4, 5],
"age": [40, 50, 10, 20, 30],
"gender": ["m", "f", "m", "f", "f"],
"signup_date": pd.date_range('2014-01-01 01:41:50', periods=5, freq='25min'),
"labels": [True, False, True, False, True]})
train_data.head()
Now, we can create an entityset for our training data.
In [ ]:
es_train_data = ft.EntitySet(id="customer_train_data")
es_train_data = es_train_data.entity_from_dataframe(entity_id="customers",
dataframe=train_data,
index="customer_id")
es_train_data
Next, we are ready to create our features, and feature matrix for the training data.
In [ ]:
feature_matrix_train, feature_defs = ft.dfs(entityset=es_train_data,
target_entity="customers")
feature_matrix_train
We will also encode our feature matrix to make machine learning compatible features.
In [ ]:
feature_matrix_train_enc, features_enc = ft.encode_features(feature_matrix_train, feature_defs)
feature_matrix_train_enc.head()
Notice how the the whole feature matrix only inclues numeric values now.
Now we can use the feature definitions to calculate our feature matrix for the test data, and avoid label leakage.
In [ ]:
test_train = pd.DataFrame({"customer_id": [6, 7, 8, 9, 10],
"age": [20, 25, 55, 22, 35],
"gender": ["f", "m", "m", "m", "m"],
"signup_date": pd.date_range('2014-01-01 01:41:50', periods=5, freq='25min')})
# lets add NaN label column to the test Dataframe
test_train['labels'] = np.nan
es_test_data = ft.EntitySet(id="customer_test_data")
es_test_data = es_test_data.entity_from_dataframe(entity_id="customers",
dataframe=test_train,
index="customer_id",
time_index="signup_date")
# Use the feature definitions from earlier
feature_matrix_enc_test = ft.calculate_feature_matrix(features=features_enc,
entityset=es_test_data)
feature_matrix_enc_test.head()
Note: Disregard the difference between the False/True above, and 0/1 in the earlier feature matrix. A simple casting would address this difference.
If your data has timestamps, the best way to prevent label leakage is to use a list of cutoff times, which specify the last point in time data is allowed to be used for each row in the resulting feature matrix. To use cutoff times, you need to set a time index for each time sensitive entity in your entity set.
Tip: Even if your data doesn’t have time stamps, you could add a column with dummy timestamps that can be used by Featuretools as time index.
When you call ft.dfs
, you can provide a Dataframe of cutoff times like this:
In [ ]:
cutoff_times = pd.DataFrame({"customer_id": [1, 2, 3, 4, 5],
"time": pd.date_range('2014-01-01 01:41:50', periods=5, freq='25min')})
cutoff_times.head()
In [ ]:
train_test_data = pd.DataFrame({"customer_id": [1, 2, 3, 4, 5],
"age": [20, 25, 55, 22, 35],
"gender": ["f", "m", "m", "m", "m"],
"signup_date": pd.date_range('2010-01-01 01:41:50', periods=5, freq='25min')})
es_train_test_data = ft.EntitySet(id="customer_train_test_data")
es_train_test_data = es_train_test_data.entity_from_dataframe(entity_id="customers",
dataframe=train_test_data,
index="customer_id",
time_index="signup_date")
feature_matrix_train_test, features = ft.dfs(entityset=es_train_test_data,
target_entity="customers",
cutoff_time=cutoff_times,
cutoff_time_in_index=True)
feature_matrix_train_test.head()
Above, we have created a feature matrix that uses cutoff times to avoid label leakage. We could also encode this feature matrix using ft.encode_features
.
There are 2 ways to pass primitives to DFS: the primitive object, or a string of the primitive name.
We will use the Transform primitive called TimeSincePrevious
to illustrate the differences.
First, let's use the string of primitive name.
In [ ]:
es = ft.demo.load_mock_customer(return_entityset=True)
In [ ]:
feature_matrix, feature_defs = ft.dfs(entityset=es,
target_entity="customers",
agg_primitives=[],
trans_primitives=["time_since_previous"])
feature_matrix
Now, let's use the primitive object.
In [ ]:
from featuretools.primitives import TimeSincePrevious
feature_matrix, feature_defs = ft.dfs(entityset=es,
target_entity="customers",
agg_primitives=[],
trans_primitives=[TimeSincePrevious])
feature_matrix
As we can see above, the feature matrix is the same.
However, if we need to modify controllable parameters in the primitive, we should use the primitive object. For instance, let's make TimeSincePrevious return units of hours (the default is in seconds).
In [ ]:
from featuretools.primitives import TimeSincePrevious
time_since_previous_in_hours = TimeSincePrevious(unit='hours')
feature_matrix, feature_defs = ft.dfs(entityset=es,
target_entity="customers",
agg_primitives=[],
trans_primitives=[time_since_previous_in_hours])
feature_matrix
You may wish to select a subset of your features based on some attributes.
Let's say you wanted to select features that had the string amount
in its name. You can check for this by using the get_name
function on the feature definitions.
In [ ]:
es = ft.demo.load_mock_customer(return_entityset=True)
feature_defs = ft.dfs(entityset=es,
target_entity="customers",
features_only=True)
features_with_amount = []
for x in feature_defs:
if 'amount' in x.get_name():
features_with_amount.append(x)
features_with_amount[0:5]
You might also want to only select features that are aggregation features.
In [ ]:
from featuretools import AggregationFeature
features_only_aggregations = []
for x in feature_defs:
if type(x) == AggregationFeature:
features_only_aggregations.append(x)
features_only_aggregations[0:5]
Also, you might only want to select features that are calculated at a certain depth. You can do this by using the get_depth
function.
In [ ]:
features_only_depth_2 = []
for x in feature_defs:
if x.get_depth() == 2:
features_only_depth_2.append(x)
features_only_depth_2[0:5]
Finally, you might only want features that return a certain type. You can do this by using the variable_type
function.
In [ ]:
from featuretools.variable_types import Numeric
features_only_numeric = []
for x in feature_defs:
if x.variable_type == Numeric:
features_only_numeric.append(x)
features_only_numeric[0:5]
Once you have your specific feature list, you can use ft.calculate_feature_matrix
to generate a feature matrix for only those features.
For our example, let's use the features with only the string amount
in its name.
In [ ]:
feature_matrix = ft.calculate_feature_matrix(entityset=es,
features=features_with_amount) # change to your specific feature list
feature_matrix.head()
Above, notice how all the column names for our feature matrix contain the string amount
.
Sometimes, you might want to create features that are conditioned on a second value before it is calculated. This extra filter is called a “where clause”. You can create these features using the using the interesting_values
of a variable.
If you have categorical columns in your EntitySet
, you can use then add_interesting_values
. This function will find interesting values for your categorical variables, which can then be used to generate “where” clauses.
First, let's create our EntitySet
.
In [ ]:
es = ft.demo.load_mock_customer(return_entityset=True)
es
Now we can add the interesting variables for the categorical variables.
In [ ]:
es.add_interesting_values()
Now we can run DFS with the where_primitives
argument to define which primitives to apply with where clauses. In this case, let's use the primitive count
.
In [ ]:
feature_matrix, feature_defs = ft.dfs(entityset=es,
target_entity="customers",
agg_primitives=["count"],
where_primitives=["count"],
trans_primitives=[])
feature_matrix.head()
We have now created some useful features. One example of a useful feature is the COUNT(sessions WHERE device = tablet)
. This feature tells us how many sessions a customer completed on a tablet.
In [ ]:
feature_matrix[["COUNT(sessions WHERE device = tablet)"]]
You might curious to know the difference between the primitive groups. Let's review the differences between transform, groupby transform, and aggregation primitives.
First, let's create a simple EntitySet
.
In [ ]:
import pandas as pd
import featuretools as ft
df = pd.DataFrame({
"id": [1, 2, 3, 4, 5, 6],
"time_index": pd.date_range("1/1/2019", periods=6, freq="D"),
"group": ["a", "b", "a", "c", "a", "b"],
"val": [5, 1, 10, 20, 6, 23],
})
es = ft.EntitySet()
es = es.entity_from_dataframe(entity_id="observations",
dataframe=df,
index="id",
time_index="time_index")
es = es.normalize_entity(base_entity_id="observations",
new_entity_id="groups",
index="group")
es.plot()
After calling normalize_entity, the variable "group" has the type "id" because it identifies another entity. Alternatively, it could be set using the variable_types parameter when we first call es.entity_from_dataframe()
.
The cum_sum primitive calculates the running sum in list of numbers.
In [ ]:
from featuretools.primitives import CumSum
cum_sum = CumSum()
cum_sum([1, 2, 3, 4, 5]).tolist()
If we apply it using the trans_primitives argument it will calculate it over the entire observations entity like this:
In [ ]:
feature_matrix, feature_defs = ft.dfs(target_entity="observations",
entityset=es,
agg_primitives=[],
trans_primitives=["cum_sum"],
groupby_trans_primitives=[])
feature_matrix
In [ ]:
feature_matrix, feature_defs = ft.dfs(target_entity="observations",
entityset=es,
agg_primitives=[],
trans_primitives=[],
groupby_trans_primitives=["cum_sum"])
feature_matrix
In [ ]:
feature_matrix, feature_defs = ft.dfs(target_entity="observations",
entityset=es,
agg_primitives=["sum"],
trans_primitives=[],
cutoff_time_in_index=True,
groupby_trans_primitives=[])
feature_matrix
If we set the cutoff time of each row to be the time index, then use sum as an aggregation primitive, the result is the same as cum_sum. (Though the order is different in the displayed dataframe).
In [ ]:
cutoff_time = df[["id", "time_index"]]
cutoff_time
In [ ]:
feature_matrix, feature_defs = ft.dfs(target_entity="observations",
entityset=es,
agg_primitives=["sum"],
trans_primitives=[],
groupby_trans_primitives=[],
cutoff_time_in_index=True,
cutoff_time=cutoff_time)
feature_matrix
In [ ]:
df_primitives = ft.list_primitives()
df_primitives.head()
In [ ]:
df_primitives.tail()
There are a few primitives in Featuretools that make some time-based calculation. These include TimeSince, TimeSincePrevious, TimeSinceLast, TimeSinceFirst
.
You can change the units from the default seconds to any valid time unit, by doing the following:
In [ ]:
from featuretools.primitives import TimeSince, TimeSincePrevious, TimeSinceLast, TimeSinceFirst
time_since = TimeSince(unit="minutes")
time_since_previous = TimeSincePrevious(unit="hours")
time_since_last = TimeSinceLast(unit="days")
time_since_first = TimeSinceFirst(unit="years")
es = ft.demo.load_mock_customer(return_entityset=True)
feature_matrix, feature_defs = ft.dfs(entityset=es,
target_entity="customers",
agg_primitives=[time_since_last, time_since_first],
trans_primitives=[time_since, time_since_previous])
Above, we changed the units to the following:
TimeSince
TimeSincePrevious
TimeSinceLast
TimeSinceFirst
.Now we can see that our feature matrix contains multiple features where the units for the TimeSince primitives are changed.
In [ ]:
feature_matrix.head()
There are now features where time unit is different from the default of seconds, such as TIME_SINCE_LAST(sessions.session_start, unit=days)
, and TIME_SINCE_FIRST(sessions.session_start, unit=years)
.
You might be wondering how to properly use your train & test data with Featuretools, and sklearn's train_test_split. There are a few things you must do to ensure accuracy with this workflow.
Let's imagine we have a Dataframes for our train data, with the labels.
In [ ]:
train_data = pd.DataFrame({"customer_id": [1, 2, 3, 4, 5],
"age": [20, 25, 55, 22, 35],
"gender": ["f", "m", "m", "m", "m"],
"signup_date": pd.date_range('2010-01-01 01:41:50', periods=5, freq='25min'),
"labels": [False, True, True, False, False]})
train_data.head()
Now we can create our EntitySet
for the train data, and create our features. To prevent label leakage, we will use cutoff times (see earlier question).
In [ ]:
es_train_data = ft.EntitySet(id="customer_data")
es_train_data = es_train_data.entity_from_dataframe(entity_id="customers",
dataframe=train_data,
index="customer_id")
cutoff_times = pd.DataFrame({"customer_id": [1, 2, 3, 4, 5],
"time": pd.date_range('2014-01-01 01:41:50', periods=5, freq='25min')})
feature_matrix_train, features = ft.dfs(entityset=es_train_data,
target_entity="customers",
cutoff_time=cutoff_times,
cutoff_time_in_index=True)
feature_matrix_train.head()
We will also encode our feature matrix to compatible for machine learning algorithms.
In [ ]:
feature_matrix_train_enc, feature_enc = ft.encode_features(feature_matrix_train, features)
feature_matrix_train_enc.head()
In [ ]:
from sklearn.model_selection import train_test_split
X = feature_matrix_train_enc.drop(['labels'], axis=1)
y = feature_matrix_train_enc['labels']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
Now you can use the encoded feature matrix with sklearn's train_test_split. This will allow you to train your model, and tune your parameters.
You might be wondering what happens when categorical variables are encoded with your training and testing data. You might be curious to know what happens if the train data has a categorical variable that is not present in the testing data.
Let's explore a simple example to see what happens during the encoding process.
In [ ]:
train_data = pd.DataFrame({"customer_id": [1, 2, 3, 4, 5],
"product_purchased": ["coke zero", "car", "toothpaste", "coke zero", "car"]})
es_train = ft.EntitySet(id="customer_data")
es_train = es_train.entity_from_dataframe(entity_id="customers",
dataframe=train_data,
index="customer_id")
feature_matrix_train, features = ft.dfs(entityset=es_train,
target_entity='customers')
feature_matrix_train
We will use ft.encode_features
to properly encode the product_purchased column.
In [ ]:
feature_matrix_train_encoded, features_encoded = ft.encode_features(feature_matrix_train,
features)
feature_matrix_train_encoded.head()
Now lets imagine we have some test data that has doesn't have one of the categorical values (toothpaste). Also, the test data has a value that wasn't present in the train data (water).
In [ ]:
test_data = pd.DataFrame({"customer_id": [6, 7, 8, 9, 10],
"product_purchased": ["coke zero", "car", "coke zero", "coke zero", "water"]})
es_test = ft.EntitySet(id="customer_data")
es_test = es_test.entity_from_dataframe(entity_id="customers",
dataframe=test_data,
index="customer_id")
feature_matrix_test = ft.calculate_feature_matrix(entityset=es_test,
features=features_encoded)
feature_matrix_test.head()
As seen above, we were able to successfully handle the encoding, and deal with the following complications:
In [ ]:
product_df = pd.DataFrame({'id': [1, 2, 3, 4, 4],
'rating': [3.5, 4.0, 4.5, 1.5, 5.0]})
product_df
Notice how the id
column has a duplicate index of 4
. If you try to create an entity with this Dataframe, you will run into the following error.
es = ft.EntitySet(id="product_data")
es = es.entity_from_dataframe(entity_id="products",
dataframe=product_df,
index="id")
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-63-a6e02ba6fa47> in <module>
2 es = es.entity_from_dataframe(entity_id="products",
3 dataframe=product_df,
----> 4 index="id")
~/featuretools/featuretools/entityset/entityset.py in entity_from_dataframe(self, entity_id, dataframe, index, variable_types, make_index, time_index, secondary_time_index, already_sorted)
486 secondary_time_index=secondary_time_index,
487 already_sorted=already_sorted,
--> 488 make_index=make_index)
489 self.entity_dict[entity.id] = entity
490 self.reset_data_description()
~/featuretools/featuretools/entityset/entity.py in __init__(self, id, df, entityset, variable_types, index, time_index, secondary_time_index, last_time_index, already_sorted, make_index, verbose)
79
80 self.df = df[[v.id for v in self.variables]]
---> 81 self.set_index(index)
82
83 self.time_index = None
~/featuretools/featuretools/entityset/entity.py in set_index(self, variable_id, unique)
450 self.df.index.name = None
451 if unique:
--> 452 assert self.df.index.is_unique, "Index is not unique on dataframe (Entity {})".format(self.id)
453
454 self.convert_variable_type(variable_id, vtypes.Index, convert_data=False)
AssertionError: Index is not unique on dataframe (Entity products)
To fix the above error, you can do one of the following solutions:
Solution #1 - You can create a unique index on your Dataframe.
In [ ]:
product_df = pd.DataFrame({'id': [1, 2, 3, 4, 5],
'rating': [3.5, 4.0, 4.5, 1.5, 5.0]})
product_df
Notice how we now have a unique index column called id
.
In [ ]:
es = es.entity_from_dataframe(entity_id="products",
dataframe=product_df,
index="id")
es
As seen above, we can now create our entity for our EntitySet
without an error by creating a unique index in our Dataframe.
Solution #2 - Set make_index to True in your call to entity_from_dataframe to create a new index on that data
make_index
creates a unique index for each row by just looking at what number the row is, in relation to all the other rows.
In [ ]:
product_df = pd.DataFrame({'id': [1, 2, 3, 4, 4],
'rating': [3.5, 4.0, 4.5, 1.5, 5.0]})
es = ft.EntitySet(id="product_data")
es = es.entity_from_dataframe(entity_id="products",
dataframe=product_df,
index="product_id",
make_index=True)
es['products'].df
As seen above, we created our entity for our EntitySet
without an error using the make_index
argument.
If you are using a training window, and you haven't set a last_time_index
for your entity, you will get this warning.
The training window attribute in Featuretools limits the amount of past data that can be used while calculating a particular feature vector.
You can add the last_time_index
to all entities automatically by calling your_entityset.add_last_time_indexes()
after you create your EntitySet
. This will remove the warning.
In [ ]:
es = ft.demo.load_mock_customer(return_entityset=True)
es.add_last_time_indexes()
Now we can run DFS without getting the warning.
In [ ]:
cutoff_times = pd.DataFrame()
cutoff_times['customer_id'] = [1, 2, 3, 1]
cutoff_times['time'] = pd.to_datetime(['2014-1-1 04:00', '2014-1-1 05:00', '2014-1-1 06:00', '2014-1-1 08:00'])
cutoff_times['label'] = [True, True, False, True]
feature_matrix, feature_defs = ft.dfs(entityset=es,
target_entity="customers",
cutoff_time=cutoff_times,
cutoff_time_in_index=True,
training_window="1 hour")
time_index
is when the instance was first known.last_time_index
is when the instance appears for the last time., by default, has Featuretools 0.4.1
installed. You may run into issues following our newest guides, or latest documentation while using an older version of Featuretools. Therefore, we suggest you upgrade to the latest featuretools version by doing the following in your notebook in Google Colab:
!pip install -U featuretools
You may need to Restart the runtime by doing Runtime -> Restart Runtime. You can check latest Featuretools version by doing following:
import featuretools as ft
print(ft.__version__)
You should see a version greater than 0.4.1