Working with user-activity data

In this notebook, you will learn how to:

Work with user-activity data as a TimeSeries

  • Import data from a csv.
  • Explore the data.
  • Convert the data into TimeSeries.
  • Perform a few simple aggregations to explore the user-activity data.

Let's get started!


In [1]:
import graphlab as gl
import datetime
gl.canvas.set_target('ipynb') # make sure plots appear inline


A newer version of GraphLab Create (v2.0.1) is available! Your current version is v2.0.

You can use pip to upgrade the graphlab-create package. For more information see https://turi.com/products/create/upgrade.
This commercial license of GraphLab Create is assigned to engr@turi.com.
[INFO] graphlab.cython.cy_server: GraphLab Create v2.0 started. Logging: /tmp/graphlab_server_1468133247.log

Loading data from a flat- ile

Other common sources of data include

  • SQL tables
  • Spark RDD
  • Pandas DataFrames
  • Numpy
  • Network file systems such as HDFS, S3

In [2]:
interactions = gl.SFrame.read_csv("data/online_retail_interactions.csv")


Finished parsing file /Users/srikris/workspace/tutorials/dss-2016/churn_prediction/data/online_retail_interactions.csv
Parsing completed. Parsed 100 lines in 1.4943 secs.
------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str,str,int,str,float,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Finished parsing file /Users/srikris/workspace/tutorials/dss-2016/churn_prediction/data/online_retail_interactions.csv
Parsing completed. Parsed 532618 lines in 1.21511 secs.

In [3]:
interactions


Out[3]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID
536365 85123A WHITE HANGING HEART
T-LIGHT HOLDER ...
6 12/1/10 8:26 2.55 17850
536365 71053 WHITE METAL LANTERN 6 12/1/10 8:26 3.39 17850
536365 84406B CREAM CUPID HEARTS COAT
HANGER ...
8 12/1/10 8:26 2.75 17850
536365 84029G KNITTED UNION FLAG HOT
WATER BOTTLE ...
6 12/1/10 8:26 3.39 17850
536365 84029E RED WOOLLY HOTTIE WHITE
HEART. ...
6 12/1/10 8:26 3.39 17850
536365 22752 SET 7 BABUSHKA NESTING
BOXES ...
2 12/1/10 8:26 7.65 17850
536365 21730 GLASS STAR FROSTED
T-LIGHT HOLDER ...
6 12/1/10 8:26 4.25 17850
536366 22633 HAND WARMER UNION JACK 6 12/1/10 8:28 1.85 17850
536366 22632 HAND WARMER RED POLKA DOT 6 12/1/10 8:28 1.85 17850
536367 84879 ASSORTED COLOUR BIRD
ORNAMENT ...
32 12/1/10 8:34 1.69 13047
[532618 rows x 7 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Exploring the data


In [4]:
interactions['Quantity'].show()



In [5]:
interactions["CustomerID"]


Out[5]:
dtype: int
Rows: 532618
[17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 13047, 13047, 13047, 13047, 13047, 13047, 13047, 13047, 13047, 13047, 13047, 13047, 13047, 13047, 13047, 13047, 13047, 12583, 12583, 12583, 12583, 12583, 12583, 12583, 12583, 12583, 12583, 12583, 12583, 12583, 12583, 12583, 12583, 12583, 12583, 12583, 12583, 13748, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 15100, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 17850, 15291, 15291, 17850, 17850, 14688, 14688, 14688, 14688, 14688, 14688, 14688, 14688, 14688, 14688, 14688, 14688, 14688, 14688, ... ]

In [6]:
# Make sure all ID columns are interpreted as string
interactions["InvoiceNo"] = interactions["InvoiceNo"].astype(str)
interactions["CustomerID"] = interactions["CustomerID"].astype(str)

In [7]:
interactions["CustomerID"].show()


What does an active user look like?


In [8]:
active_user = interactions[interactions["CustomerID"] == '17841']
active_user


Out[8]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID
536557 21495 SKULLS AND CROSSBONES
WRAP ...
25 12/1/10 14:41 0.42 17841
536557 46000R POLYESTER FILLER PAD
45x30cm ...
2 12/1/10 14:41 1.45 17841
536557 46000S POLYESTER FILLER PAD
40x40cm ...
1 12/1/10 14:41 1.45 17841
536557 22731 3D CHRISTMAS STAMPS
STICKERS ...
1 12/1/10 14:41 1.25 17841
536557 21258 VICTORIAN SEWING BOX
LARGE ...
1 12/1/10 14:41 12.75 17841
536557 21041 RED RETROSPOT OVEN GLOVE
DOUBLE ...
1 12/1/10 14:41 2.95 17841
536557 84920 PINK FLOWER FABRIC PONY 1 12/1/10 14:41 3.75 17841
536557 22173 METAL 4 HOOK HANGER
FRENCH CHATEAU ...
1 12/1/10 14:41 2.95 17841
536557 22953 BIRTHDAY PARTY CORDON
BARRIER TAPE ...
1 12/1/10 14:41 1.25 17841
536557 84508A CAMOUFLAGE DESIGN TEDDY 1 12/1/10 14:41 2.55 17841
[? rows x 7 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.


In [9]:
active_user.show(x = 'InvoiceDate', y = 'Quantity', view = 'Bar Chart')


What does an inactive user look like?


In [10]:
inactive_user = interactions[interactions["CustomerID"] == '14388']
inactive_user.show(x = 'InvoiceDate', y = 'Quantity', view = 'Bar Chart')


Import & explore user meta-data


In [11]:
users = gl.SFrame.read_csv("data/online_retail_users.csv")


Finished parsing file /Users/srikris/workspace/tutorials/dss-2016/churn_prediction/data/online_retail_users.csv
Parsing completed. Parsed 100 lines in 0.034386 secs.
------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Finished parsing file /Users/srikris/workspace/tutorials/dss-2016/churn_prediction/data/online_retail_users.csv
Parsing completed. Parsed 4340 lines in 0.014897 secs.

In [12]:
users['CustomerID'] = users['CustomerID'].astype(str)
users


Out[12]:
CustomerID Country user_type
13050 United Kingdom extra-heavy
14515 United Kingdom extra-heavy
16257 United Kingdom moderate
17885 United Kingdom extra-heavy
13560 United Kingdom heavy
15863 United Kingdom heavy
14406 United Kingdom moderate
13518 United Kingdom heavy
14388 United Kingdom extra-heavy
16200 United Kingdom extra-heavy
[4340 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [13]:
users['user_type'].show()



In [18]:
users.save('data/users.sf')

2. Converting activity-data into a TimeSeries


In [19]:
interactions['InvoiceDate']


Out[19]:
dtype: str
Rows: 532618
['12/1/10 8:26', '12/1/10 8:26', '12/1/10 8:26', '12/1/10 8:26', '12/1/10 8:26', '12/1/10 8:26', '12/1/10 8:26', '12/1/10 8:28', '12/1/10 8:28', '12/1/10 8:34', '12/1/10 8:34', '12/1/10 8:34', '12/1/10 8:34', '12/1/10 8:34', '12/1/10 8:34', '12/1/10 8:34', '12/1/10 8:34', '12/1/10 8:34', '12/1/10 8:34', '12/1/10 8:34', '12/1/10 8:34', '12/1/10 8:34', '12/1/10 8:34', '12/1/10 8:34', '12/1/10 8:34', '12/1/10 8:35', '12/1/10 8:45', '12/1/10 8:45', '12/1/10 8:45', '12/1/10 8:45', '12/1/10 8:45', '12/1/10 8:45', '12/1/10 8:45', '12/1/10 8:45', '12/1/10 8:45', '12/1/10 8:45', '12/1/10 8:45', '12/1/10 8:45', '12/1/10 8:45', '12/1/10 8:45', '12/1/10 8:45', '12/1/10 8:45', '12/1/10 8:45', '12/1/10 8:45', '12/1/10 8:45', '12/1/10 8:45', '12/1/10 9:00', '12/1/10 9:01', '12/1/10 9:01', '12/1/10 9:02', '12/1/10 9:02', '12/1/10 9:02', '12/1/10 9:02', '12/1/10 9:02', '12/1/10 9:02', '12/1/10 9:02', '12/1/10 9:02', '12/1/10 9:02', '12/1/10 9:02', '12/1/10 9:02', '12/1/10 9:02', '12/1/10 9:02', '12/1/10 9:02', '12/1/10 9:02', '12/1/10 9:02', '12/1/10 9:09', '12/1/10 9:32', '12/1/10 9:32', '12/1/10 9:32', '12/1/10 9:32', '12/1/10 9:32', '12/1/10 9:32', '12/1/10 9:32', '12/1/10 9:32', '12/1/10 9:32', '12/1/10 9:32', '12/1/10 9:32', '12/1/10 9:32', '12/1/10 9:32', '12/1/10 9:32', '12/1/10 9:32', '12/1/10 9:32', '12/1/10 9:32', '12/1/10 9:32', '12/1/10 9:34', '12/1/10 9:34', '12/1/10 9:37', '12/1/10 9:37', '12/1/10 9:37', '12/1/10 9:37', '12/1/10 9:37', '12/1/10 9:37', '12/1/10 9:37', '12/1/10 9:37', '12/1/10 9:37', '12/1/10 9:37', '12/1/10 9:37', '12/1/10 9:37', '12/1/10 9:37', '12/1/10 9:37', ... ]

Convert string to datetime


In [22]:
def str_to_datetime(x):
    import dateutil
    from dateutil import parser
    return dateutil.parser.parse(x)

In [23]:
str_to_datetime('12/1/10 14:41')


Out[23]:
datetime.datetime(2010, 12, 1, 14, 41)

In [24]:
interactions['InvoiceDate'] = interactions['InvoiceDate'].apply(str_to_datetime)
interactions['InvoiceDate']


Out[24]:
dtype: datetime
Rows: 532618
[datetime.datetime(2010, 12, 1, 8, 26), datetime.datetime(2010, 12, 1, 8, 26), datetime.datetime(2010, 12, 1, 8, 26), datetime.datetime(2010, 12, 1, 8, 26), datetime.datetime(2010, 12, 1, 8, 26), datetime.datetime(2010, 12, 1, 8, 26), datetime.datetime(2010, 12, 1, 8, 26), datetime.datetime(2010, 12, 1, 8, 28), datetime.datetime(2010, 12, 1, 8, 28), datetime.datetime(2010, 12, 1, 8, 34), datetime.datetime(2010, 12, 1, 8, 34), datetime.datetime(2010, 12, 1, 8, 34), datetime.datetime(2010, 12, 1, 8, 34), datetime.datetime(2010, 12, 1, 8, 34), datetime.datetime(2010, 12, 1, 8, 34), datetime.datetime(2010, 12, 1, 8, 34), datetime.datetime(2010, 12, 1, 8, 34), datetime.datetime(2010, 12, 1, 8, 34), datetime.datetime(2010, 12, 1, 8, 34), datetime.datetime(2010, 12, 1, 8, 34), datetime.datetime(2010, 12, 1, 8, 34), datetime.datetime(2010, 12, 1, 8, 34), datetime.datetime(2010, 12, 1, 8, 34), datetime.datetime(2010, 12, 1, 8, 34), datetime.datetime(2010, 12, 1, 8, 34), datetime.datetime(2010, 12, 1, 8, 35), datetime.datetime(2010, 12, 1, 8, 45), datetime.datetime(2010, 12, 1, 8, 45), datetime.datetime(2010, 12, 1, 8, 45), datetime.datetime(2010, 12, 1, 8, 45), datetime.datetime(2010, 12, 1, 8, 45), datetime.datetime(2010, 12, 1, 8, 45), datetime.datetime(2010, 12, 1, 8, 45), datetime.datetime(2010, 12, 1, 8, 45), datetime.datetime(2010, 12, 1, 8, 45), datetime.datetime(2010, 12, 1, 8, 45), datetime.datetime(2010, 12, 1, 8, 45), datetime.datetime(2010, 12, 1, 8, 45), datetime.datetime(2010, 12, 1, 8, 45), datetime.datetime(2010, 12, 1, 8, 45), datetime.datetime(2010, 12, 1, 8, 45), datetime.datetime(2010, 12, 1, 8, 45), datetime.datetime(2010, 12, 1, 8, 45), datetime.datetime(2010, 12, 1, 8, 45), datetime.datetime(2010, 12, 1, 8, 45), datetime.datetime(2010, 12, 1, 8, 45), datetime.datetime(2010, 12, 1, 9, 0), datetime.datetime(2010, 12, 1, 9, 1), datetime.datetime(2010, 12, 1, 9, 1), datetime.datetime(2010, 12, 1, 9, 2), datetime.datetime(2010, 12, 1, 9, 2), datetime.datetime(2010, 12, 1, 9, 2), datetime.datetime(2010, 12, 1, 9, 2), datetime.datetime(2010, 12, 1, 9, 2), datetime.datetime(2010, 12, 1, 9, 2), datetime.datetime(2010, 12, 1, 9, 2), datetime.datetime(2010, 12, 1, 9, 2), datetime.datetime(2010, 12, 1, 9, 2), datetime.datetime(2010, 12, 1, 9, 2), datetime.datetime(2010, 12, 1, 9, 2), datetime.datetime(2010, 12, 1, 9, 2), datetime.datetime(2010, 12, 1, 9, 2), datetime.datetime(2010, 12, 1, 9, 2), datetime.datetime(2010, 12, 1, 9, 2), datetime.datetime(2010, 12, 1, 9, 2), datetime.datetime(2010, 12, 1, 9, 9), datetime.datetime(2010, 12, 1, 9, 32), datetime.datetime(2010, 12, 1, 9, 32), datetime.datetime(2010, 12, 1, 9, 32), datetime.datetime(2010, 12, 1, 9, 32), datetime.datetime(2010, 12, 1, 9, 32), datetime.datetime(2010, 12, 1, 9, 32), datetime.datetime(2010, 12, 1, 9, 32), datetime.datetime(2010, 12, 1, 9, 32), datetime.datetime(2010, 12, 1, 9, 32), datetime.datetime(2010, 12, 1, 9, 32), datetime.datetime(2010, 12, 1, 9, 32), datetime.datetime(2010, 12, 1, 9, 32), datetime.datetime(2010, 12, 1, 9, 32), datetime.datetime(2010, 12, 1, 9, 32), datetime.datetime(2010, 12, 1, 9, 32), datetime.datetime(2010, 12, 1, 9, 32), datetime.datetime(2010, 12, 1, 9, 32), datetime.datetime(2010, 12, 1, 9, 32), datetime.datetime(2010, 12, 1, 9, 34), datetime.datetime(2010, 12, 1, 9, 34), datetime.datetime(2010, 12, 1, 9, 37), datetime.datetime(2010, 12, 1, 9, 37), datetime.datetime(2010, 12, 1, 9, 37), datetime.datetime(2010, 12, 1, 9, 37), datetime.datetime(2010, 12, 1, 9, 37), datetime.datetime(2010, 12, 1, 9, 37), datetime.datetime(2010, 12, 1, 9, 37), datetime.datetime(2010, 12, 1, 9, 37), datetime.datetime(2010, 12, 1, 9, 37), datetime.datetime(2010, 12, 1, 9, 37), datetime.datetime(2010, 12, 1, 9, 37), datetime.datetime(2010, 12, 1, 9, 37), datetime.datetime(2010, 12, 1, 9, 37), datetime.datetime(2010, 12, 1, 9, 37), ... ]

Convert SFrame to a TimeSeries


In [25]:
interactions_ts = gl.TimeSeries(interactions, 'InvoiceDate')

In [26]:
interactions_ts


Out[26]:
InvoiceDate InvoiceNo StockCode Description Quantity UnitPrice CustomerID
2010-12-01 08:26:00 536365 85123A WHITE HANGING HEART
T-LIGHT HOLDER ...
6 2.55 17850
2010-12-01 08:26:00 536365 71053 WHITE METAL LANTERN 6 3.39 17850
2010-12-01 08:26:00 536365 84406B CREAM CUPID HEARTS COAT
HANGER ...
8 2.75 17850
2010-12-01 08:26:00 536365 84029G KNITTED UNION FLAG HOT
WATER BOTTLE ...
6 3.39 17850
2010-12-01 08:26:00 536365 84029E RED WOOLLY HOTTIE WHITE
HEART. ...
6 3.39 17850
2010-12-01 08:26:00 536365 22752 SET 7 BABUSHKA NESTING
BOXES ...
2 7.65 17850
2010-12-01 08:26:00 536365 21730 GLASS STAR FROSTED
T-LIGHT HOLDER ...
6 4.25 17850
2010-12-01 08:28:00 536366 22633 HAND WARMER UNION JACK 6 1.85 17850
2010-12-01 08:28:00 536366 22632 HAND WARMER RED POLKA DOT 6 1.85 17850
2010-12-01 08:34:00 536367 84879 ASSORTED COLOUR BIRD
ORNAMENT ...
32 1.69 13047
[532618 rows x 7 columns]
Note: Only the head of the TimeSeries is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


In [28]:
interactions_ts.save('data/user_activity_data.ts/')