Note: This notebook uses GraphLab Create 1.7.
An SFrame is a tabular data structure. If you are familiar with R or the pandas python package, SFrames behave similarly to the dataframes available in those frameworks. SFrames act like a table by consisting of 0 or more columns. Each column has its own datatype and every column of a particular SFrame must have the same number of entries as the other columns that already exist. There are two things that make SFrames very different from other dataframes:
This tutorial shows you how to import data into an SFrame, do some basic data cleaning/exploration, and save your work for later. If you are someone that likes to learn these things through reading comprehensive documentation instead of tutorials, then you can visit our API Reference first. If not, read on!
First we will get set up with import statements for this tutorial.
In [1]:
import graphlab as gl
Reading a csv file from an S3 bucket is just one way to import your data into an SFrame. The read_csv function gives you lots of control over where to read your data from and how to parse it, which you can read about here. The column_type_hints option is important to highlight though. By default, SFrame tries to infer the types of the values it is parsing and usually does well, but providing a hint for the type of a column ensures it is parsed the way you intend. If type inference fails on a particular column, SFrame will simply interpret that column as a string. Here, only the year column is of type int, while the rest are strings.
(The csv file of song metadata comes from the Million Song Dataset. This data set was used for a Kaggle challenge and includes data from The Echo Nest, SecondHandSongs, musiXmatch, and Last.fm.)
In [2]:
# In order to interact with S3 we need to set our AWS credentials.
# You can use your own credentials or use the ones below.
gl.aws.set_credentials('AKIAJMHKEZGY6YP24BXA', 'vf/miz2Zx7V7VkCai9ZeJR45ZSimqu6/W7qdRLmN')
# The below will download a 78 MB file.
song_sf = gl.SFrame.read_csv('https://static.turi.com/datasets/millionsong/song_data.csv',
column_type_hints = {'year' : int})
In [3]:
song_sf.num_rows()
Out[3]:
If the csv file we want to read does not have a header, we can still provide column_type_hints, but with GraphLab's default column names. Below is the code that would accomplish this, but I have commented it out because I don't want to affect the dataset we work with in the rest of this tutorial.
In [4]:
#song_sf = gl.SFrame.read_csv('https://static.turi.com/datasets/millionsong/song_data.csv', header=False,
# column_type_hints = {'X5' : int})
#song_sf.head(1)
Before we start playing with this data, I want to highlight that you can save and load an SFrame for later use. This is great if you don't want to re-download a file from S3 a bunch of times, or re-parse a large csv file. Here's how to save to your current directory:
In [5]:
song_sf.save('orig_song_data')
That save operation takes some time because it copies the files SFrame uses to the given location (in this case, an auto-created directory called 'orig_song_data'). The load operation, however, is instantaneous. This is one of the perks of using a disk-backed dataframe.
In [6]:
song_sf = gl.load_sframe('orig_song_data')
I can emit several commands to see that we are working with a fairly tame dataset. After all, we only have five columns.
In [7]:
song_sf.head(5)
Out[7]:
In [8]:
song_sf.tail(5)
Out[8]:
In [9]:
song_sf.num_rows(), len(song_sf)
Out[9]:
In [10]:
song_sf.num_cols()
Out[10]:
In [11]:
song_sf.column_names()
Out[11]:
In [12]:
song_sf.column_types()
Out[12]:
Alright, I want a little more out of this SFrame. I want to add a few columns. Let's say I care about the length of the title of each song, what I've rated the song, and how many years old I was when the song was created.
In [13]:
year_i_was_born = 1988
# Count the number of words in each song title and add the word count as a new feature
song_sf['title_length'] = song_sf['title'].apply(lambda x: len(x.split()))
# Count how old I was when this song came out
song_sf.add_column(song_sf.select_column('year').apply(lambda x: x - year_i_was_born),
'how_old_was_i')
# Add a 0 rating for every song
song_sf['my_rating'] = 0
song_sf.head(5)
Out[13]:
Clearly songs with a '0' year are a problem, but we'll cover that later.
A few things to cover from the snippet above:
In [14]:
song_sf['my_rating'] = 1
song_sf.head(5)
Out[14]:
We can also add several columns at a time:
In [15]:
song_sf[['dumb_col','dumb_col2']] = [song_sf['title_length'],song_sf['my_rating']]
song_sf.head(5)
Out[15]:
But maybe that was a dumb idea. Let's get rid of those. Before I do that, I'll show you how to rename and swap column ordering. Why not?
In [16]:
song_sf.rename({'dumb_col2' : 'another_dumb_col'})
song_sf.swap_columns('dumb_col', 'another_dumb_col')
del song_sf['dumb_col']
del song_sf['another_dumb_col']
song_sf.head(5)
Out[16]:
Still with me? Notice that the column types for the transformed columns are correct.
In [17]:
song_sf.column_types()
Out[17]:
Hold on though, I think I'd actually like the rating to be a float.
In [18]:
song_sf['my_rating'] = song_sf['my_rating'].astype(float)
song_sf.column_types()
Out[18]:
To create even more interesting feature columns, you may want to apply a function using multiple (or all) columns. When you apply a function to an SFrame (instead of just an SArray like I did earlier), the input to the function is a dictionary where the keys are your column names. Here I'd like to know what combination of song title, album title, and artist name mentions the word 'love' the most:
In [19]:
song_sf['love_count'] = song_sf[['release', 'title', 'artist_name']].apply(
lambda row: sum(x.lower().split().count('love') for x in row.values()))
song_sf.topk('love_count').head(5)
Out[19]:
We can see from these examples that adding and deleting columns is a simple task for an SFrame. This is because an SFrame is essentially the keeper of references to columns (SArrays), so adding and deleting columns is a very cheap operation. However, the fact that SFrames store their data on disk produces some important limitations when thinking about editing an SFrame:
Sequential access is king on disk, and this is very useful to remember when working with SFrames. This means that inspecting a specific row would perform quite poorly and writing to a specific row is not possible. However, while working with SFrames you'll find that you can still accomplish nearly all of what you would have done with a more classic dataframe using transform and filter operations, yet you'll still reap the huge benefit of creating SFrames that are larger than the size of your machine's memory. So let's learn about filtering!
I think I want to take care of those invalid year entries now. I don't really know how many there are, so I'll find out, as the answer to that may change what I do.
In [20]:
year_count = song_sf.groupby('year', gl.aggregate.COUNT)
print year_count.head()
print "Number of unique years: " + str(len(year_count))
print "Number of invalid years: "
year_count.topk('year', reverse=True, k=1)
Out[20]:
Yikes, that's almost half of my dataset. Maybe I don't want to just get rid of that data. SFrames support missing values, and these are represented using 'None' . We will transform the apporpriate values to missing here:
In [21]:
song_sf['year'] = song_sf['year'].apply(lambda x :None if x == 0 else x)
song_sf.head(5)
Out[21]:
To show that normal operations work on columns with missing values, we will do the 'how_old_was_i' transformation again.
In [22]:
song_sf['how_old_was_i'] = song_sf['year'].apply(lambda x : x - year_i_was_born)
song_sf.head(5)
Out[22]:
However, if I actually did want to filter out these missing values, that is easy too.
In [23]:
song_sf_valid_years = song_sf[song_sf['year'] > 0]
print "Length of trimmed data: " + str(len(song_sf_valid_years))
song_sf_valid_years.head(5)
Out[23]:
What I'm showing off here is that we can filter an SFrame by an SArray, where only the SFrame rows that correspond to the given SArray entries evaluating to True will make it through the filter. This happens when an SArray is given as the index of an SFrame. Furthermore, we can create a new SArray from an existing one by using any of the comparison operators. To execute this filter we did both of these things, but you can do them in isolation as well. Here I show the resulting SArray from running the '> 0' operation in isolation:
In [24]:
tmp = song_sf['year'] > 0
tmp
Out[24]:
Keep in mind that the SArray must be the same length of the SFrame in order to filter. This also works with more complicated, chained filters with logical operators. Here's a list of songs that came out while I was in high school by a couple of my favorite bands in that period of my life:
In [25]:
my_fav_hs_songs = song_sf[((song_sf['artist_name'] == 'Relient K')
| (song_sf['artist_name'] == 'Streetlight Manifesto'))
& (song_sf['how_old_was_i'] >= 14) & (song_sf['how_old_was_i'] <= 18)]
my_fav_hs_songs
Out[25]:
That's not all of them, but that's a pretty decent selection for a dataset of a million songs. Notice that I had to use the bitwise operators instead of the 'and'/'or' keyword. Python does not allow the overloading of logical operators, so remember to use the bitwise ones.
The descriptive statistics below are operations done on the SArray, and cannot be done on the SFrame.
In [26]:
# Look at lots of descriptive statistics of title_length
print "mean: " + str(song_sf['title_length'].mean())
print "std: " + str(song_sf['title_length'].std())
print "var: " + str(song_sf['title_length'].var())
print "min: " + str(song_sf['title_length'].min())
print "max: " + str(song_sf['title_length'].max())
print "sum: " + str(song_sf['title_length'].sum())
print "number of non-zero entries: " + str(song_sf['title_length'].nnz())
We can accomplish essentially the same thing by getting a sketch_summary on this column. This will give the exact values of the descriptive statistics I asked for above, and then give approximate values of some other useful stuff like quantiles and counts of unique values. These values are approximate because performing the real operation on a dataset that is larger than your memory size could exhaust your memory or take too long to compute. Each operation has well-defined bounds on how wrong the answer will be, which are listed in our API Reference.
In [27]:
approx_sketch = song_sf['title_length'].sketch_summary()
print approx_sketch
Saving the return value from sketch_summary gives you a graphlab.Sketch object, which can be queried further (details here). Here, I can drill deeper into those quantiles:
In [28]:
print approx_sketch.quantile(.25)
print approx_sketch.quantile(.75)
print approx_sketch.quantile(.993)
print approx_sketch.quantile(.995)
print approx_sketch.quantile(.997)
But wow...47 words?!? I gotta see what that song is.
In [29]:
top10_titles = song_sf.topk('title_length')
top10_titles
Out[29]:
In [30]:
top10_titles['title'][0]
Out[30]:
Makes sense...looks like a song with several movements. I'm somewhat curious about the titles with no words too.
In [31]:
song_sf.topk('title_length', k=5, reverse=True)
Out[31]:
Here are a couple boolean operations too, with which I can prove that there were, in fact, songs before I was born. Just not all of them.
In [32]:
before_i_was_born = song_sf['how_old_was_i'] < 0
before_i_was_born.all(), before_i_was_born.any()
Out[32]:
Perhaps let's try some deeper analysis, like what albums have the most songs?
In [33]:
song_sf.groupby(['artist_name', 'release'], {'num_songs_in_album' : gl.aggregate.COUNT}).topk('num_songs_in_album')
Out[33]:
Our groupby function only supports aggregation after grouping. The aggregation functions you can use are listed here.
You can only go so far in analyzing this data though. We might want to match this data with user information, like how many times a certain person played one of these songs. For that, we need to the join function, but first we need to read this data in as an SFrame:
In [34]:
usage_data = gl.SFrame.read_csv("https://static.turi.com/datasets/millionsong/10000.txt", header=False, delimiter='\t', column_type_hints={'X3':int})
usage_data.rename({'X1':'user_id', 'X2':'song_id', 'X3':'listen_count'})
Out[34]:
I could just join the listen data with the song data, but maybe I'll do something a bit more interesting. Let's find out how many users from this dataset have listened to any one of those songs from my high school times, compared to the total number of users. First we need the total number of users:
In [35]:
num_users = len(usage_data['user_id'].unique())
print num_users
In [36]:
fav_hs_listen_data = my_fav_hs_songs.join(usage_data, 'song_id')
num_fav_hs_users = len(fav_hs_listen_data['user_id'].unique())
print num_fav_hs_users
print float(num_fav_hs_users) / float(num_users)
That's really small. Those other people don't know what they're missing. Maybe the small proportion is simply because I'm only using a list of 42 songs. For kicks, what is the most popular song of that set of songs?
In [37]:
most_popular = fav_hs_listen_data.groupby(['song_id'], {'total_listens':gl.aggregate.SUM('listen_count'),
'num_unique_users':gl.aggregate.COUNT('user_id')})
most_popular.join(song_sf, 'song_id').topk('total_listens',k=20)
Out[37]:
...and only 5 even got listens, but "Keasbey Nights" wins from this small subset. Now, suppose I was a cheater and wanted to make this look a little better? I'll pretend I am so you can see 'append' in action.
In [38]:
me = gl.SFrame({'user_id':['evan'],'song_id':['SOSFAVU12A6D4FDC6A'],'listen_count':[4000]})
usage_data = usage_data.append(me)
fav_hs_listen_data = my_fav_hs_songs.join(usage_data, 'song_id')
most_popular = fav_hs_listen_data.groupby(['song_id'], {'total_listens':gl.aggregate.SUM('listen_count'),
'num_unique_users':gl.aggregate.COUNT('user_id')})
most_popular.join(song_sf, 'song_id').topk('total_listens',k=20)
Out[38]:
We're almost done with the tour of features. For easy splitting into training and test sets, we have the random_split function:
In [39]:
# Randomly split data rows into two subsets
first_set, second_set = song_sf.random_split(0.8, seed = 1)
first_set.num_rows(), second_set.num_rows()
Out[39]:
If you want to split on a predicate though, you'll have to do that manually.
In [40]:
songs_before = song_sf[song_sf['how_old_was_i'] < 0]
songs_after = song_sf[song_sf['how_old_was_i'] >= 0]
songs_before.num_rows(), songs_after.num_rows()
Out[40]:
We can also get a random sample of the dataset.
In [41]:
pct37 = song_sf.sample(.37)
pct37.num_rows()
Out[41]:
SArrays support lots of mathematical operations. They can be performed with a scalar
In [42]:
sa = gl.SArray([1,2,3])
sa2 = sa * 2
print sa2
...or they can be performed element-wise with another SArray.
In [43]:
add = sa + sa2
div = sa / sa2
print add
print div
You can also iterate over SArrays and SFrames. When iterating over an SFrame, the returned element is a Python dictionary.
In [44]:
for i in song_sf:
if i['title_length'] >= 45:
print "Whoa that's long!"
I think I'm done exploring this dataset, but I'd like to save it for later. There's a couple ways I can do this. I can save it to a csv:
In [45]:
song_sf.save('new_song_data.csv', format='csv')
Or I can just save it as an SFrame as I showed earlier.
In [46]:
song_sf.save('new_song_data_sframe')
And of course, we can do all of this on S3. Note that if you download this notebook and run it, you won't be able to save to our datasets bucket. Simply set your AWS credentials and uncomment the code below (replacing our S3 bucket with yours) to see this in action.
In [47]:
# In order to save to S3, you will need to use your own bucket and your own credentials.
# You can set your AWS credentials using the below function:
# graphlab.aws.set_credentials(<access_key_id>, <secret_access_key>)
#song_sf.save('https://static.turi.com/datasets/my_sframes/new_song_sframe') # S3://<bucket-name>/<file-path>
Now to load an SFrame back, we use the handy 'load_sframe' function like before. This takes the name of the sframe's top level directory:
In [48]:
# The below will download about 78 MB.
#hello_again = gl.load_sframe('https://static.turi.com/datasets/my_sframes/new_song_sframe')
SArrays can be saved in a similar fashion. That's it! Our API reference covers every function associated with SFrames and SArrays in detail.