Python for data analysis

A fun-filled tutorial on data scraping, munging, analysis, and visualization with Pandas

But seriously, why Python?

  • An interpreted language with clean, readable, concise code
  • Multiplaform, and supports various programming paradigms
  • An integrated pipeline for all phases of data collection, processing, and analysis
  • An extensive standard library, plus massive repositories of packages for just about anything you can think of (with easy-to-use tools for managing them)
  • Open source, with a highly active development community

And don't forget the Jupyter (née IPython) notebook

  • Code and results in an integrated interface
  • Easily save and share ad-hoc analyses
  • Export in a variety of formats...
  • ...even as slides!

Click here for the Jupyter homepage

Let's get some data!


In [2]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
from IPython.core.display import display

In [3]:
data = pd.read_table('mylistening.txt',header=None,
    names=['user_id','item_id','artist_id','timestamp'],parse_dates=['timestamp'])

But there are lots of other ways to get data into Python...

Read from a database


In [4]:
# import MySQLdb
# db=MySQLdb.connect(host='rdc04.uits.iu.edu',port=XXXX,user=XXXX,passwd=XXX,db='analysis_lastfm')
# cursor = db.cursor()
# cursor.execute("SELECT column_a, column_b,c olumn_c FROM some_table")
# result = cursor.fetchall()

scrape data from the web!

  • roll your own with urllib2, BeautifulSoup, lxml, etc.
  • lots of prebuilt, custom libraries for pulling data from APIs (e.g. PyLast, musicbrainzngs, ...)

Use Mechanical Turk!

  • Boto

You have your dataframe. Now what?

Basic operations:


In [5]:
# preview the data: 
data.head()


Out[5]:
user_id item_id artist_id timestamp
0 5759068 41 1 2007-11-05 13:33:28
1 5759068 41 1 2009-03-29 20:18:48
2 5759068 41 1 2009-06-04 23:35:14
3 5759068 41 1 2010-02-06 03:51:20
4 5759068 41 1 2010-03-24 21:02:33

In [6]:
data.tail()


Out[6]:
user_id item_id artist_id timestamp
58828 5759068 39064966 39064958 2010-03-29 18:07:03
58829 5759068 39064990 39064991 2007-12-31 14:00:14
58830 5759068 39065008 39065006 2012-11-24 23:56:02
58831 5759068 39065013 39065014 2007-07-25 19:24:11
58832 5759068 39065015 39065014 2007-08-09 04:01:52

In [7]:
# simple stats:
data.describe()


Out[7]:
user_id item_id artist_id
count 58833 58833.000000 58833.000000
mean 5759068 2250941.841738 191385.196165
std 0 6646153.662761 2126089.930947
min 5759068 15.000000 1.000000
25% 5759068 87753.000000 1211.000000
50% 5759068 239366.000000 2965.000000
75% 5759068 1120718.000000 13973.000000
max 5759068 39065044.000000 39065014.000000

In [8]:
# sorting
data = data.sort('timestamp',ascending=True)
data.head()


Out[8]:
user_id item_id artist_id timestamp
10061 5759068 3623 405 2007-02-24 22:54:28
42885 5759068 74220 11337 2007-02-24 22:59:07
34972 5759068 64735 4536 2007-02-24 23:01:54
52132 5759068 2021815 42105 2007-02-25 01:55:39
30434 5759068 39064541 3014 2007-02-25 02:00:36

Leveraging Numpy


In [9]:
# "vanilla" python

x = range(10)
print x

print x*3
print x+10


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-20fae026fa5d> in <module>()
      5 
      6 print x*3
----> 7 print x+10

TypeError: can only concatenate list (not "int") to list

In [10]:
new1 = [i*3 for i in x]
print new1

new2 = [i+10 for i in x]
new2


[0, 3, 6, 9, 12, 15, 18, 21, 24, 27]
Out[10]:
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

In [11]:
x_arr = np.array(x)
display(x_arr)

display(x_arr*3)
display(x_arr+10)


array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
array([ 0,  3,  6,  9, 12, 15, 18, 21, 24, 27])
array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

In [12]:
print np.sin(x_arr)
print np.exp(x_arr)

print "Mean of array -> %s" % x_arr.mean()
print "Index of highest value in array -> %s" % x_arr.argmax()
print "Sum of array -> %s" % x_arr.sum()
print "Product of array -> %s" % x_arr.prod()


[ 0.          0.84147098  0.90929743  0.14112001 -0.7568025  -0.95892427
 -0.2794155   0.6569866   0.98935825  0.41211849]
[  1.00000000e+00   2.71828183e+00   7.38905610e+00   2.00855369e+01
   5.45981500e+01   1.48413159e+02   4.03428793e+02   1.09663316e+03
   2.98095799e+03   8.10308393e+03]
Mean of array -> 4.5
Index of highest value in array -> 9
Sum of array -> 45
Product of array -> 0

In [13]:
data.mean(axis=0) # axis=0: operate over rows
### note that these only operate on *numeric* columns


Out[13]:
user_id      5759068.000000
item_id      2250941.841738
artist_id     191385.196165
dtype: float64

In [14]:
print data.sum(axis=1).head(15) # # axis=0: operate over columns


10061     5763096
42885     5844625
34972     5828339
52132     7822988
30434    44826623
19986     5778037
15221     6262042
37625     5913179
10294     5866792
31556     5852659
13577     5849756
27241     5809060
12008     6089487
48280     6051698
48996     5831800
dtype: int64

Selecting and grouping data, and playing with columns...


In [15]:
# get a column:
data['artist_id'].head(10)


Out[15]:
10061      405
42885    11337
34972     4536
52132    42105
30434     3014
19986     1996
15221     1328
37625     7432
10294      405
31556     3875
Name: artist_id, dtype: int64

In [16]:
# get rows matching a condition
data[data['artist_id']==405].head(10)


Out[16]:
user_id item_id artist_id timestamp
10061 5759068 3623 405 2007-02-24 22:54:28
10294 5759068 107319 405 2007-02-25 02:12:30
10429 5759068 168925 405 2007-02-25 04:41:51
10382 5759068 129362 405 2007-02-26 06:12:02
10184 5759068 49058 405 2007-02-28 04:00:49
10173 5759068 47925 405 2007-02-28 05:48:00
10514 5759068 337929 405 2007-03-01 06:02:02
10167 5759068 47901 405 2007-03-02 07:27:57
10224 5759068 91961 405 2007-03-02 08:32:49
10062 5759068 3623 405 2007-03-02 08:36:12

In [17]:
# SQL-style joins and other operations
artists = pd.read_table('tmp')
artists.head()


Out[17]:
artist_id artist
0 1 slipknot
1 12 %c3%9cnloco
2 14 muse
3 18 earshot
4 35 drowning+pool

In [18]:
data = data.merge(artists,on='artist_id',how='left')
data.head()


Out[18]:
user_id item_id artist_id timestamp artist
0 5759068 3623 405 2007-02-24 22:54:28 queens+of+the+stone+age
1 5759068 74220 11337 2007-02-24 22:59:07 lupe+fiasco
2 5759068 64735 4536 2007-02-24 23:01:54 black+eyed+peas
3 5759068 2021815 42105 2007-02-25 01:55:39 aesop+rock
4 5759068 39064541 3014 2007-02-25 02:00:36 at+the+drive-in

In [19]:
# Drop column 
data = data.drop('user_id',axis=1)

In [20]:
# Group and aggregate!
data.groupby('artist').count().tail(10)


Out[20]:
item_id artist_id timestamp
artist
zero+7 8 8 8
zero+down 6 6 6
ziggy+marley 1 1 1
ziggy+marley+&+the+melody+makers 4 4 4
zlad! 6 6 6
zo%c3%a9 9 9 9
zoe 2 2 2
zomboy 3 3 3
zox 1 1 1
zurdok 5 5 5

In [21]:
data['artist'].value_counts().head(10)


Out[21]:
sigur+r%c3%b3s              2370
radiohead                   1750
the+mars+volta              1477
nine+inch+nails             1417
mogwai                      1221
nick+cave+&+warren+ellis    1159
beirut                      1125
tool                        1055
muse                         888
ratatat                      839
dtype: int64

In [22]:
data.groupby(['artist','item_id']).count().tail(10)


Out[22]:
artist_id timestamp
artist item_id
zo%c3%a9 1691584 1 1
2983328 2 2
37276281 1 1
39065043 2 2
zoe 39065044 2 2
zomboy 380522 1 1
855510 1 1
1354925 1 1
zox 4123632 1 1
zurdok 1351103 5 5

In [23]:
def group_by_first_letter(artist_name):
    return artist_name[0] # first letter of names

data.set_index('artist').groupby(group_by_first_letter).count().head(10)


Out[23]:
item_id artist_id timestamp
% 818 818 818
. 232 232 232
1 13 13 13
2 51 51 51
3 16 16 16
5 3 3 3
6 528 528 528
a 4624 4624 4624
b 2898 2898 2898
c 2412 2412 2412

Let's try some more realistic examples...

How much do I listen each month?


In [24]:
time_indexed = data.set_index('timestamp')
time_indexed.resample('M',how='count')['artist'].plot()


Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x10478e050>

In [25]:
fix,axes = plt.subplots(1,2,figsize=(12,4))
time_indexed.resample('A',how='count')['artist'].plot(ax=axes[0])
time_indexed.resample('W',how='count')['artist'].plot(ax=axes[1])


Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b1ec550>

Can we break it down by artist?


In [26]:
monthly_listens_by_artist = data.groupby(['artist','timestamp']).count()['item_id']
monthly_listens_by_artist.tail(25)


Out[26]:
artist    timestamp          
zlad!     2007-07-31 00:43:48    1
          2007-10-26 16:49:01    1
          2008-03-02 13:05:52    1
          2008-03-12 17:31:00    1
          2008-03-14 09:08:54    1
zo%c3%a9  2008-01-02 17:30:35    1
          2008-01-02 17:35:21    1
          2008-01-02 17:40:11    1
          2008-02-18 00:08:55    1
          2008-07-14 19:19:28    1
          2008-08-28 05:52:51    1
          2009-02-13 00:11:48    1
          2009-03-05 05:49:31    1
          2009-10-13 08:50:00    1
zoe       2008-07-08 05:18:54    1
          2008-08-20 01:34:02    1
zomboy    2012-06-04 22:55:37    1
          2012-06-08 23:32:59    1
          2012-12-04 23:34:50    1
zox       2009-02-10 00:00:15    1
zurdok    2008-01-19 23:00:32    1
          2008-01-30 13:15:51    1
          2008-02-03 14:50:28    1
          2008-04-28 07:41:57    1
          2009-06-12 20:13:36    1
Name: item_id, dtype: int64

In [27]:
monthly_listens_by_artist = monthly_listens_by_artist.unstack()
monthly_listens_by_artist.tail(10)


Out[27]:
timestamp 2007-02-24 22:54:28 2007-02-24 22:59:07 2007-02-24 23:01:54 2007-02-25 01:55:39 2007-02-25 02:00:36 2007-02-25 02:03:25 2007-02-25 02:05:00 2007-02-25 02:09:16 2007-02-25 02:12:30 2007-02-25 02:15:22 ... 2012-12-28 17:12:03 2012-12-28 17:19:03 2012-12-29 21:55:51 2012-12-29 21:59:43 2012-12-29 22:03:41 2012-12-29 22:06:38 2012-12-29 22:09:57 2012-12-29 22:13:33 2012-12-29 22:16:42 2012-12-29 22:20:38
artist
zero+7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
zero+down NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
ziggy+marley NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
ziggy+marley+&+the+melody+makers NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
zlad! NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
zo%c3%a9 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
zoe NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
zomboy NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
zox NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
zurdok NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

10 rows × 58668 columns


In [28]:
monthly_listens_by_artist.sum(axis=1).tail(10)


Out[28]:
artist
zero+7                              8
zero+down                           6
ziggy+marley                        1
ziggy+marley+&+the+melody+makers    4
zlad!                               6
zo%c3%a9                            9
zoe                                 2
zomboy                              3
zox                                 1
zurdok                              5
dtype: float64

In [29]:
top_artists = data['artist'].value_counts()[:5]
top_artists


Out[29]:
sigur+r%c3%b3s     2370
radiohead          1750
the+mars+volta     1477
nine+inch+nails    1417
mogwai             1221
dtype: int64

In [30]:
monthly_listens_by_artist.reindex(top_artists.index)


Out[30]:
timestamp 2007-02-24 22:54:28 2007-02-24 22:59:07 2007-02-24 23:01:54 2007-02-25 01:55:39 2007-02-25 02:00:36 2007-02-25 02:03:25 2007-02-25 02:05:00 2007-02-25 02:09:16 2007-02-25 02:12:30 2007-02-25 02:15:22 ... 2012-12-28 17:12:03 2012-12-28 17:19:03 2012-12-29 21:55:51 2012-12-29 21:59:43 2012-12-29 22:03:41 2012-12-29 22:06:38 2012-12-29 22:09:57 2012-12-29 22:13:33 2012-12-29 22:16:42 2012-12-29 22:20:38
sigur+r%c3%b3s NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 1 1 NaN NaN NaN NaN NaN NaN NaN NaN
radiohead NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
the+mars+volta NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
nine+inch+nails NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
mogwai NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 58668 columns


In [31]:
monthly_listens_by_artist.reindex(top_artists.index).fillna(0)


Out[31]:
timestamp 2007-02-24 22:54:28 2007-02-24 22:59:07 2007-02-24 23:01:54 2007-02-25 01:55:39 2007-02-25 02:00:36 2007-02-25 02:03:25 2007-02-25 02:05:00 2007-02-25 02:09:16 2007-02-25 02:12:30 2007-02-25 02:15:22 ... 2012-12-28 17:12:03 2012-12-28 17:19:03 2012-12-29 21:55:51 2012-12-29 21:59:43 2012-12-29 22:03:41 2012-12-29 22:06:38 2012-12-29 22:09:57 2012-12-29 22:13:33 2012-12-29 22:16:42 2012-12-29 22:20:38
sigur+r%c3%b3s 0 0 0 0 0 0 0 0 0 0 ... 1 1 0 0 0 0 0 0 0 0
radiohead 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
the+mars+volta 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
nine+inch+nails 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
mogwai 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 58668 columns


In [32]:
monthly_listens_by_artist.reindex(top_artists.index).fillna(0).T.head(10)


Out[32]:
sigur+r%c3%b3s radiohead the+mars+volta nine+inch+nails mogwai
timestamp
2007-02-24 22:54:28 0 0 0 0 0
2007-02-24 22:59:07 0 0 0 0 0
2007-02-24 23:01:54 0 0 0 0 0
2007-02-25 01:55:39 0 0 0 0 0
2007-02-25 02:00:36 0 0 0 0 0
2007-02-25 02:03:25 0 0 0 0 0
2007-02-25 02:05:00 0 0 0 0 0
2007-02-25 02:09:16 0 0 0 0 0
2007-02-25 02:12:30 0 0 0 0 0
2007-02-25 02:15:22 0 0 0 0 0

In [33]:
to_plot = monthly_listens_by_artist.reindex(top_artists.index).fillna(0).T.resample('M',how='sum')
to_plot.head()


Out[33]:
sigur+r%c3%b3s radiohead the+mars+volta nine+inch+nails mogwai
timestamp
2007-02-28 1 3 16 0 2
2007-03-31 0 22 2 1 13
2007-04-30 1 48 2 24 11
2007-05-31 2 14 1 61 25
2007-06-30 1 43 3 6 1

In [34]:
to_plot.plot()


Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x10a6b54d0>

In [35]:
to_plot.cumsum().plot()


Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x14cc51850>

In [36]:
data.groupby(['artist','timestamp']).count()['item_id'].unstack().reindex(top_artists.index).fillna(0).T.resample('M',how='sum').cumsum().plot()


Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x14d1109d0>

But does Python do stats?


In [37]:
import scipy.stats as stats
import statsmodels

In [38]:
data['subject'] = np.random.random_integers(0,19,len(data))
data['measurement'] = np.random.random(len(data))
data['condition'] = data['subject'].apply(lambda x: 'a' if x%2==0 else 'b')
data.head()


Out[38]:
item_id artist_id timestamp artist subject measurement condition
0 3623 405 2007-02-24 22:54:28 queens+of+the+stone+age 18 0.200160 a
1 74220 11337 2007-02-24 22:59:07 lupe+fiasco 16 0.958234 a
2 64735 4536 2007-02-24 23:01:54 black+eyed+peas 9 0.671806 b
3 2021815 42105 2007-02-25 01:55:39 aesop+rock 15 0.753624 b
4 39064541 3014 2007-02-25 02:00:36 at+the+drive-in 8 0.658518 a

In [39]:
# A simple group comparison
grpa =  data[data['condition']=='a']['measurement']
grpb =  data[data['condition']=='b']['measurement']
print 'Group A'
print  grpa.describe()
print 'Group B'
print  grpb.describe()
stats.ttest_ind(grpa,grpb)


Group A
count    29476.000000
mean         0.500758
std          0.289482
min          0.000049
25%          0.249881
50%          0.501190
75%          0.752262
max          0.999988
Name: measurement, dtype: float64
Group B
count    29357.000000
mean         0.498694
std          0.288415
min          0.000005
25%          0.247028
50%          0.496558
75%          0.746978
max          0.999995
Name: measurement, dtype: float64
Out[39]:
Ttest_indResult(statistic=0.86629475221585306, pvalue=0.38633207626664212)

In [40]:
data['some_IV'] = np.random.choice(['x','y','z'],len(data))
data.head()


Out[40]:
item_id artist_id timestamp artist subject measurement condition some_IV
0 3623 405 2007-02-24 22:54:28 queens+of+the+stone+age 18 0.200160 a z
1 74220 11337 2007-02-24 22:59:07 lupe+fiasco 16 0.958234 a y
2 64735 4536 2007-02-24 23:01:54 black+eyed+peas 9 0.671806 b z
3 2021815 42105 2007-02-25 01:55:39 aesop+rock 15 0.753624 b z
4 39064541 3014 2007-02-25 02:00:36 at+the+drive-in 8 0.658518 a x

In [41]:
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

model=ols('measurement ~ C(condition) + C(some_IV) + C(condition):C(some_IV)', data=data).fit() 
print anova_lm(model)


                            df       sum_sq   mean_sq         F    PR(>F)
C(condition)                 1     0.062658  0.062658  0.750507  0.386319
C(some_IV)                   2     0.046413  0.023207  0.277963  0.757326
C(condition):C(some_IV)      2     0.554453  0.277226  3.320561  0.036139
Residual                 58827  4911.338700  0.083488       NaN       NaN

Visualization

Matplotlib and pandas


In [42]:
from matplotlib import pyplot as plt
%matplotlib inline

example = pd.DataFrame({'a':np.random.random(10),'b':np.random.random(10),'c':np.random.random(10)})
example.head()


Out[42]:
a b c
0 0.010652 0.474971 0.219931
1 0.076075 0.597393 0.904103
2 0.153507 0.429889 0.591455
3 0.674325 0.589722 0.930124
4 0.398727 0.536206 0.251662

In [43]:
plt.plot(example['a'])


Out[43]:
[<matplotlib.lines.Line2D at 0x10e4a7450>]

In [44]:
fig,ax = plt.subplots(1,1)
for column in ['a','b','c']:
    ax.plot(example[column])



In [45]:
fig,ax = plt.subplots(1,1)
for column in ['a','b','c']:
    ax.plot(example[column],label='column '+column)
ax.legend()


Out[45]:
<matplotlib.legend.Legend at 0x110ec5610>

In [46]:
example.plot()


Out[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x110efe490>

In [47]:
fix,axes = plt.subplots(3,3,figsize=(16,12))
ax_iter = axes.flat
example.plot(kind='bar',ax=ax_iter.next())
example.plot(kind='barh',ax=ax_iter.next())
example.plot(kind='hist',ax=ax_iter.next())
example.plot(kind='box',ax=ax_iter.next())
example.plot(kind='density',ax=ax_iter.next())
example.plot(kind='area',ax=ax_iter.next())
example.plot(kind='scatter',x='a',y='b',ax=ax_iter.next())
example.plot(kind='pie',y='a',ax=ax_iter.next())
example.plot(kind='hexbin',x='a',y='b',gridsize=20,ax=ax_iter.next())


Out[47]:
<matplotlib.axes._subplots.AxesSubplot at 0x111489b90>

In [48]:
example = data[['condition','some_IV','measurement']]
example.head()


Out[48]:
condition some_IV measurement
0 a z 0.200160
1 a y 0.958234
2 b z 0.671806
3 b z 0.753624
4 a x 0.658518

In [49]:
grouped = example.groupby(['condition','some_IV'])
avg = grouped.mean()
avg


Out[49]:
measurement
condition some_IV
a x 0.501778
y 0.504918
z 0.495540
b x 0.497304
y 0.496837
z 0.501893

In [50]:
avg.unstack().plot(kind='bar')
avg.unstack()['measurement'].plot(kind='bar')


Out[50]:
<matplotlib.axes._subplots.AxesSubplot at 0x14cdc6e10>

In [51]:
SE = grouped.apply(lambda x: np.std(x)/np.sqrt(len(x)))
SE


Out[51]:
measurement
condition some_IV
a x 0.002921
y 0.002898
z 0.002942
b x 0.002914
y 0.002922
z 0.002910

In [52]:
avg.unstack()['measurement'].plot(kind='bar',yerr=10*SE.unstack()['measurement'])


Out[52]:
<matplotlib.axes._subplots.AxesSubplot at 0x14ceef890>

In [53]:
fig,ax = plt.subplots(1,1,figsize=(8,6))
avg.unstack()['measurement'].plot(kind='bar',yerr=10*SE.unstack()['measurement'],ax=ax,legend=None,color=['cyan','blue','purple'],alpha=0.6)
ax.legend(bbox_to_anchor=(1.05,1))
ax.set_ylabel('This is my dependent variable',fontdict={'fontname':'Comic Sans MS','fontsize':24})
ax.set_title('Check out my plot!',fontdict={'fontname':'Arial','fontsize':32})
plt.xticks([0,1],['Group A','Group B'],rotation=45)
ax.set_xlabel('Condition',fontsize=18)
ax.grid()

ax.annotate("*",(0.155,0.53),fontsize=16)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)



In [57]:
plot_data = data.groupby(['artist','timestamp']).count()['item_id'].unstack().reindex(top_artists.index[:3]).fillna(0).T.resample('M',how='sum')
total_monthly_playcounts = time_indexed.resample('M',how='count')['item_id']

fig,axes = plt.subplots(1,2,figsize = (18,6))
ax_iter = axes.flat
ax = ax_iter.next()
plot_data.divide(total_monthly_playcounts,axis=0).plot(ax=ax,grid=True,legend=False,lw=2,colormap='rainbow')
ax.set_ylabel('Monthly proportion of listening')

ax = ax_iter.next()
plot_data.cumsum().divide(total_monthly_playcounts.cumsum(),axis=0).plot(ax=ax,grid=True,legend=False,lw=2,colormap='rainbow')
ax.set_ylabel('Cumulative monthly proportion of listening')

leg = ax.legend()
from urllib import unquote_plus
text = leg.get_texts()
for t in text:
    t.set_text(unquote_plus(t.get_text().encode('ascii')).decode('utf8').title())
    
for ax in axes:
    ax.set_axis_bgcolor('#DADADA')
    #ax.set_ylim(0,0.22)

fig.suptitle('Month-by-month and cumulative listening, top 3 artists',fontsize=20)


Out[57]:
<matplotlib.text.Text at 0x10c23c2d0>

Digging deeper

This was just the tip of the iceberg

  • there are tons of specially designed packages to learn about and work with
  • And of course, you can always build your own!
  • working with bigger data? Try Graphlab (or maybe even Apache Spark)

Want to get started using these tools for your research?

Feel free to ask me for help (follow up presentation next semester?)

Check out the resources on the next slide.

Resources