Python for data analysis

A fun-filled tutorial on data scraping, munging, analysis, and visualization with Pandas

But seriously, why Python?

An interpreted language with clean, readable, concise code

Multiplaform, and supports various programming paradigms

An integrated pipeline for all phases of data collection, processing, and analysis

An extensive standard library, plus massive repositories of packages for just about anything you can think of (with easy-to-use tools for managing them)

Open source, with a highly active development community

And don't forget the Jupyter (née IPython) notebook

Code and results in an integrated interface

Easily save and share ad-hoc analyses

Export in a variety of formats...

...even as slides!

Click here for the Jupyter homepage

Let's get some data!



In [2]:

    
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
from IPython.core.display import display



In [3]:

    
data = pd.read_table('mylistening.txt',header=None,
    names=['user_id','item_id','artist_id','timestamp'],parse_dates=['timestamp'])

But there are lots of other ways to get data into Python...

Read from a database



In [4]:

    
# import MySQLdb
# db=MySQLdb.connect(host='rdc04.uits.iu.edu',port=XXXX,user=XXXX,passwd=XXX,db='analysis_lastfm')
# cursor = db.cursor()
# cursor.execute("SELECT column_a, column_b,c olumn_c FROM some_table")
# result = cursor.fetchall()

scrape data from the web!

roll your own with urllib2, BeautifulSoup, lxml, etc.
lots of prebuilt, custom libraries for pulling data from APIs (e.g. PyLast, musicbrainzngs, ...)

Use Mechanical Turk!

Boto

You have your dataframe. Now what?

Basic operations:



In [5]:

    
# preview the data: 
data.head()









    Out[5]:






  
    
      
      user_id
      item_id
      artist_id
      timestamp
    
  
  
    
      0
      5759068
      41
      1
      2007-11-05 13:33:28
    
    
      1
      5759068
      41
      1
      2009-03-29 20:18:48
    
    
      2
      5759068
      41
      1
      2009-06-04 23:35:14
    
    
      3
      5759068
      41
      1
      2010-02-06 03:51:20
    
    
      4
      5759068
      41
      1
      2010-03-24 21:02:33



In [6]:

    
data.tail()









    Out[6]:






  
    
      
      user_id
      item_id
      artist_id
      timestamp
    
  
  
    
      58828
      5759068
      39064966
      39064958
      2010-03-29 18:07:03
    
    
      58829
      5759068
      39064990
      39064991
      2007-12-31 14:00:14
    
    
      58830
      5759068
      39065008
      39065006
      2012-11-24 23:56:02
    
    
      58831
      5759068
      39065013
      39065014
      2007-07-25 19:24:11
    
    
      58832
      5759068
      39065015
      39065014
      2007-08-09 04:01:52



In [7]:

    
# simple stats:
data.describe()









    Out[7]:






  
    
      
      user_id
      item_id
      artist_id
    
  
  
    
      count
      58833
      58833.000000
      58833.000000
    
    
      mean
      5759068
      2250941.841738
      191385.196165
    
    
      std
      0
      6646153.662761
      2126089.930947
    
    
      min
      5759068
      15.000000
      1.000000
    
    
      25%
      5759068
      87753.000000
      1211.000000
    
    
      50%
      5759068
      239366.000000
      2965.000000
    
    
      75%
      5759068
      1120718.000000
      13973.000000
    
    
      max
      5759068
      39065044.000000
      39065014.000000



In [8]:

    
# sorting
data = data.sort('timestamp',ascending=True)
data.head()









    Out[8]:






  
    
      
      user_id
      item_id
      artist_id
      timestamp
    
  
  
    
      10061
      5759068
      3623
      405
      2007-02-24 22:54:28
    
    
      42885
      5759068
      74220
      11337
      2007-02-24 22:59:07
    
    
      34972
      5759068
      64735
      4536
      2007-02-24 23:01:54
    
    
      52132
      5759068
      2021815
      42105
      2007-02-25 01:55:39
    
    
      30434
      5759068
      39064541
      3014
      2007-02-25 02:00:36

Leveraging Numpy



In [9]:

    
# "vanilla" python

x = range(10)
print x

print x*3
print x+10









    



[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]






    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-20fae026fa5d> in <module>()
      5 
      6 print x*3
----> 7 print x+10

TypeError: can only concatenate list (not "int") to list



In [10]:

    
new1 = [i*3 for i in x]
print new1

new2 = [i+10 for i in x]
new2









    



[0, 3, 6, 9, 12, 15, 18, 21, 24, 27]






    Out[10]:





[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]



In [11]:

    
x_arr = np.array(x)
display(x_arr)

display(x_arr*3)
display(x_arr+10)









    





array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])






    





array([ 0,  3,  6,  9, 12, 15, 18, 21, 24, 27])






    





array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])



In [12]:

    
print np.sin(x_arr)
print np.exp(x_arr)

print "Mean of array -> %s" % x_arr.mean()
print "Index of highest value in array -> %s" % x_arr.argmax()
print "Sum of array -> %s" % x_arr.sum()
print "Product of array -> %s" % x_arr.prod()









    



[ 0.          0.84147098  0.90929743  0.14112001 -0.7568025  -0.95892427
 -0.2794155   0.6569866   0.98935825  0.41211849]
[  1.00000000e+00   2.71828183e+00   7.38905610e+00   2.00855369e+01
   5.45981500e+01   1.48413159e+02   4.03428793e+02   1.09663316e+03
   2.98095799e+03   8.10308393e+03]
Mean of array -> 4.5
Index of highest value in array -> 9
Sum of array -> 45
Product of array -> 0



In [13]:

    
data.mean(axis=0) # axis=0: operate over rows
### note that these only operate on *numeric* columns









    Out[13]:





user_id      5759068.000000
item_id      2250941.841738
artist_id     191385.196165
dtype: float64



In [14]:

    
print data.sum(axis=1).head(15) # # axis=0: operate over columns









    



10061     5763096
42885     5844625
34972     5828339
52132     7822988
30434    44826623
19986     5778037
15221     6262042
37625     5913179
10294     5866792
31556     5852659
13577     5849756
27241     5809060
12008     6089487
48280     6051698
48996     5831800
dtype: int64

Selecting and grouping data, and playing with columns...



In [15]:

    
# get a column:
data['artist_id'].head(10)









    Out[15]:





10061      405
42885    11337
34972     4536
52132    42105
30434     3014
19986     1996
15221     1328
37625     7432
10294      405
31556     3875
Name: artist_id, dtype: int64



In [16]:

    
# get rows matching a condition
data[data['artist_id']==405].head(10)









    Out[16]:






  
    
      
      user_id
      item_id
      artist_id
      timestamp
    
  
  
    
      10061
      5759068
      3623
      405
      2007-02-24 22:54:28
    
    
      10294
      5759068
      107319
      405
      2007-02-25 02:12:30
    
    
      10429
      5759068
      168925
      405
      2007-02-25 04:41:51
    
    
      10382
      5759068
      129362
      405
      2007-02-26 06:12:02
    
    
      10184
      5759068
      49058
      405
      2007-02-28 04:00:49
    
    
      10173
      5759068
      47925
      405
      2007-02-28 05:48:00
    
    
      10514
      5759068
      337929
      405
      2007-03-01 06:02:02
    
    
      10167
      5759068
      47901
      405
      2007-03-02 07:27:57
    
    
      10224
      5759068
      91961
      405
      2007-03-02 08:32:49
    
    
      10062
      5759068
      3623
      405
      2007-03-02 08:36:12



In [17]:

    
# SQL-style joins and other operations
artists = pd.read_table('tmp')
artists.head()









    Out[17]:






  
    
      
      artist_id
      artist
    
  
  
    
      0
      1
      slipknot
    
    
      1
      12
      %c3%9cnloco
    
    
      2
      14
      muse
    
    
      3
      18
      earshot
    
    
      4
      35
      drowning+pool



In [18]:

    
data = data.merge(artists,on='artist_id',how='left')
data.head()









    Out[18]:






  
    
      
      user_id
      item_id
      artist_id
      timestamp
      artist
    
  
  
    
      0
      5759068
      3623
      405
      2007-02-24 22:54:28
      queens+of+the+stone+age
    
    
      1
      5759068
      74220
      11337
      2007-02-24 22:59:07
      lupe+fiasco
    
    
      2
      5759068
      64735
      4536
      2007-02-24 23:01:54
      black+eyed+peas
    
    
      3
      5759068
      2021815
      42105
      2007-02-25 01:55:39
      aesop+rock
    
    
      4
      5759068
      39064541
      3014
      2007-02-25 02:00:36
      at+the+drive-in



In [19]:

    
# Drop column 
data = data.drop('user_id',axis=1)



In [20]:

    
# Group and aggregate!
data.groupby('artist').count().tail(10)









    Out[20]:






  
    
      
      item_id
      artist_id
      timestamp
    
    
      artist
      
      
      
    
  
  
    
      zero+7
      8
      8
      8
    
    
      zero+down
      6
      6
      6
    
    
      ziggy+marley
      1
      1
      1
    
    
      ziggy+marley+&+the+melody+makers
      4
      4
      4
    
    
      zlad!
      6
      6
      6
    
    
      zo%c3%a9
      9
      9
      9
    
    
      zoe
      2
      2
      2
    
    
      zomboy
      3
      3
      3
    
    
      zox
      1
      1
      1
    
    
      zurdok
      5
      5
      5



In [21]:

    
data['artist'].value_counts().head(10)









    Out[21]:





sigur+r%c3%b3s              2370
radiohead                   1750
the+mars+volta              1477
nine+inch+nails             1417
mogwai                      1221
nick+cave+&+warren+ellis    1159
beirut                      1125
tool                        1055
muse                         888
ratatat                      839
dtype: int64



In [22]:

    
data.groupby(['artist','item_id']).count().tail(10)



In [23]:

    
def group_by_first_letter(artist_name):
    return artist_name[0] # first letter of names

data.set_index('artist').groupby(group_by_first_letter).count().head(10)

Let's try some more realistic examples...

How much do I listen each month?



In [24]:

    
time_indexed = data.set_index('timestamp')
time_indexed.resample('M',how='count')['artist'].plot()









    Out[24]:





<matplotlib.axes._subplots.AxesSubplot at 0x10478e050>



In [25]:

    
fix,axes = plt.subplots(1,2,figsize=(12,4))
time_indexed.resample('A',how='count')['artist'].plot(ax=axes[0])
time_indexed.resample('W',how='count')['artist'].plot(ax=axes[1])









    Out[25]:





<matplotlib.axes._subplots.AxesSubplot at 0x10b1ec550>

Can we break it down by artist?



In [26]:

    
monthly_listens_by_artist = data.groupby(['artist','timestamp']).count()['item_id']
monthly_listens_by_artist.tail(25)









    Out[26]:





artist    timestamp          
zlad!     2007-07-31 00:43:48    1
          2007-10-26 16:49:01    1
          2008-03-02 13:05:52    1
          2008-03-12 17:31:00    1
          2008-03-14 09:08:54    1
zo%c3%a9  2008-01-02 17:30:35    1
          2008-01-02 17:35:21    1
          2008-01-02 17:40:11    1
          2008-02-18 00:08:55    1
          2008-07-14 19:19:28    1
          2008-08-28 05:52:51    1
          2009-02-13 00:11:48    1
          2009-03-05 05:49:31    1
          2009-10-13 08:50:00    1
zoe       2008-07-08 05:18:54    1
          2008-08-20 01:34:02    1
zomboy    2012-06-04 22:55:37    1
          2012-06-08 23:32:59    1
          2012-12-04 23:34:50    1
zox       2009-02-10 00:00:15    1
zurdok    2008-01-19 23:00:32    1
          2008-01-30 13:15:51    1
          2008-02-03 14:50:28    1
          2008-04-28 07:41:57    1
          2009-06-12 20:13:36    1
Name: item_id, dtype: int64



In [27]:

    
monthly_listens_by_artist = monthly_listens_by_artist.unstack()
monthly_listens_by_artist.tail(10)









    Out[27]:






  
    
      timestamp
      2007-02-24 22:54:28
      2007-02-24 22:59:07
      2007-02-24 23:01:54
      2007-02-25 01:55:39
      2007-02-25 02:00:36
      2007-02-25 02:03:25
      2007-02-25 02:05:00
      2007-02-25 02:09:16
      2007-02-25 02:12:30
      2007-02-25 02:15:22
      ...
      2012-12-28 17:12:03
      2012-12-28 17:19:03
      2012-12-29 21:55:51
      2012-12-29 21:59:43
      2012-12-29 22:03:41
      2012-12-29 22:06:38
      2012-12-29 22:09:57
      2012-12-29 22:13:33
      2012-12-29 22:16:42
      2012-12-29 22:20:38
    
    
      artist
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      zero+7
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      zero+down
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      ziggy+marley
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      ziggy+marley+&+the+melody+makers
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      zlad!
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      zo%c3%a9
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      zoe
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      zomboy
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      zox
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      zurdok
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

10 rows × 58668 columns



In [28]:

    
monthly_listens_by_artist.sum(axis=1).tail(10)









    Out[28]:





artist
zero+7                              8
zero+down                           6
ziggy+marley                        1
ziggy+marley+&+the+melody+makers    4
zlad!                               6
zo%c3%a9                            9
zoe                                 2
zomboy                              3
zox                                 1
zurdok                              5
dtype: float64



In [29]:

    
top_artists = data['artist'].value_counts()[:5]
top_artists









    Out[29]:





sigur+r%c3%b3s     2370
radiohead          1750
the+mars+volta     1477
nine+inch+nails    1417
mogwai             1221
dtype: int64



In [30]:

    
monthly_listens_by_artist.reindex(top_artists.index)









    Out[30]:






  
    
      timestamp
      2007-02-24 22:54:28
      2007-02-24 22:59:07
      2007-02-24 23:01:54
      2007-02-25 01:55:39
      2007-02-25 02:00:36
      2007-02-25 02:03:25
      2007-02-25 02:05:00
      2007-02-25 02:09:16
      2007-02-25 02:12:30
      2007-02-25 02:15:22
      ...
      2012-12-28 17:12:03
      2012-12-28 17:19:03
      2012-12-29 21:55:51
      2012-12-29 21:59:43
      2012-12-29 22:03:41
      2012-12-29 22:06:38
      2012-12-29 22:09:57
      2012-12-29 22:13:33
      2012-12-29 22:16:42
      2012-12-29 22:20:38
    
  
  
    
      sigur+r%c3%b3s
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      1
      1
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      radiohead
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      the+mars+volta
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      nine+inch+nails
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      mogwai
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

5 rows × 58668 columns



In [31]:

    
monthly_listens_by_artist.reindex(top_artists.index).fillna(0)









    Out[31]:






  
    
      timestamp
      2007-02-24 22:54:28
      2007-02-24 22:59:07
      2007-02-24 23:01:54
      2007-02-25 01:55:39
      2007-02-25 02:00:36
      2007-02-25 02:03:25
      2007-02-25 02:05:00
      2007-02-25 02:09:16
      2007-02-25 02:12:30
      2007-02-25 02:15:22
      ...
      2012-12-28 17:12:03
      2012-12-28 17:19:03
      2012-12-29 21:55:51
      2012-12-29 21:59:43
      2012-12-29 22:03:41
      2012-12-29 22:06:38
      2012-12-29 22:09:57
      2012-12-29 22:13:33
      2012-12-29 22:16:42
      2012-12-29 22:20:38
    
  
  
    
      sigur+r%c3%b3s
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      1
      1
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      radiohead
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      the+mars+volta
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      nine+inch+nails
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      mogwai
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 58668 columns



In [32]:

    
monthly_listens_by_artist.reindex(top_artists.index).fillna(0).T.head(10)









    Out[32]:






  
    
      
      sigur+r%c3%b3s
      radiohead
      the+mars+volta
      nine+inch+nails
      mogwai
    
    
      timestamp
      
      
      
      
      
    
  
  
    
      2007-02-24 22:54:28
      0
      0
      0
      0
      0
    
    
      2007-02-24 22:59:07
      0
      0
      0
      0
      0
    
    
      2007-02-24 23:01:54
      0
      0
      0
      0
      0
    
    
      2007-02-25 01:55:39
      0
      0
      0
      0
      0
    
    
      2007-02-25 02:00:36
      0
      0
      0
      0
      0
    
    
      2007-02-25 02:03:25
      0
      0
      0
      0
      0
    
    
      2007-02-25 02:05:00
      0
      0
      0
      0
      0
    
    
      2007-02-25 02:09:16
      0
      0
      0
      0
      0
    
    
      2007-02-25 02:12:30
      0
      0
      0
      0
      0
    
    
      2007-02-25 02:15:22
      0
      0
      0
      0
      0



In [33]:

    
to_plot = monthly_listens_by_artist.reindex(top_artists.index).fillna(0).T.resample('M',how='sum')
to_plot.head()









    Out[33]:






  
    
      
      sigur+r%c3%b3s
      radiohead
      the+mars+volta
      nine+inch+nails
      mogwai
    
    
      timestamp
      
      
      
      
      
    
  
  
    
      2007-02-28
      1
      3
      16
      0
      2
    
    
      2007-03-31
      0
      22
      2
      1
      13
    
    
      2007-04-30
      1
      48
      2
      24
      11
    
    
      2007-05-31
      2
      14
      1
      61
      25
    
    
      2007-06-30
      1
      43
      3
      6
      1



In [34]:

    
to_plot.plot()









    Out[34]:





<matplotlib.axes._subplots.AxesSubplot at 0x10a6b54d0>



In [35]:

    
to_plot.cumsum().plot()









    Out[35]:





<matplotlib.axes._subplots.AxesSubplot at 0x14cc51850>



In [36]:

    
data.groupby(['artist','timestamp']).count()['item_id'].unstack().reindex(top_artists.index).fillna(0).T.resample('M',how='sum').cumsum().plot()









    Out[36]:





<matplotlib.axes._subplots.AxesSubplot at 0x14d1109d0>

But does Python do stats?



In [37]:

    
import scipy.stats as stats
import statsmodels



In [38]:

    
data['subject'] = np.random.random_integers(0,19,len(data))
data['measurement'] = np.random.random(len(data))
data['condition'] = data['subject'].apply(lambda x: 'a' if x%2==0 else 'b')
data.head()









    Out[38]:






  
    
      
      item_id
      artist_id
      timestamp
      artist
      subject
      measurement
      condition
    
  
  
    
      0
      3623
      405
      2007-02-24 22:54:28
      queens+of+the+stone+age
      18
      0.200160
      a
    
    
      1
      74220
      11337
      2007-02-24 22:59:07
      lupe+fiasco
      16
      0.958234
      a
    
    
      2
      64735
      4536
      2007-02-24 23:01:54
      black+eyed+peas
      9
      0.671806
      b
    
    
      3
      2021815
      42105
      2007-02-25 01:55:39
      aesop+rock
      15
      0.753624
      b
    
    
      4
      39064541
      3014
      2007-02-25 02:00:36
      at+the+drive-in
      8
      0.658518
      a



In [39]:

    
# A simple group comparison
grpa =  data[data['condition']=='a']['measurement']
grpb =  data[data['condition']=='b']['measurement']
print 'Group A'
print  grpa.describe()
print 'Group B'
print  grpb.describe()
stats.ttest_ind(grpa,grpb)









    



Group A
count    29476.000000
mean         0.500758
std          0.289482
min          0.000049
25%          0.249881
50%          0.501190
75%          0.752262
max          0.999988
Name: measurement, dtype: float64
Group B
count    29357.000000
mean         0.498694
std          0.288415
min          0.000005
25%          0.247028
50%          0.496558
75%          0.746978
max          0.999995
Name: measurement, dtype: float64






    Out[39]:





Ttest_indResult(statistic=0.86629475221585306, pvalue=0.38633207626664212)



In [40]:

    
data['some_IV'] = np.random.choice(['x','y','z'],len(data))
data.head()









    Out[40]:






  
    
      
      item_id
      artist_id
      timestamp
      artist
      subject
      measurement
      condition
      some_IV
    
  
  
    
      0
      3623
      405
      2007-02-24 22:54:28
      queens+of+the+stone+age
      18
      0.200160
      a
      z
    
    
      1
      74220
      11337
      2007-02-24 22:59:07
      lupe+fiasco
      16
      0.958234
      a
      y
    
    
      2
      64735
      4536
      2007-02-24 23:01:54
      black+eyed+peas
      9
      0.671806
      b
      z
    
    
      3
      2021815
      42105
      2007-02-25 01:55:39
      aesop+rock
      15
      0.753624
      b
      z
    
    
      4
      39064541
      3014
      2007-02-25 02:00:36
      at+the+drive-in
      8
      0.658518
      a
      x



In [41]:

    
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

model=ols('measurement ~ C(condition) + C(some_IV) + C(condition):C(some_IV)', data=data).fit() 
print anova_lm(model)









    



                            df       sum_sq   mean_sq         F    PR(>F)
C(condition)                 1     0.062658  0.062658  0.750507  0.386319
C(some_IV)                   2     0.046413  0.023207  0.277963  0.757326
C(condition):C(some_IV)      2     0.554453  0.277226  3.320561  0.036139
Residual                 58827  4911.338700  0.083488       NaN       NaN

Visualization

Matplotlib and pandas



In [42]:

    
from matplotlib import pyplot as plt
%matplotlib inline

example = pd.DataFrame({'a':np.random.random(10),'b':np.random.random(10),'c':np.random.random(10)})
example.head()



In [43]:

    
plt.plot(example['a'])









    Out[43]:





[<matplotlib.lines.Line2D at 0x10e4a7450>]



In [44]:

    
fig,ax = plt.subplots(1,1)
for column in ['a','b','c']:
    ax.plot(example[column])



In [45]:

    
fig,ax = plt.subplots(1,1)
for column in ['a','b','c']:
    ax.plot(example[column],label='column '+column)
ax.legend()









    Out[45]:





<matplotlib.legend.Legend at 0x110ec5610>



In [46]:

    
example.plot()









    Out[46]:





<matplotlib.axes._subplots.AxesSubplot at 0x110efe490>



In [47]:

    
fix,axes = plt.subplots(3,3,figsize=(16,12))
ax_iter = axes.flat
example.plot(kind='bar',ax=ax_iter.next())
example.plot(kind='barh',ax=ax_iter.next())
example.plot(kind='hist',ax=ax_iter.next())
example.plot(kind='box',ax=ax_iter.next())
example.plot(kind='density',ax=ax_iter.next())
example.plot(kind='area',ax=ax_iter.next())
example.plot(kind='scatter',x='a',y='b',ax=ax_iter.next())
example.plot(kind='pie',y='a',ax=ax_iter.next())
example.plot(kind='hexbin',x='a',y='b',gridsize=20,ax=ax_iter.next())









    Out[47]:





<matplotlib.axes._subplots.AxesSubplot at 0x111489b90>

See http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html



In [48]:

    
example = data[['condition','some_IV','measurement']]
example.head()









    Out[48]:






  
    
      
      condition
      some_IV
      measurement
    
  
  
    
      0
      a
      z
      0.200160
    
    
      1
      a
      y
      0.958234
    
    
      2
      b
      z
      0.671806
    
    
      3
      b
      z
      0.753624
    
    
      4
      a
      x
      0.658518



In [49]:

    
grouped = example.groupby(['condition','some_IV'])
avg = grouped.mean()
avg









    Out[49]:






  
    
      
      
      measurement
    
    
      condition
      some_IV
      
    
  
  
    
      a
      x
      0.501778
    
    
      y
      0.504918
    
    
      z
      0.495540
    
    
      b
      x
      0.497304
    
    
      y
      0.496837
    
    
      z
      0.501893



In [50]:

    
avg.unstack().plot(kind='bar')
avg.unstack()['measurement'].plot(kind='bar')









    Out[50]:





<matplotlib.axes._subplots.AxesSubplot at 0x14cdc6e10>



In [51]:

    
SE = grouped.apply(lambda x: np.std(x)/np.sqrt(len(x)))
SE









    Out[51]:






  
    
      
      
      measurement
    
    
      condition
      some_IV
      
    
  
  
    
      a
      x
      0.002921
    
    
      y
      0.002898
    
    
      z
      0.002942
    
    
      b
      x
      0.002914
    
    
      y
      0.002922
    
    
      z
      0.002910



In [52]:

    
avg.unstack()['measurement'].plot(kind='bar',yerr=10*SE.unstack()['measurement'])









    Out[52]:





<matplotlib.axes._subplots.AxesSubplot at 0x14ceef890>



In [53]:

    
fig,ax = plt.subplots(1,1,figsize=(8,6))
avg.unstack()['measurement'].plot(kind='bar',yerr=10*SE.unstack()['measurement'],ax=ax,legend=None,color=['cyan','blue','purple'],alpha=0.6)
ax.legend(bbox_to_anchor=(1.05,1))
ax.set_ylabel('This is my dependent variable',fontdict={'fontname':'Comic Sans MS','fontsize':24})
ax.set_title('Check out my plot!',fontdict={'fontname':'Arial','fontsize':32})
plt.xticks([0,1],['Group A','Group B'],rotation=45)
ax.set_xlabel('Condition',fontsize=18)
ax.grid()

ax.annotate("*",(0.155,0.53),fontsize=16)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)



In [57]:

    
plot_data = data.groupby(['artist','timestamp']).count()['item_id'].unstack().reindex(top_artists.index[:3]).fillna(0).T.resample('M',how='sum')
total_monthly_playcounts = time_indexed.resample('M',how='count')['item_id']

fig,axes = plt.subplots(1,2,figsize = (18,6))
ax_iter = axes.flat
ax = ax_iter.next()
plot_data.divide(total_monthly_playcounts,axis=0).plot(ax=ax,grid=True,legend=False,lw=2,colormap='rainbow')
ax.set_ylabel('Monthly proportion of listening')

ax = ax_iter.next()
plot_data.cumsum().divide(total_monthly_playcounts.cumsum(),axis=0).plot(ax=ax,grid=True,legend=False,lw=2,colormap='rainbow')
ax.set_ylabel('Cumulative monthly proportion of listening')

leg = ax.legend()
from urllib import unquote_plus
text = leg.get_texts()
for t in text:
    t.set_text(unquote_plus(t.get_text().encode('ascii')).decode('utf8').title())
    
for ax in axes:
    ax.set_axis_bgcolor('#DADADA')
    #ax.set_ylim(0,0.22)

fig.suptitle('Month-by-month and cumulative listening, top 3 artists',fontsize=20)









    Out[57]:





<matplotlib.text.Text at 0x10c23c2d0>

Digging deeper

This was just the tip of the iceberg

there are tons of specially designed packages to learn about and work with

And of course, you can always build your own!

working with bigger data? Try Graphlab (or maybe even Apache Spark)

Want to get started using these tools for your research?

Feel free to ask me for help (follow up presentation next semester?)

Check out the resources on the next slide.

Resources

Working with Notebooks:
- Jupyter documentation
- A gallery of interesting notebooks
- Markdown cheatsheet (for pretty text formatting)

Plotting:

Statistics:
- A general tutorial on stats in Python
- StatsModels documentation

Books:
- Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (By Wes McKinney, creator of Pandas!)
- A general introduction to Python

Graphlab
- Documentation
- Translator (to help you transition from Pandas/R to Graphlab)

Misc.
- Python for psyhcologists
- Sublime Text (When you need to code outside of a notebook, this is hands down the best editor out there)

	a	b	c
0	0.010652	0.474971	0.219931
1	0.076075	0.597393	0.904103
2	0.153507	0.429889	0.591455
3	0.674325	0.589722	0.930124
4	0.398727	0.536206	0.251662

	user_id	item_id	artist_id	timestamp
0	5759068	41	1	2007-11-05 13:33:28
1	5759068	41	1	2009-03-29 20:18:48
2	5759068	41	1	2009-06-04 23:35:14
3	5759068	41	1	2010-02-06 03:51:20
4	5759068	41	1	2010-03-24 21:02:33

	user_id	item_id	artist_id	timestamp
58828	5759068	39064966	39064958	2010-03-29 18:07:03
58829	5759068	39064990	39064991	2007-12-31 14:00:14
58830	5759068	39065008	39065006	2012-11-24 23:56:02
58831	5759068	39065013	39065014	2007-07-25 19:24:11
58832	5759068	39065015	39065014	2007-08-09 04:01:52

	user_id	item_id	artist_id
count	58833	58833.000000	58833.000000
mean	5759068	2250941.841738	191385.196165
std	0	6646153.662761	2126089.930947
min	5759068	15.000000	1.000000
25%	5759068	87753.000000	1211.000000
50%	5759068	239366.000000	2965.000000
75%	5759068	1120718.000000	13973.000000
max	5759068	39065044.000000	39065014.000000

	user_id	item_id	artist_id	timestamp
10061	5759068	3623	405	2007-02-24 22:54:28
42885	5759068	74220	11337	2007-02-24 22:59:07
34972	5759068	64735	4536	2007-02-24 23:01:54
52132	5759068	2021815	42105	2007-02-25 01:55:39
30434	5759068	39064541	3014	2007-02-25 02:00:36

	item_id	artist_id	timestamp
artist
zero+7	8	8	8
zero+down	6	6	6
ziggy+marley	1	1	1
ziggy+marley+&+the+melody+makers	4	4	4
zlad!	6	6	6
zo%c3%a9	9	9	9
zoe	2	2	2
zomboy	3	3	3
zox	1	1	1
zurdok	5	5	5

		artist_id	timestamp
artist	item_id
zo%c3%a9	1691584	1	1
	2983328	2	2
	37276281	1	1
	39065043	2	2
zoe	39065044	2	2
zomboy	380522	1	1
	855510	1	1
	1354925	1	1
zox	4123632	1	1
zurdok	1351103	5	5

	item_id	artist_id	timestamp
%	818	818	818
.	232	232	232
1	13	13	13
2	51	51	51
3	16	16	16
5	3	3	3
6	528	528	528
a	4624	4624	4624
b	2898	2898	2898
c	2412	2412	2412

timestamp	2007-02-24 22:54:28	2007-02-24 22:59:07	2007-02-24 23:01:54	2007-02-25 01:55:39	2007-02-25 02:00:36	2007-02-25 02:03:25	2007-02-25 02:05:00	2007-02-25 02:09:16	2007-02-25 02:12:30	2007-02-25 02:15:22	...	2012-12-28 17:12:03	2012-12-28 17:19:03	2012-12-29 21:55:51	2012-12-29 21:59:43	2012-12-29 22:03:41	2012-12-29 22:06:38	2012-12-29 22:09:57	2012-12-29 22:13:33	2012-12-29 22:16:42	2012-12-29 22:20:38
sigur+r%c3%b3s	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	1	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
radiohead	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
the+mars+volta	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
nine+inch+nails	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
mogwai	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	sigur+r%c3%b3s	radiohead	the+mars+volta	nine+inch+nails	mogwai
timestamp
2007-02-24 22:54:28	0	0	0	0	0
2007-02-24 22:59:07	0	0	0	0	0
2007-02-24 23:01:54	0	0	0	0	0
2007-02-25 01:55:39	0	0	0	0	0
2007-02-25 02:00:36	0	0	0	0	0
2007-02-25 02:03:25	0	0	0	0	0
2007-02-25 02:05:00	0	0	0	0	0
2007-02-25 02:09:16	0	0	0	0	0
2007-02-25 02:12:30	0	0	0	0	0
2007-02-25 02:15:22	0	0	0	0	0

	sigur+r%c3%b3s	radiohead	the+mars+volta	nine+inch+nails	mogwai
timestamp
2007-02-28	1	3	16	0	2
2007-03-31	0	22	2	1	13
2007-04-30	1	48	2	24	11
2007-05-31	2	14	1	61	25
2007-06-30	1	43	3	6	1

	item_id	artist_id	timestamp	artist	subject	measurement	condition
0	3623	405	2007-02-24 22:54:28	queens+of+the+stone+age	18	0.200160	a
1	74220	11337	2007-02-24 22:59:07	lupe+fiasco	16	0.958234	a
2	64735	4536	2007-02-24 23:01:54	black+eyed+peas	9	0.671806	b
3	2021815	42105	2007-02-25 01:55:39	aesop+rock	15	0.753624	b
4	39064541	3014	2007-02-25 02:00:36	at+the+drive-in	8	0.658518	a

		measurement
condition	some_IV
a	x	0.501778
	y	0.504918
	z	0.495540
b	x	0.497304
	y	0.496837
	z	0.501893

		measurement
condition	some_IV
a	x	0.002921
	y	0.002898
	z	0.002942
b	x	0.002914
	y	0.002922
	z	0.002910