Computing in Context sub history

Lecture one

Number munging

This is iPython.

It is swell.

It is Python in a brower.

Pure CS types not love.

We hackish types adore!

Download anaconda (esp if on windows)


In [1]:
#This is a comment
#This is all blackboxed for now--DON'T worry about it
# Render our plots inline
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier
plt.rcParams['figure.figsize'] = (15, 5)

In [ ]:

Our first data format

Rk,G,Date,Age,Tm,,Opp,,GS,MP,FG,FGA,FG%,3P,3PA,3P%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc,+/-
1,1,2013-10-29,28-303,MIA,,CHI,W (+12),1,38:01,5,11,.455,0,1,.000,7,9,.778,0,6,6,8,1,0,2,0,17,16.9,+8
2,2,2013-10-30,28-304,MIA,@,PHI,L (-4),1,36:38,9,17,.529,4,7,.571,3,4,.750,0,4,4,13,0,0,4,3,25,21.4,-8
3,3,2013-11-01,28-306,MIA,@,BRK,L (-1),1,42:14,11,19,.579,1,2,.500,3,5,.600,1,6,7,6,2,1,5,2,26,19.9,-3
4,4,2013-11-03,28-308,MIA,,WAS,W (+10),1,34:41,9,14,.643,3,5,.600,4,5,.800,0,3,3,5,1,0,6,2,25,17.0,+16
5,5,2013-11-05,28-310,MIA,@,TOR,W (+9),1,36:01,13,20,.650,1,3,.333,8,8,1.000,2,6,8,8,0,1,1,2,35,33.9,+3

In [ ]:
#looks much nicer on a wide screen!

In [ ]:


In [ ]:

Comma-separated value (CSVs) (files)

LeBron James' first five games of the 2013-2014 NBA season


In [3]:
import csv
import urllib

url = "https://gist.githubusercontent.com/aparrish/cb1672e98057ea2ab7a1/raw/13166792e0e8436221ef85d2a655f1965c400f75/lebron_james.csv"
stats = list(csv.reader(urllib.urlopen(url)))
#example courtesy the great Allison Parrish!
#What different things do urllib.urlopen(url) then csv.reader() and then list() do?

In [4]:
stats[0]


Out[4]:
['Rk',
 'G',
 'Date',
 'Age',
 'Tm',
 '',
 'Opp',
 '',
 'GS',
 'MP',
 'FG',
 'FGA',
 'FG%',
 '3P',
 '3PA',
 '3P%',
 'FT',
 'FTA',
 'FT%',
 'ORB',
 'DRB',
 'TRB',
 'AST',
 'STL',
 'BLK',
 'TOV',
 'PF',
 'PTS',
 'GmSc',
 '+/-']

In [5]:
len(stats)


Out[5]:
78

In [6]:
stats[74][0]


Out[6]:
'77'

You can compose indexes! this is the 0th item of the 74th list.

BUT I'm not going to torture you with this lower level analysis (for now)

Pandas first-line python tool for Exploratory Data Analysis

  • rich data structures
  • powerful ways to slice, dice, reformate, fix, and eliminate data
    • taste of what can do
  • rich queries like databases

dataframes

The library Pandas provides us with a powerful overlay that lets us use matrices but always keep their row and column names: a spreadsheet on speed. It allows us to work directly with the datatype "Dataframes" that keeps track of values and their names for us. And it allows us to perform many operations on slices of the dataframe without having to run for loops and the like. This is more convenient and involves faster processing.


In [1]:
import pandas as pd #we've already done this but just to remind you you'll need to

In [8]:
#Let's start with yet another way to read csv files, this time from `pandas`
import os
directory=("/Users/mljones/repositories/comp_in_context_trial/")
os.chdir(directory)

Now we read a big csv file using a function from pandas called pd.read_csv()


In [9]:
df=pd.read_csv('HMXPC_13.csv', sep=",")

In [10]:
df


Out[10]:
course_id userid_DI registered viewed explored certified final_cc_cname_DI LoE_DI YoB gender grade start_time_DI last_event_DI nevents ndays_act nplay_video nchapters nforum_posts roles incomplete_flag
0 HarvardX/CB22x/2013_Spring MHxPC130442623 1 0 0 0 United States NaN NaN NaN 0 2012-12-19 2013-11-17 NaN 9 NaN NaN 0 NaN 1
1 HarvardX/CS50x/2012 MHxPC130442623 1 1 0 0 United States NaN NaN NaN 0 2012-10-15 NaN NaN 9 NaN 1 0 NaN 1
2 HarvardX/CB22x/2013_Spring MHxPC130275857 1 0 0 0 United States NaN NaN NaN 0 2013-02-08 2013-11-17 NaN 16 NaN NaN 0 NaN 1
3 HarvardX/CS50x/2012 MHxPC130275857 1 0 0 0 United States NaN NaN NaN 0 2012-09-17 NaN NaN 16 NaN NaN 0 NaN 1
4 HarvardX/ER22x/2013_Spring MHxPC130275857 1 0 0 0 United States NaN NaN NaN 0 2012-12-19 NaN NaN 16 NaN NaN 0 NaN 1
5 HarvardX/PH207x/2012_Fall MHxPC130275857 1 1 1 0 United States NaN NaN NaN 0 2012-09-17 2013-05-23 502 16 50 12 0 NaN NaN
6 HarvardX/PH278x/2013_Spring MHxPC130275857 1 0 0 0 United States NaN NaN NaN 0 2013-02-08 NaN NaN 16 NaN NaN 0 NaN 1
7 HarvardX/CB22x/2013_Spring MHxPC130539455 1 1 0 0 France NaN NaN NaN 0 2013-01-01 2013-05-14 42 6 NaN 3 0 NaN NaN
8 HarvardX/CB22x/2013_Spring MHxPC130088379 1 1 0 0 United States NaN NaN NaN 0 2013-02-18 2013-03-17 70 3 NaN 3 0 NaN NaN
9 HarvardX/CS50x/2012 MHxPC130088379 1 1 0 0 United States NaN NaN NaN 0 2012-10-20 NaN NaN 12 NaN 3 0 NaN 1
10 HarvardX/ER22x/2013_Spring MHxPC130088379 1 1 0 0 United States NaN NaN NaN 0 2013-02-23 2013-06-14 17 2 NaN 2 0 NaN NaN
11 HarvardX/ER22x/2013_Spring MHxPC130198098 1 1 0 0 United States NaN NaN NaN 0 2013-06-17 2013-06-17 32 1 NaN 3 0 NaN NaN
12 HarvardX/CB22x/2013_Spring MHxPC130024894 1 1 0 0 United States NaN NaN NaN 0.07 2013-01-24 2013-08-03 175 9 NaN 7 0 NaN NaN
13 HarvardX/CS50x/2012 MHxPC130024894 1 1 0 0 United States NaN NaN NaN 0 2013-06-27 NaN NaN 2 NaN 2 0 NaN 1
14 HarvardX/ER22x/2013_Spring MHxPC130024894 1 1 0 0 United States NaN NaN NaN 0 2012-12-19 2013-08-17 78 5 NaN 4 0 NaN NaN
15 HarvardX/PH207x/2012_Fall MHxPC130024894 1 1 0 0 United States NaN NaN NaN 0 2012-07-26 2013-01-16 75 14 5 2 0 NaN NaN
16 HarvardX/PH278x/2013_Spring MHxPC130024894 1 1 0 0 United States NaN NaN NaN 0 2013-07-30 2013-08-27 11 2 2 1 0 NaN NaN
17 HarvardX/CS50x/2012 MHxPC130080986 1 1 0 0 United States NaN NaN NaN 0 2012-10-15 NaN NaN 11 NaN 1 0 NaN 1
18 HarvardX/PH207x/2012_Fall MHxPC130080986 1 1 0 0 United States NaN NaN NaN 0 2012-10-25 2012-12-04 56 11 1 2 1 NaN NaN
19 HarvardX/CS50x/2012 MHxPC130063375 1 1 0 0 Unknown/Other NaN NaN NaN 0 2012-10-19 NaN NaN NaN NaN 1 0 NaN 1
20 HarvardX/CS50x/2012 MHxPC130094371 1 1 0 0 United States NaN NaN NaN 0 2013-03-03 2013-03-03 7 1 NaN 2 0 NaN NaN
21 HarvardX/CS50x/2012 MHxPC130229084 1 1 0 0 Mexico NaN NaN NaN 0 2012-10-15 NaN NaN NaN NaN 1 0 NaN 1
22 HarvardX/CS50x/2012 MHxPC130300925 1 1 0 0 United States NaN NaN NaN 0 2012-10-24 NaN NaN 2 NaN 1 0 NaN 1
23 HarvardX/ER22x/2013_Spring MHxPC130300925 1 1 0 0 United States NaN NaN NaN 0 2012-12-20 2013-05-18 15 2 NaN 2 0 NaN NaN
24 HarvardX/CS50x/2012 MHxPC130417650 1 1 0 0 Australia NaN NaN NaN 0 2012-10-29 2013-03-04 1 1 NaN 2 0 NaN NaN
25 HarvardX/CS50x/2012 MHxPC130506580 1 0 0 0 United States NaN NaN NaN 0 2012-09-04 NaN NaN NaN NaN NaN 0 NaN NaN
26 HarvardX/CS50x/2012 MHxPC130298257 1 0 0 0 United States NaN NaN NaN 0 2012-09-05 NaN NaN NaN NaN 3 0 NaN 1
27 HarvardX/CS50x/2012 MHxPC130500569 1 1 0 0 United States NaN NaN NaN 0 2012-10-22 2013-03-30 6 1 NaN 5 0 NaN NaN
28 HarvardX/CS50x/2012 MHxPC130466479 1 1 0 0 Unknown/Other NaN NaN NaN 0 2013-01-07 NaN NaN NaN NaN 1 0 NaN 1
29 HarvardX/CB22x/2013_Spring MHxPC130340959 1 1 0 0 United States NaN NaN NaN 0.05 2013-02-11 2013-04-06 285 8 NaN 4 0 NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
641108 MITx/6.002x/2013_Spring MHxPC130140735 1 1 0 0 United States Bachelor's 1991 m NaN 2013-09-07 2013-09-07 59 1 5 3 0 NaN NaN
641109 MITx/6.00x/2013_Spring MHxPC130493130 1 0 0 0 United Kingdom Master's 1977 m NaN 2013-09-07 NaN NaN NaN NaN 2 0 NaN 1
641110 MITx/6.00x/2013_Spring MHxPC130400592 1 1 0 0 Other Europe Secondary 1992 m NaN 2013-09-07 2013-09-07 395 1 51 4 0 NaN NaN
641111 MITx/6.00x/2013_Spring MHxPC130109892 1 1 0 0 India Secondary 1995 m NaN 2013-09-07 2013-09-07 49 1 14 2 0 NaN NaN
641112 MITx/14.73x/2013_Spring MHxPC130183007 1 0 0 0 India Master's 1985 m NaN 2013-09-07 NaN NaN NaN NaN NaN 0 NaN NaN
641113 MITx/8.MReV/2013_Summer MHxPC130261281 1 1 0 0 India Secondary 1994 m 0 2013-09-07 2013-09-07 8 1 NaN 1 0 NaN NaN
641114 MITx/6.00x/2013_Spring MHxPC130481990 1 1 0 0 India Bachelor's 1989 m NaN 2013-09-07 2013-09-07 22 1 5 1 0 NaN NaN
641115 MITx/6.00x/2013_Spring MHxPC130528581 1 0 0 0 United States Bachelor's 1990 f NaN 2013-09-07 2013-09-07 2 1 NaN 3 0 NaN NaN
641116 MITx/14.73x/2013_Spring MHxPC130555418 1 0 0 0 Unknown/Other Bachelor's 1988 m NaN 2013-09-07 NaN NaN NaN NaN NaN 0 NaN NaN
641117 MITx/6.002x/2013_Spring MHxPC130408810 1 0 0 0 India Secondary 1993 m NaN 2013-09-07 2013-09-07 2 1 NaN 3 0 NaN NaN
641118 MITx/6.00x/2013_Spring MHxPC130040184 1 0 0 0 United States Secondary 1991 m NaN 2013-09-07 NaN NaN NaN NaN NaN 0 NaN NaN
641119 MITx/6.002x/2013_Spring MHxPC130566049 1 0 0 0 Other Europe Master's 1982 m NaN 2013-09-07 2013-09-07 2 1 NaN 2 0 NaN NaN
641120 MITx/8.MReV/2013_Summer MHxPC130374105 1 1 0 0 India Bachelor's 1992 m 0 2013-09-07 2013-09-07 49 1 NaN 1 0 NaN NaN
641121 MITx/6.00x/2013_Spring MHxPC130282999 1 0 0 0 Other Europe Master's 1979 m NaN 2013-09-07 NaN NaN NaN NaN 7 0 NaN 1
641122 MITx/8.MReV/2013_Summer MHxPC130556398 1 0 0 0 India Bachelor's 1985 m 0 2013-09-07 2013-09-07 1 1 NaN NaN 0 NaN NaN
641123 MITx/6.00x/2013_Spring MHxPC130573334 1 0 0 0 Spain Bachelor's 1989 m NaN 2013-09-07 2013-09-07 1 1 NaN NaN 0 NaN NaN
641124 MITx/6.00x/2013_Spring MHxPC130505931 1 1 0 0 India Secondary 1995 m NaN 2013-09-07 2013-09-07 59 1 NaN 2 0 NaN NaN
641125 MITx/6.002x/2013_Spring MHxPC130280976 1 0 0 0 United States Bachelor's NaN m NaN 2013-09-07 2013-09-07 2 1 NaN NaN 0 NaN NaN
641126 MITx/6.00x/2013_Spring MHxPC130137331 1 1 0 0 United States Secondary 1992 m NaN 2013-09-07 2013-09-07 251 1 77 4 0 NaN NaN
641127 MITx/6.002x/2013_Spring MHxPC130271624 1 0 0 0 India Bachelor's 1989 m NaN 2013-09-07 2013-09-07 1 1 NaN NaN 0 NaN NaN
641128 MITx/14.73x/2013_Spring MHxPC130256541 1 1 0 0 United States Master's 1982 m NaN 2013-09-07 2013-09-07 51 1 1 1 0 NaN NaN
641129 MITx/6.00x/2013_Spring MHxPC130021638 1 0 0 0 Unknown/Other Bachelor's 1988 m NaN 2013-09-07 NaN NaN NaN NaN NaN 0 NaN NaN
641130 MITx/14.73x/2013_Spring MHxPC130591057 1 0 0 0 Canada Bachelor's NaN f NaN 2013-09-07 2013-09-07 6 1 NaN NaN 0 NaN NaN
641131 MITx/8.02x/2013_Spring MHxPC130226305 1 0 0 0 Unknown/Other Bachelor's 1988 m NaN 2013-09-07 2013-09-07 11 1 NaN 2 0 NaN NaN
641132 MITx/6.002x/2013_Spring MHxPC130030805 1 1 0 0 Pakistan Master's 1989 m NaN 2013-09-07 2013-09-07 29 1 NaN 1 0 NaN NaN
641133 MITx/6.00x/2013_Spring MHxPC130184108 1 1 0 0 Canada Bachelor's 1991 m NaN 2013-09-07 2013-09-07 97 1 4 2 0 NaN NaN
641134 MITx/6.00x/2013_Spring MHxPC130359782 1 0 0 0 Other Europe Bachelor's 1991 f NaN 2013-09-07 2013-09-07 1 1 NaN NaN 0 NaN NaN
641135 MITx/6.002x/2013_Spring MHxPC130098513 1 0 0 0 United States Doctorate 1979 m NaN 2013-09-07 2013-09-07 1 1 NaN NaN 0 NaN NaN
641136 MITx/6.00x/2013_Spring MHxPC130098513 1 1 0 0 United States Doctorate 1979 m NaN 2013-09-07 2013-09-07 74 1 14 1 0 NaN NaN
641137 MITx/8.02x/2013_Spring MHxPC130098513 1 0 0 0 United States Doctorate 1979 m NaN 2013-09-07 NaN NaN 1 NaN NaN 0 NaN 1

641138 rows × 20 columns

Note at the bottom that the display tells us how many rows and columns we're dealing with.

As a general rule, pandas dataframe objects default to slicing by column using a syntax you'll know from dicts as in df["course_id"].


In [11]:
df["course_id"]


Out[11]:
0      HarvardX/CB22x/2013_Spring
1             HarvardX/CS50x/2012
2      HarvardX/CB22x/2013_Spring
3             HarvardX/CS50x/2012
4      HarvardX/ER22x/2013_Spring
5       HarvardX/PH207x/2012_Fall
6     HarvardX/PH278x/2013_Spring
7      HarvardX/CB22x/2013_Spring
8      HarvardX/CB22x/2013_Spring
9             HarvardX/CS50x/2012
10     HarvardX/ER22x/2013_Spring
11     HarvardX/ER22x/2013_Spring
12     HarvardX/CB22x/2013_Spring
13            HarvardX/CS50x/2012
14     HarvardX/ER22x/2013_Spring
...
641123     MITx/6.00x/2013_Spring
641124     MITx/6.00x/2013_Spring
641125    MITx/6.002x/2013_Spring
641126     MITx/6.00x/2013_Spring
641127    MITx/6.002x/2013_Spring
641128    MITx/14.73x/2013_Spring
641129     MITx/6.00x/2013_Spring
641130    MITx/14.73x/2013_Spring
641131     MITx/8.02x/2013_Spring
641132    MITx/6.002x/2013_Spring
641133     MITx/6.00x/2013_Spring
641134     MITx/6.00x/2013_Spring
641135    MITx/6.002x/2013_Spring
641136     MITx/6.00x/2013_Spring
641137     MITx/8.02x/2013_Spring
Name: course_id, Length: 641138, dtype: object

In [12]:
df["course_id"][3340:3350] #pick out a list of values from ONE column


Out[12]:
3340            HarvardX/CS50x/2012
3341     HarvardX/ER22x/2013_Spring
3342    HarvardX/PH278x/2013_Spring
3343            HarvardX/CS50x/2012
3344            HarvardX/CS50x/2012
3345     HarvardX/ER22x/2013_Spring
3346            HarvardX/CS50x/2012
3347     HarvardX/CB22x/2013_Spring
3348            HarvardX/CS50x/2012
3349            HarvardX/CS50x/2012
Name: course_id, dtype: object

Instead of (column, row) we use name_of_dataframe[column name][row #]


In [13]:
df[3340:3350] # SLICE a list of ROWS


Out[13]:
course_id userid_DI registered viewed explored certified final_cc_cname_DI LoE_DI YoB gender grade start_time_DI last_event_DI nevents ndays_act nplay_video nchapters nforum_posts roles incomplete_flag
3340 HarvardX/CS50x/2012 MHxPC130386705 1 1 0 0 Russian Federation NaN NaN NaN 0 2012-08-17 NaN NaN NaN NaN 2 0 NaN 1
3341 HarvardX/ER22x/2013_Spring MHxPC130432757 1 1 0 0 United States NaN NaN NaN NaN 2013-09-05 2013-09-05 16 1 NaN 2 0 NaN NaN
3342 HarvardX/PH278x/2013_Spring MHxPC130432757 1 0 0 0 United States NaN NaN NaN NaN 2012-12-25 NaN NaN 1 NaN NaN 0 NaN 1
3343 HarvardX/CS50x/2012 MHxPC130382204 1 1 0 0 Ukraine NaN NaN NaN 0 2012-11-30 NaN NaN NaN NaN 5 0 NaN 1
3344 HarvardX/CS50x/2012 MHxPC130142047 1 1 0 0 Spain NaN NaN NaN 0.0 2013-07-12 2013-07-12 8 1 NaN 1 0 NaN NaN
3345 HarvardX/ER22x/2013_Spring MHxPC130191600 1 0 0 0 India NaN NaN NaN 0 2012-12-23 NaN NaN NaN NaN NaN 0 NaN NaN
3346 HarvardX/CS50x/2012 MHxPC130079233 1 0 0 0 United States NaN NaN NaN 0 2012-08-17 NaN NaN NaN NaN NaN 0 NaN NaN
3347 HarvardX/CB22x/2013_Spring MHxPC130277592 1 1 0 0 United States NaN NaN NaN 0.04 2013-01-23 2013-04-04 333 8 NaN 4 0 NaN NaN
3348 HarvardX/CS50x/2012 MHxPC130429812 1 1 0 0 United Kingdom NaN NaN NaN 0 2012-08-18 NaN NaN NaN NaN 2 0 NaN 1
3349 HarvardX/CS50x/2012 MHxPC130503405 1 0 0 0 Russian Federation NaN NaN NaN 0.0 2012-09-03 NaN NaN NaN NaN NaN 0 NaN NaN

In [14]:
#This was _not_ in class   PREPARE FOR TERRIBLE ERROR!
#THIS DOESN'T WORK
df[3340]


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-14-174c9ecc62ef> in <module>()
      1 #This was _not_ in class
      2 #THIS DOESN'T WORK
----> 3 df[3340]

/Users/mljones/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1778             return self._getitem_multilevel(key)
   1779         else:
-> 1780             return self._getitem_column(key)
   1781 
   1782     def _getitem_column(self, key):

/Users/mljones/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/frame.pyc in _getitem_column(self, key)
   1785         # get column
   1786         if self.columns.is_unique:
-> 1787             return self._get_item_cache(key)
   1788 
   1789         # duplicate columns & possible reduce dimensionaility

/Users/mljones/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
   1066         res = cache.get(item)
   1067         if res is None:
-> 1068             values = self._data.get(item)
   1069             res = self._box_item_values(item, values)
   1070             cache[item] = res

/Users/mljones/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item, fastpath)
   2847 
   2848             if not isnull(item):
-> 2849                 loc = self.items.get_loc(item)
   2850             else:
   2851                 indexer = np.arange(len(self.items))[isnull(self.items)]

/Users/mljones/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/core/index.pyc in get_loc(self, key)
   1400         loc : int if unique index, possibly slice or mask if not
   1401         """
-> 1402         return self._engine.get_loc(_values_from_object(key))
   1403 
   1404     def get_value(self, series, key):

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3812)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3692)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12299)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12250)()

KeyError: 3340

In [15]:
#That's icky.
#to pick out one row use `.ix`
df.ix[3340]


Out[15]:
course_id            HarvardX/CS50x/2012
userid_DI                 MHxPC130386705
registered                             1
viewed                                 1
explored                               0
certified                              0
final_cc_cname_DI     Russian Federation
LoE_DI                               NaN
YoB                                  NaN
gender                               NaN
grade                                  0
start_time_DI                 2012-08-17
last_event_DI                        NaN
nevents                              NaN
ndays_act                            NaN
nplay_video                          NaN
nchapters                              2
nforum_posts                           0
roles                                NaN
incomplete_flag                        1
Name: 3340, dtype: object

Why? A good question. Now try passing a list of just one row:


In [17]:
df.ix[[3340]]


Out[17]:
course_id userid_DI registered viewed explored certified final_cc_cname_DI LoE_DI YoB gender grade start_time_DI last_event_DI nevents ndays_act nplay_video nchapters nforum_posts roles incomplete_flag
3340 HarvardX/CS50x/2012 MHxPC130386705 1 1 0 0 Russian Federation NaN NaN NaN 0 2012-08-17 NaN NaN NaN NaN 2 0 NaN 1

We can pick out columns using their names and with a slice of rows.


In [19]:
df['final_cc_cname_DI'][100:110]


Out[19]:
100         United States
101         United States
102         United States
103         United States
104    Russian Federation
105    Russian Federation
106         United States
107         United States
108         United States
109         United States
Name: final_cc_cname_DI, dtype: object

In [18]:
df.dtypes


Out[18]:
course_id             object
userid_DI             object
registered             int64
viewed                 int64
explored               int64
certified              int64
final_cc_cname_DI     object
LoE_DI                object
YoB                  float64
gender                object
grade                 object
start_time_DI         object
last_event_DI         object
nevents              float64
ndays_act            float64
nplay_video          float64
nchapters            float64
nforum_posts           int64
roles                float64
incomplete_flag      float64
dtype: object

In inputing CSV, Pandas parses each column and attempts to discern what sort of data is within. It's good but not infallible.

  • Pandas is particularly good with dates: you simply tell it which columns to parse as dates. Let's refine our reading of the CSV to parse the dates.

In [19]:
df=pd.read_csv('HMXPC_13.csv', sep="," , parse_dates=['start_time_DI', 'last_event_DI'])

note that we pass a list of columns to pick out multiple columns


In [27]:
df["start_time_DI"]


Out[27]:
0     2012-12-19
1     2012-10-15
2     2013-02-08
3     2012-09-17
4     2012-12-19
5     2012-09-17
6     2013-02-08
7     2013-01-01
8     2013-02-18
9     2012-10-20
10    2013-02-23
11    2013-06-17
12    2013-01-24
13    2013-06-27
14    2012-12-19
...
641123    2013-09-07
641124    2013-09-07
641125    2013-09-07
641126    2013-09-07
641127    2013-09-07
641128    2013-09-07
641129    2013-09-07
641130    2013-09-07
641131    2013-09-07
641132    2013-09-07
641133    2013-09-07
641134    2013-09-07
641135    2013-09-07
641136    2013-09-07
641137    2013-09-07
Name: start_time_DI, Length: 641138, dtype: object

Now we can count how many times someone started


In [28]:
startdates=df['start_time_DI'].value_counts()
# Exercise to the reader: how might you do this without using the `.value_counts()` method?

In [29]:
startdates


Out[29]:
2012-08-17    10165
2013-01-23     8368
2012-10-15     6766
2012-08-16     6369
2012-12-20     5858
2013-02-14     5810
2012-12-21     5809
2012-08-18     5531
2012-08-13     5247
2013-03-03     5053
2012-10-16     4639
2012-07-24     4635
2013-02-15     4436
2013-01-22     4263
2012-08-20     4107
...
2013-07-15    396
2013-07-18    390
2013-07-10    386
2013-07-20    378
2013-07-09    374
2013-07-08    365
2013-07-04    357
2013-07-12    334
2013-07-05    307
2013-07-14    279
2013-07-13    275
2013-07-06    274
2013-07-07    273
2012-07-23      5
2013-09-08      1
Length: 413, dtype: int64

In [25]:
startdates.plot()


Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x111156b90>

In [26]:
startdates.plot(title="I can't it's not butter.")


Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x10d190ed0>

What are


In [28]:
startdates.plot(kind="bar")


Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x10d184810>

In [29]:
#Ok, let's consider how many times different people played a video
df["nplay_video"].dropna().plot()


Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x112200f50>

In [ ]: