I am taking the Data analysis class on coursera, I wanted to keep learning R since I previously took Computing for Data analysis but since that I change it to python. I think I love python too much so I did this week quiz on python using pandas and numpy. Not sure if I will do this each week/quiz of the class, we will see.

You can find the questions and solutions for the quiz2.pdf, the data: ss06hid.csv, ss06pid.csv and the metadata on the github page.


In [26]:
import numpy as np
import pandas as pd

Question 2


In [141]:
import urllib
f = urllib.request.urlopen('http://simplystatistics.tumblr.com/')

In [142]:
lines = []
for i in range(150):
    lines.append(f.readline())

In [143]:
len(lines[1]), len(lines[44]), len(lines[121])


Out[143]:
(920, 7, 26)

python adds a '\n' on on each line so -2

Question 3


In [2]:
housing = pd.read_csv('ss06hid.csv')

In [6]:
housing


Out[6]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6496 entries, 0 to 6495
Columns: 188 entries, RT to wgtp80
dtypes: float64(97), int64(90), object(1)

In [10]:
len(housing[housing['VAL'] >= 24])


Out[10]:
53

Question 4

Column has to many information: Family type and employment status

Question 5


In [19]:
len(housing[(housing['BDS'] == 3) & (housing['RMS'] == 4)])


Out[19]:
148

In [20]:
len(housing[(housing['BDS'] == 2) & (housing['RMS'] == 5)])


Out[20]:
386

In [21]:
len(housing[(housing['BDS'] == 2) & (housing['RMS'] == 7)])


Out[21]:
49

Question 6


In [60]:
agricultureLogical = (housing['ACR'] >= 3) & (housing['AGS'] >= 6)


Out[60]:
pandas.core.series.Series

In [66]:
np.where(agricultureLogical == True)


Out[66]:
(array([ 124,  237,  261,  469,  554,  567,  607,  642,  786,  807,  823,
        848,  951,  954, 1032, 1264, 1274, 1314, 1387, 1606, 1628, 1650,
       1855, 1918, 2100, 2193, 2402, 2442, 2538, 2579, 2654, 2679, 2739,
       2837, 2964, 3130, 3132, 3162, 3290, 3369, 3401, 3584, 3651, 3851,
       3861, 3911, 4022, 4044, 4106, 4112, 4116, 4184, 4197, 4309, 4342,
       4353, 4447, 4452, 4460, 4717, 4816, 4834, 4909, 5139, 5198, 5235,
       5325, 5416, 5530, 5573, 5893, 6032, 6043, 6088, 6274, 6375, 6419]),)

Question 7


In [67]:
q7subsetDataFrame = housing[agricultureLogical]

In [68]:
q7subsetDataFrame


Out[68]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 77 entries, 124 to 6419
Columns: 188 entries, RT to wgtp80
dtypes: float64(97), int64(90), object(1)

In [71]:
l1 = len(q7subsetDataFrame)
l1


Out[71]:
77

In [72]:
l2 = len(q7subsetDataFrame['MRGX'].dropna())
l2


Out[72]:
69

In [73]:
l1 - l2


Out[73]:
8

Question 8


In [180]:
splits = []
for col in housing.columns:
    splits.append(col.split("wgtp"))

In [182]:
splits[122]


Out[182]:
['', '15']

Question 9


In [82]:
housing['YBL'].quantile(0)


Out[82]:
-1.0

In [83]:
housing['YBL'].quantile(1)


Out[83]:
25.0

Something wrong because YBL is: 'When structure first built'

  • b .N/A (GQ)
  • 1 .2005 or later
  • 2 .2000 to 2004
  • 3 .1990 to 1999
  • 4 .1980 to 1989
  • 5 .1970 to 1979
  • 6 .1960 to 1969
  • 7 .1950 to 1959
  • 8 .1940 to 1949
  • 9 .1939 or earlier

Question 10


In [84]:
populations = pd.read_csv('ss06pid.csv')

In [88]:
pd.merge(populations, housing, on='SERIALNO', how='outer')


Out[88]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 15451 entries, 0 to 15450
Columns: 426 entries, RT_x to wgtp80
dtypes: float64(333), int64(89), object(4)

Conclusion

I believe python is catching up on data analysis with tools as pandas and scikit-learn, the problem is that it is just catching up while R has years of being consolidated as the tool for doind data analysis but I believe python is the future for its integration with other technologies such as the web with django; python is a language that is fighting on all fronts that can be good or bad, lets hope that is good.

FYI, I almost don't get question 8 xD.