I am taking the Data analysis class on coursera, I wanted to keep learning R since I previously took Computing for Data analysis but since that I change it to python. I think I love python too much so I did this week quiz on python using pandas and numpy. Not sure if I will do this each week/quiz of the class, we will see.

You can find the questions and solutions for the quiz2.pdf, the data: ss06hid.csv, ss06pid.csv and the metadata on the github page.



In [26]:

    
import numpy as np
import pandas as pd

Question 2



In [141]:

    
import urllib
f = urllib.request.urlopen('http://simplystatistics.tumblr.com/')



In [142]:

    
lines = []
for i in range(150):
    lines.append(f.readline())



In [143]:

    
len(lines[1]), len(lines[44]), len(lines[121])









    Out[143]:





(920, 7, 26)

python adds a '\n' on on each line so -2

Question 3



In [2]:

    
housing = pd.read_csv('ss06hid.csv')



In [6]:

    
housing









    Out[6]:





<class 'pandas.core.frame.DataFrame'>
Int64Index: 6496 entries, 0 to 6495
Columns: 188 entries, RT to wgtp80
dtypes: float64(97), int64(90), object(1)



In [10]:

    
len(housing[housing['VAL'] >= 24])









    Out[10]:





53

Question 4

Column has to many information: Family type and employment status

Question 5



In [19]:

    
len(housing[(housing['BDS'] == 3) & (housing['RMS'] == 4)])









    Out[19]:





148



In [20]:

    
len(housing[(housing['BDS'] == 2) & (housing['RMS'] == 5)])









    Out[20]:





386



In [21]:

    
len(housing[(housing['BDS'] == 2) & (housing['RMS'] == 7)])









    Out[21]:





49

Question 6



In [60]:

    
agricultureLogical = (housing['ACR'] >= 3) & (housing['AGS'] >= 6)









    Out[60]:





pandas.core.series.Series



In [66]:

    
np.where(agricultureLogical == True)









    Out[66]:





(array([ 124,  237,  261,  469,  554,  567,  607,  642,  786,  807,  823,
        848,  951,  954, 1032, 1264, 1274, 1314, 1387, 1606, 1628, 1650,
       1855, 1918, 2100, 2193, 2402, 2442, 2538, 2579, 2654, 2679, 2739,
       2837, 2964, 3130, 3132, 3162, 3290, 3369, 3401, 3584, 3651, 3851,
       3861, 3911, 4022, 4044, 4106, 4112, 4116, 4184, 4197, 4309, 4342,
       4353, 4447, 4452, 4460, 4717, 4816, 4834, 4909, 5139, 5198, 5235,
       5325, 5416, 5530, 5573, 5893, 6032, 6043, 6088, 6274, 6375, 6419]),)

Question 7



In [67]:

    
q7subsetDataFrame = housing[agricultureLogical]



In [68]:

    
q7subsetDataFrame









    Out[68]:





<class 'pandas.core.frame.DataFrame'>
Int64Index: 77 entries, 124 to 6419
Columns: 188 entries, RT to wgtp80
dtypes: float64(97), int64(90), object(1)



In [71]:

    
l1 = len(q7subsetDataFrame)
l1









    Out[71]:





77



In [72]:

    
l2 = len(q7subsetDataFrame['MRGX'].dropna())
l2









    Out[72]:





69



In [73]:

    
l1 - l2









    Out[73]:





8

Question 8



In [180]:

    
splits = []
for col in housing.columns:
    splits.append(col.split("wgtp"))



In [182]:

    
splits[122]









    Out[182]:





['', '15']

Question 9



In [82]:

    
housing['YBL'].quantile(0)









    Out[82]:





-1.0



In [83]:

    
housing['YBL'].quantile(1)









    Out[83]:





25.0

Something wrong because YBL is: 'When structure first built'

b .N/A (GQ)
1 .2005 or later
2 .2000 to 2004
3 .1990 to 1999
4 .1980 to 1989
5 .1970 to 1979
6 .1960 to 1969
7 .1950 to 1959
8 .1940 to 1949
9 .1939 or earlier

Question 10



In [84]:

    
populations = pd.read_csv('ss06pid.csv')



In [88]:

    
pd.merge(populations, housing, on='SERIALNO', how='outer')









    Out[88]:





<class 'pandas.core.frame.DataFrame'>
Int64Index: 15451 entries, 0 to 15450
Columns: 426 entries, RT_x to wgtp80
dtypes: float64(333), int64(89), object(4)

Conclusion

I believe python is catching up on data analysis with tools as pandas and scikit-learn, the problem is that it is just catching up while R has years of being consolidated as the tool for doind data analysis but I believe python is the future for its integration with other technologies such as the web with django; python is a language that is fighting on all fronts that can be good or bad, lets hope that is good.

FYI, I almost don't get question 8 xD.