I am taking the Data analysis class on coursera, I wanted to keep learning R since I previously took Computing for Data analysis but since that I change it to python. I think I love python too much so I did this week quiz on python using pandas and numpy. Not sure if I will do this each week/quiz of the class, we will see.
You can find the questions and solutions for the quiz2.pdf, the data: ss06hid.csv, ss06pid.csv and the metadata on the github page.
In [26]:
import numpy as np
import pandas as pd
In [141]:
import urllib
f = urllib.request.urlopen('http://simplystatistics.tumblr.com/')
In [142]:
lines = []
for i in range(150):
lines.append(f.readline())
In [143]:
len(lines[1]), len(lines[44]), len(lines[121])
Out[143]:
python adds a '\n'
on on each line so -2
In [2]:
housing = pd.read_csv('ss06hid.csv')
In [6]:
housing
Out[6]:
In [10]:
len(housing[housing['VAL'] >= 24])
Out[10]:
Column has to many information: Family type and employment status
In [19]:
len(housing[(housing['BDS'] == 3) & (housing['RMS'] == 4)])
Out[19]:
In [20]:
len(housing[(housing['BDS'] == 2) & (housing['RMS'] == 5)])
Out[20]:
In [21]:
len(housing[(housing['BDS'] == 2) & (housing['RMS'] == 7)])
Out[21]:
In [60]:
agricultureLogical = (housing['ACR'] >= 3) & (housing['AGS'] >= 6)
Out[60]:
In [66]:
np.where(agricultureLogical == True)
Out[66]:
In [67]:
q7subsetDataFrame = housing[agricultureLogical]
In [68]:
q7subsetDataFrame
Out[68]:
In [71]:
l1 = len(q7subsetDataFrame)
l1
Out[71]:
In [72]:
l2 = len(q7subsetDataFrame['MRGX'].dropna())
l2
Out[72]:
In [73]:
l1 - l2
Out[73]:
In [180]:
splits = []
for col in housing.columns:
splits.append(col.split("wgtp"))
In [182]:
splits[122]
Out[182]:
In [82]:
housing['YBL'].quantile(0)
Out[82]:
In [83]:
housing['YBL'].quantile(1)
Out[83]:
Something wrong because YBL is: 'When structure first built'
In [84]:
populations = pd.read_csv('ss06pid.csv')
In [88]:
pd.merge(populations, housing, on='SERIALNO', how='outer')
Out[88]:
I believe python is catching up on data analysis with tools as pandas and scikit-learn, the problem is that it is just catching up while R has years of being consolidated as the tool for doind data analysis but I believe python is the future for its integration with other technologies such as the web with django; python is a language that is fighting on all fronts that can be good or bad, lets hope that is good.
FYI, I almost don't get question 8 xD.