I am taking the Data analysis class on coursera, I wanted to keep learning R since I previously took Computing for Data analysis but since that I change it to python. I think I love python too much so I did this week quiz on python using pandas and numpy. Not sure if I will do this each week/quiz of the class, we will see.

You can find the questions and solutions for the quiz2.pdf, the data: ss06hid.csv, ss06pid.csv and the metadata on the github page.

``````

In [26]:

import numpy as np
import pandas as pd

``````

## Question 2

``````

In [141]:

import urllib
f = urllib.request.urlopen('http://simplystatistics.tumblr.com/')

``````
``````

In [142]:

lines = []
for i in range(150):

``````
``````

In [143]:

len(lines[1]), len(lines[44]), len(lines[121])

``````
``````

Out[143]:

(920, 7, 26)

``````

python adds a `'\n'` on on each line so `-2`

## Question 3

``````

In [2]:

``````
``````

In [6]:

housing

``````
``````

Out[6]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6496 entries, 0 to 6495
Columns: 188 entries, RT to wgtp80
dtypes: float64(97), int64(90), object(1)

``````
``````

In [10]:

len(housing[housing['VAL'] >= 24])

``````
``````

Out[10]:

53

``````

## Question 4

Column has to many information: Family type and employment status

## Question 5

``````

In [19]:

len(housing[(housing['BDS'] == 3) & (housing['RMS'] == 4)])

``````
``````

Out[19]:

148

``````
``````

In [20]:

len(housing[(housing['BDS'] == 2) & (housing['RMS'] == 5)])

``````
``````

Out[20]:

386

``````
``````

In [21]:

len(housing[(housing['BDS'] == 2) & (housing['RMS'] == 7)])

``````
``````

Out[21]:

49

``````

## Question 6

``````

In [60]:

agricultureLogical = (housing['ACR'] >= 3) & (housing['AGS'] >= 6)

``````
``````

Out[60]:

pandas.core.series.Series

``````
``````

In [66]:

np.where(agricultureLogical == True)

``````
``````

Out[66]:

(array([ 124,  237,  261,  469,  554,  567,  607,  642,  786,  807,  823,
848,  951,  954, 1032, 1264, 1274, 1314, 1387, 1606, 1628, 1650,
1855, 1918, 2100, 2193, 2402, 2442, 2538, 2579, 2654, 2679, 2739,
2837, 2964, 3130, 3132, 3162, 3290, 3369, 3401, 3584, 3651, 3851,
3861, 3911, 4022, 4044, 4106, 4112, 4116, 4184, 4197, 4309, 4342,
4353, 4447, 4452, 4460, 4717, 4816, 4834, 4909, 5139, 5198, 5235,
5325, 5416, 5530, 5573, 5893, 6032, 6043, 6088, 6274, 6375, 6419]),)

``````

## Question 7

``````

In [67]:

q7subsetDataFrame = housing[agricultureLogical]

``````
``````

In [68]:

q7subsetDataFrame

``````
``````

Out[68]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 77 entries, 124 to 6419
Columns: 188 entries, RT to wgtp80
dtypes: float64(97), int64(90), object(1)

``````
``````

In [71]:

l1 = len(q7subsetDataFrame)
l1

``````
``````

Out[71]:

77

``````
``````

In [72]:

l2 = len(q7subsetDataFrame['MRGX'].dropna())
l2

``````
``````

Out[72]:

69

``````
``````

In [73]:

l1 - l2

``````
``````

Out[73]:

8

``````

## Question 8

``````

In [180]:

splits = []
for col in housing.columns:
splits.append(col.split("wgtp"))

``````
``````

In [182]:

splits[122]

``````
``````

Out[182]:

['', '15']

``````

## Question 9

``````

In [82]:

housing['YBL'].quantile(0)

``````
``````

Out[82]:

-1.0

``````
``````

In [83]:

housing['YBL'].quantile(1)

``````
``````

Out[83]:

25.0

``````

Something wrong because YBL is: 'When structure first built'

• b .N/A (GQ)
• 1 .2005 or later
• 2 .2000 to 2004
• 3 .1990 to 1999
• 4 .1980 to 1989
• 5 .1970 to 1979
• 6 .1960 to 1969
• 7 .1950 to 1959
• 8 .1940 to 1949
• 9 .1939 or earlier

## Question 10

``````

In [84]:

``````
``````

In [88]:

pd.merge(populations, housing, on='SERIALNO', how='outer')

``````
``````

Out[88]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15451 entries, 0 to 15450
Columns: 426 entries, RT_x to wgtp80
dtypes: float64(333), int64(89), object(4)

``````

## Conclusion

I believe python is catching up on data analysis with tools as pandas and scikit-learn, the problem is that it is just catching up while R has years of being consolidated as the tool for doind data analysis but I believe python is the future for its integration with other technologies such as the web with django; python is a language that is fighting on all fronts that can be good or bad, lets hope that is good.

FYI, I almost don't get question 8 xD.