2. Refine the Data

"Data is messy"

  • Missing e.g. Check for missing or incomplete data
  • Quality e.g. Check for duplicates, accuracy, unusual data
  • Parse e.g. extract year and month from date
  • Convert e.g. free text to coded value
  • Derive e.g. gender from title
  • Calculate e.g. percentages, proportion
  • Remove e.g. remove redundant data
  • Merge e.g. first and surname for full name
  • Aggregate e.g. rollup by year, cluster by area
  • Filter e.g. exclude based on location
  • Sample e.g. extract a representative data
  • Summary e.g. show summary stats like mean

In [1]:
# Load the libraries
import numpy as np
import pandas as pd

In [2]:
# Load the data again!
df = pd.read_csv("data/Weed_Price.csv", parse_dates=[-1])

2.1 Missing Data

By “missing” data we simply mean null or “not present for whatever reason”. Lets see if we can find the missing data in our data set either because it exists and was not collected or it never existed


In [3]:
# Lets start the count to seeing about missing data
df.count()


Out[3]:
State     22899
HighQ     22899
HighQN    22899
MedQ      22899
MedQN     22899
LowQ      12342
LowQN     22899
date      22899
dtype: int64

In [4]:
df['LowQ'].isnull()


Out[4]:
0        False
1        False
2        False
3        False
4        False
5        False
6        False
7        False
8        False
9        False
10       False
11       False
12       False
13       False
14       False
15       False
16       False
17       False
18       False
19       False
20       False
21       False
22       False
23       False
24       False
25       False
26       False
27       False
28       False
29       False
         ...  
22869     True
22870     True
22871     True
22872     True
22873     True
22874     True
22875     True
22876     True
22877     True
22878     True
22879     True
22880     True
22881     True
22882     True
22883     True
22884     True
22885     True
22886     True
22887     True
22888     True
22889     True
22890     True
22891     True
22892     True
22893     True
22894     True
22895     True
22896     True
22897     True
22898     True
Name: LowQ, dtype: bool

In [5]:
# We can see the bottom rows which have NaN values
df.tail()


Out[5]:
State HighQ HighQN MedQ MedQN LowQ LowQN date
22894 Virginia 364.98 3513 293.12 3079 NaN 284 2014-12-31
22895 Washington 233.05 3337 189.92 3562 NaN 160 2014-12-31
22896 West Virginia 359.35 551 224.03 545 NaN 60 2014-12-31
22897 Wisconsin 350.52 2244 272.71 2221 NaN 167 2014-12-31
22898 Wyoming 322.27 131 351.86 197 NaN 12 2014-12-31

Pandas will represent missing value by NaN

What can we do this with missing value?

  • Drop these rows / columns? Use .dropna(how='any')
  • Fill with a dummy value? Use .fillna(value=dummy)
  • Impute the cell with the most recent value? Use .fillna(method='ffill')
  • Interpolate the amount in a linear fashion? Use .interpolate()

We use the inplace = True operator to avoid making a copy of the dataframe and changing the dataframe itself


In [6]:
# Lets sort this data frame by State and Date
df.sort(columns=['State','date'], inplace=True)

In [7]:
# Lets fill the missing value with last available value
df.fillna(method = "ffill", inplace=True)

In [8]:
df.count()


Out[8]:
State     22899
HighQ     22899
HighQN    22899
MedQ      22899
MedQN     22899
LowQ      22899
LowQN     22899
date      22899
dtype: int64

Exercise

Fill the missing value with a backward fill.


In [ ]:

Fill the missing values with the mean for the column.


In [ ]:

2.2 Quality of the Data

Lets check for completeness.

Say, do we have data on each date for all the 51 states?


In [10]:
df["State"].value_counts()


Out[10]:
Utah                    449
Washington              449
Massachusetts           449
Missouri                449
Connecticut             449
Vermont                 449
Pennsylvania            449
Georgia                 449
California              449
North Carolina          449
Florida                 449
Oregon                  449
Alabama                 449
Indiana                 449
Kentucky                449
Hawaii                  449
South Carolina          449
Virginia                449
Oklahoma                449
Texas                   449
Illinois                449
Nebraska                449
Ohio                    449
Arizona                 449
Rhode Island            449
Louisiana               449
Wisconsin               449
New York                449
Arkansas                449
Montana                 449
New Hampshire           449
North Dakota            449
New Mexico              449
Delaware                449
Iowa                    449
New Jersey              449
Kansas                  449
Idaho                   449
Colorado                449
Maryland                449
District of Columbia    449
Michigan                449
South Dakota            449
West Virginia           449
Minnesota               449
Wyoming                 449
Mississippi             449
Tennessee               449
Nevada                  449
Maine                   449
Alaska                  449
dtype: int64

Lets check the dates and see if they are all continuous


In [58]:
df1 = df[df.State=='California'].copy()
df2 = df[df.State=='California'].copy()

In [59]:
df1.shape


Out[59]:
(449, 8)

In [60]:
print "Earliest Date:",df1.date.min(), "\n", "Latest Date:",df1.date.max(), "\n",  \
"Number of days in between them:", df1.date.max() - df1.date.min()


Earliest Date: 2013-12-27 00:00:00 
Latest Date: 2015-06-11 00:00:00 
Number of days in between them: 531 days 00:00:00

In [61]:
df1.groupby(['date']).size()


Out[61]:
date
2013-12-27    1
2013-12-28    1
2013-12-29    1
2013-12-30    1
2013-12-31    1
2014-01-01    1
2014-01-02    1
2014-01-03    1
2014-01-04    1
2014-01-05    1
2014-01-06    1
2014-01-07    1
2014-01-08    1
2014-01-09    1
2014-01-10    1
2014-01-11    1
2014-01-12    1
2014-01-13    1
2014-01-14    1
2014-01-15    1
2014-01-16    1
2014-01-17    1
2014-01-18    1
2014-01-19    1
2014-01-20    1
2014-01-21    1
2014-01-22    1
2014-01-23    1
2014-01-24    1
2014-01-25    1
             ..
2015-05-13    1
2015-05-14    1
2015-05-15    1
2015-05-16    1
2015-05-17    1
2015-05-18    1
2015-05-19    1
2015-05-20    1
2015-05-21    1
2015-05-22    1
2015-05-23    1
2015-05-24    1
2015-05-25    1
2015-05-26    1
2015-05-27    1
2015-05-28    1
2015-05-29    1
2015-05-30    1
2015-05-31    1
2015-06-01    1
2015-06-02    1
2015-06-03    1
2015-06-04    1
2015-06-05    1
2015-06-06    1
2015-06-07    1
2015-06-08    1
2015-06-09    1
2015-06-10    1
2015-06-11    1
dtype: int64

In [62]:
df1.set_index("date", inplace=True)
df2.set_index("date", inplace=True)

In [63]:
idx = pd.date_range(df1.index.min(), df1.index.max())

In [64]:
idx


Out[64]:
DatetimeIndex(['2013-12-27', '2013-12-28', '2013-12-29', '2013-12-30',
               '2013-12-31', '2014-01-01', '2014-01-02', '2014-01-03',
               '2014-01-04', '2014-01-05', 
               ...
               '2015-06-02', '2015-06-03', '2015-06-04', '2015-06-05',
               '2015-06-06', '2015-06-07', '2015-06-08', '2015-06-09',
               '2015-06-10', '2015-06-11'],
              dtype='datetime64[ns]', length=532, freq='D', tz=None)

In [65]:
df1 = df1.reindex(idx, fill_value=0)

In [66]:
df1.shape


Out[66]:
(532, 7)

Exercise Show the list of dates that were missing. Hint Leverage df2. Both df1 and df2 have date as index


In [ ]:

2.3 Parse the Data

Lets see if we can get the year, month, week and weekdays from the date. Pandas has got good built in functionality for timeseries data using the DatetimeIndex method


In [68]:
df['year'] = pd.DatetimeIndex(df['date']).year
df['month'] = pd.DatetimeIndex(df['date']).month
df['week'] = pd.DatetimeIndex(df['date']).week
df['weekday'] = pd.DatetimeIndex(df['date']).weekday

In [69]:
df.dtypes


Out[69]:
State              object
HighQ             float64
HighQN              int64
MedQ              float64
MedQN               int64
LowQ              float64
LowQN               int64
date       datetime64[ns]
year                int32
month               int32
week                int32
weekday             int32
dtype: object

In [70]:
df.tail()


Out[70]:
State HighQ HighQN MedQ MedQN LowQ LowQN date year month week weekday
4997 Wyoming 313.72 148 317.38 226 161.3 13 2015-06-07 2015 6 23 6
5762 Wyoming 313.72 148 317.38 226 161.3 13 2015-06-08 2015 6 24 0
6527 Wyoming 313.72 148 317.38 226 161.3 13 2015-06-09 2015 6 24 1
7343 Wyoming 313.72 148 317.38 226 161.3 13 2015-06-10 2015 6 24 2
8159 Wyoming 313.72 148 317.38 226 161.3 13 2015-06-11 2015 6 24 3

In [71]:
df['year'].value_counts()


Out[71]:
2014    18564
2015     4080
2013      255
dtype: int64

In [72]:
df['month'].value_counts()


Out[72]:
1     3162
5     2703
2     2244
6     2091
12    1836
10    1581
7     1581
3     1581
11    1530
9     1530
8     1530
4     1530
dtype: int64

In [73]:
df["weekday"].value_counts()


Out[73]:
6    3315
0    3315
4    3264
3    3264
2    3264
1    3264
5    3213
dtype: int64

2.4 Aggregate the Data

To aggregate, we typically use the “group by” function, which involves the following steps

  • Splitting the data into groups based on some criteria
  • Applying a function to each group independently
  • Combining the results into a data structure

In [74]:
df_mean = df.groupby("State", as_index=False).mean()

In [75]:
df_mean.shape


Out[75]:
(51, 11)

In [76]:
df_mean.head()


Out[76]:
State HighQ HighQN MedQ MedQN LowQ LowQN year month week weekday
0 Alabama 339.561849 1379.414254 204.606169 1270.351893 145.978508 161.149220 2014.167038 5.953229 23.812918 2.995546
1 Alaska 291.482004 321.244989 262.046392 407.917595 394.653964 32.334076 2014.167038 5.953229 23.812918 2.995546
2 Arizona 300.667483 2392.465479 209.365345 2137.414254 188.500134 279.006682 2014.167038 5.953229 23.812918 2.995546
3 Arkansas 348.056147 751.988864 190.414655 724.683742 126.771269 135.902004 2014.167038 5.953229 23.812918 2.995546
4 California 245.376125 14947.073497 191.268909 16769.821826 189.783586 976.298441 2014.167038 5.953229 23.812918 2.995546

Pivot Table


In [77]:
df.head()


Out[77]:
State HighQ HighQN MedQ MedQN LowQ LowQN date year month week weekday
20094 Alabama 339.65 1033 198.04 926 147.15 122 2013-12-27 2013 12 52 4
20859 Alabama 339.65 1033 198.04 926 147.15 122 2013-12-28 2013 12 52 5
21573 Alabama 339.75 1036 198.26 929 149.49 123 2013-12-29 2013 12 52 6
22287 Alabama 339.75 1036 198.81 930 149.49 123 2013-12-30 2013 12 1 0
22797 Alabama 339.42 1040 198.68 932 149.49 123 2013-12-31 2013 12 1 1

In [78]:
pd.pivot_table(df, values='HighQ', index=['State'], columns=["weekday"] )


Out[78]:
weekday 0 1 2 3 4 5 6
State
Alabama 339.556308 339.577656 339.559375 339.552500 339.577969 339.568254 339.541385
Alaska 291.463231 291.384687 291.390000 291.597500 291.506406 291.596032 291.438923
Arizona 300.612000 300.653906 300.643750 300.655000 300.751562 300.715873 300.642308
Arkansas 347.930462 348.011250 348.017656 347.934844 348.278281 348.235238 347.991077
California 245.348923 245.364375 245.359219 245.342813 245.434219 245.425556 245.359231
Colorado 238.948462 238.885938 238.889844 238.977187 238.885937 238.897619 238.944769
Connecticut 341.624615 341.682812 341.633750 341.616406 341.832656 341.819683 341.652308
Delaware 366.712769 366.748750 366.705937 366.659375 366.962344 366.937143 366.750615
District of Columbia 348.205385 348.055000 348.132656 348.145625 348.213594 348.277778 348.212462
Florida 302.548923 302.577031 302.529688 302.510781 302.638437 302.648095 302.541231
Georgia 326.588308 326.635937 326.594219 326.579687 326.805000 326.784762 326.633077
Hawaii 309.940615 309.895312 309.855156 309.818750 309.934062 310.033016 309.956462
Idaho 275.642615 275.627656 275.609844 275.653437 275.671250 275.654444 275.616923
Illinois 355.406769 355.460313 355.442031 355.427188 355.580625 355.502540 355.426000
Indiana 333.147077 333.184531 333.187344 333.145781 333.257031 333.253333 333.164769
Iowa 367.059692 367.086094 367.071563 367.055625 367.175781 367.173175 367.051077
Kansas 350.760615 350.785469 350.765938 350.778438 350.928906 350.903492 350.813846
Kentucky 338.906615 338.857812 338.846719 338.874062 338.796719 338.799048 338.902769
Louisiana 366.743385 366.816562 366.761094 366.733437 367.016406 366.960159 366.800615
Maine 315.272615 315.392812 315.384844 315.330156 315.570469 315.523333 315.262308
Maryland 370.884308 370.970469 370.912187 370.899219 371.160937 371.162063 370.912462
Massachusetts 348.942462 348.995625 348.968750 348.945625 349.156562 349.100952 348.974615
Michigan 292.022462 292.080312 292.044531 292.004844 292.229844 292.199841 292.055077
Minnesota 358.104154 358.228437 358.188594 358.052969 358.507969 358.471587 358.156308
Mississippi 337.535385 337.479062 337.495469 337.532656 337.615937 337.660635 337.597231
Missouri 359.098923 359.278594 359.273594 359.264844 359.418438 359.434921 359.174000
Montana 267.832615 267.860625 267.713125 267.700937 267.930781 267.934444 267.850923
Nebraska 341.441692 341.522187 341.490000 341.479375 341.768438 341.713968 341.504923
Nevada 267.794769 267.827969 267.823594 267.811875 267.866875 267.824286 267.792308
New Hampshire 359.469846 359.543906 359.519844 359.362656 359.595469 359.512222 359.331231
New Jersey 349.260769 349.313594 349.299219 349.272969 349.435469 349.396667 349.276000
New Mexico 289.536769 289.511719 289.499531 289.470156 289.670937 289.685714 289.565231
New York 346.877692 346.907031 346.866406 346.831875 347.027344 346.995079 346.886154
North Carolina 349.413846 349.476719 349.435625 349.406563 349.585156 349.565873 349.444308
North Dakota 398.349692 398.576562 398.677656 398.550000 399.090469 398.785079 398.658769
Ohio 334.526615 334.546562 334.536406 334.532969 334.615938 334.592063 334.533385
Oklahoma 361.417077 361.645625 361.615313 361.597656 361.753437 361.598730 361.389846
Oregon 208.461385 208.474844 208.454063 208.422344 208.604688 208.579048 208.479385
Pennsylvania 366.067231 366.102969 366.094375 366.060469 366.253125 366.222857 366.081846
Rhode Island 323.733077 323.828281 323.801094 323.756094 323.998594 323.951746 323.803538
South Carolina 341.612154 341.686094 341.656250 341.641094 341.837500 341.803651 341.617846
South Dakota 375.626615 375.828125 375.830937 375.837500 375.946875 376.010159 375.657692
Tennessee 352.678000 352.730469 352.711094 352.707344 352.871250 352.868730 352.707846
Texas 337.034154 337.069687 337.037969 337.023750 337.267656 337.235714 337.071231
Utah 289.552308 289.597500 289.586094 289.567344 289.634687 289.671905 289.564923
Vermont 374.172462 374.250312 374.204844 374.181250 374.395156 374.335556 374.216615
Virginia 368.084308 368.120781 368.112813 368.087969 368.264375 368.245238 368.116462
Washington 233.381231 233.377969 233.361875 233.363594 233.422500 233.426032 233.382308
West Virginia 357.408000 357.428906 357.368125 357.330156 357.395938 357.424444 357.406462
Wisconsin 352.981077 353.050313 353.040469 353.000938 353.160000 353.141905 352.996308
Wyoming 338.837538 339.090625 339.045781 338.932969 339.260469 339.072698 338.910769

Exercise Get a cross tabulation: for each state, for each month, get the prices for the weekday as shown in the output


In [79]:



Out[79]:
weekday 0 1 2 3 4 5 6
State month
Alabama 1 338.018750 337.968750 338.152222 338.042000 338.020000 337.876667 338.090000
2 338.144286 338.576667 338.561667 338.630000 338.690000 338.535000 338.027143
3 340.758000 340.737500 340.857500 340.747500 340.665000 340.722000 340.750000
4 341.000000 340.930000 340.962000 341.167500 341.105000 341.127500 340.970000
5 339.645714 339.595714 339.535714 339.713750 339.685000 339.628750 339.360000
6 339.844286 339.711667 339.755000 339.733333 340.284000 340.210000 340.235000
7 340.630000 340.650000 340.586000 340.498000 340.537500 340.580000 340.545000
8 340.552500 340.507500 340.470000 340.440000 340.526000 340.525000 340.592000
9 340.434000 340.434000 340.430000 340.517500 340.455000 340.447500 340.382500
10 339.027500 338.820000 339.032000 339.048000 339.026000 339.005000 339.067500
11 339.422500 339.370000 339.360000 339.280000 339.197500 339.262000 339.350000
12 339.180000 339.085000 338.882000 339.177500 339.230000 339.192000 339.134000
Alaska 1 296.395000 296.257500 295.400000 296.391000 296.365000 297.210000 296.402500
2 292.368571 291.665000 291.706667 291.706667 291.431667 292.160000 292.424286
3 286.508000 286.417500 286.417500 286.417500 286.160000 286.284000 286.518000
4 288.577500 288.380000 288.186000 288.260000 288.260000 288.310000 288.092500
5 289.794286 289.687143 289.611429 289.386250 289.487500 289.311250 290.205000
6 289.178571 289.381667 289.380000 289.386667 288.578000 288.704000 288.268333
7 287.992500 288.152000 288.492000 288.316000 288.310000 288.280000 288.545000
8 289.442500 289.442500 289.470000 289.470000 289.244000 289.782500 289.618000
9 291.356000 291.356000 291.342500 291.365000 291.365000 291.365000 291.365000
10 291.535000 291.535000 291.510000 291.364000 291.530000 291.710000 291.747500
11 290.967500 290.967500 290.967500 290.967500 291.052500 290.914000 290.700000
12 298.530000 298.596667 300.566000 303.090000 300.184000 300.182000 300.182000
Arizona 1 300.628750 300.573750 300.885556 300.643000 300.635000 300.286667 300.636250
2 300.507143 301.148333 301.195000 301.123333 301.108333 301.058333 300.518571
3 302.794000 302.857500 302.877500 302.860000 302.825000 302.862000 302.806000
4 301.522500 301.706000 301.632000 301.637500 301.680000 301.537500 301.522500
5 299.955714 299.952857 299.924286 300.307500 300.270000 300.236250 299.550000
6 300.270000 300.210000 300.216667 300.213333 300.928000 300.974000 301.183333
... ... ... ... ... ... ... ... ...
West Virginia 7 354.302500 354.268000 354.078000 353.986000 353.955000 353.980000 354.125000
8 354.385000 354.452500 354.452500 354.480000 354.428000 354.337500 354.366000
9 355.302000 355.352000 355.325000 355.425000 355.530000 355.530000 355.467500
10 354.615000 354.655000 354.816000 354.816000 354.716000 354.580000 354.512500
11 357.317500 357.315000 357.272500 357.360000 357.477500 357.046000 357.536000
12 359.270000 359.333333 358.916000 358.862500 359.370000 359.430000 359.398000
Wisconsin 1 354.043750 354.036250 354.462222 354.029000 353.958000 353.483333 354.077500
2 353.748571 354.648333 354.675000 354.711667 354.625000 354.680000 353.748571
3 356.736000 356.827500 356.727500 356.697500 356.697500 356.766000 356.732000
4 355.545000 355.646000 355.600000 355.630000 355.612500 355.565000 355.590000
5 350.808571 350.774286 350.755714 351.315000 351.298750 351.288750 350.093750
6 351.417143 351.041667 351.015000 351.018333 352.176000 352.116000 352.513333
7 353.337500 353.386000 353.360000 353.322000 353.305000 353.392500 353.422500
8 353.027500 353.045000 353.047500 352.940000 352.990000 352.935000 352.966000
9 352.508000 352.502000 352.545000 352.380000 352.380000 352.427500 352.430000
10 352.237500 352.220000 352.216000 352.204000 352.180000 352.212500 352.200000
11 351.332500 351.362500 351.405000 351.325000 351.277500 351.374000 351.242000
12 351.910000 351.980000 350.844000 350.837500 352.140000 352.158000 352.160000
Wyoming 1 337.210000 337.210000 339.286667 337.397000 337.210000 335.551111 337.210000
2 340.128571 343.103333 343.103333 342.720000 342.820000 343.090000 340.117143
3 354.232000 354.242500 354.242500 354.242500 354.242500 354.430000 354.640000
4 353.935000 353.968000 353.968000 354.972500 354.972500 353.935000 353.935000
5 335.088571 335.198571 335.071429 336.931250 336.931250 336.931250 332.307500
6 340.477143 338.593333 338.593333 338.740000 343.744000 343.850000 344.995000
7 350.922500 351.094000 351.046000 350.786000 351.060000 351.060000 351.060000
8 344.870000 344.870000 344.870000 344.870000 341.978000 340.050000 341.978000
9 328.902000 328.902000 328.992500 328.992500 328.992500 328.525000 328.525000
10 329.297500 329.297500 329.140000 328.616000 328.616000 328.510000 329.297500
11 326.605000 326.605000 326.757500 326.757500 326.757500 327.214000 327.214000
12 329.785000 329.611667 323.890000 324.295000 330.482000 330.368000 330.368000

612 rows × 7 columns

2.5 Derive the Data

Lets us load the demographic dataset and create a new column for others in the population


In [81]:
df_demo = pd.read_csv("data/Demographics_State.csv")

In [82]:
df_demo.head()


Out[82]:
region total_population percent_white percent_black percent_asian percent_hispanic per_capita_income median_rent median_age
0 alabama 4799277 67 26 1 4 23680 501 38.1
1 alaska 720316 63 3 5 6 32651 978 33.6
2 arizona 6479703 57 4 3 30 25358 747 36.3
3 arkansas 2933369 74 15 1 7 22170 480 37.5
4 california 37659181 40 6 13 38 29527 1119 35.4

In [83]:
df_demo["percent_other"] = 100 - df_demo["percent_white"] - df_demo["percent_black"] - df_demo["percent_asian"] - df_demo["percent_hispanic"]

In [84]:
df_demo.head()


Out[84]:
region total_population percent_white percent_black percent_asian percent_hispanic per_capita_income median_rent median_age percent_other
0 alabama 4799277 67 26 1 4 23680 501 38.1 2
1 alaska 720316 63 3 5 6 32651 978 33.6 23
2 arizona 6479703 57 4 3 30 25358 747 36.3 6
3 arkansas 2933369 74 15 1 7 22170 480 37.5 3
4 california 37659181 40 6 13 38 29527 1119 35.4 3

Exercise Express median rent as a proportion of california's median rent (Compute it as a new column)


In [ ]:

2.6 Merge the Data

Lets merge the demographic dataset with the price dataset


In [23]:
# Let us change the column name region to State
df_demo = df_demo.rename(columns={'region':'State'})

In [24]:
df_demo.head()


Out[24]:
State total_population percent_white percent_black percent_asian percent_hispanic per_capita_income median_rent median_age percent_other
0 alabama 4799277 67 26 1 4 23680 501 38.1 2
1 alaska 720316 63 3 5 6 32651 978 33.6 23
2 arizona 6479703 57 4 3 30 25358 747 36.3 6
3 arkansas 2933369 74 15 1 7 22170 480 37.5 3
4 california 37659181 40 6 13 38 29527 1119 35.4 3

In [26]:
# We can now merge Demographic and Price mean data into one single data frame
df_merge = pd.merge(df_mean, df_demo, how='inner', on='State')
df_merge.head()


Out[26]:
State HighQ HighQN MedQ MedQN LowQ LowQN year month week weekday total_population percent_white percent_black percent_asian percent_hispanic per_capita_income median_rent median_age percent_other

What happened? Why is there no data in the dataframe?


In [27]:
# Change the State in df_mean to lower case
df_mean['State'] = df_mean['State'].str.lower()

In [28]:
df_mean.head()


Out[28]:
State HighQ HighQN MedQ MedQN LowQ LowQN year month week weekday
0 alabama 339.561849 1379.414254 204.606169 1270.351893 145.978508 161.149220 2014.167038 5.953229 23.812918 2.995546
1 alaska 291.482004 321.244989 262.046392 407.917595 394.653964 32.334076 2014.167038 5.953229 23.812918 2.995546
2 arizona 300.667483 2392.465479 209.365345 2137.414254 188.500134 279.006682 2014.167038 5.953229 23.812918 2.995546
3 arkansas 348.056147 751.988864 190.414655 724.683742 126.771269 135.902004 2014.167038 5.953229 23.812918 2.995546
4 california 245.376125 14947.073497 191.268909 16769.821826 189.783586 976.298441 2014.167038 5.953229 23.812918 2.995546

In [29]:
# We can now merge Demographic and Price mean data into one single data frame
df_merge = pd.merge(df_mean, df_demo, how='inner', on='State')

In [30]:
df_merge.head()


Out[30]:
State HighQ HighQN MedQ MedQN LowQ LowQN year month week weekday total_population percent_white percent_black percent_asian percent_hispanic per_capita_income median_rent median_age percent_other
0 alabama 339.561849 1379.414254 204.606169 1270.351893 145.978508 161.149220 2014.167038 5.953229 23.812918 2.995546 4799277 67 26 1 4 23680 501 38.1 2
1 alaska 291.482004 321.244989 262.046392 407.917595 394.653964 32.334076 2014.167038 5.953229 23.812918 2.995546 720316 63 3 5 6 32651 978 33.6 23
2 arizona 300.667483 2392.465479 209.365345 2137.414254 188.500134 279.006682 2014.167038 5.953229 23.812918 2.995546 6479703 57 4 3 30 25358 747 36.3 6
3 arkansas 348.056147 751.988864 190.414655 724.683742 126.771269 135.902004 2014.167038 5.953229 23.812918 2.995546 2933369 74 15 1 7 22170 480 37.5 3
4 california 245.376125 14947.073497 191.268909 16769.821826 189.783586 976.298441 2014.167038 5.953229 23.812918 2.995546 37659181 40 6 13 38 29527 1119 35.4 3

2.7 Filter the Data

Lets start by filtering the data

  • by location
  • by Year
  • by location & Year

In [31]:
# Filter data for location California
df_cal = df[df["State"] == "California"]

In [32]:
df_cal.shape


Out[32]:
(449, 12)

In [33]:
# Filter data for year
df_2014 = df[df["year"] == 2014]

In [34]:
df_2014.shape


Out[34]:
(18564, 12)

In [35]:
df_cal_2014 = df[(df["year"] == 2014) & (df["State"] == "California")]

In [36]:
df_cal_2014.shape


Out[36]:
(364, 12)

In [37]:
df_cal_2014.head()


Out[37]:
State HighQ HighQN MedQ MedQN LowQ LowQN date year month week weekday
4 California 248.78 12096 193.56 12812 192.92 778 2014-01-01 2014 1 1 2
769 California 248.67 12125 193.56 12836 192.80 779 2014-01-02 2014 1 1 3
1483 California 248.67 12141 193.57 12853 192.67 782 2014-01-03 2014 1 1 4
2248 California 248.65 12155 193.59 12884 192.67 782 2014-01-04 2014 1 1 5
3013 California 248.68 12176 193.63 12902 192.67 782 2014-01-05 2014 1 1 6

Exercise Find the % of hispanic population for the state with max white population (use df_demo)


In [ ]:

2.8 Summarise the Data

We can use the describe function to get the summary stats for each column in the data frame


In [38]:
df.describe()


Out[38]:
HighQ HighQN MedQ MedQN LowQ LowQN year month week weekday
count 22899.000000 22899.000000 22899.000000 22899.000000 22899.000000 22899.000000 22899.000000 22899.000000 22899.000000 22899.000000
mean 329.759854 2274.743657 247.618306 2183.737805 203.624092 202.804489 2014.167038 5.953229 23.812918 2.995546
std 41.173167 2641.936586 44.276015 2789.902626 101.484265 220.531987 0.401765 3.553055 15.426018 2.005599
min 202.020000 93.000000 144.850000 134.000000 63.700000 11.000000 2013.000000 1.000000 1.000000 0.000000
25% 303.780000 597.000000 215.775000 548.000000 145.810000 51.000000 2014.000000 3.000000 9.000000 1.000000
50% 342.310000 1420.000000 245.800000 1320.000000 185.780000 139.000000 2014.000000 6.000000 22.000000 3.000000
75% 356.550000 2958.000000 274.155000 2673.000000 222.940000 263.000000 2014.000000 9.000000 37.000000 5.000000
max 415.700000 18492.000000 379.000000 22027.000000 734.650000 1287.000000 2015.000000 12.000000 52.000000 6.000000

We can also use convenience functions like sum(), count(), mean() etc. to calculate these


In [39]:
df.HighQ.mean()


Out[39]:
329.75985414210226

In [40]:
# Lets do this the hard way
df.HighQ.sum()


Out[40]:
7551170.8999999994

In [41]:
df.HighQ.count()


Out[41]:
22899

In [42]:
df.HighQ.sum()/df.HighQ.count()


Out[42]:
329.75985414210226

In [43]:
df.HighQ.median()


Out[43]:
342.31

2.9 Sample the Data


In [46]:
?df.sample

In [51]:
df_ca_sample = df[df.State=='California'].sample(n = 50, replace = True, random_state=123)

In [57]:
df_ca_sample.duplicated()


Out[57]:
20914    False
10051    False
11428    False
2503     False
11275    False
9235     False
14590    False
8725     False
21118    False
15967    False
21322    False
7399     False
1024     False
13927    False
1075     False
8572     False
7042     False
20200    False
17599    False
6583     False
12499    False
6991     False
1432     False
20047    False
5104     False
10357    False
1024      True
16987    False
3217     False
21475    False
2911     False
21577    False
2197     False
2350     False
11428     True
15304    False
8572      True
5716     False
15508    False
7960     False
17956    False
3268     False
8521     False
8419     False
18364    False
15814    False
12244    False
20608    False
17497    False
9949     False
dtype: bool

In [58]:
df_ca_sample.loc[8572]


Out[58]:
State HighQ HighQN MedQ MedQN LowQ LowQN date year month week weekday
8572 California 248.26 12753 192.89 13618 191 818 2014-02-12 2014 2 7 2
8572 California 248.26 12753 192.89 13618 191 818 2014-02-12 2014 2 7 2

2.10 Quirks in Pandas


In [69]:
df_ca_sample.iat[0, 0] = "Cal"

In [71]:
df_ca_sample.head()


Out[71]:
State HighQ HighQN MedQ MedQN LowQ LowQN date year month week weekday
20914 Cal 244.05 16474 189.42 19085 188.60 1093 2014-12-28 2014 12 52 6
10051 California 243.60 16664 189.19 19416 188.60 1109 2015-01-14 2015 1 3 2
11428 California 244.75 15893 190.43 18113 188.60 1042 2014-11-15 2014 11 46 5
2503 California 246.53 13435 192.83 14590 191.62 866 2014-04-04 2014 4 14 4
11275 California 245.27 15108 191.60 16881 187.94 989 2014-08-15 2014 8 33 4

But changes when you copy


In [74]:
df_ca_sample2 = df_ca_sample

In [77]:
df_ca_sample2.iat[0, 0] = "CA"
df_ca_sample2.head()


Out[77]:
State HighQ HighQN MedQ MedQN LowQ LowQN date year month week weekday
20914 CA 244.05 16474 189.42 19085 188.60 1093 2014-12-28 2014 12 52 6
10051 California 243.60 16664 189.19 19416 188.60 1109 2015-01-14 2015 1 3 2
11428 California 244.75 15893 190.43 18113 188.60 1042 2014-11-15 2014 11 46 5
2503 California 246.53 13435 192.83 14590 191.62 866 2014-04-04 2014 4 14 4
11275 California 245.27 15108 191.60 16881 187.94 989 2014-08-15 2014 8 33 4

In [78]:
df_ca_sample.head()


Out[78]:
State HighQ HighQN MedQ MedQN LowQ LowQN date year month week weekday
20914 CA 244.05 16474 189.42 19085 188.60 1093 2014-12-28 2014 12 52 6
10051 California 243.60 16664 189.19 19416 188.60 1109 2015-01-14 2015 1 3 2
11428 California 244.75 15893 190.43 18113 188.60 1042 2014-11-15 2014 11 46 5
2503 California 246.53 13435 192.83 14590 191.62 866 2014-04-04 2014 4 14 4
11275 California 245.27 15108 191.60 16881 187.94 989 2014-08-15 2014 8 33 4

Fix the issue


In [79]:
df_ca_sample3 = df_ca_sample2.copy()

In [80]:
df_ca_sample3.head()


Out[80]:
State HighQ HighQN MedQ MedQN LowQ LowQN date year month week weekday
20914 CA 244.05 16474 189.42 19085 188.60 1093 2014-12-28 2014 12 52 6
10051 California 243.60 16664 189.19 19416 188.60 1109 2015-01-14 2015 1 3 2
11428 California 244.75 15893 190.43 18113 188.60 1042 2014-11-15 2014 11 46 5
2503 California 246.53 13435 192.83 14590 191.62 866 2014-04-04 2014 4 14 4
11275 California 245.27 15108 191.60 16881 187.94 989 2014-08-15 2014 8 33 4

In [81]:
df_ca_sample3.iat[0, 0] = "CALIFORNIA"
df_ca_sample3.head()


Out[81]:
State HighQ HighQN MedQ MedQN LowQ LowQN date year month week weekday
20914 CALIFORNIA 244.05 16474 189.42 19085 188.60 1093 2014-12-28 2014 12 52 6
10051 California 243.60 16664 189.19 19416 188.60 1109 2015-01-14 2015 1 3 2
11428 California 244.75 15893 190.43 18113 188.60 1042 2014-11-15 2014 11 46 5
2503 California 246.53 13435 192.83 14590 191.62 866 2014-04-04 2014 4 14 4
11275 California 245.27 15108 191.60 16881 187.94 989 2014-08-15 2014 8 33 4

In [82]:
df_ca_sample2.head()


Out[82]:
State HighQ HighQN MedQ MedQN LowQ LowQN date year month week weekday
20914 CA 244.05 16474 189.42 19085 188.60 1093 2014-12-28 2014 12 52 6
10051 California 243.60 16664 189.19 19416 188.60 1109 2015-01-14 2015 1 3 2
11428 California 244.75 15893 190.43 18113 188.60 1042 2014-11-15 2014 11 46 5
2503 California 246.53 13435 192.83 14590 191.62 866 2014-04-04 2014 4 14 4
11275 California 245.27 15108 191.60 16881 187.94 989 2014-08-15 2014 8 33 4

In [ ]: