2. Refine the Data

"Data is messy"

  • Missing e.g. Check for missing or incomplete data
  • Quality e.g. Check for duplicates, accuracy, unusual data
  • Parse e.g. extract year and month from date
  • Convert e.g. free text to coded value
  • Derive e.g. gender from title
  • Calculate e.g. percentages, proportion
  • Remove e.g. remove redundant data
  • Merge e.g. first and surname for full name
  • Aggregate e.g. rollup by year, cluster by area
  • Filter e.g. exclude based on location
  • Sample e.g. extract a representative data
  • Summary e.g. show summary stats like mean

In [1]:
# Load the libraries
import numpy as np
import pandas as pd

In [2]:
# Load the data again!
df = pd.read_csv("data/Weed_Price.csv", parse_dates=[-1])

2.1 Missing Data

By “missing” data we simply mean null or “not present for whatever reason”. Lets see if we can find the missing data in our data set either because it exists and was not collected or it never existed


In [3]:
# Lets start the count to seeing about missing data
df.count()


Out[3]:
State     22899
HighQ     22899
HighQN    22899
MedQ      22899
MedQN     22899
LowQ      12342
LowQN     22899
date      22899
dtype: int64

In [4]:
df['LowQ'].isnull()


Out[4]:
0        False
1        False
2        False
3        False
4        False
5        False
6        False
7        False
8        False
9        False
10       False
11       False
12       False
13       False
14       False
15       False
16       False
17       False
18       False
19       False
20       False
21       False
22       False
23       False
24       False
25       False
26       False
27       False
28       False
29       False
         ...  
22869     True
22870     True
22871     True
22872     True
22873     True
22874     True
22875     True
22876     True
22877     True
22878     True
22879     True
22880     True
22881     True
22882     True
22883     True
22884     True
22885     True
22886     True
22887     True
22888     True
22889     True
22890     True
22891     True
22892     True
22893     True
22894     True
22895     True
22896     True
22897     True
22898     True
Name: LowQ, dtype: bool

In [5]:
# We can see the bottom rows which have NaN values
df.tail()


Out[5]:
State HighQ HighQN MedQ MedQN LowQ LowQN date
22894 Virginia 364.98 3513 293.12 3079 NaN 284 2014-12-31
22895 Washington 233.05 3337 189.92 3562 NaN 160 2014-12-31
22896 West Virginia 359.35 551 224.03 545 NaN 60 2014-12-31
22897 Wisconsin 350.52 2244 272.71 2221 NaN 167 2014-12-31
22898 Wyoming 322.27 131 351.86 197 NaN 12 2014-12-31

Pandas will represent missing value by NaN

What can we do this with missing value?

  • Drop these rows / columns? Use .dropna(how='any')
  • Fill with a dummy value? Use .fillna(value=dummy)
  • Impute the cell with the most recent value? Use .fillna(method='ffill')
  • Interpolate the amount in a linear fashion? Use .interpolate()

We use the inplace = True operator to avoid making a copy of the dataframe and changing the dataframe itself


In [6]:
# Lets sort this data frame by State and Date
df.sort(columns=['State','date'], inplace=True)

In [7]:
# Lets fill the missing value with last available value
df.fillna(method = "ffill", inplace=True)

In [8]:
df.count()


Out[8]:
State     22899
HighQ     22899
HighQN    22899
MedQ      22899
MedQN     22899
LowQ      22899
LowQN     22899
date      22899
dtype: int64

Exercise

Fill the missing value with a backward fill.


In [ ]:

Fill the missing values with the mean for the column.


In [ ]:

2.2 Quality of the Data

Lets check for completeness.

Say, do we have data on each date for all the 51 states?


In [9]:
df["State"].value_counts()


Out[9]:
Utah                    449
Washington              449
Massachusetts           449
Missouri                449
Connecticut             449
Vermont                 449
Pennsylvania            449
Georgia                 449
California              449
North Carolina          449
Florida                 449
Oregon                  449
Alabama                 449
Indiana                 449
Kentucky                449
Hawaii                  449
South Carolina          449
Virginia                449
Oklahoma                449
Texas                   449
Illinois                449
Nebraska                449
Ohio                    449
Arizona                 449
Rhode Island            449
Louisiana               449
Wisconsin               449
New York                449
Arkansas                449
Montana                 449
New Hampshire           449
North Dakota            449
New Mexico              449
Delaware                449
Iowa                    449
New Jersey              449
Kansas                  449
Idaho                   449
Colorado                449
Maryland                449
District of Columbia    449
Michigan                449
South Dakota            449
West Virginia           449
Minnesota               449
Wyoming                 449
Mississippi             449
Tennessee               449
Nevada                  449
Maine                   449
Alaska                  449
dtype: int64

Lets check the dates and see if they are all continuous


In [10]:
df1 = df[df.State=='California'].copy()
df2 = df[df.State=='California'].copy()

In [11]:
df1.shape


Out[11]:
(449, 8)

In [12]:
print "Earliest Date:",df1.date.min(), "\n", "Latest Date:",df1.date.max(), "\n",  \
"Number of days in between them:", df1.date.max() - df1.date.min()


Earliest Date: 2013-12-27 00:00:00 
Latest Date: 2015-06-11 00:00:00 
Number of days in between them: 531 days 00:00:00

In [13]:
df1.groupby(['date']).size()


Out[13]:
date
2013-12-27    1
2013-12-28    1
2013-12-29    1
2013-12-30    1
2013-12-31    1
2014-01-01    1
2014-01-02    1
2014-01-03    1
2014-01-04    1
2014-01-05    1
2014-01-06    1
2014-01-07    1
2014-01-08    1
2014-01-09    1
2014-01-10    1
2014-01-11    1
2014-01-12    1
2014-01-13    1
2014-01-14    1
2014-01-15    1
2014-01-16    1
2014-01-17    1
2014-01-18    1
2014-01-19    1
2014-01-20    1
2014-01-21    1
2014-01-22    1
2014-01-23    1
2014-01-24    1
2014-01-25    1
             ..
2015-05-13    1
2015-05-14    1
2015-05-15    1
2015-05-16    1
2015-05-17    1
2015-05-18    1
2015-05-19    1
2015-05-20    1
2015-05-21    1
2015-05-22    1
2015-05-23    1
2015-05-24    1
2015-05-25    1
2015-05-26    1
2015-05-27    1
2015-05-28    1
2015-05-29    1
2015-05-30    1
2015-05-31    1
2015-06-01    1
2015-06-02    1
2015-06-03    1
2015-06-04    1
2015-06-05    1
2015-06-06    1
2015-06-07    1
2015-06-08    1
2015-06-09    1
2015-06-10    1
2015-06-11    1
dtype: int64

In [14]:
df1.set_index("date", inplace=True)
df2.set_index("date", inplace=True)

In [15]:
idx = pd.date_range(df1.index.min(), df1.index.max())

In [16]:
idx


Out[16]:
DatetimeIndex(['2013-12-27', '2013-12-28', '2013-12-29', '2013-12-30',
               '2013-12-31', '2014-01-01', '2014-01-02', '2014-01-03',
               '2014-01-04', '2014-01-05', 
               ...
               '2015-06-02', '2015-06-03', '2015-06-04', '2015-06-05',
               '2015-06-06', '2015-06-07', '2015-06-08', '2015-06-09',
               '2015-06-10', '2015-06-11'],
              dtype='datetime64[ns]', length=532, freq='D', tz=None)

In [17]:
df1 = df1.reindex(idx, fill_value=0)

In [18]:
df1.shape


Out[18]:
(532, 7)

Exercise Show the list of dates that were missing. Hint Leverage df2. Both df1 and df2 have date as index


In [ ]:

2.3 Parse the Data

Lets see if we can get the year, month, week and weekdays from the date. Pandas has got good built in functionality for timeseries data using the DatetimeIndex method


In [19]:
df['year'] = pd.DatetimeIndex(df['date']).year
df['month'] = pd.DatetimeIndex(df['date']).month
df['week'] = pd.DatetimeIndex(df['date']).week
df['weekday'] = pd.DatetimeIndex(df['date']).weekday

In [20]:
df.dtypes


Out[20]:
State              object
HighQ             float64
HighQN              int64
MedQ              float64
MedQN               int64
LowQ              float64
LowQN               int64
date       datetime64[ns]
year                int32
month               int32
week                int32
weekday             int32
dtype: object

In [21]:
df.tail()


Out[21]:
State HighQ HighQN MedQ MedQN LowQ LowQN date year month week weekday
4997 Wyoming 313.72 148 317.38 226 161.3 13 2015-06-07 2015 6 23 6
5762 Wyoming 313.72 148 317.38 226 161.3 13 2015-06-08 2015 6 24 0
6527 Wyoming 313.72 148 317.38 226 161.3 13 2015-06-09 2015 6 24 1
7343 Wyoming 313.72 148 317.38 226 161.3 13 2015-06-10 2015 6 24 2
8159 Wyoming 313.72 148 317.38 226 161.3 13 2015-06-11 2015 6 24 3

In [22]:
df['year'].value_counts()


Out[22]:
2014    18564
2015     4080
2013      255
dtype: int64

In [23]:
df['month'].value_counts()


Out[23]:
1     3162
5     2703
2     2244
6     2091
12    1836
10    1581
7     1581
3     1581
11    1530
9     1530
8     1530
4     1530
dtype: int64

In [24]:
df["weekday"].value_counts()


Out[24]:
6    3315
0    3315
4    3264
3    3264
2    3264
1    3264
5    3213
dtype: int64

2.4 Aggregate the Data

To aggregate, we typically use the “group by” function, which involves the following steps

  • Splitting the data into groups based on some criteria
  • Applying a function to each group independently
  • Combining the results into a data structure

In [25]:
df_mean = df.groupby("State", as_index=False).mean()

In [26]:
df_mean.shape


Out[26]:
(51, 11)

In [27]:
df_mean.head()


Out[27]:
State HighQ HighQN MedQ MedQN LowQ LowQN year month week weekday
0 Alabama 339.561849 1379.414254 204.606169 1270.351893 145.978508 161.149220 2014.167038 5.953229 23.812918 2.995546
1 Alaska 291.482004 321.244989 262.046392 407.917595 394.653964 32.334076 2014.167038 5.953229 23.812918 2.995546
2 Arizona 300.667483 2392.465479 209.365345 2137.414254 188.500134 279.006682 2014.167038 5.953229 23.812918 2.995546
3 Arkansas 348.056147 751.988864 190.414655 724.683742 126.771269 135.902004 2014.167038 5.953229 23.812918 2.995546
4 California 245.376125 14947.073497 191.268909 16769.821826 189.783586 976.298441 2014.167038 5.953229 23.812918 2.995546

Pivot Table


In [28]:
df.head()


Out[28]:
State HighQ HighQN MedQ MedQN LowQ LowQN date year month week weekday
20094 Alabama 339.65 1033 198.04 926 147.15 122 2013-12-27 2013 12 52 4
20859 Alabama 339.65 1033 198.04 926 147.15 122 2013-12-28 2013 12 52 5
21573 Alabama 339.75 1036 198.26 929 149.49 123 2013-12-29 2013 12 52 6
22287 Alabama 339.75 1036 198.81 930 149.49 123 2013-12-30 2013 12 1 0
22797 Alabama 339.42 1040 198.68 932 149.49 123 2013-12-31 2013 12 1 1

In [29]:
pd.pivot_table(df, values='HighQ', index=['State'], columns=["weekday"] )


Out[29]:
weekday 0 1 2 3 4 5 6
State
Alabama 339.556308 339.577656 339.559375 339.552500 339.577969 339.568254 339.541385
Alaska 291.463231 291.384687 291.390000 291.597500 291.506406 291.596032 291.438923
Arizona 300.612000 300.653906 300.643750 300.655000 300.751562 300.715873 300.642308
Arkansas 347.930462 348.011250 348.017656 347.934844 348.278281 348.235238 347.991077
California 245.348923 245.364375 245.359219 245.342813 245.434219 245.425556 245.359231
Colorado 238.948462 238.885938 238.889844 238.977187 238.885937 238.897619 238.944769
Connecticut 341.624615 341.682812 341.633750 341.616406 341.832656 341.819683 341.652308
Delaware 366.712769 366.748750 366.705937 366.659375 366.962344 366.937143 366.750615
District of Columbia 348.205385 348.055000 348.132656 348.145625 348.213594 348.277778 348.212462
Florida 302.548923 302.577031 302.529688 302.510781 302.638437 302.648095 302.541231
Georgia 326.588308 326.635937 326.594219 326.579687 326.805000 326.784762 326.633077
Hawaii 309.940615 309.895312 309.855156 309.818750 309.934062 310.033016 309.956462
Idaho 275.642615 275.627656 275.609844 275.653437 275.671250 275.654444 275.616923
Illinois 355.406769 355.460313 355.442031 355.427188 355.580625 355.502540 355.426000
Indiana 333.147077 333.184531 333.187344 333.145781 333.257031 333.253333 333.164769
Iowa 367.059692 367.086094 367.071563 367.055625 367.175781 367.173175 367.051077
Kansas 350.760615 350.785469 350.765938 350.778438 350.928906 350.903492 350.813846
Kentucky 338.906615 338.857812 338.846719 338.874062 338.796719 338.799048 338.902769
Louisiana 366.743385 366.816562 366.761094 366.733437 367.016406 366.960159 366.800615
Maine 315.272615 315.392812 315.384844 315.330156 315.570469 315.523333 315.262308
Maryland 370.884308 370.970469 370.912187 370.899219 371.160937 371.162063 370.912462
Massachusetts 348.942462 348.995625 348.968750 348.945625 349.156562 349.100952 348.974615
Michigan 292.022462 292.080312 292.044531 292.004844 292.229844 292.199841 292.055077
Minnesota 358.104154 358.228437 358.188594 358.052969 358.507969 358.471587 358.156308
Mississippi 337.535385 337.479062 337.495469 337.532656 337.615937 337.660635 337.597231
Missouri 359.098923 359.278594 359.273594 359.264844 359.418438 359.434921 359.174000
Montana 267.832615 267.860625 267.713125 267.700937 267.930781 267.934444 267.850923
Nebraska 341.441692 341.522187 341.490000 341.479375 341.768438 341.713968 341.504923
Nevada 267.794769 267.827969 267.823594 267.811875 267.866875 267.824286 267.792308
New Hampshire 359.469846 359.543906 359.519844 359.362656 359.595469 359.512222 359.331231
New Jersey 349.260769 349.313594 349.299219 349.272969 349.435469 349.396667 349.276000
New Mexico 289.536769 289.511719 289.499531 289.470156 289.670937 289.685714 289.565231
New York 346.877692 346.907031 346.866406 346.831875 347.027344 346.995079 346.886154
North Carolina 349.413846 349.476719 349.435625 349.406563 349.585156 349.565873 349.444308
North Dakota 398.349692 398.576562 398.677656 398.550000 399.090469 398.785079 398.658769
Ohio 334.526615 334.546562 334.536406 334.532969 334.615938 334.592063 334.533385
Oklahoma 361.417077 361.645625 361.615313 361.597656 361.753437 361.598730 361.389846
Oregon 208.461385 208.474844 208.454063 208.422344 208.604688 208.579048 208.479385
Pennsylvania 366.067231 366.102969 366.094375 366.060469 366.253125 366.222857 366.081846
Rhode Island 323.733077 323.828281 323.801094 323.756094 323.998594 323.951746 323.803538
South Carolina 341.612154 341.686094 341.656250 341.641094 341.837500 341.803651 341.617846
South Dakota 375.626615 375.828125 375.830937 375.837500 375.946875 376.010159 375.657692
Tennessee 352.678000 352.730469 352.711094 352.707344 352.871250 352.868730 352.707846
Texas 337.034154 337.069687 337.037969 337.023750 337.267656 337.235714 337.071231
Utah 289.552308 289.597500 289.586094 289.567344 289.634687 289.671905 289.564923
Vermont 374.172462 374.250312 374.204844 374.181250 374.395156 374.335556 374.216615
Virginia 368.084308 368.120781 368.112813 368.087969 368.264375 368.245238 368.116462
Washington 233.381231 233.377969 233.361875 233.363594 233.422500 233.426032 233.382308
West Virginia 357.408000 357.428906 357.368125 357.330156 357.395938 357.424444 357.406462
Wisconsin 352.981077 353.050313 353.040469 353.000938 353.160000 353.141905 352.996308
Wyoming 338.837538 339.090625 339.045781 338.932969 339.260469 339.072698 338.910769

Exercise Get a cross tabulation: for each state, for each month, get the prices for the weekday as shown in the output


In [ ]:

2.5 Derive the Data

Lets us load the demographic dataset and create a new column for others in the population


In [30]:
df_demo = pd.read_csv("data/Demographics_State.csv")

In [31]:
df_demo.head()


Out[31]:
region total_population percent_white percent_black percent_asian percent_hispanic per_capita_income median_rent median_age
0 alabama 4799277 67 26 1 4 23680 501 38.1
1 alaska 720316 63 3 5 6 32651 978 33.6
2 arizona 6479703 57 4 3 30 25358 747 36.3
3 arkansas 2933369 74 15 1 7 22170 480 37.5
4 california 37659181 40 6 13 38 29527 1119 35.4

In [32]:
df_demo["percent_other"] = 100 - df_demo["percent_white"] - df_demo["percent_black"] - df_demo["percent_asian"] - df_demo["percent_hispanic"]

In [ ]:
df_demo.head()

Exercise Express median rent as a proportion of california's median rent (Compute it as a new column)


In [ ]:

2.6 Merge the Data

Lets merge the demographic dataset with the price dataset


In [ ]:
# Let us change the column name region to State
df_demo = df_demo.rename(columns={'region':'State'})

In [ ]:
df_demo.head()

In [ ]:
# We can now merge Demographic and Price mean data into one single data frame
df_merge = pd.merge(df_mean, df_demo, how='inner', on='State')
df_merge.head()

What happened? Why is there no data in the dataframe?


In [ ]:
# Change the State in df_mean to lower case
df_mean['State'] = df_mean['State'].str.lower()

In [ ]:
df_mean.head()

In [ ]:
# We can now merge Demographic and Price mean data into one single data frame
df_merge = pd.merge(df_mean, df_demo, how='inner', on='State')

In [ ]:
df_merge.head()

2.7 Filter the Data

Lets start by filtering the data

  • by location
  • by Year
  • by location & Year

In [ ]:
# Filter data for location California
df_cal = df[df["State"] == "California"]

In [ ]:
df_cal.shape

In [ ]:
# Filter data for year
df_2014 = df[df["year"] == 2014]

In [ ]:
df_2014.shape

In [ ]:
df_cal_2014 = df[(df["year"] == 2014) & (df["State"] == "California")]

In [ ]:
df_cal_2014.shape

In [ ]:
df_cal_2014.head()

Exercise Find the % of hispanic population for the state with max white population (use df_demo)


In [ ]:

2.8 Summarise the Data

We can use the describe function to get the summary stats for each column in the data frame


In [ ]:
df.describe()

We can also use convenience functions like sum(), count(), mean() etc. to calculate these


In [ ]:
df.HighQ.mean()

In [ ]:
# Lets do this the hard way
df.HighQ.sum()

In [ ]:
df.HighQ.count()

In [ ]:
df.HighQ.sum()/df.HighQ.count()

In [ ]:
df.HighQ.median()

2.9 Sample the Data


In [ ]:
?df.sample

In [ ]:
df_ca_sample = df[df.State=='California'].sample(n = 50, replace = True, random_state=123)

In [ ]:
df_ca_sample.duplicated()

In [ ]:
df_ca_sample.loc[8572]

2.10 Quirks in Pandas


In [ ]:
df_ca_sample.iat[0, 0] = "Cal"

In [ ]:
df_ca_sample.head()

But changes when you copy


In [ ]:
df_ca_sample2 = df_ca_sample

In [ ]:
df_ca_sample2.iat[0, 0] = "CA"
df_ca_sample2.head()

In [ ]:
df_ca_sample.head()

Fix the issue


In [ ]:
df_ca_sample3 = df_ca_sample2.copy()

In [ ]:
df_ca_sample3.head()

In [ ]:
df_ca_sample3.iat[0, 0] = "CALIFORNIA"
df_ca_sample3.head()

In [ ]:
df_ca_sample2.head()

In [ ]: