1. Acquire the Data

"Data is the new oil"

Ways to acquire data (typical data source)

  • Download from an internal system
  • Obtained from client, or other 3rd party
  • Extracted from a web-based API
  • Scraped from a website
  • Extracted from a PDF file
  • Gathered manually and recorded

Data Formats

  • Flat files (e.g. csv)
  • Excel files
  • Database (e.g. MySQL)
  • JSON
  • HDFS (Hadoop)

Two Datasets

  • Price of Weed in US
  • Demographic data by US State

1.1 - Crowdsource the Price of Weed dataset

The Price of Weed website - http://www.priceofweed.com/

Crowdsources the price paid by people on the street to get weed. Self Reported.

  • Location is auto detected or can be choosen
  • Quality is classified in three categories
    • High
    • Medium
    • Low
  • Price by weight
    • an ounce
    • a half ounce
    • a quarter
    • an eighth
    • 10 grams
    • 5 grams
    • 1 gram
  • Strain (though not showed in the dataset)

Reported at individual transaction level

Here is a sample data set from United States - http://www.priceofweed.com/prices/United-States.html

See note - Averages are corrected for outliers based on standard deviation from the mean.

1.2 Scrape the data

Frank Bi from The Verge wrote a script to scrape the data daily. The daily prices are available on github at https://github.com/frankbi/price-of-weed

Here is sample data from one day - 23rd July 2015 - https://github.com/frankbi/price-of-weed/blob/master/data/weedprices23072015.csv

1.3 Combine the data

All the csv files for each day were combined into one large csv. Done by YHAT.

http://blog.yhathq.com/posts/7-funny-datasets.html

1.4 Key Questions / Assumptions

Data is an abstraction of the reality.

  • What assumptions have been in this entire data collections process?
  • Are we aware of the assumptions in this process?
  • How to ensure that the data is accurate or representative for the question we are trying to answer?

1.5 Loading the Data


In [1]:
# Load the libraries
import pandas as pd
import numpy as np

In [2]:
# Load the dataset
df = pd.read_csv("data/Weed_Price.csv")

In [3]:
# Shape of the dateset - rows & columns
df.shape


Out[3]:
(22899, 8)

In [4]:
# Check for type of each variable
df.dtypes


Out[4]:
State      object
HighQ     float64
HighQN      int64
MedQ      float64
MedQN       int64
LowQ      float64
LowQN       int64
date       object
dtype: object

In [5]:
# Lets load this again with date as date type
df = pd.read_csv("data/Weed_Price.csv", parse_dates=[-1])

In [6]:
# Now check for type for each row
df.dtypes


Out[6]:
State             object
HighQ            float64
HighQN             int64
MedQ             float64
MedQN              int64
LowQ             float64
LowQN              int64
date      datetime64[ns]
dtype: object

In [7]:
# Get the names of all columns
df.columns


Out[7]:
Index([u'State', u'HighQ', u'HighQN', u'MedQ', u'MedQN', u'LowQ', u'LowQN',
       u'date'],
      dtype='object')

In [8]:
# Get the index of all rows
df.index


Out[8]:
Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,
                9, 
            ...
            22889, 22890, 22891, 22892, 22893, 22894, 22895, 22896, 22897,
            22898],
           dtype='int64', length=22899)

1.6 Viewing the Data


In [9]:
# Can we see some sample rows - the top 5 rows
df.head()


Out[9]:
State HighQ HighQN MedQ MedQN LowQ LowQN date
0 Alabama 339.06 1042 198.64 933 149.49 123 2014-01-01
1 Alaska 288.75 252 260.60 297 388.58 26 2014-01-01
2 Arizona 303.31 1941 209.35 1625 189.45 222 2014-01-01
3 Arkansas 361.85 576 185.62 544 125.87 112 2014-01-01
4 California 248.78 12096 193.56 12812 192.92 778 2014-01-01

In [10]:
# Can we see some sample rows - the bottom 5 rows
df.tail()


Out[10]:
State HighQ HighQN MedQ MedQN LowQ LowQN date
22894 Virginia 364.98 3513 293.12 3079 NaN 284 2014-12-31
22895 Washington 233.05 3337 189.92 3562 NaN 160 2014-12-31
22896 West Virginia 359.35 551 224.03 545 NaN 60 2014-12-31
22897 Wisconsin 350.52 2244 272.71 2221 NaN 167 2014-12-31
22898 Wyoming 322.27 131 351.86 197 NaN 12 2014-12-31

In [11]:
# Get specific rows
df[20:25]


Out[11]:
State HighQ HighQN MedQ MedQN LowQ LowQN date
20 Montana 279.73 585 257.98 258 734.65 22 2014-01-01
21 Nebraska 352.64 508 247.03 510 142.99 62 2014-01-01
22 Nevada 269.61 899 218.67 928 231.08 61 2014-01-01
23 New Hampshire 366.47 422 289.81 441 561.05 38 2014-01-01
24 New Jersey 354.91 2244 294.39 2799 225.62 185 2014-01-01

In [12]:
# Can we access a specific columns
df["State"]


Out[12]:
0                     Alabama
1                      Alaska
2                     Arizona
3                    Arkansas
4                  California
5                    Colorado
6                 Connecticut
7                    Delaware
8        District of Columbia
9                     Florida
10                    Georgia
11                     Hawaii
12                      Idaho
13                   Illinois
14                    Indiana
15                       Iowa
16                     Kansas
17                   Kentucky
18                  Louisiana
19                      Maine
20                    Montana
21                   Nebraska
22                     Nevada
23              New Hampshire
24                 New Jersey
25                 New Mexico
26                   New York
27             North Carolina
28               North Dakota
29                       Ohio
                 ...         
22869                Nebraska
22870                  Nevada
22871           New Hampshire
22872              New Jersey
22873              New Mexico
22874                New York
22875          North Carolina
22876            North Dakota
22877                    Ohio
22878                Oklahoma
22879                  Oregon
22880                Maryland
22881           Massachusetts
22882                Michigan
22883               Minnesota
22884             Mississippi
22885                Missouri
22886            Pennsylvania
22887            Rhode Island
22888          South Carolina
22889            South Dakota
22890               Tennessee
22891                   Texas
22892                    Utah
22893                 Vermont
22894                Virginia
22895              Washington
22896           West Virginia
22897               Wisconsin
22898                 Wyoming
Name: State, dtype: object

In [13]:
# Using the dot notation
df.State


Out[13]:
0                     Alabama
1                      Alaska
2                     Arizona
3                    Arkansas
4                  California
5                    Colorado
6                 Connecticut
7                    Delaware
8        District of Columbia
9                     Florida
10                    Georgia
11                     Hawaii
12                      Idaho
13                   Illinois
14                    Indiana
15                       Iowa
16                     Kansas
17                   Kentucky
18                  Louisiana
19                      Maine
20                    Montana
21                   Nebraska
22                     Nevada
23              New Hampshire
24                 New Jersey
25                 New Mexico
26                   New York
27             North Carolina
28               North Dakota
29                       Ohio
                 ...         
22869                Nebraska
22870                  Nevada
22871           New Hampshire
22872              New Jersey
22873              New Mexico
22874                New York
22875          North Carolina
22876            North Dakota
22877                    Ohio
22878                Oklahoma
22879                  Oregon
22880                Maryland
22881           Massachusetts
22882                Michigan
22883               Minnesota
22884             Mississippi
22885                Missouri
22886            Pennsylvania
22887            Rhode Island
22888          South Carolina
22889            South Dakota
22890               Tennessee
22891                   Texas
22892                    Utah
22893                 Vermont
22894                Virginia
22895              Washington
22896           West Virginia
22897               Wisconsin
22898                 Wyoming
Name: State, dtype: object

In [14]:
# Selecting specific column and rows
df[0:5]["State"]


Out[14]:
0       Alabama
1        Alaska
2       Arizona
3      Arkansas
4    California
Name: State, dtype: object

In [15]:
# Works both ways
df["State"][0:5]


Out[15]:
0       Alabama
1        Alaska
2       Arizona
3      Arkansas
4    California
Name: State, dtype: object

In [16]:
#Getting unique values of State
pd.unique(df['State'])


Out[16]:
array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
       'Colorado', 'Connecticut', 'Delaware', 'District of Columbia',
       'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana',
       'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Montana',
       'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico',
       'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma',
       'Oregon', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Mississippi', 'Missouri', 'Pennsylvania', 'Rhode Island',
       'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah',
       'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin',
       'Wyoming'], dtype=object)

1.7 Slicing columns using pandas


In [17]:
df.index


Out[17]:
Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,
                9, 
            ...
            22889, 22890, 22891, 22892, 22893, 22894, 22895, 22896, 22897,
            22898],
           dtype='int64', length=22899)

In [18]:
df.loc[0]


Out[18]:
State                 Alabama
HighQ                  339.06
HighQN                   1042
MedQ                   198.64
MedQN                     933
LowQ                   149.49
LowQN                     123
date      2014-01-01 00:00:00
Name: 0, dtype: object

In [19]:
df.iloc[0,0]


Out[19]:
'Alabama'

In [20]:
df.ix[0,0]


Out[20]:
'Alabama'

Exercise

1) Load the Demographics_State.csv dataset


In [ ]:

2) Show the five first rows of the dataset


In [ ]:

3) Select the column with the State name in the data frame


In [ ]:

4) Get help


In [ ]:

5) Change index to date


In [ ]:

6) Get all the data for 2nd January 2014


In [ ]:

Thinking in Vectors

Difference between loops and vectors


In [21]:
#Find weighted average price with respective weights of 0.6, 0.4 for HighQ and MedQ

In [22]:
#Python approach. Loop over all rows. 
#For each row, multiply the respective columns by those weights. 
#Add the output to an array

In [23]:
#It is easy to convert pandas series to numpy array.
highq_np = np.array(df.HighQ)
medq_np = np.array(df.MedQ)

In [ ]:
#Standard pythonic code

def find_weighted_price():
    global weighted_price
    weighted_price = []
    
    for i in range(df.shape[0]):
        weighted_price.append(0.6*highq_np[i]*0.4*highq_np[i])

#print the weighted price
find_weighted_price()
print weighted_price

Exercise: Find the running time of the above program


In [ ]:


In [ ]:
#Vectorized Code
weighted_price_vec = 0.6*highq_np + 0.4*medq_np

Exercise: Time the above vectorized code. Do you see any improvements?


In [ ]: