1. Acquire the Data

"Data is the new oil"

Ways to acquire data (typical data source)

Download from an internal system
Obtained from client, or other 3rd party
Extracted from a web-based API
Scraped from a website
Extracted from a PDF file
Gathered manually and recorded

Data Formats

Flat files (e.g. csv)
Excel files
Database (e.g. MySQL)
JSON
HDFS (Hadoop)

Two Datasets

Price of Weed in US
Demographic data by US State

1.1 - Crowdsource the Price of Weed dataset

The Price of Weed website - http://www.priceofweed.com/

Crowdsources the price paid by people on the street to get weed. Self Reported.

Location is auto detected or can be choosen
Quality is classified in three categories
- High
- Medium
- Low
Price by weight
- an ounce
- a half ounce
- a quarter
- an eighth
- 10 grams
- 5 grams
- 1 gram
Strain (though not showed in the dataset)

Reported at individual transaction level

Here is a sample data set from United States - http://www.priceofweed.com/prices/United-States.html

See note - Averages are corrected for outliers based on standard deviation from the mean.

1.2 Scrape the data

Frank Bi from The Verge wrote a script to scrape the data daily. The daily prices are available on github at https://github.com/frankbi/price-of-weed

Here is sample data from one day - 23rd July 2015 - https://github.com/frankbi/price-of-weed/blob/master/data/weedprices23072015.csv

1.3 Combine the data

All the csv files for each day were combined into one large csv. Done by YHAT.

http://blog.yhathq.com/posts/7-funny-datasets.html

1.4 Key Questions / Assumptions

Data is an abstraction of the reality.

What assumptions have been in this entire data collections process?
Are we aware of the assumptions in this process?
How to ensure that the data is accurate or representative for the question we are trying to answer?

1.5 Loading the Data



In [1]:

    
# Load the libraries
import pandas as pd
import numpy as np



In [2]:

    
# Load the dataset
df = pd.read_csv("data/Weed_Price.csv")



In [3]:

    
# Shape of the dateset - rows & columns
df.shape









    Out[3]:





(22899, 8)



In [4]:

    
# Check for type of each variable
df.dtypes









    Out[4]:





State      object
HighQ     float64
HighQN      int64
MedQ      float64
MedQN       int64
LowQ      float64
LowQN       int64
date       object
dtype: object



In [5]:

    
# Lets load this again with date as date type
df = pd.read_csv("data/Weed_Price.csv", parse_dates=[-1])



In [6]:

    
# Now check for type for each row
df.dtypes









    Out[6]:





State             object
HighQ            float64
HighQN             int64
MedQ             float64
MedQN              int64
LowQ             float64
LowQN              int64
date      datetime64[ns]
dtype: object



In [7]:

    
# Get the names of all columns
df.columns









    Out[7]:





Index([u'State', u'HighQ', u'HighQN', u'MedQ', u'MedQN', u'LowQ', u'LowQN',
       u'date'],
      dtype='object')



In [8]:

    
# Get the index of all rows
df.index









    Out[8]:





Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,
                9, 
            ...
            22889, 22890, 22891, 22892, 22893, 22894, 22895, 22896, 22897,
            22898],
           dtype='int64', length=22899)

1.6 Viewing the Data



In [9]:

    
# Can we see some sample rows - the top 5 rows
df.head()



In [10]:

    
# Can we see some sample rows - the bottom 5 rows
df.tail()









    Out[10]:






  
    
      
      State
      HighQ
      HighQN
      MedQ
      MedQN
      LowQ
      LowQN
      date
    
  
  
    
      22894
      Virginia
      364.98
      3513
      293.12
      3079
      NaN
      284
      2014-12-31
    
    
      22895
      Washington
      233.05
      3337
      189.92
      3562
      NaN
      160
      2014-12-31
    
    
      22896
      West Virginia
      359.35
      551
      224.03
      545
      NaN
      60
      2014-12-31
    
    
      22897
      Wisconsin
      350.52
      2244
      272.71
      2221
      NaN
      167
      2014-12-31
    
    
      22898
      Wyoming
      322.27
      131
      351.86
      197
      NaN
      12
      2014-12-31



In [11]:

    
# Get specific rows
df[20:25]









    Out[11]:






  
    
      
      State
      HighQ
      HighQN
      MedQ
      MedQN
      LowQ
      LowQN
      date
    
  
  
    
      20
      Montana
      279.73
      585
      257.98
      258
      734.65
      22
      2014-01-01
    
    
      21
      Nebraska
      352.64
      508
      247.03
      510
      142.99
      62
      2014-01-01
    
    
      22
      Nevada
      269.61
      899
      218.67
      928
      231.08
      61
      2014-01-01
    
    
      23
      New Hampshire
      366.47
      422
      289.81
      441
      561.05
      38
      2014-01-01
    
    
      24
      New Jersey
      354.91
      2244
      294.39
      2799
      225.62
      185
      2014-01-01



In [12]:

    
# Can we access a specific columns
df["State"]









    Out[12]:





0                     Alabama
1                      Alaska
2                     Arizona
3                    Arkansas
4                  California
5                    Colorado
6                 Connecticut
7                    Delaware
8        District of Columbia
9                     Florida
10                    Georgia
11                     Hawaii
12                      Idaho
13                   Illinois
14                    Indiana
15                       Iowa
16                     Kansas
17                   Kentucky
18                  Louisiana
19                      Maine
20                    Montana
21                   Nebraska
22                     Nevada
23              New Hampshire
24                 New Jersey
25                 New Mexico
26                   New York
27             North Carolina
28               North Dakota
29                       Ohio
                 ...         
22869                Nebraska
22870                  Nevada
22871           New Hampshire
22872              New Jersey
22873              New Mexico
22874                New York
22875          North Carolina
22876            North Dakota
22877                    Ohio
22878                Oklahoma
22879                  Oregon
22880                Maryland
22881           Massachusetts
22882                Michigan
22883               Minnesota
22884             Mississippi
22885                Missouri
22886            Pennsylvania
22887            Rhode Island
22888          South Carolina
22889            South Dakota
22890               Tennessee
22891                   Texas
22892                    Utah
22893                 Vermont
22894                Virginia
22895              Washington
22896           West Virginia
22897               Wisconsin
22898                 Wyoming
Name: State, dtype: object



In [13]:

    
# Using the dot notation
df.State









    Out[13]:





0                     Alabama
1                      Alaska
2                     Arizona
3                    Arkansas
4                  California
5                    Colorado
6                 Connecticut
7                    Delaware
8        District of Columbia
9                     Florida
10                    Georgia
11                     Hawaii
12                      Idaho
13                   Illinois
14                    Indiana
15                       Iowa
16                     Kansas
17                   Kentucky
18                  Louisiana
19                      Maine
20                    Montana
21                   Nebraska
22                     Nevada
23              New Hampshire
24                 New Jersey
25                 New Mexico
26                   New York
27             North Carolina
28               North Dakota
29                       Ohio
                 ...         
22869                Nebraska
22870                  Nevada
22871           New Hampshire
22872              New Jersey
22873              New Mexico
22874                New York
22875          North Carolina
22876            North Dakota
22877                    Ohio
22878                Oklahoma
22879                  Oregon
22880                Maryland
22881           Massachusetts
22882                Michigan
22883               Minnesota
22884             Mississippi
22885                Missouri
22886            Pennsylvania
22887            Rhode Island
22888          South Carolina
22889            South Dakota
22890               Tennessee
22891                   Texas
22892                    Utah
22893                 Vermont
22894                Virginia
22895              Washington
22896           West Virginia
22897               Wisconsin
22898                 Wyoming
Name: State, dtype: object



In [14]:

    
# Selecting specific column and rows
df[0:5]["State"]









    Out[14]:





0       Alabama
1        Alaska
2       Arizona
3      Arkansas
4    California
Name: State, dtype: object



In [15]:

    
# Works both ways
df["State"][0:5]









    Out[15]:





0       Alabama
1        Alaska
2       Arizona
3      Arkansas
4    California
Name: State, dtype: object



In [16]:

    
#Getting unique values of State
pd.unique(df['State'])









    Out[16]:





array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
       'Colorado', 'Connecticut', 'Delaware', 'District of Columbia',
       'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana',
       'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Montana',
       'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico',
       'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma',
       'Oregon', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
       'Mississippi', 'Missouri', 'Pennsylvania', 'Rhode Island',
       'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah',
       'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin',
       'Wyoming'], dtype=object)

1.7 Slicing columns using pandas



In [17]:

    
df.index









    Out[17]:





Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,
                9, 
            ...
            22889, 22890, 22891, 22892, 22893, 22894, 22895, 22896, 22897,
            22898],
           dtype='int64', length=22899)



In [18]:

    
df.loc[0]









    Out[18]:





State                 Alabama
HighQ                  339.06
HighQN                   1042
MedQ                   198.64
MedQN                     933
LowQ                   149.49
LowQN                     123
date      2014-01-01 00:00:00
Name: 0, dtype: object



In [19]:

    
df.iloc[0,0]









    Out[19]:





'Alabama'



In [20]:

    
df.ix[0,0]









    Out[20]:





'Alabama'

Exercise

1) Load the Demographics_State.csv dataset



In [ ]:

2) Show the five first rows of the dataset



In [ ]:

3) Select the column with the State name in the data frame



In [ ]:

4) Get help



In [ ]:

5) Change index to date



In [ ]:

6) Get all the data for 2nd January 2014



In [ ]:

Thinking in Vectors

Difference between loops and vectors



In [21]:

    
#Find weighted average price with respective weights of 0.6, 0.4 for HighQ and MedQ



In [22]:

    
#Python approach. Loop over all rows. 
#For each row, multiply the respective columns by those weights. 
#Add the output to an array



In [23]:

    
#It is easy to convert pandas series to numpy array.
highq_np = np.array(df.HighQ)
medq_np = np.array(df.MedQ)



In [ ]:

    
#Standard pythonic code

def find_weighted_price():
    global weighted_price
    weighted_price = []
    
    for i in range(df.shape[0]):
        weighted_price.append(0.6*highq_np[i]*0.4*highq_np[i])

#print the weighted price
find_weighted_price()
print weighted_price

Exercise: Find the running time of the above program



In [ ]:



In [ ]:

    
#Vectorized Code
weighted_price_vec = 0.6*highq_np + 0.4*medq_np

Exercise: Time the above vectorized code. Do you see any improvements?



In [ ]:

	State	HighQ	HighQN	MedQ	MedQN	LowQ	LowQN	date
0	Alabama	339.06	1042	198.64	933	149.49	123	2014-01-01
1	Alaska	288.75	252	260.60	297	388.58	26	2014-01-01
2	Arizona	303.31	1941	209.35	1625	189.45	222	2014-01-01
3	Arkansas	361.85	576	185.62	544	125.87	112	2014-01-01
4	California	248.78	12096	193.56	12812	192.92	778	2014-01-01

	State	HighQ	HighQN	MedQ	MedQN	LowQ	LowQN	date
22894	Virginia	364.98	3513	293.12	3079	NaN	284	2014-12-31
22895	Washington	233.05	3337	189.92	3562	NaN	160	2014-12-31
22896	West Virginia	359.35	551	224.03	545	NaN	60	2014-12-31
22897	Wisconsin	350.52	2244	272.71	2221	NaN	167	2014-12-31
22898	Wyoming	322.27	131	351.86	197	NaN	12	2014-12-31

	State	HighQ	HighQN	MedQ	MedQN	LowQ	LowQN	date
20	Montana	279.73	585	257.98	258	734.65	22	2014-01-01
21	Nebraska	352.64	508	247.03	510	142.99	62	2014-01-01
22	Nevada	269.61	899	218.67	928	231.08	61	2014-01-01
23	New Hampshire	366.47	422	289.81	441	561.05	38	2014-01-01
24	New Jersey	354.91	2244	294.39	2799	225.62	185	2014-01-01