"Data is the new oil"
Ways to acquire data (typical data source)
Data Formats
Two Datasets
The Price of Weed website - http://www.priceofweed.com/
Crowdsources the price paid by people on the street to get weed. Self Reported.
Reported at individual transaction level
Here is a sample data set from United States - http://www.priceofweed.com/prices/United-States.html
See note - Averages are corrected for outliers based on standard deviation from the mean.
Frank Bi from The Verge wrote a script to scrape the data daily. The daily prices are available on github at https://github.com/frankbi/price-of-weed
Here is sample data from one day - 23rd July 2015 - https://github.com/frankbi/price-of-weed/blob/master/data/weedprices23072015.csv
All the csv files for each day were combined into one large csv. Done by YHAT.
http://blog.yhathq.com/posts/7-funny-datasets.html
Data is an abstraction of the reality.
In [1]:
# Load the libraries
import pandas as pd
import numpy as np
In [2]:
# Load the dataset
df = pd.read_csv("data/Weed_Price.csv")
In [3]:
# Shape of the dateset - rows & columns
df.shape
Out[3]:
In [4]:
# Check for type of each variable
df.dtypes
Out[4]:
In [5]:
# Lets load this again with date as date type
df = pd.read_csv("data/Weed_Price.csv", parse_dates=[-1])
In [6]:
# Now check for type for each row
df.dtypes
Out[6]:
In [7]:
# Get the names of all columns
df.columns
Out[7]:
In [8]:
# Get the index of all rows
df.index
Out[8]:
In [9]:
# Can we see some sample rows - the top 5 rows
df.head()
Out[9]:
In [10]:
# Can we see some sample rows - the bottom 5 rows
df.tail()
Out[10]:
In [11]:
# Get specific rows
df[20:25]
Out[11]:
In [12]:
# Can we access a specific columns
df["State"]
Out[12]:
In [13]:
# Using the dot notation
df.State
Out[13]:
In [14]:
# Selecting specific column and rows
df[0:5]["State"]
Out[14]:
In [15]:
# Works both ways
df["State"][0:5]
Out[15]:
In [16]:
#Getting unique values of State
pd.unique(df['State'])
Out[16]:
In [17]:
df.index
Out[17]:
In [18]:
df.loc[0]
Out[18]:
In [19]:
df.iloc[0,0]
Out[19]:
In [20]:
df.ix[0,0]
Out[20]:
In [ ]:
2) Show the five first rows of the dataset
In [ ]:
3) Select the column with the State name in the data frame
In [ ]:
4) Get help
In [ ]:
5) Change index to date
In [ ]:
6) Get all the data for 2nd January 2014
In [ ]:
In [21]:
#Find weighted average price with respective weights of 0.6, 0.4 for HighQ and MedQ
In [22]:
#Python approach. Loop over all rows.
#For each row, multiply the respective columns by those weights.
#Add the output to an array
In [23]:
#It is easy to convert pandas series to numpy array.
highq_np = np.array(df.HighQ)
medq_np = np.array(df.MedQ)
In [ ]:
#Standard pythonic code
def find_weighted_price():
global weighted_price
weighted_price = []
for i in range(df.shape[0]):
weighted_price.append(0.6*highq_np[i]*0.4*highq_np[i])
#print the weighted price
find_weighted_price()
print weighted_price
Exercise: Find the running time of the above program
In [ ]:
In [ ]:
#Vectorized Code
weighted_price_vec = 0.6*highq_np + 0.4*medq_np
Exercise: Time the above vectorized code. Do you see any improvements?
In [ ]: