Import the Blaze.
In [1]:
import blaze as bl
Let's create a simple NumPy array: we first load NumPy and then create a matrix with two rows and three columns.
In [2]:
import numpy as np
simpleArray = np.array([
[1,2,3],
[4,5,6]
])
Now that we have an array we can abstract it with Blaze's Data structure.
In [3]:
simpleData_np = bl.Data(simpleArray)
In order to peek inside the structure you can either use .peek() method
In [4]:
simpleData_np.peek()
Out[4]:
or use (familiar to those of you versed in pandas' syntax) the .head(...) method
In [5]:
simpleData_np.head(1)
Out[5]:
If you want to retrieve the first column you can use indexing.
In [6]:
simpleData_np[0]
Out[6]:
If you want to retrieve columns you have to transpose your DataShape.
In [7]:
simpleData_np.T[0]
Out[7]:
Let's specify names of our fields.
In [8]:
simpleData_np = bl.Data(simpleArray, fields=['a', 'b', 'c'])
You can now retrieve the data simply by calling the column by it's name; let's retrieve column 'b'
In [9]:
simpleData_np['b']
Out[9]:
Start by importing pandas.
In [10]:
import pandas as pd
Next, we create a DataFrame.
In [11]:
simpleDf = pd.DataFrame([
[1,2,3],
[4,5,6]
], columns=['a','b','c'])
and transform it into a DataShape.
In [12]:
simpleData_df = bl.Data(simpleDf)
You can retrieve data in the same manner as with the DataShape created from the NumPy array.
In [13]:
simpleData_df['a']
Out[13]:
DataShapes can be created directly from a CSV file.
In [3]:
import odo
traffic = bl.Data('../Data/TrafficViolations.csv')
To save the data into multiple archives (for each year of traffic violation) use this.
In [8]:
for year in traffic.Stop_year.distinct().sort():
odo.odo(traffic[traffic.Stop_year == year],
'../Data/Years/TrafficViolations_{0}.csv.gz'\
.format(year))
If you do not know the names of columns in any dataset, you can get these from the DataShape.
In [15]:
print(traffic.fields)
Blaze can also read directly from a GZipped archives.
In [16]:
traffic_gz = bl.Data('../Data/TrafficViolations.csv.gz')
To compare we get exactly the same data let's retrieve the first two records from each structure.
In [17]:
traffic.head(2)
Out[17]:
In [18]:
traffic_gz.head(2)
Out[18]:
To read from multiple files you can use the asterisk *.
In [19]:
traffic_multiple = bl.Data(
'../Data/Years/TrafficViolations_*.csv.gz')
traffic_multiple.head(2)
Out[19]:
In order to save traffic data for year 2013 you can call odo like this
In [20]:
odo.odo(traffic[traffic.Stop_year == 2013],
'../Data/Years/TrafficViolations_2013.csv.gz')
Out[20]:
In [47]:
traffic_psql = bl.Data(
'postgresql://{0}:{1}@localhost:5432/drabast::traffic'\
.format('drabast', 'pAck7!B0ok')
)
traffic_psql.head(2)
Out[47]:
We will output traffic violations that involved cars manufactured in 2016 to both, PostGRE SQL and SQLite databases. We will use odo to manage the transfers.
In [25]:
traffic_2016 = traffic_psql[traffic_psql['Year'] == 2016]
# odo.drop('sqlite:///traffic_local.sqlite::traffic2016')
# odo.drop('postgresql://drabast:pAck7!B0ok@localhost:5432/drabast::traffic_2016')
odo.odo(traffic_2016, 'sqlite:///traffic_local.sqlite::traffic2016')
odo.odo(traffic_2016, 'postgresql://drabast:pAck7!B0ok@localhost:5432/drabast::traffic_2016')
Out[25]:
Reading data from the SQLite database should be trivial by now.
In [26]:
traffic_sqlt = bl.Data('sqlite:///traffic_local.sqlite::traffic2016')
traffic_sqlt.head(2)
Out[26]:
In [27]:
traffic_mongo = bl.Data('mongodb://localhost:27017/packt::traffic')
traffic_mongo.head(2)
Out[27]:
There are two ways of accessing columns: you can get a single column at a time by accessing them as if they were a DataShape attribute
In [28]:
traffic.Year.head(2)
Out[28]:
or indexing; this allows to select more than one column at a time
In [29]:
(traffic[['Location', 'Year', 'Accident', 'Fatal', 'Alcohol']]
.head(2))
Out[29]:
If we could not reflect the schema from an already existing object, we would have to specify the schema manually.
In [30]:
schema_example = bl.symbol('schema_exampl',
'{id: int, name: string}')
Since we already have an existing dataset traffic, we can reuse the schema by calling traffic.dshape and specify our transformations directly on it.
In [31]:
traffic_s = bl.symbol('traffic', traffic.dshape)
traffic_2013 = traffic_s[traffic_s['Stop_year'] == 2013][
['Stop_year', 'Arrest_Type','Color', 'Charge']
]
To present how this works, let's read our dataset into pandas' DataFrame.
In [32]:
traffic_pd = pd.read_csv('../Data/TrafficViolations.csv')
You can now pass the dataset directly to traffic_2013 object and perform the computation using the .compute(...) method of Blaze.
In [33]:
bl.compute(traffic_2013, traffic_pd).head(2)
Out[33]:
You can also pass a list of lists or a list of NumPy arrays.
In [51]:
bl.compute(traffic_2013, traffic_pd.values)[0:2]
Out[51]:
You can check that by getting all the distinct values for the Stop_year column.
In [35]:
traffic['Stop_year'].distinct().sort()
Out[35]:
We can subtract 2000 from the Stop_year column as we do not need a greater detail.
In [36]:
traffic['Stop_year'].head(2) - 2000
Out[36]:
If you wanted to log-transform the Stop_year you need to
In [37]:
bl.log(traffic['Stop_year']).head(2)
Out[37]:
In [38]:
traffic['Stop_year'].max()
Out[38]:
You can calculate the age of the car (in years) at the time when the violation occured
In [9]:
traffic = bl.transform(traffic,
Age_of_car = traffic.Stop_year - traffic.Year)
traffic.head(2)[['Stop_year', 'Year', 'Age_of_car']]
Out[9]:
and to calculate the average age of the car involved in a fatal trafic violation and count the number of occurences you can use the by operation.
In [40]:
bl.by(traffic['Fatal'],
Fatal_AvgAge=traffic.Age_of_car.mean(),
Fatal_Count =traffic.Age_of_car.count()
)
Out[40]:
In [41]:
violation = traffic[
['Stop_month','Stop_day','Stop_year',
'Stop_hr','Stop_min','Stop_sec','Violation_Type']]
belts = traffic[
['Stop_month','Stop_day','Stop_year',
'Stop_hr','Stop_min','Stop_sec','Belts']]
Now, we join the two objects on the six date and time columns.
In [42]:
violation_belts = bl.join(violation, belts,
['Stop_month','Stop_day','Stop_year',
'Stop_hr','Stop_min','Stop_sec'])
Once we have the full dataset in place, let's check how many traffic violations involved belts and what sort of punishment was issued to the driver.
In [43]:
bl.by(violation_belts[['Violation_Type', 'Belts']],
Violation_count=violation_belts.Belts.count()
).sort('Violation_count', ascending=False)
Out[43]: