Hybrid data representation using Blaze

Import the Blaze.


In [1]:
import blaze as bl

Abstract data

Working with NumPy array

Let's create a simple NumPy array: we first load NumPy and then create a matrix with two rows and three columns.


In [2]:
import numpy as np

simpleArray = np.array([
        [1,2,3],
        [4,5,6]
    ])

Now that we have an array we can abstract it with Blaze's Data structure.


In [3]:
simpleData_np = bl.Data(simpleArray)

In order to peek inside the structure you can either use .peek() method


In [4]:
simpleData_np.peek()


Out[4]:
array([[1, 2, 3],
       [4, 5, 6]])

or use (familiar to those of you versed in pandas' syntax) the .head(...) method


In [5]:
simpleData_np.head(1)


Out[5]:
array([[1, 2, 3]])

If you want to retrieve the first column you can use indexing.


In [6]:
simpleData_np[0]


Out[6]:
None
0 1
1 2
2 3

If you want to retrieve columns you have to transpose your DataShape.


In [7]:
simpleData_np.T[0]


Out[7]:
None
0 1
1 4

Let's specify names of our fields.


In [8]:
simpleData_np = bl.Data(simpleArray, fields=['a', 'b', 'c'])

You can now retrieve the data simply by calling the column by it's name; let's retrieve column 'b'


In [9]:
simpleData_np['b']


Out[9]:
b
0 2
1 5

Working with pandas DataFrame

Start by importing pandas.


In [10]:
import pandas as pd

Next, we create a DataFrame.


In [11]:
simpleDf = pd.DataFrame([
        [1,2,3],
        [4,5,6]
    ], columns=['a','b','c'])

and transform it into a DataShape.


In [12]:
simpleData_df = bl.Data(simpleDf)

You can retrieve data in the same manner as with the DataShape created from the NumPy array.


In [13]:
simpleData_df['a']


Out[13]:
a
0 1
1 4

Working with files

DataShapes can be created directly from a CSV file.


In [3]:
import odo

traffic = bl.Data('../Data/TrafficViolations.csv')

To save the data into multiple archives (for each year of traffic violation) use this.


In [8]:
for year in traffic.Stop_year.distinct().sort():
    odo.odo(traffic[traffic.Stop_year == year], 
        '../Data/Years/TrafficViolations_{0}.csv.gz'\
        .format(year))

If you do not know the names of columns in any dataset, you can get these from the DataShape.


In [15]:
print(traffic.fields)


['Stop_month', 'Stop_day', 'Stop_year', 'Stop_hr', 'Stop_min', 'Stop_sec', 'Agency', 'SubAgency', 'Description', 'Location', 'Latitude', 'Longitude', 'Accident', 'Belts', 'Personal_Injury', 'Property_Damage', 'Fatal', 'Commercial_License', 'HAZMAT', 'Commercial_Vehicle', 'Alcohol', 'Work_Zone', 'State', 'VehicleType', 'Year', 'Make', 'Model', 'Color', 'Violation_Type', 'Charge', 'Article', 'Contributed_To_Accident', 'Race', 'Gender', 'Driver_City', 'Driver_State', 'DL_State', 'Arrest_Type', 'Geolocation']

Blaze can also read directly from a GZipped archives.


In [16]:
traffic_gz = bl.Data('../Data/TrafficViolations.csv.gz')

To compare we get exactly the same data let's retrieve the first two records from each structure.


In [17]:
traffic.head(2)


Out[17]:
Stop_month Stop_day Stop_year Stop_hr Stop_min Stop_sec Agency SubAgency Description Location Latitude Longitude Accident Belts Personal_Injury Property_Damage Fatal Commercial_License HAZMAT Commercial_Vehicle Alcohol Work_Zone State VehicleType Year Make Model Color Violation_Type Charge Article Contributed_To_Accident Race Gender Driver_City Driver_State DL_State Arrest_Type Geolocation
0 9 30 2014 23 51 0 MCP 1st district, Rockville DRIVER FAILURE TO STOP AT STEADY CIRCULAR RED ... PARK RD AT HUNGERFORD DR NaN NaN No No No No No No No No No No MD 02 - Automobile 2014.0 FORD MUSTANG BLACK Citation 21-202(h1) Transportation Article No BLACK M ROCKVILLE MD MD A - Marked Patrol NaN
1 3 31 2015 23 59 0 MCP 2nd district, Bethesda HEADLIGHTS (*) CONNECTICUT AT METROPOLITAN AVE NaN NaN No No No No No No No No No No MD 02 - Automobile 2003.0 HONDA 2S BLUE ESERO 55* NaN No HISPANIC M SILVER SPRING MD MD A - Marked Patrol NaN

In [18]:
traffic_gz.head(2)


Out[18]:
Stop_month Stop_day Stop_year Stop_hr Stop_min Stop_sec Agency SubAgency Description Location Latitude Longitude Accident Belts Personal_Injury Property_Damage Fatal Commercial_License HAZMAT Commercial_Vehicle Alcohol Work_Zone State VehicleType Year Make Model Color Violation_Type Charge Article Contributed_To_Accident Race Gender Driver_City Driver_State DL_State Arrest_Type Geolocation
0 9 30 2014 23 51 0 MCP 1st district, Rockville DRIVER FAILURE TO STOP AT STEADY CIRCULAR RED ... PARK RD AT HUNGERFORD DR NaN NaN No No No No No No No No No No MD 02 - Automobile 2014.0 FORD MUSTANG BLACK Citation 21-202(h1) Transportation Article No BLACK M ROCKVILLE MD MD A - Marked Patrol NaN
1 3 31 2015 23 59 0 MCP 2nd district, Bethesda HEADLIGHTS (*) CONNECTICUT AT METROPOLITAN AVE NaN NaN No No No No No No No No No No MD 02 - Automobile 2003.0 HONDA 2S BLUE ESERO 55* NaN No HISPANIC M SILVER SPRING MD MD A - Marked Patrol NaN

To read from multiple files you can use the asterisk *.


In [19]:
traffic_multiple = bl.Data(
    '../Data/Years/TrafficViolations_*.csv.gz')
traffic_multiple.head(2)


Out[19]:
Stop_month Stop_day Stop_year Stop_hr Stop_min Stop_sec Agency SubAgency Description Location Latitude Longitude Accident Belts Personal_Injury Property_Damage Fatal Commercial_License HAZMAT Commercial_Vehicle Alcohol Work_Zone State VehicleType Year Make Model Color Violation_Type Charge Article Contributed_To_Accident Race Gender Driver_City Driver_State DL_State Arrest_Type Geolocation Age_of_car
0 3 29 2013 17 34 0 MCP 1st district, Rockville FAILURE TO DISPLAY REGISTRATION CARD UPON DEMA... WOOTTON PKWAY AND DARNSTOWN ROAD 39.092008 -77.19444 No No No No No No No No No No MD 02 - Automobile 2003.0 NISSAN 350Z SILVER Citation 13-409(b) Transportation Article No HISPANIC M MONTGOMERY VILLAGE MD MD A - Marked Patrol (39.0920083333333, -77.19444) 10.0
1 8 12 2013 8 41 0 MCP 5th district, Germantown DRIVER CHANGING LANES WHEN UNSAFE RIDGE RD / OBSERVATION DR NaN NaN No No No No No No No No No No MD 02 - Automobile 2003.0 GMC YUKON BLACK Citation 21-309(b) Transportation Article No BLACK M DAMASCUS MD MD B - Unmarked Patrol NaN 10.0

In order to save traffic data for year 2013 you can call odo like this


In [20]:
odo.odo(traffic[traffic.Stop_year == 2013], 
    '../Data/Years/TrafficViolations_2013.csv.gz')


Out[20]:
<odo.backends.csv.CSV at 0x1247e8f60>

Working with databases

Interacting with relational databases

Let's read the data from the PostGRE SQL database now. The URI for accessing PostGRE SQL database has the following syntax postgresql://<user_name>:<password>@<server>:<port>/<database>::<table>.


In [47]:
traffic_psql = bl.Data(
    'postgresql://{0}:{1}@localhost:5432/drabast::traffic'\
    .format('drabast', 'pAck7!B0ok')
)
traffic_psql.head(2)


Out[47]:
index Stop_month Stop_day Stop_year Stop_hr Stop_min Stop_sec Agency SubAgency Description Location Latitude Longitude Accident Belts Personal_Injury Property_Damage Fatal Commercial_License HAZMAT Commercial_Vehicle Alcohol Work_Zone State VehicleType Year Make Model Color Violation_Type Charge Article Contributed_To_Accident Race Gender Driver_City Driver_State DL_State Arrest_Type Geolocation
0 0 9 30 2014 23 51 0 MCP 1st district, Rockville DRIVER FAILURE TO STOP AT STEADY CIRCULAR RED ... PARK RD AT HUNGERFORD DR None None No No No No No No No No No No MD 02 - Automobile 2014.0 FORD MUSTANG BLACK Citation 21-202(h1) Transportation Article No BLACK M ROCKVILLE MD MD A - Marked Patrol None
1 1 3 31 2015 23 59 0 MCP 2nd district, Bethesda HEADLIGHTS (*) CONNECTICUT AT METROPOLITAN AVE None None No No No No No No No No No No MD 02 - Automobile 2003.0 HONDA 2S BLUE ESERO 55* None No HISPANIC M SILVER SPRING MD MD A - Marked Patrol None

We will output traffic violations that involved cars manufactured in 2016 to both, PostGRE SQL and SQLite databases. We will use odo to manage the transfers.


In [25]:
traffic_2016 = traffic_psql[traffic_psql['Year'] == 2016]

# odo.drop('sqlite:///traffic_local.sqlite::traffic2016')
# odo.drop('postgresql://drabast:pAck7!B0ok@localhost:5432/drabast::traffic_2016')
odo.odo(traffic_2016, 'sqlite:///traffic_local.sqlite::traffic2016')
odo.odo(traffic_2016, 'postgresql://drabast:pAck7!B0ok@localhost:5432/drabast::traffic_2016')


Out[25]:
Table('traffic_2016', MetaData(bind=Engine(postgresql://drabast:***@localhost:5432/drabast)), Column('index', BigInteger(), table=<traffic_2016>), Column('Stop_month', BigInteger(), table=<traffic_2016>), Column('Stop_day', BigInteger(), table=<traffic_2016>), Column('Stop_year', BigInteger(), table=<traffic_2016>), Column('Stop_hr', BigInteger(), table=<traffic_2016>), Column('Stop_min', BigInteger(), table=<traffic_2016>), Column('Stop_sec', BigInteger(), table=<traffic_2016>), Column('Agency', Text(), table=<traffic_2016>), Column('SubAgency', Text(), table=<traffic_2016>), Column('Description', Text(), table=<traffic_2016>), Column('Location', Text(), table=<traffic_2016>), Column('Latitude', FLOAT(), table=<traffic_2016>), Column('Longitude', FLOAT(), table=<traffic_2016>), Column('Accident', Text(), table=<traffic_2016>), Column('Belts', Text(), table=<traffic_2016>), Column('Personal_Injury', Text(), table=<traffic_2016>), Column('Property_Damage', Text(), table=<traffic_2016>), Column('Fatal', Text(), table=<traffic_2016>), Column('Commercial_License', Text(), table=<traffic_2016>), Column('HAZMAT', Text(), table=<traffic_2016>), Column('Commercial_Vehicle', Text(), table=<traffic_2016>), Column('Alcohol', Text(), table=<traffic_2016>), Column('Work_Zone', Text(), table=<traffic_2016>), Column('State', Text(), table=<traffic_2016>), Column('VehicleType', Text(), table=<traffic_2016>), Column('Year', FLOAT(), table=<traffic_2016>), Column('Make', Text(), table=<traffic_2016>), Column('Model', Text(), table=<traffic_2016>), Column('Color', Text(), table=<traffic_2016>), Column('Violation_Type', Text(), table=<traffic_2016>), Column('Charge', Text(), table=<traffic_2016>), Column('Article', Text(), table=<traffic_2016>), Column('Contributed_To_Accident', Text(), table=<traffic_2016>), Column('Race', Text(), table=<traffic_2016>), Column('Gender', Text(), table=<traffic_2016>), Column('Driver_City', Text(), table=<traffic_2016>), Column('Driver_State', Text(), table=<traffic_2016>), Column('DL_State', Text(), table=<traffic_2016>), Column('Arrest_Type', Text(), table=<traffic_2016>), Column('Geolocation', Text(), table=<traffic_2016>), schema=None)

Reading data from the SQLite database should be trivial by now.


In [26]:
traffic_sqlt = bl.Data('sqlite:///traffic_local.sqlite::traffic2016')
traffic_sqlt.head(2)


Out[26]:
index Stop_month Stop_day Stop_year Stop_hr Stop_min Stop_sec Agency SubAgency Description Location Latitude Longitude Accident Belts Personal_Injury Property_Damage Fatal Commercial_License HAZMAT Commercial_Vehicle Alcohol Work_Zone State VehicleType Year Make Model Color Violation_Type Charge Article Contributed_To_Accident Race Gender Driver_City Driver_State DL_State Arrest_Type Geolocation
0 2692 4 2 2015 8 31 0 MCP 2nd district, Bethesda DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC... POOKS HILL RD AT LINDEN AV 39.065777 -77.159442 No No No No No No No No No No MD 02 - Automobile 2016.0 AUDI 4S SILVER Warning 21-201(a1) Transportation Article No WHITE M NORTH BETHESDA MD MD A - Marked Patrol (39.0657766666667, -77.1594416666667)
1 8862 4 11 2015 19 27 0 MCP 2nd district, Bethesda FAILURE TO STOP AT STOP SIGN NORFOLK AVE / DEL RAY AVE 38.989567 -77.098130 No No No No No No No No No No MD 02 - Automobile 2016.0 AUDI A7 BLACK Warning 21-707(a) Transportation Article No WHITE M BETHESDA MD MD A - Marked Patrol (38.9895666666667, -77.09813)

Interacting with MongoDB database

Reading from MongoDB is very similar to reading from a PostGRE SQL or SQLite databases.


In [27]:
traffic_mongo = bl.Data('mongodb://localhost:27017/packt::traffic')
traffic_mongo.head(2)


Out[27]:
Accident Agency Alcohol Arrest_Type Article Belts Charge Color Commercial_License Commercial_Vehicle Contributed_To_Accident DL_State Description Driver_City Driver_State Fatal Gender Geolocation HAZMAT Latitude Location Longitude Make Model Personal_Injury Property_Damage Race State Stop_day Stop_hr Stop_min Stop_month Stop_sec Stop_year SubAgency VehicleType Violation_Type Work_Zone Year
0 No MCP No A - Marked Patrol Transportation Article No 21-201(a1) RED No No No MD DRIVER FAILURE TO OBEY PROPERLY PLACED TRAFFIC... TAKOMA PARK MD No M (39.00438825, -77.0061869666667) No 39.004388 FLOWER AVE AT WAYNE AVE -77.006187 VOLK 4S No No WHITE MD 16 21 46 7 0 2013 3rd district, Silver Spring 02 - Automobile Citation No 2011.0
1 No MCP No A - Marked Patrol Transportation Article No 13-401(h) BLACK No No No MD DRIVING VEHICLE ON HIGHWAY WITH SUSPENDED REGI... GERMANTOWN MD No M (39.0821319333333, -77.1463659333333) No 39.082132 355 / VEIRS MILL -77.146366 CADILAC DEVILLE No No BLACK MD 13 17 55 1 0 2013 1st district, Rockville 02 - Automobile Citation No 2004.0

Data operations

Accessing columns

There are two ways of accessing columns: you can get a single column at a time by accessing them as if they were a DataShape attribute


In [28]:
traffic.Year.head(2)


Out[28]:
Year
0 2014.0
1 2003.0

or indexing; this allows to select more than one column at a time


In [29]:
(traffic[['Location', 'Year', 'Accident', 'Fatal', 'Alcohol']]
    .head(2))


Out[29]:
Location Year Accident Fatal Alcohol
0 PARK RD AT HUNGERFORD DR 2014.0 No No No
1 CONNECTICUT AT METROPOLITAN AVE 2003.0 No No No

Symbolic transformations

If we could not reflect the schema from an already existing object, we would have to specify the schema manually.


In [30]:
schema_example = bl.symbol('schema_exampl', 
                           '{id: int, name: string}')

Since we already have an existing dataset traffic, we can reuse the schema by calling traffic.dshape and specify our transformations directly on it.


In [31]:
traffic_s = bl.symbol('traffic', traffic.dshape)
traffic_2013 = traffic_s[traffic_s['Stop_year'] == 2013][
    ['Stop_year', 'Arrest_Type','Color', 'Charge']
]

To present how this works, let's read our dataset into pandas' DataFrame.


In [32]:
traffic_pd = pd.read_csv('../Data/TrafficViolations.csv')

You can now pass the dataset directly to traffic_2013 object and perform the computation using the .compute(...) method of Blaze.


In [33]:
bl.compute(traffic_2013, traffic_pd).head(2)


Out[33]:
Stop_year Arrest_Type Color Charge
73 2013 A - Marked Patrol SILVER 13-409(b)
215 2013 B - Unmarked Patrol BLACK 21-309(b)

You can also pass a list of lists or a list of NumPy arrays.


In [51]:
bl.compute(traffic_2013, traffic_pd.values)[0:2]


Out[51]:
array([[2013, 'A - Marked Patrol', 'SILVER', '13-409(b)'],
       [2013, 'B - Unmarked Patrol', 'BLACK', '21-309(b)']], dtype=object)

Operations on columns

You can check that by getting all the distinct values for the Stop_year column.


In [35]:
traffic['Stop_year'].distinct().sort()


Out[35]:
Stop_year
2 2013
0 2014
1 2015
3 2016

We can subtract 2000 from the Stop_year column as we do not need a greater detail.


In [36]:
traffic['Stop_year'].head(2) - 2000


Out[36]:
Stop_year
0 14
1 15

If you wanted to log-transform the Stop_year you need to


In [37]:
bl.log(traffic['Stop_year']).head(2)


Out[37]:
Stop_year
0 7.607878
1 7.608374

Reducing data

Some reduction methods are also available, like .mean() (that calculates the average), .std (that calculates standard deviation) or .max() (that returns the maximum from the list).


In [38]:
traffic['Stop_year'].max()


Out[38]:
2016

You can calculate the age of the car (in years) at the time when the violation occured


In [9]:
traffic = bl.transform(traffic, 
             Age_of_car = traffic.Stop_year - traffic.Year)

traffic.head(2)[['Stop_year', 'Year', 'Age_of_car']]


Out[9]:
Stop_year Year Age_of_car
0 2014 2014.0 0.0
1 2015 2003.0 12.0

and to calculate the average age of the car involved in a fatal trafic violation and count the number of occurences you can use the by operation.


In [40]:
bl.by(traffic['Fatal'], 
      Fatal_AvgAge=traffic.Age_of_car.mean(),
      Fatal_Count =traffic.Age_of_car.count()
)


Out[40]:
Fatal Fatal_AvgAge Fatal_Count
0 No 9.580998 404418
1 Yes 8.798246 116

Joins

We first select all the traffic violations by violation type (the violation object) and the traffic violations involving belts (the belts object).


In [41]:
violation = traffic[
    ['Stop_month','Stop_day','Stop_year',
     'Stop_hr','Stop_min','Stop_sec','Violation_Type']]

belts = traffic[
    ['Stop_month','Stop_day','Stop_year',
     'Stop_hr','Stop_min','Stop_sec','Belts']]

Now, we join the two objects on the six date and time columns.


In [42]:
violation_belts = bl.join(violation, belts, 
      ['Stop_month','Stop_day','Stop_year',
       'Stop_hr','Stop_min','Stop_sec'])

Once we have the full dataset in place, let's check how many traffic violations involved belts and what sort of punishment was issued to the driver.


In [43]:
bl.by(violation_belts[['Violation_Type', 'Belts']],
      Violation_count=violation_belts.Belts.count()
).sort('Violation_count', ascending=False)


Out[43]:
Violation_Type Belts Violation_count
0 Citation No 989728
5 Warning No 439490
2 ESERO No 56447
1 Citation Yes 35596
6 Warning Yes 12245
3 ESERO Yes 1327
4 SERO No 3