Data Ingestion, Wrangling, ETL

  • 80% of Data Science is data wrangling.
  • Python's library ecosystem is the first reason to use it!
  • Pandas: if you learn one thing today, learn this!

Everything has a Python API

It's safe to say that every internet service has an API for Python:


Many Domain Specific Libraries


Everything nicely integrated in notebooks, and can be easily turned into slides

In [5]:
# Example 1:
# do something fun with the weather API

Data Wrangling with Python and Pandas (tutorial)

Introduction: http://pandas.pydata.org/pandas-docs/stable/10min.html

Tutorial on data wrangling:


In [3]:
# Run some exploration on tutorial
%matplotlib inline
import pandas as pd
import matplotlib
#montreal weather
weather_url = "https://raw.githubusercontent.com/jvns/pandas-cookbook/master/data/weather_2012.csv"

weather_2012_final = pd.read_csv(weather_url, parse_dates='Date/Time', index_col='Date/Time')
weather_2012_final['Temp (C)'].plot(figsize=(15, 6))

<matplotlib.axes._subplots.AxesSubplot at 0x106d1aa50>

In [18]:
print weather_2012_final[weather_2012_final['Weather'] == 'Cloudy']['Temp (C)'].median()
print weather_2012_final[weather_2012_final['Weather'] == 'Snow']['Temp (C)'].median()
weather_2012_final.to_hdf('ciao.h5', compression='blocs')


Why is my code slow?

  • Look under the hood: Memory hiearchies.
  • Python is magic, magic isn't free: how built-in types are implemented and efficiency consideration
  • Profiling and monitoring
  • If everything else fails: go parallel.

In [24]:
# Run the example above

def closest(position, positions):
    x0, y0 = position
    dbest, ibest = None, None
    for i, (x, y) in enumerate(positions):
        d = (x - x0) ** 2 + (y - y0) ** 2
        if dbest is None or d < dbest:
            dbest, ibest = d, i
    return ibest

In [26]:
import random
positions = [(random.random(), random.random()) for _ in xrange(10000000)]

In [27]:
%timeit closest((.5, .5), positions)

1 loops, best of 3: 9.08 s per loop

In [37]:
positions = np.random.rand(10000000,2)

In [38]:
x, y = positions[:,0], positions[:,1]

In [39]:
distances = (x - .5) ** 2 + (y - .5) ** 2

In [40]:
%timeit exec In[39]

1 loops, best of 3: 208 ms per loop

Memory, cores, I/O

  • Latency: Register, Cache, RAM, Disk (SSD/HDD), network
  • Out of core vs distributed
  • Embarrassingly parallel problems (shell/python parallel)

In [8]:
from IPython.display import Image


In [14]:
In [2]:
# Example: Run some parallel code
from ipyparallel import Client
client = Client(profile='mycluster')
%px print("Hello from the cluster engines!")

How to deal with big data?

Network analysis with NetworkX

Intro and examples here

In [41]:
%matplotlib inline
import networkx as nx
import matplotlib.pyplot as plt
from IPython.display import Image
n = 10
m = 20
rgraph1 = nx.gnm_random_graph(n,m)
print "Nodes: ", rgraph1.nodes()
print "Edges: ", rgraph1.edges()

Nodes:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Edges:  [(0, 1), (0, 4), (0, 5), (1, 5), (2, 8), (2, 9), (2, 3), (2, 5), (3, 5), (3, 6), (4, 9), (4, 6), (5, 6), (5, 7), (5, 8), (5, 9), (6, 8), (6, 7), (7, 8), (8, 9)]

In [42]:
if nx.is_connected(rgraph1):
    print "Graph is connected"
    print "Graph is not connected"

Graph is connected

In [43]:
print "Diameter of graph is ", nx.diameter(rgraph1)

Diameter of graph is  2

In [44]:
T = nx.dfs_tree(rgraph1,0)
print "DFS Tree edges : ", T.edges()

T = nx.bfs_tree(rgraph1, 0)
print "BFS Tree edges : ", T.edges()

DFS Tree edges :  [(0, 1), (1, 5), (2, 8), (4, 6), (5, 2), (6, 3), (6, 7), (8, 9), (9, 4)]
BFS Tree edges :  [(0, 1), (0, 4), (0, 5), (4, 9), (4, 6), (5, 8), (5, 2), (5, 3), (5, 7)]

Galleries and miniproject


Extend the analysis provided here:


  1. What is the city that has the most other cities in a 10-mile radius from it?
  2. How many cities have no other city in 10 miles from them? Where are they mostly located?
  3. What is the distribution of the number of cities within a 10-mile radius from a city? What about varying the radius using interact() ?

