Hello World!

Un-attributed images in the presentation are author's own creation. To use them, check the attributions here: https://github.com/sara-02/khalaq

Data Wrangling with Python Pandas

An Introduction for newbies

- [Sarah Masud](https:github.com/sara-02)

How is Data Wrangling different from Machine Learning?

JUNK IN ==> ML Model ==> JUNK OUT

Data Preparartion is the key to sucess in ML!

Data Analysis is a part and parcel of ML and the MOST important one.

Why do we need pandas, why not our dear excel?

Fundamental Data Types in Pandas

  • Series
  • Dataframes (focus of this presentation)

Dataframes are - n-D array with indexing on both rows and columns


In [70]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 
#so that we can view the graphs inside the notebook

In [71]:
df = pd.read_csv("wine.csv") 
df.head(3)


Out[71]:
Unnamed: 0 country alcohol deaths heart liver
0 1 Australia 2.5 785 211 15.300000
1 2 Austria 3.9 863 167 45.599998
2 3 Belg/Lux 2.9 883 131 20.700001

Hmm.. DFs look similar to SQL Tables, don't they?

  • **Similarity**: The arrangement of data to in tabular format
  • **Similarity**: Ability to perform JOIN operations
  • **Disimilarity**: Pandas is not a language
  • **Disimilarity**: You don't use Pandas as a datastore/backend
  • **Disimilarity**: No need define schema in Pandas

Advantage of Pandas over SQL for data analysis

  • Performs automatic data alignment.
  • Performs faster subsetting.
  • Real Power Load data from various backend sources, making backend aganostic analysis.
  • Real Power Store the result in any and as many datasources as needed.

Basic Stats on Loaded Data

Getting to know how the data looks


In [51]:
df.head()


Out[51]:
Unnamed: 0 country alcohol deaths heart liver
0 1 Australia 2.5 785 211 15.300000
1 2 Austria 3.9 863 167 45.599998
2 3 Belg/Lux 2.9 883 131 20.700001
3 4 Canada 2.4 793 191 16.400000
4 5 Denmark 2.9 971 220 23.900000

In [52]:
df.tail()


Out[52]:
Unnamed: 0 country alcohol deaths heart liver
16 17 Sweden 1.6 743 207 11.200000
17 18 Switzerland 5.8 693 115 20.299999
18 19 UK 1.3 941 285 10.300000
19 20 US 1.2 926 199 22.100000
20 21 West Germany 2.7 861 172 36.700001

In [73]:
df['deaths'].count()


Out[73]:
21

In [54]:
df['deaths'].min()


Out[54]:
680

In [55]:
df['deaths'].max()


Out[55]:
1000

In [56]:
df['deaths'].mean()


Out[56]:
830.04761904761904

In [58]:
df['deaths'].describe()


Out[58]:
count      21.000000
mean      830.047619
std        96.518639
min       680.000000
25%       751.000000
50%       806.000000
75%       916.000000
max      1000.000000
Name: deaths, dtype: float64

In [59]:
df['deaths'].plot(kind='box')


Out[59]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb10ccc8710>

Subsetting DataSets

Why do we need subsets?

  • We want to screen out anomalous, partial datapoints.
  • We want to divide our huge data into small chunks and apply different analysis on each set.
  • Divide data into train set and test set.

Ways of subsetting in Pandas:

  • By Column names
  • By row labels
  • By row-column index
  • Combination of both labels and columns

In [75]:
num = range(1,6)
mul2 = [x*2 for x in num]
mul3 = [x*3 for x in num]
mul4 = [x*4 for x in num]
mul5 = [x*35 for x in num]
data = [num, mul2, mul3, mul4, mul5]
df1 = pd.DataFrame(data, index=['v', 'w', 'x', 'y', 'z'], columns=['A', 'B','C','D', 'E'])

In [76]:
df1


Out[76]:
A B C D E
v 1 2 3 4 5
w 2 4 6 8 10
x 3 6 9 12 15
y 4 8 12 16 20
z 35 70 105 140 175

In [77]:
#### Only Column
df1[['A']]


Out[77]:
A
v 1
w 2
x 3
y 4
z 35

In [69]:
#### Only Row
df1.loc[['v']]


Out[69]:
A B C D E
v 1 2 3 4 5

In [64]:
df1.loc[['v','w'],['A','B']] # rows and columns


Out[64]:
A B
v 1 2
w 2 4

In [65]:
df1.iloc[0:2, 0:2] #Using default index numbers


Out[65]:
A B
v 1 2
w 2 4

Merging Data Frames

How to merge?

  • concat()
  • merge()
  • append()
  • Vertically merge
  • Horizontal merge
  • Inner JOIN merge
  • Outer JOIN merge
  • Add new Keys while merging

Real Power: Merge from different datasources!

Inline Visualization

Visulaize as you analyse.


In [78]:
df.groupby('country')['deaths'].mean().plot(kind='bar')


Out[78]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb108981e10>

THANK YOU
&
Q/A