Hello World!

Un-attributed images in the presentation are author's own creation. To use them, check the attributions here: https://github.com/sara-02/khalaq

Data Wrangling with Python Pandas

An Introduction for newbies

- [Sarah Masud](https:github.com/sara-02)

How is Data Wrangling different from Machine Learning?

JUNK IN ==> ML Model ==> JUNK OUT

Data Preparartion is the key to sucess in ML!

Data Analysis is a part and parcel of ML and the MOST important one.

Why do we need pandas, why not our dear excel?

Fundamental Data Types in Pandas

Series
Dataframes (focus of this presentation)

Dataframes are - n-D array with indexing on both rows and columns



In [70]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 
#so that we can view the graphs inside the notebook



In [71]:

    
df = pd.read_csv("wine.csv") 
df.head(3)

Hmm.. DFs look similar to SQL Tables, don't they?

**Similarity**: The arrangement of data to in tabular format
**Similarity**: Ability to perform JOIN operations
**Disimilarity**: Pandas is not a language
**Disimilarity**: You don't use Pandas as a datastore/backend
**Disimilarity**: No need define schema in Pandas

Advantage of Pandas over SQL for data analysis

Performs automatic data alignment.
Performs faster subsetting.
Real Power Load data from various backend sources, making backend aganostic analysis.
Real Power Store the result in any and as many datasources as needed.

Basic Stats on Loaded Data

Getting to know how the data looks



In [51]:

    
df.head()



In [52]:

    
df.tail()









    Out[52]:






  
    
      
      Unnamed: 0
      country
      alcohol
      deaths
      heart
      liver
    
  
  
    
      16
      17
      Sweden
      1.6
      743
      207
      11.200000
    
    
      17
      18
      Switzerland
      5.8
      693
      115
      20.299999
    
    
      18
      19
      UK
      1.3
      941
      285
      10.300000
    
    
      19
      20
      US
      1.2
      926
      199
      22.100000
    
    
      20
      21
      West Germany
      2.7
      861
      172
      36.700001



In [73]:

    
df['deaths'].count()









    Out[73]:





21



In [54]:

    
df['deaths'].min()









    Out[54]:





680



In [55]:

    
df['deaths'].max()









    Out[55]:





1000



In [56]:

    
df['deaths'].mean()









    Out[56]:





830.04761904761904



In [58]:

    
df['deaths'].describe()









    Out[58]:





count      21.000000
mean      830.047619
std        96.518639
min       680.000000
25%       751.000000
50%       806.000000
75%       916.000000
max      1000.000000
Name: deaths, dtype: float64



In [59]:

    
df['deaths'].plot(kind='box')









    Out[59]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fb10ccc8710>

Subsetting DataSets

Why do we need subsets?

We want to screen out anomalous, partial datapoints.
We want to divide our huge data into small chunks and apply different analysis on each set.
Divide data into train set and test set.

Ways of subsetting in Pandas:

By Column names
By row labels
By row-column index
Combination of both labels and columns



In [75]:

    
num = range(1,6)
mul2 = [x*2 for x in num]
mul3 = [x*3 for x in num]
mul4 = [x*4 for x in num]
mul5 = [x*35 for x in num]
data = [num, mul2, mul3, mul4, mul5]
df1 = pd.DataFrame(data, index=['v', 'w', 'x', 'y', 'z'], columns=['A', 'B','C','D', 'E'])



In [76]:

    
df1



In [77]:

    
#### Only Column
df1[['A']]



In [69]:

    
#### Only Row
df1.loc[['v']]



In [64]:

    
df1.loc[['v','w'],['A','B']] # rows and columns



In [65]:

    
df1.iloc[0:2, 0:2] #Using default index numbers

Merging Data Frames

How to merge?

concat()
merge()
append()

Vertically merge
Horizontal merge
Inner JOIN merge
Outer JOIN merge
Add new Keys while merging

Real Power: Merge from different datasources!

Inline Visualization

Visulaize as you analyse.



In [78]:

    
df.groupby('country')['deaths'].mean().plot(kind='bar')









    Out[78]:





<matplotlib.axes._subplots.AxesSubplot at 0x7fb108981e10>

	Unnamed: 0	country	alcohol	deaths	heart	liver
0	1	Australia	2.5	785	211	15.300000
1	2	Austria	3.9	863	167	45.599998
2	3	Belg/Lux	2.9	883	131	20.700001

	Unnamed: 0	country	alcohol	deaths	heart	liver
0	1	Australia	2.5	785	211	15.300000
1	2	Austria	3.9	863	167	45.599998
2	3	Belg/Lux	2.9	883	131	20.700001
3	4	Canada	2.4	793	191	16.400000
4	5	Denmark	2.9	971	220	23.900000

Data Wrangling with Python Pandas

An Introduction for newbies

How is Data Wrangling different from Machine Learning?

JUNK IN ==> ML Model ==> JUNK OUT

Data Preparartion is the key to sucess in ML!

Data Analysis is a part and parcel of ML and the MOST important one.

Why do we need pandas, why not our dear excel?

Fundamental Data Types in Pandas

Dataframes are - n-D array with indexing on both rows and columns

Hmm.. DFs look similar to SQL Tables, don't they?

Advantage of Pandas over SQL for data analysis

Basic Stats on Loaded Data

Getting to know how the data looks

Subsetting DataSets

Why do we need subsets?

Ways of subsetting in Pandas:

Merging Data Frames

How to merge?

Real Power: Merge from different datasources!

Inline Visualization

Visulaize as you analyse.

Useful Links:

THANK YOU
&
Q/A

	Unnamed: 0	country	alcohol	deaths	heart	liver
16	17	Sweden	1.6	743	207	11.200000
17	18	Switzerland	5.8	693	115	20.299999
18	19	UK	1.3	941	285	10.300000
19	20	US	1.2	926	199	22.100000
20	21	West Germany	2.7	861	172	36.700001

Data Wrangling with Python Pandas

An Introduction for newbies

How is Data Wrangling different from Machine Learning?

JUNK IN ==> ML Model ==> JUNK OUT

Data Preparartion is the key to sucess in ML!

Data Analysis is a part and parcel of ML and the MOST important one.

Why do we need pandas, why not our dear excel?

Fundamental Data Types in Pandas

Dataframes are - n-D array with indexing on both rows and columns

Hmm.. DFs look similar to SQL Tables, don't they?

Advantage of Pandas over SQL for data analysis

Basic Stats on Loaded Data

Getting to know how the data looks

Subsetting DataSets

Why do we need subsets?

Ways of subsetting in Pandas:

Merging Data Frames

How to merge?

Real Power: Merge from different datasources!

Inline Visualization

Visulaize as you analyse.

Useful Links:

THANK YOU &Q/A

THANK YOU
&
Q/A