DataFrames

We have already experienced an object in Python called DataFrame. DataFrames are not built-in objects: they appeared once we imported data using pandas_datareader. It turns out there is a popular library called pandas which is also responsible for dealing with DataFrames. pandas_datareader is based on pandas, thus, it gets stock data and saves inside a dataframe.

To import pandas library we again (as always) run the following command in Jupyter Notebook (assuming pandas is already installed, which is the case if you used Anaconda):

import pandas

As you know, whenever you want to call a specific function residing inside a library, you need to use its name. This is why we can use the "as" syntax to give a short name to the library as follows:

import pandas as pd

We covered head(), tail() and describ() functions that are available for dataframes, but there are many more to come.

DataFrame functions covered

corr() - creates the correlation matrix for the dataframe varaibles and presents it inside a dataframe.
cov() - creates the variance-covariance matrix for the dataframe varaibles and presents it inside a dataframe.
mean() - calculates the means of each column for the dataframe
mode() - calculates the mode of each column for the dataframe
median() - calculates the median of each column for the dataframe

Please note, that the functinos above can be also separately applied to separate columns of the DataFrame.

Pandas functinos covered

read_csv() - reads a CSV file from your computer to DataFrame
to_csv() - saves a DataFrame to a CSV file
read_excel() - reads an Excel file from your computer to DataFrame
to_excel - saves a DataFrame to an Excel file



In [1]:

    
import pandas as pd



In [3]:

    
data = pd.read_csv('Training.csv')



In [4]:

    
data.head()









    Out[4]:






  
    
      
      ID
      Age
      Education
      Years_Employed
      Income
    
  
  
    
      0
      1
      41
      1
      6
      19
    
    
      1
      2
      47
      0
      26
      100
    
    
      2
      3
      33
      1
      10
      57
    
    
      3
      4
      29
      1
      4
      19
    
    
      4
      5
      47
      0
      31
      253



In [5]:

    
data.corr()









    Out[5]:






  
    
      
      ID
      Age
      Education
      Years_Employed
      Income
    
  
  
    
      ID
      1.000000
      0.041387
      -0.087519
      -0.002026
      -0.044262
    
    
      Age
      0.041387
      1.000000
      -0.001689
      0.547497
      0.489733
    
    
      Education
      -0.087519
      -0.001689
      1.000000
      -0.165918
      0.215851
    
    
      Years_Employed
      -0.002026
      0.547497
      -0.165918
      1.000000
      0.687851
    
    
      Income
      -0.044262
      0.489733
      0.215851
      0.687851
      1.000000



In [6]:

    
data.cov()









    Out[6]:






  
    
      
      ID
      Age
      Education
      Years_Employed
      Income
    
  
  
    
      ID
      20875.000000
      47.849699
      -11.696393
      -2.049098
      -221.607214
    
    
      Age
      47.849699
      64.032321
      -0.012505
      30.662381
      135.798573
    
    
      Education
      -11.696393
      -0.012505
      0.855611
      -1.074128
      6.918758
    
    
      Years_Employed
      -2.049098
      30.662381
      -1.074128
      48.983323
      166.822421
    
    
      Income
      -221.607214
      135.798573
      6.918758
      166.822421
      1200.803703



In [7]:

    
data.mean()









    Out[7]:





ID                250.500
Age                35.088
Education           0.710
Years_Employed      8.738
Income             46.148
dtype: float64



In [8]:

    
data.mode()









    Out[8]:






  
    
      
      ID
      Age
      Education
      Years_Employed
      Income
    
  
  
    
      0
      NaN
      29
      0
      0
      21



In [9]:

    
data.median()









    Out[9]:





ID                250.5
Age                35.0
Education           0.0
Years_Employed      7.0
Income             34.0
dtype: float64



In [10]:

    
my_corr = data.corr()



In [11]:

    
my_corr.to_csv("correlation.csv")

	ID	Age	Education	Years_Employed	Income
ID	1.000000	0.041387	-0.087519	-0.002026	-0.044262
Age	0.041387	1.000000	-0.001689	0.547497	0.489733
Education	-0.087519	-0.001689	1.000000	-0.165918	0.215851
Years_Employed	-0.002026	0.547497	-0.165918	1.000000	0.687851
Income	-0.044262	0.489733	0.215851	0.687851	1.000000

	ID	Age	Education	Years_Employed	Income
ID	20875.000000	47.849699	-11.696393	-2.049098	-221.607214
Age	47.849699	64.032321	-0.012505	30.662381	135.798573
Education	-11.696393	-0.012505	0.855611	-1.074128	6.918758
Years_Employed	-2.049098	30.662381	-1.074128	48.983323	166.822421
Income	-221.607214	135.798573	6.918758	166.822421	1200.803703