DataFrames

We have already experienced an object in Python called DataFrame. DataFrames are not built-in objects: they appeared once we imported data using pandas_datareader. It turns out there is a popular library called pandas which is also responsible for dealing with DataFrames. pandas_datareader is based on pandas, thus, it gets stock data and saves inside a dataframe.

To import pandas library we again (as always) run the following command in Jupyter Notebook (assuming pandas is already installed, which is the case if you used Anaconda):

import pandas

As you know, whenever you want to call a specific function residing inside a library, you need to use its name. This is why we can use the "as" syntax to give a short name to the library as follows:

import pandas as pd

We covered head(), tail() and describ() functions that are available for dataframes, but there are many more to come.

DataFrame functions covered

  • corr() - creates the correlation matrix for the dataframe varaibles and presents it inside a dataframe.
  • cov() - creates the variance-covariance matrix for the dataframe varaibles and presents it inside a dataframe.
  • mean() - calculates the means of each column for the dataframe
  • mode() - calculates the mode of each column for the dataframe
  • median() - calculates the median of each column for the dataframe

Please note, that the functinos above can be also separately applied to separate columns of the DataFrame.

Pandas functinos covered

  • read_csv() - reads a CSV file from your computer to DataFrame
  • to_csv() - saves a DataFrame to a CSV file
  • read_excel() - reads an Excel file from your computer to DataFrame
  • to_excel - saves a DataFrame to an Excel file

In [1]:
import pandas as pd

In [3]:
data = pd.read_csv('Training.csv')

In [4]:
data.head()


Out[4]:
ID Age Education Years_Employed Income
0 1 41 1 6 19
1 2 47 0 26 100
2 3 33 1 10 57
3 4 29 1 4 19
4 5 47 0 31 253

In [5]:
data.corr()


Out[5]:
ID Age Education Years_Employed Income
ID 1.000000 0.041387 -0.087519 -0.002026 -0.044262
Age 0.041387 1.000000 -0.001689 0.547497 0.489733
Education -0.087519 -0.001689 1.000000 -0.165918 0.215851
Years_Employed -0.002026 0.547497 -0.165918 1.000000 0.687851
Income -0.044262 0.489733 0.215851 0.687851 1.000000

In [6]:
data.cov()


Out[6]:
ID Age Education Years_Employed Income
ID 20875.000000 47.849699 -11.696393 -2.049098 -221.607214
Age 47.849699 64.032321 -0.012505 30.662381 135.798573
Education -11.696393 -0.012505 0.855611 -1.074128 6.918758
Years_Employed -2.049098 30.662381 -1.074128 48.983323 166.822421
Income -221.607214 135.798573 6.918758 166.822421 1200.803703

In [7]:
data.mean()


Out[7]:
ID                250.500
Age                35.088
Education           0.710
Years_Employed      8.738
Income             46.148
dtype: float64

In [8]:
data.mode()


Out[8]:
ID Age Education Years_Employed Income
0 NaN 29 0 0 21

In [9]:
data.median()


Out[9]:
ID                250.5
Age                35.0
Education           0.0
Years_Employed      7.0
Income             34.0
dtype: float64

In [10]:
my_corr = data.corr()

In [11]:
my_corr.to_csv("correlation.csv")