Table of Content


Overview:

Being a data scientist means, we gotta have to work with "big" data with different types.

We've seen how 2D Numpy arrays gives us power to compute data in a much efficient way, but the only downside to it is, they must be of the same type.

To solve this issue, ther's where the Pandas package comes in. So what's in Pandas?

  • High-level data manupalation.
  • The concept of "Data Frames" objects.

    • Data is stored in such data frames.
  • More specifically, they are tables,

    • with "rows" represented as "observations".

    • "Coloumns" represented by "variables".

    • Each row has a unique label, same goes for coloumns as well.

    • Coloumns can have different types.

  • We typically don't make data frames manually.

    • We convert .csv (Comma seperated values) files to data frames.

    • We do this importing the pandas package:

      import pandas as pd, again pd is an "alias".

    • Now we can use a built-in function that comes packaged with pandas called as:

      read_csv(<path to .csv file)>


Example:

We will be using pandas package to import, read in the "brics dataset" into python, let's look how the dataframes look like:


In [7]:
# import the pandas package
import pandas as pd

# load in the dataset and save it to brics var.
brics = pd.read_csv("C:/Users/pySag/Documents/GitHub/Computer-Science/Courses/DAT-208x/Datasets/BRICS_cummulative.csv")

brics

# we can make the table look more better, by adding a parameter index_col = 0
brics = pd.read_csv("C:/Users/pySag/Documents/GitHub/Computer-Science/Courses/DAT-208x/Datasets/BRICS_cummulative.csv", index_col=0)

brics #notice how the indexes assigned to row observation are now deprecated.


Out[7]:
Original Principal Amount Effective Date (Most Recent)
Country
China $48185147142.81 430
South Africa $4052800000.00 14
Russian Federation $14451100000.00 84
Brazil $59782439627.00 448
India $55556320000.00 295

One of the most effective use of pandas is the ease at which we can select rows and coloumns in different ways, here's how we do it:

  • To access the coloumns, there are three different ways we can do it, these are:

    1. data_set_var[ "coloumn-name" ]
    2. < data_set_var >.< coloumn-name >
  • We can add coloumns too, say we rank them:

    <data_set_var>["new-coloumn-name"] = < list of values >


In [9]:
# Add a new coloumn
brics["on_earth"] = [ True, True, True, True, True ]

# Print them
brics


Out[9]:
Original Principal Amount Effective Date (Most Recent) on_earth
Country
China $48185147142.81 430 True
South Africa $4052800000.00 14 True
Russian Federation $14451100000.00 84 True
Brazil $59782439627.00 448 True
India $55556320000.00 295 True

In [1]:
# Manupalating Coloumns
"""Coloumns can be manipulated using arithematic operations
on other coloumns"""


Out[1]:
'Coloumns can be manipulated using arithematic operations\non other coloumns'

Accessing Rows:


Syntax: dataframe.loc[ <"row name"> ]


Go to top:TOC

Element access


To get just one element in the table, we can specify both coloumn and row label in the loc().

Syntax:

  1. dataframe.loc[ <"row-name, coloumn name"> ]

  2. dataframe[ <"row-name"> ].loc[ <"coloumn-name"> ]

  3. dataframe.loc[ <"rowName'> ][< "coloumnName" >]

Lab:


Objective:

  • Practice importing data into python as Pandas DataFrame.

  • Practise accessig Row and Coloumns


Lab content:


Go to:TOC

CSV to DataFrame1


Preface:

The DataFrame is one of Pandas' most important data structures. It's basically a way to store tabular data, where you can label the rows and the columns.

In the exercises that follow, you will be working wit vehicle data in different countries. Each observation corresponds to a country, and the columns give information about the number of vehicles per capita, whether people drive left or right, and so on. This data is available in a CSV file, named cars.csv. It is available in your current working directory, so the path to the file is simply 'cars.csv'.

To import CSV data into Python as a Pandas DataFrame, you can use read_csv().

Instructions:

  • To import CSV files, you still need the pandas package: import it as pd.

  • Use pd.read_csv() to import cars.csv data as a DataFrame. Store this dataframe as cars.

  • Print out cars. Does everything look OK?


In [ ]:
"""
# Import pandas as pd
import pandas as pd

# Import the cars.csv data: cars
cars = pd.read_csv("cars.csv")

# Print out cars
print(cars)
"""

CSV to DataFrame2


Preface:

We have a slight of a problem, the row labels are imported as another coloumn, that has no name.

To fix this issue, we are goint to pass an argument index_col = 0 to read_csv(). This is used to specify which coloumn in the CSV file should be used as row label?

Instructions:

  1. Run the code with Submit Answer and assert that the first column should actually be used as row labels.

  2. Specify the index_col argument inside pd.read_csv(): set it to 0, so that the first column is used as row labels.

  3. Has the printout of cars improved now?


Go to top:TOC


In [3]:
"""
# Import pandas as pd
import pandas as pd

# Import the cars.csv data: cars
cars = pd.read_csv("cars.csv", index_col=0)

# Print out cars
print(cars)
"""


Out[3]:
'\n# Import pandas as pd\nimport pandas as pd\n\n# Import the cars.csv data: cars\ncars = pd.read_csv("cars.csv", index_col=0)\n\n# Print out cars\nprint(cars)\n'

Square Brackets


Preface

Selecting coloumns can be done in two way.

  1. variable_containing_CSV_file['coloumn-name']

  2. variable_containing_CSV_file[['coloumn-name']]

The former gives a pandas series, whereas the latter gives a pandas dataframe.

Instructions:

  • Use single square brackets to print out the country column of cars as a Pandas Series.

  • Use double square brackets to print out the country column of cars as a Pandas DataFrame. Do this by putting country in two square brackets this time.


In [ ]:
"""
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)

# Print out country column as Pandas Series
print( cars['country'])

# Print out country column as Pandas DataFrame
print( cars[['country']])
"""

Loc1


With loc we can do practically any data selection operation on DataFrames you can think of.

loc is label-based, which means that you have to specify rows and coloumns based on their row and coloumn labels.

Instructions:

  • Use loc to select the observation corresponding to Japan as a Series. The label of this row is JAP. Make sure to print the resulting Series.
  • Use loc to select the observations for Australia and Egypt as a DataFrame.

In [ ]:
"""
# Import cars data
import pandas as pd
cars = pd.read_csv('cars.csv', index_col = 0)

# Print out observation for Japan
print( cars.loc['JAP'] )

# Print out observations for Australia and Egypt
print( cars.loc[ ['AUS', 'EG'] ])
"""

Loc2


loc also allows us to select both, rows and coloumns from a DataFrame.

Instructions:

  • Print out the drives_right value of the row corresponding to Morocco (its row label is MOR)
  • Print out a sub-DataFrame, containing the observations for Russia and Morocco and the columns country and drives_right.