Being a data scientist means, we gotta have to work with "big" data with different types.
We've seen how 2D Numpy arrays gives us power to compute data in a much efficient way, but the only downside to it is, they must be of the same type.
To solve this issue, ther's where the Pandas package comes in. So what's in Pandas?
The concept of "Data Frames" objects.
More specifically, they are tables,
with "rows" represented as "observations".
"Coloumns" represented by "variables".
Each row has a unique label, same goes for coloumns as well.
Coloumns can have different types.
We typically don't make data frames manually.
.csv (Comma seperated values) files to
We do this importing the pandas package:
import pandas as pd, again
pd is an "alias".
Now we can use a built-in function that comes packaged with pandas called as:
read_csv(<path to .csv file)>
We will be using pandas package to import, read in the "brics dataset" into python, let's look how the dataframes look like:
In :# import the pandas package import pandas as pd # load in the dataset and save it to brics var. brics = pd.read_csv("C:/Users/pySag/Documents/GitHub/Computer-Science/Courses/DAT-208x/Datasets/BRICS_cummulative.csv") brics # we can make the table look more better, by adding a parameter index_col = 0 brics = pd.read_csv("C:/Users/pySag/Documents/GitHub/Computer-Science/Courses/DAT-208x/Datasets/BRICS_cummulative.csv", index_col=0) brics #notice how the indexes assigned to row observation are now deprecated.
Original Principal Amount Effective Date (Most Recent) Country China $48185147142.81 430 South Africa $4052800000.00 14 Russian Federation $14451100000.00 84 Brazil $59782439627.00 448 India $55556320000.00 295
One of the most effective use of pandas is the ease at which we can select rows and coloumns in different ways, here's how we do it:
To access the coloumns, there are three different ways we can do it, these are:
data_set_var[ "coloumn-name" ]
< data_set_var >.< coloumn-name >
We can add coloumns too, say we rank them:
<data_set_var>["new-coloumn-name"] = < list of values >
In :# Add a new coloumn brics["on_earth"] = [ True, True, True, True, True ] # Print them brics
Original Principal Amount Effective Date (Most Recent) on_earth Country China $48185147142.81 430 True South Africa $4052800000.00 14 True Russian Federation $14451100000.00 84 True Brazil $59782439627.00 448 True India $55556320000.00 295 True
In :# Manupalating Coloumns """Coloumns can be manipulated using arithematic operations on other coloumns"""
Out:'Coloumns can be manipulated using arithematic operations\non other coloumns'
dataframe.loc[ <"row name"> ]
Go to top:TOC
To get just one element in the table, we can specify both coloumn and row label in the
dataframe.loc[ <"row-name, coloumn name"> ]
dataframe[ <"row-name"> ].loc[ <"coloumn-name"> ]
dataframe.loc[ <"rowName'> ][< "coloumnName" >]
The DataFrame is one of Pandas' most important data structures. It's basically a way to store tabular data, where you can label the rows and the columns.
In the exercises that follow, you will be working wit vehicle data in different countries. Each observation corresponds to a country, and the columns give information about the number of vehicles per capita, whether people drive left or right, and so on. This data is available in a CSV file, named cars.csv. It is available in your current working directory, so the path to the file is simply 'cars.csv'.
To import CSV data into Python as a Pandas DataFrame, you can use read_csv().
To import CSV files, you still need the pandas package: import it as pd.
Use pd.read_csv() to import cars.csv data as a DataFrame. Store this dataframe as cars.
Print out cars. Does everything look OK?
In [ ]:""" # Import pandas as pd import pandas as pd # Import the cars.csv data: cars cars = pd.read_csv("cars.csv") # Print out cars print(cars) """
We have a slight of a problem, the row labels are imported as another coloumn, that has no name.
To fix this issue, we are goint to pass an argument
index_col = 0 to
read_csv(). This is used to specify which coloumn in the CSV file should be used as row label?
Run the code with Submit Answer and assert that the first column should actually be used as row labels.
index_col argument inside
pd.read_csv(): set it to
0, so that the first column is used as row labels.
Has the printout of cars improved now?
Go to top:TOC
In :""" # Import pandas as pd import pandas as pd # Import the cars.csv data: cars cars = pd.read_csv("cars.csv", index_col=0) # Print out cars print(cars) """
Out:'\n# Import pandas as pd\nimport pandas as pd\n\n# Import the cars.csv data: cars\ncars = pd.read_csv("cars.csv", index_col=0)\n\n# Print out cars\nprint(cars)\n'
Selecting coloumns can be done in two way.
The former gives a pandas series, whereas the latter gives a pandas dataframe.
Use single square brackets to print out the
country column of
cars as a Pandas Series.
Use double square brackets to print out the
country column of
cars as a Pandas DataFrame. Do this by putting
country in two square brackets this time.
In [ ]:""" # Import cars data import pandas as pd cars = pd.read_csv('cars.csv', index_col = 0) # Print out country column as Pandas Series print( cars['country']) # Print out country column as Pandas DataFrame print( cars[['country']]) """
loc we can do practically any data selection operation on DataFrames you can think of.
loc is label-based, which means that you have to specify rows and coloumns based on their row and coloumn labels.
locto select the observation corresponding to Japan as a Series. The label of this row is
JAP. Make sure to print the resulting Series.
locto select the observations for Australia and Egypt as a DataFrame.
In [ ]:""" # Import cars data import pandas as pd cars = pd.read_csv('cars.csv', index_col = 0) # Print out observation for Japan print( cars.loc['JAP'] ) # Print out observations for Australia and Egypt print( cars.loc[ ['AUS', 'EG'] ]) """
loc also allows us to select both, rows and coloumns from a DataFrame.
drives_rightvalue of the row corresponding to Morocco (its row label is