Data Analysis with Pandas Dataframe

Pandas is a popular library for manipulating vectors, tables, and time series. We will frequently use Pandas data structures instead of the built-in python data structures, as they provide much richer functionality. Also, Pandas is fast, which makes working with large datasets easier. Check out the official pandas website at [http://pandas.pydata.org/]

Pandas provides three data structures:

  • the series, which represents a single column of data similar to a python list. Series are most fundamental data structures in Pandas.
  • the data frame, which represents multiple series of data
  • the panel, which represents multiple data frames

Today we will mainly work with dataframe.


In [ ]:
import pandas as pd

Data I/O

Data cleaning...


In [ ]:
glad = pd.read_csv('./GLAD_15min_filtered_S1_41days_sample.csv')

In [ ]:
glad

In [ ]:
glad.shape

First and last five rows.


In [ ]:
glad_orig.head()

In [ ]:
glad_orig.tail()

Add or delete columns and write data to .csv file with one command line.


In [ ]:
import numpy as np
np.zeros(240000)

In [ ]:
glad['temperature'] = np.zeros(240000)

In [ ]:
glad.head()

In [ ]:
del glad['vel_Error']

In [ ]:
del glad['Pos_Error']

In [ ]:
glad.head()

In [ ]:
glad.to_csv('./test.csv')

In [ ]:
glad.to_csv?

In [ ]:
glad.to_csv('./test_without_index.csv', index = False)

Indexing and Slicing

.iloc[ ] : indexing by position

.loc[ ] : indexing by index


In [ ]:
glad.iloc[0]

The function takes array as index, too.


In [ ]:
glad_orig.iloc[:10]

Access the data array/list as array using .values


In [ ]:
glad_orig.iloc[0].values

In this case, indexing by position may not be practical. Instead, we can designate the column of row label 'ID' as an 'index'. It is common operation to pick a column as index to work on. When indexing the dataframe, explicitly designate the row and columns, even if with colon (':').


In [ ]:
glad_id = glad.set_index('ID')

In [ ]:
glad_id.head()

In [ ]:
glad_id.loc['CARTHE_021']

Use .values to access the data stored in the dataframe.


In [ ]:
lat = glad_id.loc['CARTHE_021', 'Latitude'].values
lat

In [ ]:
lon = glad_id.loc['CARTHE_021', 'Longitude'].values
lon

Ploting with matplotlib and cartopy


In [ ]:
import matplotlib.pyplot as plt
import cartopy.crs as ccrs

In [ ]:
plt.figure(figsize = (6, 8))
min_lat, max_lat = 23, 30.5
min_lon, max_lon = -91.5, -85
ax = plt.axes(projection = ccrs.PlateCarree())
ax.set_extent([min_lon, max_lon, min_lat, max_lat], ccrs.PlateCarree())
ax.coastlines(resolution = '50m', color = 'black')
ax.gridlines(crs = ccrs.PlateCarree(), draw_labels = True, color = 'grey')
ax.plot(lon, lat)

How to plot every drifter trajectory, aka spagetti plot?

Grouping Data Frames

In order to aggregate the data of each drifter, we can use group-by method. We can specify which column to group by. In this case, 'ID' will be the choice.


In [ ]:
drifter_grouped = glad.groupby('ID')

Dictionary is a collection of items, which are unordered, changeable and indexed. Each item can be different types such as number, string, list, etc.


In [ ]:
drifter_grouped.groups

Keys of each group are also in a dictionary.


In [ ]:
drifter_grouped.groups.keys()

You can access the items of a dictionary by referring to its key name, inside square brackets


In [ ]:
drifter_grouped.groups['CARTHE_021']

Iterate over the dictinary above to access the coordinates of each drifter.


In [ ]:
drifter_ids = drifter_grouped.groups.keys()

In [ ]:
for drifter_id in drifter_ids:
    print(drifter_id)

In [ ]:
glad_id.head()

In [ ]:
plt.figure(figsize = (6, 8))
min_lat, max_lat = 23, 30.5
min_lon, max_lon = -91.5, -85
ax = plt.axes(projection = ccrs.PlateCarree())
ax.set_extent([min_lon, max_lon, min_lat, max_lat], ccrs.PlateCarree())
ax.coastlines(resolution = '50m', color = 'black')
ax.gridlines(crs = ccrs.PlateCarree(), draw_labels = True, color = 'grey')
for drifter_id in drifter_ids:
    lon = glad_id.loc[drifter_id, 'Longitude'].values
    lat = glad_id.loc[drifter_id, 'Latitude'].values
    ax.plot(lon, lat)

Select data in certain time period.

Set the date as index.


In [ ]:
glad_date = glad_orig.set_index('Date')

In [ ]:
glad_date.head()

the "Date" index is Datetime Index


In [ ]:
glad_date.index

pd.date_range will give us a list of Index


In [ ]:
pd.date_range(start = '2012-07-22', end  = '2012-08-05')

In [ ]:
glad_date.loc[date_range,:]

use .strftime() method to convert "DatetimeIndex" to "Index"


In [ ]:
date_range = pd.date_range(start=first_day, end = last_day).strftime("%Y-%m-%d")

In [ ]:
date_range

In [ ]:
glad_selected = glad_date.loc[date_range,:]

In [ ]: