Data Analysis with Pandas Dataframe

Pandas is a popular library for manipulating vectors, tables, and time series. We will frequently use Pandas data structures instead of the built-in python data structures, as they provide much richer functionality. Also, Pandas is fast, which makes working with large datasets easier. Check out the official pandas website at [http://pandas.pydata.org/]

Pandas provides three data structures:

the series, which represents a single column of data similar to a python list. Series are most fundamental data structures in Pandas.
the data frame, which represents multiple series of data
the panel, which represents multiple data frames

Today we will mainly work with dataframe.



In [ ]:

    
import pandas as pd

Data I/O

Data cleaning...



In [ ]:

    
glad = pd.read_csv('./GLAD_15min_filtered_S1_41days_sample.csv')



In [ ]:

    
glad



In [ ]:

    
glad.shape

First and last five rows.



In [ ]:

    
glad_orig.head()



In [ ]:

    
glad_orig.tail()

Add or delete columns and write data to .csv file with one command line.



In [ ]:

    
import numpy as np
np.zeros(240000)



In [ ]:

    
glad['temperature'] = np.zeros(240000)



In [ ]:

    
glad.head()



In [ ]:

    
del glad['vel_Error']



In [ ]:

    
del glad['Pos_Error']



In [ ]:

    
glad.head()



In [ ]:

    
glad.to_csv('./test.csv')



In [ ]:

    
glad.to_csv?



In [ ]:

    
glad.to_csv('./test_without_index.csv', index = False)

Indexing and Slicing

.iloc[ ] : indexing by position

.loc[ ] : indexing by index



In [ ]:

    
glad.iloc[0]

The function takes array as index, too.



In [ ]:

    
glad_orig.iloc[:10]

Access the data array/list as array using .values



In [ ]:

    
glad_orig.iloc[0].values

In this case, indexing by position may not be practical. Instead, we can designate the column of row label 'ID' as an 'index'. It is common operation to pick a column as index to work on. When indexing the dataframe, explicitly designate the row and columns, even if with colon (':').



In [ ]:

    
glad_id = glad.set_index('ID')



In [ ]:

    
glad_id.head()



In [ ]:

    
glad_id.loc['CARTHE_021']

Use .values to access the data stored in the dataframe.



In [ ]:

    
lat = glad_id.loc['CARTHE_021', 'Latitude'].values
lat



In [ ]:

    
lon = glad_id.loc['CARTHE_021', 'Longitude'].values
lon

Ploting with matplotlib and cartopy



In [ ]:

    
import matplotlib.pyplot as plt
import cartopy.crs as ccrs



In [ ]:

    
plt.figure(figsize = (6, 8))
min_lat, max_lat = 23, 30.5
min_lon, max_lon = -91.5, -85
ax = plt.axes(projection = ccrs.PlateCarree())
ax.set_extent([min_lon, max_lon, min_lat, max_lat], ccrs.PlateCarree())
ax.coastlines(resolution = '50m', color = 'black')
ax.gridlines(crs = ccrs.PlateCarree(), draw_labels = True, color = 'grey')
ax.plot(lon, lat)

How to plot every drifter trajectory, aka spagetti plot?

Grouping Data Frames

In order to aggregate the data of each drifter, we can use group-by method. We can specify which column to group by. In this case, 'ID' will be the choice.



In [ ]:

    
drifter_grouped = glad.groupby('ID')

Dictionary is a collection of items, which are unordered, changeable and indexed. Each item can be different types such as number, string, list, etc.



In [ ]:

    
drifter_grouped.groups

Keys of each group are also in a dictionary.



In [ ]:

    
drifter_grouped.groups.keys()

You can access the items of a dictionary by referring to its key name, inside square brackets



In [ ]:

    
drifter_grouped.groups['CARTHE_021']

Iterate over the dictinary above to access the coordinates of each drifter.



In [ ]:

    
drifter_ids = drifter_grouped.groups.keys()



In [ ]:

    
for drifter_id in drifter_ids:
    print(drifter_id)



In [ ]:

    
glad_id.head()



In [ ]:

    
plt.figure(figsize = (6, 8))
min_lat, max_lat = 23, 30.5
min_lon, max_lon = -91.5, -85
ax = plt.axes(projection = ccrs.PlateCarree())
ax.set_extent([min_lon, max_lon, min_lat, max_lat], ccrs.PlateCarree())
ax.coastlines(resolution = '50m', color = 'black')
ax.gridlines(crs = ccrs.PlateCarree(), draw_labels = True, color = 'grey')
for drifter_id in drifter_ids:
    lon = glad_id.loc[drifter_id, 'Longitude'].values
    lat = glad_id.loc[drifter_id, 'Latitude'].values
    ax.plot(lon, lat)

Select data in certain time period.

Set the date as index.



In [ ]:

    
glad_date = glad_orig.set_index('Date')



In [ ]:

    
glad_date.head()

the "Date" index is Datetime Index



In [ ]:

    
glad_date.index

pd.date_range will give us a list of Index



In [ ]:

    
pd.date_range(start = '2012-07-22', end  = '2012-08-05')



In [ ]:

    
glad_date.loc[date_range,:]

use .strftime() method to convert "DatetimeIndex" to "Index"



In [ ]:

    
date_range = pd.date_range(start=first_day, end = last_day).strftime("%Y-%m-%d")



In [ ]:

    
date_range



In [ ]:

    
glad_selected = glad_date.loc[date_range,:]



In [ ]: