What is Pandas?

Pandas is a package designed to easily ingest, manipulate, write out and plot tabular data through the use of DataFrame or Series objects.

Why use Pandas?

  • Easily handles missing data
  • Efficient reading of non-uniform data tables
  • Integrated label-based slicing/indexing/subsetting operations
  • Intelligent merging and joining
  • Simple column addition or removal
  • Mixed type DataFrame objects
  • Efficient grouping
  • Time-series fuctionality

This makes it great for dealing with observational data such as Mesonet data

For reference see http://pandas.pydata.org/


In [ ]:
%pylab inline

In [ ]:
import pandas as pd
import datetime as dt

First things first

We need to import some mesonet data through the WTX Mesonet FTP webpage. We can do this using the mesonet_pull.py script after adding a username and password.

We also need to figure out how to read in the data from the files in a logical and efficient manner into arrays.

Needed packages

  • Paramiko
  • Pandas v0.13
  • Basemap

In [ ]:
# Check to make sure you have each of these, if not install them
# Pandas v0.12 will likely need to be installed regardless of Python distribution,
# so a quick check of the installed version number is below
import paramiko
from mpl_toolkits.basemap import Basemap
pd.version

Paramiko https://github.com/paramiko/paramiko

Pandas from Source https://github.com/pydata/pandas

For the two above:

  • git clone [web address]
  • navigate to the new directory
  • python setup.py install (may need to do run this as sudo or su depending on permissions setup)

Basemap http://matplotlib.org/basemap/users/installing.html

For Basemap follow instructions on the webpage


In [ ]:
# If you installed a new version of pandas you will need to reload the module 
# or restart the kernel.
pd = reload(pd)

In [ ]:
%run mesonet_sftp.py

In [ ]:
!head raw_data/REES1206.txt

Pandas has a built in csv reader which generates an array with column headers assumed to be the first row and indexed by the number of rows


In [ ]:
filename = 'raw_data/REES1206.txt'
raw = pd.read_csv(filename)

In [ ]:
raw.head(5)

However, the column headers in this case are part of the dataset and indexing by the date and time would be much more useful (currently in columns 1-2).

Pandas does include nice integration of datetime parsers, but those assume that the datetime object is in one column


In [ ]:
def parse(day, hrmin):
    return (dt.datetime(2012, 1, 1, int(hrmin)/100, int(hrmin)%100) 
              + dt.timedelta(int(day)-1) + dt.timedelta(hours=6))

In [ ]:
parsed = pd.read_csv(filename, sep=',', parse_dates=[[1,2]], 
            date_parser=parse, header=None)

In [ ]:
parsed.head(5)

The raw mesonet files contain up to three different data arrays intermixed and labeled in column '0' as 1 (meteorological), 2 (agicultural) and 3 (Reese-specific). 'groupby' operations make pulling out just one array simple and efficient.


In [ ]:
groups = parsed.groupby(0)
met_array = parsed.ix[groups.groups.get(1)]

In [ ]:
met_array.head(5)

In [ ]:
columns = ['Time', 'Array ID', 'Station ID', '10 m Scalar Wind Speed',
                   '10 m Vector Wind Speed', '10 m Wind Direction',
                   '10 m Wind Direction Std', '10 m Wind Speed Std', 
                   '10 m Gust Wind Speed', '1.5 m Temperature', 
                   '9 m Temperature', '2 m Temperature', 
                   '1.5 m Relative Humidity', 'Station Pressure', 'Rainfall', 
                   'Dewpoint', '2 m Wind Speed', 'Solar Radiation'] 
met_array.columns = columns
met_array['Station Pressure'] = met_array['Station Pressure']+600

In [ ]:
met_array.head()

In [ ]:
met_array = met_array.set_index('Time')

In [ ]:
print met_array.ix[dt.datetime(2012,6,1,6,20)]['9 m Temperature']
print met_array.ix[4, 9]

In [ ]:
met_array['9 m Temperature'].plot()

In [ ]:
met_array['9 m Temperature'][dt.datetime(2012, 6, 4,):dt.datetime(2012, 6, 5)].plot()
met_array['2 m Temperature'][dt.datetime(2012, 6, 4,):dt.datetime(2012, 6, 5)].plot()

We can ingest the agricultural data in a similar manner, remembering that since the array is of a different size there are columns completely filled with NaN's which can be removed


In [ ]:
groups = parsed.groupby(0)
agr_array = parsed.ix[groups.groups.get(2)]
agr_array = agr_array.dropna(axis=1, how='all')
columns = ['Time', 'Array ID', 'Station ID', 
                   '5 cm Natural Soil Temperature', 
                   '10 cm Natural Soil Temperature', 
                   '20 cm Natural Soil Temperature', 
                   '5 cm Bare Soil Temperature', 
                   '20 cm Bare Soil Temperature', 
                   '5 cm Water Content', '20 cm Water Content', 
                   '60 cm Water Content', '75 cm Water Content', 
                   'Leaf Wetness', 'Battery Voltage', 'Program Signature']
agr_array.columns = columns
agr_array = agr_array.set_index('Time')

In [ ]:
agr_array.head()

We can also write functions to describe a class of mesonet array objects


In [ ]:
!cat parse_mesonet.py

Using classes makes automating the reading of mesonet data simplier

Take a look at the meso_meteogram.py script for creating a simple meteogram using the MesoArrays class

Adding Data

Pandas also makes it relatively easy to add columns such as location information and join different DataFrame objects intelligently


In [ ]:
filename = 'locations.txt'
locations = pd.read_csv(filename, sep='\t')

In [ ]:
locations

In [ ]:
met_array.ix[starttime]['Station ID']

In [ ]:
starttime = dt.datetime(2012,6,15,0)
endtime   = starttime + dt.timedelta(hours=24)
tag = str(int(met_array.ix[starttime]['Station ID']))
met_array['Lat'] = (np.ones(len(met_array.index)) * 
                   locations[locations['Logger ID'] == tag]
                   ['Lat-decimal'].values)
met_array['Lon'] = (np.ones(len(met_array.index)) * 
                   locations[locations['Logger ID'] == tag]
                   ['Long.-decimal'].values)

In [ ]:
met_array

In [ ]:
import mesonet_calculations
array1 = mesonet_calculations.meso_operations('raw_data/REES1206.txt', starttime, endtime, locations)
array2 = mesonet_calculations.meso_operations('raw_data/CHIL1206.txt', starttime, endtime, locations)
new_ar = pd.concat([array1, array2], axis=0)

In [ ]:
new_ar = new_ar.set_index('Station ID', append=True)
new_ar.ix[starttime].T

The functionality is very useful for creating new arrays

An example of using this for creating surface plots from the mesonet data files is in meso_surface.py

These new arrays can also be easily saved using the Pandas to_csv function, etc.


In [ ]:
new_ar.to_csv('two_arrays.csv')

In [ ]:
!head two_arrays.csv

Pandas can also handle missing data within the DataFrame or be used to fill NaN's for using the DataFrame in more complicated functions. The DataFrame with missing data can also be transformed into an numpy masked array.


In [ ]:
filename = 'miss1206.txt'

In [ ]:
missing = mesonet_calculations.meso_operations(filename, dt.datetime(2012,6,1,6,0), dt.datetime(2012,6,1,12), locations)

In [ ]:
missing.head().T

In [ ]:
missing['10 m Scalar Wind Speed'].plot()

In [ ]:
missing.ffill().head().T

Some other tools

To quickly examine some of the other tools available we will go back to the full Childress set that we used an example from earlier earlier.


In [ ]:
met_array = mesonet_calculations.meso_operations('raw_data/CHIL1206.txt',
                                              dt.datetime(2012,6,1,6), 
                                              dt.datetime(2012,7,1,5,55), locations)

In [ ]:
met_array['Rainfall'].cumsum().plot()

In [ ]:
times = dt.datetime(2012,6,1,18)+dt.timedelta(days=1)*np.arange(0,30)
noons = met_array.ix[times]

In [ ]:
noons['2 m Temperature'].hist(bins=30)

In [ ]:
print noons['2 m Temperature'].mean()
print noons['2 m Temperature'].var()
print noons['2 m Temperature'].skew()
print noons['2 m Temperature'].kurt()

In [ ]:
noons['2 m Temperature'].describe()

In [ ]:
noons['2 m Temperature'].plot()
pd.rolling_mean(noons['2 m Temperature'], 5, center=True).plot()

But be careful and make sure the operations act as expected

(which is a good practice anyway)


In [ ]:
noons['2 m Temperature'].plot()
pd.rolling_window(noons['2 m Temperature'], 5, 'bartlett', center=True).plot()

Parallel programming

Setting up independent programs to run in parallel is easy with Python. Check out auto_pull.py for an example which is used to copy over a day's worth of data from each WTLMA station in parallel to the server each night.