Pandas is a package designed to easily ingest, manipulate, write out and plot tabular data through the use of DataFrame or Series objects.
For reference see http://pandas.pydata.org/
In [ ]:
%pylab inline
In [ ]:
import pandas as pd
import datetime as dt
We need to import some mesonet data through the WTX Mesonet FTP webpage. We can do this using the mesonet_pull.py script after adding a username and password.
We also need to figure out how to read in the data from the files in a logical and efficient manner into arrays.
In [ ]:
# Check to make sure you have each of these, if not install them
# Pandas v0.12 will likely need to be installed regardless of Python distribution,
# so a quick check of the installed version number is below
import paramiko
from mpl_toolkits.basemap import Basemap
pd.version
Paramiko https://github.com/paramiko/paramiko
Pandas from Source https://github.com/pydata/pandas
For the two above:
Basemap http://matplotlib.org/basemap/users/installing.html
For Basemap follow instructions on the webpage
In [ ]:
# If you installed a new version of pandas you will need to reload the module
# or restart the kernel.
pd = reload(pd)
In [ ]:
%run mesonet_sftp.py
In [ ]:
!head raw_data/REES1206.txt
Pandas has a built in csv reader which generates an array with column headers assumed to be the first row and indexed by the number of rows
In [ ]:
filename = 'raw_data/REES1206.txt'
raw = pd.read_csv(filename)
In [ ]:
raw.head(5)
However, the column headers in this case are part of the dataset and indexing by the date and time would be much more useful (currently in columns 1-2).
Pandas does include nice integration of datetime parsers, but those assume that the datetime object is in one column
In [ ]:
def parse(day, hrmin):
return (dt.datetime(2012, 1, 1, int(hrmin)/100, int(hrmin)%100)
+ dt.timedelta(int(day)-1) + dt.timedelta(hours=6))
In [ ]:
parsed = pd.read_csv(filename, sep=',', parse_dates=[[1,2]],
date_parser=parse, header=None)
In [ ]:
parsed.head(5)
The raw mesonet files contain up to three different data arrays intermixed and labeled in column '0' as 1 (meteorological), 2 (agicultural) and 3 (Reese-specific). 'groupby' operations make pulling out just one array simple and efficient.
In [ ]:
groups = parsed.groupby(0)
met_array = parsed.ix[groups.groups.get(1)]
In [ ]:
met_array.head(5)
In [ ]:
columns = ['Time', 'Array ID', 'Station ID', '10 m Scalar Wind Speed',
'10 m Vector Wind Speed', '10 m Wind Direction',
'10 m Wind Direction Std', '10 m Wind Speed Std',
'10 m Gust Wind Speed', '1.5 m Temperature',
'9 m Temperature', '2 m Temperature',
'1.5 m Relative Humidity', 'Station Pressure', 'Rainfall',
'Dewpoint', '2 m Wind Speed', 'Solar Radiation']
met_array.columns = columns
met_array['Station Pressure'] = met_array['Station Pressure']+600
In [ ]:
met_array.head()
In [ ]:
met_array = met_array.set_index('Time')
In [ ]:
print met_array.ix[dt.datetime(2012,6,1,6,20)]['9 m Temperature']
print met_array.ix[4, 9]
In [ ]:
met_array['9 m Temperature'].plot()
In [ ]:
met_array['9 m Temperature'][dt.datetime(2012, 6, 4,):dt.datetime(2012, 6, 5)].plot()
met_array['2 m Temperature'][dt.datetime(2012, 6, 4,):dt.datetime(2012, 6, 5)].plot()
We can ingest the agricultural data in a similar manner, remembering that since the array is of a different size there are columns completely filled with NaN's which can be removed
In [ ]:
groups = parsed.groupby(0)
agr_array = parsed.ix[groups.groups.get(2)]
agr_array = agr_array.dropna(axis=1, how='all')
columns = ['Time', 'Array ID', 'Station ID',
'5 cm Natural Soil Temperature',
'10 cm Natural Soil Temperature',
'20 cm Natural Soil Temperature',
'5 cm Bare Soil Temperature',
'20 cm Bare Soil Temperature',
'5 cm Water Content', '20 cm Water Content',
'60 cm Water Content', '75 cm Water Content',
'Leaf Wetness', 'Battery Voltage', 'Program Signature']
agr_array.columns = columns
agr_array = agr_array.set_index('Time')
In [ ]:
agr_array.head()
In [ ]:
!cat parse_mesonet.py
In [ ]:
filename = 'locations.txt'
locations = pd.read_csv(filename, sep='\t')
In [ ]:
locations
In [ ]:
met_array.ix[starttime]['Station ID']
In [ ]:
starttime = dt.datetime(2012,6,15,0)
endtime = starttime + dt.timedelta(hours=24)
tag = str(int(met_array.ix[starttime]['Station ID']))
met_array['Lat'] = (np.ones(len(met_array.index)) *
locations[locations['Logger ID'] == tag]
['Lat-decimal'].values)
met_array['Lon'] = (np.ones(len(met_array.index)) *
locations[locations['Logger ID'] == tag]
['Long.-decimal'].values)
In [ ]:
met_array
In [ ]:
import mesonet_calculations
array1 = mesonet_calculations.meso_operations('raw_data/REES1206.txt', starttime, endtime, locations)
array2 = mesonet_calculations.meso_operations('raw_data/CHIL1206.txt', starttime, endtime, locations)
new_ar = pd.concat([array1, array2], axis=0)
In [ ]:
new_ar = new_ar.set_index('Station ID', append=True)
new_ar.ix[starttime].T
In [ ]:
new_ar.to_csv('two_arrays.csv')
In [ ]:
!head two_arrays.csv
Pandas can also handle missing data within the DataFrame or be used to fill NaN's for using the DataFrame in more complicated functions. The DataFrame with missing data can also be transformed into an numpy masked array.
In [ ]:
filename = 'miss1206.txt'
In [ ]:
missing = mesonet_calculations.meso_operations(filename, dt.datetime(2012,6,1,6,0), dt.datetime(2012,6,1,12), locations)
In [ ]:
missing.head().T
In [ ]:
missing['10 m Scalar Wind Speed'].plot()
In [ ]:
missing.ffill().head().T
In [ ]:
met_array = mesonet_calculations.meso_operations('raw_data/CHIL1206.txt',
dt.datetime(2012,6,1,6),
dt.datetime(2012,7,1,5,55), locations)
In [ ]:
met_array['Rainfall'].cumsum().plot()
In [ ]:
times = dt.datetime(2012,6,1,18)+dt.timedelta(days=1)*np.arange(0,30)
noons = met_array.ix[times]
In [ ]:
noons['2 m Temperature'].hist(bins=30)
In [ ]:
print noons['2 m Temperature'].mean()
print noons['2 m Temperature'].var()
print noons['2 m Temperature'].skew()
print noons['2 m Temperature'].kurt()
In [ ]:
noons['2 m Temperature'].describe()
In [ ]:
noons['2 m Temperature'].plot()
pd.rolling_mean(noons['2 m Temperature'], 5, center=True).plot()
In [ ]:
noons['2 m Temperature'].plot()
pd.rolling_window(noons['2 m Temperature'], 5, 'bartlett', center=True).plot()