Data Wrangling the Pandas

  • The are two datasets in CSV format, both are from weather station 'USC00116760' in Petersburg, IL
  • Data ranges from 2015-01-01 to 2015-06-29
  • 'Temp_116760.csv' stores temperture data, the index is day-of-year.
  • 'Prcp_116760.csv' stores Precipation data, the index is date-time.
  • Now how can we read the data such that they appear like the followings?

Tip: Pandas will always try to align index

Tip: try to bring up the docsting of Pandas.read_csv

Tip: use Pandas.concat to join DataFrame together


In [1]:
df_temp = pd.read_csv('Temp_116760.csv', skiprows=1, index_col=0)

In [2]:
df_temp.tail()


Out[2]:
TMAX TMIN
176 278.0 172.0
177 261.0 194.0
178 211.0 150.0
179 256.0 156.0
180 233.0 156.0

In [3]:
df_prcp = pd.read_csv('Prcp_116760.csv', index_col=0)
df_prcp.index = pd.to_datetime(df_prcp.index)

In [4]:
df_prcp.head()


Out[4]:
PRCP SNOW
2015-01-01 0.0 NaN
2015-01-02 0.0 NaN
2015-01-03 33.0 NaN
2015-01-04 140.0 NaN
2015-01-05 0.0 NaN

In [5]:
# and I want the index to be of date-time, rather than just strings
df_prcp.index.dtype


Out[5]:
dtype('<M8[ns]')
  • Try pandas.concat

In [6]:
pd.concat((df_prcp, 
           pd.DataFrame(data=df_temp.values,
                        index=df_prcp.index,
                        columns=df_temp.columns)),  
          axis=1).head()


Out[6]:
PRCP SNOW TMAX TMIN
2015-01-01 0.0 NaN -50.0 -150.0
2015-01-02 0.0 NaN 6.0 -94.0
2015-01-03 33.0 NaN 33.0 -33.0
2015-01-04 140.0 NaN 22.0 0.0
2015-01-05 0.0 NaN 0.0 -156.0
  • Try pandas.merge
  • Why merge might be the better apporach?

In [7]:
pd.merge(left=df_prcp,
         right=df_temp,
         left_on=df_prcp.index.dayofyear,
         right_index=True,
         how='left').head()


Out[7]:
PRCP SNOW TMAX TMIN
2015-01-01 0.0 NaN -50.0 -150.0
2015-01-02 0.0 NaN 6.0 -94.0
2015-01-03 33.0 NaN 33.0 -33.0
2015-01-04 140.0 NaN 22.0 0.0
2015-01-05 0.0 NaN 0.0 -156.0

Using pivot_table to summarize data

  • How many snow days and non-snow days are there for each month?
  • Can you generate the following result, say, with the merged data?
  • Dose the result make sense to you
  • If not, why it dosen't and how to fix it?

In [8]:
df = pd.merge(left=df_prcp,
              right=df_temp,
              left_on=df_prcp.index.dayofyear,
              right_index=True,
              how='left')
df.pivot_table(values='TMAX',
               index=df.index.month,
               columns=df.SNOW.isnull(),
               aggfunc='count')


Out[8]:
SNOW False True
1 3.0 16.0
2 6.0 12.0
3 1.0 22.0
4 NaN 24.0
5 NaN 25.0
6 NaN 26.0

Generate the CORRECT summary table for snowy days

  • It can be done with just 3 method calls
  • TIP: lookup the pandas.DataFrame.update() method.
  • TIP: lookup the pandas.date_range() method.

In [9]:
df3 = pd.DataFrame(index=pd.date_range('2015-01-01','2015-06-30'),
                   columns=df.columns)
df3.update(df)
df3.pivot_table(values='TMAX',
                index=df3.index.month,
                columns=np.where(df3.isnull().all(1),
                                 'Missing',
                                 df3.SNOW.isnull()),
                aggfunc=len)


Out[9]:
False Missing True
1 3.0 12.0 16.0
2 6.0 10.0 12.0
3 1.0 8.0 22.0
4 NaN 6.0 24.0
5 NaN 6.0 25.0
6 NaN 4.0 26.0

To map and to reduce: use groupby object and other methods.

  • Let's make a new column, put True for the days that we have snow and False