Data Wrangling with Pandas

  • The are two datasets in CSV format, both are from weather station 'USC00116760' in Petersburg, IL
  • Data ranges from 2015-01-01 to 2015-06-29
  • 'Temp_116760.csv' stores temperture data, the index is day-of-year.
  • 'Prcp_116760.csv' stores precipation data, the index is date-time.
  • Now how can we read the data such that they appear like the followings?
  • Tip: Pandas will always try to align index
  • Tip: try to bring up the docsting of Pandas.read_csv
  • Tip: use Pandas.concat to join DataFrame together

In [1]:
# How to read the 'Temp_116760.csv' file?

In [2]:
df_temp.tail()


Out[2]:
TMAX TMIN
176 278.0 172.0
177 261.0 194.0
178 211.0 150.0
179 256.0 156.0
180 233.0 156.0

In [3]:
# How to read the 'Prcp_116760.csv' file and make its index datetime dtype?

In [4]:
df_prcp.head()


Out[4]:
PRCP SNOW
2015-01-01 0.0 NaN
2015-01-02 0.0 NaN
2015-01-03 33.0 NaN
2015-01-04 140.0 NaN
2015-01-05 0.0 NaN

In [5]:
# and I want the index to be of date-time, rather than just strings
df_prcp.index.dtype


Out[5]:
dtype('<M8[ns]')
  • Try pandas.concat

In [6]:
# How to use concat to make a combined dataframe?


Out[6]:
PRCP SNOW TMAX TMIN
2015-01-01 0.0 NaN -50.0 -150.0
2015-01-02 0.0 NaN 6.0 -94.0
2015-01-03 33.0 NaN 33.0 -33.0
2015-01-04 140.0 NaN 22.0 0.0
2015-01-05 0.0 NaN 0.0 -156.0
  • Try pandas.merge
  • Why merge might be the better apporach?

In [7]:
# How to use merge to make a combined dataframe?


Out[7]:
PRCP SNOW TMAX TMIN
2015-01-01 0.0 NaN -50.0 -150.0
2015-01-02 0.0 NaN 6.0 -94.0
2015-01-03 33.0 NaN 33.0 -33.0
2015-01-04 140.0 NaN 22.0 0.0
2015-01-05 0.0 NaN 0.0 -156.0

Using pivot_table to summarize data

  • How many snow days and non-snow days are there for each month?
  • Can you generate the following result, say, with the merged data?
  • Dose the result make sense to you
  • If not, why it dosen't and how to fix it?

In [8]:
# Assuming your now have the combined dataframe from above, named df
# How to make the following summary table using pivot_table?


Out[8]:
SNOW False True
1 3.0 16.0
2 6.0 12.0
3 1.0 22.0
4 NaN 24.0
5 NaN 25.0
6 NaN 26.0

Generate the CORRECT summary table for snowy days

  • It can be done with just 3 method calls
  • TIP: lookup the pandas.DataFrame.update() method.
  • TIP: lookup the pandas.date_range() method.

In [ ]:
# How should the correct summary table look like?
# How to make it with just 3 method calls?