The USGS calculates various types of statistics for its data and provides these values through a web service. You can access this service through the stats function.
Learn more about the USGS Statistics Service.
There are three types of report that you can request using the StatReportType parameter.
You can request multiple sites by separating them with commas, like this: '01541200,01542500'
The USGS Statistics Service allows you to specify a wide array of additional parameters in your request. You can provide these parameters as keyword arguments, like in this example:
hf.stats('01452500', parameterCD='00060')
This will only request statistics for discharge, which is specified with the '00060' parameter code.
The default behavior for the USGS Statistics Service is to provide statistics for every parameter that is collected at a site. This can make for a long table that you will have to filter by the parameter that you want, like this:
my_stat_dataframe.loc(my_stat_dataframe['parameter_cd']='00060')
Alternatively, you can just request the parameter that you are interested in, rather than all of the parameters. To limit your request, provide the parameterCD keyword argument, like this:
hf.stats('01452500', parameterCD='00060')
You can request more than one parameter by listing every parameter code that you are interested in, separated by a comma:
parameterCD='00060,00065'
The default behavior for the USGS Statistics Service is to calculate annual statistics using calendar years. Unfortunately, for many places in the US, this will split the wet season in half. Since discharge data tends to be autocorrelated, you are more likely to get a large flood in January 2020 if you had a large flood in December 2019. To fix this, hydrologists often use 'Water Years', which split the year during the more or less dry season, on October 1st. To calculate annual statistics using the water year, provide the statYearType='water' argument, like this:
hf.stats('01452500','annual', statYearType='water')
The default behavior for the USGS Statistics Service is to not calculate statistics for months or years if there are -ANY- missing values. In other words, in an annual report, every year reported will be based on 365 or 366 (leap year) values. You can override this behavior by providing the missingData='on' parameter. This will calculate the statistics as long as there are at least one measurement. You can decide whether or not to use the statistic by looking at the count_nu column to see how many values were used to generate the statistic.
The USGS accompanies every dataset with a header that explains the data. Hydrofunctions will automatically display this header along with the data. To access just one item, use either the .header or .df attribute.
test = stats('01542500')
test # Print the header & dataframe
test.header # print just the header
test.df # print just the dataframe.
The first step as always is to import hydrofunctions.
In [1]:
import hydrofunctions as hf
print(hf.__version__)
To get started, let's request some data from Karthus, PA to see what typically gets collected there.
In [2]:
may_2019 = hf.NWIS('01542500', 'dv', '2019-05-01', '2019-06-01')
may_2019
Out[2]:
This site has collected discharge data since 1960, but other parameters, such as water temperature ('00010'), have only been collected since 2010. Unfortunately, in 2010, only 41 days of water temperature measurements were collected. By setting the missingData argument to on, we can ask the USGS to report averages for incomplete years. Now it is up to you to decide if 41 values is an adequate number!
In [3]:
annual_stats = hf.stats('01542500', 'annual', missingData='on')
# Use annual_stats.header to access just the header, or .df for just the dataframe.
# If you don't specify, both will be provided.
annual_stats
Out[3]:
The monthly report provides the mean value for each parameter for every month since 1960, when data collection began at this site.
Since this site collects lots of parameters, we can limit our display of the dataframe by filtering everything out except the discharge parameter ('00060').
In [4]:
monthly_stats = hf.stats('01542500', 'monthly')
monthly_stats.df.loc[monthly_stats.df['parameter_cd']=='00060']
Out[4]:
The daily statistics report is different from the monthly and annual reports in that it aggregates multiple years together from across the entire period of record. So in the following example, in line 0, the report provides statistics for January 1st by calculating the mean of every January 1st from 1961 ('begin_yr') to 2019 ('end_yr').
Note that there are 366 rows, or 365 days each year plus Febrary 29th on leap years.
In [5]:
daily_stats = hf.stats('01542500', 'daily', parameterCd='00060')
daily_stats.df
Out[5]: