In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
The first step is to read out data into a pandas
DataFrame. For an intro to using pandas
I would highly suggest looking though this 10 minute guide to pandas
.
In [2]:
df = pd.read_csv('npr_articles.csv')
We can now checkout what our data consists of by using the .head()
method on our DataFrame. By default, this will show the top 5 rows.
In [3]:
df.head()
Out[3]:
One of the first steps you should take is to get an overview of what kind of data we have but running the .info()
method. Please see the documentation for more info (no pun intended).
In [4]:
df.info()
We can see that the column date_published
is being interpreted as an object
and not a datetime. Let's change that by using the pandas.to_datetime()
function.
In [5]:
df['date_published'] = pd.to_datetime(df['date_published'])
In [6]:
df.info()
In [7]:
# Let's create a mask for all rows that have a non-null value
mask = df['author'].notnull()
# When the data was saved to a csv, these lists were converted into strings, we can convert
# them back like so
from ast import literal_eval
df.loc[mask, 'author'] = df.loc[mask, 'author'].map(literal_eval)
# Initialize column with NaN's and then fill in the respective values
df['num_authors'] = np.nan
df.loc[mask, 'num_authors'] = df.loc[mask, 'author'].map(len)
We can now take a look at the summary statistics of any numeric columns by running the .describe()
method.
In [8]:
df.describe()
Out[8]:
In [9]:
df.head()
Out[9]:
Let's say we wanted to get the number of unique authors that are represented in this dataframe. We could potentially use df['author'].nunique()
but we are going to run into an error because each row contains a list
which isn't hashable.
Instead we could loop through each value and extend a set like so:
In [10]:
# Create a set to hold our authors
authors = set()
for lst in df.loc[mask, 'author']:
# For every row, update the authors set with those contained in that row
authors.update(lst)
# Print out the total authors seen
print(len(authors))
If we also wanted the number of times a particular author was involved in writing an article we could leverage the power of Counter
's from the collections
library. Refer to the documentation for more information.
In [11]:
from collections import Counter
authors = df.loc[mask, 'author'].map(Counter).sum()
In [12]:
authors
Out[12]:
In [13]:
authors.most_common()
Out[13]:
In [14]:
authors['Ari Shapiro']
Out[14]:
Let's say we wanted to now subset down to the articles which Ari Shapiro worked on. There are a variety of way's we could do this but I will demo one possible avenue.
In [15]:
# Because some rows have NaN's in them, we need to get clever with how we
# create our mask
mask = df['author'].map(lambda x: 'Ari Shapiro' in x if isinstance(x, list)
else False)
df.loc[mask, 'headline']
Out[15]:
In [16]:
# Here is another way we could acheive this
mask = df.loc[df['author'].notnull(), 'author'].map(lambda x: 'Ari Shapiro' in x)
df.loc[df['author'].notnull()].loc[mask, 'headline']
Out[16]:
In [17]:
df['section'].value_counts(dropna=False)[:5]
Out[17]:
When we first were looking at our DataFrame, you may have noticed that there are quite a few rows missing author information. Maybe we have a hypothesis that there are certain sections that systemically weren't attaching author information. Let's dive deeper to try and prove/disprove this hypothesis...
In [18]:
# Let's create a new column that indicates whether the author attribute was null or not
# This helps with the groupby below
df['author_null'] = df['author'].isnull()
# Get the mean amount of nulls for each section and sort descending
# NOTE: 1.0 indicates ALL Nulls
df.groupby('section')['author_null'].mean().sort_values(ascending=False)
Out[18]:
As we can see, there are clearly sections that are consistently not attaching author information as well as many that are hit or miss with the author information.
In [19]:
# Create a pandas Series with 1's as the values and the date as the index
s = pd.Series([1], index=df['date_published'])
In [20]:
s[:10]
Out[20]:
Below we see how we could use the resample function to find the number of articles published per day.
NOTE: Our DataFrame/Series must have a datetimeindex for this to work!
In [21]:
# Let's resample that Series and sum the values to find the number of articles by Day
s.resample('D').sum()
Out[21]:
There are, of course, many different offset alias' for passing to resample
. For more options see this page.
In [22]:
plt.plot(s.resample('D').sum())
plt.title('Article Count By Day')
plt.ylabel('Number of Articles')
plt.xlabel('Date')
locs, labels = plt.xticks()
plt.setp(labels, rotation=-45);
In [23]:
plt.plot(s.resample('W').sum())
plt.title('Article Count By Week')
plt.ylabel('Number of Articles')
plt.xlabel('date')
locs, labels = plt.xticks()
plt.setp(labels, rotation=-45);
In [24]:
df['hour_published'] = df['date_published'].dt.hour
We were able to run the above command because that particular column contains a datetime object. From there we can run .dt
and then extract any aspect of that datetime (e.g. .dt.hour
, .dt.second
, .dt.month
, .dt.quarter
)
In [25]:
df['hour_published'].hist()
plt.ylabel('Number of Articles Published')
plt.xlabel('Hour Published (24Hr)');
By default, the .hist
method is going to plot 10 bins. Let's up that to 24 bins so we have a bin for each hour in the day...
In [26]:
# Let's force the plot to split into 24 bins, one for each hour
df['hour_published'].hist(bins=24)
plt.ylabel('Number of Articles Published')
plt.xlabel('Hour Published (24Hr)');
In [27]:
# Let's extract the relative frequency rather than the raw counts
df['hour_published'].hist(bins=24, normed=True, alpha=0.75)
plt.ylabel('Freq. of Articles Published')
plt.xlabel('Hour Published (24Hr)');
In [28]:
# We can also grab this information without plotting it using .value_counts
df['hour_published'].value_counts()
Out[28]:
In [29]:
df['hour_published'].value_counts(normalize=True)
Out[29]:
In [30]:
# Or we could leave them in the order of a day
df['hour_published'].value_counts().sort_index()
Out[30]:
In [31]:
mask = ((df['date_published'] >= '2016-12-24 10:00:00') &
(df['date_published'] <= '2016-12-24 14:00:00'))
In [32]:
df.loc[mask, :]
Out[32]:
In [33]:
# Or we could reset or index and do it that way...
df2 = df.set_index('date_published')
df2.loc['2016-12-24 10:00:00': '2016-12-24 14:00:00', :]
Out[33]:
In [34]:
df['num_words'] = df['article_text'].map(lambda x: len(x.split()))
In [35]:
df['num_words'].describe()
Out[35]:
Let's create a histogram of the length of different articles...
In [36]:
df['num_words'].hist(bins=20, alpha=0.75)
plt.ylabel('Number of Articles Published')
plt.xlabel('Length of Article');
Clearly there are some outliers in this data. Let's subset what we are plotting to cut out the top 2% of articles in terms of article length and see what the resulting histogram looks like...
Refer to the numpy
percentile function for more information.
In [37]:
cutoff = np.percentile(df['num_words'], 98)
df.loc[df['num_words'] <= cutoff, 'num_words'].hist(bins=20, alpha=0.75)
plt.ylabel('Number of Articles Published')
plt.xlabel('Length of Article');
We can also use standard string functions by using the .str
functionality in pandas
. Take a look at this page for more information.
In [38]:
df.loc[df['headline'].str.contains('Obama'), 'headline'].head()
Out[38]:
In [39]:
# Let's subset to just the 10 most popular sections
top_sections = df['section'].value_counts()[:10].index
df_sub = df.loc[df['section'].isin(top_sections), :]
# We are now grouping by the section and extracting the mean hour that articles were published
df_sub.groupby('section')['hour_published'].mean()
Out[39]: