In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

plt.style.use('ggplot')

Read in Our Data

The first step is to read out data into a pandas DataFrame. For an intro to using pandas I would highly suggest looking though this 10 minute guide to pandas.


In [2]:
df = pd.read_csv('npr_articles.csv')

We can now checkout what our data consists of by using the .head() method on our DataFrame. By default, this will show the top 5 rows.


In [3]:
df.head()


Out[3]:
article_text author date_published headline section url processed_text
0 Birdsong is music to human ears. It has inspir... ['Barbara J. King'] 2016-12-01 09:30:00 What Do Birds Hear When They Sing Beautiful So... 13.7: Cosmos And Culture http://www.npr.org/sections/13.7/2016/12/01/50... birdsong music human ear inspire famous compos...
1 Two months after Colombian voters narrowly rej... ['Mark Katkov'] 2016-12-01 11:57:00 Colombia's Congress Ratifies Second Peace Deal... The Two-Way http://www.npr.org/sections/thetwo-way/2016/12... two_months colombian voter narrowly reject pea...
2 On a hillside overlooking the steppes of north... ['Rob Schmitz'] 2016-12-01 12:50:00 Amid Economic Crisis, Mongolians Risk Their Li... Parallels http://www.npr.org/sections/parallels/2016/12/... hillside overlook steppe northeastern mongolia...
3 When I last visited Damascus in 2008, the hist... ['Peter Kenyon'] 2016-12-01 12:50:00 Returning To Damascus, A City Changed By War Parallels http://www.npr.org/sections/parallels/2016/12/... visit damascus historic old_city district west...
4 Boston's official 2016 Christmas tree, like ot... ['Edgar B. Herwick III'] 2016-12-01 13:16:00 Boston's Christmas Tree Tradition Rooted In A ... Around the Nation http://www.npr.org/2016/12/01/503907535/boston... boston official christmas tree come thank gift...

One of the first steps you should take is to get an overview of what kind of data we have but running the .info() method. Please see the documentation for more info (no pun intended).


In [4]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2275 entries, 0 to 2274
Data columns (total 7 columns):
article_text      2275 non-null object
author            1708 non-null object
date_published    2275 non-null object
headline          2275 non-null object
section           2267 non-null object
url               2275 non-null object
processed_text    2275 non-null object
dtypes: object(7)
memory usage: 124.5+ KB

We can see that the column date_published is being interpreted as an object and not a datetime. Let's change that by using the pandas.to_datetime() function.


In [5]:
df['date_published'] = pd.to_datetime(df['date_published'])

In [6]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2275 entries, 0 to 2274
Data columns (total 7 columns):
article_text      2275 non-null object
author            1708 non-null object
date_published    2275 non-null datetime64[ns]
headline          2275 non-null object
section           2267 non-null object
url               2275 non-null object
processed_text    2275 non-null object
dtypes: datetime64[ns](1), object(6)
memory usage: 124.5+ KB

Number of Authors

Let's say we wanted to add in another column that contains the number of authors that worked on a particular article. We could do this like so:


In [7]:
# Let's create a mask for all rows that have a non-null value
mask = df['author'].notnull()

# When the data was saved to a csv, these lists were converted into strings, we can convert
# them back like so
from ast import literal_eval
df.loc[mask, 'author'] = df.loc[mask, 'author'].map(literal_eval)

# Initialize column with NaN's and then fill in the respective values
df['num_authors'] = np.nan
df.loc[mask, 'num_authors'] = df.loc[mask, 'author'].map(len)

We can now take a look at the summary statistics of any numeric columns by running the .describe() method.


In [8]:
df.describe()


Out[8]:
num_authors
count 1708.000000
mean 1.055035
std 0.285180
min 1.000000
25% 1.000000
50% 1.000000
75% 1.000000
max 6.000000

In [9]:
df.head()


Out[9]:
article_text author date_published headline section url processed_text num_authors
0 Birdsong is music to human ears. It has inspir... [Barbara J. King] 2016-12-01 09:30:00 What Do Birds Hear When They Sing Beautiful So... 13.7: Cosmos And Culture http://www.npr.org/sections/13.7/2016/12/01/50... birdsong music human ear inspire famous compos... 1.0
1 Two months after Colombian voters narrowly rej... [Mark Katkov] 2016-12-01 11:57:00 Colombia's Congress Ratifies Second Peace Deal... The Two-Way http://www.npr.org/sections/thetwo-way/2016/12... two_months colombian voter narrowly reject pea... 1.0
2 On a hillside overlooking the steppes of north... [Rob Schmitz] 2016-12-01 12:50:00 Amid Economic Crisis, Mongolians Risk Their Li... Parallels http://www.npr.org/sections/parallels/2016/12/... hillside overlook steppe northeastern mongolia... 1.0
3 When I last visited Damascus in 2008, the hist... [Peter Kenyon] 2016-12-01 12:50:00 Returning To Damascus, A City Changed By War Parallels http://www.npr.org/sections/parallels/2016/12/... visit damascus historic old_city district west... 1.0
4 Boston's official 2016 Christmas tree, like ot... [Edgar B. Herwick III] 2016-12-01 13:16:00 Boston's Christmas Tree Tradition Rooted In A ... Around the Nation http://www.npr.org/2016/12/01/503907535/boston... boston official christmas tree come thank gift... 1.0

Number of Unique Authors

Let's say we wanted to get the number of unique authors that are represented in this dataframe. We could potentially use df['author'].nunique() but we are going to run into an error because each row contains a list which isn't hashable.

Instead we could loop through each value and extend a set like so:


In [10]:
# Create a set to hold our authors
authors = set()
for lst in df.loc[mask, 'author']:
    # For every row, update the authors set with those contained in that row
    authors.update(lst)
# Print out the total authors seen
print(len(authors))


451

If we also wanted the number of times a particular author was involved in writing an article we could leverage the power of Counter's from the collections library. Refer to the documentation for more information.


In [11]:
from collections import Counter
authors = df.loc[mask, 'author'].map(Counter).sum()

In [12]:
authors


Out[12]:
Counter({'Aarti Shahani': 3,
         'Adam Cole': 2,
         'Adam Frank': 5,
         'Adrian Florido': 3,
         'Agerenesh Ashagre': 1,
         'Ailsa Chang': 3,
         'Aki Peritz': 2,
         'Alan Greenblatt': 1,
         'Alan Yu': 5,
         'Alejandra Maria Salazar': 1,
         'Alex Ariff': 1,
         'Alex Cohen': 1,
         'Alex Zaragoza': 1,
         'Alexandra Olgin': 1,
         'Alexandria Lee': 4,
         'Alexi Horowitz-Ghazi': 1,
         'Alice Fordham': 4,
         'Alina Selyukh': 8,
         'Alison Fensterstock': 2,
         'Alison Kodjak': 9,
         'Alison Meuse': 2,
         'Alison Richards': 1,
         'Allison Aubrey': 5,
         'Alva Noë': 5,
         'Alyssa Edes': 1,
         'Amal El-Mohtar': 2,
         'Ammad Omar': 5,
         'Amy E. Robertson': 1,
         'Amy Sisk': 1,
         'An-Li Herring': 2,
         'Anastasia Tsioulcas': 2,
         'Andrew Lapin': 6,
         'Andrew Limbong': 2,
         'Angus Chen': 4,
         'Ann Finkbeiner': 1,
         'Ann Powers': 7,
         'Anna Gorman': 2,
         'Anna Marketti': 3,
         'Anna Mazarakis': 1,
         'Annalisa Quinn': 1,
         'Annie Ropeik': 1,
         'Annie Tomlinson': 1,
         'Anthony Kuhn': 4,
         'Antonia Cereijido': 1,
         'Anya Kamenetz': 6,
         'Anya Sacharow': 1,
         'Ari Shapiro': 3,
         'Ariel Zambelich': 3,
         'Arnie Seipel': 4,
         'Ashley Westerman ': 2,
         'Asma Khalid': 2,
         'Audie Cornish': 1,
         'Avie Schneider': 2,
         'Barbara Campbell': 1,
         'Barbara J. King': 6,
         'Barry Walters': 2,
         'Becky Harlan': 1,
         'Ben Allen': 2,
         'Ben Fishel': 2,
         'Ben Schein': 1,
         'Benjamin Naddaff-Hafrey': 2,
         'Beth Novey': 1,
         'Bill Chappell': 37,
         'Bob Boilen': 22,
         'Bob Mondello': 4,
         'Bobby Carter': 3,
         'Bonny Wolf': 1,
         'Brakkton Booker': 10,
         'Bram Sable-Smith, KBIA and Side Effects Public Media': 1,
         'Brandie Jefferson': 1,
         'Bret Stetka': 2,
         'Brian Mann': 1,
         'Brian Naylor': 20,
         'Briana Younger': 1,
         'Bruce Warren': 4,
         'Cam Robert': 11,
         'Camila Domonoske': 88,
         'Carmen Heredia Rodriguez': 1,
         'Carrie Johnson': 7,
         'Carrie Jung': 1,
         'Carrie Kahn': 1,
         'Cassi Alexandra': 2,
         'Cheryl Corley': 4,
         'Chhavi Sachdev': 2,
         'Chris Arnold': 3,
         'Chris Benderev': 1,
         'Chris Klimek': 4,
         'Christopher Dean Hopkins': 5,
         'Christopher Joyce': 5,
         'Clare Leschin-Hoar': 1,
         'Clarissa Wei': 1,
         'Claudio Sanchez': 4,
         'Colin Dwyer': 41,
         'Cory Turner': 6,
         'Craig LeMoult': 1,
         'Dan Carsen': 1,
         'Dan Charles': 9,
         'Daniel Estrin': 4,
         'Daniel Zwerdling': 3,
         'Danielle Kurtzleben ': 15,
         'Danielle Preiss': 1,
         'Danny Hajek': 1,
         'David Bianculli': 2,
         'David Bodanis': 2,
         'David Dye': 23,
         'David Edelstein': 1,
         'David Folkenflik': 6,
         'David Greene': 3,
         'David Grinspoon': 1,
         'David Schaper': 2,
         'David Sedaris': 1,
         'David Welna': 2,
         'Debbie Elliott': 2,
         'Deborah Amos': 2,
         'Deborah Shaar': 1,
         'Deena Prichep': 2,
         'Deepak Singh': 1,
         'Dennis Ross': 1,
         'Diane Cole': 2,
         'Domenico Montanaro': 9,
         'Don Gonyea': 5,
         'Doreen McCallister': 8,
         'Doug Mosurock': 1,
         'Dustin DeSoto': 1,
         'Edgar B. Herwick III': 1,
         'Edward Mabaya': 1,
         'Eleanor Beardsley': 6,
         'Eleanor Klibanoff': 1,
         'Elena See': 1,
         'Elise Hu': 6,
         'Elissa Nadworny ': 6,
         'Elizabeth Blair': 2,
         'Elizabeth Grossman': 1,
         'Elizabeth Jensen': 3,
         'Ella Taylor': 5,
         'Ellen Wu': 1,
         'Emily Bazar': 1,
         'Emily Feng': 1,
         'Emily Siner': 1,
         'Emma Bowman': 4,
         'Eric Deggans': 5,
         'Eric Westervelt': 5,
         'Eric Whitney': 1,
         'Erin Ross': 5,
         'Esther Landhuis': 1,
         'Etelka Lehoczky': 2,
         'Eyder Peralta': 5,
         'Felix Contreras': 11,
         'Francis Davis': 2,
         'Frank Langfitt': 3,
         'Frank Morris': 1,
         'Frannie Kelley': 2,
         'Fred Bever': 1,
         'Fred Schulte': 1,
         'Gabrielle Emanuel': 2,
         'Gene Demby': 3,
         'Genevieve Valentine': 2,
         'Geoff Brumfiel': 5,
         'Geoff Nunberg': 1,
         'Glen Weldon': 29,
         'Grace Hood': 1,
         'Greg Allen': 5,
         'Greg Myre': 6,
         'Ha-Hoa Hamano': 1,
         'Hansi Lo Wang': 3,
         'Heller McAlpin': 1,
         'Howard Berkes': 3,
         'Hugo Rojo': 1,
         'Iman Smith': 1,
         'Jackie Northam': 8,
         'Jacklyn Kim': 1,
         'Jackson Sinnenberg': 1,
         'James Doubek': 4,
         'Jane Arraf': 3,
         'Jane Greenhalgh': 1,
         'Jason Beaubien': 7,
         'Jason Bentley': 8,
         'Jason Heller': 2,
         'Jason Sheehan': 3,
         'Jason Slotkin': 15,
         'Jay Price': 1,
         'Jean Zimmerman': 2,
         'Jeff Brady': 3,
         'Jeff Koehler': 1,
         'Jeff Lunden': 5,
         'Jennifer Ludden': 1,
         'Jennifer Schmidt': 1,
         'Jenny Gold': 2,
         'Jerad Walker': 1,
         'Jessica Deahl': 1,
         'Jessica Diaz-Hurtado': 5,
         'Jessica Leigh Hester': 1,
         'Jessica Taylor': 26,
         'Jewly Hight': 2,
         'Jill Neimark': 2,
         'Jim Allen': 1,
         'Jim Kane': 1,
         'Jim Zarroli': 11,
         'Joanna Kakissis': 1,
         'Joanne Silberner': 2,
         'Jodi Helmer': 1,
         'Joe Palca': 4,
         'Joe Wertz': 1,
         'Joel Rose': 6,
         'Johhny Kauffman': 1,
         'John Burnett': 4,
         'John Henning Schumann': 2,
         'John Powers': 1,
         'John U. Bacon': 1,
         'John Ydstie': 2,
         'Jon Hamilton': 5,
         'Jonathan Baer': 1,
         'Jordan Rau': 1,
         'Josh Jackson': 1,
         'Joy Ho': 1,
         'Joy Lanzendorfer': 1,
         'Juan Vidal': 1,
         'Juli Fraga': 1,
         'Julie Appleby': 2,
         'Julie McCarthy': 4,
         'Julie Rovner': 3,
         'Kait Bolongaro': 1,
         'Kara Lofton': 1,
         'Karen Grigsby Bates': 3,
         'Karen Shakerdge': 1,
         'Kat Chow': 2,
         'Kat Lonsdorf': 3,
         'Katherine Hobson': 4,
         'Katie Orr': 1,
         'Katie Simon': 1,
         'Ken Tucker': 1,
         'Kerry Klein': 1,
         'Kiana Fitzgerald': 2,
         'Kim Kankiewicz': 1,
         'Kirk Carapezza': 1,
         'Kirk Siegler': 3,
         'Korva Coleman': 18,
         'Kristen Hartke': 3,
         'Kristina Johnson': 1,
         'Larry Kaplow': 1,
         'Lars Gotrich': 20,
         'Laura Oliver': 1,
         'Laura Roman': 3,
         'Laura Snapes': 1,
         'Laura Sydell': 1,
         'Laura Wagner': 15,
         'Laurel Wamsley': 8,
         'Lauren Frayer': 6,
         'Lauren Migaki': 1,
         'Lauren Ober': 1,
         'Lauren Sommer': 1,
         'Leah Donnella': 2,
         'Lee Hale': 2,
         'Leigh Paterson': 2,
         'Lela Nargi': 1,
         'Lily Meyer': 1,
         'Linda Fahey': 1,
         'Linda Holmes': 13,
         'Linda Poon': 1,
         'Linda Wertheimer': 1,
         'Lindsey Smith': 1,
         'Liz Szabo': 1,
         'Lora Smith': 1,
         'Lucian Kim': 5,
         'Luke Runyon': 1,
         'Luke Vander Ploeg': 1,
         'Lynn Neary': 4,
         'Maanvi Singh': 4,
         'Madeline K. Sofia': 2,
         'Maggie Penman': 28,
         'Malaka Gharib': 8,
         'Mandalit del Barco': 2,
         'Maquita Peters': 4,
         'Mara Liasson': 3,
         'Marc Masters': 3,
         'Marc Silver': 4,
         'Marcelo Gleiser': 5,
         'Margarita Gokun Silver': 1,
         'Maria Godoy': 1,
         'Maria Hinojosa': 1,
         'Maria Hollenhorst': 1,
         'Maria Sherman': 1,
         'Marilyn Geewax': 2,
         'Marina Lopes': 1,
         'Marisa Arbona-Ruiz': 1,
         'Marisa Peñaloza': 1,
         'Marissa Higgins': 1,
         'Marissa Lorusso': 3,
         'Mark Daley': 1,
         'Mark H. Kim': 1,
         'Mark Jenkins': 6,
         'Mark Katkov': 8,
         'Mark Mobley': 1,
         'Marlon Bishop': 1,
         'Martha Ann Overland': 1,
         'Martin Kaste': 4,
         'Martina Guzman': 1,
         'Mary Agnes Carey': 1,
         'Mary Louise Kelly': 5,
         'Maureen Corrigan': 2,
         'Maureen McCollum': 1,
         'Maya Rodale': 1,
         'Mayra Linares': 3,
         'Meg Anderson': 6,
         'Megan Buerger': 3,
         'Megan Kamerick': 1,
         'Meredith Rizzo': 1,
         'Merrit Kennedy': 61,
         'Michael Czaplinski': 1,
         'Michael Oreskes': 1,
         'Michael Schaub': 2,
         'Michael Tomsic': 1,
         'Michaeleen Doucleff': 4,
         'Michel Martin': 1,
         'Michele Kelemen': 4,
         'Michelle Andrews': 5,
         'Michelle Mercer': 2,
         'Mike Katzif': 2,
         'Miles Bryan': 1,
         'Mina Tavakoli': 1,
         'Molly Solomon': 1,
         'NPR Ed': 1,
         'NPR Staff': 105,
         'NPR/TED Staff': 9,
         'Nancy Pearl': 1,
         'Natasha Haverty': 2,
         'Nathan Rott': 5,
         'Neda Ulaby': 5,
         'Nell Greenfieldboyce': 10,
         'Nick Evans': 1,
         'Nicky Ouellet': 2,
         'Nicole Cohen': 1,
         'Nicole Jankowski': 1,
         'Nina Totenberg': 6,
         'Noel King': 1,
         'Nurith Aizenman': 7,
         'Ofeibea Quist-Arcton': 1,
         'Oliver Wang': 1,
         'Otis Hart': 4,
         'Pam Fessler': 3,
         'Parth Shah': 2,
         'Patricia Murphy': 1,
         'Patrick Madden': 1,
         'Patti Neighmond': 3,
         'Pauline Bartolone': 1,
         'Peter Granitz': 1,
         'Peter Kenyon': 6,
         'Peter Overby': 4,
         'Petra Mayer': 1,
         'Philip Ewing': 5,
         'Philip Galewitz': 1,
         'Piotr Orlov': 5,
         'Quil Lawrence': 1,
         'Rachel Bluth': 1,
         'Rachel Horn': 5,
         'Rachel Martin': 3,
         'Rachel Waldholz': 1,
         'Rae Ellen Bichell': 8,
         'Rebecca Hersher': 60,
         'Rebecca Sananes': 1,
         'Reid Frazier': 1,
         'Renee Klahr': 2,
         'Rhaina Cohen': 1,
         'Rhitu Chaterjee': 1,
         'Rhitu Chatterjee': 1,
         'Richard Gonzales': 8,
         'Richard Harris': 7,
         'Riley Beggin': 1,
         'Rob Schmitz': 9,
         'Rob Stein': 6,
         'Robin Hilton': 18,
         'Robin Marantz Henig': 1,
         'Robyn Park': 4,
         'Ron Elving': 5,
         'Rose Friedman': 1,
         'Rowan Moore-Gerety': 1,
         'Russell Lewis': 1,
         'Ryan Kailath': 1,
         'Ryan Kellman': 1,
         'Sam Harnett': 1,
         'Sam Sanders': 3,
         'Sami Yenigun': 2,
         'Sandhya Dirks': 1,
         'Sandip Roy': 1,
         'Sarah Hepola': 1,
         'Sarah McCammon': 1,
         'Sarah-Anne Henning Schumann': 1,
         'Scott Detrow': 12,
         'Scott Horsley': 13,
         'Scott Simon': 6,
         'Scott Tobias': 6,
         'Seth Herald': 1,
         'Shankar Vedantam': 3,
         'Sherrel Stewart': 2,
         'Simon Rentner': 1,
         'Sonari Glinton': 1,
         'Sonia Narang': 1,
         'Soraya Sarhaddi Nelson': 3,
         'Stephan Bisaha': 1,
         'Stephanie Martin Taylor': 1,
         "Stephanie O'Neill": 1,
         'Stephen Nessen': 1,
         'Stephen Thompson': 12,
         'Steve Carmody': 1,
         'Steve Inskeep': 6,
         'Steven Findlay': 1,
         'Stina Sieg': 1,
         'Sujata Gupta': 2,
         'Suraya Mohamed': 5,
         'Susan Brink': 5,
         'Susan Davis': 9,
         'Susan Jaffe': 1,
         'Susan Stamberg': 2,
         'Sydney Lupkin': 1,
         'Sylvia Poggioli': 6,
         'Talia Schlanger': 12,
         'Tamar Charney': 2,
         'Tamara Keith': 11,
         'Tania Lombrozo': 5,
         'Tanya Basu': 1,
         'Tara Boyle': 1,
         'Tasneem Raja': 1,
         'Taunya English': 1,
         'Ted Robbins': 1,
         'Tegan Wendland': 1,
         'The Associated Press': 1,
         'Thomas Hjelm': 1,
         'Tim Greiving': 1,
         'Timmhotep Aku': 1,
         'Todd Bookman': 1,
         'Tom Bowman': 2,
         'Tom Cole': 2,
         'Tom Gjelten': 5,
         'Tom Goldman': 1,
         'Tom Huizenga': 4,
         'Tom Moon': 1,
         'Tom Vitale': 1,
         'Tori Whitley': 1,
         'Tove K. Danovich': 5,
         'Tunde Wey': 1,
         'Vera Zakem': 1,
         'Vicky Hallett': 3,
         'Vincent Ialenti': 1,
         'Wade Goodwyn': 2,
         'Wendy Rigby': 1,
         'Will Shortz': 8,
         'William Dobson': 1,
         'Wynne Davis': 8,
         'Yasmeen Khan': 1,
         'Yuki Noguchi': 5,
         'emma bowman': 1})

In [13]:
authors.most_common()


Out[13]:
[('NPR Staff', 105),
 ('Camila Domonoske', 88),
 ('Merrit Kennedy', 61),
 ('Rebecca Hersher', 60),
 ('Colin Dwyer', 41),
 ('Bill Chappell', 37),
 ('Glen Weldon', 29),
 ('Maggie Penman', 28),
 ('Jessica Taylor', 26),
 ('David Dye', 23),
 ('Bob Boilen', 22),
 ('Brian Naylor', 20),
 ('Lars Gotrich', 20),
 ('Robin Hilton', 18),
 ('Korva Coleman', 18),
 ('Danielle Kurtzleben ', 15),
 ('Laura Wagner', 15),
 ('Jason Slotkin', 15),
 ('Scott Horsley', 13),
 ('Linda Holmes', 13),
 ('Stephen Thompson', 12),
 ('Scott Detrow', 12),
 ('Talia Schlanger', 12),
 ('Felix Contreras', 11),
 ('Cam Robert', 11),
 ('Tamara Keith', 11),
 ('Jim Zarroli', 11),
 ('Nell Greenfieldboyce', 10),
 ('Brakkton Booker', 10),
 ('Alison Kodjak', 9),
 ('NPR/TED Staff', 9),
 ('Domenico Montanaro', 9),
 ('Dan Charles', 9),
 ('Rob Schmitz', 9),
 ('Susan Davis', 9),
 ('Jason Bentley', 8),
 ('Rae Ellen Bichell', 8),
 ('Wynne Davis', 8),
 ('Richard Gonzales', 8),
 ('Laurel Wamsley', 8),
 ('Mark Katkov', 8),
 ('Doreen McCallister', 8),
 ('Alina Selyukh', 8),
 ('Jackie Northam', 8),
 ('Will Shortz', 8),
 ('Malaka Gharib', 8),
 ('Ann Powers', 7),
 ('Carrie Johnson', 7),
 ('Richard Harris', 7),
 ('Jason Beaubien', 7),
 ('Nurith Aizenman', 7),
 ('Scott Simon', 6),
 ('Peter Kenyon', 6),
 ('Steve Inskeep', 6),
 ('David Folkenflik', 6),
 ('Greg Myre', 6),
 ('Eleanor Beardsley', 6),
 ('Meg Anderson', 6),
 ('Anya Kamenetz', 6),
 ('Lauren Frayer', 6),
 ('Sylvia Poggioli', 6),
 ('Nina Totenberg', 6),
 ('Andrew Lapin', 6),
 ('Cory Turner', 6),
 ('Elise Hu', 6),
 ('Barbara J. King', 6),
 ('Rob Stein', 6),
 ('Scott Tobias', 6),
 ('Elissa Nadworny ', 6),
 ('Joel Rose', 6),
 ('Mark Jenkins', 6),
 ('Jon Hamilton', 5),
 ('Ella Taylor', 5),
 ('Christopher Joyce', 5),
 ('Neda Ulaby', 5),
 ('Christopher Dean Hopkins', 5),
 ('Eric Deggans', 5),
 ('Suraya Mohamed', 5),
 ('Mary Louise Kelly', 5),
 ('Lucian Kim', 5),
 ('Alan Yu', 5),
 ('Philip Ewing', 5),
 ('Piotr Orlov', 5),
 ('Jessica Diaz-Hurtado', 5),
 ('Erin Ross', 5),
 ('Eric Westervelt', 5),
 ('Rachel Horn', 5),
 ('Tania Lombrozo', 5),
 ('Yuki Noguchi', 5),
 ('Marcelo Gleiser', 5),
 ('Geoff Brumfiel', 5),
 ('Susan Brink', 5),
 ('Jeff Lunden', 5),
 ('Greg Allen', 5),
 ('Alva Noë', 5),
 ('Michelle Andrews', 5),
 ('Tom Gjelten', 5),
 ('Ron Elving', 5),
 ('Nathan Rott', 5),
 ('Allison Aubrey', 5),
 ('Tove K. Danovich', 5),
 ('Adam Frank', 5),
 ('Ammad Omar', 5),
 ('Don Gonyea', 5),
 ('Eyder Peralta', 5),
 ('Bob Mondello', 4),
 ('Michele Kelemen', 4),
 ('Claudio Sanchez', 4),
 ('Katherine Hobson', 4),
 ('Martin Kaste', 4),
 ('Emma Bowman', 4),
 ('Maanvi Singh', 4),
 ('Anthony Kuhn', 4),
 ('Cheryl Corley', 4),
 ('Maquita Peters', 4),
 ('Peter Overby', 4),
 ('Bruce Warren', 4),
 ('Alexandria Lee', 4),
 ('Michaeleen Doucleff', 4),
 ('Alice Fordham', 4),
 ('Daniel Estrin', 4),
 ('Tom Huizenga', 4),
 ('Angus Chen', 4),
 ('Chris Klimek', 4),
 ('James Doubek', 4),
 ('John Burnett', 4),
 ('Otis Hart', 4),
 ('Joe Palca', 4),
 ('Julie McCarthy', 4),
 ('Arnie Seipel', 4),
 ('Robyn Park', 4),
 ('Marc Silver', 4),
 ('Lynn Neary', 4),
 ('Jason Sheehan', 3),
 ('Jeff Brady', 3),
 ('Kat Lonsdorf', 3),
 ('Mayra Linares', 3),
 ('Laura Roman', 3),
 ('Aarti Shahani', 3),
 ('Hansi Lo Wang', 3),
 ('Jane Arraf', 3),
 ('Elizabeth Jensen', 3),
 ('Ailsa Chang', 3),
 ('Soraya Sarhaddi Nelson', 3),
 ('Pam Fessler', 3),
 ('Ariel Zambelich', 3),
 ('Julie Rovner', 3),
 ('Megan Buerger', 3),
 ('Shankar Vedantam', 3),
 ('Ari Shapiro', 3),
 ('Frank Langfitt', 3),
 ('Anna Marketti', 3),
 ('Marc Masters', 3),
 ('Mara Liasson', 3),
 ('Vicky Hallett', 3),
 ('Marissa Lorusso', 3),
 ('Howard Berkes', 3),
 ('Adrian Florido', 3),
 ('Kirk Siegler', 3),
 ('Sam Sanders', 3),
 ('David Greene', 3),
 ('Daniel Zwerdling', 3),
 ('Chris Arnold', 3),
 ('Bobby Carter', 3),
 ('Kristen Hartke', 3),
 ('Rachel Martin', 3),
 ('Gene Demby', 3),
 ('Karen Grigsby Bates', 3),
 ('Patti Neighmond', 3),
 ('Lee Hale', 2),
 ('Sami Yenigun', 2),
 ('Adam Cole', 2),
 ('Amal El-Mohtar', 2),
 ('Maureen Corrigan', 2),
 ('Tom Bowman', 2),
 ('Parth Shah', 2),
 ('Francis Davis', 2),
 ('Kiana Fitzgerald', 2),
 ('Genevieve Valentine', 2),
 ('Leah Donnella', 2),
 ('David Bodanis', 2),
 ('Tamar Charney', 2),
 ('David Schaper', 2),
 ('Leigh Paterson', 2),
 ('Cassi Alexandra', 2),
 ('Elizabeth Blair', 2),
 ('Deena Prichep', 2),
 ('Ben Fishel', 2),
 ('Asma Khalid', 2),
 ('Jason Heller', 2),
 ('David Bianculli', 2),
 ('Avie Schneider', 2),
 ('Benjamin Naddaff-Hafrey', 2),
 ('Madeline K. Sofia', 2),
 ('Gabrielle Emanuel', 2),
 ('Jenny Gold', 2),
 ('Chhavi Sachdev', 2),
 ('John Henning Schumann', 2),
 ('Joanne Silberner', 2),
 ('Jill Neimark', 2),
 ('Michelle Mercer', 2),
 ('Alison Fensterstock', 2),
 ('Diane Cole', 2),
 ('Kat Chow', 2),
 ('Renee Klahr', 2),
 ('David Welna', 2),
 ('Alison Meuse', 2),
 ('Bret Stetka', 2),
 ('Tom Cole', 2),
 ('Anastasia Tsioulcas', 2),
 ('Sherrel Stewart', 2),
 ('Deborah Amos', 2),
 ('Susan Stamberg', 2),
 ('Barry Walters', 2),
 ('Jean Zimmerman', 2),
 ('Sujata Gupta', 2),
 ('Nicky Ouellet', 2),
 ('Wade Goodwyn', 2),
 ('Jewly Hight', 2),
 ('Etelka Lehoczky', 2),
 ('Aki Peritz', 2),
 ('Marilyn Geewax', 2),
 ('John Ydstie', 2),
 ('An-Li Herring', 2),
 ('Andrew Limbong', 2),
 ('Julie Appleby', 2),
 ('Debbie Elliott', 2),
 ('Ben Allen', 2),
 ('Frannie Kelley', 2),
 ('Mike Katzif', 2),
 ('Natasha Haverty', 2),
 ('Michael Schaub', 2),
 ('Ashley Westerman ', 2),
 ('Mandalit del Barco', 2),
 ('Anna Gorman', 2),
 ('Ryan Kellman', 1),
 ('Jessica Leigh Hester', 1),
 ('Lauren Sommer', 1),
 ('Bram Sable-Smith, KBIA and Side Effects Public Media', 1),
 ('Karen Shakerdge', 1),
 ('Taunya English', 1),
 ('Eleanor Klibanoff', 1),
 ('Clare Leschin-Hoar', 1),
 ('Craig LeMoult', 1),
 ('Nicole Cohen', 1),
 ('Russell Lewis', 1),
 ('Miles Bryan', 1),
 ('Megan Kamerick', 1),
 ('Rhaina Cohen', 1),
 ('Luke Vander Ploeg', 1),
 ('Laura Sydell', 1),
 ('Jennifer Schmidt', 1),
 ('Steven Findlay', 1),
 ('Jay Price', 1),
 ('Agerenesh Ashagre', 1),
 ('David Sedaris', 1),
 ('Maria Hinojosa', 1),
 ('Alyssa Edes', 1),
 ('Joy Ho', 1),
 ('Reid Frazier', 1),
 ('Emily Feng', 1),
 ('Simon Rentner', 1),
 ('Jessica Deahl', 1),
 ('Ann Finkbeiner', 1),
 ('Vincent Ialenti', 1),
 ('Marlon Bishop', 1),
 ('Petra Mayer', 1),
 ('Philip Galewitz', 1),
 ('Jim Kane', 1),
 ('Emily Bazar', 1),
 ('Sonia Narang', 1),
 ('Tanya Basu', 1),
 ('Elena See', 1),
 ('Esther Landhuis', 1),
 ('Antonia Cereijido', 1),
 ('Stina Sieg', 1),
 ('Danielle Preiss', 1),
 ('Marissa Higgins', 1),
 ('Marisa Peñaloza', 1),
 ('Maureen McCollum', 1),
 ('Rebecca Sananes', 1),
 ('Geoff Nunberg', 1),
 ('Mina Tavakoli', 1),
 ('Carmen Heredia Rodriguez', 1),
 ('Rachel Bluth', 1),
 ('Todd Bookman', 1),
 ('Barbara Campbell', 1),
 ('Beth Novey', 1),
 ('Rowan Moore-Gerety', 1),
 ('Maria Godoy', 1),
 ('Alison Richards', 1),
 ('Kait Bolongaro', 1),
 ('John Powers', 1),
 ('Elizabeth Grossman', 1),
 ('Maria Hollenhorst', 1),
 ('Alexandra Olgin', 1),
 ('Clarissa Wei', 1),
 ('Linda Poon', 1),
 ('David Grinspoon', 1),
 ('David Edelstein', 1),
 ('Sarah Hepola', 1),
 ('Joe Wertz', 1),
 ('Sandhya Dirks', 1),
 ('Laura Oliver', 1),
 ('Sonari Glinton', 1),
 ('Alex Zaragoza', 1),
 ('Frank Morris', 1),
 ('Jim Allen', 1),
 ('Edward Mabaya', 1),
 ('Lela Nargi', 1),
 ('Ben Schein', 1),
 ('Doug Mosurock', 1),
 ('Briana Younger', 1),
 ('Molly Solomon', 1),
 ('Ellen Wu', 1),
 ('Amy E. Robertson', 1),
 ('Katie Orr', 1),
 ('Alex Ariff', 1),
 ('Ken Tucker', 1),
 ('Martina Guzman', 1),
 ('Kara Lofton', 1),
 ('Annalisa Quinn', 1),
 ('Tunde Wey', 1),
 ('Sarah-Anne Henning Schumann', 1),
 ('Luke Runyon', 1),
 ('Lora Smith', 1),
 ('Jennifer Ludden', 1),
 ('Juli Fraga', 1),
 ('Michael Tomsic', 1),
 ('Maya Rodale', 1),
 ('Annie Ropeik', 1),
 ('Johhny Kauffman', 1),
 ('Tom Moon', 1),
 ('Noel King', 1),
 ('Anya Sacharow', 1),
 ('Maria Sherman', 1),
 ('Tasneem Raja', 1),
 ('Tom Vitale', 1),
 ('Jacklyn Kim', 1),
 ('Michel Martin', 1),
 ('Robin Marantz Henig', 1),
 ('Lauren Migaki', 1),
 ('Kirk Carapezza', 1),
 ('John U. Bacon', 1),
 ('Jeff Koehler', 1),
 ('Tara Boyle', 1),
 ('Dan Carsen', 1),
 ('Joanna Kakissis', 1),
 ('Annie Tomlinson', 1),
 ('The Associated Press', 1),
 ('William Dobson', 1),
 ('Ted Robbins', 1),
 ('Laura Snapes', 1),
 ('Pauline Bartolone', 1),
 ('Ryan Kailath', 1),
 ('Jane Greenhalgh', 1),
 ('Bonny Wolf', 1),
 ('Tim Greiving', 1),
 ('Lily Meyer', 1),
 ('Stephen Nessen', 1),
 ('Ofeibea Quist-Arcton', 1),
 ('Quil Lawrence', 1),
 ('Juan Vidal', 1),
 ('Nicole Jankowski', 1),
 ('Linda Fahey', 1),
 ('Yasmeen Khan', 1),
 ('Marina Lopes', 1),
 ('Amy Sisk', 1),
 ('Mark H. Kim', 1),
 ('Fred Bever', 1),
 ('Riley Beggin', 1),
 ('Mark Daley', 1),
 ('Sam Harnett', 1),
 ('Wendy Rigby', 1),
 ('Ha-Hoa Hamano', 1),
 ('Thomas Hjelm', 1),
 ('Becky Harlan', 1),
 ('Dustin DeSoto', 1),
 ('Steve Carmody', 1),
 ('emma bowman', 1),
 ('Michael Czaplinski', 1),
 ('Eric Whitney', 1),
 ('Liz Szabo', 1),
 ('Fred Schulte', 1),
 ('Susan Jaffe', 1),
 ('Sandip Roy', 1),
 ('Edgar B. Herwick III', 1),
 ('Tom Goldman', 1),
 ('Audie Cornish', 1),
 ('Jordan Rau', 1),
 ('Timmhotep Aku', 1),
 ('Jodi Helmer', 1),
 ('Nancy Pearl', 1),
 ('Sarah McCammon', 1),
 ('Margarita Gokun Silver', 1),
 ('Alan Greenblatt', 1),
 ('Kim Kankiewicz', 1),
 ('Tori Whitley', 1),
 ('Rose Friedman', 1),
 ('Marisa Arbona-Ruiz', 1),
 ('Jerad Walker', 1),
 ('Jonathan Baer', 1),
 ('Rhitu Chaterjee', 1),
 ('Oliver Wang', 1),
 ('Josh Jackson', 1),
 ('Rhitu Chatterjee', 1),
 ('Seth Herald', 1),
 ('Deepak Singh', 1),
 ('Meredith Rizzo', 1),
 ('Vera Zakem', 1),
 ('Linda Wertheimer', 1),
 ('Patricia Murphy', 1),
 ('NPR Ed', 1),
 ('Anna Mazarakis', 1),
 ('Carrie Kahn', 1),
 ('Alexi Horowitz-Ghazi', 1),
 ('Chris Benderev', 1),
 ('Joy Lanzendorfer', 1),
 ("Stephanie O'Neill", 1),
 ('Mary Agnes Carey', 1),
 ('Alejandra Maria Salazar', 1),
 ('Martha Ann Overland', 1),
 ('Alex Cohen', 1),
 ('Sydney Lupkin', 1),
 ('Lauren Ober', 1),
 ('Nick Evans', 1),
 ('Carrie Jung', 1),
 ('Peter Granitz', 1),
 ('Brandie Jefferson', 1),
 ('Patrick Madden', 1),
 ('Mark Mobley', 1),
 ('Jackson Sinnenberg', 1),
 ('Emily Siner', 1),
 ('Kristina Johnson', 1),
 ('Rachel Waldholz', 1),
 ('Danny Hajek', 1),
 ('Lindsey Smith', 1),
 ('Michael Oreskes', 1),
 ('Katie Simon', 1),
 ('Kerry Klein', 1),
 ('Brian Mann', 1),
 ('Hugo Rojo', 1),
 ('Stephanie Martin Taylor', 1),
 ('Tegan Wendland', 1),
 ('Larry Kaplow', 1),
 ('Stephan Bisaha', 1),
 ('Dennis Ross', 1),
 ('Iman Smith', 1),
 ('Heller McAlpin', 1),
 ('Grace Hood', 1),
 ('Deborah Shaar', 1)]

In [14]:
authors['Ari Shapiro']


Out[14]:
3

Let's say we wanted to now subset down to the articles which Ari Shapiro worked on. There are a variety of way's we could do this but I will demo one possible avenue.


In [15]:
# Because some rows have NaN's in them, we need to get clever with how we
# create our mask
mask = df['author'].map(lambda x: 'Ari Shapiro' in x if isinstance(x, list)
                                   else False)

df.loc[mask, 'headline']


Out[15]:
1268    Encore: Solange's 'A Seat At The Table' Honors...
1692    In Toledo, Syrian Refugees Are Welcomed Amid A...
1873    As A Syrian Refugee In Toledo Pines For His Fa...
Name: headline, dtype: object

In [16]:
# Here is another way we could acheive this
mask = df.loc[df['author'].notnull(), 'author'].map(lambda x: 'Ari Shapiro' in x)

df.loc[df['author'].notnull()].loc[mask, 'headline']


Out[16]:
1268    Encore: Solange's 'A Seat At The Table' Honors...
1692    In Toledo, Syrian Refugees Are Welcomed Amid A...
1873    As A Syrian Refugee In Toledo Pines For His Fa...
Name: headline, dtype: object

Let's find what the 5 most popular sections (as judged by the number of articles published within that article)


In [17]:
df['section'].value_counts(dropna=False)[:5]


Out[17]:
The Two-Way            479
Here And Now           398
Politics               181
Shots - Health News    103
Parallels               88
Name: section, dtype: int64

When we first were looking at our DataFrame, you may have noticed that there are quite a few rows missing author information. Maybe we have a hypothesis that there are certain sections that systemically weren't attaching author information. Let's dive deeper to try and prove/disprove this hypothesis...


In [18]:
# Let's create a new column that indicates whether the author attribute was null or not
# This helps with the groupby below
df['author_null'] = df['author'].isnull()

# Get the mean amount of nulls for each section and sort descending
# NOTE: 1.0 indicates ALL Nulls
df.groupby('section')['author_null'].mean().sort_values(ascending=False)


Out[18]:
section
Europe                                            1.000000
The Thistle & Shamrock                            1.000000
From Scratch                                      1.000000
Games & Humor                                     1.000000
Dear Sugars                                       1.000000
Here And Now                                      1.000000
Joe's Big Idea                                    1.000000
Marian McPartland's Piano Jazz                    1.000000
Metropolis                                        1.000000
Ask Me Another                                    1.000000
Mountain Stage                                    1.000000
Analysis                                          1.000000
Movies                                            1.000000
Wait Wait...Don't Tell Me!                        1.000000
On Point                                          1.000000
Fresh Air Weekend                                 1.000000
Planet Money                                      0.866667
Food                                              0.750000
Youth Radio                                       0.500000
National Security                                 0.500000
Energy                                            0.500000
Theater                                           0.500000
Movie Interviews                                  0.444444
Remembrances                                      0.333333
Television                                        0.250000
Books                                             0.250000
Environment                                       0.200000
World                                             0.200000
NPR Ombudsman                                     0.200000
U.S.                                              0.187500
                                                    ...   
Movie Reviews                                     0.000000
Parallels                                         0.000000
Music                                             0.000000
Music News                                        0.000000
Media                                             0.000000
Music Reviews                                     0.000000
Music Videos                                      0.000000
My Big Break                                      0.000000
NPR Ed                                            0.000000
NPR Extra                                         0.000000
NPR News Nuggets                                  0.000000
Recipes                                           0.000000
Religion                                          0.000000
News                                              0.000000
History                                           0.000000
From Our Listeners                                0.000000
Social Entrepreneurs: Taking On World Problems    0.000000
Front Row                                         0.000000
Simon Says                                        0.000000
Goats and Soda                                    0.000000
Hidden Brain                                      0.000000
Holiday Music                                     0.000000
Latitudes                                         0.000000
Interviews                                        0.000000
Jazz Night In America                             0.000000
Jazz Night In America Videos                      0.000000
Jazz Night In America: The Radio Program          0.000000
Research News                                     0.000000
Latin America                                     0.000000
13.7: Cosmos And Culture                          0.000000
Name: author_null, dtype: float64

As we can see, there are clearly sections that are consistently not attaching author information as well as many that are hit or miss with the author information.

Article Count by Time

Let's make a plot showing the frequency of articles published by day, week, and month.


In [19]:
# Create a pandas Series with 1's as the values and the date as the index
s = pd.Series([1], index=df['date_published'])

In [20]:
s[:10]


Out[20]:
date_published
2016-12-01 09:30:00    1
2016-12-01 11:57:00    1
2016-12-01 12:50:00    1
2016-12-01 12:50:00    1
2016-12-01 13:16:00    1
2016-12-01 13:48:00    1
2016-12-01 14:00:00    1
2016-12-01 14:14:00    1
2016-12-01 15:00:00    1
2016-12-01 15:29:00    1
dtype: int64

Below we see how we could use the resample function to find the number of articles published per day.

NOTE: Our DataFrame/Series must have a datetimeindex for this to work!


In [21]:
# Let's resample that Series and sum the values to find the number of articles by Day
s.resample('D').sum()


Out[21]:
date_published
2016-12-01    43
2016-12-02    63
2016-12-03    11
2016-12-04    13
2016-12-05    52
2016-12-06    53
2016-12-07    62
2016-12-08    65
2016-12-09    83
2016-12-10    20
2016-12-11    17
2016-12-12    59
2016-12-13    57
2016-12-14    76
2016-12-15    68
2016-12-16    92
2016-12-17    33
2016-12-18    17
2016-12-19    49
2016-12-20    75
2016-12-21    62
2016-12-22    61
2016-12-23    74
2016-12-24    22
2016-12-25    19
2016-12-26    25
2016-12-27    43
2016-12-28    55
2016-12-29    49
2016-12-30    61
2016-12-31    25
2017-01-01    13
2017-01-02    27
2017-01-03    88
2017-01-04    59
2017-01-05    81
2017-01-06    60
2017-01-07    19
2017-01-08    15
2017-01-09    61
2017-01-10    58
2017-01-11    71
2017-01-12    82
2017-01-13    78
2017-01-14    30
2017-01-15    22
2017-01-16     7
Freq: D, dtype: int64

There are, of course, many different offset alias' for passing to resample. For more options see this page.


In [22]:
plt.plot(s.resample('D').sum())
plt.title('Article Count By Day')
plt.ylabel('Number of Articles')
plt.xlabel('Date')
locs, labels = plt.xticks()
plt.setp(labels, rotation=-45);



In [23]:
plt.plot(s.resample('W').sum())
plt.title('Article Count By Week')
plt.ylabel('Number of Articles')
plt.xlabel('date')
locs, labels = plt.xticks()
plt.setp(labels, rotation=-45);


To answer this let's extract the hour when the article was published and create a histogram.


In [24]:
df['hour_published'] = df['date_published'].dt.hour

We were able to run the above command because that particular column contains a datetime object. From there we can run .dt and then extract any aspect of that datetime (e.g. .dt.hour, .dt.second, .dt.month, .dt.quarter)


In [25]:
df['hour_published'].hist()
plt.ylabel('Number of Articles Published')
plt.xlabel('Hour Published (24Hr)');


By default, the .hist method is going to plot 10 bins. Let's up that to 24 bins so we have a bin for each hour in the day...


In [26]:
# Let's force the plot to split into 24 bins, one for each hour
df['hour_published'].hist(bins=24)
plt.ylabel('Number of Articles Published')
plt.xlabel('Hour Published (24Hr)');



In [27]:
# Let's extract the relative frequency rather than the raw counts
df['hour_published'].hist(bins=24, normed=True, alpha=0.75)
plt.ylabel('Freq. of Articles Published')
plt.xlabel('Hour Published (24Hr)');



In [28]:
# We can also grab this information without plotting it using .value_counts
df['hour_published'].value_counts()


Out[28]:
19    280
20    263
21    217
15    189
16    177
17    173
22    152
14    138
18    129
23    120
12    113
13    109
0      60
1      48
3      23
11     18
2      18
4      15
5       9
7       7
6       7
9       4
10      4
8       2
Name: hour_published, dtype: int64

In [29]:
df['hour_published'].value_counts(normalize=True)


Out[29]:
19    0.123077
20    0.115604
21    0.095385
15    0.083077
16    0.077802
17    0.076044
22    0.066813
14    0.060659
18    0.056703
23    0.052747
12    0.049670
13    0.047912
0     0.026374
1     0.021099
3     0.010110
11    0.007912
2     0.007912
4     0.006593
5     0.003956
7     0.003077
6     0.003077
9     0.001758
10    0.001758
8     0.000879
Name: hour_published, dtype: float64

In [30]:
# Or we could leave them in the order of a day
df['hour_published'].value_counts().sort_index()


Out[30]:
0      60
1      48
2      18
3      23
4      15
5       9
6       7
7       7
8       2
9       4
10      4
11     18
12    113
13    109
14    138
15    189
16    177
17    173
18    129
19    280
20    263
21    217
22    152
23    120
Name: hour_published, dtype: int64

Selecting Particular Dates

Let's select articles which were published between 10 am and 2 pm on December 24th, 2016. There are a couple of ways we could do this, but let's start by making a mask.


In [31]:
mask = ((df['date_published'] >= '2016-12-24 10:00:00') &
        (df['date_published'] <= '2016-12-24 14:00:00'))

In [32]:
df.loc[mask, :]


Out[32]:
article_text author date_published headline section url processed_text num_authors author_null hour_published
1211 Graeme Wood may be known as a journalist, but ... [Dennis Ross] 2016-12-24 12:00:00 'The Way Of The Strangers' Explores The Pull O... NaN http://www.npr.org/2016/12/24/506763820/the-wa... graeme_wood know journalist fool student islam... 1.0 False 12
1212 "São Paulo is the graveyard of samba." So clai... [Marina Lopes] 2016-12-24 12:00:00 In Gritty Sao Paulo, Samba Reinvents Itself Wi... Parallels http://www.npr.org/sections/parallels/2016/12/... são paulo graveyard samba claim late brazilian... 1.0 False 12
1213 Subscription box services generally are boomin... [Wynne Davis] 2016-12-24 12:00:00 You've Got Mail: Book Boxes Offer Novels And N... NPR Ed http://www.npr.org/sections/ed/2016/12/24/4958... subscription box service generally boom fee co... 1.0 False 12
1214 Editor's note: This story was originally publi... [Bonny Wolf] 2016-12-24 12:40:00 Beyond Latkes: 8 Nights Of Fried Delights From... The Salt http://www.npr.org/sections/thesalt/2016/12/24... editor note story originally publish oil the_e... 1.0 False 12
1215 The Affordable Care Act is on the chopping blo... [Scott Horsley] 2016-12-24 13:53:00 White House Sharpens Its Case For Obamacare, A... Politics http://www.npr.org/2016/12/24/506338057/white-... the_affordable_care_act chopping block likely ... 1.0 False 13

In [33]:
# Or we could reset or index and do it that way...
df2 = df.set_index('date_published')
df2.loc['2016-12-24 10:00:00': '2016-12-24 14:00:00', :]


Out[33]:
article_text author headline section url processed_text num_authors author_null hour_published
date_published
2016-12-24 12:00:00 Graeme Wood may be known as a journalist, but ... [Dennis Ross] 'The Way Of The Strangers' Explores The Pull O... NaN http://www.npr.org/2016/12/24/506763820/the-wa... graeme_wood know journalist fool student islam... 1.0 False 12
2016-12-24 12:00:00 "São Paulo is the graveyard of samba." So clai... [Marina Lopes] In Gritty Sao Paulo, Samba Reinvents Itself Wi... Parallels http://www.npr.org/sections/parallels/2016/12/... são paulo graveyard samba claim late brazilian... 1.0 False 12
2016-12-24 12:00:00 Subscription box services generally are boomin... [Wynne Davis] You've Got Mail: Book Boxes Offer Novels And N... NPR Ed http://www.npr.org/sections/ed/2016/12/24/4958... subscription box service generally boom fee co... 1.0 False 12
2016-12-24 12:40:00 Editor's note: This story was originally publi... [Bonny Wolf] Beyond Latkes: 8 Nights Of Fried Delights From... The Salt http://www.npr.org/sections/thesalt/2016/12/24... editor note story originally publish oil the_e... 1.0 False 12
2016-12-24 13:53:00 The Affordable Care Act is on the chopping blo... [Scott Horsley] White House Sharpens Its Case For Obamacare, A... Politics http://www.npr.org/2016/12/24/506338057/white-... the_affordable_care_act chopping block likely ... 1.0 False 13

Length of Articles (# Words)

Maybe we are interested in looking at the distribution of how long our articles are...


In [34]:
df['num_words'] = df['article_text'].map(lambda x: len(x.split()))

In [35]:
df['num_words'].describe()


Out[35]:
count    2275.000000
mean      541.320440
std       502.855345
min         8.000000
25%       160.000000
50%       463.000000
75%       753.000000
max      8227.000000
Name: num_words, dtype: float64

Let's create a histogram of the length of different articles...


In [36]:
df['num_words'].hist(bins=20, alpha=0.75)
plt.ylabel('Number of Articles Published')
plt.xlabel('Length of Article');


Clearly there are some outliers in this data. Let's subset what we are plotting to cut out the top 2% of articles in terms of article length and see what the resulting histogram looks like...

Refer to the numpy percentile function for more information.


In [37]:
cutoff = np.percentile(df['num_words'], 98)

df.loc[df['num_words'] <= cutoff, 'num_words'].hist(bins=20, alpha=0.75)
plt.ylabel('Number of Articles Published')
plt.xlabel('Length of Article');


Only rows that contain 'Obama' in the Headline

We can also use standard string functions by using the .str functionality in pandas. Take a look at this page for more information.


In [38]:
df.loc[df['headline'].str.contains('Obama'), 'headline'].head()


Out[38]:
27     Obama Administration Appeals Judge's Ruling To...
28     Obama Administration Appeals Judge's Ruling To...
56     Only 26 Percent Of Americans Support Full Repe...
125    For The Holidays, The Obamas Open Up The White...
207    A Closer Look At Obama's Counterterrorism Stra...
Name: headline, dtype: object

Looking at Average Hour Published by Section

Maybe we have a hypothesis that different sections will vary in the time of day that they are publishing. We could try and get a sense for this like so:


In [39]:
# Let's subset to just the 10 most popular sections
top_sections = df['section'].value_counts()[:10].index
df_sub = df.loc[df['section'].isin(top_sections), :]

# We are now grouping by the section and extracting the mean hour that articles were published
df_sub.groupby('section')['hour_published'].mean()


Out[39]:
section
All Songs Considered    15.410714
Around the Nation       14.869565
Goats and Soda          16.803030
Here And Now            19.143216
Monkey See              15.547619
Parallels               15.931818
Politics                14.309392
Shots - Health News     17.038835
The Salt                17.013158
The Two-Way             16.265136
Name: hour_published, dtype: float64