Front page heatmap

If we view the front page of each newspaper as an MxN matrix, we can assign each pixel an intensity P based off the font size of the character at that location. Averaging the intensity across all newspapers will generate the "average importance" of a particular location on a front page for all newspapers of that size.

We'll first group newspapers into particular sizes
We'll smooth the output using a density-based heatmap



In [1]:

    
import pandas as pd

df = pd.read_sql_table('frontpage_texts', 'postgres:///frontpages')



In [2]:

    
df.head()









    Out[2]:






  
    
      
      text
      fontface
      fontsize
      bbox_left
      bbox_bottom
      bbox_right
      bbox_top
      bbox_area
      avg_character_area
      percent_of_page
      page
      page_width
      page_height
      page_area
      date
      day_of_week
      weekend
      slug
      id
    
  
  
    
      0
      See SAFETY, 2A\n
      HHDPCE+HelveticaNeueLTStd-BdCn
      10.665
      674.70
      916.106
      729.999
      926.771
      589.763835
      45.072767
      0.000536
      1
      748.0
      1472.0
      1101056.0
      2017-04-12
      2
      False
      CO_DC
      190349
    
    
      1
      CU, CORA\ndisagree\non records\n
      HHDPOJ+CenturyStd-BoldCondensed
      38.346
      594.00
      802.122
      727.795
      904.488
      13696.058970
      591.161109
      0.012439
      1
      748.0
      1472.0
      1101056.0
      2017-04-12
      2
      False
      CO_DC
      190350
    
    
      2
      By Elizabeth Hernandez\nStaff Writer\n
      HHDPCA+HelveticaNeueLTStd-Bd
      10.140
      594.00
      779.035
      689.477
      799.180
      1923.384165
      42.899610
      0.001747
      1
      748.0
      1472.0
      1101056.0
      2017-04-12
      2
      False
      CO_DC
      190351
    
    
      3
      Proponents of a bill intended to\nmake Colorad...
      HHDPOD+CenturyOldStyleStd-Regular
      11.932
      594.00
      532.381
      730.055
      775.305
      33051.024820
      52.021359
      0.030018
      1
      748.0
      1472.0
      1101056.0
      2017-04-12
      2
      False
      CO_DC
      190352
    
    
      4
      See OPEN RECORDS, 11A\n
      HHDPCE+HelveticaNeueLTStd-BdCn
      10.665
      642.31
      519.376
      729.997
      530.041
      935.181855
      48.019037
      0.000849
      1
      748.0
      1472.0
      1101056.0
      2017-04-12
      2
      False
      CO_DC
      190353

First, let's discover how many unique layouts there are for newspapers, and whether there is only one size per newspaper.



In [3]:

    
len(df.groupby(['page_width', 'page_height']).indices)









    Out[3]:





354

There may be subtle pixel-level differences that may be ignored. Let's round all values to the nearest pixel.



In [4]:

    
df['page_width_round'] = df['page_width'].apply(int)
df['page_height_round'] = df['page_height'].apply(int)

len(df.groupby(['page_width_round', 'page_height_round']).indices)









    Out[4]:





333

To the nearest 10 pixels?



In [5]:

    
df['page_width_round_10'] = df['page_width'].apply(lambda w: int(w/10)*10)
df['page_height_round_10'] = df['page_height'].apply(lambda w: int(w/10)*10)

print('''Number of unique dimensions: {}

Top dimensions:
{}'''.format(
    len(df.groupby(['page_width_round_10', 'page_height_round_10']).slug.nunique()),
    df.groupby(['page_width_round_10', 'page_height_round_10']).slug.nunique().sort_values(ascending=False)[:10]
))









    



Number of unique dimensions: 270

Top dimensions:
page_width_round_10  page_height_round_10
790                  1580                    36
                     1510                    30
                     1540                    19
                     1630                    17
                     1600                    12
                     1220                    12
                     1530                    11
                     1620                     9
800                  1070                     8
790                  1610                     7
Name: slug, dtype: int64

Perhaps it's good enough to grab the 34 newspapers in the 790x1580 categories for now.

What are they?



In [6]:

    
newspapers = pd.read_sql_table('newspapers', 'postgres:///frontpages')



In [7]:

    
WIDTH = 790
HEIGHT = 1580

df_at_size = df[(df.page_width_round_10 == WIDTH) & (df.page_height_round_10 == HEIGHT)]

print('Number of days for which we data for each newspaper')
pd.merge(newspapers, df_at_size.groupby('slug').date.nunique().reset_index(), on='slug').sort_values('date', ascending=False)









    



Number of days for which we data for each newspaper






    Out[7]:






  
    
      
      city
      country
      latitude
      longitude
      slug
      state
      title
      website
      date
    
  
  
    
      23
      Corpus Christi
      USA
      27.798986
      -97.395905
      TX_CCCT
      TX
      Corpus Christi Caller-Times
      http://www.caller.com
      14
    
    
      20
      Clarksville
      USA
      36.526154
      -87.357780
      TN_LC
      TN
      The Leaf-Chronicle
      http://www.theleafchronicle.com
      14
    
    
      14
      Lowell
      USA
      42.645889
      -71.312843
      MA_TS
      MA
      The Sun
      http://www.lowellsun.com
      14
    
    
      11
      Bedford
      USA
      38.861179
      -86.481514
      IN_TM
      IN
      Times-Mail
      http://www.tmnews.com
      14
    
    
      10
      Rome
      USA
      34.253719
      -85.166397
      GA_RNT
      GA
      Rome News-Tribune
      http://www.romenews-tribune.com
      14
    
    
      16
      Rochester
      USA
      43.154022
      -77.612679
      NY_RDC
      NY
      Rochester Democrat and Chronicle
      http://www.democratandchronicle.com
      14
    
    
      25
      Bremerton
      USA
      47.564987
      -122.625717
      WA_SUN
      WA
      Kitsap Sun
      http://www.kitsapsun.com
      13
    
    
      7
      West Covina
      USA
      34.091396
      -117.943054
      CA_SGVT
      CA
      San Gabriel Valley Tribune
      http://www.sgvtribune.com
      13
    
    
      26
      Crystal River
      USA
      28.886683
      -82.539505
      FL_CCC
      FL
      Citrus County Chronicle
      http://chronicleonline.com
      13
    
    
      22
      Nashville
      USA
      36.157280
      -86.786865
      TN_TT
      TN
      The Tennessean
      http://www.tennessean.com
      13
    
    
      12
      Bloomington
      USA
      39.166588
      -86.534248
      IN_HT
      IN
      The Herald-Times
      http://www.heraldtimesonline.com
      13
    
    
      13
      Fitchburg
      USA
      42.584831
      -71.804802
      MA_SE
      MA
      Sentinel & Enterprise
      http://www.sentinelandenterprise.com
      13
    
    
      21
      Murfreesboro
      USA
      35.844398
      -86.394341
      TN_DNJ
      TN
      The Daily News Journal
      http://www.dnj.com
      13
    
    
      18
      Bluffton
      USA
      32.237206
      -80.860939
      SC_IP
      SC
      The Island Packet
      http://www.islandpacket.com
      13
    
    
      27
      Beverly
      USA
      42.575760
      -70.875000
      MA_SN
      MA
      The Salem News
      http://www.salemnews.com
      12
    
    
      28
      Gloucester
      USA
      42.618275
      -70.676605
      MA_GDT
      MA
      The Gloucester Daily Times
      http://www.gloucestertimes.com
      12
    
    
      29
      Newburyport
      USA
      42.810974
      -70.868546
      MA_DNN
      MA
      The Daily News of Newburyport
      http://www.newburyportnews.com
      12
    
    
      0
      Gadsden
      USA
      34.014805
      -86.007167
      AL_GT
      AL
      The Gadsden Times
      http://www.gadsdentimes.com
      12
    
    
      8
      Whittier
      USA
      33.973057
      -118.037010
      CA_WDN
      CA
      Whittier Daily News
      http://www.whittierdailynews.com
      12
    
    
      15
      North Andover
      USA
      42.677593
      -71.131638
      MA_ET
      MA
      The Eagle-Tribune
      http://www.eagletribune.com
      11
    
    
      5
      Pasadena
      USA
      34.145905
      -118.131432
      CA_PSN
      CA
      Pasadena Star-News
      http://www.pasadenastarnews.com
      11
    
    
      9
      Marietta
      USA
      33.953194
      -84.545883
      GA_MDJ
      GA
      Marietta Daily Journal
      http://www.mdjonline.com
      10
    
    
      17
      Beaver
      USA
      40.696209
      -80.302437
      PA_BCT
      PA
      Beaver County Times
      http://www.timesonline.com
      9
    
    
      31
      Canton
      USA
      34.234940
      -84.486610
      GA_CT
      GA
      Cherokee Tribune
      http://www.cherokeetribune.com
      8
    
    
      1
      Auburn
      USA
      38.900497
      -121.069801
      CA_AJ
      CA
      The Auburn Journal
      http://www.auburnjournal.com
      7
    
    
      34
      Indiana
      USA
      40.624401
      -79.155968
      PA_IG
      PA
      The Indiana Gazette
      http://www.indianagazette.net
      6
    
    
      24
      Waco
      USA
      31.571701
      -97.149734
      TX_WTH
      TX
      Waco Tribune-Herald
      http://www.wacotrib.com
      4
    
    
      19
      Columbia
      USA
      33.998550
      -81.045250
      SC_TS
      SC
      The State
      http://www.thestate.com
      4
    
    
      33
      Astoria
      USA
      46.187714
      -123.833275
      OR_DA
      OR
      The Daily Astorian
      http://www.dailyastorian.com
      4
    
    
      35
      Grove
      USA
      36.593433
      -94.770142
      OK_GS
      OK
      The Grove Sun
      http://www.grandlakenews.com
      2
    
    
      6
      Torrance
      USA
      33.836597
      -118.341415
      CA_DB
      CA
      Daily Breeze
      http://www.dailybreeze.com
      2
    
    
      30
      Montreal
      Canada
      45.505995
      -73.556978
      CAN_LP
      
      La Presse
      http://www.cyberpresse.ca
      1
    
    
      2
      Eureka
      USA
      40.801418
      -124.160713
      CA_ETS
      CA
      Eureka Times Standard
      http://www.times-standard.com
      1
    
    
      32
      Salem
      USA
      44.939999
      -123.029999
      OR_CP
      OR
      Capital Press
      http://www.capitalpress.com
      1
    
    
      3
      Los Angeles
      USA
      34.053291
      -118.245010
      CA_DN
      CA
      Daily News
      http://www.dailynews.com
      1
    
    
      4
      Ontario
      USA
      34.078548
      -117.607346
      CA_IVDB
      CA
      Inland Valley Daily Bulletin
      http://www.dailybulletin.com
      1

We're interested in the average character size at every pixel on these newspapers. Let's go about constructing the matrix for one of them.



In [8]:

    
one_paper = df_at_size[(df_at_size.slug=='NY_RDC') & (df_at_size.date == df_at_size.date.max())]

print('''The Rochester Democrat and Chronicle has {} entries in the database across {} days.

On the latest day, it has {} text fields.
'''.format(
    df_at_size[df_at_size.slug == 'NY_RDC'].shape[0],
    df_at_size[df_at_size.slug == 'NY_RDC'].date.nunique(),
    one_paper.shape[0]
))









    



The Rochester Democrat and Chronicle has 433 entries in the database across 14 days.

On the latest day, it has 21 text fields.



In [9]:

    
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib.patches import Rectangle

plt.figure(figsize=(WIDTH/200, HEIGHT/200))
currentAxis = plt.gca()

for i, row in one_paper.iterrows():
    left = row.bbox_left / row.page_width
    right = row.bbox_right / row.page_width
    top = row.bbox_top / row.page_height
    bottom = row.bbox_bottom / row.page_height
    
    currentAxis.add_patch(Rectangle((left, bottom), right-left, top-bottom,alpha=0.5))
    
plt.suptitle('Layout of detected text boxes on a single front page')
plt.show()



In [41]:

    
import numpy as np

def make_intensity_grid(paper, height=HEIGHT, width=WIDTH, verbose=False):
    intensity_grid = np.zeros((height, width))

    for i, row in paper.iterrows():
        left = int(row.bbox_left)
        right = int(row.bbox_right)
        top = int(row.bbox_top)
        bottom = int(row.bbox_bottom)

        if np.count_nonzero(intensity_grid[bottom:top, left:right]) > 0:
            if verbose:
                print('Warning: overlapping bounding box with', bottom, top, left, right)
        intensity_grid[bottom:top, left:right] = row.avg_character_area
    
    return intensity_grid

def plot_intensity(intensity, title, scale=100):
    height, width = intensity.shape
    fig = plt.figure(figsize=(height/scale, width/scale))
    ax = plt.gca()

    cmap = plt.get_cmap('YlOrRd')
    cmap.set_under(color='white')

    fig.suptitle(title)
    plt.imshow(intensity, cmap=cmap, extent=[0, width, 0, height], origin='lower', vmin=0.1)
    plt.close()
    return fig
    
intensity_grid = make_intensity_grid(one_paper)
plot_intensity(intensity_grid, 'Intensity map of a front page')









    Out[41]:



In [11]:

    
intensities = []
for i, ((date, slug), paper) in enumerate(df_at_size.groupby(['date', 'slug'])):
    if i % 50 == 0:
        print('.', end='')
    intensities.append(make_intensity_grid(paper))









    



.......



In [12]:

    
avg_intensity = sum(intensities) / len(intensities)
plot_intensity(avg_intensity, 'Average intensity of all {} x {} newspapers'.format(HEIGHT, WIDTH))









    Out[12]:

Considering all newspapers of the same aspect ratio

We had to subset the newspapers quite significantly in order to find a bunch with the same exact height and width. Instead, let's find the newspapers with the same aspect ratio, and consider them after scaling.



In [13]:

    
df['aspect_ratio'] = df['page_width_round_10'] / df['page_height_round_10']

print('''Out of {} newspapers, there are {} unique aspect ratios.

The top ones are:
{}'''.format(
    df.slug.nunique(),
    df.groupby('slug').aspect_ratio.first().nunique(),
    df.groupby('slug').aspect_ratio.first().value_counts().head(5)
))









    



Out of 601 newspapers, there are 246 unique aspect ratios.

The top ones are:
0.500000    46
0.523179    30
0.512987    19
0.484663    17
0.493750    12
Name: aspect_ratio, dtype: int64

The top ones all look awfully close to 50-50. Let's round to the tenth place.



In [14]:

    
import math

df['aspect_ratio'] = np.round(df['page_width_round_10'] / df['page_height_round_10'], decimals=1) 

print('''This time, there are {} unique aspect ratios.

Top ones:
{}'''.format(
    df.groupby('slug').aspect_ratio.first().nunique(),
    df.groupby('slug').aspect_ratio.first().value_counts()
))









    



This time, there are 9 unique aspect ratios.

Top ones:
0.5    359
0.7    101
0.6     86
0.8     28
0.9     17
0.4      4
1.0      4
1.4      1
1.1      1
Name: aspect_ratio, dtype: int64

Over half of the newspapers have roughly a 1:2 aspect ratio. Let's scale them and push the errors toward the right and bottom margins.



In [15]:

    
smallest_width = df[df.aspect_ratio == 0.5].page_width_round_10.min()
smallest_height = df[df.aspect_ratio == 0.5].page_height_round_10.min()
print('''The easiest way would be to scale down to the smallest dimensions.

{} x {}'''.format(
    smallest_width,
    smallest_height
))









    



The easiest way would be to scale down to the smallest dimensions.

680 x 1410



In [16]:

    
from scipy.misc import imresize

intensities = []
for i, ((date, slug), paper) in enumerate(df[df.aspect_ratio == 0.5].groupby(['date', 'slug'])):
    if i % 50 == 0:
        print('.', end='')
    intensities.append(imresize(make_intensity_grid(paper), (smallest_height, smallest_width)))









    



....................................................................



In [17]:

    
count = len(intensities)
avg_intensity = sum([x / count for x in intensities])



In [18]:

    
plot_intensity(avg_intensity, 'Average front-page of {} newspapers'.format(len(intensities)))









    Out[18]:

Heatmap by newspaper

Each newspaper should have its own distinct appearance. Let's try grouping them by newspaper.



In [36]:

    
newspapers[newspapers.slug == 'NY_NYT'].head()









    Out[36]:






  
    
      
      city
      country
      latitude
      longitude
      slug
      state
      title
      website
    
  
  
    
      295
      New York
      USA
      40.757053
      -73.987267
      NY_NYT
      NY
      The New York Times
      http://www.nytimes.com



In [82]:

    
def newspaper_for_slug(slug):
    return newspapers[newspapers.slug == slug].title.iloc[0]

def slug_for_newspaper(title):
    return newspapers[newspapers.title == title].slug.iloc[0]



In [135]:

    
def avg_frontpage_for(newspaper_title='', random=False, paper=df):
    if newspaper_title:
        slug = slug_for_newspaper(newspaper_title)
        if slug not in paper.slug.unique():
            return 'No data'
    elif random:
        slug = paper.sample(1).slug.iloc[0]
        newspaper_title = newspaper_for_slug(slug)
    else:
        raise ArgumentError('Need newspaper_title or random=True')
    
    newspaper = paper[paper.slug == slug]
    width = newspaper.iloc[0].page_width_round
    height = newspaper.iloc[0].page_height_round

    intensities = []
    for i, ((date, slug), paper) in enumerate(newspaper.groupby(['date', 'slug'])):
        intensities.append(make_intensity_grid(paper, height=height, width=width))

    avg_intensity = sum([x / len(intensities) for x in intensities])

    return plot_intensity(avg_intensity, 'Average front-page of {} ({} days)'.format(newspaper_title, newspaper.date.nunique()))

avg_frontpage_for('The Denver Post')









    Out[135]:



In [88]:

    
avg_frontpage_for('The Washington Post')









    Out[88]:

We see the issue here with the title of the newspaper taking up a lot of space.



In [89]:

    
avg_frontpage_for(random=True)









    Out[89]:



In [99]:

    
avg_frontpage_for(random=True)









    Out[99]:



In [102]:

    
df[df.slug == slug_for_newspaper('Marietta Daily Journal')].text.value_counts().head()









    Out[102]:





By Mary Kate McGowan\n       13
mkmcgowan@mdjonline.com\n    12
jgargis@mdjonline.com\n      10
Marietta Daily Journal\n     10
A1\n                         10
Name: text, dtype: int64

Eliminating bylines, newspaper names, etc.

There are lots of repetitive non-article text that we want to eliminate. The heuristic we'll use is whether they appear more than once.



In [126]:

    
text_counts = df.groupby(['slug']).text.value_counts()

duplicate_text = text_counts[text_counts > 1].reset_index(name='count').drop('count', axis=1)

print('Detected {} rows of duplicate text'.format(duplicate_text.shape[0]))









    



Detected 20457 rows of duplicate text



In [132]:

    
from collections import defaultdict

duplicate_text_dict = defaultdict(set)
_ = duplicate_text.apply(lambda row: duplicate_text_dict[row.slug].add(row.text), axis=1)



In [142]:

    
df_clean = df[df.apply(lambda row: row.text not in duplicate_text_dict[row.slug], axis=1)]



In [143]:

    
avg_frontpage_for('The Hamilton Spectator', paper=df_clean)









    Out[143]:



In [144]:

    
avg_frontpage_for('The Washington Post', paper=df_clean)









    Out[144]:



In [157]:

    
avg_frontpage_for(random=True, paper=df_clean)









    Out[157]:

Average front page, with cleaned text



In [145]:

    
intensities = []
for i, ((date, slug), paper) in enumerate(df_clean[df_clean.aspect_ratio == 0.5].groupby(['date', 'slug'])):
    if i % 50 == 0:
        print('.', end='')
    intensities.append(imresize(make_intensity_grid(paper), (smallest_height, smallest_width)))

count = len(intensities)
avg_intensity = sum([x / count for x in intensities])

plot_intensity(avg_intensity, 'Average front-page of newspapers')









    



....................................................................





    Out[145]:

	text	fontface	fontsize	bbox_left	bbox_bottom	bbox_right	bbox_top	bbox_area	avg_character_area	percent_of_page	page	page_width	page_height	page_area	date	day_of_week	weekend	slug	id
0	See SAFETY, 2A\n	HHDPCE+HelveticaNeueLTStd-BdCn	10.665	674.70	916.106	729.999	926.771	589.763835	45.072767	0.000536	1	748.0	1472.0	1101056.0	2017-04-12	2	False	CO_DC	190349
1	CU, CORA\ndisagree\non records\n	HHDPOJ+CenturyStd-BoldCondensed	38.346	594.00	802.122	727.795	904.488	13696.058970	591.161109	0.012439	1	748.0	1472.0	1101056.0	2017-04-12	2	False	CO_DC	190350
2	By Elizabeth Hernandez\nStaff Writer\n	HHDPCA+HelveticaNeueLTStd-Bd	10.140	594.00	779.035	689.477	799.180	1923.384165	42.899610	0.001747	1	748.0	1472.0	1101056.0	2017-04-12	2	False	CO_DC	190351
3	Proponents of a bill intended to\nmake Colorad...	HHDPOD+CenturyOldStyleStd-Regular	11.932	594.00	532.381	730.055	775.305	33051.024820	52.021359	0.030018	1	748.0	1472.0	1101056.0	2017-04-12	2	False	CO_DC	190352
4	See OPEN RECORDS, 11A\n	HHDPCE+HelveticaNeueLTStd-BdCn	10.665	642.31	519.376	729.997	530.041	935.181855	48.019037	0.000849	1	748.0	1472.0	1101056.0	2017-04-12	2	False	CO_DC	190353

	city	country	latitude	longitude	slug	state	title	website	date
23	Corpus Christi	USA	27.798986	-97.395905	TX_CCCT	TX	Corpus Christi Caller-Times	http://www.caller.com	14
20	Clarksville	USA	36.526154	-87.357780	TN_LC	TN	The Leaf-Chronicle	http://www.theleafchronicle.com	14
14	Lowell	USA	42.645889	-71.312843	MA_TS	MA	The Sun	http://www.lowellsun.com	14
11	Bedford	USA	38.861179	-86.481514	IN_TM	IN	Times-Mail	http://www.tmnews.com	14
10	Rome	USA	34.253719	-85.166397	GA_RNT	GA	Rome News-Tribune	http://www.romenews-tribune.com	14
16	Rochester	USA	43.154022	-77.612679	NY_RDC	NY	Rochester Democrat and Chronicle	http://www.democratandchronicle.com	14
25	Bremerton	USA	47.564987	-122.625717	WA_SUN	WA	Kitsap Sun	http://www.kitsapsun.com	13
7	West Covina	USA	34.091396	-117.943054	CA_SGVT	CA	San Gabriel Valley Tribune	http://www.sgvtribune.com	13
26	Crystal River	USA	28.886683	-82.539505	FL_CCC	FL	Citrus County Chronicle	http://chronicleonline.com	13
22	Nashville	USA	36.157280	-86.786865	TN_TT	TN	The Tennessean	http://www.tennessean.com	13
12	Bloomington	USA	39.166588	-86.534248	IN_HT	IN	The Herald-Times	http://www.heraldtimesonline.com	13
13	Fitchburg	USA	42.584831	-71.804802	MA_SE	MA	Sentinel & Enterprise	http://www.sentinelandenterprise.com	13
21	Murfreesboro	USA	35.844398	-86.394341	TN_DNJ	TN	The Daily News Journal	http://www.dnj.com	13
18	Bluffton	USA	32.237206	-80.860939	SC_IP	SC	The Island Packet	http://www.islandpacket.com	13
27	Beverly	USA	42.575760	-70.875000	MA_SN	MA	The Salem News	http://www.salemnews.com	12
28	Gloucester	USA	42.618275	-70.676605	MA_GDT	MA	The Gloucester Daily Times	http://www.gloucestertimes.com	12
29	Newburyport	USA	42.810974	-70.868546	MA_DNN	MA	The Daily News of Newburyport	http://www.newburyportnews.com	12
0	Gadsden	USA	34.014805	-86.007167	AL_GT	AL	The Gadsden Times	http://www.gadsdentimes.com	12
8	Whittier	USA	33.973057	-118.037010	CA_WDN	CA	Whittier Daily News	http://www.whittierdailynews.com	12
15	North Andover	USA	42.677593	-71.131638	MA_ET	MA	The Eagle-Tribune	http://www.eagletribune.com	11
5	Pasadena	USA	34.145905	-118.131432	CA_PSN	CA	Pasadena Star-News	http://www.pasadenastarnews.com	11
9	Marietta	USA	33.953194	-84.545883	GA_MDJ	GA	Marietta Daily Journal	http://www.mdjonline.com	10
17	Beaver	USA	40.696209	-80.302437	PA_BCT	PA	Beaver County Times	http://www.timesonline.com	9
31	Canton	USA	34.234940	-84.486610	GA_CT	GA	Cherokee Tribune	http://www.cherokeetribune.com	8
1	Auburn	USA	38.900497	-121.069801	CA_AJ	CA	The Auburn Journal	http://www.auburnjournal.com	7
34	Indiana	USA	40.624401	-79.155968	PA_IG	PA	The Indiana Gazette	http://www.indianagazette.net	6
24	Waco	USA	31.571701	-97.149734	TX_WTH	TX	Waco Tribune-Herald	http://www.wacotrib.com	4
19	Columbia	USA	33.998550	-81.045250	SC_TS	SC	The State	http://www.thestate.com	4
33	Astoria	USA	46.187714	-123.833275	OR_DA	OR	The Daily Astorian	http://www.dailyastorian.com	4
35	Grove	USA	36.593433	-94.770142	OK_GS	OK	The Grove Sun	http://www.grandlakenews.com	2
6	Torrance	USA	33.836597	-118.341415	CA_DB	CA	Daily Breeze	http://www.dailybreeze.com	2
30	Montreal	Canada	45.505995	-73.556978	CAN_LP		La Presse	http://www.cyberpresse.ca	1
2	Eureka	USA	40.801418	-124.160713	CA_ETS	CA	Eureka Times Standard	http://www.times-standard.com	1
32	Salem	USA	44.939999	-123.029999	OR_CP	OR	Capital Press	http://www.capitalpress.com	1
3	Los Angeles	USA	34.053291	-118.245010	CA_DN	CA	Daily News	http://www.dailynews.com	1
4	Ontario	USA	34.078548	-117.607346	CA_IVDB	CA	Inland Valley Daily Bulletin	http://www.dailybulletin.com	1