Matteo Renzi mentions across Italian and Portuguese Wikipedia

Remark: Since interactive plots are present open this link to read the Notebook correctly.


In [1]:
import plotly
from pageviews import *
from wiki_parser import *
import plotly.tools as tls
from helpers_parser import *
from across_languages import *
plotly.tools.set_credentials_file(username='crimenghini', api_key='***')

1. Find articles

The fist goal to achieve is to find all the Italian and Portuguese Wikipedia articles that mention Matteo Renzi. In order to do so, we use the WikiHandler class which goes through the raw data and then keeps and stores the title and the text of the elements of the corpora that mention the Italian (almost) ex-Prime Minister.

In this example we focus on the sets of articles written in Italian and in Portugal, they are collected respectively until 20th November 2016 and 1st December 2016. In general, the code allows you to take into account more than two languages. In the README file, you can find the information related to the collection of data.


In [2]:
# Define the path of the corpora
path = '/Users/cristinamenghini/Downloads/'
# Xml file
xml_files = ['itwiki-20161120-pages-articles-multistream.xml', 
             'ptwiki-20161201-pages-articles-multistream.xml']

After having a quick peek at a snippet of the XML. The elements we are interested in are on the child page, which identifies an article. Then we want to get the contents of title and text.

Due to the big size of the XML we opted for a parser which registers callbacks for events of interest and then let the parser proceed through the document. The text of the article has not been preprocessed since for the purpose of our analysis we are not going to analyze the text in itself.

Hence, we proceed to parse the Italian corpus using the parse_articles function stored in the wiki_parser library - it basically activates the parser.


In [4]:
# Parse italian corpus
parse_articles('ita', path + xml_files[0], 'Matteo Renzi')

Then move towards the Portuguese one.


In [5]:
# Parse portuguese corpus
parse_articles('port', path + xml_files[1], 'Matteo Renzi')

The articles are filtered according to the presence of the mention to Matteo Renzi, those in Italian have been stored in a .json file whose each line corresponds to a page (title, text). The same holds for the articles in Portuguese. The two corpora are automatically stored in the folder Corpus.

                              {"title": "title_1", "text": "text_1"}
                                                    ...
                                                    ...
                              {"title": "title_n", "text": "text_n"}

2. Rank articles according to November pageviews

Once the data has been filtered, we proceed with a simple analysis of the pageviews. In particular, using the article_df_from_json function, all the article titles are extracted from the corpus and then stored in a DataFrame.


In [3]:
# Get the df for the Italian articles
df_it_titles = article_df_from_json('Corpus/wiki_ita_Matteo_Renzi.json')

# Get the df for the Portuguese articles
df_pt_titles = article_df_from_json('Corpus/wiki_port_Matteo_Renzi.json')

Take a look at the obtained DataFrame.


In [4]:
df_it_titles.sample(5)


Out[4]:
Title
101 Centro-sinistra
413 Viadotto Italia
223 TG5 Prima Pagina
403 Carcere di Santo Stefano
498 Referendum costituzionale del 2016 in Italia

Thus, we extract the number of monthly page views for each article related to the languages of interest (i.e. it and pt) from the page views file - Additional data in the README. To filter the file we use the filter_pageviews_file function and get a dictionary of dictionaries with the following structure (according to our example):

                            {'it':{'Title_1':'No pageviews',
                                           ...
                                   'Title_n':'No pageviews'},
                             'pt':{'Title_1':'No pageviews',
                                           ...
                                   'Title_k':'No pageviews'}}

In [5]:
# Page views file
pageviews_file = 'pagecounts-2016-11-views-ge-5-totals'

# Filter the page view file
articles_pageviews = filter_pageviews_file(path + pageviews_file, ['pt','it'])

Thus, a right join between the DataFrames, namely the one obtained from the pageviews and the other obtained from the corpus, is performed. It results that both for the Italian and Portuguese articles there are articles that mention Matteo Renzi that have not been visualized in November. The define_ranked_df function is stored in the pageviews library.


In [6]:
# Define the italian ranked article df according to the number of page views
ranked_df_ita = define_ranked_df(articles_pageviews, 'it', df_it_titles)
# Show the df head
ranked_df_ita.head(10)


Over the whole number of articles in the corpus  39  have not been visited during the considered period.
Out[6]:
Title Pageviews
409 Marco Travaglio 19795.0
146 Pif (conduttore televisivo) 19557.0
474 Partito Democratico (Italia) 11653.0
154 Vittorio Sgarbi 11324.0
226 Malala Yousafzai 9452.0
433 Jobs Act 8908.0
274 Enrico Letta 7791.0
343 Startup (economia) 7698.0
312 Marianna Madia 7608.0
155 Nuovo Centro Congressi 6894.0

In [7]:
# Define the italian ranked article df according to the number of page views
ranked_df_port = define_ranked_df(articles_pageviews, 'pt', df_pt_titles)
# Show the df
ranked_df_port.head(10)


Over the whole number of articles in the corpus  4  have not been visited during the considered period.
Out[7]:
Title Pageviews
33 Partido Democrático (Itália) 567.0
31 Lista de chefes de Estado e de governo atuais 410.0
4 G7 259.0
19 Federica Mogherini 215.0
7 G20 185.0
23 Centro-esquerda 141.0
8 Privatização 93.0
15 Lista de chefes de Estado e de governo por dat... 83.0
21 9.ª reunião de cúpula do G20 81.0
17 10.ª reunião de cúpula do G20 70.0

Having a quick glance at the two top 10, we notice:

  • The number of page views for the Italian articles which mention Matteo Renzi is considerably higher than for those that are written in Portuguese.
  • The only article that is present in both the top ranking is Partito Democratico (Italia).
  • It seems that the pages differ in the content: the Portuguese ones are more related to topics that regard the international politics rather the Italians that refer to politics, journalists and public figures.

3. Make comparisons

We now move ahead exploring the data that we preprocessed and trying to figure out something interesting.

  • We take a look at the number of mentions received in each article. In this contest, it may be possible that Matteo Renzi received more than one mention just because of the presence of references. For instance on this page, if you look up for Matteo Renzi, you will find 2 mentions but one of those just refers to the first. For the moment we do not address this issue.

The DataFrame below- obtained using article_mentions function in this library- shows the number of mentions that Matteo Renzi has received in each article according to both for the Italian and Portuguese corpora. The DataFrames are sorted by the number of mentions so that we get the pages where Matteo Renzi is more "popular".


In [8]:
# Italian df of mentions per page
df_it_mentions = article_mentions('Corpus/wiki_ita_Matteo_Renzi.json', 'Matteo Renzi')

# Sort the df by the number of mentions and see the top 5
df_it_mentions = df_it_mentions.sort_values('Number of mentions', ascending = False)

# Show results
df_it_mentions.head(5)


Out[8]:
Title Number of mentions
37 Matteo Renzi 62
424 Governo Renzi 30
195 Partito Democratico (Italia) 17
492 Riforma costituzionale Renzi-Boschi 12
214 Storia del Partito Democratico (Italia) 12

In [9]:
# Portuguese df of mentions per page
df_pt_mentions = article_mentions('Corpus/wiki_port_Matteo_Renzi.json', 'Matteo Renzi')

# Sort the df by the number of mentions and see the top 5
df_pt_mentions = df_pt_mentions.sort_values('Number of mentions', ascending = False)

# Show results
df_pt_mentions.head(5)


Out[9]:
Title Number of mentions
20 Matteo Renzi 11
7 Partido Democrático (Itália) 6
9 Itália 3
0 Lista de primeiros-ministros da Itália 2
16 Lista de viagens presidenciais de Dilma Rousseff 2

Comparing the two DataFrames we immediately notice that even if the maximum number of mentions that Matteo Renzi received for Italian and Portuguese articles are very different. In the Portuguese corpus there are only two articles that have more than 5 mentions. Thus, can be interesting to visualize the distribution of the mentions both for the IT and PT corpora.

The distributions are represented using the boxplots. They show that for both the languages the 75% of the articles contain no more than 3 mentions of the Italian premier. For the Portuguese corpus stand out two outliers that correspond to Matteo Renzi 11 mentions and Partido Democrático (Itália) 6 mentions, rather for the Italians the number of outliers is bigger and the maximum number of mentions are contained in Matteo Renzi 62 mentions. Moreover, zooming in the boxes, we observe that the two distributions are skewed toward left (number of mentions equal to 1).


In [10]:
#boxplot_mentions(df_pt_mentions, df_it_mentions, 'PT', 'IT', 'Number of mentions')
tls.embed("https://plot.ly/~crimenghini/20")


Out[10]:

In this direction, one aspect that can be considered is the following:

Define how important is Matteo Renzi in the articles that mention him. It requires defining the concept of importance. Intuitively, we would say that higher is the number of mentions more is the importance of our object in the article. Moreover, it may be useful to weight the number of mentions according to the number of words in the article. $$I_{string} = \frac{M}{|D|}$$ Where I is the importance, M is the number of mentions and D the number of words in the document. In this way, whether an article cited Renzi once but it is made up just by a few lines, the string of interest will result more significant.

Moreover, another aspect should be considered, especially when there is only one mention:

  • The string (i.e. Matteo Renzi) is a pointer to its main page (i.e. Matteo Renzi -> Matteo Renzi). Whether the pointer is present we can imagine that the figure is more important than a page where there is no a hyperlink.

Another thing that can be visualized is the realtionship between the Number of mentions and the Pageviews. In order to do that we first merge the two pageviews and mentions DataFrames.


In [11]:
# Merge pageviews and mentions DataFrames for IT
df_it_mension_pageview = pd.merge(df_it_mentions, ranked_df_ita, on=['Title'])

# Show it
df_it_mension_pageview.sample(5)


Out[11]:
Title Number of mentions Pageviews
255 Faccia a faccia (programma televisivo) 1 483.0
457 Fausto Brizzi 1 2161.0
20 Ivan Scalfarotto 4 1077.0
32 Elezioni amministrative italiane del 2009 4 723.0
472 Anonymous 1 6.0

In [12]:
# Merge pageviews and mentions DataFrames for PT
df_pt_mension_pageview = pd.merge(df_pt_mentions, ranked_df_port, on=['Title'])

# Show it
df_pt_mension_pageview.sample(5)


Out[12]:
Title Number of mentions Pageviews
14 42.ª reunião de cúpula do G7 2 33.0
12 Maria Elena Boschi 2 5.0
3 Lista de primeiros-ministros da Itália 2 12.0
19 G20 1 185.0
20 Lista de líderes do G20 1 8.0

A scatterplot is used to get how an article is positioned according to these two variables. The plot shows:

  • IT: when the mentions are equal to 1 the number of page views is spread between 0 and ~20k. Where the number of mentions increases the number of page visualizations belongs to a smaller range.
  • PT: also for Portuguese article the same is observed.

In [13]:
# def scatter_plot(df_it_mension_pageview, df_pt_mension_pageview, 'Number of mentions', 'Pageviews', 'Italian', 'Portuguese')
tls.embed('https://plot.ly/~crimenghini/36')


Out[13]:

About these two features, we can think that another way to explore should be the following:

Consider how the number of pageviews of an article changes when the number of Matteo Renzi citations increases from a revision to another. In particular, the importance(I) is re-defined as: $$I = \sum_{t = 1}^{T} \frac{(p_t-p_{t-1}) \times m_t}{|D_t|}$$ Where t is the time of sequential revision of the article, p is the number of page views at time and m is the number of mentions.

Thus we proceed to look for the presence of same articles (in different languages) that mention Matteo Renzi. To do so we make a request for each Portugues Wikipedia page (that cites Renzi) than we parse the HTML source to extract - where available- the title of the IT article related to that the request has been sent. Precisely, the requests are sent for each title of the language that has less article that match Matteo Renzi. The function get_matches is stored in this library.


In [14]:
# Built the common articles matches
dict_italian = get_matches(df_pt_titles, 'it')
# Create the inverted one
inverted_dict = {v : k for k, v in dict_italian.items()}

In [15]:
print ('The Portuguese articles that mention Matteo Renzi and correspond to an Italian article are: ', len(dict_italian), 
       '. The number of PT articles that have not been matched is: ', len(df_pt_titles)-len(dict_italian), '.')


The Portuguese articles that mention Matteo Renzi and correspond to an Italian article are:  31 . The number of PT articles that have not been matched is:  11 .

Proceed to create a DataFrame that contains the information related to those articles.

  • We extract the titles of all involved articles (both IT and PT).

In [16]:
# From the dictionary get the titles of both languages
italian_titles = list(dict_italian.values())
portugues_titles = list(dict_italian.keys())

Before gooing further, we check whether all the matched IT articles mention Matteo Renzi. In order to do so, we run a query on the DataFrame that stores all the IT articles that cite Renzi.


In [17]:
# Run the query
match_with_mention = df_it_titles.query('Title in @italian_titles')

# Get the number
print ('There are ', len(portugues_titles)-len(match_with_mention), 'IT articles that do not mention Matteo Renzi.')


There are  10 IT articles that do not mention Matteo Renzi.

In [18]:
# Re-define the list of IT articles according to the aforementioned "issue"
it_titles_with_mention = list(match_with_mention.Title)
  • The dictionaries that match the PT and IT titles are re-defined taking into account the fact that some IT do not mention Renzi.

In [19]:
# Re-define the two dictionaries 
dict_italian_mentions = {k:v for k,v in dict_italian.items() if v in it_titles_with_mention}
# Define the inverted
inverted_dict_italian_mentions = {v : k for k, v in dict_italian.items()}

# Create the list of titles for PT articles according to the IT that don't mention Renzi
pt_titles_with_mention = list(dict_italian_mentions.keys())

Then, we create a unique DataFrame which contains the mentions in IT an PT articles for the tuple of articles.


In [20]:
# Create df for IT mentions
df_match_it_mentions = df_it_mentions.query('Title in @it_titles_with_mention').sort_values('Number of mentions', ascending = False)

# Create df for PT mentions
df_match_pt_mentions = df_pt_mentions.query('Title in @pt_titles_with_mention').sort_values('Number of mentions', ascending = False)
  • Add a column containing the matches to join the two dfs.

In [21]:
# Create new column
new_column_it = ['/'.join([k]+[v]) for i in df_match_it_mentions.Title for k,v in dict_italian_mentions.items()  if i == v]
new_column_pt = ['/'.join([k]+[v]) for i in df_match_pt_mentions.Title for k,v in dict_italian_mentions.items()  if i == k]

# Add the new column to the two dataframes
df_match_it_mentions['Matches'] = new_column_it
df_match_pt_mentions['Matches'] = new_column_pt
  • Perform the join on the Matches and plot the results.

In [22]:
# Join the two dfs on the correspondence tuples
matches_mention = pd.merge(df_match_it_mentions, df_match_pt_mentions, on = 'Matches', suffixes = ('_IT','_PT'))

# Show result
matches_mention.head()


Out[22]:
Title_IT Number of mentions_IT Matches Title_PT Number of mentions_PT
0 Matteo Renzi 62 Matteo Renzi/Matteo Renzi Matteo Renzi 11
1 Partito Democratico (Italia) 17 Partido Democrático (Itália)/Partito Democrati... Partido Democrático (Itália) 6
2 Maria Elena Boschi 6 Maria Elena Boschi/Maria Elena Boschi Maria Elena Boschi 2
3 Federica Mogherini 4 Federica Mogherini/Federica Mogherini Federica Mogherini 1
4 Presidenti del Consiglio dei ministri della Re... 3 Lista de primeiros-ministros da Itália/Preside... Lista de primeiros-ministros da Itália 2

In [23]:
# bar_plot(df, 'Matches', 'Number of mentions_IT', 'Number of mentions_P', 'IT', 'PT', 'Compare IT and PT mentions', 
# 'Article','No. mentions', 'color-bar-prova')
tls.embed('https://plot.ly/~crimenghini/38')


Out[23]:

From the plot:

  • Among this group of articles, the two that mention Matteo Renzi more result to be the same.

From this kind of analysis a question one can think about is the following:

Given articles in different languages that correspond one to each other, if we are interested in measuring the proximity of these articles, an element that may be considered is the number of common mentions. It is likely that the necessity of quoting s.o./s.t. derives from the fact that the two articles are talking about the same topics that need to refer to the same thing.

The same procedure is repeated for the page views.

  • Check whether some articles have not been visited.

In [24]:
# Run the query
match_with_pageviews_it = ranked_df_ita.query('Title in @italian_titles')
match_with_pageviews_pt = ranked_df_port.query('Title in @portugues_titles')
# Get the number
print ('There are ', len(portugues_titles)-len(match_with_pageviews_it), 'IT articles that have not been visited.')
print ('There are ', len(portugues_titles)-len(match_with_pageviews_pt), 'PT articles that have not been visited.')


There are  12 IT articles that have not been visited.
There are  3 PT articles that have not been visited.

In [25]:
# Define list of articles that have been visualized
it_titles_with_pageviews = list(match_with_pageviews_it.Title)
pt_titles_with_pageviews = list(match_with_pageviews_pt.Title)
  • Define the matching dictionaries according to what said above.

In [26]:
# Re-define the two dictionaries according to this evidence
dict_italian_pageviews = {k:v for k,v in dict_italian.items() if v in it_titles_with_pageviews}

# PT 
dict_pt_pageviews = {v : k for k, v in dict_italian.items() if k in pt_titles_with_pageviews}

In [27]:
# Create df for IT mentions
df_match_it_pageviews = ranked_df_ita.query('Title in @it_titles_with_pageviews').sort_values('Pageviews', ascending = False)

# Create df for PT mentions
df_match_pt_pageviews = ranked_df_port.query('Title in @pt_titles_with_pageviews').sort_values('Pageviews', ascending = False)
  • Add new variable to allow the join

In [28]:
# Create new column
new_column_it = ['/'.join([k]+[v]) for i in df_match_it_pageviews.Title for k,v in dict_italian_pageviews.items()  if i == v]
new_column_pt = ['/'.join([v]+[k]) for i in df_match_pt_pageviews.Title for k,v in dict_pt_pageviews.items()  if i == v]

# Add the new column to the two dataframes
df_match_it_pageviews['Matches'] = new_column_it
df_match_pt_pageviews['Matches'] = new_column_pt

In [29]:
df_match_it_pageviews.head()


Out[29]:
Title Pageviews Matches
474 Partito Democratico (Italia) 11653.0 Partido Democrático (Itália)/Partito Democrati...
274 Enrico Letta 7791.0 Enrico Letta/Enrico Letta
312 Marianna Madia 7608.0 Marianna Madia/Marianna Madia
442 G20 (paesi industrializzati) 2545.0 G20/G20 (paesi industrializzati)
318 Giuliano Poletti 2021.0 Giuliano Poletti/Giuliano Poletti

In [30]:
df_match_pt_pageviews.head()


Out[30]:
Title Pageviews Matches
33 Partido Democrático (Itália) 567.0 Partido Democrático (Itália)/Partito Democrati...
31 Lista de chefes de Estado e de governo atuais 410.0 Lista de chefes de Estado e de governo atuais/...
4 G7 259.0 G7/G7
19 Federica Mogherini 215.0 Federica Mogherini/Federica Mogherini
7 G20 185.0 G20/G20 (paesi industrializzati)
  • Join the two DatFrames with a right join, so that we see also the PT articles that have not been visualised in IT.

In [31]:
# Join the two dfs on the correspondence tuples
matches_pageviews = pd.merge(df_match_it_pageviews, df_match_pt_pageviews, how = 'right',on = 'Matches', suffixes = ('_IT','_PT'))
matches_pageviews.fillna(0, inplace =True)
# Show result
matches_pageviews.head()


Out[31]:
Title_IT Pageviews_IT Matches Title_PT Pageviews_PT
0 Partito Democratico (Italia) 11653.0 Partido Democrático (Itália)/Partito Democrati... Partido Democrático (Itália) 567.0
1 Enrico Letta 7791.0 Enrico Letta/Enrico Letta Enrico Letta 9.0
2 Marianna Madia 7608.0 Marianna Madia/Marianna Madia Marianna Madia 6.0
3 G20 (paesi industrializzati) 2545.0 G20/G20 (paesi industrializzati) G20 185.0
4 Giuliano Poletti 2021.0 Giuliano Poletti/Giuliano Poletti Giuliano Poletti 7.0

We use a bar plot to visualize the results.


In [32]:
# bar_plot(df, 'Matches', 'Pageviews_IT', 'Pageviews_PT', 'IT', 'PT', 'Compare IT and PT pageviews', 'Article',
# 'No. pageviews', 'color-bar-pvs')
tls.embed('https://plot.ly/~crimenghini/40')


Out[32]:

From the plot:

  • The page with the highest visits are the same.
  • In general, it seems that the PT pages that mention Matteo Renzi are related to general topic and politic figures on the international stage.

It can be interesting:

To present the same plot using the relative frequencies of the visit to see the importance of the page respect the list of articles (that mention Renzi) in that language.

To study the relationships between the articles that mention Renzi. In particular, whether they are connected and point to each other. It may be used for define the importance of Matteo Renzi in an article (i.e. Matteo Renzi mentioned on the page of a TV show (just because he has been a guest), whether the page doesn't result to be connected to other articles it is possible to assume that Renzi in not the main topic of the article). I'm not totally sure it can be done, since moving from an article to another (even if the talk about an extremely different topic) does not need many hops.