Remark: Since interactive plots are present open this link to read the Notebook
correctly.
In [1]:
import plotly
from pageviews import *
from wiki_parser import *
import plotly.tools as tls
from helpers_parser import *
from across_languages import *
plotly.tools.set_credentials_file(username='crimenghini', api_key='***')
The fist goal to achieve is to find all the Italian and Portuguese Wikipedia articles that mention Matteo Renzi
. In order to do so, we use the WikiHandler class which goes through the raw data and then keeps and stores the title
and the text
of the elements of the corpora that mention the Italian (almost) ex-Prime Minister.
In this example we focus on the sets of articles written in Italian and in Portugal, they are collected respectively until 20th November 2016 and 1st December 2016. In general, the code allows you to take into account more than two languages. In the README
file, you can find the information related to the collection of data.
In [2]:
# Define the path of the corpora
path = '/Users/cristinamenghini/Downloads/'
# Xml file
xml_files = ['itwiki-20161120-pages-articles-multistream.xml',
'ptwiki-20161201-pages-articles-multistream.xml']
After having a quick peek at a snippet of the XML
. The elements we are interested in are on the child page
, which identifies an article. Then we want to get the contents of title
and text
.
Due to the big size of the XML
we opted for a parser which registers callbacks for events of interest and then let the parser proceed through the document. The text of the article has not been preprocessed since for the purpose of our analysis we are not going to analyze the text in itself.
Hence, we proceed to parse the Italian corpus using the parse_articles
function stored in the wiki_parser
library - it basically activates the parser.
In [4]:
# Parse italian corpus
parse_articles('ita', path + xml_files[0], 'Matteo Renzi')
Then move towards the Portuguese one.
In [5]:
# Parse portuguese corpus
parse_articles('port', path + xml_files[1], 'Matteo Renzi')
The articles are filtered according to the presence of the mention to Matteo Renzi, those in Italian have been stored in a .json
file whose each line corresponds to a page (title
, text
). The same holds for the articles in Portuguese. The two corpora are automatically stored in the folder Corpus
.
{"title": "title_1", "text": "text_1"}
...
...
{"title": "title_n", "text": "text_n"}
Once the data has been filtered, we proceed with a simple analysis of the pageviews. In particular, using the article_df_from_json
function, all the article titles are extracted from the corpus and then stored in a DataFrame
.
In [3]:
# Get the df for the Italian articles
df_it_titles = article_df_from_json('Corpus/wiki_ita_Matteo_Renzi.json')
# Get the df for the Portuguese articles
df_pt_titles = article_df_from_json('Corpus/wiki_port_Matteo_Renzi.json')
Take a look at the obtained DataFrame
.
In [4]:
df_it_titles.sample(5)
Out[4]:
Thus, we extract the number of monthly page views for each article related to the languages of interest (i.e. it
and pt
) from the page views file - Additional data in the README
. To filter the file we use the filter_pageviews_file
function and get a dictionary of dictionaries with the following structure (according to our example):
{'it':{'Title_1':'No pageviews',
...
'Title_n':'No pageviews'},
'pt':{'Title_1':'No pageviews',
...
'Title_k':'No pageviews'}}
In [5]:
# Page views file
pageviews_file = 'pagecounts-2016-11-views-ge-5-totals'
# Filter the page view file
articles_pageviews = filter_pageviews_file(path + pageviews_file, ['pt','it'])
Thus, a right join between the DataFrames
, namely the one obtained from the pageviews and the other obtained from the corpus, is performed. It results that both for the Italian and Portuguese articles there are articles that mention Matteo Renzi that have not been visualized in November. The define_ranked_df
function is stored in the pageviews
library.
In [6]:
# Define the italian ranked article df according to the number of page views
ranked_df_ita = define_ranked_df(articles_pageviews, 'it', df_it_titles)
# Show the df head
ranked_df_ita.head(10)
Out[6]:
In [7]:
# Define the italian ranked article df according to the number of page views
ranked_df_port = define_ranked_df(articles_pageviews, 'pt', df_pt_titles)
# Show the df
ranked_df_port.head(10)
Out[7]:
Having a quick glance at the two top 10, we notice:
Partito Democratico (Italia)
. We now move ahead exploring the data that we preprocessed and trying to figure out something interesting.
The DataFrame
below- obtained using article_mentions
function in this library- shows the number of mentions that Matteo Renzi has received in each article according to both for the Italian and Portuguese corpora. The DataFrames
are sorted by the number of mentions so that we get the pages where Matteo Renzi is more "popular".
In [8]:
# Italian df of mentions per page
df_it_mentions = article_mentions('Corpus/wiki_ita_Matteo_Renzi.json', 'Matteo Renzi')
# Sort the df by the number of mentions and see the top 5
df_it_mentions = df_it_mentions.sort_values('Number of mentions', ascending = False)
# Show results
df_it_mentions.head(5)
Out[8]:
In [9]:
# Portuguese df of mentions per page
df_pt_mentions = article_mentions('Corpus/wiki_port_Matteo_Renzi.json', 'Matteo Renzi')
# Sort the df by the number of mentions and see the top 5
df_pt_mentions = df_pt_mentions.sort_values('Number of mentions', ascending = False)
# Show results
df_pt_mentions.head(5)
Out[9]:
Comparing the two DataFrames
we immediately notice that even if the maximum number of mentions that Matteo Renzi received for Italian and Portuguese articles are very different. In the Portuguese corpus there are only two articles that have more than 5 mentions. Thus, can be interesting to visualize the distribution of the mentions both for the IT and PT corpora.
The distributions are represented using the boxplots. They show that for both the languages the 75% of the articles contain no more than 3 mentions of the Italian premier. For the Portuguese corpus stand out two outliers that correspond to Matteo Renzi 11 mentions
and Partido Democrático (Itália) 6 mentions
, rather for the Italians the number of outliers is bigger and the maximum number of mentions are contained in Matteo Renzi 62 mentions
. Moreover, zooming in the boxes, we observe that the two distributions are skewed toward left (number of mentions equal to 1).
In [10]:
#boxplot_mentions(df_pt_mentions, df_it_mentions, 'PT', 'IT', 'Number of mentions')
tls.embed("https://plot.ly/~crimenghini/20")
Out[10]:
In this direction, one aspect that can be considered is the following:
Define how important is Matteo Renzi in the articles that mention him. It requires defining the concept of importance. Intuitively, we would say that higher is the number of mentions more is the importance of our object in the article. Moreover, it may be useful to weight the number of mentions according to the number of words in the article. $$I_{string} = \frac{M}{|D|}$$ Where I is the importance, M is the number of mentions and D the number of words in the document. In this way, whether an article cited Renzi once but it is made up just by a few lines, the string of interest will result more significant.
Moreover, another aspect should be considered, especially when there is only one mention:
Another thing that can be visualized is the realtionship between the Number of mentions
and the Pageviews
. In order to do that we first merge the two pageviews and mentions DataFrames
.
In [11]:
# Merge pageviews and mentions DataFrames for IT
df_it_mension_pageview = pd.merge(df_it_mentions, ranked_df_ita, on=['Title'])
# Show it
df_it_mension_pageview.sample(5)
Out[11]:
In [12]:
# Merge pageviews and mentions DataFrames for PT
df_pt_mension_pageview = pd.merge(df_pt_mentions, ranked_df_port, on=['Title'])
# Show it
df_pt_mension_pageview.sample(5)
Out[12]:
A scatterplot is used to get how an article is positioned according to these two variables. The plot shows:
In [13]:
# def scatter_plot(df_it_mension_pageview, df_pt_mension_pageview, 'Number of mentions', 'Pageviews', 'Italian', 'Portuguese')
tls.embed('https://plot.ly/~crimenghini/36')
Out[13]:
About these two features, we can think that another way to explore should be the following:
Consider how the number of pageviews of an article changes when the number of Matteo Renzi citations increases from a revision to another. In particular, the importance(I) is re-defined as: $$I = \sum_{t = 1}^{T} \frac{(p_t-p_{t-1}) \times m_t}{|D_t|}$$ Where t is the time of sequential revision of the article, p is the number of page views at time and m is the number of mentions.
Thus we proceed to look for the presence of same articles (in different languages) that mention Matteo Renzi. To do so we make a request for each Portugues Wikipedia page (that cites Renzi) than we parse the HTML
source to extract - where available- the title of the IT article related to that the request has been sent. Precisely, the requests are sent for each title of the language that has less article that match Matteo Renzi. The function get_matches
is stored in this library.
In [14]:
# Built the common articles matches
dict_italian = get_matches(df_pt_titles, 'it')
# Create the inverted one
inverted_dict = {v : k for k, v in dict_italian.items()}
In [15]:
print ('The Portuguese articles that mention Matteo Renzi and correspond to an Italian article are: ', len(dict_italian),
'. The number of PT articles that have not been matched is: ', len(df_pt_titles)-len(dict_italian), '.')
Proceed to create a DataFrame
that contains the information related to those articles.
In [16]:
# From the dictionary get the titles of both languages
italian_titles = list(dict_italian.values())
portugues_titles = list(dict_italian.keys())
Before gooing further, we check whether all the matched IT articles mention Matteo Renzi. In order to do so, we run a query on the DataFrame
that stores all the IT articles that cite Renzi.
In [17]:
# Run the query
match_with_mention = df_it_titles.query('Title in @italian_titles')
# Get the number
print ('There are ', len(portugues_titles)-len(match_with_mention), 'IT articles that do not mention Matteo Renzi.')
In [18]:
# Re-define the list of IT articles according to the aforementioned "issue"
it_titles_with_mention = list(match_with_mention.Title)
In [19]:
# Re-define the two dictionaries
dict_italian_mentions = {k:v for k,v in dict_italian.items() if v in it_titles_with_mention}
# Define the inverted
inverted_dict_italian_mentions = {v : k for k, v in dict_italian.items()}
# Create the list of titles for PT articles according to the IT that don't mention Renzi
pt_titles_with_mention = list(dict_italian_mentions.keys())
Then, we create a unique DataFrame
which contains the mentions in IT an PT articles for the tuple of articles.
In [20]:
# Create df for IT mentions
df_match_it_mentions = df_it_mentions.query('Title in @it_titles_with_mention').sort_values('Number of mentions', ascending = False)
# Create df for PT mentions
df_match_pt_mentions = df_pt_mentions.query('Title in @pt_titles_with_mention').sort_values('Number of mentions', ascending = False)
In [21]:
# Create new column
new_column_it = ['/'.join([k]+[v]) for i in df_match_it_mentions.Title for k,v in dict_italian_mentions.items() if i == v]
new_column_pt = ['/'.join([k]+[v]) for i in df_match_pt_mentions.Title for k,v in dict_italian_mentions.items() if i == k]
# Add the new column to the two dataframes
df_match_it_mentions['Matches'] = new_column_it
df_match_pt_mentions['Matches'] = new_column_pt
Matches
and plot the results.
In [22]:
# Join the two dfs on the correspondence tuples
matches_mention = pd.merge(df_match_it_mentions, df_match_pt_mentions, on = 'Matches', suffixes = ('_IT','_PT'))
# Show result
matches_mention.head()
Out[22]:
In [23]:
# bar_plot(df, 'Matches', 'Number of mentions_IT', 'Number of mentions_P', 'IT', 'PT', 'Compare IT and PT mentions',
# 'Article','No. mentions', 'color-bar-prova')
tls.embed('https://plot.ly/~crimenghini/38')
Out[23]:
From the plot:
From this kind of analysis a question one can think about is the following:
Given articles in different languages that correspond one to each other, if we are interested in measuring the proximity of these articles, an element that may be considered is the number of common mentions. It is likely that the necessity of quoting s.o./s.t. derives from the fact that the two articles are talking about the same topics that need to refer to the same thing.
The same procedure is repeated for the page views.
In [24]:
# Run the query
match_with_pageviews_it = ranked_df_ita.query('Title in @italian_titles')
match_with_pageviews_pt = ranked_df_port.query('Title in @portugues_titles')
# Get the number
print ('There are ', len(portugues_titles)-len(match_with_pageviews_it), 'IT articles that have not been visited.')
print ('There are ', len(portugues_titles)-len(match_with_pageviews_pt), 'PT articles that have not been visited.')
In [25]:
# Define list of articles that have been visualized
it_titles_with_pageviews = list(match_with_pageviews_it.Title)
pt_titles_with_pageviews = list(match_with_pageviews_pt.Title)
In [26]:
# Re-define the two dictionaries according to this evidence
dict_italian_pageviews = {k:v for k,v in dict_italian.items() if v in it_titles_with_pageviews}
# PT
dict_pt_pageviews = {v : k for k, v in dict_italian.items() if k in pt_titles_with_pageviews}
In [27]:
# Create df for IT mentions
df_match_it_pageviews = ranked_df_ita.query('Title in @it_titles_with_pageviews').sort_values('Pageviews', ascending = False)
# Create df for PT mentions
df_match_pt_pageviews = ranked_df_port.query('Title in @pt_titles_with_pageviews').sort_values('Pageviews', ascending = False)
In [28]:
# Create new column
new_column_it = ['/'.join([k]+[v]) for i in df_match_it_pageviews.Title for k,v in dict_italian_pageviews.items() if i == v]
new_column_pt = ['/'.join([v]+[k]) for i in df_match_pt_pageviews.Title for k,v in dict_pt_pageviews.items() if i == v]
# Add the new column to the two dataframes
df_match_it_pageviews['Matches'] = new_column_it
df_match_pt_pageviews['Matches'] = new_column_pt
In [29]:
df_match_it_pageviews.head()
Out[29]:
In [30]:
df_match_pt_pageviews.head()
Out[30]:
DatFrames
with a right join, so that we see also the PT articles that have not been visualised in IT.
In [31]:
# Join the two dfs on the correspondence tuples
matches_pageviews = pd.merge(df_match_it_pageviews, df_match_pt_pageviews, how = 'right',on = 'Matches', suffixes = ('_IT','_PT'))
matches_pageviews.fillna(0, inplace =True)
# Show result
matches_pageviews.head()
Out[31]:
We use a bar plot to visualize the results.
In [32]:
# bar_plot(df, 'Matches', 'Pageviews_IT', 'Pageviews_PT', 'IT', 'PT', 'Compare IT and PT pageviews', 'Article',
# 'No. pageviews', 'color-bar-pvs')
tls.embed('https://plot.ly/~crimenghini/40')
Out[32]:
From the plot:
It can be interesting:
To present the same plot using the relative frequencies of the visit to see the importance of the page respect the list of articles (that mention Renzi) in that language.
To study the relationships between the articles that mention Renzi. In particular, whether they are connected and point to each other. It may be used for define the importance of Matteo Renzi in an article (i.e. Matteo Renzi mentioned on the page of a TV show (just because he has been a guest), whether the page doesn't result to be connected to other articles it is possible to assume that Renzi in not the main topic of the article). I'm not totally sure it can be done, since moving from an article to another (even if the talk about an extremely different topic) does not need many hops.