To view the notebook plus use this nbviewer link
Most of the code is implemented in side functions to keep this notebook as clean as possible.
Summary of what each file does :
All the Gdelt Events (2.0) data represent 200k files, 100GB uncompressed
After cleaning and only keeping the informations we need, we are left with only 1 file of 6.2GB uncompressed
In [28]:
import pandas as pd
import datetime
import numpy as np
This pipeline sums up all the data processing done before visualition. This includes data acquisition, data cleaning and data augmentation.
What is below until the section "Post milestone 2 work" is what have been done before Milestone 3 and hence does not reflect 100% what is present in the final work, but we keep it for completness.
In [2]:
# might be better to import the code and not a file (to show what we've done)
from fetch_gdelt_data import *
from clean_data import clean_df
Below, we specify the date interval of the data to load in a DataFrame for us to use and to download, if we do not already have the data locally.
In [3]:
start = datetime.date(2015, 3, 1)
end = datetime.date(2015, 4, 1)
To load and download the data, a simple function call is needed. We can specify whether we want the translingual version or the english only one.
In [4]:
test_df = fetch_df(start, end, translingual=True, v='2')
We will only keep the informations about the event type and location, the source URL and number of mentions, and the Goldstein scale and average tone of the Event. We drop every event with missing entries and add a column containing the ISO 3166-1 alpha-3 convention where the event happens.
In [5]:
selected_df = clean_df(test_df)
To show how we can visualize the data, we plan to use folium and plotly later on.
In [6]:
import json
import branca
import folium
from folium.plugins import HeatMap
from fetch_location import get_country_from_point, get_mapping
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
from plotly.graph_objs import *
init_notebook_mode()
We load the geojson that will be used to aggregate the data by country and display it in a choropleth.
In [7]:
world_geo_path = '../data/locations/countries.geo.json'
world_json_data = json.load(open(world_geo_path, encoding="UTF-8"))
The data contains a code for each country. From this we can aggregate the event easily together. However,the corresponding country name is not always the same as it is sometimes contains details on the city/state level.
For the reason above, it is not easy to know which country code corresponds to which polygon of the geojson. The easiest solution we found, was to test using the longitude and latitude in which polygon the event happened and create a mapping country_code -> polygon_name
We compute below the different metrics times their "importance", this could be done differently.
In [8]:
selected_df.loc[:,'pondered_GoldsteinScale'] = selected_df.loc[:,'GoldsteinScale'] * selected_df.loc[:,'NumMentions']
selected_df.loc[:,'pondered_AvgTone'] = selected_df.loc[:,'AvgTone'] * selected_df.loc[:,'NumMentions']
Once we have the mapping, we can give a score to each country based on a chosen metric (average tone in the news toward the event, Goldstein scale, etc...), and map the index of each country (country_code) to the name of the polygon that reprensent it.
In [9]:
chosen_metric = 'pondered_GoldsteinScale'
In [10]:
scores = selected_df.groupby('Country_Code')[chosen_metric].agg('mean')
rate_min = min(scores)
rate_max = max(scores)
# color scale from min rate to max rate
color_scale = branca.colormap.linear.RdYlGn.scale(rate_min, rate_max)
color_scale = color_scale.to_step(n=8)
def style_function(country):
if country['id'] in scores.index.values:
# country is in the dataframe
score = scores.loc[country['id']].mean()
return {
'fillOpacity': 0.8,
'weight': 0.5,
'color': 'black',
'fillColor': color_scale(score)
}
else:
# country is not in the dataframe, hence we put its color as black
return {
'fillOpacity': 0.2,
'weight': 0.2,
'color': 'black',
'fillColor': 'black'
}
def highlight_function(i):
return {
'weight': 2,
'fillOpacity': .2
}
world_map = folium.Map([25, 0], tiles='', zoom_start=2)
g = folium.GeoJson(world_json_data, style_function=style_function, highlight_function=highlight_function).add_to(world_map)
color_scale.caption = ' '.join(chosen_metric.split('_'))
color_scale.add_to(world_map)
del style_function
del highlight_function
world_map
Out[10]:
Here we plotted the pondered Goldstein scale for each country on the specified interval of time. We mutiplied the Goldstein scale for each event by the number of time its news source was mentionned during the first 15 minutes to give more weight to more important events.
This can be interpreted by saying that reddish countries have had event that shook its stability and that greener event have had more event that were more benifical toward the stability.
The function below will be usefull later on when we will need to select some event conditionally on one of their feature.
In [11]:
def select_events(df, feature, selector):
'''Example of use : select_events(selected_df, 'EventCode', lambda x: x[:2] == '08')'''
return df[df[feature].apply(selector)]
We could use the function select_events
created earlier to fine tune the events we want to plot, and show them on a heatmap.
Using the feature EventCode
, we will show every event that involve a threat toward someone.
In [24]:
selector = lambda x: x[:3] == '130'
selected_events = select_events(selected_df, 'EventCode', selector)
In [25]:
positions = selected_events.groupby(['ActionGeo_Lat', 'ActionGeo_Long'])['Day'].agg(['count'])
In [26]:
points = np.zeros((len(positions.values), 3))
for i in range(len(positions.values)):
points[i][0] = positions.index[i][0]
points[i][1] = positions.index[i][1]
points[i][2] = positions.values[i]
In [27]:
m_h = folium.Map([25, 0], tiles='stamentoner', zoom_start=2)
HeatMap(points.tolist()).add_to(m_h)
m_h
Out[27]:
Here we can see that there is a higher density of events involving threats in Europe compared to North-America.
However, Heatmaps can be a bit ineffective when we are looking at the map with a small scall. In the future, we will rather use Bubble maps.
We saw during the data exploration that most of the sources come with an URL. From this we tried to get in which country the news came from. Given a country X, the goal was to answer the following questions: "Which countries write about the events in X ?", "Where do the news writen in X happen ?".
To answer those question, we will use the Sankey Diagram.
Here, we will focus on Switzerland. We keep the data where the source of the event comes from Switzerland and group by the country it happens(right_t
). We also keep the data where the event happens in Switzerland and group by the country the news comes from (left_t
).
In [16]:
# Groupby and then aggregate for the left part and the right part of the sackey diagram
left_t = selected_df[selected_df['Country_Name'] == 'Switzerland'][['Country_Source', 'Day']].groupby('Country_Source').agg(['count'])
right_t = selected_df[selected_df['Country_Source'] == 'Switzerland'][['Country_Name', 'Day']].groupby('Country_Name').agg(['count'])
From those groupby, we only keep the top ten countries in both groups.
In [17]:
left = left_t['Day']['count'].sort_values(ascending=False)[:10].reset_index()
right = right_t['Day']['count'].sort_values(ascending=False)[:10].reset_index()
We then merge the two groups together so that we can plots them in a Sankey Diagram.
In [18]:
# Merge the two part in an horrible manner so that plotly can be used
data = right.reset_index()
data['Source'] = 'Switzerland '
data['Target'] = data['Country_Name'] + ' '
data['Value'] = data['count']
data['Label'] = data['Country_Name']
data2 = left.reset_index()
data2['Source'] = data2['Country_Source']
data2['Target'] = 'Switzerland '
data2['Value'] = data2['count']
data2['Label'] = data2['Country_Source']
l = np.concatenate([data['Source'].values, data2['Source'].values, data2['Target'].values, data['Target'].values], axis=0)
d = dict([(y,x) for x,y in enumerate(sorted(set(l)))])
data3 = pd.concat([data, data2])[['Source', 'Target', 'Value', 'Label']]
data3['Target'] = data3['Target'].map(d)
data3['Source'] = data3['Source'].map(d)
In [19]:
trace1 = {
"domain": {
"x": [0, 1],
"y": [0, 1]
},
"link": {
"label": data3['Label'].values,
"source": data3['Source'].values,
"target": data3['Target'].values,
"value": data3['Value'].values
},
"node": {"label": list(sorted(set(l)))},
"type": "sankey"
}
layout = dict(
title = "Sankey Diagram",
height = 1000,
width = 1000,
font = dict(
size = 10
),
)
fig = dict(data=[trace1], layout=layout)
iplot(fig)
Here we have, on the left, the countries where the news talk about event happening in Switzerland. NOENTRY
corresponds to websites we do not know the location yet. We are still scrapping the web for more data and we should be done in a week or two. The first country on the left (if we do not count NOENTRY
) is Switzerland, which is not surprising.
On the right side, we have the countries where the event our newspapers/websites write about happen. Here we also have Switzerland as the most frequent. The following ones will most likely be countries with conflicts and neighboring countries.
On a larger scale of time, we could see more trends, select only a part of the data to show interests depending on the event type for example and so on.
Although we had lots and powerful functions to fetch and clean the data, we were still unable to fetch, clean and store every data GDELT2.0 is providing us in an efficient manner.
To solve this problem, we mostly put all the pieces of code we had together. Also we added a cleaning step consisting of converting countries name to their ISO 3166-1 alpha-3 code. Finally we updated to a version 2 our code to get the mapping from source to country (the country in which the news was written). We now achieve an accuracy of 77%, that is 77% of the news get a source country assigned, the 23% left are mostly international website or website not considered an online newspaper, the unassigned news are discarded for this analysis.
Now to load the entirety of the data, a single command is needed. The original data files size is of more than 100 GB, our final data file is of 6.2 GB.
WARNING: The first time the command is exectued it takes A LOT of times to download and clean all the data from GDELT, at the end the function will create the above mentionned file of 6.2GB and the loading will take a few seconds as long as the file is present in the data folder.
In [ ]:
from load_data import load_data
df = load_data()
First let's note we let go the idea of using folium and are only using polt.ly now.
Also we created a website hosted on GitHub pages to write the data story and display everything in a nice way. Thankfully plot.ly allows us to easily export the visualization to javascript to include it in the website.
In [170]:
import viz
import datetime
import numpy as np
import pandas as pd
import pycountry as pyc
from sklearn import cluster
from sklearn import metrics
from sklearn import linear_model
from tqdm import tqdm_notebook as tqdm
from imp import reload
from matplotlib import pyplot as plt
from viz import world_map_figure
import plotly
import plotly.plotly as py
from plotly.graph_objs import Choropleth, Bar, Scatter
from plotly.offline import init_notebook_mode, iplot
from IPython.display import display, HTML
In [2]:
init_notebook_mode(connected=True)
In [3]:
most_common_countries = pd.read_csv('../data/viz/most_common_countries.csv', header=None, names=['Country', 'Count'])
In [4]:
show_top = 10
In [5]:
most_common_countries['Percentage'] = most_common_countries['Count'] / most_common_countries['Count'].sum()
In [433]:
most_common_countries_data_plot = [Bar(
x=most_common_countries['Country'][:show_top],
y=most_common_countries['Percentage'][:show_top]
)]
most_common_countries_fig = {
'data': most_common_countries_data_plot,
'layout': {
'title': 'Most Common Countries',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'xaxis': {
'title': 'Country',
},
'yaxis': {
'title': 'Proportion of total countries',
}
}
}
iplot(most_common_countries_fig)
In [524]:
#print(plotly.offline.plot(most_common_countries_fig, include_plotlyjs=False, output_type='div'))
In [8]:
most_unknown_websites = pd.read_csv('../data/viz/most_unknown_websites.csv', header=None, names=['URL', 'Count'])
In [9]:
show_top = 10
In [10]:
most_unknown_websites['Percentage'] = most_unknown_websites['Count'] / most_unknown_websites['Count'].sum()
In [502]:
most_unknown_websites_data_plot = [Bar(
x=most_unknown_websites['URL'][:show_top],
y=most_unknown_websites['Percentage'][:show_top],
)]
most_unknown_websites_fig = {
'data': most_unknown_websites_data_plot,
'layout': {
'title': 'Most Frequent Unknown Websites',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'xaxis': {
'title': 'Website',
},
'yaxis': {
'title': 'Proportion of total websites',
}
}
}
iplot(most_unknown_websites_fig)
In [525]:
#print(plotly.offline.plot(most_unknown_websites_fig, include_plotlyjs=False, output_type='div'))
In [13]:
def select_events(df, feature, selector):
'''Example of use : select_events(selected_df, 'EventCode', lambda x: x[:2] == '08')'''
return df[df[feature].apply(selector)]
In [14]:
all_cca = [c.alpha_3 for c in pyc.countries]
all_cca_set = set(all_cca)
In [15]:
start_date = datetime.datetime(2015, 3, 1)
end_date = datetime.datetime(2017, 12, 1)
n_months = (end_date - start_date).days * 12 // 365
dates = []
for i in range(n_months):
index = start_date.month - 1 + i
month = index % 12 + 1
year = start_date.year + index // 12
date = "{}_{:02d}".format(year, month)
dates.append(date)
dates_set = set(dates)
In [16]:
df = pd.read_csv('../data/final_data.csv', encoding='utf-8')
In [17]:
df = select_events(df, 'Target_CountryCode', lambda x: x in all_cca_set)
In [18]:
df = select_events(df, 'Source_CountryCode', lambda x: x in all_cca_set)
In [19]:
df['Year_Month'] = df['Day'].apply(str).apply(lambda x: x[:4] + '_' + x[4:6])
In [20]:
df = select_events(df, 'Year_Month', lambda x: x in dates_set)
In [22]:
reload(viz)
Out[22]:
In [275]:
colorscale_perso = [[0.0, 'rgb(165,0,38)'], [0.1111111111111111, 'rgb(215,48,39)'], [0.2222222222222222, 'rgb(244,109,67)'], [0.3333333333333333, 'rgb(253,174,97)'], [0.4444444444444444, 'rgb(254,224,144)'], [0.5555555555555556, 'rgb(224,243,248)'], [0.6666666666666666, 'rgb(171,217,233)'], [0.7777777777777778, 'rgb(116,173,209)'], [0.8888888888888888, 'rgb(69,117,180)'], [1.0, 'rgb(49,54,149)']]
colorscale_perso1 = [[0.0, '0066CC'], [1, 'FFFFFF']]#, [0.2, 'F9DBBD'], [0.3, 'rgb(253,174,97)'], [0.4, 'rgb(254,224,144)'], [0.5, 'rgb(224,243,248)'], [0.6, 'rgb(171,217,233)'], [0.7, 'rgb(116,173,209)'], [0.8, 'rgb(69,117,180)'], [0.9, 'rgb(49,54,149)'], [1.0, 'rgb(49,54,149)']]
default_colorscale = [[0,'"rgb(5, 10, 172)"'],[0.35,"rgb(40, 60, 190)"],[0.5,"rgb(70, 100, 245)"],\
[0.6,"rgb(90, 120, 245)"],[0.7,"rgb(106, 137, 247)"],[1,"rgb(220, 220, 220)"]],
In [24]:
# Pivot on countries and average on AvgTone for each month
df_tone_target = pd.pivot_table(df, values='AvgTone', index=['Target_CountryCode'], columns=['Year_Month'], aggfunc=np.median)
In [277]:
zmin_tone_target = df_tone_target.min().max()
zmax_tone_target = df_tone_target.max().min()
zmin_tone_target, zmax_tone_target
Out[277]:
In [290]:
figure_tone_target = world_map_figure(title='Average Tone Evolution',
title_colorscale='Median<br>Average Tone',
frames_title=dates,
df=df_tone_target.dropna().reset_index(),
locations_col='Target_CountryCode',
txt_fn=lambda code: pyc.countries.get(alpha_3=code).name,
zmin=zmin_tone_target,
zmax=zmax_tone_target,
colorscale=colorscale_perso1)
iplot(figure_tone_target, validate=False)
In [526]:
#print(plotly.offline.plot(figure_tone_target, include_plotlyjs=False, output_type='div'))
In [283]:
df['GoldsteinScalePondered'] = df['GoldsteinScale'] * df['NumMentions']
In [358]:
# Pivot on countries and average on AvgTone for each month
df_gs_target = pd.pivot_table(df, values='GoldsteinScale', index=['Target_CountryCode'], columns=['Year_Month'], aggfunc=np.mean)
In [359]:
zmin_gs_target = df_gs_target.min().median()
zmax_gs_target = df_gs_target.max().median()
zmin_gs_target, zmax_gs_target
Out[359]:
In [361]:
figure_gs_target = world_map_figure(title='Goldstein Scale Evolution',
title_colorscale='Median Pondered <br> Goldstein Scale',
frames_title=dates,
df=df_gs_target.dropna().reset_index(),
locations_col='Target_CountryCode',
txt_fn=lambda code: pyc.countries.get(alpha_3=code).name,
zmin=zmin_gs_target,
zmax=zmax_gs_target,
colorscale=colorscale_perso1)
iplot(figure_gs_target, validate=False)
In [527]:
#print(plotly.offline.plot(figure_gs_target, include_plotlyjs=False, output_type='div'))
In [35]:
%telepyth 'REALLY DONE 2'
Out[35]:
In [292]:
def tone_focus_on(df, code):
df_target = select_events(df, 'Target_CountryCode', lambda x: x == code)
return pd.pivot_table(df_target, values='AvgTone', index=['Source_CountryCode'], columns=['Year_Month'], aggfunc=np.mean).dropna()
In [293]:
df_tone_usa = tone_focus_on(df, 'USA')
In [297]:
zmin_tone_usa = df_tone_usa.min().max()
zmax_tone_usa = df_tone_usa.max().min()
zmin_tone_usa, zmax_tone_usa
Out[297]:
In [298]:
reload(viz)
Out[298]:
In [299]:
figure_tone_us = world_map_figure(title='AvgTone toward/against USA - Evolution',
title_colorscale='Median<br>AvgTone',
frames_title=dates,
df=df_tone_usa.reset_index(),
locations_col='Source_CountryCode',
txt_fn=lambda code: pyc.countries.get(alpha_3=code).name,
zmin=zmin_tone_usa,
zmax=zmax_tone_usa,
colorscale=colorscale_perso1)
iplot(figure_tone_us, validate=False)
In [528]:
#print(plotly.offline.plot(figure_tone_us, include_plotlyjs=False, output_type='div'))
In [337]:
def approx(y, degree=6):
X = np.arange(len(y))
X = X.reshape(X.shape + (1,))
X = np.concatenate([X ** i for i in range(degree)], axis=1)
#X -= X.mean()
#X /= X.std()
smoothing_model = linear_model.Lasso()
smoothing_model.fit(X, y)
return smoothing_model.predict(X), smoothing_model.coef_
In [323]:
def colors(c):
if c == 'France':
return 'rgb(22, 96, 167)'
elif c == 'Switzerland':
return 'rgb(205, 12, 24)'
else:
return 'rgb(0, 0, 0)'
In [324]:
def build_trace(x, y, name, polynomial_approx, mode='lines'):
if polynomial_approx:
y, _ = approx(y)
return Scatter(
x = x,
y = y,
mode = mode,
name = name + (' (approx)' if polynomial_approx else ''),
line = {
'color': colors(name)
}
)
In [493]:
def traces_to_fig(title, traces, xaxis='', yaxis=''):
return {
'data': traces,
'layout': {
'title': title,
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'xaxis': {
'title': xaxis,
},
'yaxis': {
'title': yaxis,
}
},
}
In [500]:
def trends_to_fig(df, countries, title, xaxis, yaxis, centered=False, polynomial_approx=False):
y = df.copy()
if centered:
y -= y.mean()
traces = [build_trace([d.replace('_', '/') for d in dates], y.loc[c], pyc.countries.get(alpha_3=c).name, poly) for c in some_countries for poly in set([False, polynomial_approx])]
fig = traces_to_fig(title, traces, xaxis, yaxis)
return fig
In [501]:
some_countries = ['CHE', 'FRA', 'MEX']
title = 'Trends in the Average Tone used to relate the events happening in the USA(common trend removed) (with approximation)'
xaxis = 'Date'
yaxis = 'Average Tone'
fig_trends = trends_to_fig(df_tone_usa, some_countries, title, xaxis, yaxis, centered=True, polynomial_approx=True)
iplot(fig_trends)
In [529]:
#print(plotly.offline.plot(fig_trends, include_plotlyjs=False, output_type='div'))
In [80]:
def df_to_weights(df, countries):
dict_ = {}
bias = df.mean()
bias -= bias.mean()
bias /= bias.std()
for code, y in zip(df.index, df.values):
if code in countries:
y -= y.mean()
y /= y.std()
y -= bias
_, weights = approx(y, degree=4)
dict_[code] = weights
return dict_
In [400]:
def many_df_to_weights(df_list):
all_weights = {}
all_countries = set()
for df in df_list:
if len(all_countries) == 0:
all_countries = set(df.index)
else:
all_countries = all_countries & set(df.index)
for df in df_list:
new_weights = df_to_weights(df, all_countries)
for code in new_weights:
if code in all_countries:
w = new_weights[code]
if code in all_weights:
all_weights[code] = np.append(all_weights[code], w)
else:
all_weights[code] = w
codes = np.array(list(all_weights.keys()))
weights = np.array(list(all_weights.values()))
return codes, weights
In [401]:
def many_df_to_labels(df_list):
codes, weights = many_df_to_weights(df_list)
clusters = cluster.SpectralClustering(n_clusters=2)
labels = clusters.fit(weights).labels_
return codes, labels
In [403]:
all_clusters_df = None
for c in ['USA', 'CHN', 'FRA']:
codes, cluster_idx = many_df_to_labels([tone_focus_on(df, c)])
clusters_df = pd.DataFrame(cluster_idx, index=codes, columns=[c])
if all_clusters_df is None:
all_clusters_df = clusters_df
else:
all_clusters_df = pd.concat([all_clusters_df, clusters_df], axis=1, join='inner')
In [404]:
all_clusters_df.shape
Out[404]:
In [422]:
reload(viz)
Out[422]:
In [522]:
cs = [
[0, 'rgb(255, 0, 0)'],
[0.1, 'rgb(255, 0, 0)'],
[0.1, 'rgb(0, 0, 255)'],
[1.0, 'rgb(0, 0, 255)']
]
map_clusters = world_map_figure(title='Spectral Clustering',
title_colorscale='colobar',
frames_title=['CHN', 'FRA', 'USA'],
df=all_clusters_df.reset_index(),
locations_col='index',
txt_fn=lambda code: pyc.countries.get(alpha_3=code).name,
zmin=0,
zmax=1,
colorscale=cs,
showscale=False,
)
iplot(map_clusters, validate=False)
In [530]:
#print(plotly.offline.plot(map_clusters, include_plotlyjs=False, output_type='div'))
In [504]:
codes, cluster_idx = many_df_to_labels([tone_focus_on(df, c) for c in ['USA', 'RUS', 'FRA', 'UKR']])
clusters_df_grouped = pd.DataFrame(cluster_idx, index=codes, columns=['USA RUS FRA UKR'])
In [519]:
map_clusters_grouped = world_map_figure(title='Spectral Clustering<br>(aggregated with USA RUS FRA UKR)',
title_colorscale='colobar',
frames_title=['USA RUS FRA UKR'],
df=clusters_df_grouped.reset_index(),
locations_col='index',
txt_fn=lambda code: pyc.countries.get(alpha_3=code).name,
zmin=0,
zmax=1,
colorscale=cs,
showscale=False,
)
iplot(map_clusters_grouped, validate=False)
In [531]:
#print(plotly.offline.plot(map_clusters_grouped, include_plotlyjs=False, output_type='div'))
In [92]:
def get_silhouette(X, i):
clusters = cluster.SpectralClustering(n_clusters=i)
clusters.fit(X)
labels = clusters.labels_
return metrics.silhouette_score(X, labels, metric='euclidean')
In [186]:
source_and_target_countries = list(set(df['Source_CountryCode'].values) & set(df['Target_CountryCode'].values))
In [206]:
all_weights = []
for code in tqdm(source_and_target_countries):
df_tone = tone_focus_on(df, code)
weights = many_df_to_weights([df_tone])[1]
all_weights.append(weights)
#print(weights.shape)
#if len(weights) > 5:
# sil = np.array([get_silhouette(weights, i) for i in range(2, 6)])
# all_silhouettes.append(sil)
In [418]:
all_silhouettes = []
for w, c in tqdm(list(zip(all_weights, source_and_target_countries))):
if len(w) > 100:
sil = np.array([get_silhouette(w, i) for i in range(2, 6)])
all_silhouettes.append((c, sil))
In [517]:
x = list(range(2, 9))
df_tone = tone_focus_on(df, 'AUS')
y = np.array([get_silhouette(many_df_to_weights([df_tone])[1], i) for i in x])
del df_tone
In [518]:
traces = [Scatter(x=x, y=y)]
fig = traces_to_fig('Silhouette', traces, 'Number of clusters', 'Silhouette score')
iplot(fig)
In [532]:
#print(plotly.offline.plot(fig, include_plotlyjs=False, output_type='div'))