Simon #metoo step 3


In [ ]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

We use the seaborn package to plot how topics are distributed over time in tweets.

We are dealing with categorical data, where the category in this case is the point in time.

For each point in time, we see any topics with a probability > X as "existing".

We need a dataframe with one categorical column (time) and one with a value (topicnumber).


In [ ]:
df = pd.DataFrame.from_csv("topicmodel.csv", index_col=None)

df = df.sort_index()
df.rename(columns={'Unnamed: 0': 'day'}, inplace=True)
df = df.set_index('day')

# Converting the index as date
df.index = pd.to_datetime(df.index)

df.head()

Convert the imported dataframe rows to dicts in a list.


In [ ]:
outfile = open("strongtopics.csv", "w")
outfile.write("day,topic\n")

for row in df.iterrows():
    day = str(row[0])
    values = row[1].to_dict()
    for k,v in values.items():
        if v > 0.01:
            outfile.write(day + "," + k + "\n")
            
df2 = pd.DataFrame.from_csv("strongtopics.csv", index_col=None)
df2 = df2.sort_values(by="day")

Now, we have a dataframe with all topics that are more probable than 0.3 for each hour.


In [ ]:
df2.head()

A SET OF TOPICS ARE TALKED ABOUT THROUGHOUT WHILE OTHERS ARE MORE RANDOM


In [ ]:
%matplotlib inline

plt.figure(figsize=(10,10))

sns.swarmplot(x="day", y="topic", data=df2)

plt.savefig('test.pdf')
plt.show()

THE TOPIC COMPLEXITY IS INCREASING OVER TIME, THE HASHTAG LOSES FOCUS


In [ ]:
df3=df2.groupby('day').count().reset_index()
df3.columns=['day','strong topics']
df3 = df3.sort_values(by="day")
df3 = df3.reset_index(drop=True)
df3.head()

In [ ]:
df3['day'] = df3.index 
df3

In [ ]:
df3.to_csv("df3.csv")

In [ ]:
plt.figure(figsize=(10,10))
sns.regplot(x="day", y="strong topics", data=df3)

plt.savefig('test2.pdf')