In [ ]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
We use the seaborn
package to plot how topics are distributed over time in tweets.
We are dealing with categorical data, where the category in this case is the point in time.
For each point in time, we see any topics with a probability > X
as "existing".
We need a dataframe with one categorical column (time) and one with a value (topicnumber).
In [ ]:
df = pd.DataFrame.from_csv("topicmodel.csv", index_col=None)
df = df.sort_index()
df.rename(columns={'Unnamed: 0': 'day'}, inplace=True)
df = df.set_index('day')
# Converting the index as date
df.index = pd.to_datetime(df.index)
df.head()
Convert the imported dataframe rows to dicts in a list.
In [ ]:
outfile = open("strongtopics.csv", "w")
outfile.write("day,topic\n")
for row in df.iterrows():
day = str(row[0])
values = row[1].to_dict()
for k,v in values.items():
if v > 0.01:
outfile.write(day + "," + k + "\n")
df2 = pd.DataFrame.from_csv("strongtopics.csv", index_col=None)
df2 = df2.sort_values(by="day")
Now, we have a dataframe with all topics that are more probable than 0.3
for each hour.
In [ ]:
df2.head()
In [ ]:
%matplotlib inline
plt.figure(figsize=(10,10))
sns.swarmplot(x="day", y="topic", data=df2)
plt.savefig('test.pdf')
plt.show()
In [ ]:
df3=df2.groupby('day').count().reset_index()
df3.columns=['day','strong topics']
df3 = df3.sort_values(by="day")
df3 = df3.reset_index(drop=True)
df3.head()
In [ ]:
df3['day'] = df3.index
df3
In [ ]:
df3.to_csv("df3.csv")
In [ ]:
plt.figure(figsize=(10,10))
sns.regplot(x="day", y="strong topics", data=df3)
plt.savefig('test2.pdf')