Have you spent some time on Twitter lately? So strange that politics is the main topic in so many tweets these days... Can we do something about it? YES!
I follow 124 tweeps at the time of this writing. You can see a list of them here. 200 of the most recent tweets for each account have been collected using the Twitter API. Here is what each row of our dataset contains:
Only the text will be useful for us here. Feel free to use the other fields as well!
Make sure you have installed and loaded the following libraries:
In [4]:
library(tm)
library(dplyr)
library(RColorBrewer)
library(wordcloud)
library(ggplot2)
library(SnowballC)
library(topicmodels)
library(LDAvis)
Let's setup R and make our results reproducible.
In [5]:
seed <- 42
set.seed(seed)
theme_set(theme_minimal())
options(warn=-1)
In [6]:
df <- read.csv("data/tweets.csv", header = T, stringsAsFactors = F)
Since our dataset is pretty well sorted (by username) we would love to randomize it a bit and tell R that created_at
is an actual date field:
In [7]:
df <- df[sample(nrow(df)),]
rownames(df) <- 1:nrow(df)
In [8]:
df$created_at <- as.Date(df$created_at)
In [9]:
dim(df)
In [10]:
tweets <- iconv(df$text, to = "ASCII", sub = " ")
train_tweets <- tweets[1:(nrow(df) - 10)]
test_tweets <- tweets[(nrow(df) - 9):nrow(df)]
In [11]:
create_corpus <- function(data) {
data <- tolower(data)
data <- gsub("rt", " ", data) # Remove retweets
data <- gsub("@\\w+", " ", data) # Remove usernames
data <- gsub("http.+ |http.+$", " ", data) # Remove links
data <- gsub("[[:punct:]]", " ", data) # Remove punctuation
data <- gsub("[ |\t]{2,}", " ", data) # Remove tabs
data <- gsub("amp", " ", data) # Remove "&"
data <- gsub("^ ", "", data) # Leading blanks
data <- gsub(" $", "", data) # Lagging blanks
data <- gsub(" +", " ", data) # General white spaces
data <- unique(data)
VCorpus(VectorSource(data))
}
In [12]:
train_corpus <- create_corpus(train_tweets)
Our newly created corpus will be fed to a Document-Term matrix.
In [13]:
dtm_train <- DocumentTermMatrix(train_corpus, control = list(
stemming = TRUE, removeNumbers = TRUE,
removePunctuation = TRUE, stopwords = c(stopwords("en"), stopwords("SMART")),
wordLengths = c(3, 15)))
dtm_train <- dtm_train[, !grepl("http", dtm_train$dimnames$Terms)]
Remove zero row entries
In [14]:
row_totals <- apply(dtm_train , 1, sum)
dtm_train <- dtm_train[row_totals > 0, ]
In [15]:
tdm_train <- TermDocumentMatrix(train_corpus, control = list(
stemming = TRUE, removeNumbers = TRUE,
removePunctuation = TRUE, stopwords = c(stopwords("en"), stopwords("SMART")),
wordLengths = c(3, 15)))
In [16]:
term_freq <- rowSums(as.matrix(tdm_train))
term_freq <- subset(term_freq, term_freq >= 300)
freq_df <- data.frame(term = names(term_freq), freq = term_freq)
In [17]:
ggplot(freq_df, aes(x=reorder(term, freq), y=freq)) +
geom_bar(stat="identity") +
xlab("Terms") + ylab("Count") + coord_flip()
Want to make a guess what kind of tweeps I follow most? What about major topics?
Why not make a wordcloud using the same data?
In [18]:
m <- as.matrix(tdm_train)
word.freq <- sort(rowSums(m), decreasing = T)
In [19]:
wordcloud(words = names(word.freq), freq = word.freq, min.freq = 200,
random.order = F, colors=brewer.pal(8, "Dark2"))
We can even find associations in our corpus. Let's see what is most associated to the word data
?
In [20]:
findAssocs(tdm_train, "data", 0.07)
So it's data science, data scientist, and big data? Looks reasonable enough!
We will use the package topicmodels
to train Latent Dirichlet Allocation (LDA) model using our tweet corpus. One good explanation of what exactly LDA is can be found on Quora. Concretely (get it?), we will use Gibbs sampling to find 10 (randomly picked number) topics.
In [21]:
burnin <- 4000
iter <- 2000
thin <- 500
lda <- LDA(dtm_train, k = 10, method = "Gibbs",
control = list(burnin = burnin, thin = thin, iter = iter, seed = seed))
Now that we trained our model, which words best describe each topic?
In [22]:
terms(lda, 10)
In [23]:
test_corpus <- create_corpus(test_tweets)
Let's create Document-Term matrix using the test corpus and use our training matrix terms as a dictionary.
In [24]:
dtm_test <- DocumentTermMatrix(test_corpus, control = list(
stemming = TRUE, removeNumbers = TRUE,
removePunctuation = TRUE, stopwords = c(stopwords("en"), stopwords("SMART")),
dictionary=Terms(dtm_train), wordLengths = c(3, 15)))
dtm_test <- dtm_test[, !grepl("http", dtm_test$dimnames$Terms)]
row_totals <- apply(dtm_test , 1, sum)
dtm_test <- dtm_test[row_totals> 0, ]
Finally, assign topic probabilities to each tweet in our test dataset.
In [25]:
lda_posterior <- posterior(lda, dtm_test)
In [26]:
lda_posterior$topics
Now, let's pick the most probable topic for each tweet:
In [27]:
test_topics <- apply(lda_posterior$topics, 1, which.max)
In [28]:
test_topics
Ready to have a look at the actual tweets?
In [29]:
for(i in seq_along(test_tweets)) {
cat(paste(paste(i, test_tweets[i], sep = ": "), "\n\n"))
}
Again, the top 5 terms for each topic:
In [30]:
apply(terms(lda, 5), MARGIN = 2, paste, collapse = ", ")
What do you think about the results? Remember that every tweet is limited to 140 characters, thus our documents are rather small (even tiny considering preprocessing).
Let's visualize our LDA model using the sweet LDAvis
package.
In [31]:
topicmodels2LDAvis <- function(x, ...){
post <- topicmodels::posterior(x)
if (ncol(post[["topics"]]) < 3) stop("The model must contain > 2 topics")
mat <- x@wordassignments
LDAvis::createJSON(
phi = post[["terms"]],
theta = post[["topics"]],
vocab = colnames(post[["terms"]]),
doc.length = slam::row_sums(mat, na.rm = TRUE),
term.frequency = slam::col_sums(mat, na.rm = TRUE)
)
}
In [32]:
json <- lda %>% topicmodels2LDAvis()
LDAvis::serVis(json, out.dir = 'twitter_lda_vis', open.browser = FALSE)
It looks like 3 (2, 4, 8)
of the topics are pretty well separated from any other topic. We have 5 topics that are pretty close to each other on the bottom right. Might it be a good idea to try a different number of topics then?
At the end of the day, our model looks pretty useful. Let's not forget that the data is pretty fresh and real (yes - I do not follow many tweeps, mostly those that are interested in math, machine learning and biotech/bioinformatics).
One could easily imagine using the trained model for making personalized recommendations of tweets based on preselected topics. Why not recommending new tweeps, too?
The LDA is very general and can be applied to any set of documents. Why not try it on papers, news or Facebook posts?