Correlação de palavras entre tweets sobre determinado HashTag

Queremos determinar quais palavras estão mais correlacionadas entre Tweets sobre determinado HashTag, utilizando o pacote widyr.

Cleuton Sampaio


In [1]:
library(twitteR)
library(ROAuth)
library(httr)
library(plyr)
library(stringr)
library(tidytext)
library(readr)
library(dplyr)
library(widyr)


Attaching package: ‘plyr’

The following object is masked from ‘package:twitteR’:

    id


Attaching package: ‘dplyr’

The following objects are masked from ‘package:plyr’:

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize

The following objects are masked from ‘package:twitteR’:

    id, location

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Vamos obter os Tweets. Substitua os parâmetros abaixo por: api_key, api_secret, access_token, e access_token_secret. Você os obtém em https://apps.twitter.com


In [2]:
setup_twitter_oauth('1', '2', '3', '4')


[1] "Using direct authentication"

Vamos obter os Tweets relacionados ao hashTag #brazil. O que os gringos falaram sobre o Brasil esta semana?


In [3]:
tweets_nc <- searchTwitter('#brazil', lang='en', n = 1000)

Vamos transformar em um Datafame e adicionar um identificador de cada linha:


In [4]:
df <- twListToDF(tweets_nc)
nrow(df)


1000

In [5]:
df$sessao <- seq.int(nrow(df))

Agora, vamos transformar em formato Tidy, mas antes, precisamos carregar a lista de stopwords em inglês:


In [6]:
stopwords <- read_csv('stopwords.txt', col_names = 'word')


Parsed with column specification:
cols(
  word = col_character()
)

In [7]:
head(stopwords)


word
i
me
my
myself
we
our

In [8]:
tidy_tweets <- df %>%
    unnest_tokens(word, text) %>%
    anti_join(stopwords,by="word") 
head(tidy_tweets)


favoritedfavoriteCountreplyToSNcreatedtruncatedreplyToSIDidreplyToUIDstatusSourcescreenNameretweetCountisRetweetretweetedlongitudelatitudesessaoword
FALSE 0 NA 2018-03-13 22:43:30 FALSE NA 973691046671548416 NA <a href="https://www.github.com/arun4033622" rel="nofollow">HeyFellowHuman</a>100DaysOfXBot 1 TRUE FALSE NA NA 1 rt
FALSE 0 NA 2018-03-13 22:43:30 FALSE NA 973691046671548416 NA <a href="https://www.github.com/arun4033622" rel="nofollow">HeyFellowHuman</a>100DaysOfXBot 1 TRUE FALSE NA NA 1 jpamm08
FALSE 0 NA 2018-03-13 22:43:30 FALSE NA 973691046671548416 NA <a href="https://www.github.com/arun4033622" rel="nofollow">HeyFellowHuman</a>100DaysOfXBot 1 TRUE FALSE NA NA 1 day
FALSE 0 NA 2018-03-13 22:43:30 FALSE NA 973691046671548416 NA <a href="https://www.github.com/arun4033622" rel="nofollow">HeyFellowHuman</a>100DaysOfXBot 1 TRUE FALSE NA NA 1 5
FALSE 0 NA 2018-03-13 22:43:30 FALSE NA 973691046671548416 NA <a href="https://www.github.com/arun4033622" rel="nofollow">HeyFellowHuman</a>100DaysOfXBot 1 TRUE FALSE NA NA 1 finished
FALSE 0 NA 2018-03-13 22:43:30 FALSE NA 973691046671548416 NA <a href="https://www.github.com/arun4033622" rel="nofollow">HeyFellowHuman</a>100DaysOfXBot 1 TRUE FALSE NA NA 1 html5

In [9]:
correlacao <- tidy_tweets %>%
    group_by(word) %>%
    filter(n() > 20) %>%
    pairwise_cor(word, sessao, sort = TRUE)
correlacao


item1item2correlation
birds tanager 1.0000000
cute tanager 1.0000000
bird tanager 1.0000000
animal tanager 1.0000000
tanager birds 1.0000000
cute birds 1.0000000
bird birds 1.0000000
animal birds 1.0000000
tanager cute 1.0000000
birds cute 1.0000000
bird cute 1.0000000
animal cute 1.0000000
tanager bird 1.0000000
birds bird 1.0000000
cute bird 1.0000000
animal bird 1.0000000
tanager animal 1.0000000
birds animal 1.0000000
cute animal 1.0000000
bird animal 1.0000000
belize australia 1.0000000
australia belize 1.0000000
necked red 1.0000000
red necked 1.0000000
various riddim 1.0000000
riddim various 1.0000000
t.co https 0.9885566
https t.co 0.9885566
red enjoynature0.9808029
necked enjoynature0.9808029
worldcup t.co -0.1864093
t.co worldcup -0.1864093
brazilian brazil -0.1872501
brazil brazilian -0.1872501
rolling t.co -0.1874513
t.co rolling -0.1874513
worldcup https -0.1895331
https worldcup -0.1895331
rolling https -0.1905025
https rolling -0.1905025
shastareportst.co -0.1930690
t.co shastareports-0.1930690
shastareportshttps -0.1961128
https shastareports-0.1961128
new t.co -0.1977564
t.co new -0.1977564
new https -0.2022343
https new -0.2022343
albums t.co -0.2114793
t.co albums -0.2114793
albums https -0.2148998
https albums -0.2148998
startup rt -0.2220036
rt startup -0.2220036
3 brazil -0.2415221
brazil 3 -0.2415221
https rt -0.5809906
rt https -0.5809906
t.co rt -0.5888964
rt t.co -0.5888964

In [10]:
set.seed(42)
library(ggplot2)

In [11]:
library(igraph)
library(ggraph)


Attaching package: ‘igraph’

The following objects are masked from ‘package:dplyr’:

    as_data_frame, groups, union

The following objects are masked from ‘package:stats’:

    decompose, spectrum

The following object is masked from ‘package:base’:

    union


In [15]:
correlacao %>%
    filter(correlation > .70) %>%
    graph_from_data_frame() %>%
    ggraph(layout = 'fr') + 
    geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) + 
    geom_node_point(color = 'lightblue', size = 5) + 
    geom_node_text(aes(label = name), repel = TRUE) + 
    theme_void()



In [ ]: