A internet é sem dúvidas a maior base de dados disponível para qualquer um que tenha interesse em análise de dados e pesquisa. Nesse universo, sites de mídias sociais são de extrema importância, afinal, são como baús de tesouro no mundo da informação. Felizmente, muitas dessas informações estão disponíveis e podem ser usadas por qualquer um – especialmente qualquer um que saiba usar o R.
Existe vários pacotes e tutoriais na internet que ajudam a interagir com dados provenientes de mídias sociais a partir do R. O Voson Lab, por exemplo, tem alguns tutoriais e um pacote para R chamado SocialMediaLab. Pablo Barberá, autor de um premiado artigo sobre política e Twitter, dispõe de um workshop online (mesmo que um pouco datado) para quem possui interesse em atuar com esse método de pesquisa. Uma simples busca no Google já lhe dará muitos recursos para esse tipo de estudo.
Neste post, explicamos como acessar o Twitter a partir do R, ensinando posteriormente como rodar algumas análises com os dados que podemos extrair dessas mídias sociais. O primeiro passo essencial é cadastrar ou possuir uma conta no Twitter. Isso lhe garantirá condições para criar um aplicativo para a plataforma, de forma bem simples, seguindo as instruções disponíveis do próprio site. Ao montar seu app, você poderá criar “keys” e “tokens” que serão usados pelo R para acessar o API do Twitter. No exemplo a seguir, reponha o “________” com seus códigos pessoais do Twitter.
key <- "____________" key_secret <- "__________" token <- "_______________" token_secret <- "_____________"
Usaremos nesse exemplo os pacotes a seguir. Caso não os tenha, terá que instalá-los com a função install.packages() ,
ou clicando no botão “install” no RStudio. Em seguida, precisaremos configurar a autorização do Twitter, que perguntará se gostaríamos de salvar um arquivo com esses dados do Twitter no seu computador para futuro uso – o que pode vir a ser útil. Precisaremos, para criar os gráficos no final do artigo, da última versão do pacote ggplot2 que ainda está em construção.
devtools::install_github("hadley/ggplot2") library(ggplot2) library(igraph) library(ggraph) library(dplyr) library(twitteR) library(ROAuth) library(quanteda) library(stringi) library(RColorBrewer) library(tidytext) library(widyr) set.seed(1234) setup_twitter_oauth(key, key_secret, token, token_secret)
## [1] "Using direct authentication"
Agora que temos autorização, podemos coletar nossos dados! Ambos candidatos à presidência dos Estados Unidos possuem contas no Twitter. Podemos coletar um máximo de 1000 tweets de cada conta. O pacote TwitterR possui uma função chamada gettext para as classe de objetos que são retornado da pesquisa.
trump <- searchTwitter('@realDonaldTrump', n=1000) %>% lapply(function(x) x$getText()) %>% lapply(function(x) stri_trans_general(x, "Latin-ASCII")) %>% unlist() %>% corpus() clinton <- searchTwitter('@HillaryClinton', n=1000) %>% lapply(function(x) x$getText()) %>% lapply(function(x) stri_trans_general(x, "Latin-ASCII")) %>% unlist() %>% corpus()
Esses objetos corpus são fáceis de resumir. Uma ferramenta útil é o kwic,
que nos dá as palavras chave dentro de um contexto. Por exemplo, podemos ver em que contexto Trump e Clinton mencionaram um e o outro em seus tweets:
k <- kwic(trump, "clinton", 3) head(k)
## contextPre keyword ## [text4, 8] Journalists shower Hillary [ Clinton ## [text10, 11] - Emails Show [ Clinton ## [text22, 20] them to Hillary [ Clinton ## [text45, 11] - Emails Show [ Clinton ## [text55, 11] - Emails Show [ Clinton ## [text58, 17] Way' Of [ Clinton ## contextPost ## [text4, 8] ] with campaign cash ## [text10, 11] ] Campaign Organized Potential ## [text22, 20] ] @realDonaldTrump here is ## [text45, 11] ] Campaign Organized Potential ## [text55, 11] ] Campaign Organized Potential ## [text58, 17] ] Email Investigation:
h <- kwic(clinton, "trump", 3) head(h)
## contextPre keyword ## [text1, 4] RT@HillaryClinton: [ Trump ## [text4, 8] donate every time [ Trump ## [text9, 4] RT@HillaryClinton: [ Trump ## [text23, 19] ever WE NEED [ Trump ## [text33, 9] case for Donald [ Trump ## [text52, 4] RT@HillaryClinton: [ Trump ## contextPost ## [text1, 4] ] reportedly asked this ## [text4, 8] ] tweets something offensive ## [text9, 4] ] reportedly asked this ## [text23, 19] ] to clean ho ## [text33, 9] ] vote!#Never ## [text52, 4] ] reportedly asked this
Uma das formas mais interessantes de se apresentar esses dados é em uma nuvem de palavras. Essas nuvens são fáceis de ser criadas, entretanto, é necessário primeiramente limpar e transformá-los em uma matriz document-feature
. Estamos aqui removendo elementos desnecessários do texto, incluindo as contas “@” do Twitter, e outros elementos específicos da plataforma como “rt” e “t.co”.
trump <- dfm(trump, toLower = TRUE, removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE, stem = TRUE, language = "english", removeTwitter = TRUE, ignoredFeatures = c("rt", "https", "t.co", "h", "s", "wqqpjxfb", "75ollud4si", "t", "ht", "realdonaldtrump", stopwords())) clinton <- dfm(clinton, toLower = TRUE, removeNumbers = TRUE, removePunct = TRUE, removeSeparators = TRUE, stem = TRUE, language = "english", removeTwitter = TRUE, ignoredFeatures = c("rt", "https", "t.co", "t", "hillaryclinton", stopwords()))
Agora estamos prontos para criar nossas nuvens!
plot(trump, max.words = 100, colors = brewer.pal(6, "Dark2"), scale = c(4, .5), random.order = FALSE, random.color = TRUE)
plot(clinton, max.words = 100, colors = brewer.pal(6, "Dark2"), scale = c(4, .5), random.order = FALSE, random.color = TRUE)
Podemos também limpar esses objetos com a o pacote tidytext e plotar gráficos de rede com os pacotes ggraph e igraph. Talvez você precise instalar esses pacotes caso já não os tenha.
# install.packages('devtools) # devtools::install_github('thomasp85/ggforce') # devtools::install_github('thomasp85/ggraph') # devtools::install_github('dgrtwo/widyr') library(igraph) library(ggraph) library(tidytext) library(widyr) tidy_trump <- tidy(trump) tidy_clinton <- tidy(clinton) tidy_trump %>% pairwise_count(term, document, sort = TRUE) %>% filter(n >= 10) %>% graph_from_data_frame() %>% ggraph(layout = "fr") + geom_node_point(color = "#87CEFA", size = 5) + geom_node_text(aes(label = name), vjust = 1.8) + theme_void()
tidy_clinton %>% pairwise_count(term, document, sort = TRUE) %>% filter(n >= 10) %>% graph_from_data_frame() %>% ggraph(layout = "fr") + geom_node_point(color = "#87CEFA", size = 5) + geom_node_text(aes(label = name), vjust = 1.8) + theme_void()
Outra maneira útil de usar esses dados é com uma análise de "sentiment". Primeiramente, teremos que baixar um dicionário adequado, e incluí-lo no seu diretório de trabalho. Após isso podemos baixá-los no R:
nice <- scan('../opinion-lexicon-English/positive-words.txt', what='character', comment.char=';') not_nice <- scan('../opinion-lexicon-English/negative-words.txt', what='character', comment.char=';')
Agora precisamos de uma rápida função para calcular o score “sentiment” baseado na frequência com a qual as palavras dos candidatos se encaixam nas categorias “boas” e “ruins” do dicionário:
sentiments <- function(words, nice_text, not_nice_text){ positive = match(words, nice_text) negative = match(words, not_nice_text) positive = !is.na(positive) negative = !is.na(negative) score = sum(positive) - sum(negative) return(score) }
Usando isso, podemos ver diferenças radicais entre os candidatos:
sentiments(tidy_trump$term, nice, not_nice)
## [1] -115
sentiments(tidy_clinton$term, nice, not_nice)
## [1] -63
Vamos criar um dataframe com tudo isso e plotá-los.
sentiments_df <- data_frame(Score = c(sentiments(tidy_trump$term, nice, not_nice), sentiments(tidy_clinton$term, nice, not_nice)), Candidate = c("Trump", "Hillary")) ggplot(sentiments_df) + geom_col(aes(y = Score, x = Candidate, fill = Score)) + theme_bw()
Donald Trump está atrelado a palavras muito negativas, como podemos ver acima.
Esse exemplo nos mostrou como podemos acessar dados de mídias sociais usando o R. Existem exemplos de como usar essas técnicas no Facebook, além disso, existe o pacote quanteda que pode ser útil para nos aprofundar em análise de texto.
[:en]It goes without saying that the internet itself is one of the richest datasets available to anybody interested in data analysis and research. Of particular interest are social media websites, which are a treasure trove of information. Well, the good news is that some of this information is free to use, and it is very easy to do so with R.
First of all, I must mention that there are numerous packages and tutorials on the web for interacting with social media data through R. The Voson Lab have some tutorials and an R package called SocialMediaLab. Pablo Barberá wrote a prize-winning paper on politics and Twitter, and he has a (somewhat dated) workshop online. Many resources can be found from even a quick search on Google.
In this post, I’m going to explain how to access Twitter and Facebook through R, and how to run some analyses on the data we can scrape from these sites. First of all, let’s look at Twitter. Twitter requires that you have an account and then create an app on their app page. This is quite straightforward to do, all you need to do is follow the instructions on the site.
Once you have set up your app on Twitter, you will be able to generate a set of user keys and tokens. We will use these in R to access the Twitter API. In the example R code below, replace the _________
with your personal Twitter codes.
key <- "____________"
key_secret <- "__________"
token <- "_______________"
token_secret <- "_____________"
We will use the following packages, which, if you don’t have, you will have to install with the install.packages()
function (or click on the “Install” button in RStudio). Then we set up authorization with Twitter, which will ask you if you want to store a file on your computer with these Twitter records for future use with R, which is useful. To make the plot that we use at the end, you’ll need the newest version of ggplot2, which is still in development.
devtools::install_github("hadley/ggplot2")
library(ggplot2)
library(igraph)
library(ggraph)
library(dplyr)
library(twitteR)
library(ROAuth)
library(quanteda)
library(stringi)
library(RColorBrewer)
library(tidytext)
library(widyr)
set.seed(1234)
setup_twitter_oauth(key, key_secret, token, token_secret)
## [1] "Using direct authentication"
Now that we have authorization, we are free to scrape some data! Two individuals that are in the news at the moment are the US presidential candidates Donald Trump and Hilary Clinton, who are both active on Twitter. Let’s scrape a maximum of 1000 tweets from each one. The TwitteR package has a getText
method for the class of objects that are returned from this search.
trump <- searchTwitter('@realDonaldTrump', n=1000) %>%
lapply(function(x) x$getText()) %>%
lapply(function(x) stri_trans_general(x, "Latin-ASCII")) %>%
unlist() %>%
corpus()
clinton <- searchTwitter('@HillaryClinton', n=1000) %>%
lapply(function(x) x$getText()) %>%
lapply(function(x) stri_trans_general(x, "Latin-ASCII")) %>%
unlist() %>%
corpus()
These corpus
objects are then easy to summarize. One such useful feature is the kwic
function, which gives us the keyword in context. For example, we can see in what contexts Trump and Clinton mention each other in their tweets:
k <- kwic(trump, "clinton", 3)
head(k)
## contextPre keyword
## [text4, 8] Journalists shower Hillary [ Clinton
## [text10, 11] - Emails Show [ Clinton
## [text22, 20] them to Hillary [ Clinton
## [text45, 11] - Emails Show [ Clinton
## [text55, 11] - Emails Show [ Clinton
## [text58, 17] Way' Of [ Clinton
## contextPost
## [text4, 8] ] with campaign cash
## [text10, 11] ] Campaign Organized Potential
## [text22, 20] ] @realDonaldTrump here is
## [text45, 11] ] Campaign Organized Potential
## [text55, 11] ] Campaign Organized Potential
## [text58, 17] ] Email Investigation:
h <- kwic(clinton, "trump", 3)
head(h)
## contextPre keyword
## [text1, 4] RT@HillaryClinton: [ Trump
## [text4, 8] donate every time [ Trump
## [text9, 4] RT@HillaryClinton: [ Trump
## [text23, 19] ever WE NEED [ Trump
## [text33, 9] case for Donald [ Trump
## [text52, 4] RT@HillaryClinton: [ Trump
## contextPost
## [text1, 4] ] reportedly asked this
## [text4, 8] ] tweets something offensive
## [text9, 4] ] reportedly asked this
## [text23, 19] ] to clean ho
## [text33, 9] ] vote!#Never
## [text52, 4] ] reportedly asked this
Perhaps the most visually impressive way of summarizing this type of data is a wordcloud. It’s easy to make, however, we need to first clean up the data and transform it into a document-feature
matrix. What we’re doing here is removing unnecessary elements from the text, including the “@” Twitter account names, and some other Twitter-specific elements, such as “rt” and “t.co”.
trump <- dfm(trump, toLower = TRUE, removeNumbers = TRUE,
removePunct = TRUE, removeSeparators = TRUE,
stem = TRUE, language = "english",
removeTwitter = TRUE,
ignoredFeatures = c("rt", "https", "t.co", "h", "s",
"wqqpjxfb", "75ollud4si",
"t", "ht", "realdonaldtrump",
stopwords()))
clinton <- dfm(clinton, toLower = TRUE, removeNumbers = TRUE,
removePunct = TRUE, removeSeparators = TRUE,
stem = TRUE, language = "english",
removeTwitter = TRUE,
ignoredFeatures = c("rt", "https", "t.co",
"t", "hillaryclinton", stopwords()))
Now we’re ready to plot some wordclouds!
plot(trump, max.words = 100,
colors = brewer.pal(6, "Dark2"), scale = c(4, .5),
random.order = FALSE, random.color = TRUE)
plot(clinton, max.words = 100,
colors = brewer.pal(6, "Dark2"), scale = c(4, .5),
random.order = FALSE, random.color = TRUE)
We can also tidy up these objects with the tidytext package and plot some network graphs with the ggraph and igraph packages. You may need to install these packages if you do not already have them.
# install.packages('devtools)
# devtools::install_github('thomasp85/ggforce')
# devtools::install_github('thomasp85/ggraph')
# devtools::install_github('dgrtwo/widyr')
library(igraph)
library(ggraph)
library(tidytext)
library(widyr)
tidy_trump <- tidy(trump)
tidy_clinton <- tidy(clinton)
tidy_trump %>%
pairwise_count(term, document, sort = TRUE) %>%
filter(n >= 10) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_node_point(color = "#87CEFA", size = 5) +
geom_node_text(aes(label = name), vjust = 1.8) +
theme_void()
tidy_clinton %>%
pairwise_count(term, document, sort = TRUE) %>%
filter(n >= 10) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_node_point(color = "#87CEFA", size = 5) +
geom_node_text(aes(label = name), vjust = 1.8) +
theme_void()
Another useful thing we can do with these data is sentiment analysis. Firstly, you will need to download a suitable dictionary, or lexicon, and place it in your working directory (one can be downloaded from here). Then we can load these into R:
nice <- scan('../opinion-lexicon-English/positive-words.txt',
what='character', comment.char=';')
not_nice <- scan('../opinion-lexicon-English/negative-words.txt',
what='character', comment.char=';')
Next, we need a quick function to calculate a sentiment score based on how often our candidate’s words match the “nice” and “not nice” dictionaries:
sentiments <- function(words, nice_text, not_nice_text){
positive = match(words, nice_text)
negative = match(words, not_nice_text)
positive = !is.na(positive)
negative = !is.na(negative)
score = sum(positive) - sum(negative)
return(score)
}
Using this, we can see a radical difference in Trump and Clinton:
sentiments(tidy_trump$term, nice, not_nice)
## [1] -115
sentiments(tidy_clinton$term, nice, not_nice)
## [1] -63
Let’s make a dataframe of all of this and plot it.
sentiments_df <- data_frame(Score = c(sentiments(tidy_trump$term, nice, not_nice), sentiments(tidy_clinton$term, nice, not_nice)), Candidate = c("Trump", "Hillary"))
ggplot(sentiments_df) +
geom_col(aes(y = Score, x = Candidate, fill = Score)) +
theme_bw()
Poor old Donald is certainly quite negative.
This is a short example of how you can access social media data using R. There are examples on how to use Facebook, and there is a larger vignette on quanteda here on how to use the quanteda package for the analysis of text.