Using rtweet to Create a tidyverse Twitterbot

A while back, I was inspired by this Twitter exchange to create a bot that would tweet out tidyverse related material.

Last week I finally had enough time to sit and put some work in to get this idea up and running! This post will walk through how I created the @tidyversetweets Twitter bot using the great rtweet package by Mike Kearney.

Step 1: Get the data

Before we can tweet out anything, we need to know what we are going to tweet. As of the writing of this post, @tidyversetweets will tweet out new tidyverse questions on StackOverflow, and new discussions from certain topics on the RStudio Community. Let’s start by pulling in data from StackOverflow API. This is made easy using the stackr package from StackOverflow data scientist David Robinson.

First, we’ll define a couple of functions to easily query questions, given a tag. We can use the safely function from the purrr package to handle errors (e.g., no questions for a given tag are found).

library(tidyverse)
library(stackr)

safe_stack_questions <- safely(stack_questions)
query_tag <- function(tag) {
  query <- safe_stack_questions(pagesize = 100, tagged = tag)
  return(query)
}

We can then define the tags that we want to look up on StackOverflow. For now, I’ve listed all the packages that are included in the offical list of tidyverse packages.

tidyverse <- c("tidyverse", "ggplot2", "dplyr", "tidyr", "readr", "purrr",
  "tibble", "readxl", "haven", "jsonlite", "xml2", "httr", "rvest", "DBI;r",
  "stringr", "lubridate", "forcats", "hms", "blob;r", "rlang", "magrittr",
  "glue", "recipes", "rsample", "modelr")

Finally, we can use the map function (also from the purrr package) to query all of the defined tags, and pull out the results to a data frame using map_dfr. We then do a little bit of clean up to correct some character encoding issues, remove duplicate questions that were collected under multiple tags (e.g., a question that was tagged tidyverse and dplyr), and arrange by the time the question was posted.

tidy_so <- map(tidyverse, query_tag) %>%
  map_dfr(~(.$result %>% as.tibble())) %>%
  select(title, creation_date, link) %>%
  mutate(
    title = str_replace_all(title, "&#39;", "'"),
    title = str_replace_all(title, "&quot;", '"')
  ) %>%
  distinct() %>%
  arrange(creation_date)

tidy_so
#> # A tibble: 1,500 x 3
#>                                                                          title
#>                                                                          <chr>
#>  1             Create a matrix of scatterplots (pairs() equivalent) in ggplot2
#>  2                                              Convert character to Date in R
#>  3                                        ggplot: remove lines at ribbon edges
#>  4           extract hours and seconds from POSIXct for plotting purposes in R
#>  5 R : readBin and writeBin ( for storing/retrieving MySQL BLOBs or LONGBLOBs 
#>  6                            regex multiple pattern with singular replacement
#>  7 Reshaping multiple sets of measurement columns (wide format) into single co
#>  8                        Plot a 24 hour cycle monthly for multiple variables?
#>  9                                      R lubridate converting seconds to date
#> 10                                          How to get a date from day of year
#> # ... with 1,490 more rows, and 2 more variables: creation_date <dttm>,
#> #   link <chr>

The next step is to pull new topics that have been posted to the RStudio Community site. To do this, we can use the feedeR package from Andrew Collier. This package is great for dealing with RSS feeds in R. As we did when querying StackOverflow, we’ll first define a function to query the RSS feed we’re interested in. This function uses the glue package to construct the RSS feed URL from the category that is supplied.

library(feedeR)
library(glue)

query_community <- function(category) {
  query <- feed.extract(glue("https://community.rstudio.com/c/{category}.rss"))
  return(query)
}

For now, only the tidyverse and teaching categories are queried for the @tidyversetweets. This is because other categories generally have topics that are not directly related to the tidyverse, which is the focus of this bot. Once the query_community function is defined, the process for scraping the information is very similar to that used for StackOverflow. There is, however, one additional layer of complexity for the RSS data. In the raw data, a new entry is created for each comment that is made within the category. For the Twitterbot, we only want to tweet out new topics, meaning that we only want the first entry for each topic. Therefore, we can group by the topic, and select only the entry with the earliest creation date within each topic.

rstudio <- c("tidyverse", "teaching")

tidy_rc <- map(rstudio, query_community) %>%
  map_dfr(~(.$items %>% as.tibble())) %>%
  select(title, creation_date = date, link) %>%
  group_by(title) %>%
  top_n(n = -1, wt = creation_date) %>%
  arrange(creation_date)

tidy_rc
#> # A tibble: 38 x 3
#> # Groups:   title [38]
#>                                                                      title
#>                                                                      <chr>
#>  1 Teaching: install/load packages individually, or use tidyverse package?
#>  2          Connecting with other instructors using Tidyverse / r4ds book?
#>  3                  Building foundational skills for programming beginners
#>  4                                                   Learning how to learn
#>  5   What to call a data rectangle: dataset / data frame / tibble / other?
#>  6                                    Basic API response for teaching demo
#>  7                    I guess that's why they call it the (teaching) blues
#>  8                                             About the Teaching category
#>  9                                     Using Git/GitHub for final projects
#> 10                                          Teaching classroom environment
#> # ... with 28 more rows, and 2 more variables: creation_date <dttm>,
#> #   link <chr>

Now that we have all of the data, it’s time to tweet!

Step 2: Tweet!

We don’t want to tweet every question and topic that have ever been submitted, just the new ones. I have set up @tidyversetweets to run every five minutes, therefore we can filter the questions and topics to only those that were posted in the last five minutes since the bot ran.

library(lubridate)

cur_time <- ymd_hms(Sys.time(), tz = Sys.timezone())

all_update <- bind_rows(tidy_so, tidy_rc) %>%
  arrange(creation_date) %>%
  filter(creation_date > cur_time - dminutes(5))

Finally, we cycle through the questions and topics to tweet. The tweet text is always the title of the question or topic, followed by the #tidyverse and #rstats hashtags, and then the link to the original post. To make sure we don’t exceed Twitter’s character limit, we can first do some checking to make sure that the title is never more than 250 characters, and truncate the title if it does go over 250. Yay for the higher limit!

We can then compose the tweet’s text using the glue package. Finally, the rtweet package makes it super easy to tweet from R. rtweet makes the connection to Twitter seamless for querying, streaming, and analyzing tweets. It does take a little bit of extra setup in order to post tweets, but the package has great documentation for how to do that. And that’s it! Once rtweet is setup, we’re ready to send out our tweets of new questions!

library(rtweet)

pwalk(.l = all_update, .f = function(title, creation_date, link) {
  if (nchar(title) > 250) {
    trunc_points <- str_locate_all(title, " ") %>%
      .[[1]] %>%
      .[,1]
    trunc <- max(trunc_points[which(trunc_points < 247)]) - 1
    title <- paste0(str_sub(title, start = 1, end = trunc), "...")
  }
  
  tweet_text <- glue("{title} #tidyverse #rstats {link}")
  post_tweet(tweet_text)
})

Ultimately, I’d like set up @tidyversetweets to run off of webhooks and updating instantaneously, rather than relying on a script to run every five minutes. But for now, this does a decent enough job, and was an excellent opportunity to demonstrate how easy it is to interact with Twitter using rtweet. If you have any questions or suggestions for improvement, feel free to leave a comment here, reach out to me on Twitter (@jakethomp), or file an issue on Github.

Session info

devtools::session_info()
#>  setting  value                       
#>  version  R version 3.4.3 (2017-11-30)
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  tz       America/Chicago             
#>  date     2017-12-11                  
#> 
#>  package    * version    date       source                            
#>  assertthat   0.2.0      2017-04-11 CRAN (R 3.4.0)                    
#>  backports    1.1.1      2017-09-25 CRAN (R 3.4.2)                    
#>  base       * 3.4.3      2017-12-07 local                             
#>  bindr        0.1        2016-11-13 cran (@0.1)                       
#>  bindrcpp   * 0.2        2017-06-17 cran (@0.2)                       
#>  bitops       1.0-6      2013-08-17 CRAN (R 3.4.0)                    
#>  blogdown     0.3        2017-11-13 CRAN (R 3.4.2)                    
#>  bookdown     0.5        2017-08-20 CRAN (R 3.4.1)                    
#>  broom        0.4.3      2017-11-20 CRAN (R 3.4.3)                    
#>  cellranger   1.1.0      2016-07-27 CRAN (R 3.4.0)                    
#>  cli          1.0.0      2017-11-05 CRAN (R 3.4.2)                    
#>  colorspace   1.3-2      2016-12-14 CRAN (R 3.4.0)                    
#>  compiler     3.4.3      2017-12-07 local                             
#>  crayon       1.3.4      2017-09-16 CRAN (R 3.4.1)                    
#>  curl         3.0        2017-10-06 CRAN (R 3.4.2)                    
#>  datasets   * 3.4.3      2017-12-07 local                             
#>  devtools     1.13.4     2017-11-09 CRAN (R 3.4.2)                    
#>  digest       0.6.12     2017-01-27 CRAN (R 3.4.0)                    
#>  dplyr      * 0.7.4.9000 2017-11-28 Github (tidyverse/dplyr@fc66342)  
#>  evaluate     0.10.1     2017-06-24 cran (@0.10.1)                    
#>  feedeR     * 0.0.7      2017-11-30 Github (DataWookie/feedeR@3106e5d)
#>  forcats    * 0.2.0      2017-01-23 CRAN (R 3.4.0)                    
#>  foreign      0.8-69     2017-06-22 CRAN (R 3.4.3)                    
#>  ggplot2    * 2.2.1.9000 2017-11-17 Github (tidyverse/ggplot2@582acfe)
#>  glue       * 1.2.0      2017-10-29 CRAN (R 3.4.2)                    
#>  graphics   * 3.4.3      2017-12-07 local                             
#>  grDevices  * 3.4.3      2017-12-07 local                             
#>  grid         3.4.3      2017-12-07 local                             
#>  gtable       0.2.0      2016-02-26 CRAN (R 3.4.0)                    
#>  haven        1.1.0      2017-07-09 CRAN (R 3.4.1)                    
#>  hms          0.4.0      2017-11-23 CRAN (R 3.4.3)                    
#>  htmltools    0.3.6      2017-04-28 CRAN (R 3.4.0)                    
#>  httr         1.3.1      2017-08-20 CRAN (R 3.4.1)                    
#>  jsonlite     1.5        2017-06-01 cran (@1.5)                       
#>  knitr        1.17       2017-08-10 cran (@1.17)                      
#>  lattice      0.20-35    2017-03-25 CRAN (R 3.4.3)                    
#>  lazyeval     0.2.1      2017-10-29 cran (@0.2.1)                     
#>  lubridate  * 1.7.1      2017-11-03 CRAN (R 3.4.2)                    
#>  magrittr     1.5        2014-11-22 CRAN (R 3.4.0)                    
#>  memoise      1.1.0      2017-04-21 CRAN (R 3.4.0)                    
#>  methods    * 3.4.3      2017-12-07 local                             
#>  mnormt       1.5-5      2016-10-15 CRAN (R 3.4.0)                    
#>  modelr       0.1.1      2017-07-24 cran (@0.1.1)                     
#>  munsell      0.4.3      2016-02-13 CRAN (R 3.4.0)                    
#>  nlme         3.1-131    2017-02-06 CRAN (R 3.4.0)                    
#>  openssl      0.9.9      2017-11-10 CRAN (R 3.4.2)                    
#>  parallel     3.4.3      2017-12-07 local                             
#>  pkgconfig    2.0.1      2017-03-21 cran (@2.0.1)                     
#>  plyr         1.8.4      2016-06-08 CRAN (R 3.4.0)                    
#>  psych        1.7.8      2017-09-09 CRAN (R 3.4.3)                    
#>  purrr      * 0.2.4      2017-10-18 CRAN (R 3.4.2)                    
#>  R6           2.2.2      2017-06-17 cran (@2.2.2)                     
#>  Rcpp         0.12.14    2017-11-23 cran (@0.12.14)                   
#>  RCurl        1.95-4.8   2016-03-01 CRAN (R 3.4.0)                    
#>  readr      * 1.1.1      2017-05-16 CRAN (R 3.4.0)                    
#>  readxl       1.0.0      2017-04-18 CRAN (R 3.4.0)                    
#>  reshape2     1.4.2      2016-10-22 CRAN (R 3.4.0)                    
#>  rlang        0.1.4      2017-11-05 CRAN (R 3.4.2)                    
#>  rmarkdown    1.8        2017-11-17 Github (rstudio/rmarkdown@6757c4a)
#>  rprojroot    1.2        2017-01-16 CRAN (R 3.4.0)                    
#>  rstudioapi   0.7        2017-09-07 CRAN (R 3.4.1)                    
#>  rtweet     * 0.6.0      2017-11-16 CRAN (R 3.4.2)                    
#>  rvest        0.3.2      2016-06-17 CRAN (R 3.4.0)                    
#>  scales       0.5.0.9000 2017-08-30 Github (hadley/scales@d767915)    
#>  stackr     * 0.0.0.9000 2017-11-28 Github (dgrtwo/stackr@3708582)    
#>  stats      * 3.4.3      2017-12-07 local                             
#>  stringi      1.1.6      2017-11-17 CRAN (R 3.4.2)                    
#>  stringr    * 1.2.0      2017-02-18 CRAN (R 3.4.0)                    
#>  tibble     * 1.3.4      2017-08-22 cran (@1.3.4)                     
#>  tidyr      * 0.7.2      2017-10-16 CRAN (R 3.4.2)                    
#>  tidyselect   0.2.3      2017-11-06 CRAN (R 3.4.2)                    
#>  tidyverse  * 1.2.1      2017-11-14 CRAN (R 3.4.2)                    
#>  tools        3.4.3      2017-12-07 local                             
#>  utils      * 3.4.3      2017-12-07 local                             
#>  withr        2.1.0.9000 2017-11-17 Github (jimhester/withr@daf5a8c)  
#>  XML          3.98-1.9   2017-06-19 CRAN (R 3.4.1)                    
#>  xml2         1.1.1      2017-01-24 CRAN (R 3.4.0)                    
#>  yaml         2.1.15     2017-12-01 CRAN (R 3.4.3)

comments powered by Disqus