How to write a custom gatherer/texthandler for RCurl

The RCurl package for R allows you to download files from the internet – very useful if you want to make your analysis completely reproducible (yet you rely on files that have to be downloaded) or if the source data can change over time.

Using the getURL() function you can download any file by it’s URL. What makes this even more useful is that you can give it a vector of URLs, and it will download them in parallel (by performing asynchronous downloads), and returning the results in a vector of the same length as the original.

But what if the file to be downloaded is large, but you’re only interested in a small part of it? That’s where gatherers come in. If you provide a gatherer (S3-style) object to the write parameter of the getURL() function, this object will be called whenever data is downloaded from one of the input URLs. That’s a great feature, but unfortunately it’s not well documented.

Solution

1. Gatherer Objects

Below is a constructor for a very simple gatherer that just returns a string of the contents of the downloaded file. This could be adapted to do more complex things.

SimpleGatherer <- function(txt=character()){
  # SimpleGatherer is a constructor function that returns a new
  # gatherer object. txt is the variable that stores the contents of the
  # downloaded file
  #
  # This code is heavily based on basicTextGatherer() from RCurl

  reset <- function(){
    # reset() is called when the gatherer is first created. It initialises txt
    txt <<- character()
  }
  update <- function(str){
    # update() is called whenever new data is downloaded. Here it appends it
    # to txt
    txt <<- paste(txt, str, sep="")
  }
  value <- function(){
    # value() is called when we want the result
    return(txt)
  }

  # gatherer is the new gatherer object that we want to return
  gatherer <- list(reset=reset, update=update, value=value)
  class(gatherer) <- c("RCurlTextHandler", "RCurlCallbackFunction")
  gatherer$reset()

  return(gatherer)
}

Then call getURL() with the write argument set to an instance of your new gatherer object, eg:

getURL("https://missingreadme.wordpress.com/", write=SimpleGatherer())

2. Asynchronous Gatherer Objects

In order to use a custom gatherer for parallel/asynchronous downloads, you must create a custom gatherer object containing one gatherer object for each of your URLS, like this:

SimpleAsyncGatherer <- function(n){
  # n is the number of URLs you're downloading
  gatherer <- lapply(1:n, SimpleGatherer)
  class(gatherer) <- "MultiTextGatherer"
  return(gatherer)
}

Note that if you don’t specify a custom gatherer using write, getURL automatically creates an asynchronous gatherer for you using the default gatherer object (basicTextGatherer()).

Use this gatherer as you would before, but with multiple URLs.

getURL(rep("https://missingreadme.wordpress.com/", 5), write=SimpleAsyncGatherer(5))
Advertisements

About this entry