Orchestrated web scraping

2019-05-14 · Caio Lente

The main goal of Kuber is to help with massively parallel computations. It leverages kubernetes and docker in order to create a container that automatically runs orchestrated tasks in parallel via expansion. If you already use Google Cloud Platform, Kuber is also able to automatically create clusters, run computations, and track their progress with Google’s cloud SDK.

If you’ve never heard of container orchestration, persistent cloud storage, or parallel computing, this tutorial might feel like a little bit too much. You don’t need to be any type of expert in these subjects, but it would help to know what these terms mean.

This tutorial will guide you through creating your first Kuber task. Before starting, make sure your environment has all requirements met with the “Getting started” vignette.

The task itself

Kuber’s main advantage over most parallelisation packages (like Parallel or Future/Furrr) is that it automatically creates a computing cluster that runs your task via container orchestration. This can be very useful for e.g. web scraping because (1) each node has a different IP, (2) saving scraped HTMLs is easy with Google Cloud Storage, and (3) the process can be picked up/put down at any point.

In this tutorial, the function to be parallelized is the following:

# Scrape a character vector of URLs
scrape_urls <- function(urls) {

  # Create a directory
  dir <- fs::dir_create("scraped")

  # Iterate over URLs
  paths <- c()
  for (url in urls) {
    path <- paste0(dir, "/", stringr::str_remove_all(url, "[^a-z]"), ".html")
    paths <- append(paths, path)

    httr::GET(url, httr::write_disk(path, overwrite = TRUE))
  }

  return(paths)
}

Simple enough, this function takes a character vector of URLS, scrapes them, and saves the resulting HTMLs in a local directory.

Creating the cluster

Now on to Kuber. If everything was installed correctly, you should be able to create a simple cluster with the following command:

library(kuber)

kub_create_cluster("toy-cluster", machine_type = "f1-micro")
#> ✔  Creating cluster

Head over to the Kubernetes console to see if everything worked. Don’t worry if you get a bunch of warnings, most of them are about the SDK’s version.

Creating the task

The most important function on Kuber is probably the next one. It creates a directory on your local machine that describes the parallel computation and its cluster, bucket, image, and service account. To run the command below, only toy-key.json (the service account key downloaded in the “Getting started” vignette) must already exist at the indicated location; the rest is all created for you.

kub_create_task("~/toy-dir", "toy-cluster", "toy-bucket", "toy-image", "~/toy-key.json")
#> ✔  Fetching cluster information
#> ✔  Fetching bucket information
#> ✔  Creating bucket
#> ●  Edit `~/toy-dir/exec.R`
#> ●  Create `~/toy-dir/list.rds` with usable parameters
#> ●  Run `kub_push_task("~/toy-dir")`

Editing exec.R and list.rds

The directory created by kub_create_task() has some files that are explored in detail on that function’s documentation, but the two most important are exec.R and list.rds. The first contains the R file to be executed by the docker image, while the latter has every object that every node needs for its own exec.R.

Starting from exec.R, the file is already populated with a simple template:

#!/usr/bin/env Rscript
args <- commandArgs(trailingOnly = TRUE)

# Arguments
idx <- as.numeric(args[1])
bucket <- as.character(args[2])

# Use this function to save your results
save_path <- function(path) {
  system(paste0("gsutil cp -r ", file_, " gs://", bucket, "/", gsub("/.+", "", file_)))
  do.call(file.remove, list(list.files(path, full.names = TRUE)))
  return(path)
}

# Get object passed in list[[idx]]
obj <- readRDS("list.rds")[[idx]]

###########################
## INSERT YOUR CODE HERE ##
###########################

As you can see, it is an Rscript that takes two arguments: an index and the name of a GCS bucket. The next chunk describes a function to be used when saving results; it sends the file/folder in path to the specified bucket and then deletes it from the node’s disk. Lastly, the script reads list.rds, and selects the object at index idx.

Now is time to add scrape_urls() to the file. There aren’t any changes in the function itself, only in in how the resulting files are handled. Here is the final version of exec.R:

#!/usr/bin/env Rscript
args <- commandArgs(trailingOnly = TRUE)

# Arguments
idx <- as.numeric(args[1])
bucket <- as.character(args[2])

# Use this function to save your results
save_path <- function(path) {
  system(paste0("gsutil cp -r ", file_, " gs://", bucket, "/", gsub("/.+", "", file_)))
  do.call(file.remove, list(list.files(path, full.names = TRUE)))
  return(path)
}

# Get object passed in list[[idx]]
obj <- readRDS("list.rds")[[idx]]

# Scrape a character vector of URLs
scrape_urls <- function(urls) {

  # Create a directory
  dir <- fs::dir_create("scraped")

  # Iterate over URLs
  paths <- c()
  for (url in urls) {
    path <- paste0(dir, "/", stringr::str_remove_all(url, "[^a-z]"), ".html")
    paths <- append(paths, path)

    httr::GET(url, httr::write_disk(path, overwrite = TRUE))
  }

  return(paths)
}

# Run the scraper
paths <- scrape_urls(obj)

# Save HTMLs in CGS
for (path in paths) {
  save_path(path)
}

As you might have guessed from the calls above, obj contains the URLs to be scraped. This makes sense because, as described earlier, list.rds has every object that every node needs for its own exec.R; in this case, every node needs a character vector of URLs to be scraped, and idx is simply the ID of each node (so that no two nodes scrapes the same URLs). That’s it.

Now the only thing left is creating list.rds, that is, the list of URLs broken in one chunk for each node. Since in this toy example, toy-cluster was created with the default number of nodes (3), list.rds will be a list with 3 elements. The following commands should be run in your local machine:

# URLs to be scraped, chunked by nodes
url_list <- list(
  c("google.com", "duckduckgo.com"),
  c("wikipedia.org"),
  c("facebook.com", "twitter.com", "instagram.com")
)

# Overwrite sample list.rds with list of URLs
readr::write_rds(url_list, "~/toy-dir/list.rds")

With this list.rds, the first node will scrape search engines, the second will scrape Wikipedia, and the third will scrape social media.

Pushing and running the task

Last but not least, the task must be pushed to Google Container Registry (GCR), which is where Kuber’s docker images will live. This guarantees version control to all task and allows them to be run from another computer, but may take a while to run the first time you create a task.

kub_push_task("~/toy-dir")
#> ✔  Building image
#> ✔  Authenticating
#> ✔  Pushing image
#> ✔  Removing old jobs
#> ✔  Creating new jobs

If everything up to here worked, the last mandatory command is running the task:

kub_run_task("~/toy-dir")
#> ✔  Authenticating
#> ✔  Setting cluster context
#> ✔  Creating jobs
#> ●  Run `kub_list_pods()` to follow up on the pods

Checking up on the task

There are two main ways to check the progress of a task: listing the currently active pods and listing the files uploaded to the bucket. The weird strings in the name of each process is a unique identifier generated by Kuber to track those pods.

kub_list_pods("~/toy-dir")
#> ✔  Setting cluster context
#> ✔  Fetching pods
#>                          NAME READY  STATUS RESTARTS AGE
#> 1 process-mkewsr-item-1-8kpg7   1/1 Running        0  1m
#> 2 process-mkewsr-item-2-cph8z   1/1 Running        0  1m
#> 3 process-mkewsr-item-3-kpn5f   1/1 Running        0  1m

If your pods’ statuses denote something bad, you might need to debug your exec.R file. This is absolutely normal and it can take multiple attempts until your task is running correctly. If you need help debugging your task, take a look at the “Debugging exec.R” vignette.

The command bellow lists every file in a bucket. You can also specify a folder inside the bucket and whether the listing should be done recursively or not. Here it’s possible to see that every download finished running correctly.

kub_list_bucket("~/toy-dir", folder = "scraped")
#> ✔  Listing content
#> [1] "googlecom.html"     "duckduckgocom.html" "wikipediaorg.html"
#> [4] "facebookcom.html"   "twittercom.html"    "instagramcom.html"

#r #cloud #hpc

Reply to this post by email ↪