RSelenium, will work for free

Siqi Zhu
4 min readMay 12, 2021

Are you looking for someone to take over the repetitive web data scraping tasks? Consider RSelenium as your new assistant, for free!

“Who is RSelenium?” You may ask. RSelenium, the R bindings that allows automation of repetitive web-based tasks, such as finding and copying data, using the Selenium WebDriver, was introduced to me by a talented colleague of mine. With very basic knowledge of R, and even less knowledge of HTML, at first sight, RSelenium seemed a bit standoffish. I was assured that things would get easier once I get to know RSelenium a little better. So there it began, my journey with RSelenium. A year later, I can now kick back at my desk while my assistant executes the commands; in the meantime, there is ever more to discover.

Compared to some of the vendor solutions (eg. UiPath) out there, RSelenium does require a basic knowledge of R. If R is also somewhat new to you, you could be like me — using RSelenium as a way to build up R knowledge, which, at the end of the day, would be doubly-rewarding.

Envy no more — this 5-minute read can get you started with the bread-and-butter, plus many tips and tricks to follow as your interest grows. This sample script can be broken down to 6 steps, and downloads the Covid-19 Vaccine data for Ontario (TL;DR):

1.Install and load the RSelenium package:

2. Set up a selenium server and browser:

Chrome is the browser of choice for this example; the version specified with chromever is for ChromeDriver, and must match the Chrome installed on your computer — in this case, I’m running Chrome Version 73.0.3683.86, for which the matching ChromeDriver can be found (and downloaded) from this link (see Sidenote for an alternative method). If you are unsure if the correct ChromeDriver version has been downloaded, you can figure this out by calling in R: binman::list_versions(“chromedriver”).

A side note for the initial setup (ignore if you don’t have browser/driver mismatch issue due to restricted browser version): since my work computer was running an older version of Chrome which required ChromeDriver versions 73.0.3683.68, a quick way to obtain that and all the ChromeDrivers up to the latest version available is to locate the “history” in the chromedriver file in the wdman folder (check the image below for the path). The default number for “history” is 3, i.e. the last 3 versions of ChromeDriver sourced by RSelenium. In my case, there were 24 versions of ChromeDriver available going backwards until the correct one; hence by changing the history to 24, this trick removes the need to manually downloading/installing ChromeDriver every time our IT department upgrades the Chrome on our computers.

Change “history” to download more versions of ChromeDriver

3. Navigate to the web page:

It’s common to see slow/delayed response from websites, especially compared to how fast R scripts are executed. The mismatch can cause RSelenium out of sync from the web page. A quick and easy way around the problem is to delay the execution using Sys.sleep(5), which suspends R for 5 seconds. A more sophisticated and reliable way is to programmably check if the page is loaded completely.

Adding 5-sec delays to accommodate the time it takes to set up the Selenium server and browser (first), and to load the web page (second)

4. Obtain information about the desired web element:

Xpath, or XML Path Language, is the “information” used to denote the desired web element: the Download button for the Covid-19 vaccine data. It can be easily obtained by inspect the HTML structure of the web page.

How to locate the xpath of the desired web element

5. Once such information is assigned to an R variable, the click action can be simulated by the method clickElement():

Apply click action to desired web elements in RSelenium

6. Now that the file is successfully downloaded, close the server and browser

The complete code looks like this:

A basic RSelenium script for web scraping

--

--

Siqi Zhu

Many things data. All aspects of fun. You will find tips and tricks that I have curated from data extraction, modeling, analysis, and visualization.