Web Scraping in Ruby using Nokogiri
Web scraping is an incredibly powerful tool software developers have that allows them to extract certain data from websites. While web scraping is possible with many different programming languages, in this blog, I will show you how to web scrape in Ruby using the Nokogiri gem.
What is Web Scraping
Web scraping is the process of downloading the HTML or XHTML of a webpage, then extracting data from it. There are many different ways a web scraping system fetches website data. Some systems rely on a bot or a web crawler that automatically parses website data. Other systems utilize DOM parsing and even computer vision. Regardless of which system is used to extract a website’s data, once the user has access to this data, they are free to do whatever they want with it. For example, in a recent project, I used web scraping to fetch a list of course names from my local community college’s course catalog. Web scraping is also commonly used for data mining, contact scraping, and online price change monitoring.
Nokogiri
Nokogiri is a Ruby gem that allows us to web scrape. The name of the gem comes from a type of a Japanese saw called a nokogiri (鋸) which is used in woodworking and Japanese carpentry (the creator of the Ruby language, Yukihiro Matsumoto, is Japanese, so choosing a Japanese tool for the name of the gem may be an allusion to Matsumoto). In conjunction with the Ruby module Open-URI, we can easily extract data from a website.
Using Nokogiri
To get started, run gem install nokogiri
in your terminal to install the Nokogiri gem if it isn’t already installed. Since I like anime, I’ll be making a short program to extract the titles from the most popular animes from this website: https://anilist.co/search/anime/popular
We’re going to need to require Nokogiri and Open-URI so that our program has access to these web scraping tools:
require "nokogiri"require "open-uri"
Next, we’re going to use Open-URI to grab the HTML from the URL, then use the Nokogiri::HTML
method to convert the HTML into data we can use. We’ll save that data into a variable doc
, and see what it looks like:
doc = Nokogiri::HTML(URI.open("https://anilist.co/search/anime/popular/"))puts doc
Wow, that’s a lot of stuff, and it’s quite messy (and even messier if you viewed this in IRB). While the above looks like plain HTML, it is actually a massive set of nested nodes called a node set, which were made when using Nokogiri. Node sets act as a collection, which allows us to iterate over elements, and use brackets or dot notation to access different parts of the collection. This is where the real magic happens.
*An Important Note*
Since each webpage is different, the way you scrape a page is very specific to the content of that page! In addition, if the webpage were to have some kind of update, the code used to scrape a page will most likely break as well. The following steps will only apply to this specific webpage as of the day I am writing this blog (April 27, 2021).
The best way to get exactly what we want is to inspect the actual webpage in your browser. Since we want anime titles, let’s find the HTML elements that contain these titles:
Doing some digging, we can see that all the anime titles are stored in an <a>
tag with a convenient class name of title
. Luckily, Nokogiri has an easy way for us to get this specific data: the .css
method. The .css
method is used to retrieve specific pieces of data using CSS selectors. It can be called on our doc
variable we set earlier, and it takes an argument of the CSS selector we want. In our case, we want the .title
selector. Let’s save the result of this into a variable title
, and see what it returns:
titles = doc.css(".title")puts titles
Nice! We have isolated all the HTML elements that contain the anime titles. Now we need to extract the text from these elements. Remember how I said that if we viewed this IRB, it would be extremely messy? Although that remains true, viewing this output in IRB would give us some insight into how to extract the text:
Here, we can actually see the node set structure that was mentioned earlier, and that it acts like an array. Each element is called an XML Element, and each XML Element has a lot of information attached to it. Since it acts like an array, we can use bracket notation the grab the first element of the array:
At the very end of this, we can see something that looks like Nokogiri::XML::Text
. As you guessed, this node contains the text of the element! To extract this, all we need to do is use the .text
method on the first element:
To get rid of the \n
or any extra whitespace surrounding the text, we can use .strip
:
We have successfully extracted a single anime title! If we want all the anime titles, we just need to do this for all the titles. Remember that node sets act like arrays, so we can call methods such as .each
, .filter
, and .map
. Let’s use .map
to select all the anime titles and put them in an array:
result = titles.map {|t| t.text.strip}p result
Let’s see the result!
And there we go, we have an array of all the most popular anime titles, web scraped from a webpage!
Here is the full code:
Other Nokogiri Methods of Note
As you saw in the example, getting the data we wanted was quite simple. However, every webpage is different, and some webpages may not have their data stored conveniently like this. Thankfully, Nokogiri has other methods that we can call on each XML Element to help with this.
.name
The .name
method will return the name of the XML element, which is usually the tag name of the element
.attributes
The .attributes
method will return a hash that contains all the HTML attributes associated with that element
.children
The .children
method will return all the nodes that are nested inside of the element. This can be particularly useful if the element(s) you are looking for does not have a class name.
Conclusion
Using Nokogiri to web scrape data from a website is an incredibly powerful and useful tool to have in your programmer’s toolbox. Learning how to web scrape can allow you to automate data collection that would be tedious and time consuming if you were to do it manually!
Hope this helps!