Web Scraping in Ruby using Nokogiri

Web scraping is an incredibly powerful tool software developers have that allows them to extract certain data from websites. While web scraping is possible with many different programming languages, in this blog, I will show you how to web scrape in Ruby using the Nokogiri gem.

A nokogiri, a Japanese saw
A nokogiri

What is Web Scraping

Nokogiri

Using Nokogiri

We’re going to need to require Nokogiri and Open-URI so that our program has access to these web scraping tools:

require "nokogiri"require "open-uri"

Next, we’re going to use Open-URI to grab the HTML from the URL, then use the Nokogiri::HTML method to convert the HTML into data we can use. We’ll save that data into a variable doc, and see what it looks like:

doc = Nokogiri::HTML(URI.open("https://anilist.co/search/anime/popular/"))puts doc

Wow, that’s a lot of stuff, and it’s quite messy (and even messier if you viewed this in IRB). While the above looks like plain HTML, it is actually a massive set of nested nodes called a node set, which were made when using Nokogiri. Node sets act as a collection, which allows us to iterate over elements, and use brackets or dot notation to access different parts of the collection. This is where the real magic happens.

*An Important Note*

The best way to get exactly what we want is to inspect the actual webpage in your browser. Since we want anime titles, let’s find the HTML elements that contain these titles:

Doing some digging, we can see that all the anime titles are stored in an <a> tag with a convenient class name of title. Luckily, Nokogiri has an easy way for us to get this specific data: the .css method. The .css method is used to retrieve specific pieces of data using CSS selectors. It can be called on our doc variable we set earlier, and it takes an argument of the CSS selector we want. In our case, we want the .title selector. Let’s save the result of this into a variable title, and see what it returns:

titles = doc.css(".title")puts titles

Nice! We have isolated all the HTML elements that contain the anime titles. Now we need to extract the text from these elements. Remember how I said that if we viewed this IRB, it would be extremely messy? Although that remains true, viewing this output in IRB would give us some insight into how to extract the text:

Here, we can actually see the node set structure that was mentioned earlier, and that it acts like an array. Each element is called an XML Element, and each XML Element has a lot of information attached to it. Since it acts like an array, we can use bracket notation the grab the first element of the array:

At the very end of this, we can see something that looks like Nokogiri::XML::Text. As you guessed, this node contains the text of the element! To extract this, all we need to do is use the .text method on the first element:

To get rid of the \n or any extra whitespace surrounding the text, we can use .strip:

We have successfully extracted a single anime title! If we want all the anime titles, we just need to do this for all the titles. Remember that node sets act like arrays, so we can call methods such as .each, .filter, and .map. Let’s use .map to select all the anime titles and put them in an array:

result = titles.map {|t| t.text.strip}p result

Let’s see the result!

And there we go, we have an array of all the most popular anime titles, web scraped from a webpage!

Here is the full code:

Other Nokogiri Methods of Note

.name

.attributes

.children

Conclusion

Hope this helps!

Thanks for reading!
Source: K-On!

Software engineering student at Flatiron School. Loves all things gaming, photography, and anime.