Scratching the Surface of Data Scraping
I recently made my first foray into data scraping during the process of creating my first application in ruby. The application is a simple CLI that allows the user to find any element of the periodic table and view information about it. In order to gather the information that the app would display I needed to be able to access information from an external source, in this case the periodic table website webelements.com. To do this I used the ruby gems Nokogiri and open-uri to scrape that data from the website. During this process I encountered two main problems and by solving them learned quite a lot about how scraping works and its usages as a method for acquiring data.
The first difficulty I experienced was choosing the right website to scrape. I began by doing a quick web search for periodic table websites and of those selected the ones that had the best formats for scraping. I realized quickly that charts with embed links on each element square wouldn’t suit my purposes well as they often used images for each entry from which it was impossible to extract the name and number of the element. This led me to search out sited which included a list of each element with a link to an informational page about that element at the bottom. Of these the most comprehensive seemed to be pubchem.com, a site commonly used by chemistry students everywhere to reference elements and chemical compounds. However, upon attempting to scrape the site, every css label I tried inputting would give back a return value of nil as demonstrated bellow.
page.css(‘span.btn-text.uppercase’)=> []
I realized then that the pubchem.com website was loading the majority of its content in Java in order to enable more interactive features for viewers, but making scraping more or less impossible as scraping depends largely on HTML. At this point I ruled out puchem.com as a scrapable site and went to look at a few others before settling on webelements.com. Webelements.com was compatible to scrape as it ran on relatively simple html and contained a nicely formatted list of elements from which to gather data each of which also contained a link to its own page with further information. After working out what classes and containers the my desired information was nested under I was able to easily generate an array of hashes where each hash contained the information for one element like so:
[{:name=>"Hydrogen", :symbol=>"H", :number=>"1", :link=>"http://www.webelements.com/hydrogen/"},
{:name=>"Helium", :symbol=>"He", :number=>"2", :link=>"http://www.webelements.com/helium/"},
{:name=>"Lithium", :symbol=>"Li", :number=>"3", :link=>"http://www.webelements.com/lithium/"} ... ]
I was then able to easily create an element object from each hash by iterating over the array I had generated. Each element instance would initialize with all the properties included in the hash including a name, atomic number, atomic symbol, and the url of its information page on webelements.com.
The second difficulty I encountered was with gathering information from the urls of the information pages for the individual elements. It was a two-fold problem. The first part of the problem was that some of the information came in chunks as a result of certain areas of the page listing whole sections of information under a the same css tag with no way to differentiate between them. As a result, I wound up writing the following code to parse the lumps of data into single items that I could feed into my element instances as attributes one at a time by splitting the chunk into an array and then cleaning up the elements of that array using the ‘strip’ keyword.
properties = page.css('ul.ul_facts_table').text.split(/\n/).map{|item| item.strip}
The second part of the problem was that I several of the elements had little or no information listed as a result of a mixture of lack of updating on the part of the webelements.com creators, and as yet undiscovered information as not all elements are yet completely understood. In order to avoid bringing up an error when asking for information about these elements, they would still have these properties, but the properties needed to return a default statement since there was no information to generate them with. In order to do this I implement a series of if statements like this:
if properties.drop(1).find{|item| item.include? "mass"}properties_hash[:mass] = properties.drop(1).find{|item| item.include? "mass"}.gsub(/(Relative atomic mass)|\(|[Ar]|\)|\:/,"").stripelseproperties_hash[:mass] = "N/A"end
To check if there was information for a property and generate default information, in this case the words N/A or not applicable when information was missing. At that point everything finally functioned with no unexpected errors.
In retrospect I quickly realized that if I had known better what to look for in a website in the beginning, I could have avoided many of these problems. Having now completed this project I learned both what to look for and avoid in a website you want to scrape and also how to work around certain issues, which will be extremely helpful in the future.