Web scraping is a prevalent technique to accumulate large amounts of data from publicly available websites.
There are various techniques to do this. Yet, People use a few methods a lot more than the others.
Many people use Selenium to navigate programmatically through web pages and pull data from them. Selenium is helpful, especially for scraping websites with lots of dynamically loaded components. Because these components load only after the page has loaded its static features, other techniques often fail to fetch them.
But, programming a Selenium driver could be overkill for other websites. If the website renders data as static pages, you have more accessible ways to fetch tons of data in no time.
We will use these techniques first and discuss ways to speed up selenium scripts.
Before we move on, here’s something you should be aware of. Not all websites welcome bot behaviors. You can avoid legal issues by reaching out to the site administrators before you do any scraping.
A quick check is to study the site’s `/robots.txt` file. If you see something like the following, you could say for sure that you CANT scrape it.
But not seeing this or seeing `Allow: /` doesn’t mean the site grants you permission to scrape. Check their terms-of-service page and reach out to the admin.
1. The fastest way to scrape websites using Python.
This is the stupidly simple one to scrape websites among all the techniques.
Yes! The widespread data manipulation Python library, Pandas, can do web scraping too. The `read_html` method will fetch all the data tables on the web page and return them in a list.
There are a couple of valuable hacks to do this more efficiently.
Filter tables using HTML attributes.
Sometimes, it’s difficult to find out the one table we need from all the available HTML tables. We could use this
attrs parameter to filter tables. You’d be lucky if you see an id property because it allows you to pinpoint the needle in the haystack. However,
read_html would still return a list.
Remove commas that separate thousands.
Commas improve the readability of large numbers. Yet, it’s an extra burden to clean them and parse them to numeric data types. But in Pandas, you can set the
thousands parameter to the separator value, and you’ll get them in number format.
Parsing date fields can be a daunting task. People use many different date formats across the world. A clever technique when using Pandas is to use the
You can set this
parse_date=True. Then, Pandas will try to parse all date-like fields if they follow a date format.
You can do a lot more with this parameter. For example, you could specify a list of column indexes that need to be parsed instead of True.
When to use pd.read_html to scrape web pages?
The read_html method looks for <table> tags in the web page. Hence, this method only works if your website has them.
It’s not very common for websites to always have their data in a scrape-friendly table. Often they are formatted in a list view. Product detail pages may not have <table> tags even if the product list page has. Thus, read_html isn’t an excellent option for scraping most product detail pages.
Our following approach is helpful in such a situation.
2. Scraping websites using request and beautiful soup.
In a way, this method isn’t very different from scraping with read_html. Both methods use a similar approach to fetch and parse HTML pages.
Yet, this manual approach gives more flexibility in terms of element selection. While read_html is limited to only <table> tags, this approach allows you can to program it uniquely.
Request is a Python library to interact with web resources. We can send any web request (GET, POST, DELETE, etc.) Beautifulsoup is a popular library for parsing HTML pages. It’s available in Python, R, and many other languages.
In most cases, your Python installation will include a requests module. Yet, you can install both of these packages using the following command.
The following code fetches the web page from a URL and parses it to a more manageable format. This page is a blog list page. The script will pick the titles of each post in the blog list and print them on the screen.
The above code is pretty straightforward. As you can see, we use
find_all to locate elements in the DOM. They accept extra parameters such as class_ and id_ to narrow the search further.
Both the above-discussed approaches are excellent for most websites. Yet, modern websites load their content after the static content. If we use
read_html method to fetch such pages, you’d only see the static content. But not the data you want.
requests module and
read_html method both work well where each page has a unique URL. Again, modern websites sometimes don’t have different URLs for pagination.
Also, on some websites, the data is not visible upfront. The user must perform certain actions before seeing the correct page.
3. Speed up development time for Selenium-based web scraping.
Selenium is the widespread tool used for web scraping. Its original intention is to automate software testing.
Here’s the same scrape we did in our last example. This time we are using a Selenium web driver.
For this script to work, you should’ve installed Selenium. The Chrome (or Firefox) web driver should also be in the executable path.
You can install Selenium using the below command. Also, please use this link to download a Chrome web driver.
We could also combine selenium with beautiful soup. Selenium could handle all the page interaction and navigation, while BeautifulSoup could handle data parsing. That way, we could bring in the best of both worlds.
This is slightly more complex than the two other techniques we discussed earlier. Every time you run the script, it spins up a new Selenium browser and takes actions you instruct in the same sequence.
But there’s a quicker way to do this, which doesn’t need you to write any code. At least you don’t have to write as much code as you write manually.
The hack is to use a test automation tool. Test automation tools provide a friendly interface to orchestrate actions on a web browser. Under the hood, they, too, run a Selenium browser to test your application. Some tools also allow you to down.
I suggest using Testproject.io. (** I have no affiliation with the company behind this tool)