3 Techniques to Scrape Any Websites Using Python | 1
|

3 Techniques to Scrape Any Websites Using Python

Web scraping is a prevalent technique to accumulate large amounts of data from publicly available websites.

There are various techniques to do this. Yet, People use a few methods a lot more than the others.

Many people use Selenium to navigate programmatically through web pages and pull data from them. Selenium is helpful, especially for scraping websites with lots of dynamically loaded components. Because these components load only after the page has loaded its static features, other techniques often fail to fetch them.

But, programming a Selenium driver could be overkill for other websites. If the website renders data as static pages, you have more accessible ways to fetch tons of data in no time.

We will use these techniques first and discuss ways to speed up selenium scripts.

Before we move on, here’s something you should be aware of. Not all websites welcome bot behaviors. You can avoid legal issues by reaching out to the site administrators before you do any scraping.

A quick check is to study the site’s `/robots.txt` file. If you see something like the following, you could say for sure that you CANT scrape it.

User-Agent: *
Disallow: /
Bash

 But not seeing this or seeing `Allow: /` doesn’t mean the site grants you permission to scrape. Check their terms-of-service page and reach out to the admin.

1. The fastest way to scrape websites using Python.

This is the stupidly simple one to scrape websites among all the techniques.

import pandas as pd
dfs = pd.read_html("<URL TO SCRAPE>")
df = dfs[0]
Python

Yes! The widespread data manipulation Python library, Pandas, can do web scraping too. The `read_html` method will fetch all the data tables on the web page and return them in a list.

There are a couple of valuable hacks to do this more efficiently.

Filter tables using HTML attributes.

Sometimes, it’s difficult to find out the one table we need from all the available HTML tables. We could use this attrs parameter to filter tables. You’d be lucky if you see an id property because it allows you to pinpoint the needle in the haystack. However, read_html would still return a list.

dfs = pd.read_html("URL", attrs = {'id': 'TABLE_ID'})
Python

Remove commas that separate thousands.

Commas improve the readability of large numbers. Yet, it’s an extra burden to clean them and parse them to numeric data types. But in Pandas, you can set the thousands parameter to the separator value, and you’ll get them in number format.

dfs = pd.read_html("URL TO SCRAPE", thousands=",")
Python

Parsing date

Parsing date fields can be a daunting task. People use many different date formats across the world. A clever technique when using Pandas is to use the parse_date parameter.
You can set this parse_date=True. Then, Pandas will try to parse all date-like fields if they follow a date format.
You can do a lot more with this parameter. For example, you could specify a list of column indexes that need to be parsed instead of True.

Read more about parsing dates in Pandas.

dfs = pd.read_html("URL_TO_SCRAPE", parse_date=True)
dfs = pd.read_html("URL_TO_SCRAPE", parse_date=[1,2,3])
Python

When to use pd.read_html to scrape web pages?

The read_html method looks for <table> tags in the web page. Hence, this method only works if your website has them.
It’s not very common for websites to always have their data in a scrape-friendly table. Often they are formatted in a list view. Product detail pages may not have <table> tags even if the product list page has. Thus, read_html isn’t an excellent option for scraping most product detail pages.
Our following approach is helpful in such a situation.

2. Scraping websites using request and beautiful soup.

In a way, this method isn’t very different from scraping with read_html. Both methods use a similar approach to fetch and parse HTML pages.
Yet, this manual approach gives more flexibility in terms of element selection. While read_html is limited to only <table> tags, this approach allows you can to program it uniquely.

Request is a Python library to interact with web resources. We can send any web request (GET, POST, DELETE, etc.) Beautifulsoup is a popular library for parsing HTML pages. It’s available in Python, R, and many other languages.

In most cases, your Python installation will include a requests module. Yet, you can install both of these packages using the following command.

pip install requests beautifulsoup4
# If you're using Poetry

poetry add requests beautifulsoup4
Python

The following code fetches the web page from a URL and parses it to a more manageable format. This page is a blog list page. The script will pick the titles of each post in the blog list and print them on the screen.

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.the-analytics.club")
soup = BeautifulSoup(page.text, 'html.parser')

posts = soup.find_all('div', class_='post-card')

for post in posts:
    title = post.find(class_='card-title')
    print(title.text)


>> How to Embed Serverless Jupyter Notebooks on Any Website
>> Debug Python Scripts Like a Pro
>> Pandas Replace: The Faster and Better Approach to Change Values of a Column.
>> How to Convert a List to a String in Python?
>> This Tiny Python Package Creates Huge Augmented Datasets
>> How to Find the Index of an Element in a List in Python?
>> How to Run Python Tests on Every Commit Using GitHub Actions
>> Data Dilemma: Too Much Data Can Be a Problem for Businesses Rather Than Helpful
>> How to Run SQL Queries on Pandas Data Frames?
>> Use Pipe Operations in Python for More Readable and Faster Coding
Python

The above code is pretty straightforward. As you can see, we use find and find_all to locate elements in the DOM. They accept extra parameters such as class_ and id_ to narrow the search further.

Read more about the options available in Beautiful Soup.

Both the above-discussed approaches are excellent for most websites. Yet, modern websites load their content after the static content. If we use requests the read_html method to fetch such pages, you’d only see the static content. But not the data you want.

Also, the requests module and read_html method both work well where each page has a unique URL. Again, modern websites sometimes don’t have different URLs for pagination.

Also, on some websites, the data is not visible upfront. The user must perform certain actions before seeing the correct page.

3. Speed up development time for Selenium-based web scraping.

Selenium is the widespread tool used for web scraping. Its original intention is to automate software testing.

Here’s the same scrape we did in our last example. This time we are using a Selenium web driver.

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()

driver.get("https://www.the-analytics.club")

blog_posts = driver.find_elements(By.CLASS_NAME, 'post-card')

for blog in blog_posts:
    title = blog.find_element(By.CLASS_NAME, 'card-title')
    
    print(title.text)
Python

For this script to work, you should’ve installed Selenium. The Chrome (or Firefox) web driver should also be in the executable path.

You can install Selenium using the below command. Also, please use this link to download a Chrome web driver.

pip install selenium
Bash

We could also combine selenium with beautiful soup. Selenium could handle all the page interaction and navigation, while BeautifulSoup could handle data parsing. That way, we could bring in the best of both worlds.

from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup

driver = webdriver.Chrome(ChromeDriverManager().install())

driver.get("https://www.the-analytics.club")

soup = BeautifulSoup(driver.page_source)

post_titles = soup.find_all('a', class_='card-title')

for title in post_titles:
    print(title.get_text())
Python

This is slightly more complex than the two other techniques we discussed earlier. Every time you run the script, it spins up a new Selenium browser and takes actions you instruct in the same sequence.

But there’s a quicker way to do this, which doesn’t need you to write any code. At least you don’t have to write as much code as you write manually.

The hack is to use a test automation tool. Test automation tools provide a friendly interface to orchestrate actions on a web browser. Under the hood, they, too, run a Selenium browser to test your application. Some tools also allow you to down.

I suggest using Testproject.io. (** I have no affiliation with the company behind this tool)

Similar Posts