HomeBack-End & DatabaseIntroduction to Web Scraping

Introduction to Web Scraping

What is Web Scraping?

Web Scraping is the process of data extraction from various websites.

DIFFERENT LIBRARY/FRAMEWORK FOR SCRAPING:

Scrapy:– If you are dealing with complex Scraping operation that requires enormous speed and low power consumption, then Scrapy would be a great choice. 

Beautiful Soup:- If you’re new to programming and want to work with web scraping projects, you should go for Beautiful Soup. You can easily learn it and able to perform the operations very quickly, up to a certain level of complexity. 

Selenium:- When you are dealing with Core JavaScript-based web applications and want to make browser automation with AJAX/PJAX requests, then Selenium is a great choice. 

CHALLENGES WHILE SCRAPING DATA:

Pattern Changes:

Problem: Each website periodically changes its UI every now and then. Scrapers usually need modification every week to keep up with the changes, or else it will give you an incomplete data or crash the scraper.
Solution:  You can write test cases for parsing and extraction logic and run the tests regularly. You can also use any other continuous integration tool to catch failures

Anti- Scraping Technologies:

Problem: Some websites are using anti-scraping technologies, for instance, LinkedIn. If you’re hitting a particular website from the same IP address, then there are high chances for the target website to block your IP address.
Solution: Proxy services with rotating IP Addresses help in this regard. Proxy servers help mask IP addresses and can improve crawling speed. Scraping frameworks like Scrapy provides easy integration for several rotating proxy services.

Javascript-based Dynamic Content:

Problem: Websites that heavily rely on Javascript & AJAX to render dynamic content makes data extraction difficult. Because Scrapy and related frameworks/libraries will only work or extract what it finds in the HTML document, Ajax calls or Javascript are executed at runtime so it can’t scrape that.
Solution: This can be handled by rendering the web page in a headless browser like headless Chrome, which essentially allows running Chrome in a server environment. Another alternative is we can use selenium for javascript pages.

Quality of Data:

Problem: The records which do not meet the quality guidelines will affect the overall integrity of the data. Making sure the data meets quality guidelines while crawling because it needs to be performed in real-time and faulty data can cause serious problems.
Solution: One thing you can do for this is to write test cases. You can make sure whatever your spiders are extracting is correct, and they are not scraping any wrong structure data.

Captchas:

Problem: Captchas serve a great purpose in keeping spam away. However, they also pose a great deal of accessibility challenge for the web crawling bots out there. When captchas are present on a page from where you need to scrape data, basic web scraping setups will fail and cannot get past this barrier.
Solution: For this, you need a middleware that can take captcha, solve it, and return the response.

Maintaining Deployment:

Problem: If you’re scraping millions of websites, you can imagine the size of the code. It’s even very hard to execute spiders.
Solution: You can Dockerize your spiders and run them in an interval.


Scraping Guidelines/Best Practices:

Robots.txt file: Robots.txt is a text file webmasters create to instruct robots (typically search engine robots) how to crawl & index pages on their website. So this file generally contains instruction for crawlers. Robots.txt should be the first thing to check when you are planning to scrape a website. Every website would have set some rules on how bots/spiders should interact with the site in their robots.txt file.

Do not hit the servers too frequently: Web servers are not fail-proof. Any web server will slow down or crash if the load on it exceeds a certain limit, up to which it can handle. Sending multiple requests too frequently can result in the website’s server going down or the site becoming too slow to load.

User Agent Rotation: A User-Agent in the request helps identify which browser is being used, what version, and on which operating system. Every request made from a web browser contains a user-agent header, and using the same user-agent consistently leads to the detection of a bot. User Agent rotation is the best solution for this.

Do not follow the same crawling pattern: Only robots follow the same crawling pattern. Programmed bots follow a logic that is usually very specific. Sites that have intelligent anti-crawling mechanisms can easily detect spiders.

Scrapy Vs. BeautifulSoup

In this section, you will have an overview of one of the most popularly used web scraping tool called BeautifulSoup, and its comparison to Scrapy, python’s most used scraping framework.

Functionality:

Scrapy: Scrapy is the complete package for downloading web pages, processing them, and saving it in files and databases.
BeautifulSoup: BeautifulSoup is an HTML and XML parser and requires additional libraries such as requests,urlib2 to open URLs and store the result.

Learning Curve :

Scrapy: Scrapy is a powerhouse for web scraping and offers a lot of ways to scrape a web page. It requires more time to learn and understand how Scrapy works, but once mastered, it becomes easier to make web crawlers and run them by just writing one line of command.
BeautifulSoup: BeautifulSoup is relatively easy to understand for newbies in programming and get smaller tasks done in no time.

Speed and Load :

Scrapy: Scrapy can get big jobs done very easily. It can crawl a group of URLs in not more than a minute, depending on the size of the group and does it very smoothly.
BeautifulSoup: BeautifulSoup is used for simple scraping jobs with efficiency. However, it is slower than the Scrapy.

Extending Functionality:

Scrapy: Scrapy provides item pipelines that allow you to write functions in your spider that can process your data, such as validating data, removing data, and saving data to a database.
BeautifulSoup: BeautifulSoup is good for smaller jobs, but if you require much customization such as proxies, managing cookies, and data pipelines, Scrapy is the best option.

For this blog, we are going to explain the Scrapy framework as it has more usecases in real-time scraping problems.

Scrapy: Scrapy is a fast high-level web crawling and web scraping framework used to crawl websites and extract structured data from their pages.

Key Features of Scrapy are —

  1. Scrapy has built-in support for extracting data from HTML sources using XPath expression and CSS expression.
  2. It is a portable library, i.e. (written in Python and runs on Linux, Windows, and Mac)
  3. It can be Easily Extensible.
  4. It is faster than other existing scraping libraries. It can extract the websites 20 times faster than any other tool.
  5. It consumes a lot less memory and CPU usage.
  6. It can help us to build a robust and flexible application with a bunch of functions.
  7. It has excellent community support for the developers, but the documentation is not that great for the beginners because it does not have beginner-friendly content.

I have done many web scraping projects in excellence technologies using the Scrapy framework.

Let’s start with Scrapy framework.

Before we start installing Scrapy, make sure you have python and pip set up in your system.

Using Pip: Just run this simple command.

pip install Scrapy

So, we’ll assume that Scrapy is already installed on your system. If still, you are getting an error, you can follow the official installation guide.

In starting, we will walk you through with these tasks:

  1. Creating a new Scrapy project.
  2. Writing a spider to crawl a site and extract data.

First, we will create a project using this command.

scrapy startproject tutorial

This will create a tutorial directory. Next, we will be moving into tutorial/spiders with the help of this dir and create a file quotes_spider.py

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

name: Here, the name identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders. We will start crawling parse() a method that will be called to handle the response downloaded for each of the requests made.
yield: Here yield working as a return.

Now we will run our first spider. First, we will go to spiders dir then run this command with the name.

scrapy crawl quotes

This command runs the spider with the name quotes that we’ve just added, that will send some requests for the domains. You will get an output similar to this:

Now, check the files in the current directory. You should notice that two new files have been created: quotes-1.html and quotes-2.html, with the content for the respective URLs, as our parse method instructs.

This is a Basic spider which we discuss above. Now we have a basic idea of how the Scrapy framework will work. Let’s discuss some important basic things.

Extracting data: We can use the CSS selector and Xpath selector for extracting data from webpages.
The best way to learn how to extract data with Scrapy is trying selectors by using the Scrapy shell.

scrapy shell 'http://quotes.toscrape.com/page/1/'

Now we will see some examples of extracting data with a selector using Scrapy shell.

CSS selector: syntax – response.css(‘ ‘)

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

The result of running response.css (‘title’) is a list-like object called SelectorList, which represents a list of selector objects that wrap around XML/HTML elements and allow you to run further queries to fine-grain the selection or extract the data.

To extract the text from the title above, you can use this:

>>> response.css('title::text').getall()
['Quotes to Scrape']

There are two things to note here: One is that we’ve added:: text to the CSS query, to mean we want to select only the text elements directly inside <title> element. If we don’t specify, we will get the full title element, including its tags:

>>> response.css('title').getall()
['<title>Quotes to Scrape</title>']

The other thing is that the result of calling .getall() is a list: a selector may return more than one result, so we extract them all. When you know you just want the first result, as in this case, you can use this:

>>> response.css('title::text').get()
'Quotes to Scrape'

Also, we can use this alternative.

>>> response.css('title::text')[0].get()
'Quotes to Scrape'

XPath: we can also extract data from webpages using XPath.

Now we will do the same thing by using XPath, which we have already done using CSS selectors. We are assuming that our website HTML code is similar to the below code.

<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>

So, by looking at the HTML code of that page, let’s construct an XPath for selecting the text inside the title tag:

>>> response.xpath('//title/text()')
[<Selector xpath='//title/text()' data='Example website'>]

Yes, we got the title by this XPath, but this is not the proper text. We will try to extract proper text from the title tag using the same .get() and .getall(), which we have used with a CSS selector.

>>> response.xpath('//title/text()').getall()
['Example website']
>>> response.xpath('//title/text()').get()
'Example website'

See, we got an only text from the title tag using the Xpath selector.
As you can see, .xpath() and .css() methods always return a SelectorList instance, which is a list of new selectors.

If you want to extract only the first matched element, you can call the selector.get() 

>>> response.xpath('//div[@id="images"]/a/text()').get()
'Name: My image 1 '

We can also check if any tag has no data or set a default value if data is not there in a given selector.

>>> response.xpath('//div[@id="not-exists"]/text()').get() is None
True
 >>> response.xpath('//div[@id="not-exists"]/text()').get(default='not-found') 
'not-found' 

A default return value can be provided as an argument, to be used instead of None.

Now we’re going to get the base URL and some image links from example HTML code:

>>> response.xpath('//base/@href').get()
'http://example.com/'

>>> response.css('base::attr(href)').get()
'http://example.com/'

>>> response.css('base').attrib['href']
'http://example.com/'

We can see that we got a link by using all of the three methods. So we can use any CSS or XPath selector for extracting data from tags.

CSS, Scrapy selectors also support using XPath expressions: XPath expressions are very powerful and are the foundation of Scrapy selectors. In fact, CSS selectors are converted to XPath under-the-hood.

Now, let’s extract text, author, and the tags from that ‘http://quotes.toscrape.com’ using the quote object we just created:

>>> text = quote.css("span.text::text").get()
>>> text
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> author = quote.css("small.author::text").get()
>>> author
'Albert Einstein'

Given that the tags are a list of strings, we can use the .getall() method to get all.

>>> tags = quote.css("div.tags a.tag::text").getall()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']

Having figured out how to extract each bit, we can now iterate over all the quotes elements and put them together into a Python dictionary.

Storing the scraped data:-We can use a simple command to store data into JSON format.

scrapy crawl quotes -o quotes.json

That will generate a quotes.json file containing all scraped items, serialized in JSON.

Following links: We can scrap links from webpages using attr( ).

>>> response.css('li.next a').get()
'<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'

This gets the anchor element, but we want the attribute href for this purpose Scrapy supports a CSS extension that lets you select the attribute contents, like this:

>>> response.css('li.next a::attr(href)').get()
'/page/2/'

There is also an attrib property available:

>>> response.css('li.next a').attrib['href']
'/page/2'

Let’s see now our spider modified to recursively follow the link to the next page by extracting data from it.

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

After extracting the data, the parse() method looks for the link to the next page, builds a full absolute URL using the urljoin(), and yields a new request to the next page.

Request and Response Follow: As a shortcut for creating Request objects you can use response.follow and request.follow. 

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

For <a> elements there is a shortcut response.follow uses their href attribute automatically. So the code can be shortened.

for a in response.css('li.next a'):
    yield response.follow(a, callback=self.parse)

There are many more things to discuss in the Scrapy framework. This is only a basic idea of the Scrapy framework and how it is better for others.

We can also say that Scrapy is a crawler, Beautiful Soup is a parsing library.

You could consider Beautiful Soup has fewer options than Scrapy. In other words, with Beautiful Soup, you need to provide a specific URL, and Beautiful Soup will help you get the data from that page. You can give Scrapy a start URL, and it will go on, crawling and extracting data, without having to provide it with every single URL explicitly.

That is a basic explanation of what web-scraping is and information about some of its library and framework.

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: