At Agira, Technology Simplified, Innovation Delivered, and Empowering Business is what we are passionate about. We always strive to build solutions that boost your productivity.

Implementing Web Scraping In Python Using Scrapy

  • By Allan Watts
  • May 7, 2019
  • 2235 Views

 

What Is Scrapy

Scrapy is an application framework which will act like a web crawler that mainly used to extract the data from the website. Today, our topic is very much bound to explore about Scrapy hence we’re going to implement web scrapping in Python using Scrapy in our project.
This blog will hopefully cover the following topics :

  1. How To Install Scrapy
  2. Create A Scrapy Project
  3. Export Scraped Data As CSV

Scrappy will only run on python 2.7 and python 3.4 or run above. If you’re using Anaconda, you can install the package from the conda-forge channel packages on Linux, Windows and OS X.

How To Install Scrapy:

You can install scrappy either using conda or if you’re familiar with the installation of Python packages, you can install Scrapy and its dependencies from PyPI itself.

Install Scrappy Using Anaconda

conda install -c conda-forge scrapy

 

Install Scrapy Using PyPI

pip install Scrapy

 

Install Scrapy On Ubuntu 14.04 Above

Ubuntu 14.04 and above, If you install scrapy on Ubuntu systems, you need to install these dependencies:

sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

 

Install Scrapy On Python

If you want to install Scrapy on Python 3, you’ll also need Python 3 development headers:

sudo apt-get install python3 python3-dev

 
Inside a virtualenv, you can install Scrapy with pip :

pip install scrapy

 

Create A Scrapy Project

Before you start scrapping, we need to create our scrappy project. Now, switch to the desired directory where we should run the scrapy project.

scrapy startproject project_name

 
This will create the following directory structure:

project_name/
scrapy.cfg         # deploy configuration file
project_name/          # project's Python module, you'll import your code from here
    __init__.py
    items.py       # project items definition file
    middlewares.py # project middlewares file
    pipelines.py   # project pipelines file
    settings.py    # project settings file
    spiders/       # a directory where you'll later put your spiders
        __init__.py

 
The two most important files we should consider are:
settings.py – This file will hold all the settings you have set for your project.
spiders/ – This folder will store all your custom spiders used in the project.

Related : Introduction To Web Scraping With Node JS

 

Create A Scrapy Spider :

Spiders are the classes which you define and that Scrapy uses to scrape information from a website (or a group of websites).
Here’s the code for a spider that scrapes famous quotes from website http://quotes.toscrape.com, following the pagination:

import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = [
    'http://quotes.toscrape.com/tag/humor/',
]
def parse(self, response):
    for quote in response.css('div.quote'):
        yield {
            'text': quote.css('span.text::text').get(),
            'author': quote.xpath('span/small/text()').get(),
        }
    next_page = response.css('li.next a::attr("href")').get()
    if next_page is not None:
        yield response.follow(next_page, self.parse)

 
The Spider subclasses scrapy.Spider and defines some attributes and methods:
Name: which indicates the spider, the name must be unique in the project and we can’t assign the same name to another file.
start_requests(): return our request in an iterative way so when the crawl begins then our request will be processed successively from the initial request to end.
parse(): This method is mainly called to handle our response in download, based on our “request.Response” method is an instance of TextResponse that holds the page content.
Other side, The parse() method will also parse the response and extract the crawled data as dicts & finds new URLs to follow and creating new requests (Request) from them.

How To Run Spider From Scrapy

To make your spider work, go to the project’s top level directory and run:

scrapy crawl quotes

 
This command will run the spider and generate following output,

... (omitted for brevity)
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened
2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished)

 

Also Read: Writing a web crawler with Scrapy and Scrapinghub

 

Export Scraped Data As CSV :

We can still extract all the data in the command line but it is always good to export the scraped data in various formats like CSV, Excel, JSON, etc. This saves lots of our time and also can be imported into programs else wherever we want. To make this process even easier, Scrapy provides the functions called “nifty” which allows you to export the downloaded content in various formats.
To do that, just add the following code block in settings.py file:

#Export as CSV Feed
FEED_FORMAT = "csv"
FEED_URI = "your csv name.csv"

 
That’s all guys! we have successfully exported the data as CSV. Now we know to implement web Scraping Using Scrapy.

Allan Watts

An active software developer with “Can Do” attitude. Around 3.5 years of experience, Allan developing skills in php that allows him to craft flawless web applications in Symfony, Laravel, CodeIgniter, Javascript & WordPress. Also the most accompanied developer for building applications from designer perceptive