Scraping Images using Scrapy.

Jul 30, 2023-

Imagine you’re an avid collector of online images (may not be a thing). One fine evening, you stumble across a beautiful website containing all the images you’d dream of having in your collection. It’s not really efficient to go about clicking “save as” manually, is it? Well, here’s the automated efficient alternative.

Web-scraping is a method by which you parse a website’s HTML code to extract content you need. This concept has various implications in our everyday lives. One very well-known example would be web-browsers. They make use of multiple scrapers/crawlers that go through the billions of pages out there to find updated content in different webpages and provide you with better search results.

There are a bunch of different libraries out there for web-scraping and each have their own pros and cons. However, I personally recommend using Scrapy. It’s very sophisticated and fast. Although it has a rather steep learning curve, once you understand what’s happening behind-the-scenes, you will be equipped with the ability to make efficient, fast and powerful web-scrapers.

Curious already?

Installation

To install Scrapy, all you need is Python installed.

If you don’t have Python, head on to their official website here and get it installed (make sure to get Python3).

Now open up your command prompt or bash and run,

pip3 install scrapy

You could take a quick peek at Scrapy’s official website and documentation here.

We also need another package called pillow to download the images. You can run,

pip3 install pillow

Creating a Scrapy Project

Hop onto your terminal or command-prompt and type,

scrapy startproject example

Once you run it, you’ll see something like this:

command-line-outputcommand-line-output

As the prompt says, go ahead and initiate a spider. This is done by,

cd example
scrapy genspider image-scraper www.example.com

We first switch into the directory where the project lives and create a new spider using the genspider command.

I know you’re wondering why you’re made to go through so many cumbersome steps to create a simple project. You’re just downloading a set of images after all. This project may be simple, however, Scrapy was designed to integrate the complexity of advanced projects like having multiple spiders/crawlers scraping parallelly.

Now that you have successfully created a spider. You can go on and open the folder on your preferred text-editor or IDE.

If you use Visual Studio Code, you can just run:

code .

Configurations

Assuming you’ve opened the project on a text editor, this is how your project directory would look like.

directory structuredirectory structure

short detour

Before jumping into the code, if you haven’t installed any python extensions I’d suggest you go ahead and do so.

python extensionpython extension

This comes with python-Intellisense, linting, formatting and more.

Now, coming back to our problem at hand

Go ahead and open up the spiders subdirectory and click on image_scraper.py. It should look like,

# /spiders/image_scraper.py
import scrapy

class ImageScraperSpider(scrapy.Spider):
    name = 'image_scraper'
    allowed_domains = ['www.example.com']
    start_urls = ['http://www.example.com']

    def parse(self, response):
        pass

In this article, I will be scraping this website but you’re free to choose any website and modify the code to adapt to it.

Now, change the allowed_domains and start_urls variables to as follows,

# /spiders/image_scraper.py
import scrapy

class ImageScraperSpider(scrapy.Spider):
    name = 'image_scraper'
    allowed_domains = ['quotefancy.com']
    start_urls = ['https://quotefancy.com/motivational-quotes']

    def parse(self, response):
            pass

The next step is to create an Item. Items are the containers used to collect the data that has been scraped. To define Items, we need to edit the** items.py** file under the example(the project name) directory. This is how it looks like,

# /items.py

import scrapy

class ExampleItem(scrapy.Item):

    pass

Just replace that class with this,

class ImageScraperItem(scrapy.Item):

    image_urls = scrapy.Field()

    images = scrapy.Field()

Let me explain what I’ve done here. I have created a custom class called ImageScraperItem that has 2 fields, image_urls to hold the URLs and images to hold the scraped images.

From the official documentation,

Field objects are used to specify metadata for each field. The main goal of Field objects is to provide a way to define all field metadata in one place.

Since our project involves downloading images, there are a few flags we must mention in the settings.py file.

Go on and open it up,

BOT_NAME = 'example'

SPIDER_MODULES = ['example.spiders']

NEWSPIDER_MODULE = 'example.spiders'

#USER_AGENT = 'example (+http://www.yourdomain.com)'

ROBOTSTXT_OBEY = True

It might contain a lot of comments. We can ignore that for now.

We just need to add a couple lines of code.

ITEM_PIPELINES = {'example.pipelines.ExamplePipeline': 1}

IMAGES_STORE = 'downloads'

The IMAGES_STORE flag tells the scraper where to download the images. If you specify a path, it will be downloaded there. If you simply specify a name as in our case, it will be downloaded in that folder in the current working directory or if the folder doesn’t exist, a new one will be created.

The ITEM_PIPELINES constant specifies where the pipeline exists. In our case ExamplePipeline lies in the pipelines.py file in the example directory, hence example.pipelines.ExamplePipeline.

Open up the pipelines.py file and copy the below code into it,

from scrapy.pipelines.images import ImagesPipeline

class ExamplePipeline(ImagesPipeline):

    def file_path(self, request, response=None, info=None, *, item=None):

        return request.url.split('/')[-1]

From the official documentation,

The Images Pipeline has a few extra functions for processing images:

  • Convert all downloaded images to a common format (JPG) and mode (RGB)
  • Thumbnail generation
  • Check images width/height to make sure they meet a minimum constraint

Ready, Set, Go

We’re all set now. We just need to open up the image_scraper.py file and start working.

The code that scrapes content has to be in the parse function. You might see one already defined in the image_scraper.py file,

def parse(self, response):

    pass

We can remove the pass keyword and start working.

finally!

I’ll give you the whole code here and then explain it by breaking it down into smaller parts.

import scrapy

from example.items import ImageScraperItem

class ImageScraperSpider(scrapy.Spider):

    name = 'image_scraper'

    allowed_domains = ['quotefancy.com']

    start_urls = ['https://quotefancy.com/motivational-quotes']

    base_link = 'https://quotefancy.com/motivational-quotes'

    max_pages = 1

def parse(self, response):

    #images_urls

    obj = ImageScraperItem()

    if response.status == 200:

    #This query only returns the first image

        rel_img_urls = response.css('img').getall()

        #This returns all other images

        rel_secondary_urls = response.css('img').xpath('@data-original').getall()

        rel_img_urls.extend(rel_secondary_urls)

        #Finding number of pages

        number_of_pages = response.xpath('//a[@class="loadmore page-number"]/text()').getall()

        obj['image_urls'] = self.url_join(rel_img_urls, response)

        yield obj

        #If the number_of_pages length is 1, then it means that there is only one page extra

        if len(number_of_pages) == 1:

            self.max_pages = (number_of_pages[0])

        else:

        #finding the max

        number_of_pages = [int(x) for x in number_of_pages]

        self.max_pages = str(max(number_of_pages))

        # print(self.max_pages)

    # updating link

    next_page = self.base_link + '/page/' + str(self.max_pages)

    # callback for the next page

    yield scrapy.Request(next_page, callback=self.parse)

# converting relative to absolute URLS

def url_join(self, rel_img_urls, response):

    urls = [response.urljoin(x) for x in rel_img_urls]

    return urls

NOTE: To avoid confusion and pointless Indentation errors, do not copy the code that’s been broken down into parts below. It’s simply for explanatory purposes.

Firstly, we’re creating an Item object. Then we go on to check if the response.status is equal to 200 (more on status codes here). Otherwise, it means there was some error retrieving the website’s HTML code without which we cannot proceed.

rel_img_urls = response.css('img').getall()

#This returns all other images

rel_secondary_urls = response.css('img').xpath('@data-    original').getall()

rel_img_urls.extend(rel_secondary_urls)

This bit of code queries out all the links in the src attribute of img tags.

In our problem, using the src attribute only returns the first image’s URL. In order to solve this, we take another attribute data-original which returns the links of all the other images. The** getall()** function returns all the results obtained in the page whereas the get() function returns the first result it encounters.

obj['image_urls'] = self.url_join(rel_img_urls, response)

def url_join(self, rel_img_urls, response):

    urls = [response.urljoin(x) for x in rel_img_urls]

    return urls

The above code is to convert all the links from relative to absolute URLs. This is done by the urljoin() method of the response object.

An absolute URL contains all the information necessary to locate a resource. A relative* *URL locates a resource using an absolute URL as a starting point. In effect, the “complete URL” of the target is specified by concatenating the absolute and relative URLs.

After this step, we can simply do,

yield obj

This will begin to download all the images into the directory specified.

Following this, we need to check if there are more pages and if there are any, we need to update the link to scrape the new page.

number_of_pages = response.xpath('//a[@class="loadmore page-number"]/text()').getall()

obj['image_urls'] = self.url_join(rel_img_urls, response)

yield obj

#If the number_of_pages length is 1, then it means that there is only one page extra

if len(number_of_pages) == 1:

    self.max_pages = (number_of_pages[0])

else:

    #finding the max

    number_of_pages = [int(x) for x in number_of_pages]

    self.max_pages = str(max(number_of_pages))

    # print(self.max_pages)

# updating link

next_page = self.base_link + '/page/' + str(self.max_pages)

# callback for the next page

yield scrapy.Request(next_page, callback=self.parse)

The first line queries out the number of sub-pages. In our case, there is only one extra page.

We are checking if the number of pages is equal to 1. If yes, we only have one extra page, we can now go ahead and update the link. Otherwise, we find the minimum value in the list and continue to update the link.

Now we can callback the parse method with updated arguments.

The above code provided might not work with other websites. You would have to tweak it a bit to make it work properly.

That’s something for you to think about and work on!

That’s it!

We’re done with the coding part!

Run!

Hop on to the terminal, navigate to the directory of the project and run

scrapy crawl image_scraper

If you have a good internet connection, it would take about 10 secs for all the images to get downloaded.

downloadsdownloads

I’m flattered that you ended up reading till here.

Conclusion

Hopefully this guide showed you exactly how to get started with Scrapy. You can find the full source code here.

References

Wikipedia, Scrapy.org

PS: This was my first story ever on medium — go easy on me :)