Adding type information to exported Scrapy items

By default, Scrapy won’t include any type information when using feed exports to serialize scraped items. It follows that, when exporting multiple types of items at once, we later on can’t easily discern between the different concepts represented by the items. Consider the following module:

import scrapy

class AnimalItem(scrapy.Item)
    name = scrapy.Field() 

class CatItem(AnimalItem) 

class DogItem(AnimalItem):

In the above example, the application apparently needs to discern between Cats and Dogs. Otherwise, sub-classing AnimalItem wouldn’t make a lot of sense since neither CatItem nor DogItem explicitly add anything to their base class. When exporting these items to, say, a .jsonl feed, you’d get something along these lines:

# cats'n'dogs.jsonl
{"name": "Garfield"}
{"name": "Lassie"}
{"name": "Flipper"}

Besides the apparent problem that somehow we managed to scrape not only cats and dogs but at least one dolphin as well, we have lost the ability to easily make a distinction between different kinds or types of animals.

There are multiple places in the Scrapy code structure where you could tackle this problem. For example, you could write a custom item pipeline to check the type of each processed item and add a corresponding _type field to it. Another solution (which I happen to find more elegant) is to create such a field inside the AnimalItem class to automatically add the _type field to each of its subclasses:

class AnimalItem(scrapy.Item):
    name = scrapy.Field()
    _type = scrapy.Field()
    def __init__(self, *args, **kwargs):
        kwargs['_type'] = self.__class__.__name__.replace('Item', '').lower() # 'CatItem' → 'cat'
        super().__init__(self, *args, **kwargs)

There you go. From now on, whenever we need to export our animals, it will be easy to figure out what kind of animal we are dealing with:

# cats'n'dogs'n'dolphins.jsonl
{"name": "Garfield", "_type": "cat"}
{"name": "Lassie", "_type": "dog"}
{"name": "Flipper", "_type": "dolphin"}

Leave a comment

Your email address will not be published. Required fields are marked *