Skip to content

How To build a crawler for edu sharing (alternative method)

Criamos edited this page Oct 22, 2021 · 2 revisions

This chapter of our GitHub Wiki covers the topic of building a crawler to work with our edu-sharing-mode. If you haven't done so yet, please take a look at the fantastic Scrapy Tutorial first to get your bearings on how to use this Python framework for web-scraping. Within the Scrapy-framework each individual crawler is called a scrapy.Spider, the most basic class being a CrawlSpider, which is why we're using these terms interchangeably in our Wiki.


Many roads lead to Rome and this saying holds true for building web-crawlers as well: In converter/spiders you'll find a python file called sample_spider.py that acts as a rough starting point to help new users build their own spider implementations on top of our LomBase-class (see: LOM - Learning Objects Metadata). Since the sample_spider makes heavy use of (multi-)inheritance and method overriding, we thought it would be a good idea to also present an alternative approach that might be easier to read and understand as a new Scrapy user. Please take a look at our sample_spider_alternative.py.

Before we get started, please keep the Scrapy documentation on ItemLoaders handy. To grasp what's happening with the data that our crawler scrapes from websites, the items.py file in the converter/-folder gives a good overview which Item- and ItemLoader-classes we're working with.

ItemLoaders are the conveyor-belts for our data-model

Our Scrapy web-crawler needs to fulfill two conditions to be considered a success:

  1. it should navigate through a website and gather the desired meta-data from a source
  2. it should prepare a BaseItem within its parse()-method that holds all the relevant meta-data for one specific URI/URL (for further processing in our ETL-procedure) and yield it

The BaseItem-class is our building-block

We're taking a peek at the BaseItem-class definition in converter/items.py:

class BaseItem(Item):
    sourceId = Field()
    uuid = Field()
    "explicit uuid of the target element, please only set this if you actually know the uuid of the internal document"
    hash = Field()
    collection = Field(output_processor=JoinMultivalues())
    "id of collections this entry should be placed into"
    type = Field()
    origin = Field()
    "in case it was fetched from a referatorium, the real origin name may be included here"
    response = Field(serializer=ResponseItem)
    ranking = Field()
    fulltext = Field()
    thumbnail = Field()
    lastModified = Field()
    lom = Field(serializer=LomBaseItem)
    valuespaces = Field(serializer=ValuespaceItem)
    permissions = Field(serializer=PermissionItem)
    "permissions (access rights) for this entry"
    license = Field(serializer=LicenseItem)
    publisher = Field()
    # editorial notes
    notes = Field()
    binary = Field()

Some of these fields are mandatory while others are optional (see: required metadata).

If we build a crawler by inheriting from both CrawlSpiderand LomBase we would normally have to override the getID()- and getHash-method with our own implementation. Since we're building our BaseItem "by hand", our crawler will have to provide values for sourceId and hash.

You might have noticed that some fields make use of a serializer:

  • LomBaseItem
  • LicenseItem
  • ValuespaceItem
  • PermissionItem
  • ResponseItem

Of course we would like to fill as many of these items as possible with (relevant) meta-data, but the most important field to fill is the LomBaseItem because it holds several more categories within its structure, as can be seen within the items.py:

  • LomGeneralItem
  • LomEducationalItem
  • LomClassificationItem
  • LomLifecycleItem
  • LomTechnicalItem
  • LomAgeRangeItem

How do these items fit into each other?

Mermaid Diagram of BaseItem Hierarchy

As you can see in the (simplified) diagram, our desired end-result BaseItem is built by nesting items within each other. The sample_spider_alternative.py should help you to understand how nesting these items works with the specific ItemLoader-classes. The basic gist is:

  1. Create an ItemLoader object
  2. Populate its fields by using its .add_value(key,value) or .replace_value()-methods
  3. Once you're done, create the item scrapy.Item itself by calling the .load_item()-method

(last update: 2021-10-22)