Skip to content

Spiders (Crawler)

tsimon edited this page Apr 30, 2020 · 18 revisions

A Spider represents a python class which will run and crawl data from your source. Crawling is usually done by running over the contents of a sitemap or an API result which provides a list of identifiers or entries.

Before starting writing a new spider from scratch, please be aware what spiders we already offer.

Please note that while overriding each basic class, you can of course still provide individual processing for specific attributes or data.

Quick start

All spiders are located in etl/converter/spiders. You may copy our sample_spider.py file to get started.

Make sure to change the class name and filename, as well as the name attribute inside the file.

You can then run your spider calling scrapy crawl choosen_name_spider inside the etl folder.

Please refer to the README for the basic setup instructions and make sure that you have started all the necessary docker containers first.

Required information

You need to create some class properties to get the crawler recognized correctly inside the search. Please also check out the SampleSpider class

name The internal name of your spider (how it is called and identified), please postfix it with _spider

friendlyName The readable / real name of your source. This is how it will be displayed in the Frontend

url The base url of your source. This will be linked in the Frontend when someone clicks on the source name

version The version of your spider. Please include this variable when overriding getHash(). This makes sure that as soon as the version changes, all data will be re-crawled to adopt for changes in your spider.

View SampleSpider class

Integrated spiders in this project

OAI-PMH (LOM)

If your system provides a OAI API, you can use our OAIBase class for it. You simply override the baseUrl, metadataPrefix and set.

Learn more about OAI-PMH

View OAIBase class

LRMI

If your page includes LRMI data into its sites and provides a sitemap, you can make use of our LRMIBase class to get started.

Learn more about LRMI

View LRMIBase class

RSS

If you provide data via RSS, then we offer two approches:

If you only have a single RSS-Feed, we recommend to use the RSSBase class. It provides basic handling for the main metadata of RSS. Additional metadata should be added by your class (either static or may by crawling additional data from your webpage or a seperate endpoint)

If your source consists of multiple RSS-Feeds, take a look at the RSSListBase class. You have to provide a (csv)-file for all your RSS-URL's and can also include several default metadata for each feed as described in the class file.

View RSSBase class

View RSSListBase class

Spiders provided by Scrapy

If none of the above spiders matches your API, there's no need to worry! Since we use Scrapy as a framework, you can use all the base spiders included to collect your data. You'll find the list here.

We strongly recommend you to also inherit the LOMBase class to already have all basic callbacks for the individual metadata groups. Furthermore, if your source makes use of JSON, you may also will find the JSONBase class helpful. A sample override could look like this:

class MySpider(CrawlSpider, LomBase):
  name = 'my_spider'
  start_urls = ['https://edu-sharing.com']
  # more data comes here

  def parse(self, response):
    return LomBase.parse(self, response)

For more details, please take a look at the sample_spider class

View SampleSpider class

Import arguments

For making debugging easier, the spiders can accept several arguments for specific behaviour

Import specific uuid

scrapy crawl <spider_name> -a uuid=<specific_uuid>

To make sure that this works, make sure your spider does correctly calls hasChanged and returns None if it returns false. Call this method as soon as possible. If your spider does not support the hasChanged method, you can not use this feature.

Delete all previous imports (cleanrun)

scrapy crawl <spider_name> -a cleanrun=true Please note that this will not update the elasticsearch automatically, you need to re-sync the index with the database manually in case elements have been deleted