Skip to content

A simple web scraper to regularly call up a list of websites and optionally save backups of them

License

Notifications You must be signed in to change notification settings

Jocker271/scheduled-web-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scheduled-web-scraper

This simple project is used to automatically generate backups of specific webpages.

License Python Author

Installation

The Python script alone is not sufficient for a repetitive action, so we need to set up a cron job. Of course, the cron job only runs when the computer is on, so the setup on a server is recommended.

Automatically (Linux)

Run the setup.sh. You may have to make the file executable first with chmod +x setup.sh.

Manually (Linux)

Since Git can't index empty folders and I don't want to work with a .gitkeep file, you need to manually create a folder called "archive" in this project with mkdir archiv.

Open the crontab file with crontab -e and create a new cron job by adding a line according to the given scheme (minute hour day month day-of-week command-line-to-execute). The time parameters can be freely selected. However, the web scraper script must be called in the command. Make sure you specify the correct path to the python file.

The new line could look like one of the following examples:

  • Run the web-scraper every full hour:
    0 * * * * /usr/stupid-web-scraper/main.py
  • Run the web-scraper every 15 minutes from 8 am to 6 pm from Monday to Friday:
    */15 8-18 * * 1-6 /usr/stupid-web-scraper/main.py
    See wiki.ubuntuusers.de/Cron for more information.

Save the file and make sure cron is actually running with service cron status. If not you have to start the service with sudo service cron start.

Usage

Store the links to all webpages to be called in url_list.csv file. All other settings can be adjusted in config.py file.

Start the cron jobs

service cron start


Stop the cron jobs

service cron stop

Roadmap

  • Core functionality to scrap multiple webpages
  • Add Logging
  • Set configs in a separate file
  • Set limit for saving web pages
  • Adding a random time for retrieving the web pages
  • Adjustable path to the backup (archive) folder
  • Implementation for different operating systems
    • Linux (using cron job)
    • Windows (using scheduler)
    • Using Docker
  • Parallelize the web requests
  • Send regular summaries as emails

About

A simple web scraper to regularly call up a list of websites and optionally save backups of them

Topics

Resources

License

Stars

Watchers

Forks