Spider

An spider class built to go through pages on urls based on some rules. And then go through those pages on those urls.

This is heavily inspired by the ruby gem spider: https://github.com/johnnagro/spider

Installation

Add the dependency to your shard.yml:

dependencies:
  spider:
    github: confact/spider.cr

Run shards install

Usage

require "spider"

And then set up the spider config like this:

Spider.start("https://google.com") do |s|
  s.amount_workers = 30
  s.every_page_urls = ->(url : URI) {
    if /^https:\/\/news.ycombinator.com\/.*/ =~ url.to_s
      s.add_link_to_visit(url)
    end
    if /^https:\/\/indiehackers.com\/.*/ =~ url.to_s
      s.add_link_to_visit(url)
    end
  }

  s.every_page = ->(data : Lexbor::Parser, url : URI) {
    # run either the whole data process here or move it to another class and call it here,
    # we give you the Lexbor::Parser instance directly so you can use it freely,
    # and the url to route to correct processing depending on url.
  }
end

This will run the spider and it will block any code below it.

Configuration

prefix_url

If you have a proxy api you use, you can set it here.

it usually is a url and then set the url you want to go to as a query parameter.

As example:

s.prefix_url = "https://app.scrapingbee.com/api/v1/?api_key={api_key}&render_js=true&url="

Storage of visited urls and queue urls

We plan to expand to different ways to store the visited urls and queue urls. Right now it is hardcoded to use the text files only.

Ideas of future storage:

Redis
Memcached
Database
some custom API

Todo:

This is working and is doing pretty good on some production systems. But it could do some more things better:

failure handling, have a custom way to handle them in the start block.
More storage possibility, and a way to set it in start block.
It is keeping up and running even if it is done. As the while check seems to not work fully.
robots.txt support to respect websites wishes.

Contributing

Would love some contributions. As example the concurrency support, as I am new to that.

Fork it (https://github.com/confact/spider.cr/fork)
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create a new Pull Request

Contributors

Håkan Nylén - creator and maintainer

Repository

Spider.cr

Owner

confact

Statistic

9
0
1
0
1
almost 4 years ago
November 15, 2021

License

MIT License

Links

Synced at

Sun, 26 Oct 2025 22:28:45 GMT

Languages

Crystal 100.0%

Spider.cr v0.1.2