Spider.cr v0.1.2
Spider
An spider class built to go through pages on urls based on some rules. And then go through those pages on those urls.
This is heavily inspired by the ruby gem spider
: https://github.com/johnnagro/spider
Installation
-
Add the dependency to your
shard.yml
:dependencies: spider: github: confact/spider.cr
-
Run
shards install
Usage
require "spider"
And then set up the spider config like this:
Spider.start("https://google.com") do |s|
s.amount_workers = 30
s.every_page_urls = ->(url : URI) {
if /^https:\/\/news.ycombinator.com\/.*/ =~ url.to_s
s.add_link_to_visit(url)
end
if /^https:\/\/indiehackers.com\/.*/ =~ url.to_s
s.add_link_to_visit(url)
end
}
s.every_page = ->(data : Lexbor::Parser, url : URI) {
# run either the whole data process here or move it to another class and call it here,
# we give you the Lexbor::Parser instance directly so you can use it freely,
# and the url to route to correct processing depending on url.
}
end
This will run the spider and it will block any code below it.
Configuration
prefix_url
If you have a proxy api you use, you can set it here.
it usually is a url and then set the url you want to go to as a query parameter.
As example:
s.prefix_url = "https://app.scrapingbee.com/api/v1/?api_key={api_key}&render_js=true&url="
Storage of visited urls and queue urls
We plan to expand to different ways to store the visited urls and queue urls. Right now it is hardcoded to use the text files only.
Ideas of future storage:
- Redis
- Memcached
- Database
- some custom API
Todo:
This is working and is doing pretty good on some production systems. But it could do some more things better:
- failure handling, have a custom way to handle them in the start block.
- More storage possibility, and a way to set it in start block.
- It is keeping up and running even if it is done. As the while check seems to not work fully.
- robots.txt support to respect websites wishes.
Contributing
Would love some contributions. As example the concurrency support, as I am new to that.
- Fork it (https://github.com/confact/spider.cr/fork)
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request
Contributors
- Håkan Nylén - creator and maintainer
Spider.cr
- 9
- 0
- 1
- 0
- 1
- over 2 years ago
- November 15, 2021
MIT License
Wed, 24 Apr 2024 21:59:42 GMT