crawler

Highly customizable web crawler

fast : Native code with libcurl
simple : Just edit config
flexible : Extract data by CSS and Regex
safe : Detect infinite loop
tracable : All HTTP transfered data are stored

Installation

Static Binary is ready for x86_64 linux

https://github.com/maiha/crawler/releases

Usage

edit config
receive html
extract data

1. edit config

For the first time, config generate may help you.

$ crawler config generate
$ vi .crawlrc

[extract.json]
title = "css:div.r h3"
name  = ["css:p.name", "strip:"]

[crawl]
url      = "https://www.google.com/search?q=crystal"
next     = "css:a.pn"
html     = "css:div.rc"
page_max = 2

This means

Visits initial url config:crawl.url
Extracts html part config:crawl.html and stores in local
Follows next url config:crawl.next if page limit doesn't exceed config:crawl.page_max

2. receive html

$ crawler recv html

3. extract data

$ crawler extract json
$ crawler pb list json -f val
{"title":"Crystal"}
...

Development

crystal 0.33.0

for dynamic binary

$ make dynamic

for static binary

needs libcurl.a.

$ make static

Contributing

Fork it (https://github.com/maiha/crawler/fork)
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create a new Pull Request

Contributors

maiha - creator and maintainer

Repository

crawler

Owner

maiha

Statistic

1
0
0
0
14
almost 3 years ago
January 13, 2020

License

MIT License

Links

Synced at

Fri, 08 Aug 2025 12:29:19 GMT

Languages

Crystal 89.42% Makefile 6.38% Shell 2.8% Dockerfile 1.41%

crawler v0.1.1

crawler

Installation

Static Binary is ready for x86_64 linux

Usage

1. edit config

2. receive html

3. extract data

Development

for dynamic binary

for static binary

Contributing

Contributors