crawler v0.1.1
crawler
Highly customizable web crawler
- fast : Native code with libcurl
- simple : Just edit config
- flexible : Extract data by CSS and Regex
- safe : Detect infinite loop
- tracable : All HTTP transfered data are stored
Installation
Static Binary is ready for x86_64 linux
Usage
- edit config
- receive html
- extract data
1. edit config
For the first time, config generate
may help you.
$ crawler config generate
$ vi .crawlrc
[extract.json]
title = "css:div.r h3"
name = ["css:p.name", "strip:"]
[crawl]
url = "https://www.google.com/search?q=crystal"
next = "css:a.pn"
html = "css:div.rc"
page_max = 2
This means
- Visits initial url
config:crawl.url
- Extracts html part
config:crawl.html
and stores in local - Follows next url
config:crawl.next
if page limit doesn't exceedconfig:crawl.page_max
2. receive html
$ crawler recv html
3. extract data
$ crawler extract json
$ crawler pb list json -f val
{"title":"Crystal"}
...
Development
- crystal 0.33.0
for dynamic binary
$ make dynamic
for static binary
needs libcurl.a
.
$ make static
Contributing
- Fork it (https://github.com/maiha/crawler/fork)
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request
Contributors
- maiha - creator and maintainer
Repository
crawler
Owner
Statistic
- 1
- 0
- 0
- 0
- 14
- about 2 years ago
- January 13, 2020
License
MIT License
Links
Synced at
Fri, 22 Nov 2024 12:42:26 GMT
Languages