crawler v0.1.1

web crawler

crawler Build Status

Highly customizable web crawler

  • fast : Native code with libcurl
  • simple : Just edit config
  • flexible : Extract data by CSS and Regex
  • safe : Detect infinite loop
  • tracable : All HTTP transfered data are stored

Installation

Static Binary is ready for x86_64 linux

Usage

  1. edit config
  2. receive html
  3. extract data
1. edit config

For the first time, config generate may help you.

$ crawler config generate
$ vi .crawlrc
[extract.json]
title = "css:div.r h3"
name  = ["css:p.name", "strip:"]

[crawl]
url      = "https://www.google.com/search?q=crystal"
next     = "css:a.pn"
html     = "css:div.rc"
page_max = 2

This means

  • Visits initial url config:crawl.url
  • Extracts html part config:crawl.html and stores in local
  • Follows next url config:crawl.next if page limit doesn't exceed config:crawl.page_max
2. receive html
$ crawler recv html
3. extract data
$ crawler extract json
$ crawler pb list json -f val
{"title":"Crystal"}
...

Development

for dynamic binary
$ make dynamic
for static binary

needs libcurl.a.

$ make static

Contributing

  1. Fork it (https://github.com/maiha/crawler/fork)
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request

Contributors

  • maiha - creator and maintainer
Repository

crawler

Owner
Statistic
  • 1
  • 0
  • 0
  • 0
  • 14
  • over 2 years ago
  • January 13, 2020
License

MIT License

Links
Synced at

Thu, 26 Dec 2024 05:06:03 GMT

Languages