acorn
Acorn 
🚧 Under Construction 👷
A state machine compiler with no runtime dependency. Define a grammar using a subset of regular expression notation, then compile it into a blazing-fast state machine. Acorn supports lexers or custom string-based state machines.
Installation
Add this to your application's shard.yml:
dependencies:
acorn:
github: "rmosolgo/acorn"
Usage
-
Define the grammar in a build file:
# ./build/my_lexer.cr require "acorn" class MyLexer < Acorn::Lexer # optional, rename the generated class: # name("Namespace::MyLexer") token :letter "a-z" token :number "0-9" generate "./src/my_lexer.cr" end -
Generate the lexer:
crystal run ./build/my_lexer.cr -
Use the compiled lexer:
require "./src/my_lexer.cr" MyLexer.scan(input) # => Array(Tuple(Symbol, String))
Regular Expressions
Tokens are defined with a small regular expression language:
| Feature | Example |
|---|---|
| Character | a, 1, ❤️ |
| Sequence | ab, 123 |
| Alternation | a|b |
(ab)|c |
|
. |
|
| One of | [abc] |
[^abc] |
|
| Escape | \[, \. |
| Unicode character range | a-z, 0-9 |
| Zero-or-more | a* |
| One-or-more | a+ |
| Zero-or-one | a? |
| Specific number | a{3} |
| Between numbers | a{3,4} |
| At least | a{3,} |
Build Step
An Acorn module is a Crystal program that generates code. To get a lexer, you have to run the Acorn module. Then, your main program should use the generated code.
For example, if you define a lexer:
# build/my_lexer.cr
class MyLexer < Acorn::Lexer
# ...
generate("./app/my_lexer.cr")
end
You should run the file with Crystal to generate the specified file:
crystal run build/my_lexer.cr
Then, your main program should require the generated file:
# my_app.cr
require "app/my_lexer"
MyLexer.scan(input) # => Array(Tuple(Symbol, String))
The generated code has no dependency on Acorn, so you only need this library during development.
Tokens
Acorn returns an of array tokens. Each token is a tuple with:
Symbol: the name of this token in the lexer definitionString: the segment of input which matched the pattern: line number and column number where the token began{Int32, Int32}: line number and column number where the token ended{Int32, Int32}
Line numbers and column numbers are 1-indexed, so the first character in the input is 1:1.
Custom Machines
Acorn lexers are actually a special case of state machine. You can specify a custom machine, too.
class MyMachine < Acorn::Machineto bring in the macrosalias Accumulator = ...to specify the data that will be modified during the process- It will be initialized with
.new - It will be returned from
.scan
- It will be initialized with
- use
action :name, "pattern" { |acc, str, ts, te| ... }to define patternsaccis an instance of yourAccumulatorstris the original inputtsis the index instrwhere this token beganteis the index instrwhere this token ended (if the match is one character long,ts == te)
- optionally,
name("MyNamespace::MachineName")to rename the generated Crystal class
Development
- rebuild fixtures with
crystal run spec/prepare.cr crystal spec
Goals & Non-Goals
Goals:
- Great runtime performance
- Linear time complexity
- Reasonable code size (
.crfile and memory usage) - No runtime dependency
- Plain-Crystal API (source files are
.cr, not a special format) - Easy to write a lexer, possible to write something else
Non-goals:
- Great compile-time performance (choose simplicity over performance)
- Fancy regexp features
TODO
- Finish regexp language (
(...),.,[^...])- A move on
ais also a move on:any, how is that handled?
- A move on
- Better line/col awareness:
- Add line/col to tokens
- Add line/col to errors
License
LGPLv3
acorn
- 13
- 2
- 0
- 0
- 0
- over 8 years ago
- March 28, 2017
GNU Lesser General Public License v3.0
Fri, 07 Nov 2025 14:15:56 GMT