tokenizer v1.1.1

Simplified binary stream tokenization for crystal lang

Crystal Lang Tokenizer

CI

A tool for buffering and tokenizing streaming inputs.

Overview

Consider a binary protocol such as the one used by the Harman BSS DSP.

It uses 0x03 to indicate the end of a message.

require "socket"
require "tokenizer"

# Connect to the device
connection = TCPSocket.new("10.10.10.10", 1023)
connection.tcp_nodelay = true

# Messages terminate with 0x03, so we are looking for this byte
token_buffer = Tokenizer.new(Bytes.new(1, 0x03))

while !connection.closed?
    raw_data = Bytes.new(512)
    bytes_read = connection.read(raw_data)
    break if bytes_read == 0 # Connection was closed

    token_buffer.extract(raw_data[0, bytes_read]).each do |message|
        # Process messages here, messages are of type Bytes

        # If the data was a string, it's simple to convert
        # (assuming we want to ignore the start and stop bytes)
        message = String.new(message[1, message.size - 2])

        # Do something with the message
        process message
    end
end

Supported tokenization strategies

  • Message Length - i.e. all messages are 12 bytes in size
  • Delimiter - i.e. all messages end with [0x03, 0x00]
  • Abstract - i.e. message header determines message length

Message Length

Messages are a fixed length, optionally starting with some indicator bytes.

# Message length 4 bytes, including the indicator bytes
buffer = Tokenizer.new(4, "GO")

# So a string like "GO12, GO56, G" has 2 complete messages
# "GO12" and "GO56"

messages = buffer.extract("GO12, GO56, G") # => [Bytes, Bytes]
messages = messages.map { |bytes| String.new(bytes) }
messages # => ["GO12", "GO56"]

The example above uses strings however you would typically use this binary data that can't be represented by strings

Delimiter

Messages are variable length, however there is a byte or bytes that represent the end of the message.

# Messages end with \n
buffer = Tokenizer.new("\n")

# So a string like "Hello.\nHow are you?\nWha" has 2 complete messages
# "Hello.\n" and "How are you?\n"

messages = buffer.extract("Hello.\nHow are you?\nWha")
messages = messages.map { |bytes| String.new(bytes) }
messages # => ["Hello.\n", "How are you?\n"]

Abstract

Messages are split by some arbitrary logic. i.e.

  • A header specifies the length of a message
  • or a successful CRC check indicates the message end

A callback is used for the application to define when a complete message has been received.

# A message header indicates the length of the message
buffer = Tokenizer.new do |io|
    bytes = io.peek # for demonstration purposes
    string = io.gets_to_end

    string[0].to_i + 1
end

# So a string like "7welcome2to5hu" has 2 complete messages
# "7welcome" and "2to"

messages = buffer.extract("7welcome2to5hu")
messages = messages.map { |bytes| String.new(bytes) }
messages # => ["7welcome", "2to"]

  • The block is expected to return the number of bytes in the next message
  • Returning anything <= 0 means the message is not complete
  • You can return the message size even if the message has not completely buffered. (i.e. if the header is completely buffered)
Repository

tokenizer

Owner
Statistic
  • 2
  • 0
  • 0
  • 5
  • 0
  • over 3 years ago
  • December 15, 2018
License

MIT License

Links
Synced at

Tue, 21 Jan 2025 09:37:23 GMT

Languages