Logarithm: Anomaly Detection Agent

Crystal GPLv3 Version

Logarithm is a self-learning diagnostics agent for GNU/Linux systems that uses machine learning to detect anomalies in system logs in real-time.

Built with Crystal for performance and reliability, it trains an autoencoder on normal log patterns to identify potential issues, security threats, or unusual behavior.

Features

  • Multi-source Monitoring: Simultaneous ingestion from systemd journal and syslog files
  • Real-time Detection: Continuous anomaly detection with configurable thresholds
  • Machine Learning: TF-IDF vectorization and autoencoder-based unsupervised learning
  • Memory-Efficient Training: Chunked storage system prevents memory accumulation during training on large datasets (100K+ logs)
  • Incremental Retraining: Flexible retraining modes to adapt to evolving log patterns
  • Security: AES-256 encryption, audit logging, and input validation
  • Resilience: Retry logic, circuit breakers, and comprehensive error handling
  • CLI Tools: Simple command-line interface for training and monitoring

Quick Start

Prerequisites: Crystal 1.17.1+, GNU Make, systemd dev libs (for journald)

git clone https://gitlab.com/renich/logarithm.git
cd logarithm
make release

Train the model (24 hours on default logs):

bin/logarithm train

Train with custom duration:

bin/logarithm train -t '1h' -j

Monitor for anomalies:

bin/logarithm monitor

Monitor with custom threshold:

bin/logarithm monitor -T 0.8 /var/log/syslog

Monitor from stdin:

tail -f /var/log/app.log | bin/logarithm monitor -i

Command Line Interface

Logarithm provides comprehensive CLI options for both training and monitoring operations. All flags support both long and short forms.

Training Command

The train command supports extensive customization of the training process:

bin/logarithm train [flags...] [paths...]

Training Flags

Flag Short Description Example
--time -t Training duration -t '30m', -t '2h', -t '90s'
--vocab-size -V TF-IDF vocabulary size -V 500
--batch-size -B Training batch size -B 5000
--max-batches -M Maximum training batches -M 10
--threshold -T Anomaly detection threshold -T 0.8
--retrain-mode -m Retraining mode -m incremental, -m full, -m hybrid
--expand-vocab -e Expand vocabulary with new terms
--rollback -b Rollback to previous model version
--journald -j Include systemd journal
--recursive -r Recursively monitor directories
--since -s Start journal from specific time -s '1 hour ago'
--data-dir -D Model storage directory -D /custom/path
--log-level -L Log level -L DEBUG, -L INFO
--encryption-key -K Encryption key for models -K 'your-key'
--encryption-fingerprint -P Encryption key fingerprint -P 'sha256-hash'
--config -c Configuration file path -c config.yml
--verbose -v Enable verbose logging

Training Examples

Basic training with custom duration:

bin/logarithm train -t '1h' -j

Advanced training with custom parameters:

bin/logarithm train -t '45m' -V 300 -B 2000 -M 8 -T 0.75 -L INFO -v

Full retraining with encryption:

bin/logarithm train --retrain-mode full -K 'my-secret-key' -t '2h'

Training with specific log sources:

bin/logarithm train /var/log/syslog /var/log/auth.log --recursive

Monitoring Command

The monitor command provides real-time anomaly detection with flexible output options:

bin/logarithm monitor [flags...] [paths...]

Monitoring Flags

Flag Short Description Example
--stdin -i Read logs from stdin
--output-format -f Output format -f json, -f csv, -f human
--filter -F Filter logs containing text -F "ERROR"
--threshold -T Anomaly detection threshold -T 0.8
--journald -j Include systemd journal
--recursive -r Recursively monitor directories
--since -s Start journal from specific time -s '1 hour ago'
--data-dir -D Model directory -D /custom/path
--log-level -L Log level -L DEBUG
--encryption-key -K Encryption key for models -K 'your-key'
--encryption-fingerprint -P Encryption key fingerprint -P 'sha256-hash'
--config -c Configuration file path -c config.yml
--verbose -v Enable verbose logging

Monitoring Examples

Basic monitoring:

bin/logarithm monitor /var/log/syslog

Monitor with JSON output and filtering:

bin/logarithm monitor -f json -F "WARNING" -T 0.7 /var/log/app.log

Read from stdin with custom threshold:

tail -f /var/log/app.log | logarithm monitor -i -T 0.9

Monitor multiple sources with journal:

bin/logarithm monitor -j /var/log/syslog /var/log/auth.log

CSV output for data analysis:

bin/logarithm monitor -f csv -T 0.8 /var/log/*.log > anomalies.csv

Model Management

Logarithm also provides model management commands:

bin/logarithm model [subcommand] [flags...]

Available subcommands:

  • info - Display detailed model information
  • list - List available model versions
  • delete - Remove specific model versions

Examples:

# Show model information
bin/logarithm model info

# List all model versions
bin/logarithm model list

# Delete old model versions
bin/logarithm model delete --older-than 30d

Troubleshooting

Common Issues

"No trained models found" error:

# Train models first
bin/logarithm train -t '1h'

# Or specify custom model directory
bin/logarithm monitor -D /path/to/models

Permission denied on log files:

# Run with appropriate permissions or use sudo
sudo bin/logarithm monitor /var/log/syslog

# Or monitor specific accessible files
bin/logarithm monitor ~/logs/app.log

High memory usage during training:

# Use memory-efficient chunked training (recommended for large datasets)
bin/logarithm train -t '2h'  # Automatically uses chunked storage for 100K+ logs

# Or reduce batch size and vocabulary for smaller datasets
bin/logarithm train -B 1000 -V 200 -t '30m'

Slow anomaly detection:

# Adjust threshold for fewer false positives
bin/logarithm monitor -T 0.9

# Use filtering to reduce log volume
bin/logarithm monitor -F "ERROR|WARNING" /var/log/app.log

Performance Tuning

For large log volumes:

  • Use smaller batch sizes: -B 1000
  • Reduce vocabulary size: -V 200
  • Limit training batches: -M 5

For real-time monitoring:

  • Use filtering: -F "important-pattern"
  • Adjust threshold: -T 0.8
  • Use JSON output for integration: -f json

Advanced Training Options

Logarithm supports flexible retraining strategies to adapt to evolving log patterns:

Incremental retraining (default, loads existing models and trains on new logs):

bin/logarithm train --retrain-mode incremental

Full retraining (ignores existing models, starts fresh training):

bin/logarithm train --retrain-mode full

Hybrid retraining (loads models but forces vocabulary expansion):

bin/logarithm train --retrain-mode hybrid

Expand vocabulary (add new terms to existing vectorizer vocabulary):

bin/logarithm train --expand-vocab

Rollback to previous model version (revert to backup models):

bin/logarithm train --rollback

Memory-Efficient Training

Logarithm v0.9.0 introduces a memory-efficient chunked training system that prevents memory accumulation when processing large datasets. This system automatically activates for datasets with 100K+ logs and provides the following benefits:

  • Bounded Memory Usage: Memory usage remains constant regardless of dataset size
  • Automatic Chunking: Large log files are automatically split into manageable chunks (~900KB each)
  • Temporary Storage: Chunks are stored in temporary files, not memory
  • Seamless Integration: No changes required to existing training commands

How It Works

When training on large datasets, Logarithm:

  1. Collects logs in memory until reaching the chunk threshold
  2. Writes chunks to temporary files in /tmp/logarithm_training-*
  3. Processes chunks sequentially during training
  4. Cleans up temporary files automatically after training

Memory-Efficient Training Examples

Training on large log files (automatic chunking):

# Processes 1M logs with bounded memory usage
bin/logarithm train -t '2h' /var/log/large_app.log

# Multiple large files with chunked processing
bin/logarithm train -t '4h' /var/log/app.log /var/log/access.log

Monitoring chunked training progress:

# Check chunk files during training
watch 'ls -la /tmp/logarithm_training-*/chunk_*.log | wc -l'

# Monitor memory usage (should remain stable)
watch 'ps aux | grep logarithm | grep -v grep | awk "{print \$4\"% \"\$6\"KB\"}"'

Performance Characteristics

  • Memory Usage: ~25-50MB regardless of dataset size (vs. GBs without chunking)
  • Chunk Size: ~900KB per chunk for optimal I/O performance
  • Processing Rate: 1000-5000 logs/second depending on hardware
  • Storage Overhead: Minimal (chunks are cleaned up automatically)

Configuration

Logarithm supports configuration via YAML files, environment variables, and command-line flags. Settings are applied in this order of precedence:

  1. Command-line flags (highest priority) - Override all other settings
  2. Environment variables - Override config file and defaults
  3. Configuration file - Override built-in defaults
  4. Built-in defaults (lowest priority)

Configuration File

Example config file (config.yml):

data_dir: ~/.local/share/logarithm
threshold: 0.85
duration: 48h
vocab_size: 100
batch_size: 10000
max_batches: 5

Environment Variables

Logarithm respects the following environment variables:

Variable Description Example
LOGARITHM_DATA_DIR Model storage directory /custom/path
LOGARITHM_LOG_LEVEL Logging verbosity DEBUG, INFO, WARN
LOGARITHM_ENCRYPTION_KEY Encryption key for models your-secret-key
LOGARITHM_CONFIG Path to config file /path/to/config.yml

CLI Flag Precedence

All command-line flags override their corresponding configuration settings:

# This will use threshold 0.9 regardless of config file or env var settings
bin/logarithm monitor -T 0.9 /var/log/syslog

# This will use custom data directory regardless of LOGARITHM_DATA_DIR
bin/logarithm train -D /tmp/models -t '1h'

Configuration Examples

Using config file:

bin/logarithm train -c /etc/logarithm/config.yml

Using environment variables:

export LOGARITHM_DATA_DIR=/var/lib/logarithm
export LOGARITHM_THRESHOLD=0.8
bin/logarithm monitor

Mixed configuration (CLI overrides all):

LOGARITHM_THRESHOLD=0.7 bin/logarithm monitor -T 0.9 -D /tmp/models
# Result: threshold=0.9, data_dir=/tmp/models

GPL v3 License

This project is licensed under the GPL v3 License - see the LICENSE file for details.

Authors

Acknowledgments

Development and Testing

Fakelogs Tool

Logarithm includes a fakelogs tool for generating synthetic log data for testing:

# Build the fakelogs tool
make fakelogs

# Generate syslog-style logs
./bin/fakelogs --template syslog --count 1000 > test_logs.log

# Generate JSON logs with anomalies
./bin/fakelogs --template json --anomalies 50 > test_data.json

Testing

Run the full test suite:

make test

Run integration tests:

make -C integration test

Building

Build development version:

make build

Build optimized release:

make release

Documentation

  • API Documentation: Generated from source code using make docs (uses README.rst)
  • User Guide: See USER_GUIDE.rst for practical usage examples
  • Fakelogs Guide: See integration/README.md for testing tool documentation
Repository

logarithm

Owner
Statistic
  • 0
  • 0
  • 0
  • 0
  • 3
  • 29 days ago
  • September 14, 2025
License

GNU General Public License v3.0 or later

Links
Synced at

Sun, 19 Oct 2025 01:43:43 GMT

Languages