LLM Algorithm Testing Repository

This repository contains algorithm implementations and tests from various Large Language Models (LLMs) to evaluate their problem-solving capabilities, code quality, and understanding of complex concepts.

Test Overview

The primary test consists of challenging an LLM to create a Bitcoin price prediction algorithm that embodies the principle of "wisdom" - defined as being as simple as possible but as accurate as possible.

Test Components

#. Initial Code Analysis: LLMs are given a Crystal language implementation with bugs and asked to fix it #. Algorithm Improvement: Enhance the prediction algorithm while maintaining simplicity #. Wisdom Demonstration: Explain why the approach demonstrates wisdom in algorithm design #. Multi-Horizon Predictions: Generate predictions for 1 hour, 1 day, 1 week, 1 month, 1 year, and 3 years

Evaluation Criteria

LLMs are evaluated on:

Code Quality: Clean, readable, maintainable code
Bug Fixing: Ability to identify and fix existing issues
Algorithm Design: Balance between simplicity and effectiveness
Market Understanding: Incorporation of real-world market behaviors
Wisdom Principles: Demonstration of knowing what to include vs. what to exclude
Documentation: Clear explanation of approach and rationale

Test Results

Each LLM's implementation is stored in its own directory containing:

The complete algorithm implementation (predictor.cr)
Comprehensive documentation explaining the approach (README.rst)
Evidence of wisdom in algorithm design
Performance characteristics and predictions

Reviews and Rankings

The reviews/ directory contains comprehensive evaluations of all implementations:

gemini.rst - Gemini's evaluation of all LLM implementations
grok.rst - Grok's evaluation of all LLM implementations
instructions.rst - Guidelines for reviewing implementations

Key Findings from Reviews:

Top Performers: Kimi (Bayesian intervals), Gemini (rigorous validation), DeepSeek (Bitcoin fundamentals)
Critical Wisdom: Epistemic humility (probabilistic ranges) outperforms false precision
Bitcoin-Specific: Understanding Bitcoin's network properties beats generic technical analysis
Validation Matters: Walk-forward backtesting essential for proving effectiveness

Wisdom Definition

For the purpose of this test, "wisdom" in algorithm design is defined as:

#. Simplicity: Using the minimum complexity necessary to achieve the goal #. Effectiveness: Producing meaningful, useful results #. Balance: Knowing what features to include and what to exclude #. Reality: Incorporating real-world constraints and behaviors #. Humility: Acknowledging limitations and uncertainties

Why This Test Matters

This test evaluates several key LLM capabilities:

Technical Proficiency: Code debugging and implementation skills

Domain Knowledge: Understanding of financial markets and algorithmic trading

Conceptual Thinking: Ability to grasp abstract concepts like "wisdom"

Communication: Clear documentation and explanation of complex ideas

Pragmatism: Balancing theoretical perfection with practical utility

Market Awareness: Incorporating real-world market behaviors and technical analysis

Future Tests

Additional test scenarios may include:

Different asset classes (stocks, commodities, forex)

Alternative programming languages

More complex prediction scenarios

Real-time data integration

Performance optimization challenges

Extended historical data analysis (beyond 365 days)

Contributing

This repository serves as a benchmark for LLM capabilities. Each implementation represents a different approach to the same fundamental problem, providing valuable insights into how different models reason about complex, ambiguous tasks.

The goal is not to find the "perfect" algorithm, but to understand how LLMs approach problems that require both technical skill and conceptual wisdom.

Repository

predictor

Owner

renich

Statistic

0
0
0
0
0
3 months ago
December 11, 2025

License

Links

Synced at

Tue, 27 Jan 2026 04:51:13 GMT

Languages

Crystal 100.0%