mcp-server v1.0.0
mnemodoc-server
A Crystal MCP server that indexes project documentation via Ollama embeddings and exposes hybrid search (semantic + keyword) to MCP clients.
Why
Loading full documentation context at each Claude Code session is expensive. This server lets Claude fetch only the relevant passages on demand, reducing token costs significantly.
Features
- Section-aware chunking — splits Markdown at
##/###boundaries, not arbitrary token counts - Hybrid search — semantic (Ollama embeddings) + in-memory keyword matching fused with RRF
- Local & private — embeddings via Ollama (native or Docker), no data sent externally
- Two transports — stdio (Claude Code) and HTTP (Cursor, other MCP clients)
- Static binary — single executable, no runtime dependencies
Quick start
1. Start Ollama
docker run -d --name ollama -p 11434:11434 ollama/ollama
docker exec ollama ollama pull nomic-embed-text
2. Install mnemodoc-server
# macOS
brew install mnemodoc/tap/mnemodoc-server
# Linux — download the binary for your architecture from the releases page:
# https://github.com/mnemodoc/mcp-server/releases
3. Create a config in your project
# Download the example config
curl -fsSL https://raw.githubusercontent.com/mnemodoc/mcp-server/master/.mnemodoc.example.yml \
-o .mnemodoc.yml
# Then edit .mnemodoc.yml to set your doc paths
4. Index your docs and test (optional — serve auto-indexes on startup)
mnemodoc-server index doc/ --config .mnemodoc.yml
mnemodoc-server search "how to persist a model" --config .mnemodoc.yml
5. Add to your MCP client
Claude Code (~/.claude/settings.json) — stdio transport, no network exposure:
{
"mcpServers": {
"doc": {
"command": "mnemodoc-server",
"args": ["serve", "--config", "/path/to/project/.mnemodoc.yml"]
}
}
}
Cursor (.cursor/mcp.json) — HTTP transport, start the server first:
mnemodoc-server serve --sse --config /path/to/project/.mnemodoc.yml
{
"mcpServers": {
"doc": {
"url": "http://localhost:8765/mcp"
}
}
}
CLI
mnemodoc-server serve [--config .mnemodoc.yml] # Claude Code (stdio, default)
mnemodoc-server serve --sse [--port 8765] [--host 127.0.0.1] # Cursor / other clients
mnemodoc-server index <path> # Index a file or directory
mnemodoc-server search "<query>" [--mode hybrid|semantic|keyword] [--top 5] # Test search from terminal
mnemodoc-server status # Index stats
mnemodoc-server delete <path> # Remove from index
mnemodoc-server info # Version info
MCP tools
| Tool | Required args | Optional args | Returns |
|---|---|---|---|
query_documents |
query (string) |
top_k (int), mode (hybrid|semantic|keyword) |
chunks with file, heading, parent_heading, content, score; total_candidates, query_time_ms, mode |
ingest_path |
path (string) |
— | indexed, skipped, pruned counts |
list_files |
— | — | list of indexed files with metadata |
delete_file |
path (string) |
— | confirmation |
status |
— | — | version, chunk_count, file_count, model, search_mode, db_path |
query_documents optional args override the config values for that request only.
Behaviour notes
Auto-indexing on startup — serve automatically re-indexes all paths from the config in the background. The server is immediately responsive; indexing happens concurrently. Files whose mtime hasn't changed since the last run are skipped, so restarts are cheap.
Config paths resolve relative to the config file — doc/claude/ in .mnemodoc.yml is resolved relative to the directory that contains the config file, not the process working directory. Move the config file and the paths move with it.
Model mismatch — if you change ollama.model in the config, re-index before querying. Vectors from different models have incompatible dimensions and will silently score near-zero. query_documents emits a warning field in the response when it detects a mismatch.
Streaming ingest — MCP clients that support progress reporting can send Accept: text/event-stream with a tools/call ingest_path request. The server streams notifications/progress events per file indexed, followed by the final result frame. Include _meta.progressToken in the request arguments to receive progress notifications:
{
"jsonrpc": "2.0",
"id": 1,
"method": "tools/call",
"params": {
"name": "ingest_path",
"arguments": {
"path": "/your/docs",
"_meta": { "progressToken": "my-token" }
}
}
}
Deployment
systemd
To run as a systemd service (SSE mode), create /etc/systemd/system/mnemodoc-server.service:
[Unit]
Description=mnemodoc-server
After=network.target
[Service]
Type=notify
ExecStart=/usr/local/bin/mnemodoc-server serve --sse --config /path/to/.mnemodoc.yml
Restart=on-failure
WatchdogSec=30
[Install]
WantedBy=multi-user.target
systemctl daemon-reload
systemctl enable --now mnemodoc-server
The server sends READY=1 once the startup index pass completes and STOPPING=1 on SIGTERM. Log rotation via SIGUSR1 is supported for use with logrotate.
The HTTP transport also exposes GET /health — a lightweight liveness probe that returns 200 OK. Use it in ExecStartPost healthchecks or load balancer probes.
Development
Requires: Crystal, mise, Ollama (native or Docker).
mise dev:ollama # start Ollama (macOS native, Metal GPU) + pull model
mise dev:deps # install dependencies
mise dev:spec # run tests
mise dev:check # build + lint + test
See CLAUDE.md for full development guide.
Alternatives
| Project | Language | Vector store | Embeddings | Chunking |
|---|---|---|---|---|
| qpd-v/mcp-ragdocs | TypeScript | Qdrant | Ollama / OpenAI | Fixed tokens |
| sanderkooger/mcp-server-ragdocs | TypeScript | Qdrant | Ollama / OpenAI | Fixed tokens |
| Zackriya-Solutions/MCP-Markdown-RAG | Python | Milvus | Local | Fixed tokens |
| Daniel-Barta/mcp-rag-server | Python | In-memory | OpenAI | Fixed tokens |
Why mnemodoc-server differs:
- Zero runtime dependencies — static binary, no Node, no Python, no external vector database
- SQLite only — no Qdrant, no Milvus to run alongside
- Section-aware chunking — splits at
##/###boundaries instead of arbitrary token counts, preserving Markdown structure - Hybrid search — semantic + keyword fused with RRF, with a recency bias option
- Ollama only — intentionally local-first; no OpenAI key required or supported
Contributing
Contributions welcome. See CLAUDE.md for the full development guide.
Scaling optimizations are deferred until the index grows past ~50–100k chunks (below that, semantic search is sub-millisecond and the cost/complexity isn't worth it). Known areas for improvement:
- ANN index — replace the linear cosine scan (O(N·D) per query) with an approximate nearest-neighbour index (
sqlite-vecor HNSW) for O(log N) lookups. - Lazy chunk content in cache — the in-memory cache currently keeps each chunk's full text; scoring only needs the embedding. Cache embeddings + metadata and fetch content from SQLite only for the returned top-K results.
- Partial top-K selection —
Semantic#searchfully sorts all candidates (O(N log N)); a bounded top-K heap would be O(N log k).
License
MIT
mcp-server
- 0
- 0
- 0
- 0
- 7
- about 1 hour ago
- June 16, 2026
MIT License
Wed, 17 Jun 2026 04:25:50 GMT