freshcrate
Home > Testing > little-coder

Description

A coding agent optimized to smaller LLMs

README

little-coder

A Claude Code-inspired CLI coding agent, heavily optimized for small models that run on any modern consumer laptop.

little-coder takes the architecture of a cloud-powered coding assistant and makes it work with 5–25 GB local models served via Ollama or llama.cpp, through skill-augmented tool use, domain-knowledge injection, workspace-aware context discovery, a Write-vs-Edit tool invariant, and a thinking-budget system that prevents reasoning models from hanging while preserving their partial insights.

Headline result: ollama/qwen3.5 (9.7B, 6.6 GB) + little-coder scores 45.56% mean (across two full runs) on the full 225-exercise Aider Polyglot benchmark, running on a consumer laptop with no network calls. On the public leaderboard that sits above gpt-4.5-preview (44.9%) and gpt-oss-120b high (41.8%). A matched-model vanilla Aider baseline reaches 19.11%.

The full narrative β€” motivation, design, methodology, results, leaderboard comparison, integrity audit, and limitations β€” is in the white paper at https://itayinbarr.substack.com/p/honey-i-shrunk-the-coding-agent This README is the quick tour: what it looks like, how to run it, and how the repo is laid out. For anything about why the design is the way it is or what the numbers mean, read the paper.


What it looks like

little-coder startup banner

Every time you're about to type, the status line shows how much context you've burned and projects how many more messages you can send before a new session is recommended. Zones at 70% (yellow) and 85% (red) match the threshold that triggers automatic compaction:

context usage and session counter

When you ask little-coder to implement something, the agent uses the workspace-awareness skill to discover any spec file (.docs/instructions.md, AGENTS.md, CLAUDE.md, README.md), reads the stub, and then Edits it in place. On the occasions it tries to Write over an existing file, the tool-level guard refuses and hands the agent the exact Edit recipe for the same path:

tool-use flow

Write guard firing and the Edit recovery

All four screenshots are real Rich-rendered SVG exports regenerated from a local generator script β€” they update in sync with the codebase.


Quick start

Option A β€” Ollama (simplest)

# 1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# 2. Pull a model
ollama pull qwen3.5

# 3. Clone + install little-coder
git clone https://github.com/itayinbarr/little-coder.git
cd little-coder
pip install -e .

# 4. Run
python little_coder.py
# Then in the REPL:  /model ollama/qwen3.5

Option B β€” llama.cpp (fastest, supports MoE models like Qwen3.6-35B-A3B)

# 1. Build llama.cpp with CUDA (sm_XXX matches your GPU; Blackwell = 120)
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 -DLLAMA_CURL=ON
cmake --build build --config Release -j

# 2. Fetch a GGUF (example: Qwen3.6-35B-A3B Q4_K_M, 22 GB)
pip install -U "huggingface_hub[cli]"
hf download unsloth/Qwen3.6-35B-A3B-GGUF Qwen3.6-35B-A3B-UD-Q4_K_M.gguf \
   --local-dir ~/models

# 3. Serve it (MoE trick: keep experts in RAM, attention on GPU)
build/bin/llama-server -m ~/models/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf \
   --host 127.0.0.1 --port 8888 --jinja \
   -c 16384 -ngl 99 --n-cpu-moe 999 --flash-attn on

# 4. Point little-coder at it
cd /path/to/little-coder && pip install -e .
python little_coder.py --model llamacpp/qwen3.6-35b-a3b
# or for the 9B backup:  --model llamacpp/qwen3.5-9b

Set LLAMACPP_BASE_URL=http://localhost:8888/v1 if you run the server on a different host or port.


Supported models

via llama.cpp (new in v0.0.3)

Model Size Notes
Qwen3.6-35B-A3B 22 GB (Q4_K_M) Sparse MoE, 35B total / 3B active β€” runs at ~38 tok/s on an 8 GB laptop GPU with --n-cpu-moe 999. Passes tasks that Qwen3.5 9B fails (e.g. book-store) on the first attempt.
Qwen3.5-9B 5.3 GB (Q4_K_M) Dense 9.7B, same model used for the v0.0.2 headline benchmark.

via Ollama

Model Size Notes
Qwen3.5 (default) 6.6 GB 9.7B, thinking + tools, the model the headline 45.56% is from
Gemma4:e4b 9.6 GB 8B, vision + audio capable
Qwen3:8b 5.2 GB 8.2B, thinking + tools
Gemma3:4b 3.3 GB 4B, 8K context, needs all optimizations
Llama 3.2:3b ~2 GB 3B, tight context
Phi4-mini ~3 GB 16K context
Any cloud model β€” Claude, GPT-4, Gemini β€” small-model optimizations auto-disabled

CLI reference

python little_coder.py [options]
  --model MODEL        Set the model (e.g. ollama/qwen3.5)
  --permission-mode    auto | accept-all | manual | plan

Key slash commands

Command Description
/model <name> Switch model
/context Show current context usage + message projection
/compact Summarize old messages to free up context
/commit Review and commit changes
/review Code review with structured feedback
/skills List available skills
/memory View persistent memories
/voice Voice input mode
/help Full command reference

Repo layout

little_coder.py          # REPL, slash commands, rendering
agent.py                 # Core agent loop with small-model adaptations
providers.py             # Multi-provider streaming (Ollama, llama.cpp, Anthropic, OpenAI-compat)
tools.py                 # 8 core tools + Write-vs-Edit invariant
tool_registry.py         # Tool registration and dispatch
context.py               # System prompt builder (base + skills + knowledge)
config.py                # Configuration management
compaction.py            # Context window management
workspace.py             # Workspace introspection helpers
memory.py                # Persistent file-based memory

local/                   # Small-model preprocessing pipeline
β”œβ”€β”€ config.py            # Per-model profiles (context, tokens, budgets)
β”œβ”€β”€ skill_augment.py     # Tool-skill selection and injection
β”œβ”€β”€ knowledge_augment.py # Domain-knowledge selection and injection
β”œβ”€β”€ context_manager.py   # Prompt compression and message pruning
β”œβ”€β”€ quality.py           # Empty / hallucinated / looped response detection
β”œβ”€β”€ output_parser.py     # Text-based tool-call extraction + JSON repair
└── deliberate.py        # Parallel reasoning branches

skill/
β”œβ”€β”€ tools/               # Tool usage guidance (8 files)
β”œβ”€β”€ knowledge/           # Algorithm + domain reference (13 files)
β”œβ”€β”€ loader.py            # Skill file parser
β”œβ”€β”€ executor.py          # Skill execution (inline/fork)
└── builtin.py           # Built-in slash skills

benchmarks/
β”œβ”€β”€ aider_polyglot.py              # Multi-language benchmark harness
β”œβ”€β”€ polyglot_status.py             # Status dashboard for running benchmarks
β”œβ”€β”€ smoke_test_langs.sh            # Reference-solution smoke test per language
└── results_full_polyglot*.json    # Per-exercise results from full runs

Further reading

  • docs/whitepaper.md β€” the white paper. Motivation, design philosophy (intern, not senior engineer), methodology, full results, leaderboard comparison, integrity audit, limitations. Start here.
  • docs/benchmark-reproduction.md β€” two-run reproduction report with per-language statistics, tool-use analysis, intervention metrics, and the runner-degradation investigation.
  • docs/benchmark-baseline-aider.md β€” vanilla Aider + Qwen3.5 baseline (19.1%) for scaffold-ablation comparison.
  • docs/architecture.md β€” deep internals for contributors: module dependency graph, tool registry API, skill loader structure.

Citation

If you reference little-coder or its Aider Polyglot result in academic work, please cite the white paper:

@misc{inbar2026littlecoder,
  title        = {little-coder: A Coding Agent Optimized for Small Local Language Models},
  subtitle     = {Architectural Adaptation Lets a 9.7B Model Outperform Frontier Models on Aider Polyglot},
  author       = {Inbar, Itay},
  year         = {2026},
  month        = apr,
  howpublished = {\url{https://github.com/itayinbarr/little-coder/blob/main/docs/whitepaper.md}},
  note         = {White paper}
}

Plain-text form:

Inbar, I. (2026). little-coder: A Coding Agent Optimized for Small Local Language Models. White paper. https://github.com/itayinbarr/little-coder/blob/main/docs/whitepaper.md


Attribution

little-coder is a derivative work based on CheetahClaws / ClawSpring by SafeRL-Lab, licensed under Apache 2.0. The upstream project provided the foundational agent architecture, tool system, multi-provider support, and REPL interface.

little-coder adds significant new systems for small-model optimization: skill-augmented tool use, domain-knowledge injection, workspace awareness, thinking-budget enforcement with reasoning reuse, the Write-vs-Edit tool invariant, model-specific profiles for Qwen3.5 and Gemma4, and a full multi-language benchmark harness.


License

Apache 2.0 β€” see LICENSE for details.

Release History

VersionChangesUrgencyDate
v0.0.4Latest release: v0.0.4High4/21/2026
v0.0.3Tag v0.0.3High4/21/2026

Dependencies & License Audit

Loading dependencies...

Similar Packages

tsunamiautonomous AI agent that builds full-stack apps. local models. no cloud. no API keys. runs on your hardware.main@2026-04-21
ISC-BenchInternal Safety Collapse: Turning the LLM or an AI Agent into a sensitive data generator.v0.0.5
vector-db-benchmarkFramework for benchmarking vector search enginesmaster@2026-04-17
My_AILocal-first AI assistant β€” 9 specialized agents (code, web, debug, security…), 10M token vector memory, mobile relay via secure tunnel, real-time web search and document processing. Runs 100% on your v7.2.0
OpenClawProBenchOpenClawProBench is a live-first benchmark harness for evaluating LLM agents in the OpenClaw runtime with deterministic grading and repeated-trial reliability.main@2026-04-15