little-coder

A Claude Code-inspired CLI coding agent, heavily optimized for small models that run on any modern consumer laptop.

little-coder takes the architecture of a cloud-powered coding assistant and makes it work with 5–25 GB local models served via Ollama or llama.cpp, through skill-augmented tool use, domain-knowledge injection, workspace-aware context discovery, a Write-vs-Edit tool invariant, and a thinking-budget system that prevents reasoning models from hanging while preserving their partial insights.

Headline result: ollama/qwen3.5 (9.7B, 6.6 GB) + little-coder scores 45.56% mean (across two full runs) on the full 225-exercise Aider Polyglot benchmark, running on a consumer laptop with no network calls. On the public leaderboard that sits above gpt-4.5-preview (44.9%) and gpt-oss-120b high (41.8%). A matched-model vanilla Aider baseline reaches 19.11%.

The full narrative — motivation, design, methodology, results, leaderboard comparison, integrity audit, and limitations — is in the white paper at https://itayinbarr.substack.com/p/honey-i-shrunk-the-coding-agent This README is the quick tour: what it looks like, how to run it, and how the repo is laid out. For anything about why the design is the way it is or what the numbers mean, read the paper.

What it looks like

Every time you're about to type, the status line shows how much context you've burned and projects how many more messages you can send before a new session is recommended. Zones at 70% (yellow) and 85% (red) match the threshold that triggers automatic compaction:

When you ask little-coder to implement something, the agent uses the workspace-awareness skill to discover any spec file (.docs/instructions.md, AGENTS.md, CLAUDE.md, README.md), reads the stub, and then Edits it in place. On the occasions it tries to Write over an existing file, the tool-level guard refuses and hands the agent the exact Edit recipe for the same path:

All four screenshots are real Rich-rendered SVG exports regenerated from a local generator script — they update in sync with the codebase.

Quick start

Option A — Ollama (simplest)

# 1. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# 2. Pull a model
ollama pull qwen3.5

# 3. Clone + install little-coder
git clone https://github.com/itayinbarr/little-coder.git
cd little-coder
pip install -e .

# 4. Run
python little_coder.py
# Then in the REPL:  /model ollama/qwen3.5

Option B — llama.cpp (fastest, supports MoE models like Qwen3.6-35B-A3B)

# 1. Build llama.cpp with CUDA (sm_XXX matches your GPU; Blackwell = 120)
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 -DLLAMA_CURL=ON
cmake --build build --config Release -j

# 2. Fetch a GGUF (example: Qwen3.6-35B-A3B Q4_K_M, 22 GB)
pip install -U "huggingface_hub[cli]"
hf download unsloth/Qwen3.6-35B-A3B-GGUF Qwen3.6-35B-A3B-UD-Q4_K_M.gguf \
   --local-dir ~/models

# 3. Serve it (MoE trick: keep experts in RAM, attention on GPU)
build/bin/llama-server -m ~/models/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf \
   --host 127.0.0.1 --port 8888 --jinja \
   -c 16384 -ngl 99 --n-cpu-moe 999 --flash-attn on

# 4. Point little-coder at it
cd /path/to/little-coder && pip install -e .
python little_coder.py --model llamacpp/qwen3.6-35b-a3b
# or for the 9B backup:  --model llamacpp/qwen3.5-9b

Set LLAMACPP_BASE_URL=http://localhost:8888/v1 if you run the server on a different host or port.

Supported models

via llama.cpp (new in v0.0.3)

Model	Size	Notes
Qwen3.6-35B-A3B	22 GB (Q4_K_M)	Sparse MoE, 35B total / 3B active — runs at ~38 tok/s on an 8 GB laptop GPU with `--n-cpu-moe 999`. Passes tasks that Qwen3.5 9B fails (e.g. `book-store`) on the first attempt.
Qwen3.5-9B	5.3 GB (Q4_K_M)	Dense 9.7B, same model used for the v0.0.2 headline benchmark.

via Ollama

Model	Size	Notes
Qwen3.5 (default)	6.6 GB	9.7B, thinking + tools, the model the headline 45.56% is from
Gemma4:e4b	9.6 GB	8B, vision + audio capable
Qwen3:8b	5.2 GB	8.2B, thinking + tools
Gemma3:4b	3.3 GB	4B, 8K context, needs all optimizations
Llama 3.2:3b	~2 GB	3B, tight context
Phi4-mini	~3 GB	16K context
Any cloud model	—	Claude, GPT-4, Gemini — small-model optimizations auto-disabled

CLI reference

python little_coder.py [options]
  --model MODEL        Set the model (e.g. ollama/qwen3.5)
  --permission-mode    auto | accept-all | manual | plan

Key slash commands

Command	Description
`/model <name>`	Switch model
`/context`	Show current context usage + message projection
`/compact`	Summarize old messages to free up context
`/commit`	Review and commit changes
`/review`	Code review with structured feedback
`/skills`	List available skills
`/memory`	View persistent memories
`/voice`	Voice input mode
`/help`	Full command reference

Repo layout

little_coder.py          # REPL, slash commands, rendering
agent.py                 # Core agent loop with small-model adaptations
providers.py             # Multi-provider streaming (Ollama, llama.cpp, Anthropic, OpenAI-compat)
tools.py                 # 8 core tools + Write-vs-Edit invariant
tool_registry.py         # Tool registration and dispatch
context.py               # System prompt builder (base + skills + knowledge)
config.py                # Configuration management
compaction.py            # Context window management
workspace.py             # Workspace introspection helpers
memory.py                # Persistent file-based memory

local/                   # Small-model preprocessing pipeline
├── config.py            # Per-model profiles (context, tokens, budgets)
├── skill_augment.py     # Tool-skill selection and injection
├── knowledge_augment.py # Domain-knowledge selection and injection
├── context_manager.py   # Prompt compression and message pruning
├── quality.py           # Empty / hallucinated / looped response detection
├── output_parser.py     # Text-based tool-call extraction + JSON repair
└── deliberate.py        # Parallel reasoning branches

skill/
├── tools/               # Tool usage guidance (8 files)
├── knowledge/           # Algorithm + domain reference (13 files)
├── loader.py            # Skill file parser
├── executor.py          # Skill execution (inline/fork)
└── builtin.py           # Built-in slash skills

benchmarks/
├── aider_polyglot.py              # Multi-language benchmark harness
├── polyglot_status.py             # Status dashboard for running benchmarks
├── smoke_test_langs.sh            # Reference-solution smoke test per language
└── results_full_polyglot*.json    # Per-exercise results from full runs

Citation

If you reference little-coder or its Aider Polyglot result in academic work, please cite the white paper:

@misc{inbar2026littlecoder,
  title        = {little-coder: A Coding Agent Optimized for Small Local Language Models},
  subtitle     = {Architectural Adaptation Lets a 9.7B Model Outperform Frontier Models on Aider Polyglot},
  author       = {Inbar, Itay},
  year         = {2026},
  month        = apr,
  howpublished = {\url{https://github.com/itayinbarr/little-coder/blob/main/docs/whitepaper.md}},
  note         = {White paper}
}

Plain-text form:

Inbar, I. (2026). little-coder: A Coding Agent Optimized for Small Local Language Models. White paper. https://github.com/itayinbarr/little-coder/blob/main/docs/whitepaper.md

Attribution

little-coder is a derivative work based on CheetahClaws / ClawSpring by SafeRL-Lab, licensed under Apache 2.0. The upstream project provided the foundational agent architecture, tool system, multi-provider support, and REPL interface.

little-coder adds significant new systems for small-model optimization: skill-augmented tool use, domain-knowledge injection, workspace awareness, thinking-budget enforcement with reasoning reuse, the Write-vs-Edit tool invariant, model-specific profiles for Qwen3.5 and Gemma4, and a full multi-language benchmark harness.

License

Apache 2.0 — see LICENSE for details.

Version	Changes	Urgency	Date
v1.8.2	### Fixed - Minimal user `models.json` entries no longer crash startup with `Cannot read properties of undefined (reading 'input')` ([#36](https://github.com/itayinbarr/little-coder/issues/36)). The shipped `models.json` declares every field — `id`, `name`, `reasoning`, `input`, `contextWindow`, `maxTokens`, `cost` — but a user override that omitted e.g. `name`/`maxTokens`/`cost` was passed through unchanged to pi's registry, which then exploded deep in `applyModelOverride` when it tried to	High	5/30/2026
v1.8.1	### Fixed - `glob` no longer exhausts memory on a recursive search from a huge root. The tool capped matches at 500 but never bounded the walk: run from a home directory (or any tree with macOS `Library`, caches, or `node_modules`), `fs.glob` recursively descended everything and its internal traversal state grew until the Node process ran out of heap — a host-memory crash (`Ineffective mark-compacts near heap limit`), entirely distinct from the model's context window (the read-gua	High	5/23/2026
v1.4.3	Follow-up to v1.4.2: clean up two cosmetic regressions that the @earendil-works scope migration surfaced. ### Fixed - Pi's `What's New` block no longer appears inside little-coder's TUI after a version bump. Root cause: pi's interactive mode reads its own bundled `CHANGELOG.md` on startup and renders every entry strictly newer than the `lastChangelogVersion` field in `~/.pi/agent/settings.json` (`interactive-mode.js:getChangelogForDisplay`). v1.4.2 jumped the bundled pi from 0.68.1 to 0.75	High	5/19/2026
v1.2.0	Issue-cleanup release that also ships built-in LM Studio support. Closes [#17](https://github.com/itayinbarr/little-coder/issues/17) (Windows), [#19](https://github.com/itayinbarr/little-coder/issues/19) (phantom Agent tool), [#21](https://github.com/itayinbarr/little-coder/issues/21) (skill param mismatch). ### Added - Built-in `lmstudio/local-model` provider. [LM Studio](https://lmstudio.ai/) exposes an OpenAI-compatible server on `http://127.0.0.1:1234/v1` by default, and previously the	High	5/13/2026
v1.1.0	Issue-cleanup release. Three small features and one bug fix, driven by GitHub issues #12 / #13 / #15 / #16. ### Added - `models.json` is now the canonical provider registration. ([#13](https://github.com/itayinbarr/little-coder/issues/13)) Previously `.pi/extensions/llama-cpp-provider/index.ts` hardcoded the model list and `models.json` was decorative; editing it had no effect. Now the extension loads providers and models from `models.json` at startup and registers them dynamically. **Us	High	5/3/2026
v1.0.3	README and install.sh now lead with `little-coder --model llamacpp/qwen3.6-35b-a3b` as the canonical example. That's the configuration little-coder is tuned for: small local model + custom scaffolding. Cloud models (Anthropic, OpenAI) move into the secondary list. No code changes — purely a docs change. ## Update ``` npm install -g little-coder@1.0.3 ``` (Or wait for the in-launcher prompt next time you run `little-coder`.)	High	4/28/2026
main@2026-04-23	Latest activity on main branch	High	4/23/2026
v0.0.4	Latest release: v0.0.4	High	4/21/2026
main@2026-04-21	Latest activity on main branch	High	4/21/2026
v0.0.3	Tag v0.0.3	High	4/21/2026

little-coder

Description

README

little-coder

What it looks like

Quick start

Option A — Ollama (simplest)

Option B — llama.cpp (fastest, supports MoE models like Qwen3.6-35B-A3B)

Supported models

via llama.cpp (new in v0.0.3)

via Ollama

CLI reference

Key slash commands

Repo layout

Further reading

Citation

Attribution

License

Release History

Dependencies & License Audit

Similar Packages

More in Testing