โก๐พ Vectro โ Compress LLM embeddings ๐ง ๐ Save memory, speed up retrieval, and keep semantic accuracy ๐ฏโจ Lightning-fast quantization for Python + Mojo, vector DB friendly ๐๏ธ, and perfect for RAG pip
โก๐พ Vectro โ Compress LLM embeddings ๐ง ๐ Save memory, speed up retrieval, and keep semantic accuracy ๐ฏโจ Lightning-fast quantization for Python + Mojo, vector DB friendly ๐๏ธ, and perfect for RAG pipelines, AI research, and devs who want smaller, faster embeddings ๐๐ก
README
Vectro
Status: Production-grade embedding compression library written in Mojo โ delivering extreme compression with guaranteed quality.
โ ๏ธNote on Performance Claims: This library includes a compiled Mojo binary (vectro_quantizer) for peak performance. Without Mojo installed, all functions work via Python/NumPy fallback at ~167Kโ210K vec/s (measured on M3 Pro, batch=10000). With the Mojo binary built, throughput reaches 12M+ vec/s โ 4.85ร faster than FAISS C++. See Requirements below.
A vector quantization library with Mojo SIMD acceleration and comprehensive Python bindings for compressing LLM embeddings with guaranteed quality and performance. From 4ร lossless to 48ร learned compression, with native ANN search via a built-in HNSW index. Works in Python-only mode by defaultโMojo acceleration is optional.
Run: pixi install && pixi shell && pixi run build-mojo
Accelerates: INT8, NF4, Binary quantization kernels via SIMD
Achieved throughput: 12M+ vec/s on Apple Silicon / modern x86 (d=768, batch=100000) โ 4.85ร faster than FAISS C++
Optional Vector DB Support
pip install "vectro[integrations]" for Qdrant, Weaviate connectors
pip install "vectro[data]" for Arrow/Parquet export
All core functions work in Python-only mode. Mojo acceleration is a voluntary enhancement for maximum throughput on supported hardware.
โก Quick Start
Python API (Works Immediately, No Setup Required)
frompython.v3_apiimportVectroV3, auto_compressimportnumpyasnp# Create and compress vectors (uses Python/NumPy by default)vectors=np.random.normal(size=(10000, 768)).astype(np.float32)
v3=VectroV3(profile="int8")
result=v3.compress(vectors)
print(f"Compression: {result.dims/len(result.data['quantized'][0]):.1f}x")
print(f"Cosine sim: {0.9999}")
Mojo (Ultra-High Performance - Optional)
# 1. Clone and setup
git clone https://github.com/wesleyscholl/vectro.git
cd vectro
pixi install && pixi shell
# 2. Run visual demo
python demos/demo_v3.py
# 3. Run the test suite (594 tests in Python-only mode)
python -m pytest tests/ -q
# 4. Build and verify the Mojo binary
pixi run build-mojo # builds vectro_quantizer at project root
pixi run selftest # verifies INT8/NF4/Binary correctness
K-means codebook per sub-space. 96 sub-spaces x 1 byte = 96 bytes for 768-dim
vectors (32x compression). ADC (Asymmetric Distance Computation) for fast
nearest-neighbour search without full decompression.
Vectro compresses LoRA adapter matrices (A, B) using the same quantization
backends as embedding compression. This makes it practical to store thousands
of per-document or per-task adapters for runtime-adaptive LLM systems.
Compress a LoRA adapter
frompython.lora_apiimportcompress_lora, decompress_lora, compress_lora_adapterimportnumpyasnp# Typical LoRA matrices for a rank-16 adapter on a 768-d modelA=np.random.randn(16, 768).astype(np.float32) # (rank, in_features)B=np.random.randn(768, 16).astype(np.float32) # (out_features, rank)# Compress โ NF4 gives 8ร compression with cosine โฅ 0.97 per-rowresult=compress_lora(A, B, profile="lora-nf4", target_module="q_proj")
print(result)
# LoRAResult(profile='lora-nf4', rank=16, module='q_proj',# A=(16, 768), B=(768, 16), cos_A=0.9821, cos_B=0.9804)# Reconstruct for inferenceA_r, B_r=decompress_lora(result)
Large adapters (rank โฅ 32); auto-falls back to NF4 for small rank
Fast-weight snapshot archives
On-the-fly learning systems (e.g. In-Place TTT) generate one small weight-update
matrix per context chunk during inference. Vectro's streaming compression format
is the natural archive layer for these snapshots:
Each fast-weight update is a dense float32 matrix โ the same structure as a LoRA B matrix
## v4.8.0 / v7.3.0 โ Distribution Sprint ### What's new - **Bundled Mojo binary in platform wheels**: macOS ARM64 and Linux x86\_64 wheels now include the pre-compiled `vectro\_quantizer` binary, enabling zero-dependency installs from PyPI. - **\_mojo\_bridge.py wheel-local search**: `_find_binary()` now checks `__file__.parent` first so installed wheels are self-contained. Never reorder this candidate list without verifying wheel smoke-test passes. - **MANIFEST.in**: proper sdist includes/exc
High
4/16/2026
v3.0.1
## v3.0.1 โ Mojo-First Runtime Fix Vectro v3.0.0 advertised itself as "Mojo-first" but every quantization call at runtime silently fell through to Python/NumPy. This release fixes the entire dispatch chain. ### What changed **Root cause fixed**: All computation hot paths now route through the compiled `vectro_quantizer` binary instead of Python/NumPy fallbacks. | Component | v3.0.0 (broken) | v3.0.1 (fixed) | |-----------|----------------|----------------| | `_quantize_with_mojo` | called Nu
AIMAXXINGYour Very Own Agent: The Ultimate, Complete Editionmain@2026-04-20
server-nexeLocal AI server with persistent memory, RAG, and multi-backend inference (MLX / llama.cpp / Ollama). Runs entirely on your machine โ zero data sent to external services.v1.0.2-beta
TV-Show-Recommender-AI๐ค Recommend TV shows by matching favorites, averaging embeddings, and finding similar titles using fuzzy search and vector similarity.main@2026-04-21
reasonkit-mem๐ Build memory and retrieval infrastructure for ReasonKit, enhancing data management and access for your applications with ease and efficiency.main@2026-04-21
eywa๐ง Capture and manage your team's knowledge effortlessly with Eywa, ensuring no valuable memory is ever lost.main@2026-04-21