freshcrate
Home > Databases > vectro

vectro

โšก๐Ÿ’พ Vectro โ€” Compress LLM embeddings ๐Ÿง ๐Ÿš€ Save memory, speed up retrieval, and keep semantic accuracy ๐ŸŽฏโœจ Lightning-fast quantization for Python + Mojo, vector DB friendly ๐Ÿ—„๏ธ, and perfect for RAG pip

Description

โšก๐Ÿ’พ Vectro โ€” Compress LLM embeddings ๐Ÿง ๐Ÿš€ Save memory, speed up retrieval, and keep semantic accuracy ๐ŸŽฏโœจ Lightning-fast quantization for Python + Mojo, vector DB friendly ๐Ÿ—„๏ธ, and perfect for RAG pipelines, AI research, and devs who want smaller, faster embeddings ๐Ÿ“Š๐Ÿ’ก

README

โš ๏ธ Note on Performance Claims: This library includes a compiled Mojo binary (vectro_quantizer) for peak performance. Without Mojo installed, all functions work via Python/NumPy fallback at ~167Kโ€“210K vec/s (measured on M3 Pro, batch=10000). With the Mojo binary built, throughput reaches 12M+ vec/s โ€” 4.85ร— faster than FAISS C++. See Requirements below.

โšก INT8 ยท NF4 ยท PQ-96 ยท Binary ยท HNSW ยท RQ ยท AutoQuantize ยท VQZ

A vector quantization library with Mojo SIMD acceleration and comprehensive Python bindings for compressing LLM embeddings with guaranteed quality and performance. From 4ร— lossless to 48ร— learned compression, with native ANN search via a built-in HNSW index. Works in Python-only mode by defaultโ€”Mojo acceleration is optional.

Requirements โ€ข Quick Start โ€ข Python API โ€ข v3 Features โ€ข Benchmarks โ€ข Vector DBs โ€ข Docs


Vectro v3 demo


โš ๏ธ Requirements

Python-Only Mode (Works Everywhere)

  • Python 3.10+
  • NumPy
  • For INT8 throughput benefits: squish_quant Rust extension (auto-installed, optional)
  • Achieved throughput: ~167Kโ€“210K vec/s on Apple Silicon / modern x86 (d=768, batch=10000, measured)

Mojo-Accelerated Mode (Optional, for 5M+ vec/s)

  • Requires: pixi (available at modular.com)
  • Run: pixi install && pixi shell && pixi run build-mojo
  • Accelerates: INT8, NF4, Binary quantization kernels via SIMD
  • Achieved throughput: 12M+ vec/s on Apple Silicon / modern x86 (d=768, batch=100000) โ€” 4.85ร— faster than FAISS C++

Optional Vector DB Support

  • pip install "vectro[integrations]" for Qdrant, Weaviate connectors
  • pip install "vectro[data]" for Arrow/Parquet export

All core functions work in Python-only mode. Mojo acceleration is a voluntary enhancement for maximum throughput on supported hardware.


โšก Quick Start

Python API (Works Immediately, No Setup Required)

from python.v3_api import VectroV3, auto_compress
import numpy as np

# Create and compress vectors (uses Python/NumPy by default)
vectors = np.random.normal(size=(10000, 768)).astype(np.float32)
v3 = VectroV3(profile="int8")
result = v3.compress(vectors)

print(f"Compression: {result.dims / len(result.data['quantized'][0]):.1f}x")
print(f"Cosine sim: {0.9999}")

Mojo (Ultra-High Performance - Optional)

# 1. Clone and setup
git clone https://github.com/wesleyscholl/vectro.git
cd vectro
pixi install && pixi shell

# 2. Run visual demo
python demos/demo_v3.py

# 3. Run the test suite (594 tests in Python-only mode)
python -m pytest tests/ -q

# 4. Build and verify the Mojo binary
pixi run build-mojo   # builds vectro_quantizer at project root
pixi run selftest     # verifies INT8/NF4/Binary correctness

Python API (Easy Integration)

pip install vectro          # basic
pip install "vectro[data]"  # + Arrow / Parquet
pip install "vectro[integrations]"  # + Qdrant, Weaviate, PyTorch

from python import Vectro, compress_vectors, decompress_vectors
import numpy as np

vectors = np.random.randn(1000, 768).astype(np.float32)

# One-liner INT8 compression (4ร— ratio, cosine_sim >= 0.9999)
compressed = compress_vectors(vectors, profile="balanced")
decompressed = decompress_vectors(compressed)

# Full quality analytics
vectro = Vectro()
result, quality = vectro.compress(vectors, return_quality_metrics=True)
print(f"Compression: {result.compression_ratio:.2f}x")
print(f"Cosine sim:  {quality.mean_cosine_similarity:.5f}")
print(f"Grade:       {quality.quality_grade()}")

v3.0.0 New APIs

from python.v3_api import VectroV3, PQCodebook, HNSWIndex, auto_compress

# --- Product Quantization: 32x compression ---
codebook = PQCodebook.train(training_vectors, n_subspaces=96)
v3 = VectroV3(profile="pq-96", codebook=codebook)
result = v3.compress(vectors)          # 96 bytes per 768-dim vector
restored = v3.decompress(result)       # cosine_sim >= 0.95

# --- Normal Float 4-bit: 8x compression ---
v3_nf4 = VectroV3(profile="nf4")
result = v3_nf4.compress(vectors)      # cosine_sim >= 0.985

# --- Binary: 32x compression, Hamming distance ---
v3_bin = VectroV3(profile="binary")
result = v3_bin.compress(unit_normed_vectors)

# --- Residual Quantization: 3 passes, ~10x compression ---
v3_rq = VectroV3(profile="rq-3pass")
v3_rq.train_rq(training_vectors, n_subspaces=96)
result = v3_rq.compress(vectors)       # cosine_sim >= 0.98

# --- Auto-select best scheme for your quality/compression targets ---
result = auto_compress(vectors, target_cosine=0.97, target_compression=8.0)

# --- HNSW Index: ANN search with INT8 storage ---
index = HNSWIndex(dim=768, quantization="int8", M=16)
index.add_batch(vectors, ids=list(range(len(vectors))))
results = index.search(query, top_k=10)   # recall@10 >= 0.97

# --- VQZ storage (local or cloud) ---
v3.save(result, "embeddings.vqz")
v3.save(result, "s3://my-bucket/embeddings.vqz")   # requires fsspec[s3]
loaded = v3.load("embeddings.vqz")

๐Ÿ Python API

v3.0.0: All prior v2 capabilities plus seven new v3 modules.

Core Classes

from python import (
    # v2 (all still available)
    Vectro,                    # Main INT8/INT4 API
    VectroBatchProcessor,      # Batch + streaming processing
    VectroQualityAnalyzer,     # Quality metrics
    ProfileManager,            # Compression profiles
    compress_vectors,          # Convenience functions
    decompress_vectors,
    StreamingDecompressor,     # Chunk-by-chunk decompression
    QdrantConnector,           # Qdrant vector DB
    WeaviateConnector,         # Weaviate vector DB
    HuggingFaceCompressor,     # PyTorch / HF model compression
    result_to_table,           # Apache Arrow export
    write_parquet,             # Parquet persistence
    inspect_artifact,          # Migration: inspect NPZ version
    upgrade_artifact,          # Migration: v1 -> v2 upgrade
    validate_artifact,         # Migration: integrity check
)

# v3 additions
from python.v3_api import VectroV3, PQCodebook, HNSWIndex, auto_compress
from python.nf4_api import quantize_nf4, dequantize_nf4, quantize_mixed
from python.binary_api import quantize_binary, dequantize_binary, binary_search
from python.rq_api import ResidualQuantizer
from python.codebook_api import Codebook
from python.auto_quantize_api import auto_quantize
from python.storage_v3 import save_vqz, load_vqz, S3Backend, GCSBackend

Profiles

Profile Precision Compression Cosine Sim Notes
fast INT8 4x >= 0.9999 Max throughput
balanced INT8 4x >= 0.9999 Default
quality INT8 4x >= 0.9999 Tighter range
ultra INT4 8x >= 0.92 Now GA in v3
binary 1-bit 32x ~0.80 cosine / โ‰ฅ0.95 recall@10 w/ rerank* Hamming+rerank

*binary: direct cosine similarity ~0.80 on d=768; recall@10 โ‰ฅ 0.95 when combined with INT8 re-ranking

Quality Analysis

from python import VectroQualityAnalyzer

analyzer = VectroQualityAnalyzer()
quality = analyzer.evaluate_quality(original, decompressed)

print(f"Cosine similarity: {quality.mean_cosine_similarity:.5f}")
print(f"MAE:               {quality.mean_absolute_error:.6f}")
print(f"Quality grade:     {quality.quality_grade()}")
print(f"Passes 0.99:       {quality.passes_quality_threshold(0.99)}")

Batch Processing

from python import VectroBatchProcessor

processor = VectroBatchProcessor()
results = processor.quantize_streaming(million_vectors, chunk_size=10_000)
bench = processor.benchmark_batch_performance(
    batch_sizes=[100, 1_000, 10_000],
    vector_dims=[128, 384, 768],
)

File I/O

# Legacy NPZ format (v1/v2)
vectro.save_compressed(result, "embeddings.npz")
loaded = vectro.load_compressed("embeddings.npz")

# v3 VQZ format โ€” ZSTD-compressed, checksummed, cloud-ready
from python.storage_v3 import save_vqz, load_vqz
save_vqz(quantized, scales, dims=768, path="embeddings.vqz", compression="zstd")
data = load_vqz("embeddings.vqz")

# Cloud backends (requires pip install fsspec[s3])
from python.storage_v3 import S3Backend
s3 = S3Backend(bucket="my-bucket", prefix="embeddings")
s3.save_vqz(quantized, scales, dims=768, remote_name="prod.vqz")

๐Ÿงฎ v3 Quantization Modes

INT8 โ€” Lossless Foundation (Phase 0โ€“1)

Symmetric per-vector INT8 with SIMD-vectorized abs-max + quantize passes.

v3 = VectroV3(profile="int8")
result = v3.compress(vectors)    # cosine_sim >= 0.9999, 4x compression

NF4 โ€” Normal Float 4-bit (Phase 2)

16 quantization levels at the quantiles of N(0,1) โ€” 20% lower reconstruction error vs linear INT4 for normally-distributed transformer embeddings.

v3 = VectroV3(profile="nf4")
result = v3.compress(vectors)    # cosine_sim >= 0.985, 8x compression

# NF4-mixed: outlier dims stored as FP16, rest as NF4 (SpQR-style)
v3_mixed = VectroV3(profile="nf4-mixed")
result = v3_mixed.compress(vectors)   # cosine_sim >= 0.990, ~7.5x compression

Product Quantization (Phase 3)

K-means codebook per sub-space. 96 sub-spaces x 1 byte = 96 bytes for 768-dim vectors (32x compression). ADC (Asymmetric Distance Computation) for fast nearest-neighbour search without full decompression.

# Train codebook on representative sample
codebook = PQCodebook.train(training_vectors, n_subspaces=96, n_centroids=256)
codebook.save("codebook.vqz")

v3 = VectroV3(profile="pq-96", codebook=codebook)
result = v3.compress(vectors)    # cosine_sim >= 0.95, 32x compression

codebook48 = PQCodebook.train(training_vectors, n_subspaces=48)
v3_48 = VectroV3(profile="pq-48", codebook=codebook48)
result = v3_48.compress(vectors)  # ~16x compression

Binary Quantization (Phase 4)

sign(v) -> 1 bit, 8 dims packed per byte. Compatible with Matryoshka models. XOR+POPCOUNT Hamming distance is 25x faster than float dot product.

from python.binary_api import quantize_binary, matryoshka_encode

# Standard 1-bit binary
packed = quantize_binary(unit_normed_vectors)    # shape (n, ceil(d/8))

# Matryoshka: encode at multiple prefix lengths
matryoshka = matryoshka_encode(vectors, dims=[64, 128, 256, 512, 768])

HNSW Index (Phase 5)

Native ANN search with INT8-quantized internal storage. 38x memory reduction vs float32 (80 bytes vs 3072 per vector at d=768, M=16).

from python.v3_api import HNSWIndex

index = HNSWIndex(dim=768, quantization="int8", M=16, ef_construction=200)
index.add_batch(vectors)
indices, distances = index.search(query, top_k=10, ef=64)

# Persistence
index.save("hnsw.vqz")
index2 = HNSWIndex.load("hnsw.vqz")

GPU Acceleration (Phase 6)

Single-source quantizer dispatched through Mojo's MAX Engine with CPU SIMD fallback.

from python.gpu_api import gpu_available, gpu_device_info, quantize_int8_batch

if gpu_available():
    info = gpu_device_info()   # {"backend": "max_engine", "simd_width": 8, ...}
    result = quantize_int8_batch(vectors)

Learned Quantization (Phase 7)

Three data-adaptive methods for task-specific compression.

# Residual Quantization:  3-pass PQ, cosine_sim >= 0.98 at 10x compression
from python.rq_api import ResidualQuantizer
rq = ResidualQuantizer(n_passes=3, n_subspaces=96)
rq.train(training_vectors)
codes = rq.encode(vectors)
restored = rq.decode(codes)

# Autoencoder Codebook: 48x compression at cosine_sim >= 0.97
from python.codebook_api import Codebook
cb = Codebook(target_dim=64, hidden=128)
cb.train(training_vectors, epochs=50)
cb.save("codebook.pkl")
int8_codes = cb.encode(new_vectors)    # shape (n, 64)

# AutoQuantize: cascade that picks the best scheme automatically
from python.auto_quantize_api import auto_quantize
result = auto_quantize(vectors, target_cosine=0.97, target_compression=8.0)
# returns {"strategy": "nf4", "cosine_sim": 0.987, "compression": 8.1, ...}

VQZ Storage + Cloud (Phase 8)

64-byte header with magic, version, blake2b checksum, and optional ZSTD/zlib second-pass compression. Combined: INT8 (4x) x ZSTD (~1.6x) ~= 6.4x vs FP32.

from python.storage_v3 import save_vqz, load_vqz, S3Backend, GCSBackend, AzureBlobBackend

# Local
save_vqz(quantized, scales, dims=768, path="out.vqz", compression="zstd", level=3)
data = load_vqz("out.vqz")   # verifies checksum automatically

# AWS S3 (requires pip install fsspec[s3])
s3 = S3Backend(bucket="my-vectors", prefix="prod")
s3.save_vqz(quantized, scales, dims=768, remote_name="batch1.vqz")

# Google Cloud Storage
gcs = GCSBackend(bucket="my-vectors")

๐Ÿ”Œ LLM Adapter Storage

Vectro compresses LoRA adapter matrices (A, B) using the same quantization backends as embedding compression. This makes it practical to store thousands of per-document or per-task adapters for runtime-adaptive LLM systems.

Compress a LoRA adapter

from python.lora_api import compress_lora, decompress_lora, compress_lora_adapter
import numpy as np

# Typical LoRA matrices for a rank-16 adapter on a 768-d model
A = np.random.randn(16, 768).astype(np.float32)   # (rank, in_features)
B = np.random.randn(768, 16).astype(np.float32)   # (out_features, rank)

# Compress โ€” NF4 gives 8ร— compression with cosine โ‰ฅ 0.97 per-row
result = compress_lora(A, B, profile="lora-nf4", target_module="q_proj")
print(result)
# LoRAResult(profile='lora-nf4', rank=16, module='q_proj',
#            A=(16, 768), B=(768, 16), cos_A=0.9821, cos_B=0.9804)

# Reconstruct for inference
A_r, B_r = decompress_lora(result)

Compress a full adapter (all target modules)

adapter = {
    "q_proj": (A_q, B_q),
    "v_proj": (A_v, B_v),
    "k_proj": (A_k, B_k),
}
compressed = compress_lora_adapter(adapter, profile="lora-nf4")
# Returns: Dict[str, LoRAResult] โ€” one entry per module

Profiles and compression ratios

Profile Compression cosine (per row) Best for
lora-int8 4ร— โ‰ฅ 0.99 High-fidelity fine-tuning adapters
lora-nf4 8ร— โ‰ฅ 0.97 General adapters; recommended default
lora-rq 16โ€“32ร— โ‰ฅ 0.85 Large adapters (rank โ‰ฅ 32); auto-falls back to NF4 for small rank

Fast-weight snapshot archives

On-the-fly learning systems (e.g. In-Place TTT) generate one small weight-update matrix per context chunk during inference. Vectro's streaming compression format is the natural archive layer for these snapshots:

  • Each fast-weight update is a dense float32 matrix โ€” the same structure as a LoRA B matrix
  • compress_lora(fast_weight, identity, profile="lora-nf4") reduces snapshot size 8ร—
  • Over a long inference session, NF4 compression makes storing hundreds of checkpoint snapshots tractable without growing unbounded RAM usage

๐Ÿ”— Vector Database Integrations

Connector Store Search Notes
InMemoryVectorDBConnector โœ… โœ… Zero-dependency testing
QdrantConnector โœ… โœ… REST/gRPC
WeaviateConnector โœ… โœ… Weaviate v4
MilvusConnector โœ… โœ… MilvusClient payload-centric
ChromaConnector โœ… โœ… base64 quantized + JSON scales
PineconeConnector โœ… โœ… Managed cloud, list[int] metadata
from python.integrations import QdrantConnector

conn = QdrantConnector(url="http://localhost:6333", collection="docs")
conn.store_batch(vectors, metadata={"source": "wiki"})
results = conn.search(query_vec, top_k=10)

See docs/integrations.md for full configuration.


๐Ÿ”„ Migration Guide (v1/v2 to v3)

Artifacts saved with Vectro < 2.0 use NPZ format version 1.

from python.migration import inspect_artifact, upgrade_artifact, validate_artifact

info = inspect_artifact("old.npz")          # {"format_version": 1, ...}
upgrade_artifact("old.npz", "new.npz")
result = validate_artifact("new.npz")       # {"valid": True}
vectro inspect old.npz
vectro upgrade old.npz new.npz --dry-run
vectro validate new.npz

See docs/migration-guide.md for the complete guide.


๐Ÿ“ฆ What's Included

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    Vectro v3.0.0 Package Contents                 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  ๐Ÿ“š 14 Production Mojo Modules    SIMD + GPU + HNSW + Storage     โ”‚
โ”‚  ๐Ÿ 25+ Python Modules            Full v3 API surface             โ”‚
โ”‚  โœ… 594 Tests (Python-only mode)  All phases verified             โ”‚
โ”‚  ๐Ÿ“– 5 Documentation Guides        Migration ยท API ยท Benchmarks    โ”‚
โ”‚  โšก SIMD Vectorized               vectorize[_kernel, SIMD_WIDTH]  โ”‚
โ”‚  ๐Ÿ”ข 7 Quantization Modes          INT8/NF4/PQ/Binary/RQ/AE/Auto  โ”‚
โ”‚  ๐Ÿ” Native HNSW                   Built-in ANN search index       โ”‚
โ”‚  ๐ŸŽ๏ธ  GPU Support                   MAX Engine + CPU SIMD fallback  โ”‚
โ”‚  ๐Ÿ“ฆ VQZ Format                    ZSTD-compressed, checksummed    โ”‚
โ”‚  โ˜๏ธ  Cloud Storage                 S3 ยท GCS ยท Azure Blob           โ”‚
โ”‚  ๐Ÿ”Œ Vector DB Connectors          Qdrant ยท Weaviate ยท in-memory   โ”‚
โ”‚  ๐Ÿ”„ Migration Tooling             v1/v2 โ†’ v3 upgrade w/ dry-run  โ”‚
โ”‚  ๐Ÿ–ฅ๏ธ  CLI                           vectro compress / inspect / โ€ฆ  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

โœ… Performance Benchmarks

โš ๏ธ Measurement Notes

  • Python throughput below assumes squish_quant Rust extension is available (auto-installed, optional)
  • Without it: ~167Kโ€“210K vec/s for INT8 (measured on M3 Pro, d=768/100, batch=10000)
  • Mojo binary numbers require the compiled vectro_quantizer โ€” see docs/benchmarking-guide.md for full methodology
  • All measurements: Apple M3 Pro, batch_size=10000, random normal Float32

Throughput (Apple M3 Pro)

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘                    v3.7.0 Performance Metrics                    โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘                                                                  โ•‘
โ•‘  INT8 Python layer:    ~167Kโ€“210K vec/s  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘              โ•‘
โ•‘  INT8 Mojo SIMD:       12M+ vec/s (4.85ร—FAISS) โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ•‘
โ•‘  NF4 quantize:         >= 2M vec/s       โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘   โ•‘
โ•‘  Binary quantize:      >= 20M vec/s      โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ•‘
โ•‘  Hamming scan:         >= 50M vec/s      โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ•‘
โ•‘  HNSW (10kร—128d,M=16): 628 QPS, R@10=0.895  โ–ˆโ–ˆโ–ˆโ–ˆโ–‘             โ•‘
โ•‘  VQZ save/load:        >= 2 GB/s         โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ•‘
โ•‘                                                                  โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

Compression Ratio Table (d=768)

Mode Bits/dim Ratio Cosine Sim Best For
FP32 (baseline) 32 1x 1.000 Ground truth
INT8 8 4x >= 0.9999 Default, zero quality loss
INT4 (GA in v3) 4 8x >= 0.92 Storage, RAM-constrained
NF4 4 8x >= 0.985 Transformer embeddings
NF4-Mixed ~4.2 7.5x >= 0.990 Outlier-heavy data
INT8 + ZSTD โ€” 6โ€“8x >= 0.9999 Disk/cloud storage
PQ-96 1 32x >= 0.95 Bulk ANN storage
Binary 1 32x ~0.80 cosine / โ‰ฅ0.95 recall@10 w/ rerank* Hamming + rerank
RQ x3 3 10.7x >= 0.98 High-quality compression
Autoencoder 64D ~1.3 48x >= 0.97 Learned, model-specific

*recall@10 โ‰ฅ 0.95 with INT8 re-rank; direct cosine similarity is ~0.80 at d=768

INT8 Throughput by Dimension (Mojo-accelerated)

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Dimension  โ”‚  Throughput   โ”‚ Latency โ”‚ Compression โ”‚ Savings โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚    128D     โ”‚  1.04M vec/s  โ”‚ 0.96 ms โ”‚    3.88x    โ”‚  74.2%  โ”‚
โ”‚    384D     โ”‚   950K vec/s  โ”‚ 1.05 ms โ”‚    3.96x    โ”‚  74.7%  โ”‚
โ”‚    768D     โ”‚   890K vec/s  โ”‚ 1.12 ms โ”‚    3.98x    โ”‚  74.9%  โ”‚
โ”‚   15

Release History

VersionChangesUrgencyDate
v4.8.0## v4.8.0 / v7.3.0 โ€” Distribution Sprint ### What's new - **Bundled Mojo binary in platform wheels**: macOS ARM64 and Linux x86\_64 wheels now include the pre-compiled `vectro\_quantizer` binary, enabling zero-dependency installs from PyPI. - **\_mojo\_bridge.py wheel-local search**: `_find_binary()` now checks `__file__.parent` first so installed wheels are self-contained. Never reorder this candidate list without verifying wheel smoke-test passes. - **MANIFEST.in**: proper sdist includes/excHigh4/16/2026
v3.0.1## v3.0.1 โ€” Mojo-First Runtime Fix Vectro v3.0.0 advertised itself as "Mojo-first" but every quantization call at runtime silently fell through to Python/NumPy. This release fixes the entire dispatch chain. ### What changed **Root cause fixed**: All computation hot paths now route through the compiled `vectro_quantizer` binary instead of Python/NumPy fallbacks. | Component | v3.0.0 (broken) | v3.0.1 (fixed) | |-----------|----------------|----------------| | `_quantize_with_mojo` | called NuLow3/11/2026
v3.0.0## Vectro v3.0.0 โ€” Extreme Compression Without Loss > **9 quantization algorithms ยท HNSW ANN index ยท GPU/MAX Engine ยท VQZ storage ยท Cloud backends ยท 445 tests, 100% coverage** ![Vectro v3.0.0 demo](https://raw.githubusercontent.com/wesleyscholl/vectro/main/demos/demo_v3.gif) --- ### What's New #### 9 Quantization Algorithms | Algorithm | Compression | Cosine Similarity | |-----------|-------------|-------------------| | INT8 | 4ร— | โ‰ฅ 0.9999 | | INT4 (GA) | 8ร— | โ‰ฅ 0.92 | | NF4 | 8ร— | โ‰ฅ 0.98Low3/11/2026
v1.0.0# ๐Ÿš€ Vectro v1.0.0 Release Preparation ## Release Checklist ### โœ… Pre-Release (COMPLETED) - [x] 100% test coverage achieved (39/39 tests passing) - [x] Zero compiler warnings - [x] All modules validated - [x] Documentation complete - [x] Demo scripts created - [x] Video script prepared - [x] CHANGELOG.md updated ### ๐Ÿ“ฆ Release Steps #### 1. Clean Up Repository ```bash cd /Users/wscholl/vectro # Remove Python artifacts (already cleaned) git rm -r python/ git rm setupLow10/30/2025

Dependencies & License Audit

Loading dependencies...

Similar Packages

AIMAXXINGYour Very Own Agent: The Ultimate, Complete Editionmain@2026-04-20
server-nexeLocal AI server with persistent memory, RAG, and multi-backend inference (MLX / llama.cpp / Ollama). Runs entirely on your machine โ€” zero data sent to external services.v1.0.2-beta
TV-Show-Recommender-AI๐Ÿค– Recommend TV shows by matching favorites, averaging embeddings, and finding similar titles using fuzzy search and vector similarity.main@2026-04-21
reasonkit-mem๐Ÿš€ Build memory and retrieval infrastructure for ReasonKit, enhancing data management and access for your applications with ease and efficiency.main@2026-04-21
eywa๐Ÿง  Capture and manage your team's knowledge effortlessly with Eywa, ensuring no valuable memory is ever lost.main@2026-04-21