SeekStorm

SeekStorm: sub-millisecond, native vector & lexical search - in-process library & multi-tenancy server, in Rust.

Development started in 2015, in production since 2020, Rust port in 2023, open sourced in 2024, work in progress.

SeekStorm is open source licensed under the Apache License 2.0

Blog Posts: SeekStorm is now Open Source and SeekStorm gets Faceted search, Geo proximity search, Result sorting

SeekStorm high-performance search library

Hybrid search

Internally, SeekStorm uses two separate, first-class, native index architectures for vector search and keyword search. Two native cores, not just a retrofit, add-on layer.
SeekStorm doesn’t try to make one index do everything. It runs two native search engines and lets the query planner decide how to combine them.
Two native index architectures under one roof:
- Lexical search: an inverted index optimized for lexical relevance,
- Vector search: an ANN index optimized for vector similarity.
Both are first-class engines, integrated at the query planner level.
- Query planner with multiple QueryModes and FusionTypes
- Per query choice of lexical search, vector search, or hybrid search.
Separate internal index, storage layouts, indexing, search, scoring, top-k candidates - unified query planner and result fusion (Reciprocal Rank Fusion - RRF).
But the user is fully shielded from the complexity, as if it was only a single index.
Enables pure lexical, pure vector or hybrid search (exhaustive, not only re-ranking of preliminary candidates).

Architecture

Fast sharded indexing: 35K docs/sec = 3 billion docs/day on a laptop.
Fast sharded search: 7x faster query latency, 17x faster tail latency (P99) for lexical search.
Billion-scale index
Index either in RAM or memory mapped files
Cross-platform (Windows, Linux, MacOS)
SIMD (Single Instruction, Multiple Data) hardware acceleration support,
both for x86-64 (AMD64 and Intel 64) and AArch64 (ARM, Apple Silicon).
Single-machine scalability: serving thousands of concurrent queries with low latency from a single commodity server without needing clusters or proprietary hardware accelerators.
100% human 😎 craftsmanship - No AI 🤖 was forced into vibe coding/AI slop.

Vector Features

Multi-Vector indexing: both from multiple fields and from multiple chunks per field.
Integrated inference: Generate and index embeddings from any text document field.
Alternatively, import and index externally generated embeddings.
Multiple vector precisions: F32, I8.
Multiple similarity measures: Cosine similarity, Dot product, Euclidean distance.
Scalar Quantization (SQ).
Chunking that respects sentence boundaries and Unicode segmentation for multilingual text.
K-Medoid clustering: PAM (Partition Around Medoids) with actual data points as centers.
Sharded and leveled IVF index.
Approximate Nearest Neighbor Search (ANNS) in an Leveled IVF index.
All field filters are directly active during vector search, not just as post-search filtering step.

Lexical Features

BM25F and BM25F_Proximity ranking
6 tokenizers, including Chinese word segmentation.
Stemming for 38 languages.
Optional stopword lists, custom and predefined, for smaller indices and faster search.
Frequent word lists, custom and predefined, for faster phrase search by N-gram indexing.
Inverted index
Roaring-bitmap posting list compression.
N-gram indexing
Block-max WAND and Maxscore acceleration

General Features

True real-time search, both for vector search and lexical search, with negligible performance impact
Incremental indexing
Unlimited field number, field length & index size
Compressed document store: ZStandard
Field filtering
Faceted search: Counting & filtering of String & Numeric range facets (with Histogram/Bucket & Min/Max aggregation)
Result sorting by any field, ascending or descending, multiple fields combined by "tie-breaking".
Geo proximity search, filtering and sorting.
Iterator to iterate through all documents of an index, in both directions, e.g., for index export, conversion, analytics and inspection.
Search with empty query, but query facets, facet filter, and result sort parameters, ascending and descending.
Typo tolerance / Fuzzy queries / Query spelling correction: return results if the query contains spelling errors.
Typo-tolerant Query Auto-Completion (QAC) and Instant search.
KWIC snippets, highlighting
One-way and multi-way synonyms
Language independent

Field types

U8..U64
I8..I64
F32, F64
Timestamp
Bool
String16, String32
StringSet16, StringSet32
Text (Multi-vector: automatically generated embeddings for each text field)
Point
Json
Binary (embedded images, audio, video, pdf)
Vector (externally generated embeddings)

Query types

OR disjunction union
AND conjunction intersection
"" phrase
- NOT

Result types

TopK
Count
TopKCount

SeekStorm multi-tenancy search server

Index and search via RESTful API with CORS.
Ingest local data files in CSV, JSON, Newline-delimited JSON (ndjson), and Concatenated JSON formats via console command.
Ingest local PDF files via console command (single file or all files in a directory).
Multi-tenancy index management.
API-key management.
Embedded web server and web UI to search and display results from any index without coding.
Web UI with query auto correction, query auto-completion, instant search, keyword highlighting, histogram, date filter, faceting, result sorting, document preview (as demo, for testing, as template).
Code first OpenAPI generated REST API documentation
Cross-platform: runs on Linux, Windows, and macOS (other OS untested).
Docker file and container image at Docker Hub

Why SeekStorm?

Twin-core native vector & keyword search
Two separate, first-class, native index architectures for vector search and keyword search under one roof.
A query planner with 8 dedicated QueryModes and FusionTypes automatically decide how to combine the results for maximum query understanding.

Performance
Lower latency, higher throughput, lower cost & energy consumption, esp. for multi-field and concurrent queries.
Low tail latencies ensure a smooth user experience and prevent loss of customers and revenue.
While some rely on proprietary hardware accelerators (FPGA/ASIC) or clusters to improve performance,
SeekStorm achieves a similar boost algorithmically on a single commodity server.

Consistency
No unpredictable query latency during and after large-volume indexing as SeekStorm doesn't require resource-intensive segment merges.
Stable latencies - no cold start costs due to just-in-time compilation, no unpredictable garbage collection delays.

Scaling
Maintains low latency, high throughput, and low RAM consumption even for billion-scale indices.
Unlimited field number, field length & index size.

Relevance
Term proximity ranking provides more relevant results compared to BM25.

Real-time
True real-time search, as opposed to NRT: every indexed document is immediately searchable, even before and during commit.

Benchmarks

Lexical Search

the who: vanilla BM25 ranking vs. SeekStorm proximity ranking

Methodology
Comparing different open-source search engine libraries (BM25 lexical search) using the open-source search_benchmark_game developed by Tantivy and Jason Wolfe.

Benefits

using a proven open-source benchmark used by other search libraries for comparability
adapters written mostly by search library authors themselves for maximum authenticity and faithfulness
results can be replicated by everybody on their own infrastructure
detailed results per query, per query type and per result type to investigate optimization potential

Detailed benchmark results https://seekstorm.github.io/search-benchmark-game/

Benchmark code repository https://github.com/SeekStorm/search-benchmark-game/

See our blog posts for more detailed information: SeekStorm is now Open Source and SeekStorm gets Faceted search, Geo proximity search, Result sorting

Vector search

1 million vectors, 128 dimensions, f32 precision
nprobe=16 -> recall@10=95%, average latency=188 microseconds
nprobe=33 -> recall@10=99%, average latency=302 microseconds

SIFT1M dataset

Benchmark code

Benchmark vector search vs. lexical search (Wikipedia)

There are benchmarks of vector search engines, and benchmarks of lexical search engines.
But seeing the latency of lexical search and vector search stacked up against each other might offer some unique insight.

English Wikipedia: 5 million documents, 16 million vectors
Lexical: 2 fields, top10, BM25, average latency 305 microseconds
Vector: 2 fields, nprobe=68 -> recall@10=95%, average latency 2,700 microseconds
Vector: 2 fields, nprobe=200 -> recall@10=99%, average latency 6,370 microseconds
Using Model2Vec from MinishLab: PotionBase2M, chunks: 1000 byte

We are using the English Wikipedia data (5 million entries) and queries (300 intersection queries) derived from the AOL query dataset, both from Tantivy’s search-benchmark-game.

Why latency matters

Search speed might be good enough for a single search. Below 10 ms people can't tell latency anymore. Search latency might be small compared to internet network latency.
But search engine performance still matters when used in a server or service for many concurrent users and requests for maximum scaling, throughput, low processor load, and cost.
With performant search technology, you can serve many concurrent users at low latency with fewer servers, less cost, less energy consumption, and a lower carbon footprint.
It also ensures low latency even for complex and challenging queries: instant search, fuzzy search, faceted search, and union/intersection/phrase of very frequent terms.
Local search performance matters, e.g. when many local queries are spawned for reranking, fallback/refinement queries, fuzzy search, data mining or RAG befor the response is transferred back over the network.
Besides average latencies, we also need to reduce tail latencies, which are often overlooked but can cause loss of customers, revenue, and a bad user experience.
It is always advisable to engineer your search infrastructure with enough performance headroom to keep those tail latencies in check, even during periods of high concurrent load.
Also, even if a human user might not notice the latency, it still might make a big difference in autonomous stock markets, defense applications or RAG which requires multiple queries.

Keyword search remains a core building block in the advent of vector search and LLMs

Despite what the hype-cycles https://www.bitecode.dev/p/hype-cycles want you to believe, keyword search is not dead, as NoSQL wasn't the death of SQL.

You should maintain a toolbox, and choose the best tool for your task at hand. https://seekstorm.com/blog/vector-search-vs-keyword-search1/

Keyword search is just a filter for a set of documents, returning those where certain keywords occur in, usually combined with a ranking metric like BM25. A very basic and core functionality is very challenging to implement at scale with low latency. Because the functionality is so basic, there is an unlimited number of application fields. It is a component, to be used together with other components. There are use cases which can be solved better today with vector search and LLMs, but for many more keyword search is still the best solution. Keyword search is exact, lossless, and it is very fast, with better scaling, better latency, lower cost and energy consumption. Vector search works with semantic similarity, returning results within a given proximity and probability.

Why hybrid search?

Because lexical search and vector search complement each other. We can significantly improve result quality with hybrid search by combining their strengths, while compensating their shortcomings.

Lexical search is fast, precise, exact, and language independent - but unable to deal with meaning and semantic similarity.
Vector search understands similarities - but is language dependent, can't deal with new or rare terms it wasn't trained for, it is slower and more expensive.

Keyword search (lexical search)

If you search for exact results like proper names, numbers, license plates, domain names, and phrases (e.g. plagiarism detection) then keyword search is your friend. Vector search, on the other hand, will bury the exact result that you are looking for among a myriad of results that are only somehow semantically related. At the same time, if you don’t know the exact terms, or you are interested in a broader topic, meaning or synonym, no matter what exact terms are used, then keyword search will fail you.

- works with text data only
- unable to capture context, meaning and semantic similarity
- low recall for semantic meaning
+ perfect recall for exact keyword match 
+ perfect precision (for exact keyword match)
+ high query speed and throughput (for large document numbers)
+ high indexing speed (for large document numbers)
+ incremental indexing fully supported
+ smaller index size
+ lower infrastructure cost per document and per query, lower energy consumption
+ good scalability (for large document numbers)
+ perfect for exact keyword and phrase search, no false positives
+ perfect explainability
+ efficient and lossless for exact keyword and phrase search
+ works with new vocabulary out of the box
+ works with any language out of the box
+ works perfect with long-tail vocabulary out of the box
+ works perfect with any rare language or domain-specific vocabulary out of the box
+ RAG (Retrieval-augmented generation) based on keyword search offers unrestricted real-time capabilities.

Vector search

Vector search is perfect if you don’t know the exact query terms, or you are interested in a broader topic, meaning or synonym, no matter what exact query terms are used. But if you are looking for exact terms, e.g. proper names, numbers, license plates, domain names, and phrases (e.g. plagiarism detection) then you should always use keyword search. Vector search will instead bury the exact result that you are looking for among a myriad of results that are only somehow related. It has a good recall, but low precision, and higher latency. It is prone to false positives, e.g., in plagiarism detection as exact words and word order get lost.

Vector search enables you to search not only for similar text, but for everything that can be transformed into a vector: text, images (face recognition, fingerprints), audio, enabling you to do magic things like "queen - woman + man = king."

+ works with any data that can be transformed to a vector: text, image, audio ...
+ able to capture context, meaning, and semantic similarity
+ high recall for semantic meaning (90%)
- lower recall for exact keyword match (for Approximate Similarity Search)
- lower precision (for exact keyword match)
- lower query speed and throughput (for large document numbers)
- lower indexing speed (for large document numbers)
- incremental indexing is expensive and requires rebuilding the entire index periodically, which is extremely time-consuming and resource intensive.
- larger index size
- higher infrastructure cost per document and per query, higher energy consumption
- limited scalability (for large document numbers)
- unsuitable for exact keyword and phrase search, many false positives
- low explainability makes it difficult to spot manipulations, bias and root cause of retrieval/ranking problems
- inefficient and lossy for exact keyword and phrase search
- Additional effort and cost to create embeddings and keep them updated for every language and domain. Even if the number of indexed documents is small, the embeddings have to created from a large corpus before nevertheless.
- Limited real-time capability due to limited recency of embeddings
- works only with vocabulary known at the time of embedding creation
- works only with the languages of the corpus from which the embeddings have been derived
- works only with long-tail vocabulary that was sufficiently represented in the corpus from which the embeddings have been derived
- works only with rare language or domain-specific vocabulary that was sufficiently represented in the corpus from which the embeddings have been derived
- RAG (Retrieval-augmented generation) based on vector search offers only limited real-time capabilities, as it can't process new vocabulary that arrived after the embedding generation

Vector search is not a replacement for keyword search, but a complementary addition - best to be used within a hybrid solution where the strengths of both approaches are combined. Keyword search is not outdated, but time-proven.

Why Rust

We have (partially) ported the SeekStorm codebase from C# to Rust

Factor 2..4x performance gain vs. C# (latency and throughput)
No slow first run (no cold start costs due to just-in-time compilation)
Stable latencies (no garbage collection delays)
Less memory consumption (no ramping up until the next garbage collection)
No framework dependencies (CLR or JVM virtual machines)
Ahead-of-time instead of just-in-time compilation
Memory safe language https://www.whitehouse.gov/oncd/briefing-room/2024/02/26/press-release-technical-report/

Rust is great for performance-critical applications 🚀 that deal with big data and/or many concurrent users. Fast algorithms will shine even more with a performance-conscious programming language 🙂

Architecture

see ARCHITECTURE.md

Building

cargo build --release

⚠ WARNING: make sure to set the MASTER_KEY_SECRET environment variable to a secret, otherwise your generated API keys will be compromised.

Documentation

https://docs.rs/seekstorm

Build documentation

cargo doc --no-deps

Access documentation locally

SeekStorm\target\doc\seekstorm\index.html
SeekStorm\target\doc\seekstorm_server\index.html

Feature Flags

zh (default): Enables TokenizerType.UnicodeAlphanumericZH that implements Chinese word segmentation to segment continuous Chinese text into tokens for indexing and search.
pdf (default): Enables PDF ingestion via pdfium crate.
vb: vb (verbose) adds additional properties to the Result struct:
- field_id
- chunk_id
- level_id
- shard_id
- cluster_id
- cluster_score
- vector_score
- lexical_score
- source: ResultSource (Lexical/Vector/Hybrid)

You can disable the SeekStorm default features by using default-features = false in the cargo.toml of your application.
This can be useful to reduce the size of your application or if there are dependency version conflicts.

[dependencies]
seekstorm = { version = "0.12.19", default-features = false }

Usage of the library

Lexical search

Add required crates to your project

cargo add seekstorm
cargo add tokio
cargo add serde_json

Use an asynchronous Rust runtime

use std::error::Error;
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error + Send + Sync>> {

  // your SeekStorm code here

   Ok(())
}

create schema (from JSON)

use seekstorm::index::SchemaField;

let schema_json = r#"
[{"field":"title","field_type":"Text","store":false,"index_lexical":false,"dictionary_source":true,"completion_source":true},
{"field":"body","field_type":"Text","store":true,"index_lexical":true},
{"field":"url","field_type":"Text","store":false,"index_lexical":false}]"#;
let schema:Vec<SchemaField>=serde_json::from_str(schema_json).unwrap();

create schema (from SchemaField)

use seekstorm::index::{SchemaField,FieldType};

let schema= vec![
    SchemaField::new("title".to_owned(), false, false,false, FieldType::Text, false,false, 1.0,true,true),
    SchemaField::new("body".to_owned(),true,true,false,FieldType::Text,false,true,1.0,false,false),
    SchemaField::new("url".to_owned(), false, false,false, FieldType::Text,false,false,1.0,false,false),
];

create index

# tokio_test::block_on(async {

use std::path::Path;
use seekstorm::index::{IndexMetaObject, Clustering, LexicalSimilarity,TokenizerType,StopwordType,FrequentwordType,AccessType,StemmerType,NgramSet,SchemaField,FieldType,SpellingCorrection,QueryCompletion,DocumentCompression,create_index};
use seekstorm::vector::Inference;
use seekstorm::vector_similarity::VectorSimilarity;

let index_path=Path::new("C:/index/");

let schema= vec![
    SchemaField::new("title".to_owned(), false, false,false, FieldType::Text, false,false, 1.0,true,true),
    SchemaField::new("body".to_owned(),true,true,false,FieldType::Text,false,true,1.0,false,false),
    SchemaField::new("url".to_owned(), false, false, false,FieldType::Text,false,false,1.0,false,false),
];

let meta = IndexMetaObject {
    id: 0,
    name: "test_index".into(),
    lexical_similarity: LexicalSimilarity::Bm25f,
    tokenizer: TokenizerType::UnicodeAlphanumeric,
    stemmer: StemmerType::None,
    stop_words: StopwordType::None,
    frequent_words: FrequentwordType::English,
    ngram_indexing: NgramSet::NgramFF as u8,
    document_compression: DocumentCompression::Snappy,
    access_type: AccessType::Mmap,
    spelling_correction: Some(SpellingCorrection { max_dictionary_edit_distance: 1, term_length_threshold: Some([2,8].into()),count_threshold: 20,max_dictionary_entries:500_000 }),
    query_completion: Some(QueryCompletion{max_completion_entries:10_000_000}),
    clustering: Clustering::None,
    inference: Inference::None,
};

let segment_number_bits1=11;
let index_arc=create_index(index_path,meta,&schema,&Vec::new(),segment_number_bits1,false,None).await.unwrap();

# });

open index (alternatively to create index)

# tokio_test::block_on(async {

use std::path::Path;
use seekstorm::index::open_index;

let index_path=Path::new("C:/index/");
let mut index_arc=open_index(index_path,false).await.unwrap(); 

# });

index documents (from JSON)

# tokio_test::block_on(async {

use std::path::Path;
use seekstorm::index::{open_index, IndexDocuments};

let index_path=Path::new("C:/index/");
let mut index_arc=open_index(index_path,false).await.unwrap(); 

let documents_json = r#"
[{"title":"title1 test","body":"body1","url":"url1"},
{"title":"title2","body":"body2 test","url":"url2"},
{"title":"title3 test","body":"body3 test","url":"url3"}]"#;
let documents_vec=serde_json::from_str(documents_json).unwrap();

index_arc.index_documents(documents_vec).await; 

# });

index document (from Document)

# tokio_test::block_on(async {

use seekstorm::index::{FileType, Document, IndexDocument, open_index};
use std::path::Path;
use serde_json::Value;

let index_path=Path::new("C:/index/");
let mut index_arc=open_index(index_path,false).await.unwrap(); 

let document= Document::from([
    ("title".to_string(), Value::String("title4 test".to_string())),
    ("body".to_string(), Value::String("body4 test".to_string())),
    ("url".to_string(), Value::String("url4".to_string())),
]);

index_arc.index_document(document,FileType::None).await;

# });

commit documents

# tokio_test::block_on(async {

use seekstorm::commit::Commit;
use seekstorm::index::open_index;
use std::path::Path;

let index_path=Path::new("C:/index/");
let mut index_arc=open_index(index_path,false).await.unwrap(); 

index_arc.commit().await;

# });

search index

# tokio_test::block_on(async {

use seekstorm::search::{Search, SearchMode, QueryType, ResultType, QueryRewriting};
use seekstorm::index::open_index;
use std::path::Path;

let index_path=Path::new("C:/index/");
let mut index_arc=open_index(index_path,false).await.unwrap(); 

let query="test".to_string();
let query_vector=None;
let search_mode=SearchMode::Lexical;
let enable_empty_query=false;
let offset=0;
let length=10;
let query_type=QueryType::Intersection; 
let result_type=ResultType::TopkCount;
let include_uncommitted=false;
let field_filter=Vec::new();
let query_facets=Vec::new();
let facet_filter=Vec::new();
let result_sort=Vec::new();
let query_rewriting= QueryRewriting::SearchRewrite { distance: 1, term_length_threshold: Some([2,8].into()), correct:Some(2),complete: Some(3), length: Some(5) };
let result_object = index_arc.search(query, query_vector, query_type, search_mode, enable_empty_query, offset, length, result_type,include_uncommitted,field_filter,query_facets,facet_filter,result_sort,query_rewriting).await;

// ### display results

use seekstorm::highlighter::{Highlight, highlighter};
use std::collections::HashSet;

let highlights:Vec<Highlight>= vec![
    Highlight {
        field: "body

Release History

Version	Changes	Urgency	Date
v3.2.1	### Fixed - delete_index fixed, both library method and REST API endpoint. - Intermittent indexing after commits of incomplete levels (< 65_536 documents) fixed. - In open_index now the shard_number is restored correctly. Previously it was always set to the number of physical cores, not respecting when it was previously set to a lower number via force_shard_number. ### Improved - Incomplete levels with less than 10_000 vectors per shard are now also clustered, resulting in improved qu	High	6/4/2026
v3.2.0	### Added - Vector search now with NEON SIMD acceleration for the AArch64 target (Apple Silicon, AWS Graviton). - The REST API info endpoint /api/v1/live now returns information weather SIMD is enabled or not. - Server info card - SIMD entry: AVX2/Neon/None - Web server (UI, REST API) entry: now has a clickable URL to the embedded web UI. ### Fixed - Vector search performance regression fixed: #[inline(always)] - Compile error for the AArch64 target (since v3.0.0) fixed. Fixes	High	5/13/2026
v3.1.3	### Added - For Euclidean, ScalarQuantizationI8, new non-affine quantization methods were added. - For Euclidean, ScalarQuantizationI8, new non-affine similarity functions were added. ### Improved - For Euclidean, ScalarQuantizationI8, it is now automatically selected between affine and non-affine quantization (displayed in index info card), depending on the data. Always best recall.	High	5/5/2026
v3.1.0	### Added - [TurboQuant](https://en.wikipedia.org/wiki/TurboQuant) (TQ) quantization for vector search added. TurboQuant reduces the quantization error (rounding errors with different directions might change the ratio between dimensions) with the following benefits: - requires no training in contrast to Product Quantization (PQ), - provides better recall compared to Product Quantization (PQ), - allows lower nprobe with improved query latency in ANN search, - allows higher v	High	4/29/2026
v3.0.2	### Added - New vector search example (external vectors and vector query via REST API, Euclidean with I8 quantization) added to test_api.rest. ### Fixed - I8 quantization for Euclidean distance fixed for range below 1.0.	High	4/23/2026
v3.0.1	### Changed - VectorHeader.zero_point changed from i32 to i16 - VectorHeader.padding: u16 removed ### Fixed - index: removed duplicate use std::sync::LazyLock;	High	4/22/2026
v3.0.0	### 🔥 SeekStorm v3.0.0 adds vector search and hybrid search: #### SeekStorm uses two separate, first-class, native index architectures, under one roof. - Lexical search: sharded and leveled inverted index. - Vector search: sharded and leveled IVF index for ANN or exhaustive search. - Shared document store, shared document ID space. - Both first-class engines are integrated at the query planner level. - Query planner with QueryModes (Lexical, Vector, Hybrid…) an	High	4/19/2026
v2.3.2	- stemmer crate renamed	Low	3/9/2026
v2.3.1	- Stemmers updated to Snowball 3.0.0 - Added stemmer language support for - Armenian - Basque - Catalan - Czech - DutchPorter - Esperanto - Estonian - Hindi - Indonesian - Irish - Lithuanian - Lovins - Nepali - Persian - Polish - Porter - Serbian - Sesotho - Ukrainian - Yiddish	Low	3/9/2026
v2.3.0	### Added - Added `FieldType::Binary` that for storing binary data in base64 format. This field type will not be tokenized and indexed. For embedding binary data, e.g. images, audio, video, pdf, … in JSON or CSV documents. A self-contained alternative to storing URLs to external resources. Using the [Data URI scheme](https://en.wikipedia.org/wiki/Data_URI_scheme) in JavaScript you can create an Image object and put the base64 as its src, including the data:image... part like this:	Low	2/9/2026
v2.2.1	### Fixed - Fixed update document(s) REST API endpoint document array detection fixed. There was an issue when the document itself contained a `[`-char. - Fixed issue #57. `index_document`/`index_posting` caused an exception after a previously committed incomplete level due to a wrong posting list `CompressionType` deserialization.	Low	2/6/2026
v2.2.0	### Added - Multiple `document_compression` methods: `None`, `Snappy`, `Lz4`, `Zstd`. Faster search (200% median, 110% mean) and 50% faster indexing with Snappy, compared to Zstandard, if documents are stored and loaded from the document store. Now you have control over the best balance of index size, indexing speed, and query latency for your use case. Some search benchmarks measure only pure search performance, but **real-world usage almost always includes	Low	2/1/2026
v2.1.0	### Added - `search` now supports an empty query: similar to an iterator across all indexed documents, but all search parameters are supported, apart from query and field_filter: - result_type: ResultType, - include_uncommitted: bool, - query_facets: Vec<QueryFacet>, - facet_filter: Vec<FacetFilter>, - result_sort: Vec<ResultSort>, For search with empty query, if no sort field is specified, then the search results are sorted by `_id` in `descending` order p	Low	1/28/2026
v2.0.0	### Added - Document ID iterator API `get_docid`, both for SeekStorm library and server. Allows to iterate through all documents ID (and with get_document through all documents) of the whole index, in both directions. Allows to sequentially retrieve all documents even from large collections without collecting them to size-limited RAM first, e.g., for index export and inspection. Ensures without invoking search, that only valid document IDs are returned, even though document ID a	Low	1/22/2026
v1.2.5	### Added - [Early query completion expansion](https://seekstorm.com/blog/query-auto-completion-(QAC)/#sliding-window-completion-expansion): if a query with >=2 terms returns less than max_completion_entries, but a completion with 3 terms is returned, it is expanded with more query terms. Previously, only the last incomplete term of a query was completed, now the completion is expanded early to look one more term ahead. The intended full query is reached earlier, saving even more time.	Low	1/17/2026
v1.2.4	### Added - New `SpellingCorrection.count_threshold`: The minimum frequency count per index for dictionary words to be eligible for spelling correction can now be set by the user for more control over the dictionary generation. If count_threshold is too high, some correct words might be missed from the dictionary and deemed misspelled, if count_threshold too low, some misspelled words from the corpus might be considered correct and added to the dictionary. Dictionary terms eligible	Low	1/15/2026
v1.2.3	### Changed - Updated [OpenAPI definition](https://github.com/SeekStorm/SeekStorm/tree/main/src/seekstorm_server/openapi) files.	Low	1/11/2026
v1.2.2	### Changed - Updated [OpenAPI definition](https://github.com/SeekStorm/SeekStorm/tree/main/src/seekstorm_server/openapi) files.	Low	1/11/2026
v1.2.1	### Added - Completion of spelling corrected query. ### Changed - Highlighting of completions in dropdown reversed. Now the part the user didn't type will be highlighted, while the part they typed remains plain. - Font size of input and completion dropdown are now identical.	Low	1/11/2026
v1.2.0	### Added - Typo-tolerant Query Auto-Completion (QAC) and Instant Search: see [blog post](https://seekstorm.com/blog/query-auto-completion-(QAC)/). - The completions are automatically derived in real-time from indexed documents, not from a query log: - works, even if no query log is available, especially for domain-specific, newly created indices or few users. - works for new or domain-specific terms. - allows out-of-the-box domain specific suggestions - prevents incons	Low	1/9/2026
v1.1.3	### Added - hello endpoint added to seekstorm server: http://127.0.0.1/api/v1/hello -> returns "SeekStorm server 1.1.3" ### Fixed - seekstorm server commandline disabled if no terminal/tti (docker parameter -ti) detected. Fixes #39 .	Low	12/3/2025
v1.1.2	### Fixed - Normalization/folding of ligatures and roman numerals fixed.	Low	11/30/2025
v1.1.1	### Fixed - If TokenizerType::UnicodeAlphanumericFolded is selected, then diacritics, ligatures, and accents in the query string are now folded prior to spelling correction. - Examples for query spelling correction added to README.md (in create_index and search). - Examples for specifying Boolean queries via query operators and query type have been added to README.md. - Query operators added to library documentation. - Limit the size of the min-heap per shard to s=cmp::min(offset+length,	Low	11/30/2025
v1.1.0	### Added - Added query spelling correction / typo-tolerant search / fuzzy queries by integrating [SymSpell](https://github.com/wolfgarbe/symspell_rs), both to SeekStorm library and SeekStorm server REST API. - New `create_index`, `IndexMetaObject` parameter property: `spelling_correction: Option<SpellingCorrection>`: - enables automatic incremental creation of the Symspell dictionary during the indexing of documents. - New `search` parameter: `query_rewriting: QueryRewriting`: En	Low	11/27/2025
v1.0.0	### Improved - 4..6x faster indexing speed with sharded index. - 3x shorter query latency with sharded index. - Faster index loading. - Faster clear_index. - Benchmarks updated. ### Added - index.indexed_doc_count() - index.committed_doc_count() - index.uncommitted_doc_count() - SchemaField has now `longest` property. This allows to annotate (manually set) the longest field in schema. Otherwise the longest field will be automatically detected in first index_document. Sett	Low	10/22/2025
v0.14.1	Faster tokenizer and indexing.	Low	10/14/2025
v0.14.0	### Improved - Maximum cardinality of distinct string facet values increased from 65_535 (16 bit) to 4_294_967_295 (32 bit). - FieldType::String32 and FieldType::StringSet32 added, that allow a cardinality of 4_294_967_295 (32 bit) distinct string facet values, while FieldType::String and FieldType::StringSet were renamed to FieldType::String16 and FieldType::StringSet16 that allow only a cardinality of 65_535 (16 bit) distinct string facet values, but are space-saving. - Query	Low	9/14/2025
v0.13.3	hash32 fixed for platforms without aes or sse2.	Low	8/27/2025
v0.13.2	- rustdocflags added in config.toml and cargo.toml	Low	8/25/2025
v0.13.1	- Faster and complete topk results for union queries > 8 terms by using MAXSCORE. - Required target_features for using gxhash fixed.	Low	8/23/2025
v0.13.0	### Added - N-gram indexing: N-grams are indexed in addition to single terms, for faster phrase search, at the cost of higher index size. - N-grams not as parts of terms, but as combination of consecutive terms. See [NGRAM_SEARCH.md](https://github.com/SeekStorm/SeekStorm/blob/main/NGRAM_SEARCH.md). - N-Gram indexing improves phrase query latency on average by factor 2.14 (114%), maximum tail latency by factor 7.51 (651%), and some phrase queries up to **3 orders of magnit	Low	8/8/2025
v0.12.27	- very rare position compression bug fixed.	Low	5/15/2025
v0.12.26	- Put winapi crate behind conditional compilation #[cfg(target_os = "windows")]	Low	5/13/2025
v0.12.25	- Faster index_document, commit, clear_index: Increased SEGMENT_KEY_CAPACITY prevents HashMap resizing during indexing. vector/hashmap reuse instead of reinitialization. - Intersection between RLE-RLE and RLE-Bitmap compressed posting lists fixed.	Low	5/13/2025
v0.12.24	- Fixes a 85% performance drop (Windows 11 24H2/Intel hybrid CPUs only) caused by a faulty Windows 11 24H2 update, that changed the task scheduler behavior into under-utilizing the P-Cores over E-cores of Intel hybrid CPUs. This is a workaround until Microsoft fixes the issue in a future update. The fix solves the issue for the SeekStorm server, if you embedd the SeekStorm library into your own code you have to apply the fix as well. See blog post for details: https://seekstorm.com	Low	5/2/2025
v0.12.23	- Ingestion of files in [CSV](https://en.wikipedia.org/wiki/Comma-separated_values), SSV, TSV, PSV format with `ingest_csv()` method and seekstorm_server command line `ingest`: configurable header, delimiter char, quoting, number of skipped document, number of indexed documents. - `stop_words` parameter (predefined languages and custom) added to create_index IndexMetaObject: Stop words are not indexed for compact index and faster queries. - `frequent_words` parameter (predefined lang	Low	4/28/2025
v0.12.22	- Problem fixed where an intersection in an very small index didn't return results (all_terms_frequent). - Early termination fixed in single_blockid: did not guarantee most relevant results for filtered single term queries with result type Topk.	Low	4/16/2025
v0.12.21	- Stemming for 18 languages added: new property `IndexMetaObject.stemmer: StemmerType` - Allow to specify the markup tags to insert before and after each highlighted term. Default is "<b>" "</b>" (by @DanLLC). - Fixed read_f32() in utils.rs - See [CHANGELOG.md](https://github.com/SeekStorm/SeekStorm/blob/main/CHANGELOG.md) for details.	Low	3/27/2025
v0.12.20	- PDF ingestion via `pdfium` dependency moved behind a new `pdf` feature flag which is enabled by default. You can disable the SeekStorm default features by using `seekstorm = { version = "0.12.19", default-features = false }` in the cargo.toml of your application. This can be useful to reduce the size of your application or if there are dependency version conflicts. - feature flags documented in README.md	Low	3/5/2025
v0.12.19	- Backward compatibility to indexes created prior v0.12.18 restored.	Low	3/4/2025
v0.12.18	- Fixes intersection_vector16 for target_arch != "x86_64". - Fixes issue where multiple indices per API key were not correctly reloaded after server restart (IndexMetaObject.id #[serde(skip)] removed). - Fixes issue #39 with commandline() in server.rs for docker environment without -ti parameter (run interactively with a tty session). - Updated to Rust edition 2024. - Changed serde_json::from_str(&value.to_string()).unwrap_or(value.to_string()).to_string() -> serde_json::from_value::<String>	Low	3/3/2025
v0.12.17	- Fixed issue in clear_index.	Low	2/15/2025
v0.12.16	- Basic tests added (issue #33): cargo test - New method current_doc_count() returns the number of indexed documents - deleted documents. - Fixed issue #32 in clear_index.	Low	2/14/2025
v0.12.15	- Fixes issue #34 - refactoring of http_server (by @gabriel-v)	Low	2/12/2025
v0.12.14	- Fixed issue #36 panic at realtime search - Fixed a possible issue in clear_index	Low	2/10/2025
v0.12.13	- Intersection speed for ResultType::Count improved. - Fixes issue #31 for queries with query parameter length=0 and ResultType::TopK or ResultType::TopkCount. - If you specify length=0, resultType::TopkCount will automatically downgraded to resultType::Count and return the number of results only, without returning the results itself. - If you don't specify the length in the REST API, a default of 10 will be used.	Low	2/9/2025
v0.12.12	- Fixes an issue in clear_index that prevented the facet.json file from being created in commit after clear_index, causing problems after reloading the index. Fixes issue #27.	Low	2/7/2025
v0.12.11	clear_index fixed. Fixes issue [#26](https://github.com/SeekStorm/SeekStorm/issues/26) .	Low	2/3/2025
v0.12.10	- Fixed indexing postings with more than 8_192 positions. - Fixed an issue with Chinese word segmentation, where a hyphen within a string was interpreted as a NOT ('-') operator in front of one of the resulting segmented words. - Fixed an issue with NOT query terms that are RLE compressed. - Fixed an issue for union > 8 terms with custom result sorting.	Low	2/2/2025
v0.12.9	- Automatic resize of postings_buffer in index_posting. - Fixed a subtract with overflow exception when real time search was enabled. - Fixed exception if > 10 query terms. - Fixed stack overflow in some long union queries. - Fixed endless loop while intersecting RLE-compressed posting lists. - Updated rand from v0.8.5 to v0.9.0.	Low	1/29/2025
v0.12.8	Removed unsafe std::slice::from_raw_parts (cast arrays of different element types) which caused unaligned data exceptions. Fixes issue #20 .	Low	1/24/2025
v0.12.7	Endless loop while intersecting multiple RLE-compressed posting lists fixed. Fixes issue https://github.com/SeekStorm/SeekStorm/issues/21 .	Low	1/19/2025
v0.12.6	Exception while intersecting multiple RLE-compressed posting lists fixed. Fixes issue https://github.com/SeekStorm/SeekStorm/issues/21 .	Low	1/16/2025
v0.12.5	Endless loop while intersecting multiple RLE-compressed posting lists fixed. Fixes issue #21 .	Low	1/13/2025
v0.12.4	- Fixed a subtract with overflow exception that occurred in debug mode when committing, and the previous commit was < 64k documents. Fixes issue #22 . - Changed cast_byte_ushort_slice and cast_byte_ulong_slice to either take mutable references and return mutable ones or take immutable references and return immutable ones. - Added index_file and docstore_file flush in commit.	Low	1/12/2025
v0.12.3	- Docker file and container added (#17) - https://hub.docker.com/r/wolfgarbe/seekstorm_server - `docker run -ti -p "8000:80" wolfgarbe/seekstorm_server:v0.12.3` - Added a server welcome web page with instructions how to create an API key and index. - Exception handling if docker is run without the -ti parameter (run interactively with a tty session). - See [CHANGELOG.md](https://github.com/SeekStorm/SeekStorm/blob/main/CHANGELOG.md) for details.	Low	12/21/2024

Dependencies & License Audit

Loading dependencies...

Similar Packages

meilisearchA lightning-fast search engine API bringing AI-powered hybrid search to your sites and applications.v1.45.2

reasonkit-mem🚀 Build memory and retrieval infrastructure for ReasonKit, enhancing data management and access for your applications with ease and efficiency.main@2026-05-31

qdrantQdrant - High-performance, massive-scale Vector Database and Vector Search Engine for the next generation of AI. Also available in the cloud https://cloud.qdrant.io/v1.18.2

AIMAXXINGYour Very Own Agent: The Ultimate, Complete Editionmain@2026-05-29

coordinodeThe graph-native hybrid retrieval engine for AI and GraphRAG. Graph + Vector + Full-Text in a single transactional engine.v0.4.3

More in Databases

milvusMilvus is a high-performance, cloud-native vector database built for scalable vector ANN search

WeKnoraLLM-powered framework for deep document understanding, semantic retrieval, and context-aware answers using RAG paradigm.

ai-real-estate-assistantAdvanced AI Real Estate Assistant using RAG, LLMs, and Python. Features market analysis, property valuation, and intelligent search.

alibabacloud-adb20211201Alibaba Cloud adb (20211201) SDK Library for Python

Description

README

SeekStorm high-performance search library

Hybrid search

Architecture

Vector Features

Lexical Features

General Features

Field types

Query types

Result types

SeekStorm multi-tenancy search server

Why SeekStorm?

Benchmarks

Lexical Search

Vector search

Benchmark vector search vs. lexical search (Wikipedia)

Why latency matters

Keyword search remains a core building block in the advent of vector search and LLMs

Why hybrid search?

Keyword search (lexical search)

Vector search

Why Rust

Architecture

Building

Documentation

Feature Flags

Usage of the library

Lexical search

Release History

Dependencies & License Audit

Similar Packages

More in Databases