Semantic search systems powered by dense vector embeddings have transformed information retrieval, but their deployment at large scale remains prohibitively expensive due to memory and computational demands. This paper presents a practical engineering study on scaling semantic search across 41 million Wikipedia articles on commodity CPU hardware. We propose and evaluate a two-stage retrieval pipeline that combines binary quantized embeddings for rapid candidate retrieval with int8 quantized embeddings for precision rescoring. Using the mixedbread-ai/mxbai-embed-large-v1 model with 1024-dimensional embeddings, binary quantization reduces memory from approximately 160 GB (float32) to just 5.2 GB — a 30× compression — while retaining over 95% of float32 retrieval accuracy when combined with int8 rescoring. Hamming distance-based similarity search over binary vectors delivers a 15–45× speedup relative to float32 cosine similarity, with a mean speedup of 25×. The complete system operates within 8 GB of RAM on a CPU-only machine, making high-quality semantic search viable for cost-sensitive and resource-constrained applications. An interactive Gradio interface is provided for real-time query evaluation.
Keywords: semantic search, binary quantization, embedding compression, FAISS, vector retrieval, information retrieval, RAG, Hamming distance, int8 rescoring, large-scale NLP
The rise of transformer-based sentence encoders has established dense vector retrieval as a powerful paradigm for semantic search, question answering, and Retrieval-Augmented Generation (RAG) pipelines. Unlike traditional keyword-based systems such as BM25, dense retrieval encodes both queries and documents as continuous vectors in a shared semantic space, enabling retrieval based on conceptual meaning rather than surface form.
Despite this power, deploying dense retrieval at scale remains challenging. A corpus of 41 million documents with 1024-dimensional float32 embeddings requires approximately 160 GB of storage for the vectors alone, plus an additional ~20 GB for a Hierarchical Navigable Small World (HNSW) search index — totalling nearly 180 GB of RAM. For the majority of practitioners, this presents an insurmountable barrier: GPU-grade infrastructure is expensive, and most edge or cloud CPU deployments operate under strict memory budgets.
This paper presents a complete, open-source implementation of a large-scale semantic search system that sidesteps these constraints through embedding quantization — specifically, binary (1-bit) quantization for the search index and int8 (8-bit) quantization for a precision-restoring rescoring stage. The corpus used is the full English Wikipedia, totalling 41 million passages, making this one of the most practically relevant open benchmarks for large-scale dense retrieval research.
The key contributions of this work are:
Dense retrieval systems encode text passages as fixed-length vectors using neural encoders, typically transformer-based models such as BERT variants or bi-encoders fine-tuned with contrastive objectives. At inference time, a query is encoded into the same vector space and the most semantically similar documents are retrieved using approximate nearest neighbour (ANN) search.
State-of-the-art dense encoders such as mixedbread-ai/mxbai-embed-large-v1, sentence-transformers/all-mpnet-base-v2, and OpenAI's text-embedding-3-large produce high-quality 768–3072 dimensional representations. These models are trained using objectives that encourage uniform distribution in embedding space, a property that is exploited by quantization schemes.
Libraries such as FAISS (Facebook AI Similarity Search) and USearch support efficient ANN retrieval at scale. FAISS provides implementations of flat (exact) and IVF-based (approximate) indices for both float32 and binary vectors, while USearch extends this to int8 and other quantized types with efficient on-disk support.
HNSW (Hierarchical Navigable Small World) graphs offer excellent recall-speed tradeoffs for float32 retrieval but do not translate gracefully to memory-constrained settings — their graph structure introduces significant overhead beyond the raw embedding storage.
Quantization is the process of representing floating-point values at reduced precision. In the context of embeddings:
Prior work by Hugging Face (Zhu et al., 2024) and Matryoshka Representation Learning (Kusupati et al., 2022) has demonstrated that modern embedding models can be substantially compressed with minimal accuracy degradation. Binary quantization in particular has attracted interest because of the extreme hardware efficiency of Hamming distance computation: modern CPUs can execute XOR + POPCOUNT operations in as few as 2 clock cycles.
Cascade retrieval — using a cheap first-stage ranker to shortlist candidates followed by an expensive re-ranker — is well established in information retrieval. ColBERT (Khattab & Zaharia, 2020) pioneered multi-vector dense retrieval with late interaction, while subsequent work explored scalar-quantized first stages with cross-encoder rerankers. Our approach applies this cascade principle entirely within the embedding space, avoiding the need for separate reranker models.
The proposed system implements a two-stage retrieval pipeline with three core components: the embedding model, the binary FAISS index, and the int8 USearch index.
Query Text
│
▼
[Embedding Model: mxbai-embed-large-v1]
│
├──── Binary Quantization ────▶ [FAISS Binary Index]
│ │
│ Top-K × multiplier candidates
│ │
└──── Float32 Query ──────────▶ [Int8 USearch Index (on disk)]
│
Lazy-load candidate vectors
│
Dot-product rescoring
│
Top-K results
Figure 1. Two-stage retrieval pipeline. The binary index performs fast coarse retrieval; int8 embeddings loaded on-demand refine the ranking.
| Component | Format | Location | Size (41M docs) |
|---|---|---|---|
| Embedding Model | float32 | RAM | ~2–3 GB |
| Binary Search Index | ubinary (1-bit) | RAM | ~5.2 GB |
| Int8 Rescoring Index | int8 (8-bit) | Disk | ~47.5 GB |
| Total RAM | ~8 GB |
This compares to approximately 180 GB for a conventional float32 HNSW-based system — a 22.5× reduction in RAM footprint.
Given an embedding vector x ∈ ℝ^d normalized to unit length, we define:
Int8 quantization:
x_int8[i] = round(x[i] × 127)
Binary (ubinary) quantization:
x_bin[i] = 1 if x[i] > 0
x_bin[i] = 0 otherwise
For binary vectors, pairwise similarity is measured via Hamming distance — the number of bit positions at which two vectors differ:
d_H(a, b) = popcount(a XOR b)
This operation is equivalent to cosine similarity up to a monotone transformation for unit-normalized embeddings, which is why the ranking quality is preserved.
Modern embedding models trained with contrastive objectives learn representations that are robust to sign changes — the sign of each dimension encodes a coarse semantic signal. When two document embeddings share the same sign across most dimensions, they are semantically similar; when they differ, they are dissimilar. Binary quantization preserves exactly this sign information, discarding only the magnitude.
Additionally, embedding models such as mxbai-embed-large-v1 are explicitly trained with quantization robustness in mind, using techniques from Matryoshka Representation Learning that encourage dimensional importance ordering.
Stage 1: Binary FAISS Index
The binary index (IndexBinaryFlat or IndexBinaryIVF) is constructed by:
quantize_embeddings(x, precision="ubinary")).# Pseudocode: Binary index construction for batch in wikipedia_passages: embeddings = model.encode(batch, normalize_embeddings=True) binary_embs = quantize_embeddings(embeddings, precision="ubinary") binary_index.add(binary_embs) faiss.write_index_binary(binary_index, "binary_index.bin")
Stage 2: Int8 USearch Index
The int8 index is stored on disk and accessed via lazy loading during rescoring:
round(x × 127) per dimension.Step 1 — Query Encoding:
query_embedding = model.encode(query_text, normalize_embeddings=True)
Step 2 — Binary Search (Stage 1):
binary_query = quantize_embeddings(query_embedding, precision="ubinary") candidate_ids, _ = binary_index.search(binary_query, k * rescore_multiplier)
Where rescore_multiplier is typically set to 4 (retrieving 80 candidates when k=20).
Step 3 — Int8 Rescoring (Stage 2):
int8_candidates = usearch_index.get(candidate_ids) # lazy disk load scores = np.dot(query_embedding.astype(np.float32), int8_candidates.T) top_k_ids = candidate_ids[np.argsort(scores)[::-1][:k]]
The float32 query is dot-producted against the int8 candidate embeddings. While int8 operations introduce minor precision loss, the rescoring stage substantially recovers accuracy lost in binary coarse retrieval.
Wikipedia English Corpus (41M passages)
wikipedia dataset (processed into ~100-word passages)mixedbread-ai/mxbai-embed-large-v1| Resource | Specification |
|---|---|
| CPU | Standard x86-64 (no GPU) |
| RAM | 8 GB available for inference |
| Storage | ~50 GB for int8 index on disk |
| OS | Linux (Ubuntu) |
| System | Precision | Index | RAM Required |
|---|---|---|---|
| Baseline | float32 | HNSW | ~180 GB |
| Ours (Stage 1 only) | ubinary | FAISS Flat | ~5.2 GB |
| Ours (Stage 1 + 2) | ubinary + int8 | FAISS + USearch | ~8 GB |
| Configuration | RAM | Reduction vs. float32 |
|---|---|---|
| Float32 (baseline) | ~180 GB | — |
| Binary only | ~5.2 GB | 34.6× |
| Binary + model | ~8 GB | 22.5× |
The binary index achieves a 32× compression in raw vector storage (from 4 bytes/dim to 1 bit/dim). Including the embedding model in memory, the full system fits within 8 GB of RAM.
| Method | Speedup vs. float32 |
|---|---|
| Binary search (Stage 1) | 15–45× (mean: 25×) |
| Binary + int8 rescore | ~20× overall |
Hamming distance computation via XOR + POPCOUNT is so computationally inexpensive that even with int8 rescoring overhead, the full pipeline maintains a roughly 20× speedup over float32 cosine similarity search.
| Stage | Accuracy vs. float32 |
|---|---|
| Binary only (Stage 1) | ~90% |
| Binary + int8 rescore (Stage 1+2) | ~95% |
Accuracy is measured as Recall@K relative to the float32 baseline. The addition of int8 rescoring recovers approximately half the accuracy gap between binary and float32 search, achieving a quality level suitable for the vast majority of real-world applications.
| Metric | Float32 | Binary Only | Binary + Int8 |
|---|---|---|---|
| RAM | ~180 GB | ~5.2 GB | ~8 GB |
| Search Speed | 1× | 15–45× | ~20× |
| Recall@K | 100% | ~90% | ~95% |
| GPU Required | Recommended | No | No |
| Cost | Very High | Very Low | Very Low |
A key insight underlying this system is that different stages of retrieval have different precision requirements. The binary stage acts as a filter: it need not rank documents perfectly, only exclude obviously irrelevant ones. The int8 stage refines this shortlist with higher precision. The query itself is always encoded at float32 to preserve semantic quality at the source.
This cascade mirrors established patterns in search engineering — fast approximate first-stage retrieval followed by slower but more accurate reranking — but applies the principle entirely within a single embedding model's representational space.
The rescore_multiplier hyperparameter controls the tradeoff between recall and latency. A higher multiplier retrieves more Stage 1 candidates for rescoring, improving recall at the cost of additional disk reads and dot products. In our implementation, a multiplier of 4 provides a good balance, but this should be tuned based on application requirements:
Float16 (half-precision): Offers better accuracy than int8 with 2× compression relative to float32. However, it does not achieve the extreme memory reductions of binary quantization and lacks native CPU acceleration for Hamming distance. It is most useful when GPU infrastructure is available.
Product Quantization (PQ): A classical ANN technique that decomposes high-dimensional vectors into subvectors and quantizes each independently. PQ achieves aggressive compression but requires non-trivial index training and can degrade quality on embedding vectors with non-uniform dimensional distributions.
Matryoshka Embedding Truncation: An alternative compression strategy that uses shorter prefixes of Matryoshka embeddings (e.g., 256 dimensions instead of 1024). This preserves float32 precision but reduces dimensionality, offering a different accuracy-efficiency tradeoff.
Binary quantized retrieval makes it feasible to build production-grade RAG pipelines without GPU infrastructure, dramatically lowering the cost bar for startups and research groups working with large corpora.
With an 8 GB RAM footprint, the system can run on consumer laptops, edge servers, and moderately equipped cloud instances. This enables private, on-device semantic search for enterprise document management and personal knowledge bases.
Legal, scientific, and regulatory document search systems often involve tens of millions of documents. The presented approach enables semantic retrieval at this scale without dedicated vector database infrastructure.
Binary quantized dense retrieval can be combined with sparse retrieval (BM25) in a hybrid search architecture. The dense component handles semantic queries; BM25 handles exact keyword matches. The memory savings from quantization leave headroom for the sparse index.
Several extensions of this work merit investigation:
Multi-stage rescoring: A three-stage pipeline (binary → int8 → float16) could further improve accuracy at the cost of slightly increased latency and storage.
Dynamic multiplier adaptation: Query-difficulty estimation could automatically adjust the rescore multiplier — using larger candidate sets for ambiguous queries and smaller sets for highly specific ones.
Distributed sharding: For corpora exceeding 100 million documents, binary indices can be sharded across multiple machines with simple merge-based candidate fusion.
Production vector database integration: Managed vector databases such as Qdrant and Vespa.ai support scalar quantization natively and could be used to productionize this architecture with added reliability, monitoring, and multi-tenancy.
Model-aware quantization: Fine-tuning embedding models with binary quantization as an explicit training objective (beyond existing quantization-aware training) may further close the accuracy gap relative to float32 retrieval.
This paper demonstrated that semantic search over 41 million Wikipedia documents is achievable on commodity CPU hardware using binary quantized embeddings for approximate retrieval and int8 embeddings for precision rescoring. The two-stage pipeline reduces RAM requirements by 22.5× and achieves a mean search speedup of 25× relative to float32 baselines, while retaining ~95% of float32 retrieval accuracy.
The core principle — that not every retrieval operation requires full precision — provides a practical framework for scaling semantic search to large corpora without prohibitive infrastructure costs. The open-source implementation makes this approach immediately accessible for real-world applications in RAG systems, document search, edge deployment, and hybrid retrieval pipelines.
As embedding quantization becomes increasingly supported by modern embedding models and vector database backends, this multi-precision retrieval paradigm is likely to become a standard component of scalable semantic search infrastructure.
| Precision | Bytes/dim | Memory (41M × 1024 dims) | Compression |
|---|---|---|---|
| float32 | 4 | ~160 GB | 1× |
| int8 | 1 | ~40 GB | 4× |
| binary (ubinary) | 0.125 | ~5 GB | 32× |
BinaryQuantised-Embedding-for-Wikipedia-search/
├── save_binary_index.py # Build FAISS binary index from float32 embeddings
├── save_int8_index.py # Build USearch int8 index for rescoring
├── app.py # Gradio web interface for interactive search
├── requirements.txt # Python dependencies
└── README.md # Project overview and setup guide
Quick Start:
git clone https://github.com/Suchi-BITS/BinaryQuantised-Embedding-for-Wikipedia-search cd BinaryQuantised-Embedding-for-Wikipedia-search pip install -r requirements.txt python save_binary_index.py python save_int8_index.py python app.py
© 2026 Shuchismita Sahu. This work is published for open research and education. Please cite the GitHub repository and this paper if you build upon this work.