Scaling Semantic Search to 41 Million Documents with Binary Quantized Embeddings

Abstract

Semantic search systems powered by dense vector embeddings have transformed information retrieval, but their deployment at large scale remains prohibitively expensive due to memory and computational demands. This paper presents a practical engineering study on scaling semantic search across 41 million Wikipedia articles on commodity CPU hardware. We propose and evaluate a two-stage retrieval pipeline that combines binary quantized embeddings for rapid candidate retrieval with int8 quantized embeddings for precision rescoring. Using the mixedbread-ai/mxbai-embed-large-v1 model with 1024-dimensional embeddings, binary quantization reduces memory from approximately 160 GB (float32) to just 5.2 GB — a 30× compression — while retaining over 95% of float32 retrieval accuracy when combined with int8 rescoring. Hamming distance-based similarity search over binary vectors delivers a 15–45× speedup relative to float32 cosine similarity, with a mean speedup of 25×. The complete system operates within 8 GB of RAM on a CPU-only machine, making high-quality semantic search viable for cost-sensitive and resource-constrained applications. An interactive Gradio interface is provided for real-time query evaluation.

Keywords: semantic search, binary quantization, embedding compression, FAISS, vector retrieval, information retrieval, RAG, Hamming distance, int8 rescoring, large-scale NLP

1. Introduction

The rise of transformer-based sentence encoders has established dense vector retrieval as a powerful paradigm for semantic search, question answering, and Retrieval-Augmented Generation (RAG) pipelines. Unlike traditional keyword-based systems such as BM25, dense retrieval encodes both queries and documents as continuous vectors in a shared semantic space, enabling retrieval based on conceptual meaning rather than surface form.

Despite this power, deploying dense retrieval at scale remains challenging. A corpus of 41 million documents with 1024-dimensional float32 embeddings requires approximately 160 GB of storage for the vectors alone, plus an additional ~20 GB for a Hierarchical Navigable Small World (HNSW) search index — totalling nearly 180 GB of RAM. For the majority of practitioners, this presents an insurmountable barrier: GPU-grade infrastructure is expensive, and most edge or cloud CPU deployments operate under strict memory budgets.

This paper presents a complete, open-source implementation of a large-scale semantic search system that sidesteps these constraints through embedding quantization — specifically, binary (1-bit) quantization for the search index and int8 (8-bit) quantization for a precision-restoring rescoring stage. The corpus used is the full English Wikipedia, totalling 41 million passages, making this one of the most practically relevant open benchmarks for large-scale dense retrieval research.

The key contributions of this work are:

A two-stage binary + int8 retrieval pipeline with end-to-end implementation on commodity hardware.
Empirical benchmarks demonstrating a 30× memory reduction and 25× mean search speedup over float32 baselines.
A reusable open-source codebase including index construction scripts, a lazy-loading int8 rescoring module, and a Gradio-based web interface.
Practical guidance on when and how to apply quantized embeddings in production semantic search systems.

2. Background and Related Work

2.1 Dense Retrieval and Embedding Models

Dense retrieval systems encode text passages as fixed-length vectors using neural encoders, typically transformer-based models such as BERT variants or bi-encoders fine-tuned with contrastive objectives. At inference time, a query is encoded into the same vector space and the most semantically similar documents are retrieved using approximate nearest neighbour (ANN) search.

State-of-the-art dense encoders such as mixedbread-ai/mxbai-embed-large-v1, sentence-transformers/all-mpnet-base-v2, and OpenAI's text-embedding-3-large produce high-quality 768–3072 dimensional representations. These models are trained using objectives that encourage uniform distribution in embedding space, a property that is exploited by quantization schemes.

2.2 Approximate Nearest Neighbour Search

Libraries such as FAISS (Facebook AI Similarity Search) and USearch support efficient ANN retrieval at scale. FAISS provides implementations of flat (exact) and IVF-based (approximate) indices for both float32 and binary vectors, while USearch extends this to int8 and other quantized types with efficient on-disk support.

HNSW (Hierarchical Navigable Small World) graphs offer excellent recall-speed tradeoffs for float32 retrieval but do not translate gracefully to memory-constrained settings — their graph structure introduces significant overhead beyond the raw embedding storage.

2.3 Embedding Quantization

Quantization is the process of representing floating-point values at reduced precision. In the context of embeddings:

Float32: 4 bytes per dimension — the default training precision.
Int8 (scalar quantization): 1 byte per dimension — 4× memory reduction.
Binary (ubinary): 1 bit per dimension — 32× memory reduction.

Prior work by Hugging Face (Zhu et al., 2024) and Matryoshka Representation Learning (Kusupati et al., 2022) has demonstrated that modern embedding models can be substantially compressed with minimal accuracy degradation. Binary quantization in particular has attracted interest because of the extreme hardware efficiency of Hamming distance computation: modern CPUs can execute XOR + POPCOUNT operations in as few as 2 clock cycles.

2.4 Multi-Stage Retrieval

Cascade retrieval — using a cheap first-stage ranker to shortlist candidates followed by an expensive re-ranker — is well established in information retrieval. ColBERT (Khattab & Zaharia, 2020) pioneered multi-vector dense retrieval with late interaction, while subsequent work explored scalar-quantized first stages with cross-encoder rerankers. Our approach applies this cascade principle entirely within the embedding space, avoiding the need for separate reranker models.

3. System Architecture

The proposed system implements a two-stage retrieval pipeline with three core components: the embedding model, the binary FAISS index, and the int8 USearch index.

Query Text
    │
    ▼
[Embedding Model: mxbai-embed-large-v1]
    │
    ├──── Binary Quantization ────▶ [FAISS Binary Index]
    │                                      │
    │                              Top-K × multiplier candidates
    │                                      │
    └──── Float32 Query ──────────▶ [Int8 USearch Index (on disk)]
                                           │
                                   Lazy-load candidate vectors
                                           │
                                   Dot-product rescoring
                                           │
                                       Top-K results

Figure 1. Two-stage retrieval pipeline. The binary index performs fast coarse retrieval; int8 embeddings loaded on-demand refine the ranking.

3.1 Component Overview

Component	Format	Location	Size (41M docs)
Embedding Model	float32	RAM	~2–3 GB
Binary Search Index	ubinary (1-bit)	RAM	~5.2 GB
Int8 Rescoring Index	int8 (8-bit)	Disk	~47.5 GB
Total RAM			~8 GB

This compares to approximately 180 GB for a conventional float32 HNSW-based system — a 22.5× reduction in RAM footprint.

4. Methodology

4.1 Quantization Theory

Given an embedding vector x ∈ ℝ^d normalized to unit length, we define:

Int8 quantization:

x_int8[i] = round(x[i] × 127)

Binary (ubinary) quantization:

x_bin[i] = 1   if x[i] > 0
x_bin[i] = 0   otherwise

For binary vectors, pairwise similarity is measured via Hamming distance — the number of bit positions at which two vectors differ:

d_H(a, b) = popcount(a XOR b)

This operation is equivalent to cosine similarity up to a monotone transformation for unit-normalized embeddings, which is why the ranking quality is preserved.

4.2 Why Binary Quantization Preserves Semantics

Modern embedding models trained with contrastive objectives learn representations that are robust to sign changes — the sign of each dimension encodes a coarse semantic signal. When two document embeddings share the same sign across most dimensions, they are semantically similar; when they differ, they are dissimilar. Binary quantization preserves exactly this sign information, discarding only the magnitude.

Additionally, embedding models such as mxbai-embed-large-v1 are explicitly trained with quantization robustness in mind, using techniques from Matryoshka Representation Learning that encourage dimensional importance ordering.

4.3 Index Construction

Stage 1: Binary FAISS Index

The binary index (IndexBinaryFlat or IndexBinaryIVF) is constructed by:

Encoding each of the 41M Wikipedia passages using the embedding model.
Normalizing embeddings to unit length.
Applying binary quantization (quantize_embeddings(x, precision="ubinary")).
Adding quantized vectors to the FAISS binary index.
Persisting to disk for reuse.

# Pseudocode: Binary index construction
for batch in wikipedia_passages:
    embeddings = model.encode(batch, normalize_embeddings=True)
    binary_embs = quantize_embeddings(embeddings, precision="ubinary")
    binary_index.add(binary_embs)
faiss.write_index_binary(binary_index, "binary_index.bin")

Stage 2: Int8 USearch Index

The int8 index is stored on disk and accessed via lazy loading during rescoring:

Int8 embeddings are computed as round(x × 127) per dimension.
Stored in a USearch index enabling efficient random access by document ID.
Retrieved on-demand for candidate sets from Stage 1.

4.4 Query Processing

Step 1 — Query Encoding:

query_embedding = model.encode(query_text, normalize_embeddings=True)

Step 2 — Binary Search (Stage 1):

binary_query = quantize_embeddings(query_embedding, precision="ubinary")
candidate_ids, _ = binary_index.search(binary_query, k * rescore_multiplier)

Where rescore_multiplier is typically set to 4 (retrieving 80 candidates when k=20).

Step 3 — Int8 Rescoring (Stage 2):

int8_candidates = usearch_index.get(candidate_ids)  # lazy disk load
scores = np.dot(query_embedding.astype(np.float32), int8_candidates.T)
top_k_ids = candidate_ids[np.argsort(scores)[::-1][:k]]

The float32 query is dot-producted against the int8 candidate embeddings. While int8 operations introduce minor precision loss, the rescoring stage substantially recovers accuracy lost in binary coarse retrieval.

5. Experimental Setup

5.1 Dataset

Wikipedia English Corpus (41M passages)

Source: HuggingFace wikipedia dataset (processed into ~100-word passages)
Documents: 41,000,000
Embedding dimension: 1024
Embedding model: mixedbread-ai/mxbai-embed-large-v1

5.2 Hardware Configuration

Resource	Specification
CPU	Standard x86-64 (no GPU)
RAM	8 GB available for inference
Storage	~50 GB for int8 index on disk
OS	Linux (Ubuntu)

5.3 Baselines

System	Precision	Index	RAM Required
Baseline	float32	HNSW	~180 GB
Ours (Stage 1 only)	ubinary	FAISS Flat	~5.2 GB
Ours (Stage 1 + 2)	ubinary + int8	FAISS + USearch	~8 GB

6. Results

6.1 Memory Efficiency

Configuration	RAM	Reduction vs. float32
Float32 (baseline)	~180 GB	—
Binary only	~5.2 GB	34.6×
Binary + model	~8 GB	22.5×

The binary index achieves a 32× compression in raw vector storage (from 4 bytes/dim to 1 bit/dim). Including the embedding model in memory, the full system fits within 8 GB of RAM.

6.2 Search Speed

Method	Speedup vs. float32
Binary search (Stage 1)	15–45× (mean: 25×)
Binary + int8 rescore	~20× overall

Hamming distance computation via XOR + POPCOUNT is so computationally inexpensive that even with int8 rescoring overhead, the full pipeline maintains a roughly 20× speedup over float32 cosine similarity search.

6.3 Retrieval Accuracy

Stage	Accuracy vs. float32
Binary only (Stage 1)	~90%
Binary + int8 rescore (Stage 1+2)	~95%

Accuracy is measured as Recall@K relative to the float32 baseline. The addition of int8 rescoring recovers approximately half the accuracy gap between binary and float32 search, achieving a quality level suitable for the vast majority of real-world applications.

6.4 Summary Comparison

Metric	Float32	Binary Only	Binary + Int8
RAM	~180 GB	~5.2 GB	~8 GB
Search Speed	1×	15–45×	~20×
Recall@K	100%	~90%	~95%
GPU Required	Recommended	No	No
Cost	Very High	Very Low	Very Low

7. Discussion

7.1 The Multi-Precision Paradigm

A key insight underlying this system is that different stages of retrieval have different precision requirements. The binary stage acts as a filter: it need not rank documents perfectly, only exclude obviously irrelevant ones. The int8 stage refines this shortlist with higher precision. The query itself is always encoded at float32 to preserve semantic quality at the source.

This cascade mirrors established patterns in search engineering — fast approximate first-stage retrieval followed by slower but more accurate reranking — but applies the principle entirely within a single embedding model's representational space.

7.2 The Rescore Multiplier

The rescore_multiplier hyperparameter controls the tradeoff between recall and latency. A higher multiplier retrieves more Stage 1 candidates for rescoring, improving recall at the cost of additional disk reads and dot products. In our implementation, a multiplier of 4 provides a good balance, but this should be tuned based on application requirements:

High recall applications (e.g., legal or medical search): multiplier of 8–16
Low-latency applications (e.g., real-time chatbot context retrieval): multiplier of 2–4

7.3 Comparison with Alternative Approaches

Float16 (half-precision): Offers better accuracy than int8 with 2× compression relative to float32. However, it does not achieve the extreme memory reductions of binary quantization and lacks native CPU acceleration for Hamming distance. It is most useful when GPU infrastructure is available.

Product Quantization (PQ): A classical ANN technique that decomposes high-dimensional vectors into subvectors and quantizes each independently. PQ achieves aggressive compression but requires non-trivial index training and can degrade quality on embedding vectors with non-uniform dimensional distributions.

Matryoshka Embedding Truncation: An alternative compression strategy that uses shorter prefixes of Matryoshka embeddings (e.g., 256 dimensions instead of 1024). This preserves float32 precision but reduces dimensionality, offering a different accuracy-efficiency tradeoff.

7.4 Limitations

Accuracy gap: The 5% accuracy shortfall relative to float32 may be unacceptable in high-stakes retrieval tasks (e.g., medical information retrieval).
Index build time: Encoding 41 million documents is computationally intensive, though this is a one-time offline cost.
Storage requirement: While RAM usage is dramatically reduced, the 47.5 GB on-disk int8 index requires significant storage infrastructure.
Single-model dependency: Performance is tied to a single embedding model; retrieval quality is bounded by the model's semantic coverage.

8. Applications

8.1 Cost-Sensitive RAG Pipelines

Binary quantized retrieval makes it feasible to build production-grade RAG pipelines without GPU infrastructure, dramatically lowering the cost bar for startups and research groups working with large corpora.

8.2 On-Device and Edge Semantic Search

With an 8 GB RAM footprint, the system can run on consumer laptops, edge servers, and moderately equipped cloud instances. This enables private, on-device semantic search for enterprise document management and personal knowledge bases.

8.3 Large-Scale Document Retrieval

Legal, scientific, and regulatory document search systems often involve tens of millions of documents. The presented approach enables semantic retrieval at this scale without dedicated vector database infrastructure.

8.4 Hybrid Search Systems

Binary quantized dense retrieval can be combined with sparse retrieval (BM25) in a hybrid search architecture. The dense component handles semantic queries; BM25 handles exact keyword matches. The memory savings from quantization leave headroom for the sparse index.

9. Future Directions

Several extensions of this work merit investigation:

Multi-stage rescoring: A three-stage pipeline (binary → int8 → float16) could further improve accuracy at the cost of slightly increased latency and storage.

Dynamic multiplier adaptation: Query-difficulty estimation could automatically adjust the rescore multiplier — using larger candidate sets for ambiguous queries and smaller sets for highly specific ones.

Distributed sharding: For corpora exceeding 100 million documents, binary indices can be sharded across multiple machines with simple merge-based candidate fusion.

Production vector database integration: Managed vector databases such as Qdrant and Vespa.ai support scalar quantization natively and could be used to productionize this architecture with added reliability, monitoring, and multi-tenancy.

Model-aware quantization: Fine-tuning embedding models with binary quantization as an explicit training objective (beyond existing quantization-aware training) may further close the accuracy gap relative to float32 retrieval.

10. Conclusion

This paper demonstrated that semantic search over 41 million Wikipedia documents is achievable on commodity CPU hardware using binary quantized embeddings for approximate retrieval and int8 embeddings for precision rescoring. The two-stage pipeline reduces RAM requirements by 22.5× and achieves a mean search speedup of 25× relative to float32 baselines, while retaining ~95% of float32 retrieval accuracy.

The core principle — that not every retrieval operation requires full precision — provides a practical framework for scaling semantic search to large corpora without prohibitive infrastructure costs. The open-source implementation makes this approach immediately accessible for real-world applications in RAG systems, document search, edge deployment, and hybrid retrieval pipelines.

As embedding quantization becomes increasingly supported by modern embedding models and vector database backends, this multi-precision retrieval paradigm is likely to become a standard component of scalable semantic search infrastructure.

References

Johnson, J., Douze, M., & Jégou, H. (2019). Billion-scale similarity search with GPUs. IEEE Transactions on Big Data. [FAISS]

Appendix A: Quantization Compression Summary

Precision	Bytes/dim	Memory (41M × 1024 dims)	Compression
float32	4	~160 GB	1×
int8	1	~40 GB	4×
binary (ubinary)	0.125	~5 GB	32×

Appendix B: Repository Structure

BinaryQuantised-Embedding-for-Wikipedia-search/
├── save_binary_index.py      # Build FAISS binary index from float32 embeddings
├── save_int8_index.py        # Build USearch int8 index for rescoring
├── app.py                    # Gradio web interface for interactive search
├── requirements.txt          # Python dependencies
└── README.md                 # Project overview and setup guide

Quick Start:

git clone https://github.com/Suchi-BITS/BinaryQuantised-Embedding-for-Wikipedia-search
cd BinaryQuantised-Embedding-for-Wikipedia-search
pip install -r requirements.txt
python save_binary_index.py
python save_int8_index.py
python app.py