I Built a Semantic Search API Using FastAPI and SBERT

I Built a Semantic Search API Using FastAPI and SBERT — Here's How It Works

Abstract

Traditional keyword-based search often fails to capture the true intent behind a user's query — especially when working with complex, unstructured text like books, documents, or user notes. LibroVault-Python addresses this limitation by implementing a modular, high-performance semantic search API built with FastAPI and Sentence-Transformers.

Designed for seamless integration as a microservice, it exposes two core endpoints:

POST /api/embedding/make — generates dense vector embeddings from raw text
POST /api/embedding/compare — compares a query embedding against stored vectors to return the most semantically similar items

The system easily supports changing between models, switching all-MiniLM-L6-v2 for English and paraphrase-multilingual-MiniLM-L12-v2 for other languages. Internally, the architecture is split into three layers — configuration/model core, service logic, and API orchestration — enabling clean separation of concerns and easy extensibility.

The project achieves low-latency performance (~50ms per embedding, ~10ms for vector search across thousands of items), with normalization and vector operations powered by NumPy and Hugging Face’s semantic_search. Embeddings are loaded and compared dynamically from JSON files, but the design allows for future integration with vector databases (e.g., FAISS, Qdrant) for scalable deployment.

LibroVault-Python is ideal for any application requiring context-aware search — from virtual libraries and academic archives to document assistants and recommendation engines.

Methodology

The architecture of LibroVault-Python is designed around clarity, modularity, and extensibility. It follows a three-layered structure:

Core (Configuration & Model Initialization)
- config.py loads environment variables and defines normalization behavior for each supported model.
- model.py instantiates a single SentenceTransformer object based on MODEL_NAME, prints model load info, and configures optional L2 normalization.
Embedding Generation
- generate_embedding(text: str) in embedding_service.py encodes input text into a dense vector using the preloaded model.
- The /api/embedding/make endpoint receives raw text, delegates processing to the service layer, and returns a vector of floats.
Semantic Comparison
- The /api/embedding/compare endpoint accepts a query text and paths to embedding files.
- Embeddings are loaded from JSON files, normalized (if applicable), and indexed alongside metadata (e.g., book ID, page number).
- The search_service.py uses Hugging Face’s semantic_search function on NumPy arrays to return the top-K most similar vectors.
- Metadata is mapped back to each result to produce a ranked response with similarity scores.
Best Practices
- Layered Separation: Core logic, API routes, and service functions are cleanly decoupled.
- Centralized Config: All model-related behavior is controlled via .env, enabling runtime changes with no code edits.
- Validation: Pydantic models enforce payload structure and type safety.
- Robustness: All file I/O operations are wrapped in try/except blocks to gracefully handle missing or malformed data.

This methodology ensures that LibroVault-Python remains easy to test, adapt, and scale as project requirements evolve.

Results

LibroVault-Python delivers low-latency, high-accuracy semantic search suitable for real-world NLP applications. Performance testing has shown:

Embedding generation latency: ~50 milliseconds per input using all-MiniLM-L6-v2
Semantic search latency: ~5–10 milliseconds across collections with thousands of embeddings
Similarity quality: Embeddings from Sentence-Transformers capture conceptual meaning, enabling accurate and relevant matches even when keywords differ

Key advantages demonstrated:

Scalability: Embedding loading is file-based and decoupled, allowing integration with caching layers or vector databases (e.g., FAISS, Qdrant) for large-scale deployments
Flexibility: Model choice is controlled via .env, enabling experimentation and multilingual support without changing code
Robustness: Graceful handling of malformed input or missing files ensures service reliability
Clean architecture: Modular separation enables isolated testing, maintenance, and future extensibility

Example response from /api/embedding/compare includes book_id, page_number, file path, and cosine similarity scores — allowing developers to rank and retrieve documents based on conceptual relevance, not just keyword overlap.

If you're interested in exploring the project further, you're welcome to contribute or review the source code on GitHub:
🔗 https://github.com/Hugobsan/LibroVault-Python

You can also see the system in action through the user-facing interface available at:
🌐 https://librovault.hugobsan.tech

Obs: Only in Portuguese 🇧🇷

To request demo access, feel free to send an email to hbsantos36@gmail.com, and I’ll provide a login and password — no cost involved.