May 30, 2025●19 reads●MIT License

Semantica- Semantic Book Recommender

AI
Cosine Similarity
Embedding Models
Flask
Machine Learning
NLP
Python
Recommender System
Semantic Search
Sentence Transformers

Covenant Adeogo

Abstract

Semantica is a Python-based web application designed to provide intelligent book recommendations. It operates on the premise that nuanced understanding of book content, derived from titles and descriptions, leads to more relevant suggestions than traditional keyword-based methods. Semantica leverages sentence embeddings generated by state-of-the-art transformer models and calculates cosine similarity to identify semantically similar books. The system features an offline embedding generation process and a Flask-based web interface for users to select a book and receive tailored recommendations.

Introduction

In an era of vast digital libraries, finding the next great read can be overwhelming. Traditional recommendation systems often rely on explicit user ratings, purchase history, or simple keyword matching, which can miss subtle connections between books or suffer from cold-start problems. Semantica addresses this by employing semantic search techniques. By converting book titles and descriptions into dense vector representations (embeddings) using the sentence-transformers library, we capture the underlying meaning and context. These embeddings allow us to compute a similarity matrix using cosine similarity, forming the backbone of our recommendation engine. The Flask framework then serves these recommendations through an intuitive web interface. This article details the architecture of Semantica, from data preparation and embedding generation to the core recommendation logic and its deployment as a web service.

A Little Context: The Evolution of Recommendation Systems

Recommendation systems have evolved significantly. Early approaches centered on:

Keyword Matching: Simple, but often lacking contextual understanding.
Collaborative Filtering: "Users who liked X also liked Y." Effective but struggles with new items/users.
Content-Based Filtering (Traditional): Matching explicit item attributes (e.g., genre, author).

The advent of powerful Natural Language Processing (NLP) models, particularly transformers, has unlocked a new dimension for content-based filtering: semantic similarity. Instead of just matching explicit tags, we can now compare the meaning embedded within textual descriptions. Semantica embraces this modern approach, utilizing pre-trained sentence embedding models like 'all-MiniLM-L6-v2' to understand book content at a deeper level.

Key Principles and Architecture of Semantica

Data Ingestion and Preparation (`prod_dataset.csv`)

Reading the dataset using Pandas.
Handling missing descriptions (fillna('')).
Combining 'Title' and 'Description' into a single text field for richer embeddings. This contextualizes the title and leverages descriptive nuances.

Offline Embedding Generation (`generate_embeddings.py`)

Utilizing sentence-transformers (specifically 'all-MiniLM-L6-v2') to convert combined texts into high-dimensional vectors.
Storing these embeddings (converted to string list representations) in a new CSV (prod_dataset_combined_embeddings.csv).

This pre-computation is crucial for efficient runtime performance of the recommendation app, as embedding generation can be time-consuming.

Embedding Loading and Preprocessing (`app.py`)

Reading the CSV with pre-computed embeddings.
The convert_embedding function:
- Crucial for reliably parsing string-formatted embeddings back into NumPy arrays.
- Addresses potential inconsistencies in how embeddings might be stringified (e.g., "np.float32()", "array()", simple lists). This robustness is key for real-world data.
Filtering out any rows where embedding conversion failed.

Similarity Computation (`app.py`)

Stacking the NumPy array embeddings.
Calculating the pairwise cosine_similarity matrix. This matrix is the core of the recommendation engine, pre-calculating similarity between all book pairs.

Recommendation Logic (`get_similar_books` function)

Takes a book index and top_n as input.
Retrieves the similarity scores for the given book from the pre-computed matrix.
Sorts books by similarity score in descending order.
Excludes the book itself (similarity of 1.0 with itself) and returns the top N most similar books.

Web Interface (Flask - `app.py`)

/ (Index Route): Renders index.html, passing a list of book titles and their corresponding DataFrame indices for easy selection in a dropdown.
/recommend (Recommendation Route):
- Accepts book_id (the DataFrame index) as a GET parameter.
- Validates the book_id.
- Calls get_similar_books to fetch recommendations.
- Formats the selected book and recommendations into a JSON response.

Walkthrough and Code Examples

Dataset Snippet (`prod_dataset.csv`)

The dataset contains book information, with 'Title' and 'Description' being key for our semantic analysis.

Title,Authors,Description,Category,Publisher,Publish Date,Price
In the Bedroom,"By Dubus, Andre","The seven stories collected here–including “Killings,” the basis for Todd Field’s award-winning film In the Bedroom–showcase legendary writer Andre Dubus’s sheer narrative mastery in a book of quietly staggering emotional power.A father in mourning contemplates the unthinkable as the only way to allay his grief. A boy must learn to care for his younger brother when their mother leaves the family. A young woman who has never lacked lovers despairs of ever finding love itself, and then makes an accidental discovery that brings her real joy. Culled from Dubus’s treasured collections Selected Stories and Dancing After Hours, these beautiful stories of people at pivotal moments in their lives are some of the most bewitching and profound in American fiction."," Fiction , Media Tie-In",Vintage,"Friday, February 1, 2002",Price Starting at $5.29
Captain Kate,"By Reeder, Carolyn","While the Civil War rages all around them, Kate enlists the aid of her stepbrother Seth to ferry coal from Cumberland, Maryland, to Georgetown for the good of her family. Reprint."," Juvenile Fiction , Historical , United States , General",Avon Camelot,"Saturday, January 1, 2000",Price Starting at $5.29

Embedding Generation (`generate_embeddings.py`)

This script processes the raw dataset, combines titles and descriptions, generates embeddings using 'all-MiniLM-L6-v2', and saves them to a new CSV.

import pandas as pd
from sentence_transformers import SentenceTransformer

# Load the Original Dataset
df = pd.read_csv('prod_dataset.csv')
df['Description'] = df['Description'].fillna('')

# Combine Title and Description
combined_texts = (df['Title'] + ". " + df['Description']).tolist()

# Generate Embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(combined_texts, show_progress_bar=True)

# Save the New CSV File
# Convert each embedding (a NumPy array) to a string representation
df['Combined_Embedding'] = [str(list(e)) for e in embeddings]
df.to_csv('prod_dataset_combined_embeddings.csv', index=False)

print("Combined embeddings saved to prod_dataset_combined_embeddings.csv")

Core Application Logic (`app.py`)

`convert_embedding` function

This utility is vital for converting the string representation of embeddings (stored in the CSV) back into usable NumPy arrays. It handles various formatting quirks that can arise from stringifying complex Python objects.

import pandas as pd
import numpy as np
import ast
import re
from sklearn.metrics.pairwise import cosine_similarity
from flask import Flask, request, render_template, jsonify

def convert_embedding(embedding_str):
    """
    Convert a string representation of an embedding into a NumPy array.
    Removes any "np.float32(...)" wrappers and "array(...)" wrappers before converting.
    """
    if not isinstance(embedding_str, str) or not embedding_str.strip():
        return None

    # Remove np.float32 wrappers using regex.
    if "np.float32(" in embedding_str:
        embedding_str = re.sub(r'np\.float32\(([^)]+)\)', r'\1', embedding_str)
    # Remove "array(" wrapper if present.
    if embedding_str.startswith("array(") and embedding_str.endswith(")"):
        embedding_str = embedding_str[len("array("):-1]
    try:
        return np.array(ast.literal_eval(embedding_str))
    except Exception:
        try:
            # Handle cases like '[1.0 2.0 ...]' or '1.0, 2.0, ...'
            s = embedding_str.strip().strip("[]")
            parts = s.split(",") if "," in s else s.split()
            floats = [float(x.strip()) for x in parts if x.strip()]
            return np.array(floats)
        except Exception:
            return None # Final fallback

Loading Data & Computing Similarity

The application loads the embeddings, converts them, and then computes the similarity matrix.

# In app.py
df = pd.read_csv('prod_dataset_combined_embeddings.csv')
df['Description'] = df['Description'].fillna('') # ensure consistency if needed later

df['Combined_Embedding'] = df['Combined_Embedding'].apply(
    lambda x: convert_embedding(x) if isinstance(x, str) and x.strip() != "" else None
)
df = df[df['Combined_Embedding'].notnull()] # Crucial: filter out invalid embeddings
df.reset_index(drop=True, inplace=True) # Reset index after filtering for consistent indexing

# Stack embeddings and compute the cosine similarity matrix.
embeddings = np.stack(df['Combined_Embedding'].values)
similarity_matrix = cosine_similarity(embeddings)

# Prepare a list of books (id and title) for the HTML template
# This list uses the new index after filtering and resetting
books_list = [{"id": i, "title": title} for i, title in enumerate(df["Title"].tolist())]

After filtering rows with None embeddings, it's important to reset_index. This ensures that the indices used in books_list and received from the frontend correctly map to rows in the filtered df and the similarity_matrix.

`get_similar_books` function

This function retrieves the top N similar books for a given book index.

def get_similar_books(book_index, top_n=5):
    """
    Given a book index (for the filtered df/similarity_matrix),
    return a list of tuples (index, similarity) for the top_n similar books.
    """
    sim_scores = list(enumerate(similarity_matrix[book_index]))
    # Skip the book itself (index 0 from enumerate, which is actually the book itself in sim_scores)
    # and return the next top_n results.
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:top_n+1]
    return sim_scores # Returns (original_df_index, score)

Flask Recommendation Endpoint (`/recommend`)

This Flask route handles requests for recommendations, calls the core logic, and returns JSON data.

app = Flask(__name__)

@app.route("/")
def index():
    return render_template("index.html", books=books_list)

@app.route("/recommend", methods=["GET"])
def recommend():
    book_id = request.args.get("book_id", type=int) # book_id is the index from books_list
    
    if book_id is None or book_id < 0 or book_id >= len(df):
        return jsonify({"error": "Invalid book id"}), 400

    # book_id directly corresponds to the row index in the filtered df and similarity_matrix
    # because books_list was created from the filtered df using enumerate.
    
    recs_from_similarity = get_similar_books(book_id, top_n=5)
    
    recommendations = []
    for rec_idx, score in recs_from_similarity:
        # rec_idx is the index in the filtered df
        recommendations.append({
            "id": rec_idx, # This is the index in the *filtered* df
            "title": df.loc[rec_idx, 'Title'], # Use .loc with the direct index
            "similarity": round(score, 4)
        })
    
    selected_book_title = df.loc[book_id, 'Title']
    return jsonify({"selected": selected_book_title, "recommendations": recommendations})

if __name__ == "__main__":
    # This part is for running, not strictly for the article's code snippet
    # app.run(debug=True) # Comment out for snippet
    pass

User Interface (Conceptual)

The index.html provides a dropdown for book selection. JavaScript fetches and displays recommendations.

<!-- templates/index.html -->
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Semantic Book Recommender</title>
    <style>
        body { font-family: sans-serif; margin: 20px; }
        #results { margin-top: 20px; }
        ul { list-style-type: none; padding-left: 0; }
        li { margin-bottom: 5px; padding: 5px; border: 1px solid #eee; }
    </style>
</head>
<body>
    <h1>Select a Book</h1>
    <select id="book_select">
        <option value="">--Select a Book--</option>
        {% for book in books %}
        <option value="{{ book.id }}">{{ book.title }}</option>
        {% endfor %}
    </select>
    <button onclick="getRecommendations()">Get Recommendations</button>

    <div id="results">
        <!-- Recommendations will be displayed here -->
    </div>

    <script>
        async function getRecommendations() {
            const bookId = document.getElementById("book_select").value;
            const resultsDiv = document.getElementById("results");
            resultsDiv.innerHTML = "<p>Loading...</p>";

            if (bookId === "") {
                resultsDiv.innerHTML = "<p>Please select a book.</p>";
                return;
            }
            try {
                const response = await fetch(`/recommend?book_id=${bookId}`);
                if (!response.ok) {
                    const errorData = await response.json();
                    resultsDiv.innerHTML = `<p>Error: ${errorData.error || response.statusText}</p>`;
                    return;
                }
                const data = await response.json();
                
                let html = `<h2>Selected: ${data.selected}</h2><h3>Recommendations:</h3><ul>`;
                if (data.recommendations && data.recommendations.length > 0) {
                    data.recommendations.forEach(rec => {
                        html += `<li>${rec.title} (Similarity: ${rec.similarity})</li>`;
                    });
                } else {
                    html += "<li>No recommendations found.</li>";
                }
                html += "</ul>";
                resultsDiv.innerHTML = html;
            } catch (error) {
                resultsDiv.innerHTML = `<p>An error occurred: ${error.message}</p>`;
                console.error("Fetch error:", error);
            }
        }
    </script>
</body>
</html>

// JavaScript for fetching and displaying recommendations (embedded in HTML above)
// async function getRecommendations() { ... }

// Example JSON response from /recommend
{
  "selected": "In the Bedroom",
  "recommendations": [
    {
      "id": 12, // Index in the filtered DataFrame
      "title": "The Melting Pot and Other Subversive Stories",
      "similarity": 0.8567
    },
    {
      "id": 36, // Index in the filtered DataFrame
      "title": "Dark Places",
      "similarity": 0.8234
    }
    // ... more recommendations
  ]
}

graph TD
    A[book-recommender-app/] --> B[app.py]
    A --> C[generate_embeddings.py]
    A --> D[requirements.txt]
    A --> E[prod_dataset.csv]
    A --> F[prod_dataset_combined_embeddings.csv]
    A --> G[templates/]
    A --> H[README.md]
    A --> I[Dockerfile]
    
    G --> J[index.html]
    
    %% Data Flow Relationships
    E -->|reads| C
    C -->|generates| F
    F -->|loads into memory| B
    B -->|serves| J
    J -->|makes requests to| B
    
    %% Dependency Relationships
    D -->|specifies dependencies for| B
    D -->|specifies dependencies for| C
    
    %% Deployment Relationships
    I -->|containerizes| B
    I -->|includes| F
    I -->|includes| G
    
    %% Documentation
    H -->|documents| A
    
    %% Styling for different file types
    classDef pythonFile fill:#3776ab,stroke:#fff,color:#fff
    classDef dataFile fill:#28a745,stroke:#fff,color:#fff
    classDef templateFile fill:#e83e8c,stroke:#fff,color:#fff
    classDef configFile fill:#fd7e14,stroke:#fff,color:#fff
    classDef docFile fill:#6c757d,stroke:#fff,color:#fff
    classDef folderStyle fill:#007bff,stroke:#fff,color:#fff
    
    class B,C pythonFile
    class E,F dataFile
    class J templateFile
    class D,I configFile
    class H docFile
    class A,G folderStyle
    
    %% Add labels for key relationships
    C -.->|"1. Processes original data"| E
    C -.->|"2. Creates embeddings file"| F
    B -.->|"3. Loads embeddings at startup"| F
    B -.->|"4. Serves web interface"| J
    J -.->|"5. Ajax calls for recommendations"| B

Key Features of Semantica

Semantic Understanding: Moves beyond keywords to grasp the contextual meaning of book titles and descriptions.
Efficient Embedding: Utilizes the highly-rated all-MiniLM-L6-v2 model for fast and effective sentence embedding.
Pre-computed Similarity: Calculates and stores the cosine similarity matrix at startup for rapid recommendation retrieval.
Robust Embedding Handling: Includes a flexible parser (convert_embedding) for various string representations of embeddings stored in CSVs.
Lightweight Web Framework: Employs Flask for a simple, accessible web interface.
Clear Separation of Concerns: Offline batch embedding generation (generate_embeddings.py) distinct from the online recommendation serving app (app.py).
Scalable Core Logic: The core similarity logic can be adapted to larger datasets and more complex models (currently limited by CSV and in-memory matrix).

Key Use Cases

Personalized book discovery platform.
Enhancing search functionality for online bookstores or library catalogs.
Powering "similar items" features on content websites.
Educational tool for demonstrating applied NLP and recommendation algorithms.

Tech Stack and Languages

Backend: Python, Flask
Data Handling: Pandas, NumPy
Machine Learning/NLP: Scikit-learn (for cosine_similarity), Sentence-Transformers
Data Storage: CSV files
Frontend: HTML, JavaScript (for fetching and displaying recommendations)

The Future: Expanding Semantica

Semantica currently provides a solid foundation for semantic book recommendations. Future enhancements could include:

Database Integration: Migrating from CSVs to a database (e.g., PostgreSQL with pg_vector for native vector storage and querying) for better scalability and data management.
Real-time Updates: Implementing a mechanism to add new books and update embeddings/similarity matrix without full reprocessing.
User Feedback Loop: Incorporating user ratings or "not interested" feedback to personalize recommendations further.
Hybrid Approaches: Combining semantic similarity with collaborative filtering or knowledge graph information for even richer recommendations.
Advanced UI/UX: Developing a more interactive and visually appealing frontend.
Scalability & Deployment: Containerizing with Docker and deploying on a cloud platform for handling more users and larger datasets.

Semantica demonstrates the power of modern NLP techniques in building practical and intelligent applications. We encourage you to explore the code Semantica and experiment with its capabilities.

Semantica- Semantic Book Recommender

Table of contents

Abstract

Introduction

A Little Context: The Evolution of Recommendation Systems

Key Principles and Architecture of Semantica

Data Ingestion and Preparation (`prod_dataset.csv`)

Offline Embedding Generation (`generate_embeddings.py`)

Embedding Loading and Preprocessing (`app.py`)

Similarity Computation (`app.py`)

Recommendation Logic (`get_similar_books` function)

Web Interface (Flask - `app.py`)

Walkthrough and Code Examples

Dataset Snippet (`prod_dataset.csv`)

Embedding Generation (`generate_embeddings.py`)

Core Application Logic (`app.py`)

`convert_embedding` function

Loading Data & Computing Similarity

`get_similar_books` function

Flask Recommendation Endpoint (`/recommend`)

User Interface (Conceptual)

Key Features of Semantica

Key Use Cases

Tech Stack and Languages

The Future: Expanding Semantica

Table of contents

Table of contents

Abstract

Introduction

A Little Context: The Evolution of Recommendation Systems

Key Principles and Architecture of Semantica

Data Ingestion and Preparation (prod_dataset.csv)

Offline Embedding Generation (generate_embeddings.py)

Embedding Loading and Preprocessing (app.py)

Similarity Computation (app.py)

Recommendation Logic (get_similar_books function)

Web Interface (Flask - app.py)

Walkthrough and Code Examples

Dataset Snippet (prod_dataset.csv)

Embedding Generation (generate_embeddings.py)

Core Application Logic (app.py)

convert_embedding function

Loading Data & Computing Similarity

get_similar_books function

Flask Recommendation Endpoint (/recommend)

User Interface (Conceptual)

Key Features of Semantica

Key Use Cases

Tech Stack and Languages

The Future: Expanding Semantica

Table of contents

Data Ingestion and Preparation (`prod_dataset.csv`)

Offline Embedding Generation (`generate_embeddings.py`)

Embedding Loading and Preprocessing (`app.py`)

Similarity Computation (`app.py`)

Recommendation Logic (`get_similar_books` function)

Web Interface (Flask - `app.py`)

Dataset Snippet (`prod_dataset.csv`)

Embedding Generation (`generate_embeddings.py`)

Core Application Logic (`app.py`)

`convert_embedding` function

`get_similar_books` function

Flask Recommendation Endpoint (`/recommend`)