Semantica is a Python-based web application designed to provide intelligent book recommendations. It operates on the premise that nuanced understanding of book content, derived from titles and descriptions, leads to more relevant suggestions than traditional keyword-based methods. Semantica leverages sentence embeddings generated by state-of-the-art transformer models and calculates cosine similarity to identify semantically similar books. The system features an offline embedding generation process and a Flask-based web interface for users to select a book and receive tailored recommendations.
In an era of vast digital libraries, finding the next great read can be overwhelming. Traditional recommendation systems often rely on explicit user ratings, purchase history, or simple keyword matching, which can miss subtle connections between books or suffer from cold-start problems. Semantica addresses this by employing semantic search techniques. By converting book titles and descriptions into dense vector representations (embeddings) using the sentence-transformers
library, we capture the underlying meaning and context. These embeddings allow us to compute a similarity matrix using cosine similarity, forming the backbone of our recommendation engine. The Flask framework then serves these recommendations through an intuitive web interface. This article details the architecture of Semantica, from data preparation and embedding generation to the core recommendation logic and its deployment as a web service.
Recommendation systems have evolved significantly. Early approaches centered on:
The advent of powerful Natural Language Processing (NLP) models, particularly transformers, has unlocked a new dimension for content-based filtering: semantic similarity. Instead of just matching explicit tags, we can now compare the meaning embedded within textual descriptions. Semantica embraces this modern approach, utilizing pre-trained sentence embedding models like 'all-MiniLM-L6-v2' to understand book content at a deeper level.
prod_dataset.csv
)fillna('')
).generate_embeddings.py
)sentence-transformers
(specifically 'all-MiniLM-L6-v2') to convert combined texts into high-dimensional vectors.prod_dataset_combined_embeddings.csv
).This pre-computation is crucial for efficient runtime performance of the recommendation app, as embedding generation can be time-consuming.
app.py
)convert_embedding
function:
app.py
)cosine_similarity
matrix. This matrix is the core of the recommendation engine, pre-calculating similarity between all book pairs.get_similar_books
function)top_n
as input.app.py
)/
(Index Route): Renders index.html
, passing a list of book titles and their corresponding DataFrame indices for easy selection in a dropdown./recommend
(Recommendation Route):
book_id
(the DataFrame index) as a GET parameter.book_id
.get_similar_books
to fetch recommendations.prod_dataset.csv
)The dataset contains book information, with 'Title' and 'Description' being key for our semantic analysis.
Title,Authors,Description,Category,Publisher,Publish Date,Price In the Bedroom,"By Dubus, Andre","The seven stories collected here–including “Killings,” the basis for Todd Field’s award-winning film In the Bedroom–showcase legendary writer Andre Dubus’s sheer narrative mastery in a book of quietly staggering emotional power.A father in mourning contemplates the unthinkable as the only way to allay his grief. A boy must learn to care for his younger brother when their mother leaves the family. A young woman who has never lacked lovers despairs of ever finding love itself, and then makes an accidental discovery that brings her real joy. Culled from Dubus’s treasured collections Selected Stories and Dancing After Hours, these beautiful stories of people at pivotal moments in their lives are some of the most bewitching and profound in American fiction."," Fiction , Media Tie-In",Vintage,"Friday, February 1, 2002",Price Starting at $5.29 Captain Kate,"By Reeder, Carolyn","While the Civil War rages all around them, Kate enlists the aid of her stepbrother Seth to ferry coal from Cumberland, Maryland, to Georgetown for the good of her family. Reprint."," Juvenile Fiction , Historical , United States , General",Avon Camelot,"Saturday, January 1, 2000",Price Starting at $5.29
generate_embeddings.py
)This script processes the raw dataset, combines titles and descriptions, generates embeddings using 'all-MiniLM-L6-v2', and saves them to a new CSV.
import pandas as pd from sentence_transformers import SentenceTransformer # Load the Original Dataset df = pd.read_csv('prod_dataset.csv') df['Description'] = df['Description'].fillna('') # Combine Title and Description combined_texts = (df['Title'] + ". " + df['Description']).tolist() # Generate Embeddings model = SentenceTransformer('all-MiniLM-L6-v2') embeddings = model.encode(combined_texts, show_progress_bar=True) # Save the New CSV File # Convert each embedding (a NumPy array) to a string representation df['Combined_Embedding'] = [str(list(e)) for e in embeddings] df.to_csv('prod_dataset_combined_embeddings.csv', index=False) print("Combined embeddings saved to prod_dataset_combined_embeddings.csv")
app.py
)convert_embedding
functionThis utility is vital for converting the string representation of embeddings (stored in the CSV) back into usable NumPy arrays. It handles various formatting quirks that can arise from stringifying complex Python objects.
import pandas as pd import numpy as np import ast import re from sklearn.metrics.pairwise import cosine_similarity from flask import Flask, request, render_template, jsonify def convert_embedding(embedding_str): """ Convert a string representation of an embedding into a NumPy array. Removes any "np.float32(...)" wrappers and "array(...)" wrappers before converting. """ if not isinstance(embedding_str, str) or not embedding_str.strip(): return None # Remove np.float32 wrappers using regex. if "np.float32(" in embedding_str: embedding_str = re.sub(r'np\.float32\(([^)]+)\)', r'\1', embedding_str) # Remove "array(" wrapper if present. if embedding_str.startswith("array(") and embedding_str.endswith(")"): embedding_str = embedding_str[len("array("):-1] try: return np.array(ast.literal_eval(embedding_str)) except Exception: try: # Handle cases like '[1.0 2.0 ...]' or '1.0, 2.0, ...' s = embedding_str.strip().strip("[]") parts = s.split(",") if "," in s else s.split() floats = [float(x.strip()) for x in parts if x.strip()] return np.array(floats) except Exception: return None # Final fallback
The application loads the embeddings, converts them, and then computes the similarity matrix.
# In app.py df = pd.read_csv('prod_dataset_combined_embeddings.csv') df['Description'] = df['Description'].fillna('') # ensure consistency if needed later df['Combined_Embedding'] = df['Combined_Embedding'].apply( lambda x: convert_embedding(x) if isinstance(x, str) and x.strip() != "" else None ) df = df[df['Combined_Embedding'].notnull()] # Crucial: filter out invalid embeddings df.reset_index(drop=True, inplace=True) # Reset index after filtering for consistent indexing # Stack embeddings and compute the cosine similarity matrix. embeddings = np.stack(df['Combined_Embedding'].values) similarity_matrix = cosine_similarity(embeddings) # Prepare a list of books (id and title) for the HTML template # This list uses the new index after filtering and resetting books_list = [{"id": i, "title": title} for i, title in enumerate(df["Title"].tolist())]
After filtering rows with None
embeddings, it's important to reset_index
. This ensures that the indices used in books_list
and received from the frontend correctly map to rows in the filtered df
and the similarity_matrix
.
get_similar_books
functionThis function retrieves the top N similar books for a given book index.
def get_similar_books(book_index, top_n=5): """ Given a book index (for the filtered df/similarity_matrix), return a list of tuples (index, similarity) for the top_n similar books. """ sim_scores = list(enumerate(similarity_matrix[book_index])) # Skip the book itself (index 0 from enumerate, which is actually the book itself in sim_scores) # and return the next top_n results. sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:top_n+1] return sim_scores # Returns (original_df_index, score)
/recommend
)This Flask route handles requests for recommendations, calls the core logic, and returns JSON data.
app = Flask(__name__) @app.route("/") def index(): return render_template("index.html", books=books_list) @app.route("/recommend", methods=["GET"]) def recommend(): book_id = request.args.get("book_id", type=int) # book_id is the index from books_list if book_id is None or book_id < 0 or book_id >= len(df): return jsonify({"error": "Invalid book id"}), 400 # book_id directly corresponds to the row index in the filtered df and similarity_matrix # because books_list was created from the filtered df using enumerate. recs_from_similarity = get_similar_books(book_id, top_n=5) recommendations = [] for rec_idx, score in recs_from_similarity: # rec_idx is the index in the filtered df recommendations.append({ "id": rec_idx, # This is the index in the *filtered* df "title": df.loc[rec_idx, 'Title'], # Use .loc with the direct index "similarity": round(score, 4) }) selected_book_title = df.loc[book_id, 'Title'] return jsonify({"selected": selected_book_title, "recommendations": recommendations}) if __name__ == "__main__": # This part is for running, not strictly for the article's code snippet # app.run(debug=True) # Comment out for snippet pass
The index.html
provides a dropdown for book selection. JavaScript fetches and displays recommendations.
<!-- templates/index.html --> <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Semantic Book Recommender</title> <style> body { font-family: sans-serif; margin: 20px; } #results { margin-top: 20px; } ul { list-style-type: none; padding-left: 0; } li { margin-bottom: 5px; padding: 5px; border: 1px solid #eee; } </style> </head> <body> <h1>Select a Book</h1> <select id="book_select"> <option value="">--Select a Book--</option> {% for book in books %} <option value="{{ book.id }}">{{ book.title }}</option> {% endfor %} </select> <button onclick="getRecommendations()">Get Recommendations</button> <div id="results"> <!-- Recommendations will be displayed here --> </div> <script> async function getRecommendations() { const bookId = document.getElementById("book_select").value; const resultsDiv = document.getElementById("results"); resultsDiv.innerHTML = "<p>Loading...</p>"; if (bookId === "") { resultsDiv.innerHTML = "<p>Please select a book.</p>"; return; } try { const response = await fetch(`/recommend?book_id=${bookId}`); if (!response.ok) { const errorData = await response.json(); resultsDiv.innerHTML = `<p>Error: ${errorData.error || response.statusText}</p>`; return; } const data = await response.json(); let html = `<h2>Selected: ${data.selected}</h2><h3>Recommendations:</h3><ul>`; if (data.recommendations && data.recommendations.length > 0) { data.recommendations.forEach(rec => { html += `<li>${rec.title} (Similarity: ${rec.similarity})</li>`; }); } else { html += "<li>No recommendations found.</li>"; } html += "</ul>"; resultsDiv.innerHTML = html; } catch (error) { resultsDiv.innerHTML = `<p>An error occurred: ${error.message}</p>`; console.error("Fetch error:", error); } } </script> </body> </html>
// JavaScript for fetching and displaying recommendations (embedded in HTML above) // async function getRecommendations() { ... }
// Example JSON response from /recommend { "selected": "In the Bedroom", "recommendations": [ { "id": 12, // Index in the filtered DataFrame "title": "The Melting Pot and Other Subversive Stories", "similarity": 0.8567 }, { "id": 36, // Index in the filtered DataFrame "title": "Dark Places", "similarity": 0.8234 } // ... more recommendations ] }
graph TD A[book-recommender-app/] --> B[app.py] A --> C[generate_embeddings.py] A --> D[requirements.txt] A --> E[prod_dataset.csv] A --> F[prod_dataset_combined_embeddings.csv] A --> G[templates/] A --> H[README.md] A --> I[Dockerfile] G --> J[index.html] %% Data Flow Relationships E -->|reads| C C -->|generates| F F -->|loads into memory| B B -->|serves| J J -->|makes requests to| B %% Dependency Relationships D -->|specifies dependencies for| B D -->|specifies dependencies for| C %% Deployment Relationships I -->|containerizes| B I -->|includes| F I -->|includes| G %% Documentation H -->|documents| A %% Styling for different file types classDef pythonFile fill:#3776ab,stroke:#fff,color:#fff classDef dataFile fill:#28a745,stroke:#fff,color:#fff classDef templateFile fill:#e83e8c,stroke:#fff,color:#fff classDef configFile fill:#fd7e14,stroke:#fff,color:#fff classDef docFile fill:#6c757d,stroke:#fff,color:#fff classDef folderStyle fill:#007bff,stroke:#fff,color:#fff class B,C pythonFile class E,F dataFile class J templateFile class D,I configFile class H docFile class A,G folderStyle %% Add labels for key relationships C -.->|"1. Processes original data"| E C -.->|"2. Creates embeddings file"| F B -.->|"3. Loads embeddings at startup"| F B -.->|"4. Serves web interface"| J J -.->|"5. Ajax calls for recommendations"| B
all-MiniLM-L6-v2
model for fast and effective sentence embedding.convert_embedding
) for various string representations of embeddings stored in CSVs.generate_embeddings.py
) distinct from the online recommendation serving app (app.py
).cosine_similarity
), Sentence-TransformersSemantica currently provides a solid foundation for semantic book recommendations. Future enhancements could include:
pg_vector
for native vector storage and querying) for better scalability and data management.Semantica demonstrates the power of modern NLP techniques in building practical and intelligent applications. We encourage you to explore the code Semantica and experiment with its capabilities.