Azure AI Search is an information retrieval service that uses AI to improve the relevance and accuracy of results.
Azure AI Search allows you to index information using vectors, which means it can search not only by keywords but also by concepts and relationships between terms. This is especially useful for Retrieval-Augmented Generation (RAG) applications.
A vector embedding is a numerical representation of an object, such as text or an image, that captures its semantic features. In the context of Azure AI Search, embeddings allow the search engine to better understand the meaning and relationship between different terms.
Example:
In this example, the vectors for "Dog" and "Animal" are close in the vector space, indicating that they are semantically related. This allows Azure AI Search to find relevant results even if the exact words do not match.
import os import dotenv import openai dotenv.load_dotenv() openai_client = openai.OpenAI( base_url="https://models.inference.ai.azure.com", api_key=os.environ["GITHUB_TOKEN"] ) MODEL_NAME = "text-embedding-3-small" content_input = "Resume: Lionel Messi. Argentine footballer, considered one of the best football players of all time." embeddings_response = openai_client.embeddings.create( model=MODEL_NAME, input=content_input, ) embedding = embeddings_response.data[0].embedding print(len(embedding)) print(embedding)
1536
[0.01875680685043335, -0.01207759603857994, -0.014209441840648651,...]
To compare the similarity between two vectors, you can use a distance metric such as "cosine distance." This metric measures the angle between two vectors and is useful for determining how similar they are in terms of direction.
For example: Suppose we want to compare the vectors of the previously generated text with a query phrase like "Diego Zumárraga Mera." We can calculate the similarity between the two vectors generated by OpenAI.
First, generate the embeddings for both phrases:
content_input = "Resume: Lionel Messi. Argentine footballer, considered one of the best football players of all time." embeddings_response = openai_client.embeddings.create( model=MODEL_NAME, input=content_input, ) content_embedding = embeddings_response.data[0].embedding print(content_embedding) embeddings_response = openai_client.embeddings.create( model=MODEL_NAME, input="Diego Zumárraga Mera", ) query_embedding1 = embeddings_response.data[0].embedding print(query_embedding1)
Then, calculate the cosine distance between the two vectors:
def cosine_similarity(v1, v2): dot_product = sum( [a * b for a, b in zip(v1, v2)]) magnitude = ( sum([a**2 for a in v1]) * sum([a**2 for a in v2])) ** 0.5 return dot_product / magnitude # Compare the two vectors similarity = cosine_similarity(query_embedding1, content_embedding) print(f"Similarity: {similarity:.4f}")
Similarity: 0.1799
As we can see, the similarity between the two vectors is approximately 0.1799, indicating that they are not very semantically similar.
Now let's change the query phrase to "Resume: Diego Zumárraga Mera" and recalculate the similarity:
embeddings_response = openai_client.embeddings.create( model=MODEL_NAME, input="Summarize the resume of Diego Zumárraga Mera", ) query_embedding2 = embeddings_response.data[0].embedding print(query_embedding2) # Compare the two vectors similarity = cosine_similarity(query_embedding2, content_embedding) print(f"Similarity: {similarity:.4f}")
Similarity: 0.3429
Full code in the notebook: VectorEmbeddings.ipynb
The similarity between the two vectors almost doubled, indicating that they are now more semantically similar. This is because the phrase "Resume" generates similarity with any text containing that phrase.
If we want to develop a RAG solution to index and search documents such as resumes, it may have problems finding a specific resume if the query includes common phrases like "Resume." This is because Azure AI Search may return results containing that phrase, but they are not necessarily relevant to the query.
To improve the relevance of results, we can configure semantic search in Azure AI Search to take into account other index features such as the document title or additional keywords.
When creating an index in Azure AI Search, we can create two fields to use as title
and keywords
. These fields can then be used in the semanticSearch
configuration to improve the relevance of results.
def create_index_definition(name: str) -> SearchIndex: """ Returns an Azure Cognitive Search index with the given name. The index includes a vector search with the default HNSW algorithm """ # The fields we want to index. The "embedding" field is a vector field that will # be used for vector search. fields=[ SimpleField(name="id", type=SearchFieldDataType.String, key=True), SearchableField(name="title", type=SearchFieldDataType.String, searchable=True, filterable=True, facetable=True), SearchableField(name="content", type=SearchFieldDataType.String, searchable=True), SearchableField(name="keywords", type=SearchFieldDataType.String, searchable=True, filterable=True, facetable=True), SearchField( name="embedding", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), searchable=True, # Size of the vector created by the text-embedding-3-small model. vector_search_dimensions=1536, vector_search_profile_name="myHnswProfile", ), ] # The "content" field should be prioritized for semantic ranking. semantic_config = SemanticConfiguration( name="default", prioritized_fields=SemanticPrioritizedFields( content_fields=[SemanticField(field_name="content")], title_field=SemanticField(field_name="title"), keywords_fields=[SemanticField(field_name="keywords")], ), ) # For vector search, we want to use the HNSW (Hierarchical Navigable Small World) # algorithm (a type of approximate nearest neighbor search algorithm) with cosine # distance. vector_search = VectorSearch( algorithms=[ HnswAlgorithmConfiguration( name="myHnsw") ], profiles=[ VectorSearchProfile( name="myHnswProfile", algorithm_configuration_name="myHnsw", ) ] ) # Create the semantic settings with the configuration semantic_search = SemanticSearch(configurations=[semantic_config]) # Create the search index. index = SearchIndex( name=name, fields=fields, semantic_search=semantic_search, vector_search=vector_search, ) return index
When indexing a document, we must ensure to include the title
and keywords
fields. In these fields, we can avoid including common and repetitive phrases such as: "Resume," "User Manual," etc. Instead, we can include keywords that are more specific to the document's content.
def gen_index_document() -> List[Dict[str, any]]: openai_client = openai.OpenAI( base_url="https://models.inference.ai.azure.com", api_key=os.environ["GITHUB_TOKEN"] ) MODEL_NAME = "text-embedding-3-small" content_input = [{ "Title": "Lionel Messi", "Content": "Resume: Lionel Messi. Argentine footballer, considered one of the best football players of all time." }, { "Title": "Diego Zumárraga Mera", "Content": "Resume: Diego Zumárraga Mera. Software engineer with experience in web and mobile application development, passionate about artificial intelligence and machine learning." }] items = [] for ix, item in enumerate(content_input): content = item["Content"] print(f"Processing item {ix}: {content}") embeddings_response = openai_client.embeddings.create( model=MODEL_NAME, input=content, ) embedding = embeddings_response.data[0].embedding print(len(embedding)) print(embedding) items.append({ "id": f"doc-{ix}", "title": item["Title"], "content": content, "keywords": item["Title"].replace(" ", ", "), # split the title into keywords "embedding": embedding }) return items
Full code in the notebook: AzureAISearchIndex.ipynb
To test semantic search, we can search for the phrase "Summarize the resume of Diego Zumárraga Mera" and see how Azure AI Search can return relevant results.
query_input = "Summarize the resume of Diego Zumárraga Mera" embeddings_response = openai_client.embeddings.create( model=MODEL_NAME, input=query_input, ) query_embedding = embeddings_response.data[0].embedding print(query_embedding) item = {"query": query_input, "embedding": query_embedding } search_client = SearchClient( endpoint=os.environ["AZURE_AI_SEARCH_ENDPOINT"], index_name="my-index", credential=DefaultAzureCredential(), ) results = [] vector_query = VectorizedQuery( vector=item["embedding"], k_nearest_neighbors=3, fields="embedding" ) result = search_client.search( search_text=item["query"], vector_queries=[vector_query], query_type=QueryType.SEMANTIC, semantic_configuration_name="default", query_caption=QueryCaptionType.EXTRACTIVE, query_answer=QueryAnswerType.EXTRACTIVE, top=2, )
Now let's see the best search result:
answers = result.get_answers() if answers: print("Answers:") for answer in answers: print(f"Answer: {answer.text}") print(f"Confidence: {answer.score}")
Answers:
Answer: Resume: Diego Zumárraga Mera. Software engineer with experience in web and mobile application development, passionate about artificial intelligence and machine learning.
Confidence: 0.9860000014305115
Full code in the notebook: AzureAISearchQuery.ipynb
As we can see, Azure AI Search has returned a relevant result for the query, even though the query includes the common phrase "Resume". This is because we have configured semantic search to take into account the document's title and keywords, which improves the relevance of the results.