# SnapCaption: Explore it [here!📸](https://snap-frontend-mamadoudkaba-gmailcom-team-polar-bears.vercel.app/?_vercel_share=oAUGL1kXtPswa2Cig0nY9TIFRGYdTGrL) ![Screenshot 2024-11-28 024502](https://github.com/user-attachments/assets/a4a3634b-9ba9-4fca-85ea-671d7cfe0a68) **Description**: This interface displays an uploaded image and its automatically generated caption, demonstrating SnapCaption's core functionality. The photo was sourced from [World Cities Culture Forum: Montreal](https://worldcitiescultureforum.com/city/montreal). ---
# Abstract SnapCaption is an AI-powered platform that combines computer vision and natural language processing (NLP) to automate the generation, refinement, and storage of descriptive captions for images. At its core, SnapCaption leverages Azure Cognitive Services, a state-of-the-art computer vision model, to analyze images and extract semantic information. This includes object detection, scene understanding, and relationship identification, laying the foundation for meaningful captions. The extracted information is then refined using OpenAI GPT-4o, transforming raw data into polished, cohesive descriptions. The system integrates Azure Blob Storage for image handling, Cosmos DB for scalable metadata storage, and a FastAPI backend for efficient processing. The frontend, built with Next.js, provides an intuitive interface for user interactions. Deployed on Google Cloud Run and Vercel, SnapCaption ensures scalability, responsiveness, and reliability. This publication explores the technical architecture, workflow, evaluation metrics, and innovative features of SnapCaption, emphasizing its computer vision capabilities and practical application in real-world scenarios. ---
# **1. Project Scope and Objectives** Manual captioning of images is time-consuming, inconsistent, and prone to human error. SnapCaption aims to address this challenge by offering a **cloud-integrated AI solution** that generates, refines, and stores descriptive image captions automatically. The main objectives of the project include: - **Automating Image Analysis and Interpretation:** Use **Azure Cognitive Services** to extract semantic features from images, including object detection, scene classification, and relationship mapping. - **Caption Generation and Refinement:** Leverage **OpenAI GPT-4o** to transform raw image descriptions into meaningful, human-like captions. - **Ensuring Scalability:** Utilize **Azure Blob Storage**, **Cosmos DB**, and **Google Cloud Run** to handle large-scale image processing and storage efficiently. - **User-Friendly Interface:** Provide an intuitive **Next.js frontend** for seamless user interactions. - **Cloud-Native Deployment:** Deploy services on **Google Cloud Run** (backend) and **Vercel** (frontend) for scalability and low latency. These objectives ensure SnapCaption delivers a robust, **vision-first automated captioning pipeline** capable of addressing real-world challenges. ---
# **2. Problem Statement** Manual image captioning often fails to scale in modern applications due to its reliance on **human intervention**, **high time investment**, and **inconsistent results**. Additionally, platforms requiring automated captioning for **content management**, **search optimization**, and **accessibility improvements** face significant limitations without reliable computer vision systems. SnapCaption addresses these limitations by utilizing **Azure Cognitive Services** for **advanced image analysis**. The system leverages computer vision algorithms to identify **objects**, **scenes**, and **contextual relationships** within images. This visual understanding is then translated into meaningful text descriptions, refined through **OpenAI GPT-4o** for enhanced clarity and coherence. By combining **computer vision** and **natural language generation**, SnapCaption ensures accurate, scalable, and meaningful image metadata creation. ---
# **3. Methodology**
### **3.1 Technical Stack** SnapCaption integrates a comprehensive technology stack, with **computer vision at its core**, to enable seamless image captioning: - **Frontend:** Developed using **Next.js** with **Tailwind CSS** for responsive UI and **Axios** for efficient API communication. - **Backend:** Built on **FastAPI**, providing high-performance API routing and OpenAPI documentation. - **Computer Vision Engine:** - **Azure Cognitive Services:** Performs **object detection**, **scene recognition**, and **contextual analysis** on uploaded images. - **AI Caption Refinement:** - **OpenAI GPT-4o:** Refines raw captions into human-readable, meaningful text. - **Cloud Infrastructure:** - **Azure Blob Storage:** Stores uploaded images. - **Cosmos DB:** Manages caption metadata and refined captions. - **Deployment:** - **Google Cloud Run:** Backend deployment with autoscaling. - **Vercel:** Frontend hosting optimized for edge delivery. - **CI/CD Pipeline:** Managed with **GitHub Actions** and containerized via **Docker**. This stack ensures that SnapCaption is **scalable, efficient, and optimized for computer vision workflows**.
### **3.2 System Workflow** ![sequence diagram-2024-11-28-062717](https://github.com/user-attachments/assets/ed890db0-28c2-45f4-bba6-40d76cb09ff9) **Description**: The sequence diagram outlining how The SnapCaption workflow integrates **frontend interactions, backend API calls, and cloud services** to deliver a seamless image captioning experience. **Workflow Steps:** 1. **Image Upload:** The user uploads an image through the frontend interface. 2. **Storage in Azure Blob:** The image is sent to the **FastAPI backend**, which securely uploads it to **Azure Blob Storage**. 3. **Image Analysis:** The image URL is processed by **Azure Cognitive Services**, generating initial captions through **computer vision analysis**. 4. **Caption Refinement:** The generated captions are passed to **OpenAI GPT-4o** for refinement into a polished, cohesive caption. 5. **Metadata Storage:** The refined captions and associated metadata are stored in **Cosmos DB** for future retrieval. 6. **Caption Retrieval:** Users can fetch refined captions using a unique **image ID** via dedicated API endpoints. 7. **Response to Frontend:** The processed data (refined captions, metadata, etc.) is returned to the frontend for display. This workflow ensures **scalability, accuracy, and efficiency**, with seamless integration between the frontend, backend, and cloud services.
### **3.3 Backend Implementation**
### **SwaggerHub Interface : Explore it [here](https://snapcaption-backend2-336921101433.us-central1.run.app/docs).** ![Screenshot 2024-11-28 025117](https://github.com/user-attachments/assets/ef26352b-401b-4ea0-a398-aff0f0e964c3) **Description**: The SwaggerHub interface lists SnapCaption's API endpoints, used for uploading images, generating captions, and retrieving data. The **FastAPI backend** serves as the **central hub** for all interactions, from **image uploads** to **caption generation**, **refinement**, and **metadata storage**. It integrates with **Azure Cognitive Services** for image analysis, **OpenAI GPT-4o** for text refinement, and **Cosmos DB** for metadata management. Below are detailed descriptions of the key backend endpoints and their functionality. ### **1. Image Upload Endpoint** ```python @router.post("/upload_image") async def upload_image(image: UploadFile = File(...)): image_id = str(uuid.uuid4()) blob_client = blob_service_client.get_blob_client(container=container_name, blob=image_id) image_data = await image.read() blob_client.upload_blob( image_data, overwrite=True, content_settings=ContentSettings(image.content_type), ) return {"image_id": image_id, "message": "Image uploaded successfully"} ``` **Explanation:** - **Unique Image ID:** A **UUID** is generated for each image upload to avoid filename conflicts. - **Blob Storage Integration:** The image is securely uploaded to **Azure Blob Storage**, ensuring scalable storage and efficient retrieval. - **Response:** The API returns a confirmation message along with the unique `image_id` for tracking. ### **2. Generate Sentence-Based Captions** ```python @router.get("/get_caption") async def get_caption(image_url: str): headers = {"Ocp-Apim-Subscription-Key": cv_key, "Content-Type": "application/json"} data = {"url": image_url} response = requests.post(cv_endpoint, headers=headers, json=data) return response.json() ``` **Explanation:** - **Image Analysis:** The image URL is sent to **Azure Cognitive Services** for **visual analysis**. - **Object and Scene Recognition:** The vision model detects **objects**, **scenes**, and **contextual elements** in the image. - **Sentence Generation:** The API generates multiple descriptive sentences based on the detected visual features. - **Response:** The generated sentences are returned in JSON format. ### **3. Refine Captions with OpenAI GPT-4o** ```python @router.post("/generate_caption") async def generate_caption(captions: list[str]): bullet_points = "\n".join(f"{caption}" for caption in captions) prompt = f"The following captions were generated:\n{bullet_points}\nRefine them into one cohesive caption." headers = {"Content-Type": "application/json", "api-key": openai_key} response = requests.post(openai_endpoint, headers=headers, json={ "messages": [ {"role": "system", "content": "You are an assistant that refines image captions."}, {"role": "user", "content": prompt} ], "temperature": 0.2, "top_p": 0.5, "max_tokens": 200 }) return {"refined_caption": response.json()["choices"][0]["message"]["content"].strip()} ``` **Explanation:** - **Input Captions:** The raw sentences generated by the computer vision model are sent to **GPT-4o**. - **Refinement Process:** GPT-4o processes the sentences, identifies their core meaning, and generates a **polished, cohesive caption**. - **Response:** The refined caption is returned to the client. ### **4. Store Refined Captions** ```python @router.post("/store_caption") async def store_caption(payload: dict = Body(...)): refined_caption = payload.get("refined_caption") document = { "id": str(uuid.uuid4()), "refined_caption": refined_caption, "sentences": refined_caption.split("."), } container.create_item(body=document) return {"message": "Caption stored successfully."} ``` **Explanation:** - **Metadata Management:** Each refined caption is stored with its **UUID** and **individual sentence metadata**. - **Database Integration:** The metadata is securely saved in **Azure Cosmos DB**, ensuring fast and scalable retrieval. - **Response:** The API confirms that the caption and metadata were successfully stored. ### **5. Retrieve Stored Captions** ```python @router.get("/get_stored_caption") async def get_stored_caption(caption_id: str): query = f"SELECT * FROM c WHERE c.id = '{caption_id}'" results = container.query_items( query=query, enable_cross_partition_query=True ) return list(results) ``` **Explanation:** - **Query Execution:** The API queries Cosmos DB using a **caption ID** to fetch metadata and captions. - **Efficient Search:** Cross-partition querying ensures scalability and efficiency. - **Response:** Returns the metadata, refined caption, and related sentences in JSON format.
### **3.4 Caption History and Refinement** ![Screenshot 2024-11-28 024650](https://github.com/user-attachments/assets/1066c19a-2860-4683-9570-f9791932814b) **Description**: Displays stored captions and metadata for different images. This table reflects how SnapCaption organizes and manages data. **Process Summary:** 1. **Azure Cognitive Services** generates sentences through **image analysis** using computer vision algorithms. 2. **OpenAI GPT-4o** refines these sentences into a single cohesive caption. 3. Refined captions and metadata are securely stored in **Azure Cosmos DB** for future retrieval. This pipeline integrates **computer vision** and **natural language processing** seamlessly to produce meaningful image captions. ---
## **4. Deployment** The deployment of **SnapCaption** was designed to ensure **scalability, reliability, and optimal performance** across its infrastructure. The backend, built using **FastAPI**, is containerized with **Docker** and hosted on **Google Cloud Run**, a serverless container execution environment. This setup enables the backend services to **automatically scale** based on incoming API traffic, maintaining high availability and responsiveness even under heavy loads. Secrets such as API keys and connection strings are securely managed through **Google Secret Manager**, ensuring compliance with best practices in cloud security. For the frontend, SnapCaption uses **Vercel**, an optimized deployment platform for **Next.js applications**. Vercel’s global edge network ensures low-latency delivery of the user interface, enhancing accessibility and responsiveness regardless of the user's geographic location. To maintain efficiency and reliability in the development pipeline, **GitHub Actions** is used for continuous integration and delivery (**CI/CD**). This pipeline automates building, testing, and deploying both frontend and backend services, reducing the risk of errors and ensuring consistent updates. Together, these deployment strategies guarantee that SnapCaption remains **scalable, secure, and easy to maintain** in production environments. ---
## **5. Evaluation and Results**
### **5.1 Evaluation Metrics** SnapCaption was evaluated based on **four key performance metrics** to ensure effectiveness and scalability: - **Latency:** The time taken for image upload, caption generation, and metadata retrieval. - **Accuracy:** The relevance and clarity of generated captions compared to human benchmarks. - **Scalability:** The system’s ability to handle concurrent requests without performance degradation. - **Availability:** The uptime percentage of backend and frontend services over extended testing periods.
### **5.2 Results Overview** | **Metric** | **Observed Value** | |-----------------|--------------------------| | **Upload Time**| Avg. 1.2 seconds | | **Caption Gen. Time** | Avg. 2.5 seconds | | **API Latency**| Avg. 300 ms | | **System Uptime** | 99.8% |
### **5.3 Key Observations** - SnapCaption maintained consistent performance even under **100+ concurrent API requests**. - Captions generated by **Azure Cognitive Services** and refined by **OpenAI GPT-4o** demonstrated **high contextual accuracy and readability**. - **CI/CD pipelines** reduced deployment time and errors significantly, enabling a smooth development workflow. - Backend scalability ensured the system adapted well to varying workloads without latency spikes. These results validate SnapCaption’s effectiveness as a **scalable and accurate image captioning platform**, highlighting its strengths in **computer vision-driven analysis** and **AI-based text refinement**. ---
## **6. Conclusion** SnapCaption successfully demonstrates the integration of **computer vision** and **AI technologies** to streamline image captioning workflows. By leveraging **Azure Cognitive Services** for image analysis and **OpenAI GPT-4o** for refined textual generation, SnapCaption automates the creation of **high-quality image captions** with minimal human intervention. Its **FastAPI backend** and **Next.js frontend**, deployed on **Google Cloud Run** and **Vercel**, provide a **scalable, responsive, and reliable platform** for handling large volumes of image data. **CI/CD pipelines** ensure smooth updates and reduce deployment overhead, while **SwaggerHub documentation** fosters developer collaboration. ### **Future Directions:** - **Real-Time Video Captioning:** Extend SnapCaption's capabilities to process live video streams. - **Multilingual Captioning:** Enable caption generation in multiple languages to support global accessibility. - **Third-Party Integration:** Seamlessly integrate SnapCaption with widely used **Content Management Systems (CMS)**. SnapCaption serves as a testament to the transformative potential of combining **computer vision** and **AI-driven natural language processing**. It not only addresses existing challenges in image metadata generation but also sets the stage for broader applications in digital content management. ---
## **7. Source Code and References** The complete source code for SnapCaption is publicly available on [GitHub](https://github.com/mdkaba/Snapcaption) 1. **Docker.** [*Containerize an Application: Getting Started Workshop.*](https://docs.docker.com/get-started/workshop/02_our_app/ ) 2024. 2. **Google Cloud.** [*Deploying to Cloud Run.*](https://cloud.google.com/run/docs/deploying) 2024. 3. **Vercel.** [*Deploying to Vercel.*]( https://vercel.com/docs/deployments/overview) 2024. 4. **Azure Cognitive Services Documentation.** [*Image Analysis API Reference.*](https://learn.microsoft.com/en-us/azure/ai-services/computer-vision/how-to/call-analyze-image) 2024. 5. **OpenAI API Documentation.** [*GPT-4o Integration Guide.*](https://platform.openai.com/docs/models/gpt-4o) 2024. 6. **Stack Overflow.** [*No Next.js version could be detected in your project. Make sure 'next' is installed.*](https://stackoverflow.com/questions/74297132/no-next-js-version-could-be-detected-in-your-project-make-sure-next-is-inst) 2022.
These resources served as foundational guides during the implementation and deployment phases.