Created By: Fareed Khan
GitHub Project Link: AI Vision Dataset Builder
When real-world data is limited or expensive to collect, training custom object detection models has always been a headache - you either need to spend countless hours manually labeling data or deal with the hassle of licensing existing datasets. Even when you find usable data, it often doesn't quite match what you need for your specific research or business case. But here's where things get interesting: the recent boom in AI vision models and image generators has opened up a whole new way to tackle this problem. Instead of the traditional manual grind, we can now use Large Language Models (LLMs) and vision models to create synthetic training data based on our needs. Since there are so many LLMs and image generation models out there now, both open-source and commercial, developers have incredible flexibility in choosing what works best for their specific use case.
This project shows how we can generate custom synthetic datasets for training object detection models, helping developers spend less time on the tedious parts of data preparation and more time on actually solving problems. It's about making the whole process more efficient and accessible.
Our system works like a pipeline, with each step building on the previous one to create a complete training dataset. Here's a high-level overview of how it works:
First up is our Base Prompt Template - this is where you describe your object and where you might find it. Think of it like giving basic instructions about what you want to detect.
Next, these instructions get handed off to Text Generation LLMs (like OpenAI, Gemini, or Hugging Face models). These AI models take your basic template and get creative with it, generating lots of different scenarios for your object.
To make sure we have enough training data, we multiply these prompts (labeled x1, x2, x3... xn) by N. This gives us a huge variety of different scenarios to work with.
These detailed descriptions then go to Image Generation Models like Stability.ai and DALL-E. These models turn all those text descriptions into actual images we can use for training.
We can validate the quality of these synthetic images using a pre-trained vision model from hugging face or any other source.
Once we have our synthetic images, tools like Grounded-SAM automatically label them, adding bounding boxes around the objects we want to detect.
Then we validate the quality of these annotations using a pre-trained vision models which checks if the bounding boxes are correctly placed.
The end result? A complete training dataset where each image comes with its matching bounding box annotations, ready to train an object detection model.
The best part is how flexible this setup is - you can mix and match different LLMs and image generators based on what works best for your specific needs, whether that's budget, speed, or quality.
Make sure Python is installed on your system. You can download it from here. Once you have Python installed then you clone this repository and navigate to the project directory.
git clone https://github.com/FareedKhan-dev/ai-vision-dataset-builder.git cd ai-vision-dataset-builder
You can then install the required packages using the following command:
pip install numpy pandas matplotlib opencv-python Pillow torch diffusers autodistill autodistill-grounding-dino openai autodistill-yolov8 roboflow
You can also install the required packages using the requirements.txt
file:
pip install -r requirements.txt
Next, we import the required libraries:
import os # For interacting with the operating system import math # For mathematical operations import io # For file input and output operations import ast # For parsing and evaluating Python expressions import base64 # For base64 encoding and decoding from io import BytesIO # For reading and writing files in memory import numpy as np # For numerical operations import pandas as pd # For data manipulation and analysis import matplotlib.pyplot as plt # For plotting and visualizations import cv2 # OpenCV library for computer vision tasks from PIL import ImageDraw # For image processing and drawing graphics import torch # PyTorch for deep learning from diffusers import StableDiffusionPipeline # For text-to-image generation with Stable Diffusion from autodistill.detection import CaptionOntology # For labeling/annotation tasks in object detection from autodistill_grounding_dino import GroundingDINO # For grounding and detection tasks from openai import OpenAI # OpenAI API for AI Chat
The first step in our pipeline is to create a base prompt template. This is where you describe the object you want to detect and where you might find it. Here's an example of a base prompt template for detecting bears in different environments:
# Define the important objects that must be present in each generated prompt. important_objects = "brown bear" # For multiple objects, separate them with commas, e.g., "different kinds of bear, bottles, ... etc." # Specify the number of prompts to generate. number_of_prompts = 50 # Define the number of prompts to generate for the image generation task. # Provide a brief description of the kind of images you want the prompts to depict. description_of_prompt = "brown bear in different environments" # Describe the scenario or context for the image generation. # Generate a formatted instruction set to produce image generation prompts. # This formatted string will help in creating detailed and diverse prompts for the computer vision model. base_prompt = f''' # List of Important Objects: # The objects listed here must be included in every generated prompt. Important Objects that must be present in each prompt: {important_objects} # Input Details: # The task is to generate a specific number of prompts related to the description provided. Input: Generate {number_of_prompts} realistic prompts related to {description_of_prompt} for image generation. # Instructions for Prompt Generation: # - Each prompt should depict real-life behaviors and scenarios involving the objects. # - All important objects should be included in every prompt. # - Ensure that the objects are captured at varying distances from the camera: # - From very close-up shots to objects in the far background. # - The prompts should be diverse and detailed to cover a wide range of use cases. # Output Format: # - The output should be a Python list containing all the generated prompts as strings. # - Each prompt should be enclosed in quotation marks and separated by commas within the list. Output: Return a Python list containing these prompts as strings for later use in training a computer vision model. [prompt1, prompt2, ...] ''' # Print the formatted instruction set for generating prompts. print(base_prompt)
# List of Important Objects:
# The objects listed here must be included in every generated prompt.
Important Objects that must be present in each prompt:
brown bear
# Input Details:
# The task is to generate a specific number of prompts related to the description provided.
Input:
Generate 50 realistic prompts related to brown bear in different environments for image generation.
# Instructions for Prompt Generation:
# - Each prompt should depict real-life behaviors and scenarios involving the objects.
# - All important objects should be included in every prompt.
# - Ensure that the objects are captured at varying distances from the camera:
# - From very close-up shots to objects in the far background.
# - The prompts should be diverse and detailed to cover a wide range of use cases.
# Output Format:
# - The output should be a Python list containing all the generated prompts as strings.
# - Each prompt should be enclosed in quotation marks and separated by commas within the list.
Output:
Return a Python list containing these prompts as strings for later use in training a computer vision model.
[prompt1, prompt2, ...]
Next, we expand our base prompt template using a text generation model. This gives us a wide variety of different scenarios for our object detection model. There are two main ways to do this:
To use OpenAI's API, you need to sign up for an API key. You can find more information on how to do this here. Once you have your API key, you can use the following code to expand your prompt:
# Initialize the OpenAI API client with your API key. openai_chat = OpenAI( api_key="YOUR_OPENAI_API_KEY" ) # Generate prompts for image generation using the OpenAI API. completion = openai_chat.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "system", "content": base_prompt+ "\n\n your response: [prompt1, prompt2, ...] and do not say anything else and i will be be using ast.literal_eval to convert the string to a list"}] ) # Extract the generated prompts from the API response. response = completion.choices[0].message.content
Since the output of the API is a string, we need to convert it into a list of strings.
# Extract the part of the string that contains the variable definition variable_definition = response.strip() # Fix the formatting issue by ensuring the string is a valid Python list if variable_definition.endswith(","): variable_definition = variable_definition[:-1] + "]" # Use ast.literal_eval to safely evaluate the variable definition prompts = ast.literal_eval(variable_definition) # Print the first few prompts to verify the output print(prompts[0:5])
['A brown bear in a dense forest, standing behind a thick tree trunk with leaves and branches covering it from head to paw, looking directly at the camera.', 'A brown bear walking alone in a vast, open tundra, with mountains visible in the far background, and a few birds flying overhead.', "A close-up shot of a brown bear's face, focusing on its eyes and nostrils as it smells the air, with the background blurred.", 'A brown bear standing at the edge of a serene lake, reflecting the beauty of the surrounding landscape in the calm water.', 'A brown bear roaming freely in a meadow filled with wildflowers of various colors, under a clear blue sky with a few white clouds.']
If you prefer using web interfaces, you can use tools like ChatGPT, Gemini, Claude, or others. These tools provide a user-friendly interface where you can input your prompt and get the expanded output. Here's an example of how you can use ChatGPT to expand your prompt by providing the base prompt template:
Once you have your expanded prompts, we can further multiply them to get a larger dataset. This is done by simply repeating the prompts multiple times. Here's an example of how you can multiply your prompts:
# Increase the number of prompts by doubling the existing prompts prompts = prompts * 2 # Shuffle the prompts to ensure randomness import random random.shuffle(prompts) # Print the total number of prompts after doubling and shuffling len(prompts)
100
Now that we have our expanded prompts, we can use image generation models to turn these text descriptions into actual images. There are several image generation models available, such as DALL-E, Stability.ai, and others. Here's an example of how you can use stable diffusion models to generate images based on your prompts:
# Defining the model name and device to use for the Stable Diffusion pipeline. model_id = "CompVis/stable-diffusion-v1-4" device = "cuda" # Load the Stable Diffusion pipeline with the specified model and device. pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) pipe = pipe.to(device)
Now that we have loaded the image generation model, we can use it to generate images based on our prompts. Here's how of how you can generate images based on your prompts:
# Extract the first 5 prompts for generating images. (sample data) sample_prompts = prompts[:10] # Generate images based on the sample prompts using the Stable Diffusion pipeline. images = pipe(sample_prompts).images
In the above code, we sampled 5 prompts from our expanded prompts and generated images based on those prompts to quickly visualize the results.
Let's create a dataframe to store the generated images and their corresponding prompts for easy visualization and reference.
# Convert the generated images to a format that can be displayed in a DataFrame. synthetic_data = pd.DataFrame({'Prompt': sample_prompts, 'Image': images}) # Display the synthetic data containing prompts and the corresponding generated images. synthetic_data
Prompt | Image | |
---|---|---|
0 | A brown bear in a dense forest, standing behin... | <PIL.Image.Image image mode=RGB size=512x512 a... |
1 | A brown bear walking alone in a vast, open tun... | <PIL.Image.Image image mode=RGB size=512x512 a... |
2 | A close-up shot of a brown bear's face, focusi... | <PIL.Image.Image image mode=RGB size=512x512 a... |
3 | A brown bear standing at the edge of a serene ... | <PIL.Image.Image image mode=RGB size=512x512 a... |
4 | A brown bear roaming freely in a meadow filled... | <PIL.Image.Image image mode=RGB size=512x512 a... |
5 | A brown bear crossing a shallow stream, steppi... | <PIL.Image.Image image mode=RGB size=512x512 a... |
6 | A brown bear in a forest, climbing a tree, wit... | <PIL.Image.Image image mode=RGB size=512x512 a... |
7 | A brown bear on top of a hill, looking out ove... | <PIL.Image.Image image mode=RGB size=512x512 a... |
8 | A brown bear in a snowy landscape, trudging th... | <PIL.Image.Image image mode=RGB size=512x512 a... |
9 | A brown bear at a beach, walking along the sho... | <PIL.Image.Image image mode=RGB size=512x512 a... |
Let's visualize the generated images and see how they look:
# Define a function to display the images generated from the prompts. def display_images(dataframe): # Extract the images from the DataFrame images = dataframe['Image'] # Set up the grid max_images_per_row = 5 rows = (len(images) + max_images_per_row - 1) // max_images_per_row columns = min(len(images), max_images_per_row) # Create a figure and axis fig, axes = plt.subplots(rows, columns, figsize=(columns * 5, rows * 5)) # Flatten axes if there's only one row if rows == 1: axes = [axes] else: axes = axes.flatten() # Display each image for i, image in enumerate(images): axes[i].imshow(image) axes[i].axis('off') # Hide any unused subplots for j in range(i + 1, len(axes)): axes[j].axis('off') # Show the grid of images plt.show() # Display the synthetic images generated from the prompts. display_images(synthetic_data)
We can improve our base_prompt
to avoid such such images which are not related to our targeted object but If we are generating tons of images we cannot check if our images are correctly containing that object for example in our case it can easily generate a noisy image of bear for that we can manually check such noisy images because we are working with a sample data but we can also use a pre-trained vision models to perform the validation and recreate those images which contains such errors.
We can use vision models provided by Hugging Face, OpenAI, or any other pre-trained vision model providers. We will use Qwen/Qwen2-VL-72B-Instruct
an open-source vision model through an API from Nebius.ai to validate the quality of the generated images.
The very first step is to create validation prompt which checks if the generated image contains the object we are looking for.
# Define the prompt for the object detection task. validation_prompt = "Analyze the provided image and determine if it depicts a real bear, which is an animal, excluding any other types of objects or representations. Respond strictly with 'True' for yes or 'False' for no."
Then we can encode the images and pass them to the vision model along with the validation prompt to check if the generated images contain the object we are looking for.
# Initialize the OpenAI API client with your Nebius API key. openai_chat = OpenAI( base_url="https://api.studio.nebius.ai/v1/", api_key="YOUR_NEBIUS_API_KEY" ) # Define a function to validate images using the OpenAI API. def validate_images(validation_prompt, images): # Initialize an empty list to store the validation results bools = [] # Function to encode a PIL image to base64 def encode_image_pil(image): buffer = BytesIO() image.save(buffer, format="JPEG") # Save the image to the buffer in JPEG format buffer.seek(0) # Rewind the buffer to the beginning return base64.b64encode(buffer.read()).decode("utf-8") # Convert to base64 # Iterate through your images for image in images: # Convert the PIL image to base64 base64_image = encode_image_pil(image) # Prepare the API payload response = openai_chat.chat.completions.create( model="Qwen/Qwen2-VL-72B-Instruct", messages=[ { "role": "user", "content": [ {"type": "text", "text": validation_prompt}, {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}} ], } ], ) # Append the response to the list bools.append(response.choices[0].message.content.replace('.', '').replace('\n', '')) # Convert the list of strings to a list of booleans bools = [ast.literal_eval(item) for item in bools] return bools # Validate images and add the results as a new column in the dataframe synthetic_data['bear_class'] = validate_images(validation_prompt, synthetic_data['Image'])
Let's see how our updated dataframe looks like after validation:
synthetic_data
Prompt | Image | bear_class | |
---|---|---|---|
0 | A brown bear in a dense forest, standing behin... | <PIL.Image.Image image mode=RGB size=512x512 a... | True |
1 | A brown bear walking alone in a vast, open tun... | <PIL.Image.Image image mode=RGB size=512x512 a... | True |
2 | A close-up shot of a brown bear's face, focusi... | <PIL.Image.Image image mode=RGB size=512x512 a... | True |
3 | A brown bear standing at the edge of a serene ... | <PIL.Image.Image image mode=RGB size=512x512 a... | True |
4 | A brown bear roaming freely in a meadow filled... | <PIL.Image.Image image mode=RGB size=512x512 a... | True |
5 | A brown bear crossing a shallow stream, steppi... | <PIL.Image.Image image mode=RGB size=512x512 a... | True |
6 | A brown bear in a forest, climbing a tree, wit... | <PIL.Image.Image image mode=RGB size=512x512 a... | False |
7 | A brown bear on top of a hill, looking out ove... | <PIL.Image.Image image mode=RGB size=512x512 a... | False |
8 | A brown bear in a snowy landscape, trudging th... | <PIL.Image.Image image mode=RGB size=512x512 a... | True |
9 | A brown bear at a beach, walking along the sho... | <PIL.Image.Image image mode=RGB size=512x512 a... | True |
Two images were found to be noisy so we can recreate them and validate them again to make sure they contain the object we are looking for.
# Define a function to regenerate and validate images based on the validation results. def regenerate_and_validate_images(dataframe, validation_prompt, pipe): # Get rows where bear_class is False (i.e., the image does not depict a bear) and runs the image generation process again for those rows rows_to_regenerate = dataframe[dataframe['bear_class'] == False] # Extract indices and prompts separately indices_to_regenerate = rows_to_regenerate.index prompts_to_regenerate = rows_to_regenerate['Prompt'].tolist() # Generate images based on the prompts that need to be regenerated. images_to_regenerate = pipe(prompts_to_regenerate).images # Iterate over the indices and the newly generated images for idx, img in zip(indices_to_regenerate, images_to_regenerate): dataframe.at[idx, 'Image'] = img # Validate only the rows that were regenerated dataframe.loc[indices_to_regenerate, 'bear_class'] = validate_images(validation_prompt, dataframe.loc[indices_to_regenerate, 'Image']) return dataframe # Call the function to regenerate and validate images synthetic_data = regenerate_and_validate_images(synthetic_data, validation_prompt, pipe)
Let's print the dataframe to see if the images were recreated and validated correctly this time:
synthetic_data
Prompt | Image | bear_class | |
---|---|---|---|
0 | A brown bear in a dense forest, standing behin... | <PIL.Image.Image image mode=RGB size=512x512 a... | True |
1 | A brown bear walking alone in a vast, open tun... | <PIL.Image.Image image mode=RGB size=512x512 a... | True |
2 | A close-up shot of a brown bear's face, focusi... | <PIL.Image.Image image mode=RGB size=512x512 a... | True |
3 | A brown bear standing at the edge of a serene ... | <PIL.Image.Image image mode=RGB size=512x512 a... | True |
4 | A brown bear roaming freely in a meadow filled... | <PIL.Image.Image image mode=RGB size=512x512 a... | True |
5 | A brown bear crossing a shallow stream, steppi... | <PIL.Image.Image image mode=RGB size=512x512 a... | True |
6 | A brown bear in a forest, climbing a tree, wit... | <PIL.Image.Image image mode=RGB size=512x512 a... | True |
7 | A brown bear on top of a hill, looking out ove... | <PIL.Image.Image image mode=RGB size=512x512 a... | True |
8 | A brown bear in a snowy landscape, trudging th... | <PIL.Image.Image image mode=RGB size=512x512 a... | True |
9 | A brown bear at a beach, walking along the sho... | <PIL.Image.Image image mode=RGB size=512x512 a... | True |
Our dataframe does not contain any noisy images now. Let's visualize the generated images to make sure this time they contain the object we are looking for:
# Display the synthetic images after regenerating and validating the images. display_images(synthetic_data)
Now that we have successfully validated our generated images, we can move on to the next step in our pipeline - labeling the images with bounding boxes.
Once we have our synthetic images, we need to label them with bounding boxes around the objects we want to detect. There are several tools available that can help automate this process, such as Grounded-SAM, Grounded-DINO, and others. We will use Grounded-DINO to label our images with bounding boxes, you can choose any other tool as per your requirements.
The very first is to first define the ontology of the captions that we want to detect. For example in our case, we want to detect bears, so we define the ontology as follows:
# Defining the CaptionOntology for the object "bear" in the generated images. ontology=CaptionOntology( { "bear": "bear" # Define the ontology for the object "bear" } )
After that we initialize the Grounded-DINO model with the above ontology:
# Initialize the GroundingDINO model with the defined ontology. base_model = GroundingDINO(ontology=ontology)
Grounding-DINO requires the input to be a directory containing images. So we will save the images temporarily to a directory and then pass that directory to the model for grounding.
# Create a temporary directory for saving images temp_dir = "temp_images" os.makedirs(temp_dir, exist_ok=True) # Save the images to the temporary directory for idx, img in enumerate(synthetic_data['Image']): file_path = os.path.join(temp_dir, f"image_{idx}.jpg") img.save(file_path) # Save the PIL image
Now we can label the images in the specified directory using the GroundingDINO model:
# Label the images using the GroundingDINO model base_model.label(temp_dir, # Pass the list of image file paths extension=".jpg", output_folder="labeled_images") # Optional: Clean up the temporary directory after labeling (if desired) import shutil shutil.rmtree(temp_dir)
Once we run the above code, the images will be labeled with bounding boxes around the objects we want to detect. A new folder will be created based on the parameter output_dir
where the labeled images will be saved.
our labeled_images
folder will contain two subfolders, train
, valid
and one file data.yaml
. The train
folder contains the training images with bounding boxes, the valid
folder contains the validation images with bounding boxes, and the data.yaml
file contains the metadata about the labeled images.
It will be much better to include labels in the dataframe for easy reference. So let's update the dataframe with the labels.
# Paths to the train and valid directories train_labels_dir = "labeled_images/train/labels" valid_labels_dir = "labeled_images/valid/labels" # Function to map labels to dataframe def map_labels_to_dataframe(df, train_labels_dir, valid_labels_dir): # Combine all label paths label_paths = {} for label_dir in [train_labels_dir, valid_labels_dir]: if os.path.exists(label_dir): label_files = sorted(os.listdir(label_dir)) # Ensure order matches `image_{index}` for label_file in label_files: if label_file.endswith(".txt"): label_index = int(label_file.split('_')[1].split('.')[0]) # Extract index from filename label_paths[label_index] = os.path.join(label_dir, label_file) # Read labels and map to dataframe labels = [] for idx in range(len(df)): label_path = label_paths.get(idx, None) if label_path and os.path.exists(label_path): with open(label_path, "r") as f: label_content = f.read().strip() # Read the bounding box info labels.append(label_content) else: labels.append("") # No label found for this index # Assign labels to dataframe df['Labels'] = labels return df # Map labels to the dataframe synthetic_data = map_labels_to_dataframe(synthetic_data, train_labels_dir, valid_labels_dir) # Optional: Clean up the temporary directory after labeling (if desired) import shutil shutil.rmtree("labeled_images")
Let's visualize the train
labeled images to see how they look:
# Helper function to draw bounding boxes on an image def draw_bboxes(image, bboxes): # Convert PIL Image to NumPy array (if needed) if isinstance(image, np.ndarray) is False: image = np.array(image) h, w = image.shape[:2] # Draw each bounding box for _, xc, yc, ww, hh in bboxes: x1, y1, x2, y2 = [int(v) for v in [(xc-ww/2)*w, (yc-hh/2)*h, (xc+ww/2)*w, (yc+hh/2)*h]] cv2.rectangle(image, (x1, y1), (x2, y2), (255, 0, 0), 3) # Blue bounding box cv2.putText(image, 'Bear', (x1, y1-10), cv2.FONT_HERSHEY_SIMPLEX, 1, (255, 0, 0), 2) return image # Helper function to parse labels from a string def parse_labels(label_str): if not label_str: return [] return [tuple(map(float, line.split())) for line in label_str.split('\n')] # Grid layout for displaying images grid_cols = math.ceil(math.sqrt(len(synthetic_data))) grid_rows = math.ceil(len(synthetic_data) / grid_cols) fig, axes = plt.subplots(grid_rows, grid_cols, figsize=(15, 10)) axes = axes.flatten() if grid_rows * grid_cols > 1 else [axes] # Plot each image with its bounding boxes for i, (image, label_str) in enumerate(zip(synthetic_data['Image'], synthetic_data['Labels'])): # Parse labels and draw bounding boxes bboxes = parse_labels(label_str) image_with_boxes = draw_bboxes(image.copy(), bboxes) # Display the image axes[i].imshow(image_with_boxes) axes[i].axis('off') axes[i].set_title(f"Image {i}") # Hide unused axes for ax in axes[len(synthetic_data):]: ax.axis('off') plt.tight_layout() plt.show()
Similar to validating the generated images, we can validate the labeled images using a pre-trained vision model. We will use the same vision model Qwen/Qwen2-VL-72B-Instruct
to validate the labeled images. We will be checking if the bounding boxes are correctly placed around the objects we want to detect.
The very first step is to create validation prompt to check if the generated image contains the object we are looking for.
# Define the validation prompt for the labeled images. validation_labeled_prompt = "Evaluate the provided image and its associated bounding box. Determine if the bounding box correctly and fully encloses the entire bear of interest without cutting off any part of it, leaving excessive empty space, or including irrelevant areas. Respond strictly with 'True' if the bounding box is correct or 'False' if it is not."
Next we can draw bounding boxes on the images and then validate them using the same vision model from Nebius.ai.
# Function to parse labels def parse_labels(label_str): if not label_str: return [] return [tuple(map(float, line.split())) for line in label_str.split('\n')] # Function to draw bounding boxes on a copy of a PIL image def draw_bboxes_on_pil(image, bboxes): image_copy = image.copy() draw = ImageDraw.Draw(image_copy) width, height = image_copy.size for _, xc, yc, ww, hh in bboxes: x1 = (xc - ww / 2) * width y1 = (yc - hh / 2) * height x2 = (xc + ww / 2) * width y2 = (yc + hh / 2) * height draw.rectangle([x1, y1, x2, y2], outline="blue", width=7) draw.text((x1, y1 - 10), "Bear", fill="blue") return image_copy # Create a new list to store images with bounding boxes images_with_bboxes = [ draw_bboxes_on_pil(row['Image'], parse_labels(row['Labels'])) for _, row in synthetic_data.iterrows() ] # Apply the validate_images function to the images with bounding boxes synthetic_data['correct_label'] = validate_images(validation_labeled_prompt, images_with_bboxes)
Let's see if the validation is successful:
synthetic_data
Prompt | Image | bear_class | Labels | correct_label | |
---|---|---|---|---|---|
0 | A brown bear in a dense forest, standing behin... | <PIL.Image.Image image mode=RGB size=512x512 a... | True | 0 0.57378 0.51804 0.56246 0.50930 | True |
1 | A brown bear walking alone in a vast, open tun... | <PIL.Image.Image image mode=RGB size=512x512 a... | True | 0 0.38505 0.63658 0.42610 0.28150 | True |
2 | A close-up shot of a brown bear's face, focusi... | <PIL.Image.Image image mode=RGB size=512x512 a... | True | 0 0.53580 0.50004 0.92833 1.00000 | True |
3 | A brown bear standing at the edge of a serene ... | <PIL.Image.Image image mode=RGB size=512x512 a... | True | 0 0.32516 0.47152 0.28826 0.28555 | True |
4 | A brown bear roaming freely in a meadow filled... | <PIL.Image.Image image mode=RGB size=512x512 a... | True | 0 0.42838 0.57926 0.85668 0.40411 | True |
5 | A brown bear crossing a shallow stream, steppi... | <PIL.Image.Image image mode=RGB size=512x512 a... | True | 0 0.25346 0.37694 0.50663 0.36794 | True |
6 | A brown bear in a forest, climbing a tree, wit... | <PIL.Image.Image image mode=RGB size=512x512 a... | True | 0 0.43852 0.55988 0.67820 0.87656 | True |
7 | A brown bear on top of a hill, looking out ove... | <PIL.Image.Image image mode=RGB size=512x512 a... | True | 0 0.61097 0.72983 0.77816 0.47296 | True |
8 | A brown bear in a snowy landscape, trudging th... | <PIL.Image.Image image mode=RGB size=512x512 a... | True | 0 0.66967 0.55373 0.66067 0.89216 | True |
9 | A brown bear at a beach, walking along the sho... | <PIL.Image.Image image mode=RGB size=512x512 a... | True | 0 0.71455 0.69845 0.38040 0.24842 | True |
Grounding-DINO has successfully labeled the images with bounding boxes around the objects we want to detect i.e. bears. The validation prompt has also successfully validated the labeled images. If the validation fails for any image, we can recreate the bounding boxes using different parameters for Grounding-DINO or other methods such as Grounding-SAM and more.
Now that we have our training dataset ready, we can preprocess the data to make it suitable for training an object detection model. We will save our training dataframe to a CSV file which includes the images, labels, and some metadata.
# Function to convert an image to a base64 string def pil_image_to_base64(img): buffered = io.BytesIO() img.save(buffered, format="PNG") img_str = base64.b64encode(buffered.getvalue()).decode('utf-8') return img_str # Convert images to base64 and store in the dataframe synthetic_data['Image'] = synthetic_data['Image'].apply(pil_image_to_base64) # Save the DataFrame to CSV synthetic_data.to_csv('train.csv', index=False)
In this notebook, we have demonstrated how to generate synthetic images using text prompts, validate the generated images, label the images with bounding boxes, validate the labeled images, and preprocess the data for training an object detection model. This pipeline can be used to generate large datasets for training object detection models when real-world data is limited or expensive to collect.
Thanks for reading!
There are no models linked
There are no datasets linked
There are no datasets linked