We present an exploratory system that leverages natural language generation and image synthesis for a two-step process of image reconstruction via captioning. In the first step, images captured using Sony’s Camera Remote API are processed by a Python script or Android application that obtains a descriptive caption using an OpenAI language model (gpt-4-o). In the second step, the caption is provided to a text-to-image model (DALL·E 3) to generate a new image. The resulting output is conceptually related to the original while demonstrating how language-based representation can shape the final visual. Although not intended as a high-fidelity reproduction tool, our system offers insights into bridging real-time camera inputs with text-driven generative pipelines.
We utilized Sony’s Camera Remote API for image capture from, in our case, a Sony DSC-QX10 lens-style, viewfinder-less camera. We make a connection to the camera from either a computer or android phone using Wi-Fi Direct to begin listening for photo capture events.
Each captured image was fed into gpt-4o-mini
to produce descriptive text in the style of a caption. A prompt was engineered to emphasize salient features of the photograph (e.g., objects and their composition, background context, and general mood, with specific detail being paid to human figures).
The generated captions were then delivered to dall-e-3
through an API call. Because the textual description encodes the conceptual attributes of the source image, the output attempts to preserve the key themes and features described. We also ensure a leading system prompt is prepended to the caption to ensure the image is generated in the style of a photograph (e.g., not as a cartoon or drawing).
Source Image:
Output Image:
Generated Caption used for Image Reconstruction:
The image features a dog, likely a Boston Terrier, resting on a blanket. The dog has a distinctive black and white coat, with a white patch on its face that contrasts with the darker fur. Its ears are large and upright, adding to its alert appearance. The dog's expression is curious and slightly inquisitive, gazing toward the camera with its big, round eyes. The eyes are dark and expressive, conveying a sense of personality. Its mouth is closed, which gives it a calm demeanor.In terms of pose, the dog is lying down, with one foreleg visible, slightly extended. The body language suggests relaxation, yet the attentive position of the ears indicates that it is aware of its surroundings. The background features a plain wall, while the dog is resting on a colorful blanket with a soft, patterned design featuring blue and green elements. The image has a warm, cozy feel, emphasizing the intimate setting. The focus appears somewhat soft, but the subject (the dog) remains the central point of interest.
The image generation stage (DALL-E 3) serves as the primary bottleneck in the pipeline, as it struggles to represent all details provided in the captions. Meanwhile, the captioning model (GPT-4o-mini) generates highly detailed and information-rich descriptions. This disparity suggests that opting for GPT-4o-mini for captioning is cost-effective, as it provides more detail than DALL-E 3 can fully utilize.
The prompt was refined to address observed shortcomings, ensuring the captioning stage emphasizes details critical to visual resemblance, such as composition, lighting/mood, and human-centric features like facial expressions and body language. While these aspects are not consistently prioritized by the LLM by default, incorporating them explicitly in the prompt ensures their inclusion. This significantly enhances the perceived relevance of the generated image to the expectations set by the original photo.
The captioning model we've used (GPT-4o-mini) does not infer or assign ethnicity, while the image generation model (DALL-E 3) introduces ethnic characteristics at random. This inconsistency may warrant further exploration, particularly for applications sensitive to accurate demographic representation.
There are no datasets linked
There are no datasets linked
There are no models linked
There are no models linked