![Teaser image showcasing example results. There are four character generation examples. Each shows a character depicted in three different poses (e.g., facing front, left, and right) being used as input to the model so it generated the same character in a missing fourth pose (e.g., facing back). The four examples depict the result of generating the following missing poses: back, left, front, and right. The quality of the generated images is very close to the intended, but the images contain more colors than the desired palette.](https://fegemo.github.io/mdigan-characters/demo/teaser.svg) | See it in action | Watch a presentation | Read the paper | | --- | --- | --- | | [![Screenshot of the interactive generator, a page where you can experiment with the model proposed in this work through your browser. The screenshot shows the model's architecture which receives five inputs: four images of characters in different poses and a label indicating the missing pose to generate.](https://i.imgur.com/HEb04R2.png)][interactive-generator] | [![Thumbnail of a youtube video with the presentation of the paper "A Missing Data Imputation GAN for Character Sprite Generation"](https://i.imgur.com/Nmk3LRt.png)][youtube-presentation] | [![Thumbnail of the first page of the paper entitled "A Missing Data Imputation GAN for Character Sprite Generation"](https://i.imgur.com/JRBc8nV.png)][paper-mdigan] [interactive-generator]: https://fegemo.github.io/interactive-generator [youtube-presentation]: https://www.youtube.com/watch?v=A5Ixo7VZfqo [paper-collagan]: https://www.computer.org/csdl/proceedings-article/cvpr/2019/329300c482/1gys5gg67QY [paper-mdigan]: https://arxiv.org/abs/2409.10721 # Abstract Creating and updating pixel art character sprites with many frames spanning different animations and poses takes time and can quickly become repetitive. However, that can be partially automated to allow artists to focus on more creative tasks. In this work, **we concentrate on creating pixel art character sprites in a target pose from images of them facing other three directions**. We present a novel approach to character generation by framing the problem as a missing data imputation task. Our proposed generative adversarial networks model receives the images of a character in all available domains and produces the image of the missing pose. We evaluated our approach in the scenarios with one, two, and three missing images, achieving similar or better results to the state-of-the-art when more images are available. We also evaluate the impact of the proposed changes to the base architecture. -----DIVIDER--# Introduction Creating characters for pixel art games is a time-consuming process in game development that often involves a lot of back-and-forth adjustments. Artists meticulously design each pixel, but even small changes can require updating many images, especially for characters that move and face different directions. While artists are skilled at this, some aspects of the process can be repetitive and tedious, such as creating special effects or ensuring consistency across all character poses. To address these challenges, researchers are exploring the use of artificial intelligence (AI) to help automate parts of the character creation process. This research often involves using AI to generate new images based on existing ones, such as creating a character facing a different direction from a set of existing images. This new approach focuses on using all available images of a character to predict missing poses, rather than simply generating images from scratch. By utilizing more information about the character, the AI model can create more realistic and high-quality results, streamlining the character design process and allowing artists to focus on more creative aspects of their work.--DIVIDER--# Related work Recent research in pixel art character generation has focused on improving the quality and versatility of AI-generated images. [Serpa and Rodrigues (2022)][serpa-2022] significantly enhanced their previous work by reframing the problem as a semantic segmentation task. This approach, combined with architectural modifications like dense connections and deep supervision, resulted in more accurate and detailed character sprites, particularly in challenging poses. Furthermore, [Coutinho and Chaimowicz (2024)][coutinho-2024] demonstrated the importance of large and diverse datasets for training robust AI models. By compiling a dataset of 14,000 paired images of characters in different poses, they achieved significant improvements in image quality, especially when evaluated on more artistically cohesive datasets. Additionally, they introduced a post-processing step to quantize the generated images to the color palette of the input image, further enhancing the visual fidelity of the results. [serpa-2022]: https://www.sciencedirect.com/science/article/abs/pii/S1875952122000210 [coutinho-2024]: https://www.sciencedirect.com/science/article/pii/S1524070324000018--DIVIDER--# Methodology ![Architecture of the proposed model. Left: The generator receives a character in the source domains and a label indicating the target, which is one-hot encoded, spatially spread, and concatenated with each input image. The inputs follow the encoder branches and are concatenated at the bottleneck layer, flowing into the unified decoder. Skip connections provide early outputs to the decoder. Right: The discriminator receives the image (real or fake) that must be distinguished and outputs D_adv with the real/fake logit and D_dmn with the probabilities of the image being part of each domain.](https://i.imgur.com/AAcvBiu.png) We propose an architecture based on [CollaGAN][paper-collagan] to impute images of pixel art characters in a missing pose (target domain). To facilitate understanding, let us consider that there are domains $N=\{a,b,c,d\}$, one representing each pose. The architecture consists of a single generator and discriminator pair, of which the former creates an image $\hat{x}_t$ of a character in the missing pose $t$ using the available images from all of the other source $S$ poses: $$ \hat{x}_t=G(x_S,t), \text{with } t\in N, S=N-\{t\} $$ Our generator has one encoder branch to process the input from each domain, a single decoder branch with concatenated skip connections, and outputs an image in the missing domain. The discriminator distinguishes images as real or fake, as well as determines their domain through an auxiliary classifier output. Compared to the original CollaGAN architecture, we proposed the following modifications: 1. Increasing the capacity of each convolutional layer by **multiplying the number of channels by 4**. - Number of trainable parameters: 104,887,616. 2. Using an innovative **conservative input dropout strategy** for the batch selection during training. 3. **Replacing only the target image** with the generated in the forward pass to create new batches for the backwards pass. [paper-collagan]: https://www.computer.org/csdl/proceedings-article/cvpr/2019/329300c482/1gys5gg67QY :::info{title="Info"} While the change (1) is straightforward, changes (2) and (3) require explanation. However, to avoid getting too deep into the training procedure, we invite the reader to either [read the paper][paper-mdigan] or to [watch a presentation][youtube-presentation]. Anyhow, later in this article we present an ablation study that shows how each such modification improved the resulting images. ::: [youtube-presentation]: https://www.youtube.com/watch?v=A5Ixo7VZfqo [paper-mdigan]: https://arxiv.org/abs/2409.10721 --DIVIDER--# Experiments The model trained with the pixel art characters dataset for 240,000 generator update steps in minibatches of 4 examples, which is equivalent to $\approx$80 epochs. It took 01:20h to train using a GeForce GTX 1050 Ti GPU (mid-range from 2016 with 4GB VRAM). After training, it takes 110.03ms for the model to generate a batch of images. The following animation shows the training evolution for 3 images from the training set, and 3 from the test partition. ![Animation of the model training along 240k steps showing the generated images at each 1k updates.](https://i.imgur.com/nFXZhlH.png) We evaluated using the metrics [FID][paper-fid] and MAE ($L_1$ distance between the generated images and the ground truth) against baselines that follow the [Pix2Pix][paper-pix2pix] and [StarGAN][paper-stargan] architectures for the image-to-image translation task. :::info{title="Info regarding the Baselines"} - [Pix2Pix][paper-pix2pix] (by Isola et al., 2017) is **an image-to-image translation GAN architecture** that can transfer images from one domain to another. So we trained 16 such models, as we have 4 domains/poses in our problem (4 x 4). Trainable parameters: 351,694,128. - [StarGAN][paper-stargan] (by Choi et al., 2018) tackles the same problem, but it **is multi-domain**. Hence, it can translate images from and to any domain, requiring only a single generator and discriminator pair. Trainable parameters: 134,448,128. ::: [paper-pix2pix]: https://arxiv.org/abs/1611.07004 [paper-stargan]: https://arxiv.org/abs/1711.09020 [paper-fid]: https://arxiv.org/abs/1706.08500 --DIVIDER--# Results Next, we present our main results with both a quantitative and qualitative analysis. But we also invite you to try the model using our [interactive generator][interactive-generator]. ## Receiving 3 images ![Example images generated in different target domains. The columns show the source images, the target, the generation with the baselines using different source domains, and the generation using all sources with MDIGAN.](https://i.imgur.com/1zvG1qn.png) The figure above shows different character examples in its rows, and the columns depict: the source and target images, then the ones generated by Pix2Pix, StarGAN and our MDIGAN (CollaGAN-3) models. As the baseline models take only a single image as input, there are 3 possible outputs for each character. The quality of the generated images varies depending on the model and the target pose. **Color Usage**: While the generated images generally use colors in appropriate locations, they often exhibit a wider range of tones than the original pixel art style, which typically relies on a limited color palette. This issue can be partially addressed by quantizing the colors after generation, a technique used in previous research ([Coutinho and Chamowicz, 2024][coutinho-2024]). **Shape Accuracy**: _Easier directions_: All models generally perform well when generating poses that involve simple transformations, such as flipping the character horizontally (e.g., from facing left to right). This is likely due to the relative ease of learning this specific transformation. _Harder directions_: When generating more complex poses, such as a character facing backward, some models, including ours, may exhibit artifacts, such as faint remnants of facial features in unexpected areas (see the maid in the first row). **Overall Quality**: We can observe that the quality of images generated by our proposed MDIGAN model is either comparable to or surpasses the performance of baseline models. Notably, the model achieved these results with a significantly smaller number of trainable parameters compared to other prominent architectures, indicating a more efficient use of resources. :::tip{title="Tip"} Our MDIGAN model is 22% smaller than StarGAN, 70% smaller than Pix2Pix in number of trainable parameters. ::: ## Receiving 2 or 1 images Even though we propose a model to impute a single missing domain, we also evaluate it in scenarios where it receives two (CollaGAN-2) or only one image (CollaGAN-1). The metrics' values are averaged among all targets and all available sources for each model and scenario (i.e., CollaGAN-3, 2, and 1). The following table compares the proposed model in those situations. We can observe that both FID and MAE metrics progressively improve as the number of available domains increases, with CollaGAN-2 still having better MAE than Pix2Pix and StarGAN. | Model/Sources | Average FID| Average MAE | |--------------------------- |-------------------: |---------------------------:| | Pix2Pix | 4.091 | 0.05273 | | StarGAN | 2.288 | 0.06577 | | CollaGAN-1 | 8.393 | 0.06449 | | CollaGAN-2 | 4.277 | 0.05035 | | **CollaGAN-3** 🏆| **1.508** | **0.04078** | :::info{title="Why an 'Average' value?"} The values of both metrics have been averaged among all possible input/output combinations for each model. ::: The figure below shows example generations of the model when it receives 3 inputs (CollaGAN-3), 2 inputs (CollaGAN-2) and only 1 (CollaGAN-1). ![Images generated by the CollaGAN-based model with different source domain combinations. Columns show the source and target images, and the ones generated with a single input image, two images and all three. Rows show examples with different target poses.](https://i.imgur.com/LQTVr9L.png) ## Input Dropout Strategy We evaluated the impact of different batch selection strategies on presenting examples to the proposed model during training: Should it always see the 3 available domains, or should they sometimes be omitted? We investigated the following approaches: 1. always showing all available domains (none), 2. the original input dropout strategy proposed in [CollaGAN][paper-collagan]; 3. a curriculum learning approach suggested by [Sharma et al., 2019][paper-sharma]; 4. and our proposed conservative tactic. The original approach has an equal chance of presenting three, two, or a single image in a training step. The curriculum learning approach starts training with easier tasks (using three images) and progressively makes it harder (using a single input) until half of the training, then it randomly chooses between the number of domains to drop out for the second part. Lastly, the conservative approach randomly selects the number of images to drop, but with higher probabilities to keep more images: 60\% with 3 images, 30\% with 2, and 10\% with a single image. The following table presents the results. Using any input dropout yields better results than always showing all domains (none). Compared to the original and curriculum learning strategies, our proposed conservative tactic has better FID and MAE metrics on the average of the three scenarios.

FID					MAE
Sources	None	Original	Curric.	Conserv.🏆	None	Original	Curric.	Conserv.🏆
CollaGAN-3	4.816	1.911	2.160	1.508	0.04523	0.04277	0.04222	0.04078
CollaGAN-2	19.050	6.835	9.233	4.277	0.08003	0.05053	0.07389	0.05035
CollaGAN-1	32.676	11.162	20.303	8.393	0.12820	0.06243	0.12232	0.06449
Average	18.847	6.636	10.566	4.726	0.08449	0.05191	0.07948	0.05187

## Ablation Study To understand the impact of our changes to the original CollaGAN architecture, we trained and evaluated models that progressively added each modification. The following table shows the FID and MAE values of the generated images averaged over all domains and among the scenarios of the model receiving three, two, and one input domains. The rows show the results of each modification cumulatively: the first one is the original CollaGAN model without any of our proposed changes, the second introduces the first modification, the third uses two changes, and the last includes all three (our final model).

Modification	Average FID		Average MAE
Modification	Value	Improv.	Value	Improv.
Original	8.866	---	0.06069	---
+ Increased capacity	11.078	-24.95%	0.05666	6.64%
+ Forward Replacer	6.636	25.15%	0.05191	14.47%
+ Conservative Inp. Drop.	4.726	46.70%	0.05187	14.53%

The original model had 6,565,712 trainable variables, but with the increased capacity, there are 104,887,616 parameters. That change alone improved MAE but worsened FID. The replacement strategy of substituting only the original target with the image generated in the forward step improves both metrics' results. Lastly, training with the proposed conservative input dropout further enhances the results, with FID and MAE values that are 46.7% and 14.53\% better than the original architecture. [coutinho-2024]: https://www.sciencedirect.com/science/article/pii/S1524070324000018 [interactive-generator]: https://fegemo.github.io/interactive-generator [paper-collagan]: https://www.computer.org/csdl/proceedings-article/cvpr/2019/329300c482/1gys5gg67QY [paper-sharma]: https://arxiv.org/abs/1904.12200 --DIVIDER--# Conclusion We posed the task of generating pixel art characters as a missing data imputation problem and approached it using a deep generative model. It is based on the CollaGAN architecture, from which we proposed changes involving a capacity increase, a conservative input dropout strategy, and a different replacement tactic during the backward step of the training procedure. The experiments showed that all of the changes contributed to achieving better results. Compared to the baseline models, **our approach produces images with similar or better quality** when using three domains as input. The model can still produce feasible images in scenarios with fewer available images but with increasingly lower quality. ![](spacer.png) --- ![](spacer.png) --DIVIDER-- In case you're **interested in using our architecture**, trained model or dataset, please read through. ![](spacer.png)--DIVIDER--# Dataset We trained the models using the [PAC Dataset][pac-dataset], which consists of 14k pixel art character sprites facing 4 directions: back, left, front, and right. ![A pixel art character in the style of SNES 90s game facing 4 directions: back, left, front, and right.](https://i.imgur.com/s5ONl9Q.png) It can be used for non-commercial purposes only and has been assembled from the [Liberated Cup Characters][lpc-characters] and RPG Maker RTPs (available online). [pac-dataset]: https://huggingface.co/datasets/plucksquire/pac [lpc-characters]: https://opengameart.org/content/lpc-character-bases --DIVIDER--# Code Repository and Pre-trained Weights We used Tensorflow 2.10.1 in Python 3.9 to create and train the model, and the repository is [available on Github][repository]. The weights can be downloaded from HuggingFace through [fegemo/mdigan-characters][hf-model] and used in [Python][py-model] or [JavaScript][js-model] (with Tensorflow.js). [repository]: https://github.com/fegemo/mdigan-characters/ [hf-model]: https://huggingface.co/fegemo/mdigan-characters [js-model]: https://github.com/fegemo/pixel-sides-models/tree/main/collagan/all%2C280824%2Csbgames24 [py-model]: https://huggingface.co/fegemo/mdigan-characters --DIVIDER--# Instructions on How to Use in Python ## Install the Required Libraries We tested the code using Python 3.9 and specific versions of tensorflow and numpy, marked below.--DIVIDER--```python # requires Python 3.9 !pip install numpy==1.24.4 tensorflow==2.10.1 matplotlib notebook patool Pillow ```--DIVIDER--## Download the Model and Images The model is hosted on Github and requires git lfs to download it. We also download some images to test the model.--DIVIDER--```python print("Downloading the model from Github...") !git lfs install !git clone --depth=1 --branch=main https://github.com/fegemo/mdigan-characters-model/ weights print("Extracting some images...") import patoolib patoolib.extract_archive("./weights/images.zip", outdir=".") ```--DIVIDER--## Load the Model Load the model from the `weights` folder.--DIVIDER--```python import tensorflow as tf model = tf.keras.models.load_model("weights", compile=False) ```--DIVIDER--## Load some Images We load the images of the characters "dealer", "merchant", and "cleric" from the `images` folder. Each character has four images: one for each direction (back, left, front, right).--DIVIDER--```python import numpy as np import matplotlib.pyplot as plt from PIL import Image character_names = ["dealer", "merchant", "cleric"] character_images = [ [ np.array(Image.open(f"images/{name}/back.png")), np.array(Image.open(f"images/{name}/left.png")), np.array(Image.open(f"images/{name}/front.png")), np.array(Image.open(f"images/{name}/right.png")) ] for name in character_names ] character_images = tf.constant(character_images) character_images = tf.cast(character_images, tf.float32) / 127.5 - 1 ```--DIVIDER--## Generate Images Create images by calling `model([input_images, missing_indices])`, where `input_images` is a tensor with shape `[batch, domains, size, size, channels]` with one of the images erased (replaced by zeros) and `missing_indices` is a tensor with the indices of the missing images (shape `[batch]`). --DIVIDER--```python from image_utils import plot_input_and_output_images # erase one of the images poses = ["back", "left", "front", "right"] def erase_missing_image(images, indices): np_images = images.numpy() for i, index in enumerate(indices): np_images[i, index] = 0 return tf.constant(np_images) missing_indices = tf.constant(range(len(character_names))) input_images = erase_missing_image(character_images, missing_indices) target_images = tf.gather(character_images, missing_indices, axis=1, batch_dims=1) generated_images = model([input_images, missing_indices]) fig = plot_input_and_output_images(input_images, target_images, generated_images) fig.patch.set_alpha(0.0) plt.show() ``` ![Figure with three rows and three columns. Each row shows a different character example. The columns show the input images, the missing target image, and the generated one. The quality of the generated images is very good.](https://i.imgur.com/DZAJBnW.png)