Creating and updating pixel art character sprites with many frames spanning different animations and poses takes time and can quickly become repetitive. However, that can be partially automated to allow artists to focus on more creative tasks. In this work, we concentrate on creating pixel art character sprites in a target pose from images of them facing other three directions. We present a novel approach to character generation by framing the problem as a missing data imputation task. Our proposed generative adversarial networks model receives the images of a character in all available domains and produces the image of the missing pose. We evaluated our approach in the scenarios with one, two, and three missing images, achieving similar or better results to the state-of-the-art when more images are available. We also evaluate the impact of the proposed changes to the base architecture.
Creating characters for pixel art games is a time-consuming process in game development that often involves a lot of back-and-forth adjustments. Artists meticulously design each pixel, but even small changes can require updating many images, especially for characters that move and face different directions. While artists are skilled at this, some aspects of the process can be repetitive and tedious, such as creating special effects or ensuring consistency across all character poses.
To address these challenges, researchers are exploring the use of artificial intelligence (AI) to help automate parts of the character creation process. This research often involves using AI to generate new images based on existing ones, such as creating a character facing a different direction from a set of existing images. This new approach focuses on using all available images of a character to predict missing poses, rather than simply generating images from scratch. By utilizing more information about the character, the AI model can create more realistic and high-quality results, streamlining the character design process and allowing artists to focus on more creative aspects of their work.
Recent research in pixel art character generation has focused on improving the quality and versatility of AI-generated images. Serpa and Rodrigues (2022) significantly enhanced their previous work by reframing the problem as a semantic segmentation task. This approach, combined with architectural modifications like dense connections and deep supervision, resulted in more accurate and detailed character sprites, particularly in challenging poses.
Furthermore, Coutinho and Chaimowicz (2024) demonstrated the importance of large and diverse datasets for training robust AI models. By compiling a dataset of 14,000 paired images of characters in different poses, they achieved significant improvements in image quality, especially when evaluated on more artistically cohesive datasets. Additionally, they introduced a post-processing step to quantize the generated images to the color palette of the input image, further enhancing the visual fidelity of the results.
We propose an architecture based on CollaGAN to impute images of pixel art characters in a missing pose (target domain). To facilitate understanding, let us consider that there are domains
Our generator has one encoder branch to process the input from each domain, a single decoder branch with concatenated skip connections, and outputs an image in the missing domain. The discriminator distinguishes images as real or fake, as well as determines their domain through an auxiliary classifier output.
Compared to the original CollaGAN architecture, we proposed the following modifications:
While the change (1) is straightforward, changes (2) and (3) require explanation. However, to avoid getting too deep into the training procedure, we invite the reader to either read the paper or to watch a presentation. Anyhow, later in this article we present an ablation study that shows how each such modification improved the resulting images.
The model trained with the pixel art characters dataset for 240,000 generator update steps in minibatches of 4 examples, which is equivalent to
We evaluated using the metrics FID and MAE (
Next, we present our main results with both a quantitative and qualitative analysis. But we also invite you to try the model using our interactive generator.
The figure above shows different character examples in its rows, and the columns depict: the source and target images, then the ones generated by Pix2Pix, StarGAN and our MDIGAN (CollaGAN-3) models. As the baseline models take only a single image as input, there are 3 possible outputs for each character.
The quality of the generated images varies depending on the model and the target pose.
Color Usage: While the generated images generally use colors in appropriate locations, they often exhibit a wider range of tones than the original pixel art style, which typically relies on a limited color palette. This issue can be partially addressed by quantizing the colors after generation, a technique used in previous research (Coutinho and Chamowicz, 2024).
Shape Accuracy:
Easier directions: All models generally perform well when generating poses that involve simple transformations, such as flipping the character horizontally (e.g., from facing left to right). This is likely due to the relative ease of learning this specific transformation.
Harder directions: When generating more complex poses, such as a character facing backward, some models, including ours, may exhibit artifacts, such as faint remnants of facial features in unexpected areas (see the maid in the first row).
Overall Quality: We can observe that the quality of images generated by our proposed MDIGAN model is either comparable to or surpasses the performance of baseline models. Notably, the model achieved these results with a significantly smaller number of trainable parameters compared to other prominent architectures, indicating a more efficient use of resources.
Our MDIGAN model is 22% smaller than StarGAN, 70% smaller than Pix2Pix in number of trainable parameters.
Even though we propose a model to impute a single missing domain, we also evaluate it in scenarios where it receives two (CollaGAN-2) or only one image (CollaGAN-1). The metrics' values are averaged among all targets and all available sources for each model and scenario (i.e., CollaGAN-3, 2, and 1).
The following table compares the proposed model in those situations. We can observe that both FID and MAE metrics progressively improve as the number of available domains increases, with CollaGAN-2 still having better MAE than Pix2Pix and StarGAN.
Model/Sources | Average FID | Average MAE |
---|---|---|
Pix2Pix | 4.091 | 0.05273 |
StarGAN | 2.288 | 0.06577 |
CollaGAN-1 | 8.393 | 0.06449 |
CollaGAN-2 | 4.277 | 0.05035 |
CollaGAN-3 🏆 | 1.508 | 0.04078 |
The values of both metrics have been averaged among all possible input/output combinations for each model.
The figure below shows example generations of the model when it receives 3 inputs (CollaGAN-3), 2 inputs (CollaGAN-2) and only 1 (CollaGAN-1).
We evaluated the impact of different batch selection strategies on presenting examples to the proposed model during training: Should it always see the 3 available domains, or should they sometimes be omitted?
We investigated the following approaches:
The original approach has an equal chance of presenting three, two, or a single image in a training step. The curriculum learning approach starts training with easier tasks (using three images) and progressively makes it harder (using a single input) until half of the training, then it randomly chooses between the number of domains to drop out for the second part. Lastly, the conservative approach randomly selects the number of images to drop, but with higher probabilities to keep more images: 60% with 3 images, 30% with 2, and 10% with a single image.
The following table presents the results. Using any input dropout yields better results than always showing all domains (none). Compared to the original and curriculum learning strategies, our proposed conservative tactic has better FID and MAE metrics on the average of the three scenarios.
FID | MAE | |||||||
---|---|---|---|---|---|---|---|---|
Sources | None | Original | Curric. | Conserv.🏆 | None | Original | Curric. | Conserv.🏆 |
CollaGAN-3 | 4.816 | 1.911 | 2.160 | 1.508 | 0.04523 | 0.04277 | 0.04222 | 0.04078 |
CollaGAN-2 | 19.050 | 6.835 | 9.233 | 4.277 | 0.08003 | 0.05053 | 0.07389 | 0.05035 |
CollaGAN-1 | 32.676 | 11.162 | 20.303 | 8.393 | 0.12820 | 0.06243 | 0.12232 | 0.06449 |
Average | 18.847 | 6.636 | 10.566 | 4.726 | 0.08449 | 0.05191 | 0.07948 | 0.05187 |
To understand the impact of our changes to the original CollaGAN architecture, we trained and evaluated models that progressively added each modification. The following table shows the FID and MAE values of the generated images averaged over all domains and among the scenarios of the model receiving three, two, and one input domains. The rows show the results of each modification cumulatively: the first one is the original CollaGAN model without any of our proposed changes, the second introduces the first modification, the third uses two changes, and the last includes all three (our final model).
Modification | Average FID | Average MAE | ||
---|---|---|---|---|
Value | Improv. | Value | Improv. | |
Original | 8.866 | --- | 0.06069 | --- |
+ Increased capacity | 11.078 | -24.95% | 0.05666 | 6.64% |
+ Forward Replacer | 6.636 | 25.15% | 0.05191 | 14.47% |
+ Conservative Inp. Drop. | 4.726 | 46.70% | 0.05187 | 14.53% |
The original model had 6,565,712 trainable variables, but with the increased capacity, there are 104,887,616 parameters. That change alone improved MAE but worsened FID. The replacement strategy of substituting only the original target with the image generated in the forward step improves both metrics' results. Lastly, training with the proposed conservative input dropout further enhances the results, with FID and MAE values that are 46.7% and 14.53% better than the original architecture.
We posed the task of generating pixel art characters as a missing data imputation problem and approached it using a deep generative model. It is based on the CollaGAN architecture, from which we proposed changes involving a capacity increase, a conservative input dropout strategy, and a different replacement tactic during the backward step of the training procedure. The experiments showed that all of the changes contributed to achieving better results.
Compared to the baseline models, our approach produces images with similar or better quality when using three domains as input. The model can still produce feasible images in scenarios with fewer available images but with increasingly lower quality.
In case you're interested in using our architecture, trained model or dataset, please read through.
We trained the models using the PAC Dataset, which consists of 14k pixel art character sprites facing 4 directions: back, left, front, and right.
It can be used for non-commercial purposes only and has been assembled from the Liberated Cup Characters and RPG Maker RTPs (available online).
We used Tensorflow 2.10.1 in Python 3.9 to create and train the model, and the repository is available on Github. The weights can be downloaded from HuggingFace through fegemo/mdigan-characters and used in Python or JavaScript (with Tensorflow.js).
We tested the code using Python 3.9 and specific versions of tensorflow and numpy, marked below.
# requires Python 3.9 !pip install numpy==1.24.4 tensorflow==2.10.1 matplotlib notebook patool Pillow
The model is hosted on Github and requires git lfs to download it. We also download some images to test the model.
print("Downloading the model from Github...") !git lfs install !git clone --depth=1 --branch=main https://github.com/fegemo/mdigan-characters-model/ weights print("Extracting some images...") import patoolib patoolib.extract_archive("./weights/images.zip", outdir=".")
Load the model from the weights
folder.
import tensorflow as tf model = tf.keras.models.load_model("weights", compile=False)
We load the images of the characters "dealer", "merchant", and "cleric" from the images
folder. Each character has four images: one for each direction (back, left, front, right).
import numpy as np import matplotlib.pyplot as plt from PIL import Image character_names = ["dealer", "merchant", "cleric"] character_images = [ [ np.array(Image.open(f"images/{name}/back.png")), np.array(Image.open(f"images/{name}/left.png")), np.array(Image.open(f"images/{name}/front.png")), np.array(Image.open(f"images/{name}/right.png")) ] for name in character_names ] character_images = tf.constant(character_images) character_images = tf.cast(character_images, tf.float32) / 127.5 - 1
Create images by calling model([input_images, missing_indices])
, where input_images
is a tensor with shape [batch, domains, size, size, channels]
with one of the images erased (replaced by zeros) and missing_indices
is a tensor with the indices of the missing images (shape [batch]
).
from image_utils import plot_input_and_output_images # erase one of the images poses = ["back", "left", "front", "right"] def erase_missing_image(images, indices): np_images = images.numpy() for i, index in enumerate(indices): np_images[i, index] = 0 return tf.constant(np_images) missing_indices = tf.constant(range(len(character_names))) input_images = erase_missing_image(character_images, missing_indices) target_images = tf.gather(character_images, missing_indices, axis=1, batch_dims=1) generated_images = model([input_images, missing_indices]) fig = plot_input_and_output_images(input_images, target_images, generated_images) fig.patch.set_alpha(0.0) plt.show()