Can off-the-shelf VAEs reconstruct 1024×1024 sprite sheets with 100% accuracy?


Can off-the-shelf VAEs reconstruct 1024×1024 sprite sheets with 100% accuracy? Short answer is a NO, because VAEs are meant to be lossful compressors. It is trivial to make truly loss-less compressor but resulting compression won’t be that good.
Today, I want to experiment today with Flux.1’s VAE ae.safetensors
to see how effective it is against sprite sheets without any scaling.
Sprite Sheet
We are going to use sprite sheets from sprite-resources.com today.
Comfy UI Workflow Setup
Now, let’s design our ComfyUI workflow.
Load
ae.safetensors
from flux.1-dev (or flux.1-schnell, they are the same).Crop the source image to 1024×1024 (default resolution for Flux1).
Encode the image to produce latent image.
Decode the latent image to create reconstructed image.
Crop both source image and reconstructed image into 128×128 images.
Stitch them together to see them side by side.
Then scale them using nearest neighbor sampling to allow us to see the details easily.
Results
We are varying the scale factor to see how it behave in different zoom levels.
At 1x, the reconstructed image is slightly blurry and has trouble capturing high frequency details like the eyes and other facial features.
At 2x, we can still see some blurry artifacts.
At 4x, visually equivalent.
At 8x, no visual difference with 4x.
It seems like 4x is the sweat spot if you are not planning to use any post processor. However, 4x means we are using 4×4=16 pixels to describe just one pixel worth of information. It seems to be quite wasteful.
Conclusion
Current VAE used in Flux1 models is loss-ful and won’t be able to produce Pixel Arts at 1x zoom level. To remedy it, we can use 4x zoom. However, it does limit the canvas size to 256×256 if we are using the default resolution of 1024×1024 for Flux1 models.
Potential Fixes
Here are some of the potential fixes that came up from Pixel.
Option 1: Use 4x Scale
As described above, we can use 4x scale at the cost of limited canvas space. This is the easiest approach.
Option 2: Fine-tune VAE
Alternatively, we can fine-tune the VAE model. However, fine-tuning VAE model is not as straightforward because we do not want to break the existing latent mappings. We can potentially investigate on this further in later posts.
Option 3: Palette Snap Post‑Decode
Convert the reconstructed RGB back to the original sprite’s palette via nearest‑color or median‑cut + remap. Often recovers “crisp” look even if subpixel blur exists in RGB. However, this does require the original source image or palette to begin with. So it won’t work in pure text-to-image use-case.
Option 4: Patchwise encode at sprite resolution (128×128) instead of whole‑sheet.
Many VAEs smear across large receptive fields; isolating tiles can reduce cross‑tile bleed and may sharpen structure. Cheap to test.
Option 5: Use Pixel Art Post Processor Nodes
There are some custom nodes in Comfy that does quantization for pixel art images. Most of them are focused on downsampling scaled pixel art into the 1x size and quantizing to make them palletized.
Update: I tried some custom nodes but I wasn’t able to get good results. Here are some anecdotal results from ComfyUI-Pixelate
node (original, 1x zoom reconstruction, post processed with the custom node).
The post-processed image still remains blurry and even loses more detail.
Next Steps
I don’t know yet, but I think we will need to use 4x scale for now, and perhaps limit each sprites to fit 64×128, so we can put at least 4×2=8 sprites into once image.
— SpritedDev
Subscribe to my newsletter
Read articles from Sprited Dev directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
