How to fine-tune Flux.1 LoRA in Python 3.12

Sprited DevSprited Dev
4 min read

This is a short walkthrough of fine-tuning Flux1 model. We are using kohya-ss/sd-scripts as a starting point.

1. Environment Setup

Prerequisites:

  • GPU with ≥24GB VRAM

I’m using:

  • Nvidia RTX Pro 6000

  • Python 3.12

First clone the repo.

git clone git@github.com:kohya-ss/sd-scripts.git
cd sd-scripts

We should switch to sd3 branch since Flux LoRA training script is only in that branch.

git checkout sd3

Let’s setup a venv for the scripts for better environment isolation.

python3.12 -m venv venv
source venv/bin/activate

Install CUDA version of PyTorch as instructed here.

pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

Now, let’s install rest of the dependencies

pip install -r requirements.txt

We should be ready to go with the environment.

⚠️ Caveat 1: At the time of writing this, opencv-python version inside requirements.txt was frozen at a particular version 4.8.1.78 which was incompatible with numpy 2.x. I had to change the following line in requirements.txt and reinstall the dependencies.

- numpy<=2.0
+ numpy<2.0

Then, run following line to re-install numpy.

pip uninstall -y numpy
pip install -r requirements.txt

⚠️ Caveat 2: The bitsandbytes version in requirements.txt was causing ImportError: No bitsandbytes error. So, I had to update to bitsandbytes==0.46.1.

- bitsandbytes==0.44.0
+ bitsandbytes==0.46.1

Then, run:

pip uninstall -y bitsandbytes
pip install -r requirements.txt

⚠️ Caveat 3: At the time of writing, the captioning-related dependencies are commented out in requirements.txt. You will need this for Section 3.2.

# for BLIP captioning
- # requests==2.28.2
- # timm==0.6.12
- # fairscale==0.4.13
+ # requests==2.28.2
+ # timm==1.0.17         # 0.6.12 was causing type issues, so had to upgrade.
+ # fairscale==0.4.13

Then:

pip install -r requirements.txt

2. Downloading the Models

Download Flux1 model and auto-encoder into the project root directory.

Download text encoders into sd3 directory (you will have to create a directory inside the project root.

Note: If you already have these files downloaded somewhere, it may be easier to simply change the arguments in Section 4.


3. Prepare Dataset

3.1 Add Images

Now, let’s create and place some images into data/lora folder.

3.2 Generate Captions

The sd-scripts contain useful utilities to generate captions. You can use those to generate the captions for the images.

To generate the captions:

python finetune/make_captions.py data/lora --caption_extension .txt

This will generate .txt file next to images. You can edit these values as seem fit.

3.3 Metadata

Finally, we need to create a metadata.json file. If you generated captions using 3.2 above, you should have .caption files next to each images. We will not run a script to combine those into metadata.json file.

python finetune/merge_captions_to_metadata.py data/lora data/lora/metadata.json --caption_extension .txt

This will generate metadata.json file which contains mapping of images to captions.

3.4 Toml File

Create and save dataset_1024_bs2.toml file at project root directory. Schema can be found here.

[general]
shuffle_caption = true
caption_extension = '.txt'
keep_tokens = 1
flip_aug = false
color_aug = false
keep_tokens_separator= "|||"
caption_tag_dropout_rate = 0

[[datasets]]
resolution = [1024, 1024]
batch_size = 2
keep_tokens_separator= "|||"

  [[datasets.subsets]]
  image_dir = './data/lora'
  metadata_file = './data/lora/metadata.json'
  num_repeats = 1
  caption_prefix = "[SPRITE 128x128 GAME ASSET] |||"

4. Train LoRA

Now, as instructed in the README section, let’s run the HuggingFace’s accelerate command.

accelerate launch  --mixed_precision bf16 --num_cpu_threads_per_process 1 \
flux_train_network.py \
--pretrained_model_name_or_path flux1-dev.safetensors --clip_l sd3/clip_l.safetensors \
--t5xxl sd3/t5xxl_fp16.safetensors --ae ae.safetensors --cache_latents_to_disk \
--save_model_as safetensors --sdpa --persistent_data_loader_workers \
--max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 \
--save_precision bf16 --network_module networks.lora_flux --network_dim 64 \
--network_train_unet_only --optimizer_type adamw8bit --learning_rate 2e-4 \
--cache_text_encoder_outputs --cache_text_encoder_outputs_to_disk --fp8_base \
--highvram --max_train_epochs 50 --save_every_n_epochs 1 \
--dataset_config dataset_1024_bs2.toml --output_dir path/to/output/dir \
--output_name flux-lora-name --timestep_sampling shift --discrete_flow_shift 3.1582 \
--model_prediction_type raw --guidance_scale 1.0

After few minutes, you should get a .safetensors file generated.


5. Inference with LoRA

To generate images with LoRA applied, you can run following command.

python flux_minimal_inference.py \
--ckpt flux1-dev.safetensors \
--clip_l sd3/clip_l.safetensors \
--t5xxl sd3/t5xxl_fp16.safetensors \
--ae ae.safetensors \
--dtype bf16 \
--prompt "[SPRITE 128x128 GAME ASSET] chubby boy scout standing full body" \
--out path/to/output/dir \
--seed 1 \
--flux_dtype fp8 \
--offload \
--lora 'path/to/output/dir/flux-lora-name.safetensors;2.0'

If you want to check the result without LoRA applied, you can run it without the last --lora line.

6. Results

The quality actually degrades in my example, but it does produce near perfect 128×128 grid alignment.

Without LoRA: The produced image is pixelated but it is not aligned with 128×128 grid. ❌

With LoRA: The produced image is near-perfectly aligned with 128×128 grid. ✅


7. Conclusion

Fine‑tuned on just three samples, you’ll immediately see that grid lock‑step improvement. Go ahead, add more data, bump those hyperparams, and watch it really pop.

— Sprited Dev

0
Subscribe to my newsletter

Read articles from Sprited Dev directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Sprited Dev
Sprited Dev