Tutorials

8 min read

How to Train a Custom LoRA for Consistent Face Generation

by

Priya Nair

Why off-the-shelf models drift on identity

Base diffusion models were trained to generate plausible faces in general — not to reproduce any specific person's face with consistency. When you prompt a base SDXL model for the same character across multiple generations, the face will shift subtly each time: the jaw softens, the eyes widen slightly, the nose changes proportion. These are not bugs in the model; they are the natural consequence of sampling from a broad learned distribution rather than a narrow, identity-anchored one.

A LoRA — Low-Rank Adaptation — solves this by fine-tuning a small set of learned weights on top of the base model using a curated dataset of your target face. The base model's broad knowledge is preserved, but a new layer of identity-specific information is added that anchors the generation to the face you have trained on. The result is a model that generates your target face reliably across different styles.

Building a training dataset that actually works

The quality of your LoRA is determined almost entirely by the quality of your training dataset. Twenty to thirty images is typically sufficient for a face LoRA, but the selection and curation of those images matters far more than the quantity. You need variety across lighting conditions — direct sunlight, overcast, indoor artificial — across facial expressions, and across camera angles. A dataset of thirty images all taken from the same angle in the same lighting generates excellent results only in that specific configuration.

Crop every image tightly to the face, and ensure consistent resolution across the dataset — 512×512 or 768×768 works well for most training pipelines. Images with motion blur, heavy compression artefacts, or strong stylistic filters should be excluded; the model will learn from whatever noise is present in the data, and blurry or artefacted training images produce blurry, artefacted outputs.

Captioning your dataset for precision control

Each training image needs a text caption that describes everything in the image except the face itself. This is counterintuitive at first — why caption everything except the thing you are training on? The answer is that the caption teaches the model what aspects of the image are context and what aspects are the identity signal. If your captions include descriptions of the face, the model cannot cleanly separate face identity from lighting during training.

A well-formed caption for a face LoRA image might read: 'outdoor portrait, shallow depth of field, warm colour grade'. The face — the thing the LoRA is learning — is conspicuously absent from this description. Most automated captioning tools using BLIP or LLaVA will include face descriptions by default; you will need to post-process the captions to remove them before training begins.

Training parameters and what they control

For SDXL LoRA training, the key parameters are learning rate, training steps, and network rank. Learning rate controls how aggressively the model updates toward the training data — too high and the LoRA overfits, producing a face that looks correct but refuses to adapt to different styles or lighting; too low and the identity fails to transfer reliably. A learning rate of 1e-4 with a cosine scheduler is a reliable starting point for face LoRA training runs.

Network rank determines the capacity of the LoRA — how many parameters it has available to represent the identity. Rank 16 to 32 is appropriate for most face LoRAs; higher ranks capture more detail but are more prone to overfitting on small datasets. Training steps typically range from 1,000 to 2,000 for a 25-image dataset at rank 16. Beyond 2,000 steps on a small dataset, the LoRA typically begins to overfit and generation quality outside the training distribution degrades.

Testing and refining your trained LoRA

Once training completes, systematic evaluation matters as much as the training itself. Generate a standardised test set: the same five to ten prompts covering different lighting setups, style descriptors, and scene contexts. Score each output on identity preservation, naturalness, and adaptability to the style context. A well-trained LoRA should maintain clear identity across all test prompts without making every generated image look like a flat reproduction of the training data.

The most common failure mode is over-presence — the LoRA weight is set too high during inference, causing the face to overpower lighting and style cues and look artificially pasted into the scene. Most pipelines expose a LoRA strength parameter at inference time; values between 0.6 and 0.85 produce the best balance between identity fidelity and natural scene integration for face LoRAs. Treat this parameter as part of your workflow calibration, not a fixed value.

Ready to create studio quality swaps that look real?

CTA Phone image

Ready to create studio quality swaps that look real?

CTA Phone image

Ready to create studio quality swaps that look real?

CTA Phone image

Create a free website with Framer, the website builder loved by startups, designers and agencies.