A Non-Technical Guide to Fine Tuning

Improving model performance with fine-tuning

Jul 1, 2024

Useful For

Fine-tuning can be used to improve the following model capabilities:

  • Personality training — train behavior, voice and tone

  • Guardrails — preventing illegal and toxic outputs Protect user’s PII, prevent doxxing.

  • Alignment — eliminating bias across gender, and race. Follow defined rules and principles.

Cost and Latency

4x more throughput and 6x lower cost when running an open-source fine-tuned model.

How does it work?

  1. Fine-tune on a dataset that represents a wide range of domain queries with ideal expected outputs.

  2. Evaluate the model. The evaluation dataset is carefully picked with no overlap to the training dataset in step #1.

  3. Iterate. Based on the eval results, improve the fine-tuning dataset. Iteratively run fine-tuning and eval to improve the model.

Identify key improvements with experimentation and hypothesis testing.
Use small datasets and run quick, cheap fine-tuning + eval runs to identify the ideal dataset for improving the model's performance.


  1. How much data is needed?
    An iterative training and evaluation process is used to identify the best dataset. Representing a diverse set of use cases and having high-quality responses is key. Get quality right for a small dataset before adding more. Generally, it ranges between 10 - 25000 rows depending on the task.

  2. Can the model only answer queries present in the fine-tuning dataset?
    The model shows improved performance on queries related to the dataset. This is measured using an evaluation dataset that contains only different queries than the training dataset.

  3. Is the model response the same as the dataset used for fine-tuning?
    The model learns behavior from the dataset. The model performs better across a wide-range of queries across a number of dimensions, including reasoning, style, tone, eliminating biases and other alignment goals.

  4. Is the model response 100% correct after fine-tuning?
    LLMs are non-deterministic, i.e., they work on probability. They are not 100% accurate. Fine-tuning a smaller model, however, improves the accuracy rates better than or equal to the latest models at a fraction of the price.

  5. What is the cost of fine-tuning a model? How does it compare to running GPT-4 or 3.5?
    Fine-tuning costs less than $500 for a full run. At scale, the cost of inference is a fraction of GPT models. Using tools like vllm and loading multiple models on the same GPU reduce costs and improve hardware utilization over 10x at scale.