How LLMs are trained: from data to dialogue

You can think of training as three layers of ability: first continue text, then follow instructions, then give answers that are more helpful.

Pretraining Learning language structure, factual associations, and common patterns from massive text.

Instruction tuning Teaching the model that users may ask it to answer, summarize, rewrite, classify, or reason.

Preference alignment Using human preference signals to tune helpfulness, style, and safety boundaries.

Pretraining: learning the language world

A common pretraining objective is next-token prediction. The task looks simple, but with enough data and model capacity it forces the model to learn grammar, knowledge, reasoning traces, and domain patterns.

Instruction tuning: from continuation to assistant

A model trained only with pretraining behaves more like a text completer. Instruction tuning gives it examples of Q&A, summarization, classification, code, and reasoning so it learns to complete tasks.

Preference alignment: useful answers, not just plausible ones

The same question can have many possible answers. Preference alignment teaches the model which responses are clearer, safer, and more useful. RLHF and DPO are two representative methods.

One-sentence takeaway

LLM training is not a single magic step. It is a sequence from language modeling, to task following, to alignment with human preferences.