What's pretraining?
What's pretraining?
the training that comes before the training
KIDDING! it’s the biggest phase of LLM training. like when GPT4 famously cost $100M, most of they was on pretraining these days a lot more post training is mixed in, but pretraining is still very large
in my mind(prob oversimplified): making a LLM is: -Get training data - Clean it - Make the model try and predict next token base on preceding tokens. Reward when right. Repeat - Tune the models to human preferences. What part would pretraining of that be?
that’s actually pretty good. the “predict next token” is pretraining. preferences is post training data is kind of an ever-present problem throughout the process
I knew what post-training meant but not pretraining 😅 on data, I guess model makers mostly reuse what they already collected?
for data I asked because even models with a cutoff date supposedly in 2025 (like Gemini 2.5 in January) will often default to 2024 or even earlier knowledge. So maybe this is because most training data is <2023?