The video "Deep Dive into LLMs like ChatGPT" by Andrej Karpathy (3.5 hours) is one of the most insightful tutorials on Large Language Models.
I learned a lot about LLMs by watching and studying it.
I watched it twice. The first time, I paid attention but didn’t try to understand everything.
The second time was a much slower process. I paused the video every time Andrej explained a concept worth remembering. Each time, I wrote a question and an answer.
I tried to reuse Andrej’s explanations as much as possible, but sometimes they were too verbose, so I had to condense them into a few lines. This was an incredible learning exercise, though not a quick one.
By the end of the video, I had written 63 Q&As, which I polished using ChatGPT, but only to fix grammar and spelling.
If you’ve watched Deep Dive into LLMs like ChatGPT (and you should), use these Q&As to check what you’ve learned about LLMs.
Pre-Training
1. What are the three stages to train a Large Language Model (LLM) like ChatGPT?
- Pre-training: Learning general language patterns from large amounts of text
- Post-training: Supervised Fine-Tuning (SFT)
- RLHF: Reinforcement Learning from Human Feedback
2. What is the primary source of data used to pre-train LLMs?
The primary source of data is text scraped from the web.
Common Crawl is one of the major sources of data crawled from the web.
Other sources include books, academic papers, and articles.
3. What is Common Crawl?
Common Crawl is a nonprofit organization that regularly crawls the web and makes petabytes of web data freely available to the public.
4. Is raw web-scraped data suitable for training as it is?
No, the raw data must be filtered in many ways.
Raw data is noisy and full of duplicate content, low-quality text, and irrelevant information. Before training, it requires heavy filtering.
5. What kinds of filters and cleaning must be applied to raw data for LLM training?
Step 1: URL filtering.
This involves filtering out URLs and domains we do not want in our dataset. This includes malware, pornographic content, racist material, and more.Step 2: Text extraction.
Web pages extracted by crawlers are in raw HTML. This step removes HTML tags, scripts, and CSS.Step 3: Language filtering.
Select only pages that correspond to a specific language. If we are not interested in creating a model that can chat in Italian, we can filter out all Italian pages.- Other steps: There are various minor steps. One worth mentioning is PII (Personally Identifiable Information) removal.
6. What is tokenization, and why is it a critical step in training LLMs?
It is about changing the representation of text.
We want to represent text as sequences of symbols, and the neural network is trained on those sequences.
7. Why is tokenizing text into sequences using only a few symbols a bad idea?
The sequence length a neural network can process is a very finite and precious resource, and we do not want long sequences made of very few symbols.
A vocabulary size of just two symbols (0 and 1), or even 256 symbols, is too small.
In production language models, we must go beyond 256 symbols.
This is done by running what is called the Byte Pair Encoding (BPE) algorithm.
8. How does the Byte Pair Encoding algorithm work?
It works by looking for consecutive bytes that occur very frequently.
For example, if the sequence 116 followed by 32 occurs often, we group this pair into a new symbol with ID 256 and replace every occurrence of the pair 116–32 with this new symbol.
We then iterate this algorithm as many times as we wish. Each time we mint a new symbol, the sequence length decreases and the vocabulary size increases.This process of converting raw text into these symbols (usually called tokens) is called tokenization.
9. What is an LLM’s vocabulary size, and why does it matter?
The vocabulary size is the total number of possible tokens.
If the vocabulary is too small, the sequence representing a text becomes enormous.
Shorter sequences are preferable, but not too short, as that would lead to an overly large vocabulary.A good vocabulary size turns out to be around 100,000 possible symbols. For example, GPT-4 uses 100,277 tokens.
10. What is TikTokenizer?
It is a helpful web application that shows how a text is tokenized.
11. What does Andrej K. mean by "windows of tokens"?
They are random sequences of tokens extracted from a large corpus of text.
12. What is a good size for token windows?
Andrej says that 8,000 tokens is a good maximum length, and the minimum size is 0.
This means sequences can be anywhere between 0 and 8,000 tokens long.
According to him, 4,000 or 16,000 tokens work fine as the maximum length too.
13. What is the neural network of an LLM trained for?
It is trained to predict the next token in a sequence of tokens.
The goal is to train the model to learn the statistical relationships that describe how tokens follow one another.
14. What are the input and output of the neural network?
The input is a sequence of tokens, and the output is a prediction of what comes next.
Since the vocabulary contains around 100,000 possible tokens, the neural network produces exactly that many numbers. Each number represents the probability of a token being the next one in the sequence.
In short, it is making probabilistic guesses about what comes next.
15. How does pre-training work?
Pre-training is about computing all the parameters and weights of the neural network by feeding it random sequences extracted from the data and adjusting the weights based on the expected next tokens.
Given the huge amount of data involved, pre-training a model can take months and cost hundreds of millions of dollars.
16. Why is LLM output described as stochastic?
Because the output can change each time you run inference on the same input sequence.
The model does not repeat verbatim what it was trained on. Instead it produces responses based on probabilities.
17. What does inference refer to?
Inference is the process of using a trained model to predict the next tokens for a given prompt.
18. What happens when the model is not trained?
An untrained model has randomly initialized weights, so it produces random tokens (non-sensical text).
19. What has driven NVIDIA’s stock price to such a high level?
Pre-training massive models takes months and requires powerful GPUs. Since every major tech company needs these GPUs for their models, demand has surged, pushing NVIDIA's stock price up sharply.
20. What is a base model?
A base model is the result of the pre-training stage (the first stage).
21. How does Andrej K. describe base models?
A base model is a powerful text autocomplete system that creates a remix of the internet.
As Andrej K. said, "It dreams internet pages."
22. What are some web applications for running models like LLaMA 3?
The company Hyperbolic provides a Web App to run models like LLaMA 3 (and many other models): app.hyperbolic.ai
Another good web service is Together.at
23. Can you get useful results from a base model?
Yes you can, but you must prompt the model smartly.
The billions of parameters store lots of knowledge about the world.You can elicit that knowledge with a prompt that is likely to be found on a web page.
For example:
"Here is my top 10 list of landmarks to see in Paris:"
On the internet, there are many web pages that suggest Paris landmarks, so the recollection of the landmarks will be plausible.
24. Do the parameters store information in a lossless way?
No. The model stores the knowledge from the documents probabilistically, so it is a kind of lossy compression.
When information is recollected via inference, content that appears very frequently on the internet has a higher chance of being remembered correctly compared to more infrequent documents.
So you cannot fully trust the output, since the knowledge is not stored explicitly in the parameters.
It is more a probabilistic recollection of the internet.
25. What is a few-shot prompt?
It is a prompt that contains some examples before asking a question.
The model can infer a task from the examples and apply that task to new inputs.
Example of few-shot prompt: "butterfly: farfalla, ocean: oceano, whisper: sussurro, mountain: montagna, thunder: fulmine, gentle: gentile, freedom: libertà, umbrella: ombrello, cinnamon: cannella, moonlight: chiar di luna, teacher:"
Thanks to the examples, the model will infer the Italian translation for the word teacher: insegnante.
This capability is called in-context learning.
26. Is it possible to use a base model as an assistant?
Yes, but you must provide a few-shot prompt of a dialog between human and assistant:
- human: ...
- assistant: ...
- human: ...
- assistant: ...
- human: ...
- assistant: ...
- ...
That said, To create a more reliable assistant, the model must be fine-tuned.
Post-Training
27. What is the goal of post-training?
To create a useful assistant that answers user's questions.
Pre-training gives the user a powerful autocomplete. Post-training turns that into an assistant that actually tries to help the user.
28. What is the data input used to post-train the model?
To train the model to behave like an assistant, we need many thousands of human–assistant conversations.
These conversations are created by humans, often called labelers.
29. Pre-training or post-training: which one is more computationally expensive?
The pre-training stage. It can take months and cost millions of pounds. The major cost comes from renting data centers capable of training on huge amounts of data.
Post-training takes only a few hours, which makes it much cheaper.
30. How do we tokenize conversations into token sequences?
We use the same vocabulary of tokens used in pre-training, plus a few extra special tokens added during post-training.
These special tokens are used to tag the human–assistant conversation.
For example:
<|im_start|>user<|im_sep|> What is 2 + 2? <|im_end|> <|im_start|>assistant<|im_sep|> 2 + 2 equals 4. <|im_end|>
31. What are three important principles contained in the labeling instructions given to human labelers at OpenAI?
- Helpful
- Truthful
- Harmless
These are only some of the principles contained in the policy manual that labelers need to study to write good answers.
LLM "Psycology"
32. What is the meaning of "hallucination"?
It is when the model does not have enough knowledge stored in its parameters, but it still generates a response, which is a "best guess" (in terms of probability).
Since the guess is not based on actual knowledge, it is often false and sometimes absurd.
33. What are two possible ways to mitigate hallucinations?
These are two techniques:
Post-training with "I don’t know" examples: A simple technique is to post-train the model on questions for which it does not know the answer, and explicitly teach it to respond with "I don’t know" (or a similar phrase) instead of guessing.
Web search tool: Another approach mirrors human behavior: searching for information when the answer is unknown.
Modern LLMs can use web search tools to get useful information and add it to the context window. The model then answers the question using this new information, which greatly improves reliability.Using these techniques, LLM providers have reduced hallucinations in their models.
34. How does Andrej describe the context window and the knowledge in the parameters?
The knowledge in the parameters offers a vague recollection (e.g. of something you read one month ago).
The knowledge in the tokens of the context window is like working memory (e.g. recent experiences that are fresh in our mind).
35. Do LLMs have knowledge of self?
Andrej says that asking questions like "Who are you?" or "Who built you?" is nonsensical.
The model follows the statistical regularities of its training set.Old models reply to these kinds of questions with plausible but wrong answers (hallucinations).
Newer ones are often trained to answer these questions and avoid hallucinations, but that does not make them self-aware.
36. What is the meaning of "models need tokens to think"?
With that sentence, Andrej states that LLMs don’t think silently; their "thinking" happens by generating tokens step by step.
An LLM is trained to predict the next token, so any reasoning must be expressed as a sequence of tokens.
In the video, he asks the model to solve this math problem:
"Emily buys 3 apples and 2 oranges. Each orange costs $2. The total cost of all the fruit is $13. What is the cost of apples?"
There are two possible answers:
- Only the answer:
$3
- Answer with reasoning tokens:
2 oranges cost $4 13 − 4 = 9 9 / 3 = 3For the second answer, the model writes the intermediate steps.
Those steps are reasoning process (the "thinking). The model uses tokens as a form of working memory to reason through the problem.In this case, the answer is much more likely to be correct.
37. What is a more reliable way to ask ChatGPT to solve math problems?
Just add "Use code" at the end of the question, and the model will generate code that solves the problem (usually Python) and run it to get the response.
38. Are LLMs good at counting? For example, "How many dots are in this string?"
No, LLMs often make mistakes when counting characters or words.
In this case, adding "Use code" to the prompt will request the LLM to write and run Python code. The response is much more reliable, and you can even check the code’s accuracy.
39. Are models good at spelling?
No, because models do not see characters. They see tokens.
For example, if you ask the model to print every third character of a word, the model will probably fail.If you ask it to "Use code", you will get a correct response.
Reinforcement Learning
40. Andrej uses the school textbook analogy to introduce Reinforcement Learning for LLMs. Can you tell which are the three classes of information in the textbook?
In a textbook, you can find the expositions, the problems and solutions, and the practice problems sections:
The expositions: this is the knowledge base, the explanation of ideas and concepts.
The problems and solutions: these are sections in the book in which the expert shows how to solve specific problems.
The practice problems: these are critical for learning—the problems students can use to practice, and the final answers, usually at the end of each chapter in the textbook, but the steps to get to the answer are not present.
41. How does the textbook analogy map onto an LLM?
The expositions: pre-training stage. The model reads huge amounts of text and learns the statistical correlations between tokens.
The problems & solutions: post-training stage. Supervised fine-tuning, in which the model is trained on thousands of questions (prompts) and ideal solutions and answers provided by human experts.
Practice problems: reinforcement learning.
42. What is a company that publicly shared its Reinforcement Learning approach?
DeepSeek released a paper in which they talked publicly about their approach to RL in their LLMs and the improvements they obtained.
43. In the RL stage, is the model trained using questions and correct answers?
No, and that is the important distinction between Reinforcement Learning and Supervised Fine-Tuning.
The correct answers are not used to train the model.In the RL stage, the model generates the solutions and the final answers.
The correct answers are used only to check the correctness of the generated answers.A positive or negative reward is given to the model based on the comparison between the model’s answer and the correct answer.
44. What are models trained with RL usually called?
They are usually called thinking or reasoning models.
45. What is the best use of a thinking model?
To solve problems that require reasoning, like math and coding.
46. In which cases is it overkill to use a thinking model?
For factual questions, where no reasoning is necessary.
It is wasteful to use a thinking model because it requires more tokens and more computation.
47. Why, in the context of the game Go, does Reinforcement Learning get better results than Supervised Learning?
Supervised learning is based on training a model on matches played by human experts. In this way, the model can be as good as the best players, but it cannot go beyond that.
With RL, the system plays against itself.
It plays millions of matches, and only the winning ones are rewarded.
In this way, human performance is not a limit.In fact, the AlphaGo RL system by Google trained the model using only RL so well that it won against top Go players like Lee Sedol.
48. What are the kinds of problems that have verifiable domains?
These are problems in which all candidate solutions are easy to score against a correct answer.
The scoring and the reward can be done automatically, without human intervention.
For example, in math problems it is easy to check if the final number is correct.
Logic games like chess and Go are also examples, in which it is possible to verify whether certain moves will end with a win or a loss.
49. What are the kinds of problems that have unverifiable domains?
These are problems where the correctness and quality of the response are subjective and hard to measure.
For example: "Write a joke about pelicans". Machines are bad at understanding humor, so only humans can score this kind of question.
50. What is the meaning of RLHF?
It is Reinforcement Learning from Human Feedback. RLHF is a form of RL that requires input from humans.
For example, humans rank or compare different answers based on their quality, providing preference data that helps train the model.
51. How does RLHF work in practice for unverifiable tasks?
LLM engineers create and train a separate reward model neural network to imitate human preferences.
This reward model is then used to score responses generated by the LLM, and reinforcement learning is applied to encourage higher-scoring outputs.Example:
Prompt: "Write a joke about pelicans" (asked 5 times).The LLM produces five different responses: a, b, c, d, e.
The reward model assigns scores to these responses and ranks them from best to worst, approximating human preferences.
Reinforcement learning then nudges the model to tell jokes more like the higher-ranked ones, which are potentially funnier.
52. What is the discriminator-generator gap?
It is much easier to tell whether something is good than to generate it.
You can often spot a bad explanation immediately, but generating a good explanation is much harder.That asymmetry is the discriminator–generator gap.
53. Can you run RLHF as long as you want to improve indefinitely an LLM?
No, after a certain number of iterations, usually a few hundred updates, the LLM starts degrading.
The reason is that LLMs start finding answers that trick the reward model and get very high scores for nonsensical responses, in a sort of gamification.In other words, the reward function is gameable, and LLMs are very good at that game, discovering inputs that are evaluated as excellent, even if they are nonsensical for real humans.
So RLHF works, but in a limited way. You cannot run it for too long.
The solution is to stop RLHF before the model deteriorates.
54. What are adversarial examples in RLHF?
They are nonsensical responses that the LLM learns to generate because they trick the reward model into giving them very high scores.
The model exploits flaws in the reward model to maximize its score rather than actual quality.
55. What is the main difference between RL in a verifiable domain and RLHF in an unverifiable domain?
You can run RL for extended periods in a verifiable domain and still discover better solutions.
The game of Go is a good example in which RL applies well. DeepMind trained a model so well that it eventually beat the best Go player.RLHF is not the kind of RL that you can run for extended periods. At a certain point, the model starts generating bad responses that are scored highly by the reward model (a problem known as reward model overoptimization).
56. What does Andrej mean by the "Swiss cheese model"?
Andrej uses the Swiss cheese metaphor to describe LLM capabilities.
They work really well for certain things, but they fail in other cases, and they do so almost at random, like the holes in Swiss cheese.
An example of a shortcoming that happened with early models of ChatGPT is:
"What is bigger, 9.11 or 9.9?"ChatGPT used to answer "9.11", which is of course wrong.
Recent models have fixed this problem.
57. Should you fully trust LLM responses?
No, you should not. Models are not infallible. They can hallucinate and fail in different ways (see the Swiss cheese model), but they are still powerful and useful tools.
Use them for a first draft, for inspiration, to summarize, and for many other tasks, but do not fully trust them.
Be responsible for the work you create using LLMs.
58. What is a multimodal model?
It is a model that can process not only text, but also audio, images, and video.
Those different media can be tokenized in a similar way to text, so multimodal models are not technically very different from text-only LLMs.
59. What are LLM agents?
Agents are systems built around LLMs that use tools to perform tasks and report progress to humans.
They can run for minutes or hours to complete longer jobs. Since models are not infallible, they benefit from human supervision, especially for critical tasks.
60. What is the biggest limitation of LLMs regarding learning?
The capacity to learn new things.
LLMs ingest all their knowledge during the pre-training and post-training stages. After that, the models do not have the capacity to change their parameters, which means they cannot learn new things.
You can use in-context learning and give the model examples in the prompt (aka few-shot prompting), but this is not real learning since the parameters do not change.
Also, the context window is a finite and precious resource, especially when running multimodal tasks, so its use is limited.
This is an open issue, and there is currently a lot of research to address it.
61. What is LMArena?
LMArena (also known as Chatbot Arena) is an LLM leaderboard that ranks top models based on human comparisons.
Two models are shown the same prompt, and humans compare their responses without knowing which model produced which answer.
62. In what way is the model DeepSeek-R1 different from Gemini or ChatGPT?
DeepSeek-R1 has an MIT license and an open-weights release, so anyone can download and use it and freely host their own version of DeepSeek.
On the contrary, Gemini, ChatGPT (or Claude) have proprietary licenses.It was surprising that a model as powerful as DeepSeek-R1 was released with open weights. Hopefully, more companies will follow DeepSeek's example.
63. What is LM Studio?
LM Studio is an application to run LLMs on your computer.
You probably cannot run top models locally, like DeepSeek-V3 with 671B parameters (you'd need hundreds of gigabytes of RAM and powerful GPUs), but fortunately there are smaller versions available, such as distilled or quantized models.
You can run these smaller models on a powerful MacBook Pro or Linux box (64–128 GB RAM). To run models more easily, you can use lower precision (quantization).
Conclusion
LLMs are formidable tools, and the time you spend learning how to leverage them is totally worth it.
I created these Q&As as personal notes, but I hope you found them inspiring and helpful.
If you tried to answer the questions before revealing the answers, congratulations! You've just strengthened the neuron connections about LLMs in your brain! (Yes, that's actually how learning works!)