Hugo’s blog - Developing and Training LLMs From Scratch

In a recent episode of the Vanishing Gradients podcast, host Hugo Bowne-Anderson spoke with Sebastian Raschka, an AI researcher and educator, about the full lifecycle of large language models (LLMs). The conversation covers a wide range of topics related to building, training, fine-tuning, and deploying LLMs. Read this post to check out what they covered and you can watch the full episode below. We’ve also embedded short clips in the relevant sections of this post.

TLDR:

The LLM lifecycle involves model architecture design, pre-training, fine-tuning, and deployment. Key architecture choices include vocabulary size, embedding size, and number of transformer blocks.
Pre-training on large general or domain-specific datasets is expensive but crucial. Fine-tuning adapts the model to specific tasks but is less effective for learning new knowledge.
Retrieval Augmented Generation (RAG) uses an LLM to draw from external knowledge for question-answering and summarization.
Reinforcement Learning with Human Feedback (RLHF) aligns LLMs with human preferences through reward modeling and policy optimization. Continued pre-training is an alternative to standard fine-tuning.
Working with LLMs requires deep learning fundamentals, distributed training techniques, and engineering best practices. Start small and debug before scaling up.
Hardware options range from free Colab GPUs to paid cloud instances. Consider GPU RAM, parallelism support, and fast interconnects.
Choose between prompt engineering, RAG, and fine-tuning based on task complexity and available resources. Combine approaches as needed.
Promising research directions include LoRA for efficient fine-tuning, Mixture of Experts for conditional computation, multi-token prediction for faster inference, and Direct Preference Optimization as an alternative to RLHF.
LLMs like ChatGPT may change how knowledge is shared on community-driven platforms like Stack Overflow, presenting both challenges and opportunities.

Note: we used Claude Opus to help write this post, based on the podcast transcript.

If you’re interested in getting hands-on, you can find a reproducible run-down of Sebastian live coding to fine-tune GPT-2 here.

LLM Lifecycle

The LLM lifecycle may seem like a big, intimidating term, but it can be broken down into several key steps:

Coding the model architecture: Choose a base architecture (e.g., GPT, Llama, Phi) and define unique properties like:
- Vocabulary size
- Embedding size
- Number of attention heads and transformer blocks
- Activation functions
- These choices determine the size of the LLM, the data it needs, and the compute requirements for training and deployment.
Pre-training: Train the model on a large corpus of text data, which can be either:
- General (e.g., for models like Llama, Gemma, and Phi)
- Domain-specific (e.g., finance-focused data for BloombergGPT)
- The tradeoff is whether the expense of custom pre-training, which can cost hundreds of thousands to millions of dollars, is worth it for the specific use case versus fine-tuning one of the general models.
Fine-tuning: Adapt the pre-trained model for a specific task or domain. Options include:
- Instruction fine-tuning: Make the model better at following instructions without changing its underlying knowledge
- Task-specific fine-tuning: Train the model for a downstream task like spam classification
- Fine-tuning is usually not as effective as pre-training for instilling entirely new knowledge into the model. For example, fine-tuning an English LLM on Spanish data may improve its Spanish performance if there was already some Spanish data during pre-training, but it likely won’t be as good as pre-training from scratch on Spanish.
Retrieval Augmented Generation (RAG): If the goal is to have the model retrieve information from documents rather than generating the full response from scratch, RAG can be used on top of a pre-trained LLM. This is useful when you want the model to draw on external knowledge to answer questions or summarize information.
Deployment: Deploy the model to production and monitor its performance, making updates as needed. This may involve serving the model through an API, integrating it into a user-facing application, and setting up monitoring and logging to track its behavior over time.

Depending on the use case, not all of these steps may be necessary. For example, a simple personal assistant could potentially just use an off-the-shelf model like Llama 3 without any custom fine-tuning, running locally or deployed with a basic UI.

RLHF and Other Add-Ons

Beyond the core LLM lifecycle, there are some interesting extensions and add-ons to consider. One hot area of research is reinforcement learning from human feedback (RLHF), which aims to align LLMs more closely with human preferences.

The RLHF process typically involves two main stages:

Supervised Instruction Fine-Tuning:
- Instruction fine-tune the pre-trained model (using next-token prediction on instruction data) on a large dataset (e.g., 50K examples)
- Goal is to make the model better at following instructions and generating relevant outputs
Reward Modeling and Policy Optimization:
- Collect a smaller dataset of human preferences, where each example includes:
  - Two possible model outputs for the same prompt
  - A human annotation indicating which output is preferred
- Train a reward model to predict the human-preferred output based on this dataset
- Optimize the main LLM (the “policy”) to maximize the reward predicted by the reward model

Sebastian illustrates the impact of RLHF with an example: if an instruction-tuned model frequently uses the word “delve” compared to its non-instruction-tuned counterpart, it suggests that the human raters preferred outputs containing “delve” during the RLHF process.

Another interesting direction mentioned in the conversation is Continued Pre-training, advocated by Jeremy Howard and others. The idea is to continue pre-training the model on a smaller, task-specific dataset, rather than switching to supervised fine-tuning. This may help the model acquire new knowledge more effectively compared to standard fine-tuning.

Skills You Need to Work with LLMs

Working with LLMs requires a foundation in deep learning, but a lot of the core concepts carry over:

Familiarity with PyTorch (or another deep learning framework)
Understanding of loss functions, optimizers, and evaluation metrics
Comfort with training loops, including concepts like cross-entropy loss, learning rate schedules, gradient descent, and backpropagation

Sebastian notes that you can think of an LLM as a big PyTorch model – if you’re comfortable with the core deep learning concepts, you’re already well on your way to working with LLMs.

However, the scale of LLMs introduces some additional challenges:

Multi-GPU and multi-node training: To train LLMs efficiently, you’ll need to use techniques like model parallelism, pipeline parallelism, and sharded data parallelism across multiple GPUs and multiple nodes (depending on the model size).
Low-precision and mixed-precision training: Using lower-precision formats like bfloat16 can help reduce memory requirements and speed up training.
Checkpointing: Saving model checkpoints periodically is crucial to avoid losing progress if a long-running training job fails.

Engineering best practices become even more important when working with LLMs. Tools like PyTorch Lightning and Fabric can help abstract away some of the complexity of distributed training, but it’s still valuable to understand what’s going on under the hood.

Hardware / Resources to Work with LLMs

Training LLMs requires significant compute resources beyond a typical laptop or desktop. Sebastian outlines a few common options:

Google Colab: Offers free GPU access for prototyping small models, but with strict runtime limits. Not suitable for larger-scale training.
Cloud platforms (AWS, GCP, Azure): Allow you to rent powerful GPU instances on-demand, but require careful environment setup and cost management. You’ll need to set up your environment from scratch each time, which can be time-consuming.
Managed platforms (e.g., Lightning AI): Provide streamlined development environments with easy access to pre-configured compute resources. The price is similar to AWS, but it can save a lot of engineering time. Lightning AI also offers persistent storage, so you don’t need to constantly re-install your dependencies.

When choosing a hardware setup, key considerations include the amount of GPU RAM available (often the main bottleneck for LLMs), support for multi-GPU parallelism techniques like tensor parallelism and sharded data parallelism, and fast interconnects between devices.

Sebastian emphasizes the importance of starting small and debugging your code on a local machine before scaling up. He likes to prototype using compact models like GPT-2 or smaller versions of Pythia, which can run on a single CPU or GPU and shares many architectural similarities with larger LLMs. In fact, he was able to train a spam classifier using GPT-2 on his MacBook Air in just 6 minutes (or 30 seconds on a GPU)!

Interestingly, many of the most popular LLM architectures (Llama, Mistral, Phi, etc.) are derived from the same basic building blocks as GPT. So getting hands-on experience with GPT-2, even on a small scale, can teach you a lot about how LLMs work under the hood. Playing with pre-trained weights is also a great way to validate your code, since the model will only generate sensible outputs if you’ve implemented everything correctly.

Prompt Engineering, RAG, or Fine-Tuning?

When working with LLMs, there are several approaches to getting the desired outputs for your specific use case. Sebastian breaks down the key considerations for choosing between prompt engineering, retrieval augmented generation (RAG), and fine-tuning.

Prompt engineering: The simplest approach, where you craft prompts to elicit the desired outputs from the model.
- Works well when the model has the necessary knowledge and just needs guidance.
- Limited by the model’s pre-existing knowledge and ability to follow instructions.
Retrieval Augmented Generation (RAG): Enhances the model’s knowledge by retrieving relevant documents from a corpus based on the user’s prompt.
- Useful when you have a large corpus of relevant data that wasn’t included in the model’s original training.
- Requires a well-organized corpus and infrastructure for efficient retrieval.
Fine-tuning: Trains the model on a specific dataset to deeply internalize new knowledge or adapt to a specific task.
- Most powerful approach, but also the most computationally intensive and time-consuming.
- Requires significant compute resources and high-quality data.

So how do you choose the right approach for your use case? Sebastian recommends the following:

Start with prompt engineering and gradually work your way up to more advanced techniques as needed.
If prompt engineering isn’t sufficient, try incorporating RAG to expand the model’s knowledge base.
If you find yourself using RAG frequently for the same task, or if the model is still struggling, consider investing in fine-tuning.

Of course, the feasibility of each approach also depends on your resources and constraints. Prompt engineering is cheap and easy, RAG requires a well-organized corpus, and fine-tuning is the most resource-intensive.

The key is to experiment and iterate:

Start with the simplest approach that might work, and gradually add complexity as needed.
Don’t be afraid to combine techniques (e.g., using prompt engineering to improve RAG queries or fine-tuning a model and then using prompt engineering to guide its outputs).

As you gain more experience working with LLMs, you’ll develop a better intuition for which approaches are likely to work best in different scenarios. But as a general rule, Sebastian recommends starting simple and only moving to more advanced techniques when the benefits clearly outweigh the costs.

LLM Research Techniques: LoRA, DPO, and more

Among the many exciting research ideas in the LLM space, Sebastian highlights a few of his favorites:

LoRA (Low-Rank Adaptation): An efficient fine-tuning technique that approximates full parameter updates with smaller updates to a low-rank decomposition of the weights. LoRA has been around for a couple years but remains Sebastian’s go-to for fine-tuning due to its strong performance and efficiency. He’s also excited about recent variants like QLoRA and DoRA.
Mixture of Experts (MoE): An approach to conditional computation where different subsets of the model parameters are activated depending on the input. This can make the model more efficient by avoiding unnecessary computation. Sebastian notes that MoE is used in some recent state-of-the-art models like Mixtral.
Multi-token prediction: A technique for speeding up LLM inference by predicting multiple tokens at once using separate output heads. Sebastian describes this as a clever “hack” that achieves a 4x inference speedup without any fundamentally new math, showing the power of creatively combining existing building blocks.
Direct Preference Optimization (DPO): An alternative to RLHF that removes the need to train a separate reward model, instead directly optimizing the policy (i.e., the LLM) directly to make its outputs preferred by humans. DPO is simpler than RLHF, which may explain why many top-performing models on leaderboards use it. However, a recent paper suggests that RLHF with PPO still outperforms DPO, albeit with more complexity.

Looking ahead, Sebastian is excited to see more research combining multiple techniques, like the forthcoming Llama 3 model which uses both RLHF and DPO. He also mentions DoRA (an extension of LoRA), the Kahneman Tversky Optimization (a preference-free form of RLHF), and other promising ideas. While the field is moving quickly, some of these core techniques seem to be standing the test of time so far.

Stack Overflow and OpenAI

The recent collaboration between Stack Overflow and OpenAI raises interesting questions about the future of community-driven knowledge sharing in the age of LLMs.

On one hand, there’s a risk that if everyone starts relying on LLMs like ChatGPT to answer their coding questions, they’ll stop contributing to resources like Stack Overflow. This could lead to a negative feedback loop where the lack of new content degrades the quality of the LLMs themselves, since they rely heavily on sites like Stack Overflow for training data.

At the same time, LLMs are particularly well-suited for answering common, repeated questions that make up a significant fraction of Stack Overflow traffic. If these questions can be reliably answered by an LLM, it may free up human experts to focus on more novel, challenging problems. In this future, Stack Overflow could become a site focused on cutting-edge content and high-quality discussions, while LLMs handle the long tail of simpler queries.

Sebastian emphasizes that the success of Stack Overflow and other community-driven resources in the age of LLMs will depend on their ability to adapt and provide unique value that can’t be easily replicated by models. This could mean doubling down on moderation and quality control, investing in new content formats and interaction models, or exploring hybrid human-AI collaboration tools.

In the end, LLMs are only as good as the data they’re trained on – which comes from the collective knowledge contributions of humans. Finding ways to sustain and encourage those contributions, even as LLMs become more prevalent, will be a key challenge and opportunity going forward.

Be sure to check out Sebastian’s book “Build a Large Language Model (From Scratch)” and his other tutorials to dive deeper into the technical details of the LLM lifecycle!

If you enjoyed this, you can follow Sebastian on twitter here and Hugo here. You can also subscribe to Hugo’s fortnightly Vanishing Gradients Newsletter here.