Home Machine Learning Hyperparameter Optimization For LLMs: Advanced Strategies

Machine Learning

Hyperparameter Optimization For LLMs: Advanced Strategies

March 16, 2025

Finding an optimal set of hyperparameters is essential for efficient and effective training of Large Language Models (LLMs).

The key LLM hyperparameters influence the model size, learning rate, learning behavior, and token generation process.

Due to their computational demands, traditional methods for optimizing hyperparameters, such as grid search, are impractical for LLMs.

Advanced hyperparameter optimization strategies, like population-based training, Bayesian optimization, and adaptive LoRA, promise to balance computational effort and outcome.

The rise of large language models (LLMs) is bringing advances in text generation and contextual understanding. Hyperparameters control the size of LLMs, their training process, and how they generate outputs.

An optimal combination of hyperparameters is fundamental to efficiently pre-training and fine-tuning LLMs. Since LLM training is computationally intensive, exhaustive experimentation is not viable. This rules out traditional machine-learning hyperparameter optimization (HPO) methods that rely on systematically exploring the hyperparameter space by training many models with slightly different configurations.

When configuring models and training processes, LLM developers rely on a thorough understanding of each hyperparameter’s influence, insights from fundamental research, and empirical evidence gained from training state-of-the-art foundation models. Methods for estimating optimal hyperparameter values with limited compute budgets and adapting hyperparameters throughout the training process can help pre-training and fine-tuning.

After reading this article, you’ll be able to answer the following questions:

What key hyperparameters should be considered when developing, training, and applying LLMs?
How does each hyperparameter influence the LLM, and which trade-offs do we need to be aware of?
How can we select an optimal combination of hyperparameters in our scenario without fully training multiple model variants?
What advanced hyperparameter optimization techniques are available for LLMs, and when can we apply them?

LLM hyperparameters

A hyperparameter is a configuration value that controls the behavior of a machine-learning model during the training or inference process. Unlike model parameters (the weights), which are learned directly from the training data, hyperparameters are defined by the model developers. A hyperparameter can be constant or adjusted dynamically according to predefined rules or schedules.

Model size

In the case of LLMs, we often work with pre-trained models, where the activation functions, internal architecture of layers or blocks, and their connections—all examples of hyperparameters—are fixed. If our pre-trained LLM of choice is available in different sizes, the model size is the only hyperparameter affecting the model’s makeup we can actively control.

The size of an LLM refers to the total number of parameters it contains, which influences the model’s capacity to understand and generate complex language patterns. Hyperparameters set and tuned during pre-training influence the total size of an LLM.

One hyperparameter influencing a model’s size is its depth, corresponding to the total number of layers stacked sequentially. Each additional layer in an LLM adds more parameters, such as the weights for the self-attention mechanism and feed-forward layers in a transformer block.

Another hyperparameter influencing an LLM’s size is its hidden size, which refers to the dimensionality of the token embeddings and the internal representations within each layer. The hidden size determines how richly the model can encode information about each input token and how effectively it can process complex language patterns. A larger hidden size means each token is represented in a higher-dimensional space, allowing the model to capture more detailed semantic and syntactic nuances.

Further, the number of parallel attention heads in each transformer block influences the size of the LLM. Multiple heads allow the model to focus on different input aspects simultaneously. Through multi-query and grouped-query attention, we can reduce the number of necessary parameters.

Finally, the vocabulary size and context window (maximum sequence length) also impact the model’s size. They determine the language diversity a model can handle and the context length it can maintain, respectively.

These hyperparameters, set before beginning the training process and unable to be changed later, determine the model size. For example, GPT-3 has 96 layers, a hidden size of 12,288, 96 attention heads, a vocabulary of 50,257 tokens, and a context window of 2,048 tokens, resulting in a total of 175 billion parameters.

Learning rate

The learning rate (LR) is a critical hyperparameter in training LLMs. Optimizing these hyperparameters is essential for efficient learning, stable convergence, and good generalization to unseen data.

The learning rate determines how much model weights are changed during each update. A high learning rate helps speed up the training process but increases the risk of instability and overfitting. A low learning rate increases stability and tends to benefit generalization but leads to slow training.

In the case of LLMs, the learning rate is typically not constant but varies as training progresses. This variation is governed by a learning rate schedule (LRS). The schedule is usually tied to the number of tokens seen—either directly, or indirectly through the number of samples, steps, or epochs. At a high level, it contains phases of a rising, constant, and decreasing learning rate.

How does the learning rate affect training duration and quality?

Following theoretical work by Stanford researcher Kaiyue Wen and colleagues published in December 2024, we can think of LLM training as progressing along a loss landscape that looks like a river valley. They hypothesize that the existence and overall direction of the river are due to the facts and knowledge an LLM learns, which are reflected as highly deterministic and, therefore, easy-to-predict tokens. The valley slopes arise from flexibility and ambiguity inherent to language, i.e., hard-to-predict tokens.

Visualization of LLM training as traveling down a river valley. Using a stable but high learning rate ensures quick progress down the river but leads to jumps between relatively high loss values. Reducing the learning rate during a subsequent decay phase brings the model towards a local loss minimum. | Source

In this picture, the training goal is to reach the river mouth, at which point we should be as close to the bottom of the valley as possible. The first crucial insight is that it does not matter whether we stay at the bottom of the valley until then. Thus, if we can make faster progress down the river by bouncing back and forth between points high up the loss valley’s slopes, we can do this without affecting the final outcome.

Thus, we should aim to use a high learning rate—resulting in large steps towards the loss minimum but leading to wildly fluctuating loss values—for as long as possible. Towards the end of the training, the learning rate should be decreased to a very low value. This will slow down progress towards the river mouth but reduce the oscillations to a point where we constantly stay at the valley’s bottom, i.e., the local loss minimum.

However, all of this is only going to work if we are already in a sufficiently deep loss river valley. When training is first starting, a high learning rate will lead to undirected jumps across the loss landscape. To avoid this, learning rate schedules for LLMs start with a small learning rate and slowly ramp it up to its maximum value. This is called the warmup phase.

Cosine schedule

The cosine schedule (also known as cosine decay or cosine annealing) implements this approach by starting with a linear warmup phase that brings the learning rate to its maximum value, followed by a slow decay following the cosine function:

LR(t) = LR_min + 0.5 (LR_max – LR_min) (1 + cos(π t/T)

Here, LR_min and LR_max are the minimum and maximum learning rates, t is the training step, and T is the total number of training steps. The advantage of this schedule is that it stays close to the peak learning rate for a long time, and the final decay is gradual. It’s also easy to implement, as it depends on just three hyperparameters (LR_max, LR_min, and T) linked by the cosine function.

Cosine schedules have been highly popular for pretraining LLMs. For example, it was used for BLOOM, a 176-billion-parameter multilingual model developed by the BigScience Research Workshop and released in 2022. In an initial warmup phase, the learning rate was ramped to a peak of 6 x 10^-5 over 375 million tokens. Afterward, it was lowered to 10% of this value with cosine decay over 410 million tokens and remained at this value. The implementation and detailed description are publicly accessible in BLOOM’s GitHub repository.

For pre-training their Llama 3 405B model, Meta used a slightly more involved variant of the cosine schedule. In the first stage, a warm-up phase of up to 8,000 steps brought the learning rate to a maximum of 8 x 10^-5. Subsequently, the learning rate decreased to 8 x 10^-7 over 1.2 million steps with a cosine decay. After the second stage focused on training the LLM up to its final context length of 128,000 tokens, the learning rate linearly decreased to 0 over 40 million tokens in the third stage. Supervised fine-tuning was conducted over about 9,000 steps with a learning rate of 10^-5.

A major disadvantage of the cosine schedule is that the total number of training steps has to be known beforehand. When training large foundation models, the total compute budget is typically set, and the optimal number of training tokens can be estimated. However, when fine-tuning or experimenting, it would be preferable to base the decision on when to end training on the model’s performance.

Warmup-stable-decay schedule

The warmup-stable-decay (WSD) schedule is a simple protocol introduced by Shengding Hu and colleagues at Tsinghua University in 2024. It starts with a linear warmup to the maximum learning rate, keeps the learning rate constant for the majority of the training, and ramps it down at the end.

Through experiments, they found that a decay phase that makes up 10% of the total length is sufficient. They also demonstrated that a WSD schedule leads to a lower loss than a cosine schedule. According to Wen and colleagues at Stanford, this can readily be understood in the river valley picture. In the WSD schedule, the learning rate stays at a high value longer than in the cosine schedule. Hence, we make it further down the valley before dropping to its bottom. Further, their analysis shows that training progress in the stable phase is dominated by learning to predict deterministic tokens (facts and knowledge), while in the decay phase, the LLM learns the stochastic tokens (language variability).

Comparison of the loss curves resulting from a cosine and warmup-stable-decay (WSD) learning rate schedule. In the WSD schedule, the learning rate remains at a constant high value during the stable phase. This leads to high intermediate loss values as the loss fluctuates around the local minimum as it progresses towards lower values. During the final 10% of the total training steps, the learning rate is decreased to its minimum, leading to a sharp drop in the loss. Since the learning rate remained at a high value for longer, the final loss resulting from the WSD schedule is smaller than the loss from the cosine schedule. | Source

While a WSD schedule yields a lower loss for the same training budget, knowing the total number of training steps ahead of time is still required for scheduling the decay phase. However, the WSD schedule offers a straightforward way to extend the total number of training steps retroactively: If we find that our final model’s performance is unsatisfactory, we can resume training from a model snapshot taken at the end of the stable phase. This beams us back a small distance up the loss river valley, from where we continue making large jumpy steps towards the river mouth as if we had never descended down to the valley’s bottom in the first place.

Restarting this way, we still benefit from 90% of the compute budget spent so far. It allows us to determine the compute budget we need as we go, producing fully trained intermediate models—something that the cosine schedule inherently does not allow for.

Track months-long model training with more confidence. Use neptune.ai forking feature to iterate faster and optimize the usage of GPU resources.

With Neptune, users can visualize forked training out of the box. This means you can:

Test multiple configs at the same time. Stop the runs that don’t improve accuracy. And continue from the most accurate last step.
Restart failed training sessions from any previous step. The training history is inherited, and the entire experiment is visible on a single chart.

Cyclical cosine schedule

Returning to a high learning rate after decaying to a minimum is not a new idea in machine learning. Long established in gradient-free optimization, it was made popular for deep learning training through the “Stochastic Gradient Descent with Warm Restarts” technique proposed by Ilya Loshchilov and Frank Hutter in 2017. The learning rate is governed by a function very similar to the one for the cosine schedule:

LR(t) = LR_min + 0.5 (LR_max − LR_min) (1 + cos(π (t mod T)/T))

This time, T is not the total number of training steps but is understood as the schedule’s period. For example, we might train for 10,000 steps with T = 1,000, leading to ten consecutive cosine decay cycles. Commonly, LR_max is set to a new, lower value at the beginning of each cycle.

In the loss landscape river valley, we’re climbing down to the bottom over T steps, making ever slower progress down the river as we keep closer to the bottom. Then, we immediately go back to make large jumps toward the river mouth high up the valley’s slopes.

Right at the beginning of a new cosine cycle, the loss will be significantly higher than it was previously. This could be due to the jump in the learning rate, which might perturb the model. However, Wen and colleagues argue, based on their experiments and theoretical insights, that it is the result of training with a small learning rate for too long.

Whatever the cause, this doesn’t just make training less efficient. It’s also an obstacle to continue model training later. Whether we aim to further pre-train on newly acquired or different data, fine-tune an LLM, or incrementally evolve a model in a continual learning scenario—ideally, we could take a model snapshot and train it effectively, making the most of the compute budget we have available and the compute budget we have already spent. The learning rate schedule used during pretraining directly impacts this.

Cyclical warmup-stable-decay schedule

The Warmup-Stable-Decay (WSD) schedule allows continuing training from the final model checkpoint of the stable phase without incurring a loss penalty. This preserves a large fraction of the compute budget spent, as we only have to discard what we spent on intermediate decay phases. But this is not negligible at the scale of LLM pretraining, where the costs regularly exceed tens of millions of US dollars.

As Wen and colleagues found, starting from the final decay phase model checkpoint in a WSD schedule does not cause the same loss penalty as the cosine schedule. As the WSD schedule’s decay phase is rather short, they hypothesize it does not have the same destructive effect as the cosine schedule’s long and slow decay. Given a total compute budget, consecutively repeating the WSD cycle is more efficient than restarting from the final checkpoint of the latest stable phase.

A cyclical WSD schedule is easier to implement than WSD restarts, as the model evolves continuously down the loss landscape river valley, and no prior checkpoints have to be reloaded. It also helps downstream users, who initially often utilize few-shot prompting to adapt an LLM to their use case. If they later decide to fine-tune it, and the LLM is trained with a WSD schedule, training the same model checkpoint they already use for inference is efficient.

Learning behavior

In a neural network, the weights are the parameters of its neurons learned during training. In an LLM, weights include the query, key, and value matrices in the attention heads and the activation function parameters in the feed-forward layers. While the learning rate governs the scale of changes made to the model’s weights, we can also control how the weights change on a more fine-grained level.

Weight decay

Employing weight decay during training penalizes large weights, preventing small parts of the model from dominating its output. Weight decay in stochastic gradient descent is implemented by adding a term to the loss function. For example, using L2 regularization, the adapted loss function looks like this:

Here, L_orig is the original loss function, λ is the weight decay factor, and w_i are the model weights.

Weight decay has been applied to transformer-based NLP models since the beginning. In the seminal 2018 paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, the authors state that they trained the model using “Adam with [a] learning rate of 1e-4, β₁=0.9, β₂=0.999, L2 weight decay of 0.01, learning rate warm up over the first 10,000 steps, and linear decay of the learning rate.”

As Ilya Loshchilov and Frank Hutter point out in their 2019 paper Decoupled Weight Decay Regularization, in adaptive optimizers like Adam, L2 regularization and weight decay are not identical, and L2 regularization is not effective. In Adam, the gradient of the regularization term is scaled with the gradient of L_orig, which leads to minimal regularization for terms in L for which the gradient is large. They introduced the AdamW optimizer, where the weight decay term is independent of the gradient-based update. AdamW is widely used for LLMs, such as for training Megatron-LM (2019), Llama 1 (2023), Llama 2 (2023), and Llama 3 (2024).

In LLM pretraining, models often see each training sample only once. Thus, overfitting to training data, which weight decay helps prevent in traditional deep learning scenarios, is only of concern if there are many similar or even identical samples in the training dataset. Still, weight decay positively affects training speed and the final loss.

According to a 2023 analysis by Francesco D’Angelo and colleagues at EPFL, this is because weight decay increases the effective learning rate. The effective learning rate at training step t is defined as LR(t)/||w_t||₂, the learning rate scaled by the inverse norm of the weight vector. The smaller the weights, the larger the influence of a weight update. Further, D’Angelo and colleagues find that weight decay stabilizes training in reduced floating-point precision.

Gradient clipping

Gradient clipping caps gradient magnitudes, helping maintain numerical stability. In the river valley analogy, we impose a threshold on slope steepness when deciding where to move next. Rather than jumping off a cliff, we treat it as a moderately steep hillside.

There are two common types of gradient clipping:

Clipping by value: Set predefined minimum and maximum values for gradient magnitudes. A gradient component is clipped to the respective limit if it exceeds these thresholds. This approach has the key benefit of not requiring access to the entire gradient vector.
Clipping by norm: The entire gradient vector is scaled down if the norm exceeds a specified threshold. For example, Nvidia’s original Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism paper first published in 2019 notes: “[W]e use global gradient norm clipping of 1.0 to improve the stability of training large models.” In contrast to clipping by value, this preserves the gradient vector’s direction but requires access to the entire gradient vector to compute.

In 2022, Yang and Ma introduced the Component-Wise Gradient Norm Clipping (CWGNC) approach for fine-tuning LLMs. In a nutshell, CWGNC applies gradient-clipping by norm separately to components in the LLM, such as the key, query, and value matrices or feed-forward layers. This stabilizes the training of each component individually, which might progress at significantly different rates.

Next-token generation

LLMs are autoregressive language models. They predict the next token by taking the sequence of previously generated tokens as input and producing a vector containing a probability for each token in the vocabulary. Different post-processing techniques can be used to determine the next token from these probabilities.

Temperature

Typically, LLMs use a softmax function as the final step in computing token probabilities. A temperature parameter controls this function.

The temperature influences the degree of randomness (or “originality” or “creativity”) in an LLM’s predicted text. At low temperatures, the model becomes more deterministic, rarely considering less likely options and instead focusing on the tokens with the highest probabilities. Conversely, a high temperature increases unpredictability, allowing the model to choose from a broader range of tokens. Thus, lower temperatures are helpful when you need reliable answers, while higher temperatures lead to more varied and surprising outputs.

The Text Gen Playground Hugging Face Space allows users to experiment with different temperature settings and models. By inputting a prompt and adjusting the temperature parameter, you can observe how the model’s output varies from predictable and deterministic to creative and varied.

For example, using the prompt “The sun rises in the” at different temperatures:

Low Temperature (e.g., T = 0.2): The model will likely complete the sentence with “east,” reflecting a common and expected continuation.
High Temperature (e.g., T = 1.2): The model might generate more imaginative completions like “morning haze” or “golden skies,” showcasing increased creativity.

Adjusting the temperature parameter in such playgrounds provides valuable insights into controlling the balance between determinism and creativity in language model outputs.

Sampling strategy

Given the vector of probabilities, there are many ways to select the next token.

A straightforward strategy is always picking the most likely token. Since the sampling process only considers the probabilities for the very next token, this “greedy decoding” leads to highly probable multi-token sequences being discarded if they start with a token that – viewed in isolation – is less likely.

Using beam search or random sampling according to the token probabilities can mitigate this. While the former produces deterministic outputs and thus no variety, the latter can lead to the selection of highly improbable tokens, producing nonsensical sequences.

A more balanced approach is top-k sampling, which restricts sampling of the next token to the k most probable tokens. Alternatively, in top-p sampling, only the most likely tokens up to a cumulative probability of p are considered. This approach adapts dynamically to the probability distribution, sampling from many tokens in uncertain scenarios and picking from only a few when the model is more confident. (p and k can be adjusted during training or inference time.)

As ML Engineers, we can fine-tune temperature and sampling strategy parameters according to your project needs. For example, if our tasks require precision (e.g., technical writing or summarization), we’ll use lower temperatures and top-k sampling to prioritize high-probability tokens. If we need more diversity, we’ll begin with common default values (temperature 0.7, top-k: k = 40, top-p: p = 0.9). We’ll iteratively adjust them based on the qualitative evaluation of outputs and document our findings to build a shared knowledge base with your team.

How do we find the optimal hyperparameters?

LLM training involves many hyperparameters, resulting in a combinatorial explosion of the search space. Simply guessing hyperparameters is unlikely to yield good results. Further, hyperparameters interact in complex ways, so the optimal value for one may depend on the values of others. Thus, adjusting hyperparameters one at a time may lead to suboptimal solutions, as we easily become trapped in local optima and don’t adequately explore the hyperparameter space.

Finding an optimal combination of hyperparameters requires a systematic approach. First, it’s paramount to understand the relevant hyperparameters and their influence on the particular LLM. It’s essential to research how similar architectures were trained or how the LLM we want to fine-tune was pre-trained. Further, we should clarify the available time, our compute budget, and the training objectives.

Next, we can sketch a roadmap. Can we afford to conduct experiments with particular hyperparameter combinations we believe are useful? Do we already have an experiment tracker and resource monitoring in place, or do we need to set it up first? What will be the decision points and criteria that ensure we end up with a fully trained LLM at the end of the project? Finally, we can start executing this roadmap and adjust our plans as we gather more information and insight.

The BLOOM team published a detailed paper on their preliminary experiments to determine the optimal model size and architecture. They describe how they started with GPT-3’s hyperparameters and conducted trial runs to estimate the optimal balance between model size and number of tokens given their fixed compute budget. Similar experiments were run by the Meta team that trained Llama3, who also aimed to predict downstream task performance.

Can we use traditional machine learning hyperparameter optimization methods for LLMs?

Methods for systematic hyperparameter optimization have long been studied in machine learning:

Learning curve analysis involves training models with varying hyperparameters over several epochs and plotting the loss to identify trends. In deep-learning models, plotting the gradient can further help assess whether and how efficiently a model learns.

Grid search systematically steps through the hyperparameter space, training a model for each possible combination. Random search samples the hyperparameter space, training models for randomly selected combinations.

While these approaches have successfully been applied to optimize LLM hyperparameters, their use is severely limited by the fact that LLMs are very expensive to train. The computational and memory requirements make it unviable to train large numbers of models. If training a model takes several months on a large cluster, we’ll only get one shot at a full training run.

Advanced strategies for LLM hyperparameter optimization

Beyond starting from a well-known hyperparameter combination and systematically conducting experiments, there is a range of approaches for automatically identifying or optimizing LLM hyperparameters in specific circumstances.

Population-based training (PBT)

Population-Based Training (PBT) is an approach pioneered by Google DeepMind that combines the concepts of evolutionary search and online training. Instead of fixing hyperparameters at the start of training and leaving them static throughout the process, PBT adapts them dynamically, informed by the models’ performance.

In a nutshell, the population-based training process consists of the following steps:

Set up a population of models, each with unique hyperparameters hi and weights i.
Train each model, updating i every iteration.
After a fixed number of iterations, evaluate each model’s performance on a validation dataset.
Identify models that are underperforming relative to others. Replace their current weights and hyperparameters with those of a better-performing model (exploitation).
Slightly perturb the hyperparameters of previously underperforming models to prevent the population from converging to a single configuration too early and improve diversity (exploration).
Conclude the training if the compute budget is exhausted or the objective has been met. Otherwise, repeat the process starting from step 2.

This process initially appears resource-intensive since it requires maintaining and updating multiple models simultaneously, which can increase total GPU hours. However, PBT’s dynamic refinement of hyperparameters during training can significantly save wall-clock time. By avoiding restarting from scratch for each hyperparameter configuration and leveraging partially trained models, PBT reduces the number of training epochs needed to achieve optimal performance.

The 2017 DeepMind study on Population-Based Training (PBT) showcased its potential for LLMs by fine-tuning the first transformer model on the WMT 2014 English-German machine translation benchmark. They manually optimized a baseline model and compared it to a model where they used PBT to optimize the dropouts for different layers and the learning rate. Their evaluation showed that the PBT-optimized model outperformed their hand-tuned baseline. Further, they discovered that the learning rate schedule generated through PBT mimicked the human-created one. Starting with a small learning rate, it then jumped to a high value before something resembling an exponential decay” brought it down to a low value again. DeepMind’s original PBT transformer model also learned noticeably faster.

Ray Tune is a hyperparameter tuning library that supports population-based training. It is part of the open-source Ray framework for scaling machine-learning applications. The Ray Tune documentation includes an example of tuning BERT and RoBERTa on the GLUE benchmark dataset using population-based training.

Bayesian optimization

Bayesian optimization is a popular method for efficiently navigating the hyperparameter space by building a probabilistic model (surrogate model) of the influence of the hyperparameters on the objective (e.g., validation loss). The surrogate model is used to predict promising hyperparameter combinations to try next. The results of this exploration are then used to refine the surrogate model.

The 2024 paper Crafting Efficient Fine-Tuning Strategies for Large Language Models investigates the applicability of Bayesian optimization to fine-tuning LLMs. First, a population of N models is trained for a pre-defined budget t₁. As each model is trained, the surrogate model is updated, and the updated version is used to set the hyperparameters of the next model. Once all N models are trained, the top k models are selected and are trained up to t₂. Finally, the best model among the k fully trained models is selected.

Adaptive Low-Rank Adaptation (LoRA)

Low-Rank Adaptation (LoRA) is a popular technique for reducing the memory footprint and computational demands when fine-tuning LLMs. In brief, the idea is to represent the weights of the fine-tuned model as

W_fine = W_pre + ∆W = W_pre + BA

Here, the fine-tuned weights W_fine are the sum of the original weights W_pre and a difference ∆W, which is the product of two matrices, B and A. Only B and A are updated during fine-tuning, while W_pre remains unchanged. If W_pre and ∆W have dimensions m x n, B and A have dimensions m x r and r x n, respectively. If the rank r is much smaller than m and n, the number of weights to be updated is greatly reduced, leading to faster training progress while requiring less memory.

In practice, it is often unclear to which LLM components LoRA should be applied for the best outcome. While we know that not all weights influence task performance equally, identifying which components are important for a particular objective would require extensive ablation studies. Thus, LoRA is often applied across all suitable weight matrices in a model.

AdaLoRA (Adaptive Low-Rank Adaptation) is a method to allocate a given parameter budget across weight matrices. The core idea is to apply LoRA to all LLM components but to use different values for the rank r. Important components use a matrix pair with a large r, leading to a ∆W with many weights. Less important components are approximated using a lower-rank matrix pair. AdaLoRA assigns an importance score to each component and sets the values for r such that the total number of weights remains within the user-defined budget. This leads to an optimal training outcome for a fixed compute and memory budget.

AdaMoLE (Adaptive Mixture of Low-Rank Adaptation Experts) similarly aims to reduce the number of weights that need to be updated. It replaces the single low-rank matrix pair of the original LoRA with a collection of multiple matrix pairs (LoRA experts) that are activated dynamically based on the input context. This enables the LLM to learn different tasks with a minimal total number of weights.

Fine-tuning an LLM with the Adaptive Mixture of Low-Rank Adaptation Experts approach. The fine-tuned weights are approximated as the sum of the frozen pre-trained weights and a number of so-called LoRA experts that are activated by a gating function and a threshold function. Different LoRA experts specialize in different contexts, allowing the LLM to learn different tasks with a minimal number of weights. | Modified based on: source

Hands-on: LLM hyperparameter optimization with neptune.ai

Optuna is a framework for optimizing hyperparameter search using Bayesian optimization. It can be applied to various machine-learning tasks, including LLM hyperparameter tuning.

To see this in action, we’ve prepared a Colab notebook that walks you through the process of finding the optimal combination of learning rate, batch size, and number of epochs for fine-tuning a Hugging Face Transformers model on the IMBD dataset.

The tutorial uses neptune.ai to track training progress and analyze the different hyperparameters. If you don’t want to go through the tutorial yourself right now, you can still explore example results in this public Neptune project.

How about being one of the first to access Neptune Scale?

Neptune Scale is our upcoming product release built for teams that train foundation models. It offers enhanced scalability and exciting new features. You can join our beta program to benefit from Neptune Scale earlier.

What’s next in LLM hyperparameter optimization?

Finding an optimal combination of hyperparameters is essential for training LLMs. In this article, we’ve reviewed key LLM hyperparameters and their influence on the model and training performance. We’ve also discussed how to approach hyperparameter optimization systematically and explored methods to assist or even automate this task in certain scenarios.

From the examples of hyperparameter choices for state-of-the-art LLMs, we’ve seen that while architectures, training tasks, and data change, most models are trained with relatively similar learning rate schedules and optimizer configurations. As our understanding of the model and training mechanics deepens and more experiments yield empirical evidence, we’ll likely see an evolution of the standard recipes and more diversity.

Hyperparameter Optimization For LLMs: Advanced Strategies

LLM hyperparameters

Model size

Learning rate

How does the learning rate affect training duration and quality?

Cosine schedule

Warmup-stable-decay schedule

Cyclical cosine schedule

Cyclical warmup-stable-decay schedule

Learning behavior

Weight decay

Gradient clipping

Next-token generation

Temperature

Sampling strategy

How do we find the optimal hyperparameters?

Can we use traditional machine learning hyperparameter optimization methods for LLMs?

Advanced strategies for LLM hyperparameter optimization

Population-based training (PBT)

Bayesian optimization

Adaptive Low-Rank Adaptation (LoRA)

Hands-on: LLM hyperparameter optimization with neptune.ai

What’s next in LLM hyperparameter optimization?

Was the article useful?

Explore more content topics:

Popular Posts

5 steps to making sure you’re ready to retire

Dream Companion Review and Features

Branson's new underwater plane

Do annuities belong in your retirement plan?

My Favorites

7 things to know before the bell

Inside The Studio

VW says new emission tests pose major threat

How this STEM school is shattering stereotypes

Popular Categories

Ethical Considerations and Best Practices in LLM Development