Machine Learning

Home Machine Learning

March 16, 2025

This article is part of a series of articles on automating Data Cleaning for any tabular dataset.

You can test the feature described in this article on your own dataset using the CleanMyExcel.io service, which is free and requires no registration.

Start with the why

A spreadsheet containing information about awards given to films

Let’s consider this Excel spreadsheet, which contains information on awards given to films. It is sourced from the book Cleaning Data for Effective Data Science and is available here.

This is a typical and common spreadsheet that everyone may own and deal with in their daily tasks. But what is wrong with it?

To answer that question, let us first recall the end goal of using data: to derive insights that help guide our decisions in our personal or business lives. This process requires at least two crucial things:

Reliable data: clean data without issues, inconsistencies, duplicates, missing values, etc.
Tidy data: a well-normalised data frame that facilitates processing and manipulation.

The second point is the primary foundation of any analysis, including dealing with data quality.

Returning to our example, imagine we want to perform the following actions:

1. For each film involved in multiple awards, list the award and year it is associated with.

2. For each actor/actress winning multiple awards, list the film and award they are associated with.

3. Check that all actor/actress names are correct and well-standardised.

Naturally, this example dataset is small enough to derive those insights by eye or by hand if we structure it (as quickly as coding). But imagine now that the dataset contains the entire awards history; this would be time-consuming, painful, and error-prone without any automation.

Reading this spreadsheet and directly understanding its structure by a machine is difficult, as it does not follow good practices of data arrangement. That is why tidying data is so important. By ensuring that data is structured in a machine-friendly way, we can simplify parsing, automate quality checks, and enhance business analysis—all without altering the actual content of the dataset.

Example of a reshaping of this data:

Example of a reshaping of the data from the previous spreadsheet:

Now, anyone can use low/no-code tools or code-based queries (SQL, Python, etc.) to interact easily with this dataset and derive insights.

The main challenge is how to turn a shiny and human-eye-pleasant spreadsheet into a machine-readable tidy version.

What is tidy data? A well-shaped data frame?

The term tidy data was described in a well‐known article named Tidy Data by Hadley Wickham and published in the Journal of Statistical Software in 2014. Below are the key quotes required to understand the underlying concepts better.

Data tidying

“Structuring datasets to facilitate manipulation, visualisation and modelling.”

“Tidy datasets provide a standardised way of linking the structure of a dataset (its physical layout) with its semantics (its meaning).”

Data structure

“Most statistical datasets are rectangular tables composed of rows and columns. The columns are almost always labelled, and the rows are sometimes labelled.”

Data semantics

“A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organised in two ways. Every value belongs to both a variable and an observation. A variable contains all values that measure the same underlying attribute (such as height, temperature or duration) across units. An observation contains all values measured on the same unit (for example, a person, a day or a race) across attributes.”

“In a given analysis, there may be multiple levels of observation. For example, in a trial of a new allergy medication, we might have three types of observations:

Demographic data collected from each person (age, sex, race),
Medical data collected from each person on each day (number of sneezes, redness of eyes), and
Meteorological data collected on each day (temperature, pollen count).”

Tidy data

“Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is considered messy or tidy depending on how its rows, columns and tables correspond to observations, variables and types. In tidy data:

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.”

Common problems with messy datasets

Column headers might be values rather than variable names.

Messy example: A table where column headers are years (2019, 2020, 2021) instead of a “Year” column.
Tidy version: A table with a “Year” column and each row representing an observation for a given year.

Multiple variables might be stored in one column.

Messy example: A column named “Age_Gender” containing values like 28_Female
Tidy version: Separate columns for “Age” and “Gender”

Variables might be stored in both rows and columns.

Messy example: A dataset tracking student test scores where subjects (Math, Science, English) are stored as both column headers and repeated in rows instead of using a single “Subject” column.
Tidy version: A table with columns for “Student ID,” “Subject,” and “Score,” where each row represents one student’s score for one subject.

Multiple types of observational units might be stored in the same table.

Messy example: A sales dataset that contains both customer information and store inventory in the same table.
Tidy version: Separate tables for “Customers” and “Inventory.”

A single observational unit might be stored in multiple tables.

Messy example: A patient’s medical records are split across multiple tables (Diagnosis Table, Medication Table) without a common patient ID linking them.
Tidy version: A single table or properly linked tables using a unique “Patient ID.”

Now that we have a better understanding of what tidy data is, let’s see how to transform a messy dataset into a tidy one.

Thinking about the how

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” Hadley Wickham (cf. Leo Tolstoy)

Although these guidelines sound clear in theory, they remain difficult to generalise easily in practice for any kind of dataset. In other words, starting with the messy data, no simple or deterministic process or algorithm exists to reshape the data. This is mainly explained by the singularity of each dataset. Indeed, it is surprisingly hard to precisely define variables and observations in general and then transform data automatically without losing content. That is why, despite massive improvements in data processing over the last decade, data cleaning and formatting are still done “manually” most of the time.

Thus, when complex and hardly maintainable rules-based systems are not suitable (i.e. to precisely deal with all contexts by describing decisions in advance), machine learning models may offer some benefits. This grants the system more freedom to adapt to any data by generalising what it has learned during training. Many large language models (LLMs) have been exposed to numerous data processing examples, making them capable of analysing input data and performing tasks such as spreadsheet structure analysis, table schema estimation, and code generation.

Then, let’s describe a workflow made of code and LLM-based modules, alongside business logic, to reshape any spreadsheet.

Diagram of a workflow made of code and LLM-based modules alongside business logic to reshape a spreadsheet

Spreadsheet encoder

This module is designed to serialise into text the main information needed from the spreadsheet data. Only the necessary subset of cells contributing to the table layout is retained, removing non-essential or overly repetitive formatting information. By retaining only the necessary information, this step minimises token usage, reduces costs, and enhances model performance.. The current version is a deterministic algorithm inspired by the paper SpreadsheetLLM: Encoding Spreadsheets for Large Language Models, which relies on heuristics. More details about it will be the topic of a next article.

Table structure analysis

Before moving forward, asking an LLM to extract the spreadsheet structure is a crucial step in building the next actions. Here are examples of questions addressed:

How many tables are present, and what are their locations (regions) in the spreadsheet?
What defines the boundaries of each table (e.g., empty rows/columns, specific markers)?
Which rows/columns serve as headers, and do any tables have multi-level headers?
Are there metadata sections, aggregated statistics, or notes that need to be filtered out or processed separately?
Are there any merged cells, and if so, how should they be handled?

Table schema estimation

Once the analysis of the spreadsheet structure has been completed, it is now time to start thinking about the ideal target table schema. This involves letting the LLM process iteratively by:

Identifying all potential columns (multi-row headers, metadata, etc.)
Comparing columns for domain similarities based on column names and data semantics
Grouping related columns

The module outputs a final schema with names and a short description for each retained column.

Code generation to format the spreadsheet

Considering the previous structure analysis and the table schema, this last LLM-based module should draft code that transforms the spreadsheet into a proper data frame compliant with the table schema. Moreover, no useful content must be omitted (e.g. aggregated or computed values may still be derived from other variables).

As generating code that works well from scratch at the first iteration is challenging, two internal iterative processes are added to revise the code if needed:

Code checking: Whenever code cannot be compiled or executed, the trace error is provided to the model to update its code.
Data frame validation: The metadata of the created data frame—such as column names, first and last rows, and statistics about each column—is checked to validate whether the table conforms to expectations. Otherwise, the code is revised accordingly.

Convert the data frame into an Excel file

Finally, if all data fits properly into a single table, a worksheet is created from this data frame to respect the tabular format. The final asset returned is an Excel file whose active sheet contains the tidy spreadsheet data.

Et voilà! The sky’s the limit for making the most of your newly tidy dataset.

Feel free to test it with your own dataset using the CleanMyExcel.io service, which is free and requires no registration.

Final note on the workflow

Why is a workflow proposed instead of an agent for that purpose?

At the time of writing, we consider that a workflow based on LLMs for precise sub-tasks is more robust, stable, iterable, and maintainable than a more autonomous agent. An agent may offer advantages: more freedom and liberty in actions to perform tasks. Nonetheless, they may still be hard to deal with in practice; for example, they may diverge quickly if the objective is not clear enough. I believe this is our case, but that does not mean that this model would not be applicable in the future in the same way as SWE-agent coding is performing, for example.

Next articles in the series

In upcoming articles, we plan to explore related topics, including:

A detailed description of the spreadsheet encoder mentioned earlier.
Data validity: ensuring each column meets the expectations.
Data uniqueness: preventing duplicate entities within the dataset.
Data completeness: handling missing values effectively.
Evaluating data reshaping, validity, and other key aspects of data quality.

Stay tuned!

Thank you to Marc Hobballah for reviewing this article and providing feedback.

All images, unless otherwise noted, are by the author.

The Impact of GenAI and Its Implications for Data Scientists

Machine Learning

dim

March 16, 2025

GenAI systems affect how we work. This general notion is well known. However, we are still unaware of the exact impact of GenAI. For example, how much do these tools affect our work? Do they have a larger impact on certain tasks? What does this mean for us in our daily work?

To answer these questions, Anthropic released a study based on millions of anonymized conversations on Claude.ai. The study provides data on how GenAI is incorporated into real-world tasks and reveals actual GenAI usage patterns.

In this article, I will go through the four main findings of the study. Based on the findings I will derive how GenAI changes our work and what skills we need in the future.

Main findings

GenAI is mostly used for software development and technical writing tasks, reaching almost 50 % of all tasks. This is likely due to LLMs being mostly text-based and thus being less useful for certain tasks.

GenAI has a stronger impact on some groups of occupations than others.More than one-third of occupations use GenAI in at least a quarter of their tasks. In contrast, only 4 % of occupations use it for more than three-quarters of their tasks. We can see that only very few occupations use GenAI across most of their tasks. This suggests that no job is being entirely automated.

GenAI is used for augmentation rather than automation, i.e., 57% vs 43 % of the tasks. But most occupations use both, augmentation and automation across tasks. Here, augmentation means the user collaborates with the GenAI to enhance their capabilities. Automation, in contrast, refers to tasks in which the GenAI directly performs the task. However, the authors guess that the share of augmentation is even higher as users might adjust GenAI answers outside of the chat window. Hence, what seems to be automation is actually augmentation. The results suggest that GenAI serves as an efficiency tool and a collaborative partner, resulting in improved productivity. These results align very well with my own experience. I mostly use GenAI tools to augment my work instead of automating tasks. In the article below you can see how GenAI tools have increased my productivity and what I use them for daily.

GenAI is mostly used for tasks associated with mid-to-high-wage occupations, such as data scientists. In contrast, the lowest and highest-paid roles show a much lower usage of GenAI. The authors conclude that this is due to the current limits of GenAI capabilities and practical barriers when it comes to using GenAI.

Overall, the study suggests that occupations will rather evolve than disappear. This is because of two reasons. First, GenAI integration remains selective rather than comprehensive within most occupations. Although many jobs use GenAI, the tools are only used selectively for certain tasks. Second, the study saw a clear preference for augmentation over automation. Hence, GenAI serves as an efficiency tool and a collaborative partner.

Limitations

Before we can derive the implications of GenAI, we should look at the limitations of the study:

It is unknown how the users used the responses. Are they copy-pasting code snippets uncritically or editing them in their IDE? Hence, some conversations that look like automation might have been augmentation instead.
The authors only used conversations from Claude.ai’s chat but not from API or Enterprise users. Hence, the dataset used in the analysis shows only a fraction of actual GenAI usage.
Automating the classification might have led to the wrong classification of conversations. However, due to the large amount of conversation used the impact should be rather small.
Claude being only text-based restricts the tasks and thus might exclude certain jobs.
Claude is advertised as a state-of-the-art coding model thus attracting mostly users for coding tasks.

Overall, the authors conclude that their dataset is not a representative sample of GenAI use in general. Thus, we should handle and interpret the results with care. Despite the study’s limitations, we can see some implications from the impact of GenAI on our work, particularly as Data Scientists.

Implications

The study shows that GenAI has the potential to reshape jobs and we can already see its impact on our work. Moreover, GenAI is rapidly evolving and still in the early stages of workplace integration.

Thus, we should be open to these changes and adapt to them.

Most importantly, we must stay curious, adaptive, and willing to learn. In the field of Data Science changes happen regularly. With GenAI tools change will happen even more frequently. Hence, we must stay up-to-date and use the tools to support us in this journey.

Currently, GenAI has the potential to enhance our capabilities instead of automating them.

Hence, we should focus on developing skills that complement GenAI. We need skills to augment workflows effectively in our work and analytical tasks. These skills lie in areas with low penetration of GenAI. This includes human interaction, strategic thinking, and nuanced decision-making. This is where we can stand out.

Moreover, skills such as critical thinking, complex problem-solving, and judgment will remain highly valuable. We must be able to ask the right questions, interpret the output of LLMs, and take action based on the answers.

Moreover, GenAI will not replace our collaboration with colleagues in projects. Hence, improving our emotional intelligence will help us to work together effectively.

Conclusion

GenAI is rapidly evolving and still in the early stages of workplace integration. However, we can already see some implications from the impact of GenAI on our work.

In this article, I showed you the main findings of a recent study from Anthropic on the use of their LLMs. Based on the results, I showed you the implications for Data Scientists and what skills might become more important.

I hope that you find this article useful and that it will help you become a better Data Scientist.

See you in my next article.

Hyperparameter Optimization For LLMs: Advanced Strategies

Machine Learning

dim

March 16, 2025

Finding an optimal set of hyperparameters is essential for efficient and effective training of Large Language Models (LLMs).

The key LLM hyperparameters influence the model size, learning rate, learning behavior, and token generation process.

Due to their computational demands, traditional methods for optimizing hyperparameters, such as grid search, are impractical for LLMs.

Advanced hyperparameter optimization strategies, like population-based training, Bayesian optimization, and adaptive LoRA, promise to balance computational effort and outcome.

The rise of large language models (LLMs) is bringing advances in text generation and contextual understanding. Hyperparameters control the size of LLMs, their training process, and how they generate outputs.

An optimal combination of hyperparameters is fundamental to efficiently pre-training and fine-tuning LLMs. Since LLM training is computationally intensive, exhaustive experimentation is not viable. This rules out traditional machine-learning hyperparameter optimization (HPO) methods that rely on systematically exploring the hyperparameter space by training many models with slightly different configurations.

When configuring models and training processes, LLM developers rely on a thorough understanding of each hyperparameter’s influence, insights from fundamental research, and empirical evidence gained from training state-of-the-art foundation models. Methods for estimating optimal hyperparameter values with limited compute budgets and adapting hyperparameters throughout the training process can help pre-training and fine-tuning.

After reading this article, you’ll be able to answer the following questions:

What key hyperparameters should be considered when developing, training, and applying LLMs?
How does each hyperparameter influence the LLM, and which trade-offs do we need to be aware of?
How can we select an optimal combination of hyperparameters in our scenario without fully training multiple model variants?
What advanced hyperparameter optimization techniques are available for LLMs, and when can we apply them?

LLM hyperparameters

A hyperparameter is a configuration value that controls the behavior of a machine-learning model during the training or inference process. Unlike model parameters (the weights), which are learned directly from the training data, hyperparameters are defined by the model developers. A hyperparameter can be constant or adjusted dynamically according to predefined rules or schedules.

Model size

In the case of LLMs, we often work with pre-trained models, where the activation functions, internal architecture of layers or blocks, and their connections—all examples of hyperparameters—are fixed. If our pre-trained LLM of choice is available in different sizes, the model size is the only hyperparameter affecting the model’s makeup we can actively control.

The size of an LLM refers to the total number of parameters it contains, which influences the model’s capacity to understand and generate complex language patterns. Hyperparameters set and tuned during pre-training influence the total size of an LLM.

One hyperparameter influencing a model’s size is its depth, corresponding to the total number of layers stacked sequentially. Each additional layer in an LLM adds more parameters, such as the weights for the self-attention mechanism and feed-forward layers in a transformer block.

Another hyperparameter influencing an LLM’s size is its hidden size, which refers to the dimensionality of the token embeddings and the internal representations within each layer. The hidden size determines how richly the model can encode information about each input token and how effectively it can process complex language patterns. A larger hidden size means each token is represented in a higher-dimensional space, allowing the model to capture more detailed semantic and syntactic nuances.

Further, the number of parallel attention heads in each transformer block influences the size of the LLM. Multiple heads allow the model to focus on different input aspects simultaneously. Through multi-query and grouped-query attention, we can reduce the number of necessary parameters.

Finally, the vocabulary size and context window (maximum sequence length) also impact the model’s size. They determine the language diversity a model can handle and the context length it can maintain, respectively.

These hyperparameters, set before beginning the training process and unable to be changed later, determine the model size. For example, GPT-3 has 96 layers, a hidden size of 12,288, 96 attention heads, a vocabulary of 50,257 tokens, and a context window of 2,048 tokens, resulting in a total of 175 billion parameters.

Learning rate

The learning rate (LR) is a critical hyperparameter in training LLMs. Optimizing these hyperparameters is essential for efficient learning, stable convergence, and good generalization to unseen data.

The learning rate determines how much model weights are changed during each update. A high learning rate helps speed up the training process but increases the risk of instability and overfitting. A low learning rate increases stability and tends to benefit generalization but leads to slow training.

In the case of LLMs, the learning rate is typically not constant but varies as training progresses. This variation is governed by a learning rate schedule (LRS). The schedule is usually tied to the number of tokens seen—either directly, or indirectly through the number of samples, steps, or epochs. At a high level, it contains phases of a rising, constant, and decreasing learning rate.

How does the learning rate affect training duration and quality?

Following theoretical work by Stanford researcher Kaiyue Wen and colleagues published in December 2024, we can think of LLM training as progressing along a loss landscape that looks like a river valley. They hypothesize that the existence and overall direction of the river are due to the facts and knowledge an LLM learns, which are reflected as highly deterministic and, therefore, easy-to-predict tokens. The valley slopes arise from flexibility and ambiguity inherent to language, i.e., hard-to-predict tokens.

Visualization of LLM training as traveling down a river valley. Using a stable but high learning rate ensures quick progress down the river but leads to jumps between relatively high loss values. Reducing the learning rate during a subsequent decay phase brings the model towards a local loss minimum. | Source

In this picture, the training goal is to reach the river mouth, at which point we should be as close to the bottom of the valley as possible. The first crucial insight is that it does not matter whether we stay at the bottom of the valley until then. Thus, if we can make faster progress down the river by bouncing back and forth between points high up the loss valley’s slopes, we can do this without affecting the final outcome.

Thus, we should aim to use a high learning rate—resulting in large steps towards the loss minimum but leading to wildly fluctuating loss values—for as long as possible. Towards the end of the training, the learning rate should be decreased to a very low value. This will slow down progress towards the river mouth but reduce the oscillations to a point where we constantly stay at the valley’s bottom, i.e., the local loss minimum.

However, all of this is only going to work if we are already in a sufficiently deep loss river valley. When training is first starting, a high learning rate will lead to undirected jumps across the loss landscape. To avoid this, learning rate schedules for LLMs start with a small learning rate and slowly ramp it up to its maximum value. This is called the warmup phase.

Cosine schedule

The cosine schedule (also known as cosine decay or cosine annealing) implements this approach by starting with a linear warmup phase that brings the learning rate to its maximum value, followed by a slow decay following the cosine function:

LR(t) = LR_min + 0.5 (LR_max – LR_min) (1 + cos(π t/T)

Here, LR_min and LR_max are the minimum and maximum learning rates, t is the training step, and T is the total number of training steps. The advantage of this schedule is that it stays close to the peak learning rate for a long time, and the final decay is gradual. It’s also easy to implement, as it depends on just three hyperparameters (LR_max, LR_min, and T) linked by the cosine function.

Cosine schedules have been highly popular for pretraining LLMs. For example, it was used for BLOOM, a 176-billion-parameter multilingual model developed by the BigScience Research Workshop and released in 2022. In an initial warmup phase, the learning rate was ramped to a peak of 6 x 10^-5 over 375 million tokens. Afterward, it was lowered to 10% of this value with cosine decay over 410 million tokens and remained at this value. The implementation and detailed description are publicly accessible in BLOOM’s GitHub repository.

For pre-training their Llama 3 405B model, Meta used a slightly more involved variant of the cosine schedule. In the first stage, a warm-up phase of up to 8,000 steps brought the learning rate to a maximum of 8 x 10^-5. Subsequently, the learning rate decreased to 8 x 10^-7 over 1.2 million steps with a cosine decay. After the second stage focused on training the LLM up to its final context length of 128,000 tokens, the learning rate linearly decreased to 0 over 40 million tokens in the third stage. Supervised fine-tuning was conducted over about 9,000 steps with a learning rate of 10^-5.

A major disadvantage of the cosine schedule is that the total number of training steps has to be known beforehand. When training large foundation models, the total compute budget is typically set, and the optimal number of training tokens can be estimated. However, when fine-tuning or experimenting, it would be preferable to base the decision on when to end training on the model’s performance.

Warmup-stable-decay schedule

The warmup-stable-decay (WSD) schedule is a simple protocol introduced by Shengding Hu and colleagues at Tsinghua University in 2024. It starts with a linear warmup to the maximum learning rate, keeps the learning rate constant for the majority of the training, and ramps it down at the end.

Through experiments, they found that a decay phase that makes up 10% of the total length is sufficient. They also demonstrated that a WSD schedule leads to a lower loss than a cosine schedule. According to Wen and colleagues at Stanford, this can readily be understood in the river valley picture. In the WSD schedule, the learning rate stays at a high value longer than in the cosine schedule. Hence, we make it further down the valley before dropping to its bottom. Further, their analysis shows that training progress in the stable phase is dominated by learning to predict deterministic tokens (facts and knowledge), while in the decay phase, the LLM learns the stochastic tokens (language variability).

Comparison of the loss curves resulting from a cosine and warmup-stable-decay (WSD) learning rate schedule. In the WSD schedule, the learning rate remains at a constant high value during the stable phase. This leads to high intermediate loss values as the loss fluctuates around the local minimum as it progresses towards lower values. During the final 10% of the total training steps, the learning rate is decreased to its minimum, leading to a sharp drop in the loss. Since the learning rate remained at a high value for longer, the final loss resulting from the WSD schedule is smaller than the loss from the cosine schedule. | Source

While a WSD schedule yields a lower loss for the same training budget, knowing the total number of training steps ahead of time is still required for scheduling the decay phase. However, the WSD schedule offers a straightforward way to extend the total number of training steps retroactively: If we find that our final model’s performance is unsatisfactory, we can resume training from a model snapshot taken at the end of the stable phase. This beams us back a small distance up the loss river valley, from where we continue making large jumpy steps towards the river mouth as if we had never descended down to the valley’s bottom in the first place.

Restarting this way, we still benefit from 90% of the compute budget spent so far. It allows us to determine the compute budget we need as we go, producing fully trained intermediate models—something that the cosine schedule inherently does not allow for.

Track months-long model training with more confidence. Use neptune.ai forking feature to iterate faster and optimize the usage of GPU resources.

With Neptune, users can visualize forked training out of the box. This means you can:

Test multiple configs at the same time. Stop the runs that don’t improve accuracy. And continue from the most accurate last step.
Restart failed training sessions from any previous step. The training history is inherited, and the entire experiment is visible on a single chart.

Cyclical cosine schedule

Returning to a high learning rate after decaying to a minimum is not a new idea in machine learning. Long established in gradient-free optimization, it was made popular for deep learning training through the “Stochastic Gradient Descent with Warm Restarts” technique proposed by Ilya Loshchilov and Frank Hutter in 2017. The learning rate is governed by a function very similar to the one for the cosine schedule:

LR(t) = LR_min + 0.5 (LR_max − LR_min) (1 + cos(π (t mod T)/T))

This time, T is not the total number of training steps but is understood as the schedule’s period. For example, we might train for 10,000 steps with T = 1,000, leading to ten consecutive cosine decay cycles. Commonly, LR_max is set to a new, lower value at the beginning of each cycle.

In the loss landscape river valley, we’re climbing down to the bottom over T steps, making ever slower progress down the river as we keep closer to the bottom. Then, we immediately go back to make large jumps toward the river mouth high up the valley’s slopes.

Right at the beginning of a new cosine cycle, the loss will be significantly higher than it was previously. This could be due to the jump in the learning rate, which might perturb the model. However, Wen and colleagues argue, based on their experiments and theoretical insights, that it is the result of training with a small learning rate for too long.

Whatever the cause, this doesn’t just make training less efficient. It’s also an obstacle to continue model training later. Whether we aim to further pre-train on newly acquired or different data, fine-tune an LLM, or incrementally evolve a model in a continual learning scenario—ideally, we could take a model snapshot and train it effectively, making the most of the compute budget we have available and the compute budget we have already spent. The learning rate schedule used during pretraining directly impacts this.

Cyclical warmup-stable-decay schedule

The Warmup-Stable-Decay (WSD) schedule allows continuing training from the final model checkpoint of the stable phase without incurring a loss penalty. This preserves a large fraction of the compute budget spent, as we only have to discard what we spent on intermediate decay phases. But this is not negligible at the scale of LLM pretraining, where the costs regularly exceed tens of millions of US dollars.

As Wen and colleagues found, starting from the final decay phase model checkpoint in a WSD schedule does not cause the same loss penalty as the cosine schedule. As the WSD schedule’s decay phase is rather short, they hypothesize it does not have the same destructive effect as the cosine schedule’s long and slow decay. Given a total compute budget, consecutively repeating the WSD cycle is more efficient than restarting from the final checkpoint of the latest stable phase.

A cyclical WSD schedule is easier to implement than WSD restarts, as the model evolves continuously down the loss landscape river valley, and no prior checkpoints have to be reloaded. It also helps downstream users, who initially often utilize few-shot prompting to adapt an LLM to their use case. If they later decide to fine-tune it, and the LLM is trained with a WSD schedule, training the same model checkpoint they already use for inference is efficient.

Learning behavior

In a neural network, the weights are the parameters of its neurons learned during training. In an LLM, weights include the query, key, and value matrices in the attention heads and the activation function parameters in the feed-forward layers. While the learning rate governs the scale of changes made to the model’s weights, we can also control how the weights change on a more fine-grained level.

Weight decay

Employing weight decay during training penalizes large weights, preventing small parts of the model from dominating its output. Weight decay in stochastic gradient descent is implemented by adding a term to the loss function. For example, using L2 regularization, the adapted loss function looks like this:

Here, L_orig is the original loss function, λ is the weight decay factor, and w_i are the model weights.

Weight decay has been applied to transformer-based NLP models since the beginning. In the seminal 2018 paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, the authors state that they trained the model using “Adam with [a] learning rate of 1e-4, β₁=0.9, β₂=0.999, L2 weight decay of 0.01, learning rate warm up over the first 10,000 steps, and linear decay of the learning rate.”

As Ilya Loshchilov and Frank Hutter point out in their 2019 paper Decoupled Weight Decay Regularization, in adaptive optimizers like Adam, L2 regularization and weight decay are not identical, and L2 regularization is not effective. In Adam, the gradient of the regularization term is scaled with the gradient of L_orig, which leads to minimal regularization for terms in L for which the gradient is large. They introduced the AdamW optimizer, where the weight decay term is independent of the gradient-based update. AdamW is widely used for LLMs, such as for training Megatron-LM (2019), Llama 1 (2023), Llama 2 (2023), and Llama 3 (2024).

In LLM pretraining, models often see each training sample only once. Thus, overfitting to training data, which weight decay helps prevent in traditional deep learning scenarios, is only of concern if there are many similar or even identical samples in the training dataset. Still, weight decay positively affects training speed and the final loss.

According to a 2023 analysis by Francesco D’Angelo and colleagues at EPFL, this is because weight decay increases the effective learning rate. The effective learning rate at training step t is defined as LR(t)/||w_t||₂, the learning rate scaled by the inverse norm of the weight vector. The smaller the weights, the larger the influence of a weight update. Further, D’Angelo and colleagues find that weight decay stabilizes training in reduced floating-point precision.

Gradient clipping

Gradient clipping caps gradient magnitudes, helping maintain numerical stability. In the river valley analogy, we impose a threshold on slope steepness when deciding where to move next. Rather than jumping off a cliff, we treat it as a moderately steep hillside.

There are two common types of gradient clipping:

Clipping by value: Set predefined minimum and maximum values for gradient magnitudes. A gradient component is clipped to the respective limit if it exceeds these thresholds. This approach has the key benefit of not requiring access to the entire gradient vector.
Clipping by norm: The entire gradient vector is scaled down if the norm exceeds a specified threshold. For example, Nvidia’s original Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism paper first published in 2019 notes: “[W]e use global gradient norm clipping of 1.0 to improve the stability of training large models.” In contrast to clipping by value, this preserves the gradient vector’s direction but requires access to the entire gradient vector to compute.

In 2022, Yang and Ma introduced the Component-Wise Gradient Norm Clipping (CWGNC) approach for fine-tuning LLMs. In a nutshell, CWGNC applies gradient-clipping by norm separately to components in the LLM, such as the key, query, and value matrices or feed-forward layers. This stabilizes the training of each component individually, which might progress at significantly different rates.

Next-token generation

LLMs are autoregressive language models. They predict the next token by taking the sequence of previously generated tokens as input and producing a vector containing a probability for each token in the vocabulary. Different post-processing techniques can be used to determine the next token from these probabilities.

Temperature

Typically, LLMs use a softmax function as the final step in computing token probabilities. A temperature parameter controls this function.

The temperature influences the degree of randomness (or “originality” or “creativity”) in an LLM’s predicted text. At low temperatures, the model becomes more deterministic, rarely considering less likely options and instead focusing on the tokens with the highest probabilities. Conversely, a high temperature increases unpredictability, allowing the model to choose from a broader range of tokens. Thus, lower temperatures are helpful when you need reliable answers, while higher temperatures lead to more varied and surprising outputs.

The Text Gen Playground Hugging Face Space allows users to experiment with different temperature settings and models. By inputting a prompt and adjusting the temperature parameter, you can observe how the model’s output varies from predictable and deterministic to creative and varied.

For example, using the prompt “The sun rises in the” at different temperatures:

Low Temperature (e.g., T = 0.2): The model will likely complete the sentence with “east,” reflecting a common and expected continuation.
High Temperature (e.g., T = 1.2): The model might generate more imaginative completions like “morning haze” or “golden skies,” showcasing increased creativity.

Adjusting the temperature parameter in such playgrounds provides valuable insights into controlling the balance between determinism and creativity in language model outputs.

Sampling strategy

Given the vector of probabilities, there are many ways to select the next token.

A straightforward strategy is always picking the most likely token. Since the sampling process only considers the probabilities for the very next token, this “greedy decoding” leads to highly probable multi-token sequences being discarded if they start with a token that – viewed in isolation – is less likely.

Using beam search or random sampling according to the token probabilities can mitigate this. While the former produces deterministic outputs and thus no variety, the latter can lead to the selection of highly improbable tokens, producing nonsensical sequences.

A more balanced approach is top-k sampling, which restricts sampling of the next token to the k most probable tokens. Alternatively, in top-p sampling, only the most likely tokens up to a cumulative probability of p are considered. This approach adapts dynamically to the probability distribution, sampling from many tokens in uncertain scenarios and picking from only a few when the model is more confident. (p and k can be adjusted during training or inference time.)

As ML Engineers, we can fine-tune temperature and sampling strategy parameters according to your project needs. For example, if our tasks require precision (e.g., technical writing or summarization), we’ll use lower temperatures and top-k sampling to prioritize high-probability tokens. If we need more diversity, we’ll begin with common default values (temperature 0.7, top-k: k = 40, top-p: p = 0.9). We’ll iteratively adjust them based on the qualitative evaluation of outputs and document our findings to build a shared knowledge base with your team.

How do we find the optimal hyperparameters?

LLM training involves many hyperparameters, resulting in a combinatorial explosion of the search space. Simply guessing hyperparameters is unlikely to yield good results. Further, hyperparameters interact in complex ways, so the optimal value for one may depend on the values of others. Thus, adjusting hyperparameters one at a time may lead to suboptimal solutions, as we easily become trapped in local optima and don’t adequately explore the hyperparameter space.

Finding an optimal combination of hyperparameters requires a systematic approach. First, it’s paramount to understand the relevant hyperparameters and their influence on the particular LLM. It’s essential to research how similar architectures were trained or how the LLM we want to fine-tune was pre-trained. Further, we should clarify the available time, our compute budget, and the training objectives.

Next, we can sketch a roadmap. Can we afford to conduct experiments with particular hyperparameter combinations we believe are useful? Do we already have an experiment tracker and resource monitoring in place, or do we need to set it up first? What will be the decision points and criteria that ensure we end up with a fully trained LLM at the end of the project? Finally, we can start executing this roadmap and adjust our plans as we gather more information and insight.

The BLOOM team published a detailed paper on their preliminary experiments to determine the optimal model size and architecture. They describe how they started with GPT-3’s hyperparameters and conducted trial runs to estimate the optimal balance between model size and number of tokens given their fixed compute budget. Similar experiments were run by the Meta team that trained Llama3, who also aimed to predict downstream task performance.

Can we use traditional machine learning hyperparameter optimization methods for LLMs?

Methods for systematic hyperparameter optimization have long been studied in machine learning:

Learning curve analysis involves training models with varying hyperparameters over several epochs and plotting the loss to identify trends. In deep-learning models, plotting the gradient can further help assess whether and how efficiently a model learns.

Grid search systematically steps through the hyperparameter space, training a model for each possible combination. Random search samples the hyperparameter space, training models for randomly selected combinations.

While these approaches have successfully been applied to optimize LLM hyperparameters, their use is severely limited by the fact that LLMs are very expensive to train. The computational and memory requirements make it unviable to train large numbers of models. If training a model takes several months on a large cluster, we’ll only get one shot at a full training run.

Advanced strategies for LLM hyperparameter optimization

Beyond starting from a well-known hyperparameter combination and systematically conducting experiments, there is a range of approaches for automatically identifying or optimizing LLM hyperparameters in specific circumstances.

Population-based training (PBT)

Population-Based Training (PBT) is an approach pioneered by Google DeepMind that combines the concepts of evolutionary search and online training. Instead of fixing hyperparameters at the start of training and leaving them static throughout the process, PBT adapts them dynamically, informed by the models’ performance.

In a nutshell, the population-based training process consists of the following steps:

Set up a population of models, each with unique hyperparameters hi and weights i.
Train each model, updating i every iteration.
After a fixed number of iterations, evaluate each model’s performance on a validation dataset.
Identify models that are underperforming relative to others. Replace their current weights and hyperparameters with those of a better-performing model (exploitation).
Slightly perturb the hyperparameters of previously underperforming models to prevent the population from converging to a single configuration too early and improve diversity (exploration).
Conclude the training if the compute budget is exhausted or the objective has been met. Otherwise, repeat the process starting from step 2.

This process initially appears resource-intensive since it requires maintaining and updating multiple models simultaneously, which can increase total GPU hours. However, PBT’s dynamic refinement of hyperparameters during training can significantly save wall-clock time. By avoiding restarting from scratch for each hyperparameter configuration and leveraging partially trained models, PBT reduces the number of training epochs needed to achieve optimal performance.

The 2017 DeepMind study on Population-Based Training (PBT) showcased its potential for LLMs by fine-tuning the first transformer model on the WMT 2014 English-German machine translation benchmark. They manually optimized a baseline model and compared it to a model where they used PBT to optimize the dropouts for different layers and the learning rate. Their evaluation showed that the PBT-optimized model outperformed their hand-tuned baseline. Further, they discovered that the learning rate schedule generated through PBT mimicked the human-created one. Starting with a small learning rate, it then jumped to a high value before something resembling an exponential decay” brought it down to a low value again. DeepMind’s original PBT transformer model also learned noticeably faster.

Ray Tune is a hyperparameter tuning library that supports population-based training. It is part of the open-source Ray framework for scaling machine-learning applications. The Ray Tune documentation includes an example of tuning BERT and RoBERTa on the GLUE benchmark dataset using population-based training.

Bayesian optimization

Bayesian optimization is a popular method for efficiently navigating the hyperparameter space by building a probabilistic model (surrogate model) of the influence of the hyperparameters on the objective (e.g., validation loss). The surrogate model is used to predict promising hyperparameter combinations to try next. The results of this exploration are then used to refine the surrogate model.

The 2024 paper Crafting Efficient Fine-Tuning Strategies for Large Language Models investigates the applicability of Bayesian optimization to fine-tuning LLMs. First, a population of N models is trained for a pre-defined budget t₁. As each model is trained, the surrogate model is updated, and the updated version is used to set the hyperparameters of the next model. Once all N models are trained, the top k models are selected and are trained up to t₂. Finally, the best model among the k fully trained models is selected.

Adaptive Low-Rank Adaptation (LoRA)

Low-Rank Adaptation (LoRA) is a popular technique for reducing the memory footprint and computational demands when fine-tuning LLMs. In brief, the idea is to represent the weights of the fine-tuned model as

W_fine = W_pre + ∆W = W_pre + BA

Here, the fine-tuned weights W_fine are the sum of the original weights W_pre and a difference ∆W, which is the product of two matrices, B and A. Only B and A are updated during fine-tuning, while W_pre remains unchanged. If W_pre and ∆W have dimensions m x n, B and A have dimensions m x r and r x n, respectively. If the rank r is much smaller than m and n, the number of weights to be updated is greatly reduced, leading to faster training progress while requiring less memory.

In practice, it is often unclear to which LLM components LoRA should be applied for the best outcome. While we know that not all weights influence task performance equally, identifying which components are important for a particular objective would require extensive ablation studies. Thus, LoRA is often applied across all suitable weight matrices in a model.

AdaLoRA (Adaptive Low-Rank Adaptation) is a method to allocate a given parameter budget across weight matrices. The core idea is to apply LoRA to all LLM components but to use different values for the rank r. Important components use a matrix pair with a large r, leading to a ∆W with many weights. Less important components are approximated using a lower-rank matrix pair. AdaLoRA assigns an importance score to each component and sets the values for r such that the total number of weights remains within the user-defined budget. This leads to an optimal training outcome for a fixed compute and memory budget.

AdaMoLE (Adaptive Mixture of Low-Rank Adaptation Experts) similarly aims to reduce the number of weights that need to be updated. It replaces the single low-rank matrix pair of the original LoRA with a collection of multiple matrix pairs (LoRA experts) that are activated dynamically based on the input context. This enables the LLM to learn different tasks with a minimal total number of weights.

Fine-tuning an LLM with the Adaptive Mixture of Low-Rank Adaptation Experts approach. The fine-tuned weights are approximated as the sum of the frozen pre-trained weights and a number of so-called LoRA experts that are activated by a gating function and a threshold function. Different LoRA experts specialize in different contexts, allowing the LLM to learn different tasks with a minimal number of weights. | Modified based on: source

Hands-on: LLM hyperparameter optimization with neptune.ai

Optuna is a framework for optimizing hyperparameter search using Bayesian optimization. It can be applied to various machine-learning tasks, including LLM hyperparameter tuning.

To see this in action, we’ve prepared a Colab notebook that walks you through the process of finding the optimal combination of learning rate, batch size, and number of epochs for fine-tuning a Hugging Face Transformers model on the IMBD dataset.

The tutorial uses neptune.ai to track training progress and analyze the different hyperparameters. If you don’t want to go through the tutorial yourself right now, you can still explore example results in this public Neptune project.

How about being one of the first to access Neptune Scale?

Neptune Scale is our upcoming product release built for teams that train foundation models. It offers enhanced scalability and exciting new features. You can join our beta program to benefit from Neptune Scale earlier.

What’s next in LLM hyperparameter optimization?

Finding an optimal combination of hyperparameters is essential for training LLMs. In this article, we’ve reviewed key LLM hyperparameters and their influence on the model and training performance. We’ve also discussed how to approach hyperparameter optimization systematically and explored methods to assist or even automate this task in certain scenarios.

From the examples of hyperparameter choices for state-of-the-art LLMs, we’ve seen that while architectures, training tasks, and data change, most models are trained with relatively similar learning rate schedules and optimizer configurations. As our understanding of the model and training mechanics deepens and more experiments yield empirical evidence, we’ll likely see an evolution of the standard recipes and more diversity.

Was the article useful?

Explore more content topics:

One Turn After Another | Towards Data Science

Machine Learning

dim

March 16, 2025

While some games, like rock-paper-scissors, only work if all payers decide on their actions simultaneously, other games, like chess or Monopoly, expect the players to take turns one after another. In Game Theory, the first kind of game is called a static game, while turn-taking is a property of so-called dynamic games. In this article, we will analyse the latter with methods from game theory.

This article is the fourth part of a four-chapter series on the fundamentals of game theory. I recommend you to read the first three articles if you haven’t done that yet, as the concepts shown here will build on the terms and paradigms introduced in the previous articles. But if you are already familiar with the core fundamentals of game theory, don’t let yourself be stopped, and go ahead!

Dynamic games

Dynamic games can be visualized as trees. Photo by Adarsh Kummur on Unsplash

While so far we only looked at static games, we will now introduce dynamic games where payers take turns. As previously, such games include a number of players n, a set of actions for each player, and a reward function that assesses the actions of a player given the other players’ actions. Beyond that, for a dynamic game, we need to define an order in which the players take their turns. Consider the following tree-like visualization of a dynamic game.

A visualization of a dynamic game. Figure by author.

At the top we have a node where player 1 has to decide between two actions L and R. This determines whether to follow the left part or the right part of the tree. After player 1’s turn, player 2 takes their turn. If player 1 chooses L, player 2 can decide between l1 and r1. If player 1 chooses R, player 2 has to decide between l2 and r2. At the leaves of the tree (the nodes at the bottom), we see the rewards just like we had them in the matrix cells in static games. For example, if player 1 decides for L and player 2 decides for r1, the reward is (1,0); that is, player 1 gets a reward of 1, and player 2 gets a reward of 0.

I bet you are eager to find the Nash equilibrium of this game, as this is what Game Theory is mainly about (if you still struggle with the concept of Nash equilibrium, you might want to take a look back at chapter 2 of this series). To do that, we can transform the game into a matrix, as we already know how to find a Nash equilibrium in a game displayed as a matrix. Player 1 decides on the row of the matrix, player 2 decides on the column and the values in the cell then specifies the reward. However, there is one important point to notice. When we look at the game displayed as a tree, player 2 decides on their action after player 1 does and hence only cares about the part of the tree that is actually reached. If player 1 chooses action L, player 2 only decides between l1 and r1 and doesn’t care about l2 and r2, because these actions are out of the question anyway. However, when we search for a Nash Equilibrium, we need to be aware of what would happen, if player 1 would change their action. Therefore, we must know what player 2 would have done if player 1 had chosen a different option. That is why we have four columns in the following matrix, to always account for decisions in both parts of the tree.

A column like (r1,l2) can be read as “player 2 chooses r1 if player 1 chose L and chooses l2 if player 1 chose R”. On this matrix, we can search for the best answers. For example, the cell (L, (l1,l2)) with reward 3,1 is a best answer. Player 1 has no reason to change from L to R because that would lower his reward (from 3 to 1), and Player 2 has no reason to change either because none of the other options is better (one is as good, though). In total, we find three Nash equilibria, which are underlined in the upcoming matrix:

The chocolate-pudding market

We will talk about chocolate pudding now. But also about game theory. Photo by American Heritage Chocolate on Unsplash

Our next example brings the idea of dynamic games to life. Let’s assume player 2 is a market-leading retailer of chocolate pudding. Player 1 also wants to build up his business but isn’t sure yet whether to join the chocolate pudding market or whether they rather should sell something else. In our game, player 1 has the first turn and can decide between two actions. Join the market (i.e., sell chocolate pudding), or don’t join the market (i.e., sell something else). If player 1 decides to sell something other than chocolate pudding, player 2 stays the market-dominating retailer for chocolate pudding and player 1 makes some money in the other area they decided for. This is reflected by the reward 1,3 in the right part of the tree in the following figure.

The market-game as a dynamic game. Figure by author.

But what if player 1 is greedy for the unimaginable riches that lie dormant on the chocolate pudding market? If they decide to join the market, it is player 2’s turn. They can decide to accept the new competitor, give in and share the market. In this case, both players get a reward of 2. But player 2 can also decide to start a price war to demonstrate his superiority to the new competitor. In this case, both players get a reward of 0, because they ruin their profit due to dumping prices.

Just like before, we can turn this tree into a matrix and find the Nash equilibria by searching for the best answers:

If player 1 joins the market, the best option for player 1 is to give in. This is an equilibrium because no player has any reason to change. For player 1 it does not make sense to leave the market (that would give a reward of 1 instead of 2) and for player 2 it is no good idea to switch to fighting either (which would give a reward of 0 instead of 2). The other Nash equilibrium happens when player 1 just doesn’t join the market. However, this scenario includes player 2’s decision to fight, if player 1 had chosen to join the market instead. He basically makes a threat and says “If you join the market, I will fight you.” Remember that previously we said we need to know what the players would do even in the cases that don’t appear to happen? Here we see why this is important. Player 1 needs to assume that player 2 would fight because that is the only reason for player 1 to stay out of the market. If player 2 wouldn’t threaten to fight, we wouldn’t have a Nash equilibrium, because then joining the market would become a better option for player 1.

But how reasonable is this threat? It keeps player 1 outside the market, but what would happen if player 1 didn’t believe the threat and decided to still join the market? Would player 2 really carry out his threat and fight? That would be very silly because it would give him a reward of 0, whereas giving in would give a reward of 2. From that perspective, player 2 used an empty threat that is not very reasonable. If the case really occurs, he wouldn’t carry it out anyway, would he?

Subgame perfect equilibrium

The previous example showed that sometimes Nash equilibria occur, that are not very reasonable within the game. To cope with this problem, a more strict concept of equilibrium has been introduced which is called a subgame perfect equilibrium. This adds some stricter conditions to the notion of an equilibrium. Hence every subgame perfect equilibrium is a Nash equilibrium, but not all Nash equilibria are subgame perfect.

A Nash equilibrium is subgame perfect if every subgame of this equilibrium is a Nash equilibrium itself. What does that mean? First, we have to understand that a subgame is a part of the game’s tree that starts at any node. For example, if player 1 chooses L, the remainder of the tree under the node reached by playing L is a subgame. In a likewise fashion, the tree that comes after the node of action R is a subgame. Last but not least, the whole game is always a subgame of itself. As a consequence, the example we started with has three subgames, which are marked in grey, orange and blue in the following:

The market game has three subgames. Figure by author.

We already saw, that this game has three Nash equilibria which are (L,(l1,l2)), (L, (l1,r2)) and (R,(r1,r2)). Let us now find out which of these are subgame perfect. To this end, we investigate the subgames one after another, starting with the orange one. If we only look at the orange part of the tree, there is a single Nash equilibrium that occurs if player 2 chooses l1. If we look at the blue subgame, there is also a single Nash equilibrium that is reached when player 2 chooses r2. Now that tells us that in every subgame perfect Nash equilibrium, player 2 has to choose option l1 if we arrive in the orange subgame (i.e. if player 1 chooses L) and player 2 has to choose option r2 if we arrive at the blue subgame (i.e., if player 1 chooses R). Only one of the previous Nash equilibria fulfills this condition, namely (L,(l1,r2)). Hence this is the only subgame perfect Nash equilibrium of the whole game. The other two versions are Nash equilibria as well, but they are somewhat unlogical in the sense, that they contain some kind of empty threat, as we had it in the chocolate pudding market example before. The method we just used to find the subgame perfect Nash equilibrium is called backwards induction, by the way.

Uncertainty

In dynamic games, it can happen that you have to make decisions without knowing exactly what node of the game you are in. Photo by Denise Jans on Unsplash

So far in our dynamic games, we always knew which decisions the other players made. For a game like chess, this is the case indeed, as every move your opponent makes is perfectly observable. However, there are other situations in which you might not be sure about the exact moves the other players make. As an example, we go back to the chocolate pudding market. You take the perspective of the retailer that is already in the market and you have to decide whether you would start fighting if the other player joins the market. But there is one thing you don’t know, namely how aggressive your opponent will be. When you start fighting, will they be frightened easily and give up? Or will they be aggressive and fight you until only one of you is left? This can be seen as a decision made by the other player that influences your decision. If you expect the other player to be a coward, you might prefer to fight, but if they turn out to be aggressive, you would rather want to give in (reminds you of the birds fighting for food in the previous chapter, doesn’t it?). We can model this scenario in a game like this:

A dynamic game with a hidden decision (indicated by the dotted circle). Figure by author.

The dotted circle around the two nodes indicates, that these are hidden decisions that are not observable to everyone. If you are player 2, you know whether player 1 joined the market or not, but if they joined, you don’t know whether they are aggressive (left node) or moderate (right node). Hence you act under uncertainty, which is a very common ingredient in many games you play in the real world. Poker would become very boring if everybody knew everyone’s cards, that’s why there is private information, namely the cards on your hand only you know about.

Now you still have to decide whether to fight or give in, although you are not exactly sure what node of the tree you are in. To do that, you have to make assumptions about the likelihood of each state. If you are quite certain that the other player is behaving moderately, you might be up for a fight, but if you assume them to be aggressive, you might prefer giving in. Say there is a Probability p that the other player is aggressive and 1-p that they behave moderately. If you assume p to be high, you should give in, but if p becomes smaller, there should be a point where your decision switches to fighting. Let’s try to find that point. In particular, there should be a sweet spot in between where the probability of the other player being aggressive vs. moderate is such that fighting and giving in are equal alternatives to one another. That is, the rewards would be equal, which we can model as follows:

Do you see how this formula is derived from the rewards for fighting or giving in in the different leaves of the tree? This formula solves to p=1/3, so if the probability of the other player being aggressive is 1/3 it would make no difference whether to fight or give in. But if you assume the other player to be aggressive with a probability of more than 1/3, you should give in, and if you assume aggressiveness to be less likely than 1/3, you should fight. This is a chain of thought you also have in other games where you act under uncertainty. When you play poker, you might not calculate the probabilities exactly, but you ask yourself, “How likely is it that John has two kings on his hand?” and depending on your assumption of that probability, you check, raise or give up.

Summary & outlook

Your journey on the seas of game theory has only just begun. There is so much more to explore. Photo by George Liapis on Unsplash

Now we have learned a lot about dynamic games. Let us summarize our key findings.

Dynamic games include an order in which players take turns.
In dynamic games, the players’ possible actions depend on the previously executed actions of the other players.
A Nash equilibrium in a dynamic game can be implausible, as it contains an empty threat that would not be rational.
The concept of subgame perfect equilibria prevents such implausible solutions.
In dynamic games, decisions can be hidden. In that case, players may not exactly know which node of the game they are in and have to assign probabilities to different states of the games.

With that, we have reached the end of our series on the fundamentals of game theory. We have learned a lot, yet there are plenty of things we haven’t been able to cover. Game theory is a science in itself, and we have only been able to scratch the surface. Other concepts that expand the possibilities of game-theoretic analyses include:

Analysing games that are repeated multiple times. If you play the prisoner’s dilemma multiple times, you might be tempted to punish the other player for having betrayed you in the previous round.
In cooperative games, players can conclude binding contracts that determine their actions to reach a solution of the game together. This is different from the non-cooperative games we looked at, where all players are free to decide and maximize their own reward.
While we only looked at discrete games, where each player has a finite number of actions to choose from, continuous games allow an infinite number of actions (e.g., any number between 0 and 1).
A big part of game theory considers the usage of public goods and the problem that individuals might consume these goods without contributing to their maintenance.

These concepts allow us to analyse real-world scenarios from various fields such as auctions, social networks, evolution, markets, information sharing, voting behaviour and much more. I hope you enjoyed this series and find meaningful applications for the knowledge you gained, be it the analysis of customer behaviour, political negotiations or the next game night with your friends. From a game theory perspective, life is a game!

References

The topics introduced here are typically covered in standard textbooks on game theory. I mainly used this one, which is written in German though:

Bartholomae, F., & Wiens, M. (2016). Spieltheorie. Ein anwendungsorientiertes Lehrbuch. Wiesbaden: Springer Fachmedien Wiesbaden.

An alternative in the English language could be this one:

Espinola-Arredondo, A., & Muñoz-Garcia, F. (2023). Game Theory: An Introduction with Step-by-step Examples. Springer Nature.

Game theory is a rather young field of research, with the first main textbook being this one:

Von Neumann, J., & Morgenstern, O. (1944). Theory of games and economic behavior.

Like this article? Follow me to be notified of my future posts.

Multimodal Large Language Models

Machine Learning

dim

March 16, 2025

Multimodal Large Language Models (MLLMs) process data from different modalities like text, audio, image, and video.

Compared to text-only models, MLLMs achieve richer contextual understanding and can integrate information across modalities, unlocking new areas of application. Prime use cases of MLLMs include content creation, personalized recommendations, and human-machine interaction.

Examples of MLLMs that process image and text data include Microsoft’s Kosmos-1, DeepMind’s Flamingo, and the open-source LLaVA. Google’s PaLM-E additionally handles information about a robot’s state and surroundings.

Combining different modalities and dealing with different types of data comes with some challenges and limitations, such as alignment of heterogeneous data, inherited biases from pre-trained models, and lack of robustness.

How would you translate this sentence: “The glasses are broken.” into French: “Les verres sont cases.” or “Les lunettes sont cases.”? What if you have an image? Will you be able to choose the correct translation? As humans, we use different modalities daily to enhance communication. Machines can do the same.

Access to visual context can resolve ambiguity when translating between languages. In this example, the image of drinking glasses resolves the ambiguity in the meaning of “glasses” when translating the sentence from English to French. | Modified based on: source

While Large Language Models (LLMs) have shown impressive capabilities in understanding complex text, they are limited to a single data modality. However, many tasks span several modalities.

This article explores Multimodal Large Language Models, exploring their core functionalities, challenges, and potential for various machine-learning domains.

What is a multimodal large language model?

Let’s break down the concept of Multimodal Large Language Models (MLLMs) by first understanding the terms “modal” and “multimodal:”

“Modal” refers to a particular way of communicating or perceiving information. It’s like a channel through which we receive and express ourselves. Some of the common modalities are:

Visual: Sight, including images, videos, and spatial information.
Auditory: Hearing, including sounds, music, and speech.
Textual: Written language, including words, sentences, and documents.
Haptic: Touch, including sensations of texture, temperature, and pressure.
Olfactory: Smell

“Multimodal” refers to incorporating various modalities to create a richer understanding of the task, e.g., as on a website or in a blog post that integrates text with visuals.

MLLMs can process not just text but other modalities as well. They are trained on samples containing different modalities, which allows them to develop joint representations and utilize multimodal information to solve tasks.

Why do we need multimodal LLMs?

Many industries heavily rely on multimodality, particularly those that handle a blend of data modalities. For example, MLLMs can be used in a healthcare setting to process patient reports comprising doctor notes (text), treatment plans (structured data), and X-rays or MRI scans (images).

Example of a multi-modal model. The model is trained on X-rays, medical reports, actions, and texts describing the diagnosis and outcome. This way, the model learns to use visual and textual information to predict potential diagnoses. | Modified based on: source

MLLMs process and integrate information from different modalities (i.e., text, image, video, and audio), essential to solving many tasks. Some prominent applications are:

Content creation: MLLMs can generate image captions, transform text into visually descriptive narratives, or create multimedia presentations, making them valuable tools for creative and professional industries.

Enhanced human-machine interaction: By understanding and responding to inputs from diverse modalities such as text, speech, and images, MLLMs enable more natural communication. This can enrich the user experience in applications like virtual assistants, chatbots, and smart devices.

Personalized recommendations: MLLMs contribute to refining recommendation systems by analyzing user preferences across diverse modalities. Whether suggesting movies based on textual reviews, recommending products through image recognition, or personalizing content recommendations across varied formats, these models elevate the precision and relevance of recommendations.

Domain-specific problem solving: MLLMs are adaptable and invaluable in addressing challenges across various domains. In healthcare, their capability to interpret medical images aids in diagnostics, while in education, they enhance learning experiences by providing enriched materials that seamlessly combine text and visuals.

How do multimodal LLMs work?

A typical multimodal LLM has three primary modules:

The input module comprises specialized neural networks for each specific data type that output intermediate embeddings.
The fusion module converts the intermediate embeddings into a joint representation.
The output module generates outputs based on the task and the processed information. An output could be, e.g., a text, a classification (like “dog” for an image), or an image. Some MLLMs, like Google’s Gemini family, can produce outputs in more than one modality.

Basic structure of a multimodal LLM. Different modalities are processed by separate input modules. Then, the extracted information is joined in the fusion module. The output module (in this case, a classifier) generates the output in the desired modality.

Examples of multimodal LLMs

Microsoft: Kosmos-1

Kosmos-1 (GitHub) is a multimodal LLM created by Microsoft for natural language and perception-intensive tasks. It can perform visual dialogue, visual explanation, visual question answering, image captioning, math equations, OCR, and zero-shot image classification with and without descriptions.

Architecture and training

Kosmos-1 processes inputs consisting of text and encoded image embeddings. Image embeddings are obtained through the pre-trained CLIP ViT-L/14 (GitHub) model. An embedding module processes this input before feeding it into a transformer-based decoder based on Magneto.

Kosmos-1 used the same initialization as the Magneto transformer for better optimization stability. To capture position information more precisely and better generalize to different sequence lengths (short sequences for training, long ones during testing), Kosmos-1 used xPOS as a relative position encoder.

Kosmos-1 has about 1.6 billion parameters in total, which is smaller than rival models like Flamingo, LLaVA, or GPT-4o. It was trained from scratch on web-scale multimodal corpora (text corpora, image caption pairs, and interleave image-text data).

A main limitation of Kosmos-1 is the limited number of input tokens (2,048) across text and image modalities.

Performance

The creators of Kosmos-1 proposed the Raven IQ test dataset to evaluate the nonverbal reasoning capabilities of MLLMs. This is the first time that a model is tested on nonverbal reasoning. The experimental results from the Kosmos-1 paper show that although the performance of Kosmos-1 is slightly better than that of random choice (random choosing one of the options), it is still far from the average results of adults for the same test. Nevertheless, this shows that MLLMs have the capability of nonverbal reasoning by aligning perception with language models.)

Experimental results published in the Kosmos-1 paper show that MLLMs benefit from performing cross-modal transfer, i.e., learning from one modality and transferring the knowledge to other modalities is more beneficial than using only one modality.

Microsoft published promising results for Kosmos-1 on the OCR-free language understanding task. In this task, the model reads and comprehends the meaning of words and sentences directly from the images. Microsoft also demonstrated that providing descriptions in the context improves the accuracy of zero-shot image classification tasks.

Examples of different Kosmos-1 tasks. The modal can explain an image (1, 2) or answer questions based on an image (3, 4). Kosmos-1 can also extract information from a text in an image (5) or answer math questions (6). The model is able to combine these capabilities to answer questions that require locating specific information in an image (7, 8) | Source

Chain-of-thoughts prompting with Kosmos-1. In the first stage, given an image, a prompt is used to guide the model in generating a rationale. The model is then fed the rationale and a task-aware prompt to produce the final results. | Source

DeepMind: Flamingo

Flamingo architecture overview. Visual data is processed through a pretrained, frozen image encoder to extract image embeddings. These embeddings are passed through a Preceiver Sampler, trained from scratch, which outputs a fixed number of embeddings. The fixed image embeddings and text tokens are fed into gated cross-attention dense blocks, inserted between the frozen LLM blocks, and trained from scratch. The model produces free-form text as output. | Source

Flamingo, a vision language model (VLM) developed by DeepMind, can perform various multimodal tasks, including image captioning, visual dialogue, and visual question answering (VQA). Flamingo models take interleaved image data and text as input and generate free-form text.

Flamingo consists of pre-trained vision and language models connected by a “Perceiver Resampler.” The Perceiver Resampler takes as input a variable number of image or video features from the pre-trained vision encoder and returns a fixed number of visual outputs. A pre-trained and frozen Normalizer-Free ResNet (NFNET) is used as a vision encoder, and a frozen Chinchilla is used as the language model. Gated cross-attention dense blocks (GATED XATTN-DENSE) are inserted between frozen LLM blocks and trained from scratch. The largest Flamingo model has 80B parameters and is trained on three datasets scraped from the web: interleaved image and text, image-text, and video-text pairs.

Experimental results on 16 multimodal image/video and language tasks show that Flamingo 80B models are more effective than fine-tuned models for specific tasks. However, as Flamingo focuses more on open-ended tasks, its performance on classification tasks is not as good as that of contrastive models like BASIC, CLI, and ALIGN.

Some limitations that Flamingo inherits from the pre-trained LLM used include hallucinations, poor sample efficiency during training, and poor generalizations for sequences that are longer than the ones used during training. Other limitations that many VLMs struggle with are outputting offensive language, toxicity, propagating social biases and stereotypes, and leaking private information. One way to mitigate these limitations is to filter them out of the training data and exclude them during evaluation.

LLaVA

The Large Language and Vision Assistant (LLaVA) is an end-to-end trained multimodal LLM that integrates the CLIP ViT-L/14 vision encoder and the Vicuna (a chat model created by fine-tuning Llama 2) for general-purpose visual and language understanding.

Given an input image, the pre-trained CLIP ViT-L/14 vision encoder extracts the vision features, which are transformed into the word embedding space using a simple linear layer. Vicuna was chosen as the LLM model because it is the best open-source instruction-following model for language tasks.

Overview of LLaVA architecture. The pretrained CLIP ViT-L/14 vision encoder extracts visual features from input images Xv, which are then mapped into the word embedding space using a linear projection W. — Overview of LLaVA architecture. The pretrained CLIP ViT-L/14 vision encoder extracts visual features from input images X_v, which are then mapped into the word embedding space using a linear projection W. | Source

LLaVA is trained using a two-stage instruction-tuning process. In the first pre-training stage for feature alignment, both the vision encoder and LLM weights are frozen, and the projection matrix is updated to align image features with the pre-trained LLM word embedding. In the second stage, end-to-end fine-tuning is performed to optimize the model for multimodal chatbot interactions and reasoning within the science domain.

Experimental results show that LLaVA 7B has better instruction-tuning capabilities than GPT-4 and Flamingo 80B despite having fewer parameters. LLaVA can follow user instructions and give a more comprehensive answer than GPT-4. LLaVA also outperforms GPT-4 on the ScienceQA dataset, which has multimodal multiple-choice questions from natural, social, and language sciences.

LLaVA has some limitations, including its perception of images as a “bag of patches,” failing to grasp the complex semantics within them. Similar to Flamingo, it inherits biases from both vision and language encoders and is prone to hallucinations and misinformation. Contrary to Flamingo, LLaVA cannot handle multiple images due to its lack of instructions.

This example shows LLaVA's capabilities of visual reasoning and chat. LLaVA accurately follows the user’s instructions instead of simply describing the scene and offers a comprehensive response. Even when merely asked to describe the image, LLaVA identifies atypical aspects of the image. — This example shows LLaVA’s capabilities of visual reasoning and chat. LLaVA accurately follows the user’s instructions instead of simply describing the scene and offers a comprehensive response. Even when merely asked to describe the image, LLaVA identifies atypical aspects of the image. | Source

Google: PaLM-E

Google developed an embodied language model, PaLM-E, to incorporate continuous sensor modalities into language models and establish the link between words and perceptions.

PaLM-E is a general-purpose MLLM for embodied reasoning, visual language, and language tasks. PaLM-E uses multimodal sentences, where inputs from different modalities (i.e., images in blue, state estimate of a robot in green) are inserted alongside text tokens (in orange) as input to an LLM and are trained end-to-end. PaLM-E can perform different tasks like robotic planning, visual question answering (VQA), and image captioning. | Source

Architecture and training

PaLM-E is a decoder-only LLM that auto-regressively generates text using a multimodal prompt consisting of text, tokenized image embeddings, and state estimates representing quantities like a robot’s position, orientation, and velocity.

PaLM-E combines PaLM, a decoder-only LLM with 540 billion parameters, and the ViT vision transformer by projecting the latter’s image representations into the former’s input token space. The same approach, which relies on a learned transformation function, is used for projecting state estimates.

Performance

Experimental results show that PALM-E outperforms other baselines like SayCan and PALI in different robotic domains and tasks. This shows that combining pre-trained PALM and ViT with the full mixture of robotics and general visual-language data increases the performance compared to training individual models on individual tasks. Moreover, PALM-E outperforms Flamingo in VQA tasks and PALM in language tasks.

PALM-E 562B has many capabilities, including zero-shot multi-modal chain of thought (CoT) reasoning, multi-image reasoning, OCR-free math reasoning, image captioning, VQA, and few-shot prompting.

Challenges, limitations, and future directions of MLLMs

Expanding LLMs to other modalities comes with challenges regarding data quality, interpretation, safety, and generalization. In a survey paper, Paul Liang et al. proposed a new taxonomy to characterize the challenges and limitations of large multimodal language models:

Representation: How can one represent different modalities in a meaningful and comprehensive manner?
Fusion, i.e., integrating two or more modalities and reducing the number of separate representations, is a closely related challenge. Fusion can happen after unimodal encoders capture unique representations of different modalities or directly using raw modalities, which is more challenging as data is heterogeneous.

Representation coordination aims to organize different modalities in a shared coordinate space, such as Euclidian distance. The objective is to position similar modalities close together and put modalities that are not equivalent far away. For instance, the goal is that the representation of the text “a bike” and an image of a bike are placed close together in cosine distance but far away from an image of a cat.

Human cognition offers valuable insights into developing and further improving multimodal models. Understanding how the brain processes different modalities and combining them can be a promising direction for proposing new approaches to multimodal learning and enabling more effective analysis of complex data.

Alignment: Another challenge is identifying cross-modal connections and interactions between elements of different modalities. For instance, how can we align gestures with speech when a person is talking? Or how can we align an image with a description?
When the elements of multiple modalities are discrete (i.e., there is a clear segmentation between elements, like words in a text) and supervised data exists, contrastive learning is used. It matches the representations of the same concepts expressed in different modalities (e.g., the word “car” with an image of a car).

If the ground truth is unavailable, the alignment is done with all the elements of the modalities to learn the necessary connections and matchings between them. For example, aligning video clips with text descriptions when there are no ground truth labels that link descriptions with video clips requires comparing each video embedding with each text embedding. A similarity score (i.e., cosine similarity) is calculated for all pairs and aligns the modalities.

Alignment is more challenging when elements of a modality are continuous (like time-series data) or data does not contain clear semantic boundaries (e.g., MRI images). Clustering can be used to group continuous data based on semantic similarity to achieve modality alignment.

Further, current multimodal models struggle with long-range sequences and cannot learn interactions over long periods. For instance, aligning the text “After 25 minutes in the oven, the cupcakes are golden brown” with the correct scene in a video requires understanding that “25 minutes in the oven” corresponds to a specific scene later in the video. Capturing and aligning long-term interactions that happen very far in time and space is challenging and complex, but it is an important and promising future direction that needs to be explored.
Reasoning: Reasoning is a complex process that involves drawing conclusions from knowledge through multiple logical steps and observations.
One reasoning-related challenge in MLLMs is structure modeling, which involves learning and representing the relationships over which reasoning happens. Understanding hierarchical relationships where smaller components (atoms) are combined to create larger ones (molecules) is essential for complex reasoning.

Another challenge is encoding or representing multimodal concepts during reasoning so that they are interpretable and effective using attention mechanisms, language, or symbols. It is very important to understand how to go from low-level representations (e.g., pixels of an image or words) to high-level concepts (e.g., “What color is the jacket?”) while still being interpretable by humans.

Understanding the reasoning process of the trained models and how they combine elements from different modalities (i.e., text, vision, audio) is very important for their transparency, reliability, and performance. This will help to discover potential biases and limitations in the reasoning process of MLLMs, enabling the development of robust models to overcome these challenges.
Generation: Research is ongoing on generating meaningful outputs that reflect cross-modal interaction and are structured and coherent.
Generative models focus on generating raw modalities (text, images, or videos) and capturing the relationships and interactions between different modalities. For instance, guided text summarization uses input modalities such as images, video, or audio to compress the data and summarize the most relevant and important information from the original content.

Multimodal translation maps one modality to another while respecting semantic connections and information content. Generating novel high-dimensional data conditioned on initial inputs is extremely challenging. It has to preserve semantics, be meaningful and coherent, and capture many possible generations (different styles, colors, and shapes of the same scene).

One of the main challenges of multimodal generation is the difficulty of evaluating the generated content, primarily when ethical issues (e.g., generating deepfakes, hate speech, and fake news) are involved. Evaluating user studies is time-consuming, costly, and biased.

An insightful future work will be to study if the risk for the above ethical issues is reduced or increased when using a multimodal dataset and if there are ethical issues specific to multimodal generations. Multimodal datasets may reduce ethical issues as they are more diverse and contextually complete and may improve model fairness. On the other hand, the biases from one modality can interact and amplify biases in other modalities, leading to complex ethical issues (i.e., combining video with text metadata may reveal sensitive information).)

Transference: In multimodal modeling, transference refers to the process of transferring knowledge from one modality (the second modality) to another (the primary modality) when the primary modality’s resources are limited (e.g., lack of annotated data, unreliable labels, noisy inputs). By leveraging the information from the second modality, the primary modality can enhance performance and learn new capabilities, which would not be possible without the shared information.
In cross-modal transfer settings, large-scale pre-trained models are fine-tuned for specific downstream tasks with a focus on the primary modality. For example, fine-tuning pre-trained frozen large language models for image captioning. On the other hand, multimodal co-learning aims to transfer the learned information by sharing intermediate spaces between modalities. In this case, a single joint model is used across all modalities. For instance, having both image and text modalities during training and using the model for image classification. Contrary model induction, exemplified by co-training, promotes independent training of models and only exchanges their model predictions (outputs) to enable information transfer while maintaining separation.

Learning from many modalities increases the data heterogeneity and complexity challenges during data processing. Dealing with modalities that aren’t all present simultaneously is a direction that needs further exploration to enhance multimodality models’ performance.

Quantification: Quantification aims to understand better and improve multimodal models’ reliability, interpretability, and robustness. Understanding the dimensions of heterogeneity and their effect on multimodal learning and modeling is very important. Exploring interactions and connections of multimodal modalities enhances the understanding of modality interconnections of the trained models. Improving how the multimodal models are trained and optimized is crucial to achieving better generalization, usability, and efficiency.
Having formal guidelines and theories for evaluating which modalities are beneficial or harmful (adversarial attacks) is a critical challenge. Understanding what modalities to select and compare them in a systematic way is very important for improving multimodal models. Furthermore, it is essential to interpret and explain complex relationships and patterns of the multimodal models before employing them in real-world applications. For instance, recognizing social biases of the data (text or image) is key to ensuring fairness while guaranteeing the robustness of the model against noisy or out-of-distribution modalities. These unresolved core challenges require thorough analysis to ensure that multimodal models can be reliably applied across different domains.

As this extensive list of open research questions and practical challenges shows, multimodal LLMs are still in their early stages. The LLaVA GitHub repository and the unit on multi-modal models in the Hugging Face Community Computer Vision Course are excellent resources to dive deeper and get hands-on experience training and fine-tuning MLLMs.

Was the article useful?

Explore more content topics:

Introduction to State Space Models as Natural Language Models

Machine Learning

dim

March 16, 2025

State Space Models (SSMs) use first-order differential equations to represent dynamic systems.

The HiPPO framework provides a mathematical foundation for maintaining continuous representations of time-dependent data, enabling efficient approximation of long-range dependencies in sequence modeling.

Discretization of continuous-time SSMs lays the groundwork for processing natural language and modeling long-range dependencies in a computationally efficient way.

LSSL, S4, and S5 are increasingly sophisticated and efficient sequence-to-sequence state-space models that pave the way for viable SSM-based alternatives to transformer models.

While transformer-based models are in the limelight of the NLP community, a quiet revolution in sequence modeling is underway. State Space Models (SSMs) have the potential to address one of the key challenges of transformers: scaling efficiently with sequence length.

In a series of articles, we’ll introduce the foundations of SSMs, explore their application to sequence-to-sequence language modeling, and provide hands-on guidance for training the state-of-the-art SSMs Mamba and Jamba.

In this first article of the three-part series, we’ll examine the core principles of SSMs, trace their evolution from Linear State Space Layers (LSSL) to the S5 model, and examine their potential to revolutionize sequence modeling with unparalleled efficiency.

Understanding state space models

Before exploring how State Space Models (SSMs) can function as components of large language models (LLMs), we’ll examine their foundational mechanics. This will allow us to understand how SSMs operate within deep neural networks and why they hold promise for efficient sequence modeling.

SSMs are a method for modeling, studying, and controlling the behavior of dynamic systems, which have a state that varies with time. SSMs represent dynamic systems using first-order differential equations, providing a structured framework for analysis and simplifying computations compared to solving higher-order differential equations directly.

Let’s dissect what this means.

Consider a system consisting of a moving car on the road. When we supply a certain input to this system (like pressing the gas pedal), we alter the car’s current state (for example, the amount of gas the engine is burning) and consequently cause the car to move at a certain speed.

Because our system’s state varies with time, it is considered a dynamic system. In this case, we are studying one state variable (the amount of gas the engine burns) in our state (the car’s internals). State variables are the minimum number of variables we can use to understand the system’s behavior through mathematical representation.

A car as a dynamic system. The system has a certain input, which is a foot pressing the gas pedal. This input is supplied to the car, influencing its state. The state variable being changed is the amount of gas the engine is burning. The output of the system is the speed of the car.

In our scenario, the car was already moving, so it was burning gas—a result of the previous force on the gas pedal. The speed we would get if we pressed the pedal in a stationary car differs from the speed we would get if the car were already moving since the engine would need less additional gas (and less additional input force) to reach a certain speed. Thus, when determining the speed, we should also factor in the car’s previous state.

A dynamic system with a previous state as the input. The value of the state variable depends not only on the input but also on the previous state.

There is one more thing to consider. State Space Models also model a “skip connection,” which represents the direct influence of the input on the output. In our case, the skip connection would model an immediate influence of pressing the gas pedal on the car’s speed, regardless of the current state. In the specific case of a car, this direct feedthrough (D) is zero, but we keep it in the model as, generally, systems can (and do) have direct input‐to‐output dependencies.

A dynamic system with a direct connection between input and output. There is a direct relationship between pressing a car’s gas pedal (input) and the car’s speed (output).

Now that we have considered all the possible connections in our system, let’s try to model it mathematically. First, we need representations for the variables in our system. We have the previous state of the model, x(t-1), the input, u(t), the current state of the model, x(t), and the output, y(t).

We also need a notation to represent the relationship between every two variables in the system. Let’s denote the effect of the previous state on the current one by a matrix A, the effect of the input on the current state by a matrix B, the effect of the state on the output by a matrix C, and the direct effect of the input on the output by the matrix D.

State space representation of a dynamic system. The input u(t), the state x(t), the output y(t), and the system’s previous state x(t-1) are connected through matrices A, B, C, and D, respectively. — State space representation of a dynamic system. The input *u(t)*, the state *x(t)*, the output *y(t)*, and the system’s previous state *x(t-1)* are connected through matrices A, B, C, and D, respectively.

From the input u(t), we need to compute two variables:

1. The new state x(t), which considers the effect of the previous state x(t-1) and the input u(t).

2. The output y(t), which considers the effect of the new state x(t) and the direct effect of the input u(t).

Consequently, we can derive the equations for the two variables:

1. The equation for the new state x(t):

2. The equation for the output y(t):

These two equations form our system’s state space representation (SSR). The SSR allows us to study the system’s stability by analyzing the effects of inputs on the system’s state variables and output.

We can model probabilistic dependencies between state variables and the inputs by introducing noise terms into the dynamics and observation equations. These stochastic extensions enable us to account for uncertainties in the system and its environment, providing a foundation for modeling and controlling the system’s behavior in real-world scenarios.

State space models for natural language processing

State Space Models (SSMs), long established in time series analysis, have been utilized as trainable sequence models for decades. Around 2020, their ability to efficiently handle long sequences spurred significant progress in adapting them for natural language processing (NLP).

The exploration of SSMs as trainable sequence models was gradual through multiple contributions that laid the foundation for introducing SSMs in deep learning models as “State Space Layers” (SSLs). In the following sections, we’ll explore key contributions that led to the use of SSMs as NLP models.

Applying SSMs to natural language processing reframes the input as a token, the state as the contextual representation, and the output as the predicted next token.

HiPPO: recurrent memory with optimal polynomial projections

The primary challenge sequence models face is capturing dependencies between two inputs that are far apart in a long sequence.

Let’s say we have a paragraph where the last sentence references something mentioned in the first sentence:

The word ‘Sushi’ in the first sentence is referenced in the last sentence, with a large number of words in between. Thus, understanding the phrase “that name” in the last sentence requires the first sentence for context.

Historically, sequence models, such as traditional RNNs, GRUs, and LSTMs, struggled to retain such long-range dependencies due to problems like vanishing or exploding gradients. The gating mechanisms these algorithms rely on regulate information flow by selectively retaining important features and discarding irrelevant ones, which mitigates issues like short-term memory loss.

However, these mechanisms are insufficient for capturing long-range dependencies because they struggle to preserve information over extended sequences. This is due to capacity constraints, a tendency to prioritize short-term patterns during training, and cumulative errors that degraded information over long sequences. While transformers address many of these issues through their self-attention mechanism, due to the quadratic complexity of attention, they are computationally inefficient for long sequences.

Albert Gu and colleagues at Stanford attempted to solve this problem by introducing HiPPO (short for “High-order Polynomial Projection Operators”). This mathematical framework aims to compress historical information into a fixed-size representation. The fixed-size representation captures the entire processed sequence and enables sequence models to process and utilize long-range dependencies efficiently. Unlike the hidden state in an LSTM or GRU, which is also a fixed-size representation but primarily optimized for short-term memory retention, HiPPO is explicitly designed to capture the entire processed sequence, enabling sequence models to process and utilize long-range dependencies efficiently.

HiPPO works by constructing a set of polynomial bases that are mathematically orthogonal with respect to a specific weighting function. The weighting function w(t) weighs the importance of historical information using one of two variants:

1. Transform HiPPO Matrix Variations: Transform matrices prioritize the latest inputs and change the system’s response continuously with time. The importance of information stored in the sequence history decays over time.

2. Stationary HiPPO Matrix Variations: Stationary matrices are time-invariant and consider all past data with consistent importance. The rate of natural decay of information remains consistent over time, providing a balance between retaining historical information and responding to new inputs.

Gu and colleagues applied the two variants to three different polynomial families referred to as Leg, Lag, and Cheb. The difference between the Leg, Lag, and Cheb is the amount of information retention, which is determined by the variations in the weighting functions w(t) associated with each set of polynomials and their orthogonality properties:

1. HiPPO-Leg is based on the Legendre polynomials. It gives uniform weighting for all the information in the sequence. Thus, the weighting function w(t) = 1. As the sequence length becomes larger, the older parts of the sequence are compressed into a fixed-size representation.

2. HiPPO-Lag is based on the Laguerre polynomials. There is an exponential decay of information over time.

3. HiPPO-Cheb is based on the Chebyshev polynomials. It creates a non-uniform distribution that prioritizes the latest and oldest information.

The storage and prioritization of the sequence’s historical data is due to the mathematical properties of these polynomials. The appendix of the HiPPO paper contains all the equations and mathematical proofs.

The HiPPO matrix is obtained by deriving differential operators that project the input signal onto the specified polynomial basis in real-time. The operators ensure the orthogonality of the states while preserving the defined weighting function. The following equation defines them:

Here, ϕ(t) are the basis functions of the chosen family of orthogonal polynomials (i.e., Legendre, Laguerre, or Chebyshev), ϕ′i is the derivative of the i-th basis function with respect to time t, and w(t) is the weighting function that defines the importance of information over time. i is the index of the current state or basis function being updated, and j is the index of the previous state or basis function contributing to the update. It points to the j-th basis function that is being integrated with respect to w(t). The integral computes the contribution of the j-th basis function to the update of the i-th state, considering the weighting w(t).

This mechanism allows for efficiently updating the model’s hidden state, minimizing the loss of long-range dependencies. Thus, the HiPPO matrix can be used to control the update of a model’s context or hidden state.

This sounds familiar, right? In the previous section, we saw that the representation of the state change (A) for text data would be the context of the text (or sequence). Just like in RNNs and LSTMs, we can use this context (or hidden state) to predict the next word. Since its structure allows it to handle long- and short-range dependencies, HiPPO acts as a template for the matrix A.

Combining recurrent, convolutional, and continuous-time models with linear state-space layers

HiPPO’s inventors collaborated with other Stanford researchers to develop the Structured State Space Sequence model, which uses the HiPPO framework. This model makes significant strides in applying SSMs to sequence modeling tasks.

Their 2021 paper Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers aims to combine the best and most efficient properties of all the existing sequence modeling algorithms.

According to the authors, an ideal sequence modeling algorithm would have the following capabilities:

1. Parallelizable training, as is possible with Convolutional Neural Networks (CNNs). This saves computational resources and enables a faster training process.

2. Stateful inference, as provided by Recurrent Neural Networks (RNNs). This allows context to be used as a factor while deciding on the output.

3. Time-scale adaptation, as in Neural Differential Equations (NDEs). This enables the sequence model to adapt to various lengths of input sequences.

In addition to these properties, the model should also be able to handle long-range dependencies in a computationally efficient manner.

Motivated by these goals, the authors explored using State Space Models (SSMs) to develop a computationally efficient and generalizable sequence model suitable for long sequences.

Let’s explore how they did that:

As we learned above, the SSR equations represent a dynamic system with a continuously changing state. To apply SSMs to NLP, we need to adapt these continuous-time models to operate on discrete input sequences. Rather than continuous signals, we’ll now feed strings of individual tokens to the model one by one.

Discretization

We can discretize the continuous SSR equations using numerical methods.

To understand this process, we will return to the example of the continuously moving car. The car’s speed is a continuous signal. To study the variation in the car’s speed, we need to measure it at all times. However, it’s impractical to record every infinitesimal change in speed. Instead, we take measurements at regular intervals—for example, every 30 seconds.

By recording the car’s speed at these specific moments, we convert the continuous speed profile into a series of discrete data points. This process of sampling the continuous signal at regular intervals is called “discretization.” The interval of time we are using to measure the speed is called the time scale Δt, also known as “step size” or “discretization parameter.”

To convert a continuous signal into a discrete signal, it is sampled in fixed intervals Δt. — To convert a continuous signal into a discrete signal, it is sampled in fixed intervals *Δt.*

Similar to discretizing car speed, to adapt SSMs for natural language processing, we start with continuous-time equations that describe how a system evolves. We discretize the equations, converting them into a form that updates at each discrete time step.

The choice of Δt is critical: if it is too large, we risk losing important details of the state dynamics (undersampling):

If Δt is too small, the system might become inefficient or numerically unstable due to excessive computations (oversampling):

In Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers, the authors explored several methods for discretizing state-space models to adapt them for sequence modeling tasks. They ultimately selected the Generalized Bilinear Transform (GBT), which effectively balances accuracy (by avoiding oversampling) and stability (by avoiding undersampling). The GBT allows the discrete state-space model to approximate the continuous dynamics while maintaining robustness in numerical computations.

The discrete state equation under GBT is given by:

Here, x is the state representation, Δt is the time step, A is the matrix that represents how the state is influenced by the previous state, B is the matrix that represents the effect of the input on the current state, and I is the identity matrix which ensures that the output has consistent dimensionality.

A critical decision when applying the Generalized Bilinear Transform is the choice of the parameter α, which controls the balance between preserving the characteristics of the continuous-time system and ensuring stability in the discrete domain. The authors selected α = 0.5 as it counterbalances accuracy and numerical stability. The resulting state equation is given by:

The bilinear transform equation is then applied to the initialized continuous-time matrices A and B, discretizing them into A and B respectively.

Now that we have a discretized version of the SSR equations, we can apply them to natural language generation tasks where:

1. u(t) is the input token we feed into the model.

2. x(t) is the context, which is the representation of the sequence’s history thus far.

3. y(t) is the output, the predicted next token.

Thus, we have a representation of SSMs that can handle tokens as input.

State Space Model with discretized matrices A and B. A and B map the current context xt-1 and the input token ut to the new context xt. C maps the context to the output token yt, with D modeling the direct relationship between ut and yt. The direct connection between the input and the output mediated by D is treated as a skip connection and is not explicitly incorporated into the model's internal architecture. — State Space Model with discretized matrices A and B. A and B map the current context x_t-1 and the input token u_t to the new context x_t. C maps the context to the output token y_t, with D modeling the direct relationship between u_t and y_t. The direct connection between the input and the output mediated by D is treated as a skip connection and is not explicitly incorporated into the model’s internal architecture.

The three pillars of SSMs as sequence models

Now that we can use SSMs for NLP tasks, let’s see how they measure up with respect to the other available sequencing algorithms by circling back to the goals the authors stated at the beginning of Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers.

Parallelizable training

Parallelizable training would save a considerable amount of computational resources and time. Two widely used sequencing architectures are inherently parallelizable during training:

1. Convolutional Neural Networks (CNNs) are inherently parallelizable because the convolution operation can be applied simultaneously across all positions in the input sequence. In sequence modeling, CNNs process the entire input in parallel by applying convolutional filters over the sequence, allowing for efficient computation during training.

2. Transformers achieve parallelism through the self-attention mechanism, which simultaneously computes attention weights between all pairs of tokens in the sequence. This is possible because the computations involve matrix operations that can be parallelized, allowing the model to process entire sequences at once.

Efficiently distributing the computational workload is crucial for sequence algorithms, especially when training on large datasets. To address this challenge, the authors introduced a convolutional representation of SSMs, which allows these models to process sequences in parallel, similar to CNNs and Transformers.

The author’s idea is to express the SSM as a convolution operation with a specific kernel k derived from the state-space parameters, enabling the model to compute outputs over long sequences efficiently.

To derive the SSR equations as a convolution operation, they assume the SSM model to be time-invariant. This means the matrices A, B, C, and D do not vary with time, the matrix A is stable (which is already achieved by adopting the HiPPO matrix for A that allows a numerically stable update of the context), and the initial state x(0) is 0.

Using the SSR equations mentioned earlier (state equation that derives x(t) and output equation that derives y(t)), the kernel k can be derived in two steps:

1. Solving for the state, we start with the state equation from the SSR equations where x₀= 0:

Solving for the state, we start with the state equation from the SSR equations where x0 = 0

We derived the state x_n, which represents the system’s state at time step n, based on the contributions of past inputs. Similarly, u_k denotes the input to the system at a specific time step k within the sequence. The number of time steps n (i.e., the number of times we sample using Δt) depends on the length of the input sequence, as the state x_n is influenced by all preceding inputs up to time n−1.

2. Substitute the x_nin the SSR output equation with the state that is derived from step 1.

We can simplify this equation by combining the state representations (A, B, C, and D) as the kernel k:

Here, m is the index for summing over past inputs. The result is the following equation for the output at step n:

Thus, we are left with the convolutional representation of State Space Representation: We take the input u_nas a common factor and denote the term multiplied by the input as the kernel k. We obtain the outputs from the input sequence by passing the kernel across it.

Stateful inference

Stateful inference refers to a sequence model’s ability to create, maintain, and utilize a “state,” which includes all the relevant context needed for further computations. This ability is desirable because it eliminates the computational inefficiency of understanding the context whenever a new input token is present.

Transformers capture long-range dependencies and context through the self-attention mechanism. However, recomputing the attention weights and value vectors every time we have a new input token is computationally expensive. We can cache the values of key and value vectors to avoid some recomputation, which makes it slightly more efficient. Still, it does not solve the problem of transformers scaling quadratically.

RNNs achieve stateful inference through a hidden state that is only updated and not recomputed for every input token. However, RNNs struggle to retain information from earlier tokens in long sequences. This limitation arises because, during backpropagation, gradients associated with long-range dependencies diminish exponentially as they are propagated through many layers (or time steps), a phenomenon known as the vanishing gradient problem. As a result, RNNs cannot effectively model long-range dependencies between tokens.

Thanks to their state equation, SSMs achieve stateful inference. They inherently maintain a state containing the sequence’s context, making them more computationally efficient than transformer-based models.

To handle long-range dependencies, the authors of Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers use the HiPPO-LegS (Stationary form of HiPPO-Leg) formulation to parameterize A.

Time-scale adaptation

Time-scale adaptation refers to a sequence model’s ability to capture dependencies for the input token in different parts of the input sequence. In technical terms, this means the context can retain dependencies that occur over different temporal distances within the same sequence. Time-scale adaptation enables effective capturing of both short-term (immediate) and long-term (distant) relationships between elements in the data.

A model’s context representation is crucial for its ability to capture the internal dependencies within a sequence. SSMs represent the context as the matrix A. Thus, an SSM’s ability to update the state based on the new input through the state equation allows the model to adapt to the contextual dependencies within a sequence, allowing it to handle both long and short-range dependencies.

Linear state space layers (LSSLs)

So far, we’ve seen that State Space Models are efficient sequence models. In their paper Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers, Gu and colleagues introduced the Linear State Space Layer (LSSL) utilizing both the discretized recurrent and convolutional forms of State Space Representation equations. This layer is integrated into deep learning architectures to introduce efficient handling of long-range dependencies and structured sequence representations.

Like RNNs, SSMs are recurrent. They update the context by combining the previous state with the new state. This recurrent form is very slow to train because we need to wait for the previous output to be available before computing the next one. To address this problem, the authors devised the convolutional representation of the SSM equations that we discussed in the previous sections.

While the convolutional representation of SSMs enables training parallelization, it is not without its own problems. The key issue is the fixed size of the kernel. The kernel we are using to process the input sequence is determined by the model parameters (matrices A, B, C, and D) and sequence length, as we saw in the first step of the kernel derivation. However, natural language sequences vary in length. Thus, the kernel would be recomputed during inference based on the input sequence, which is inefficient.

Although recurrent representations are inefficient to train, they can handle varying sequence lengths. Thus, to have a computationally efficient model, we seem to need the properties of both the convolutional and recurrent representations. Gu and colleagues devised a “best of both worlds” approach, using the convolutional representation during training and the recurrent representation during inference.

Comparison of the continuous-time, recurrent, and convolutional forms of SSMs. The Linear State Space Layer adopts both the recurrent and convolutional forms of the SSM representation to leverage their complementary advantages. The recurrent form is used during inference, and the convolutional form during training. | Source

In their paper, Gu and collaborators describe the LSSL architecture as a “deep neural network that involves stacking LSSL layers connected with normalization layers and residual connections.” Similar to the attention layers in the transformer architecture, each LSSL layer is preceded by a normalization layer and followed by a GeLU activation function. Then, through a residual connection, the output is added to the normalized output of a position-wise feedforward layer.

Architecture of a Linear State Space Layer. Each input has H features (the size of the token’s embedding vector) that are processed by independent copies of the SSM as one-dimensional inputs in parallel. Each SSM copy produces an M-dimensional output for each feature. The combined outputs are fed through a GeLU activation function and a position-wise feed-forward layer.

Efficiently modeling long sequences with state structured spaces

The LSSL model performed impressively well on sequence data but was not widely adopted due to computational complexities and memory bottlenecks.

Results of testing the original LSSL model on the sequential MNIST, permuted MNIST, and sequential CIFAR tasks, which are popular benchmarks originally designed to test theability of recurrent models to capture long-term dependencies of length up to1k. LSSL sets SoTA on sCIFAR by more than 10 points.

In the paper Efficiently Modeling Long Sequences with State Structured Spaces, Gu, together with close collaborators Karan Goel and Christopher Ré, advanced the LSSL to reduce the computational complexity and accuracy of the training process.

Improvements on the state matrix A

In the previous section, we explored how the original LSSL relied on a fixed, predefined form of the HiPPO matrix to serve as the state matrix A. While this representation was successful in compressing information, it was computationally inefficient due to the full (dense) matrix representation of A. Gu, Goel, and Ré described this implementation as “infeasible to use in practice because of prohibitive computation and memory requirements induced by the state representation.”

In the LSSL, the state is multiplied by the matrix A to produce the updated version of the state. The most computationally efficient form of the matrix A for multiplication would be a diagonal matrix. Unfortunately, the HiPPO matrix could not be reformed as a diagonal matrix since it does not have a full set of eigenvectors.

However, the authors were able to dissect the matrix into a diagonal plus low-rank decomposition (DPLR). The diagonal matrix has nonzero entries only on the main diagonal, which makes the multiplication process more efficient by requiring only a single multiplication per vector element. The low-rank matrix can be represented as the product of two much smaller matrices. Because of this factorization, the operations needed to multiply by the vector are greatly reduced compared to a full-rank matrix of the same size.

The original LSSL architecture required O(N²L) operations, where N is the state dimension, and L is the sequence length. After the transformation of the matrix A into its diagonal plus low-rank (DPLR) form, both the recursive and convolutional forms’ computational complexity were reduced:

1. For the recurrent form, the DLPR form has only O(NL) matrix-vector multiplications.

2. For the convolutional form, the convolutional kernel was reduced to require only O(N log L + L log L) operations. This was achieved by changing the technique used to derive the kernel, which included using the inverse Fast Fourier Transform (iFFT) and applying the Woodbury identity to reduce the low-rank term of matrix A.

This is a considerable leap in computational efficiency, significantly reducing the scaling with sequence length and bringing SSMs closer to linear time complexity, in contrast to the quadratic scaling of transformers.

Improvements in the training implementation

After tackling the LSSL’s computational complexity, the authors found another significant improvement, which is making the matrix A (partially) learnable. In the LSSL, the matrix was fixed and not updated during the training process. Rather, the matrices B and C were responsible for the update and learnability of the SSM blocks.

Keeping the matrix A fixed ensures computational efficiency, but it limits the model’s ability to capture complex dynamics and underlying patterns in the sequence. A fully learnable matrix A offers the flexibility to adapt to arbitrary dynamics. However, it comes with trade-offs: more parameters to optimize, slower training, and higher computational costs during inference.

To balance these competing demands, the modified LSSL – dubbed S4 – adopts a partially learnable A. By maintaining the DPLR structure of A, the model retains computational efficiency, while the introduction of learnable parameters enhances its ability to capture richer, domain-specific behaviors. By introducing learnable parameters into A, a model can adjust the state dynamics during training and update sequence-specific internal representations in the state.

Additionally, Efficiently Modeling Long Sequences with State Structured Spaces introduces techniques for implementing bidirectional state-space models. These models can process sequences in both the forward and backward directions, capturing dependencies from past and future contexts.

Simplified state space layers for sequence modeling

In Simplified State Space Layers for Sequence Modeling, Jimmy Smith, Andrew Warrington, and Scott Linderman proposed multiple improvements to the S4 architecture to enhance performance while maintaining the same computational complexity.

While the improvements of S4 over the original LSSL mainly focus on reducing the model’s computational complexity, S5 aimed to simplify the architecture, making it more efficient and easier to implement while maintaining or improving performance.

Using parallel associative scan

Parallel scan, also known as parallel associative scan, is an algorithm that allows parallel computation through pre-computing cumulative operations (in this case, products) up to each position in the sequence so they can be selected during the processing step instead of processed one at a time.

Using a parallel associative scan, Smith and colleagues were able to parallelize the training process of recurrent SSMs, removing the need for the use of the convolutional representation.

Thus, the S5 layer operates only in the time domain instead of having the convolutional and frequency domain. This is an important improvement because it allows the time complexity per layer to be O(N log ⁡L) instead of O(NL), leveraging parallel computation over the sequence length while reducing the memory overhead.

Allowing multi-input-multi-output

LSSL and S4 are Single-Input-Single-Output (SISO) models. Allowing Multi-Input-Multi-Output (MIMO) was computationally infeasible since the computations inside LSSL and S4 were designed under the assumption of having one input at a time. For example, adapting the convolutional representation to operate on matrices instead of vectors would have significantly increased the computational cost, making the approach impractical.

Smith and collaborators discretized the MIMO SSM equations instead of the SISO SSM equations. Using the same SSR equations, they extended the discretization process to handle m-dimensional inputs and n-dimensional outputs. Assuming the state has N dimensions, this change makes B an N x m matrix instead of N x 1, and C an n x N matrix instead of 1 x N.

S5’s support for MIMO allows it to handle multidimensional data, such as multivariate and multi-channel time series data, process multiple sequences simultaneously, and produce multiple outputs. This reduces computational overhead by allowing multiple sequences to be processed at the same time instead of having m copies of the SSM.

Diagonalized parametrization

As we discussed above, HiPPO-LegS could not be diagonalized. However, the parallel scan approach requires a diagonal matrix A. Through experimentation, Smith and colleagues discovered that they could represent the HiPPO-LegS matrix as a normal plus low-rank (NLPR) matrix, where the normal component is referred to as HiPPO-N, which can be diagonalized.

They showed that removing the low-rank terms and initializing the HiPPO-N matrix had similar results by proving that HiPPO-N and HiPPO-LegS produced the same dynamics. (A proof is given in the appendix of the paper.) However, if they were to use the diagonal matrix from the DPLR approximation, the approximation would have produced very different dynamics than the original structure.

Using a diagonalized version of the HiPPO-N matrix reduced the model’s computational complexity by removing the need to convert the HiPPO-LegS matrix into its DPLR approximation.

Similar to how using a structured parametrization for matrix A decreased the computational overhead, S5 uses a low-rank representation of matrices B and C, further reducing the number of parameters.

The computational components of an S5 layer, which uses a parallel scan on a diagonalized linear SSM to compute the SSM outputs. A nonlinear activation function is applied to the SSM outputs to produce the layer outputs. | Source

Conclusion and outlook

The evolution of State Space Models (SSMs) as sequence-to-sequence models has highlighted their growing importance in the NLP domain, particularly for tasks requiring the modeling of long-term dependencies. Innovations such as LSSL, S4, and S5 have advanced the field by enhancing computational efficiency, scalability, and expressiveness.

Despite the advancements made by the S5 model, it still lacks the ability to be context-aware. The S5 can efficiently train and infer in the time domain and retain information for long-range dependencies, but it does not explicitly filter or focus on specific parts of the sequence, as Transformers do with attention mechanisms.

Hence, a key next step is to incorporate a mechanism into SSMs that enables them to focus on the most relevant parts of the state rather than processing the entire state uniformly. This is what the Mamba model architecture addresses, which we’ll explore in the upcoming second part of the series.

Was the article useful?

Explore more content topics:

Transformers Key-Value Caching Explained

Machine Learning

dim

March 16, 2025

As the complexity and size of transformer-based models grow, so does the need to optimize their inference speed, especially in chat applications where the users expect immediate replies.

Key-value (KV) caching is a clever trick to do that: At inference time, key and value matrices are calculated for each generated token. KV caching stores these matrices in memory so that when subsequent tokens are generated, we only compute the keys and values for the new tokens instead of having to recompute everything.

The inference speedup from KV caching comes at the cost of increased memory consumption. When memory is a bottleneck, one can reclaim some of it by simplifying the model, thus sacrificing its accuracy.

Implementing K-V caching in large-scale production systems requires careful cache management, including choosing an appropriate strategy for cache invalidation and exploring opportunities for cache reuse.

The transformer architecture is arguably one of the most impactful innovations in modern deep learning. Proposed in the famous 2017 paper “Attention Is All You Need,” it has become the go-to approach for most language-related modeling, including all Large Language Models (LLMs), such as the GPT family, as well as many computer vision tasks.

As the complexity and size of these models grow, so does the need to optimize their inference speed, especially in chat applications where the users expect immediate replies. Key-value (KV) caching is a clever trick to do just that – let’s see how it works and when to use it.

Transformer architecture overview

Before we dive into KV caching, we will need to take a short detour to the attention mechanism used in transformers. Understanding how it works is required to spot and appreciate how KV caching optimizes transformer inference.

We will focus on autoregressive models used to generate text. These so-called decoder models include the GPT family, Gemini, Claude, or GitHub Copilot. They are trained on a simple task: predicting the next token in sequence. During inference, the model is provided with some text, and its task is to predict how this text should continue.

From a high-level perspective, most transformers consist of a few basic building blocks:

A tokenizer that splits the input text into subparts, such as words or sub-words.
An embedding layer that transforms the resulting tokens (and their relative positions within the texts) into vectors.
A couple of basic neural network layers, including dropout, layer normalization, and regular feed-forward linear layers.

The last building block missing from the list above is the slightly more involved self-attention modules.

The self-attention module is, arguably, the only advanced piece of logic in the transformer architecture. It is the cornerstone of every transformer, enabling it to focus on different parts of the input sequence when generating the outputs. It is this mechanism that gives transformers the ability to model long-range dependencies effectively.

Let’s inspect the self-attention module in more detail.

Basic self-attention module

Self-attention is a mechanism that allows the model to “pay attention” to specific parts of the input sequence as it generates the next token. For example, in generating the sentence “She poured the coffee into the cup,” the model might pay more attention to the words “poured” and “coffee” to predict “into” as the next word since these words provide context for what is likely to come next (as opposed to “she” and “the”).

Mathematically speaking, the goal of self-attention is to transform each input (embedded token) into a so-called context vector, which combines the information from all the inputs in a given text. Consider the text “She poured coffee”. Attention will compute three context vectors, one for each input token (let’s assume tokens are words).

To calculate the context vectors, self-attention computes three kinds of intermediate vectors: queries, keys, and values. The diagram below shows step by step how the context vector for the second word, “poured,” is calculated:

The diagram shows step by step how the context vector for the second word, “poured,” is calculated. | Source: Author

Let’s denote the three tokenized inputs as x1, x2, and x3, respectively. The diagram pictures them as vectors with three elements, but in practice, they will be hundreds or thousands of elements long.

As the first step, self-attention multiplies each input separately with two weight matrices, Wk and Wv. The input for which the context vector is now being computed (x2 in our case) is additionally multiplied with a third weight matrix, Wq. All three W matrices are your usual neural network weights, randomly initialized and optimized in the learning process. The outputs of this step are the keys (k) and values (v) vectors for each input, plus an additional query (q) vector for the input being processed.

In step two, the key vector of each input is multiplied by the query vector of the input being processed (our q2). The output is then normalized (not shown in the diagram) to produce the attention weights. In our example, a21 is the attention weight between the inputs “She” and “poured.”

Finally, each attention weight is multiplied by its corresponding value vector. The outputs are then summed to produce the context vector z. In our example, the context vector z2 corresponds to the input x2, “poured.” The context vectors are the outputs of the self-attention module.

If it’s easier for you to read code than diagrams, take a look at this implementation of the basic self-attention module by Sebastian Raschka. The code is part of his book, “Build A Large Language Model (From Scratch)”:

import torch

class SelfAttention_v2(torch.nn.Module):

    def __init__(self, d_in, d_out, qkv_bias=False):
        super().__init__()
        self.W_query = torch.nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = torch.nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = torch.nn.Linear(d_in, d_out, bias=qkv_bias)

    def forward(self, x):
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)
        
        attn_scores = queries @ keys.T
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)

        context_vec = attn_weights @ values
        return context_vec

Sebastian’s code operates on matrices: the x in his forward() method corresponds to our x1, x2, and x3 vectors stacked together as a matrix with three rows. This allows him to simply multiply x with W_key to obtain keys, a matrix consisting of three rows (k1, k2, and k3 in our example).

The important takeaway from this brief explanation of self-attention is that in each forward pass, we multiply keys with the queries and then later with the values. Keep this in mind as you read on.

Advanced self-attention modules

The variant of self-attention described above is its simplest vanilla form. Today’s largest LLMs typically use slightly modified variations that typically differ from our basic flavor in three ways:

1
Attention is causal.

2
Dropout is used on attention weights.

3
Multi-head attention is used.

Causal attention means that the model should only consider previous tokens in the sequence when predicting the next one, preventing it from “looking ahead” at future words. Going back to our example, “She poured coffee.”, when the model was given the word “She” and is now attempting to predict the next one (“poured” would be correct), it should not compute or have access to attention weights between “coffee” and any other word since the word “coffee” has not appeared in the text yet. Causal attention is typically implemented by masking the “look-ahead” part of the attention weights matrix with zeros.

Next, to reduce overfitting during training, dropout is often applied to the attention weights. This means that some of them are randomly set to zero in each forward pass.

Finally, basic attention can be referred to as single-head, meaning that there is just one set of Wk, Wq, and Wv matrices. An easy way to increase the model’s capacity is to switch to multi-head attention. This boils down to having multiple sets of the W-matrices and, consequently, multiple query, key, and value matrices, as well as multiple context vectors for each input.

Additionally, some transformers implement additional modifications of the attention module with the goal of improving speed or accuracy. Three popular ones are:

Grouped-query attention: Instead of looking at every input token individually, tokens are grouped, allowing the model to focus on related groups of words at once, which speeds up processing. This is used by Llama 3, Mixtral, and Gemini.
Paged attention: Attention is broken down into “pages” or chunks of tokens, so the model processes one page at a time, making it faster for very long sequences.
Sliding-window attention: The model only attends to nearby tokens within a fixed “window” around each token, so it focuses on the local context without needing to look at the entire sequence.

All of these state-of-the-art approaches to implementing self-attention don’t change its basic premise and the fundamental mechanism it relies on: one always needs to multiply the keys by the queries and then later by the values. And as it turns out, at inference time, these multiplications show major inefficiencies. Let’s see why that’s the case.

What is key-value caching?

During inference, transformers generate one token at a time. When we prompt the model to start generation by passing “She,” it will produce one word, such as “poured” (for the sake of avoiding distractions, let’s keep assuming one token is one word). Then, we can pass “She poured” to the model, and it produces “coffee.” Next, we pass “She poured coffee” and obtain the end-of-sequence token from the model, indicating that it considers generation to be complete.

This means we have run the forward pass three times, each time multiplying the queries by the keys to obtain the attention scores (the same applies to the later multiplication by the values).

In the first forward pass, there was just one input token (“She”), resulting in just one key vector and one query vector. We multiplied them to obtain the q1k1 attention score.

In the first forward pass, there is just one input token (“She”), resulting in just one key vector and one query vector. We multiplie them to obtain the q1k1 attention score.

Next, we passed “She poured” to the model. It now sees two input tokens, so the computation inside our attention module looks as follows:

Next, we pass “She poured” to the model. It now sees two input tokens.

We did the multiplication to compute three terms, but q1k1 was computed needlessly—we had already calculated it before! This q1k1 element is the same as in the previous forward pass because:

q1 is calculated as the embedding of the input (“She”) times the Wq matrix,
k1 is calculated as the embedding of the input (“She”) times the Wk matrix,
Both the embeddings and the weight matrices are constant at inference time.

Note the grayed-out entries in the attention scores matrix: these are masked with zero to achieve causal attention. For example, the top-right element where q1k3 would have been is not shown to the model as we don’t know the third word (and k3) at the moment of generating the second word.

Finally, here is the illustration of the query-times-keys calculation in our third forward pass.

We get the illustration of the query-times-keys calculation in the third forward pass.

We make the computational effort to calculate six values, half of which we already know and don’t need to recompute!

You may already have a hunch about what key-value caching is all about. At inference, as we compute the keys (K) and values (V) matrices, we store their elements in the cache. The cache is an auxiliary memory from which high-speed retrieval is possible. As subsequent tokens are generated, we only compute the keys and values for the new tokens.

For example, this is how the third forward pass would look with caching:

An example on how the third forward pass could look with caching.

When processing the third token, we don’t need to recompute the previous token’s attention scores. We can retrieve the keys and values for the first two tokens from the cache, thus saving computation time.

Assessing the impact of key-value caching

Key-value caching may have a significant impact on inference time. The magnitude of this impact depends on the model architecture. The more cachable computations there are, the larger the potential to reduce inference time.

Let’s analyze the impact of K-V caching on generation time using the GPT-Neo-1.3B model from EleutherAI, which is available on the Hugging Face Hub.

We will start by defining a timer context manager to calculate generation time:

import time

class Timer:

   def __enter__(self):
       self._start = time.time()
       return self

   def __exit__(self, exc_type, exc_value, traceback):
       self._end = time.time()
       self.duration = self._end - self._start

   def get_duration(self) -> float:
       return self.duration

Next, we load the model from the Hugging Face Hub, set up the tokenizer, and define the prompt:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "EleutherAI/gpt-neo-1.3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

input_text = "Why is a pour-over the only acceptable way to drink coffee?"

Finally, we can define the function to run model inference:

def generate(use_cache):
    input_ids = tokenizer.encode(
        input_text,
        return_tensors="pt").to(device),
    )
 output_ids = model.generate(
     input_ids,
     max_new_tokens=100,
     use_cache=use_cache,
 )

Note the use_cache argument we pass to model.generate: It controls whether K-V caching is employed.

With this setup, we can measure the average generation time with and without K-V caching:

for use_cache in (False, True):
   gen_times = []
   for _ in range(10):
     with Timer() as t:
       generate(use_cache=use_cache)
     gen_times += [t.duration]
   print(f"Average inference time with use_cache={use_cache}: {np.round(np.mean(gen_times), 2)} seconds")

I have executed this code on Google Colab using their free-tier T4 GPU using torch==2.5.1+cu121 and transformers==4.46.2 on Python 3.10.12 and obtained the following output:

Average inference time with use_cache=False: 9.28 seconds Average inference time with use_cache=True: 3.19 seconds

As you can see, in this case, the speedup from caching is almost threefold.

Challenges and trade-offs

As is usually the case, there is no such thing as a free lunch. The generation speedup we have just seen can only be achieved at the cost of increased memory usage, and it requires considerate management in production systems.

Latency-memory trade-off

Storing data in the cache uses up memory space. Systems with limited memory resources may struggle to accommodate this additional memory overhead, potentially resulting in out-of-memory errors. This is especially the case when long inputs need to be processed, as the memory required for the cache grows linearly with the input length.

Another aspect to keep in mind is that the additional memory consumed by the cache is not available for storing the batches of data. As a result, one might need to reduce the batch size to keep it within the memory limits, thus decreasing the throughput of the system.

If the memory consumed by the cache becomes a problem, one can trade additional memory for some of the model accuracy. Specifically, one can truncate the sequences, prune the attention heads, or quantize the model:

Sequence truncation refers to limiting the maximum input sequence length, thus capping the cache size at the expense of losing long-term context. In tasks where this long context is relevant, the model’s accuracy might suffer.

Reducing the number of layers or attention heads, thereby decreasing both the model size and cache memory requirements, is another strategy to reclaim some memory. However, reducing model complexity may impact its accuracy.

Finally, there is quantization, which means using lower-precision data types (e.g., float16 instead of float32) for caching to reduce memory usage. Yet again, model accuracy can suffer.

To sum up, faster latency provided by K-V caching comes at the cost of increased memory usage. If there is sufficient memory, it’s a non-issue. If the memory becomes the bottleneck, however, one can reclaim it by simplifying the model in various ways, thus transitioning from a latency-memory trade-off to a latency-accuracy trade-off.

KV cache management in production systems

In large-scale production systems with many users, the K-V cache needs to be properly managed to ensure consistent and reliable response time while preventing excessive memory consumption. The two most critical aspects of this are cache invalidation (when to clear it) and cache reuse (how to use the same cache multiple times).

Cache invalidation

Three of the most popular cache invalidation strategies are session-based clearing, time-to-live invalidation, and contextual relevance-based approaches. Let’s explore them in this order.

The most basic cache invalidation strategy is session-based clearing. We simply clear the cache at the end of a user session or conversation with the model. This simple strategy is a perfect fit for applications where conversations are short and independent of each other.

Think about a customer support chatbot application in which each user session typically represents an individual conversation where the user seeks assistance with specific issues. In this context, the contents of this cache are unlikely to be needed again. Clearing the K-V cache once the user ends the chat or the session times out due to inactivity is a good choice, freeing up memory for the application to handle new users.

In situations where individual sessions are long, however, there are better solutions than session-based clearing. In time-to-live (TTL) invalidation, cache contents are automatically cleared after a certain period. This strategy is a good choice when the relevance of cached data diminishes predictably over time.

Consider a news aggregator app that provides real-time updates. Cached keys and values might only be relevant for as long as the news is hot. Implementing a TTL policy where cached entries expire after, say, one day ensures that responses to similar queries about fresh developments are generated fast while old news doesn’t fill up memory.

Finally, the most sophisticated of the three popular cache invalidation strategies is based on contextual relevance. Here, we clear the cache contents as soon as they become irrelevant to the current context or user interaction. This strategy is ideal when the application handles diverse tasks or topics within the same session, and the previous context doesn’t contribute value to the new one.

Think about a coding assistant that works as an IDE plug-in. While the user is working on a particular set of files, the cache should be retained. As soon as they switch to a different codebase, however, the previous keys and values become irrelevant and can be deleted to free memory. Contextual relevance-based approaches might be challenging to implement, though, as they require pinpointing the event or point in time at which the context switch occurs.

Cache reuse

Another important aspect of cache management is its reuse. On some occasions, a once-generated cache can be used again to speed up generation and save memory by avoiding storing the same data multiple times in different users’ cache instances.

Cache reuse opportunities typically show up when there is shared context and/or a warm start is desirable.

In scenarios where multiple requests share a common context, one can reuse the cache for that shared portion. In e-commerce platforms, certain products may have standard descriptions or specifications that are frequently asked about by multiple customers. These may include product details (“55-inch 4K Ultra HD Smart LED TV”), warranty information (“Comes with a 2-year manufacturer’s warranty covering parts and labor.”), or customer instructions (“For best results, mount the TV using a compatible wall bracket, sold separately.”). By caching the key-value pairs for these shared product descriptions, a customer support chatbot will generate responses to common questions faster.

Similarly, one can precompute and cache the initial K-V pairs for frequently used prompts or queries. Consider a voice-activated virtual assistant application. Users frequently start interactions with phrases like “What’s the weather today?” or “Set a timer for 10 minutes.” The assistant can respond more quickly by precomputing and caching the key-value pairs for these frequently used queries.

Conclusion

Key-value (K-V) caching is a technique in transformer models where the key and value matrices from previous steps are stored and reused during the generation of subsequent tokens. It allows for the reduction of redundant computations and speeding up inference time. This speedup comes at the cost of increased memory consumption. When memory is a bottleneck, one can reclaim some of it by simplifying the model, thus sacrificing its accuracy. Implementing K-V caching in large-scale production systems requires careful cache management, including choosing the strategy for cache invalidation and exploring the opportunities for cache reuse.

Was the article useful?

Explore more content topics:

Bayesian Deep Learning is Needed in the Age of Large-Scale AI [Paper Reflection]

Machine Learning

dim

March 16, 2025

In his famous blog post Artificial Intelligence — The Revolution Hasn’t Happened Yet, Michael Jordan (the AI researcher, not the one you probably thought of first) tells a story about how he might have almost lost his unborn daughter due to a faulty AI prediction. He speculates that many children die needlessly each year in the same way. Abstracting away the specifics of his case, this is one example of an application in which an AI algorithm’s performance looked good on paper during its development but led to bad decisions once deployed.

In our paper Bayesian Deep Learning is Needed in the Age of Large-Scale AI, we argue that the case above is not the exception but rather the rule and a direct consequence of the research community’s focus on predictive accuracy as a single metric of interest.

Our position paper was born out of the observation that the annual Symposium on Advances of Approximate Bayesian Inference, despite its immediate relevance to these questions, attracted fewer junior researchers over the years. At the same time, many of our students and younger colleagues seemed unaware of the fundamental problems with current practices in machine learning research—especially when it comes to large-scale efforts like the work on foundation models, which grab most of the attention today but fall short in terms of safety, reliability, and robustness.

We reached out to fellow researchers in Bayesian deep learning and eventually assembled a group of researchers from 29 of the most renowned institutions around the world, working at universities, government labs, and industry. Together, we wrote the paper to make the case that Bayesian deep learning offers promising solutions to core problems in machine learning and is ready for application beyond academic experiments. In particular, we point out that there are many other metrics beyond accuracy, such as uncertainty calibration, which we have to take into account to ensure that better models also translate to better outcomes in downstream applications.

In this commentary, I will expand on the importance of decisions as a goal for machine learning systems, in contrast to singular metrics. Moreover, I will make the case for why Bayesian deep learning can satisfy these desiderata and briefly review recent advances in the field. Finally, I will provide an outlook for the future of this research area and give some advice on how you can already use the power of Bayesian deep learning solutions in your research or practice today.

Machine learning for decisions

If you open any machine learning research paper presented at one of the big conferences, chances are that you will find a big table with a lot of numbers. These numbers usually reflect the predictive accuracy of different methods on different datasets, and the line corresponding to the authors’ proposed method probably has a lot of bold numbers, indicating that they are higher than the ones of the other methods.

The results table from the ResNet paper is a typical example of how results are presented in machine learning publications. The researchers applied different models and model variants to the same dataset and measured two metrics. The best metric values—usually belonging to the researchers’ newly devised model—are boldened.

In the results table from the Vision Transformer paper, the authors compare three of their own model variants against the prior state-of-the-art ResNet-152 model. They trained all four models on seven different datasets and measured the accuracy. Their findings indicate that the ViT-H/14 model (first column) outperforms the other models on six of the seven datasets. Crucially, this does not allow any conclusions about how any of the models would perform on a particular downstream task. (The last line of the table, labeled “TPUv3-core-days,” indicates the number of days it took to train the models on TPUs.)

Based on this observation, one might believe that bold numbers in tables are all that matters in the world. However, I want to strongly argue that this is not the case. What matters in the real world are decisions—or, more precisely, decisions and their associated utilities.

A motivating example

Imagine you overslept and are now running the risk of getting late to work. Moreover, there is a new construction site on your usual route to work, and there is also a parade going on in town today. This makes the traffic situation rather hard to predict. It is 08:30 am, and you have to be at work by 09:00. There are three different routes you can take: through the city, via the highway, or through the forest. How do you choose?

Luckily, some clever AI researchers have built tools that can predict the time each route takes. There are two tools to choose from, Tool A and Tool B, and these are their predictions:

Annoyingly, Tool A suggests that you should use the highways, but Tool B suggests the city. However, as a tech-savvy user, you actually know that B uses a newer algorithm, and you have read the paper and marveled at the bold numbers. You know that B yields a lower mean-squared error (MSE), a common measure for predictive performance on regression tasks.

Confidently, you choose to trust Tool B and thus take the route through the city—just to arrive at 09:02 and get an annoyed side-glance from your boss for being late.

But how did that happen? You chose the best tool, after all! Let’s look at the ground-truth travel times:

As we can see, the highway was actually the fastest one and, in fact, the only one that would have gotten you to work on time. But how is that possible? This will become clear when we compute the MSE in these times for the two predictive algorithms:

MSE(A) = [ (35-32)² + (25-25)² + (43-35)²] / 3 = 24.3

MSE(B) = [ (28-32)² + (32-25)² + (35-35)²] / 3 = 21.7

Indeed, we see that Tool B has the better MSE, as advertised in the paper. But that didn’t help you now, did it? What you ultimately cared about was not having the most accurate predictions across all possible routes but making the best decision regarding which route to take, namely the decision that gets you to work in time.

While Tool A makes worse predictions on average, its predictions are better for routes with shorter travel times and get worse the longer a route takes. It also never underestimates travel times.

To get to work on time, you don’t care about the predictions for the slowest routes, only about the fastest ones. You’d also like to have the confidence to arrive on time and not choose a route that then actually ends up taking longer. Thus, while Tool A has a worse MSE, it actually leads to better decisions.

Uncertainty estimation to the rescue

Of course, if you had known that the prediction could have been so wrong, you might have never trusted it in the first place, right? Let’s add another useful feature to the predictions: uncertainty estimation.

Here are the original two algorithms and a new third one (Tool C) that estimates its own predictive uncertainties:

The ranking based on mean predictions of Tool C agrees with Tool B. However, you can now assess how much risk there is that you run late to work. Your true utility is not to be at work in the shortest time possible but to be at work on time, i.e., within a maximum of 30 min.

According to Tool C, the drive through the city can take between 17 and 32 min, so while it seems to be the fastest on average, there is a chance that you will be late. In contrast, the highway can take between 25 and 29 min, so you will be on time in any case. Armed with these uncertainty estimates, you’d make the correct choice of choosing the highway.

This was just one example of a scenario in which we are faced with decisions whose utility does not correlate with an algorithm’s raw predictive accuracy, and uncertainty estimation is crucial to making better decisions.

The case for Bayesian deep learning

Bayesian deep learning uses the foundational statistical principles of Bayesian inference to endow deep learning systems with the ability to make probabilistic predictions. These predictions can then be used to derive uncertainty intervals of the form shown in the previous example (which a Bayesian would call “credible intervals”).

Uncertainty intervals can encompass aleatoric uncertainty, that is, the uncertainty inherent in the randomness of the world (e.g., whether your neighbor decided to leave the car park at the same time as you), and epistemic uncertainty, related to our lack of knowledge (e.g., we might not know how fast the parade moves).

Crucially, by applying Bayes’ theorem, we can incorporate prior knowledge into the predictions and uncertainty estimates of our Bayesian deep learning model. For example, we can use our understanding of how traffic flows around a construction site to estimate potential delays.

Frequentist statisticians will often criticize this aspect of Bayesian inference as “subjective” and will advocate for “distribution-free” approaches, such as conformal prediction, which give you provable guarantees for the coverage of the prediction intervals. However, these guarantees only hold uniformly across all the predictions (in our example, across all the routes), but not necessarily in any given case.

As we have seen in our example, we don’t care that much about the accuracy (and, in extension, uncertainty estimates) on the slower routes. As long as the predictions and uncertainty estimates for the fast routes are accurate, a tool serves its purpose. Conformal methods cannot provide such a marginal coverage guarantee for each route, limiting their applicability in many scenarios.

“But Bayesian deep learning doesn’t work”

If you have only superficially followed the field of Bayesian deep learning a few years ago and have then stopped paying attention, distracted by all the buzz around LLMs and generative AI, you would be excused in believing that it has elegant principles and a strong motivation, but does not actually work in practice. Indeed, this truly was the case until only very recently.

However, in the last few years, the field has seen many breakthroughs that allow for this framework to finally deliver on its promises. For instance, performing Bayesian inference on posterior distributions over millions of neural network parameters used to be computationally intractable, but we now have scalable approximate inference methods that are only marginally more costly than standard neural network training.

Moreover, it used to be hard to choose the right model class for a given problem, but we have made great progress in automating this decision away from the user thanks to advances in Bayesian model selection.

While it is still nearly impossible to design a meaningful prior distribution over neural network parameters, we have found different ways to specify priors directly over functions, which is much more intuitive for most practitioners. Finally, some troubling conundra related to the behavior of the Bayesian neural network posterior, such as the infamous cold posterior effect, are much better understood now.

Armed with these tools, Bayesian deep learning models have then started to have a beneficial impact in many domains, including healthcare, robotics, and science. For instance, we have shown that in the context of predicting health outcomes for patients in the intensive care unit based on time series data, a Bayesian deep learning approach can not only yield better predictions and uncertainty estimates but also lead to recommendations that are more interpretable for medical practitioners. Our position paper contains detailed accounts of this and other noteworthy examples.

However, Bayesian deep learning is unfortunately still not as easy to use as standard deep learning, which you can do these days in a few lines of PyTorch code.

If you want to use a Bayesian deep learning model, first, you have to think about specifying the prior. This is a crucial component of the Bayesian paradigm and might sound like a chore, but if you actually have prior knowledge about the task at hand, this can really improve your performance.

Then, you are still left with choosing an approximate inference algorithm, depending on how much computational budget you are willing to spend. Some algorithms are very cheap (such as Laplace inference), but if you want really high-fidelity uncertainty estimates, you might have to opt for a more expensive one (e.g., Markov Chain Monte Carlo).

Finally, you have to find the right implementation of that algorithm that also works with your model. For instance, some inference algorithms might only work with certain types of normalization operators (e.g., layer norm vs. batch norm) or might not work with low-precision weights.

As a research community, we should make it a priority to make these tools more easily usable for normal practitioners without a background in ML research.

The road ahead

This commentary on our position paper has hopefully convinced you that there is more to machine learning than predictive accuracies on a test set. Indeed, if you use predictions from an AI model to make decisions, in almost all circumstances, you should care about ways to incorporate your prior knowledge into the model and get uncertainty estimates out of it. If this is the case, trying out Bayesian deep learning is likely worth your while.

A good place to start is the Primer on Bayesian Neural Networks that I wrote together with three colleagues. I’ve also written a review on priors in Bayesian Deep Learning that’s published open access. Once you understand the theoretical foundations and feel ready to get your hands dirty with some actual Bayesian deep learning in PyTorch, check out some popular libraries for inference methods such as Laplace inference, variational inference, and Markov chain Monte Carlo methods.

Finally, if you are a researcher and would like to get involved in the Bayesian deep learning community, especially contributing to the goal of better benchmarking to show the positive impact on real decision outcomes and to the goal of building easy-to-use software tools for practitioners, feel free to reach out to me.

Was the article useful?

Explore more content topics:

Building The Most Scalable Experiment Tracker For Foundation Models

Machine Learning

dim

March 16, 2025

At a large-scale model training (in huge models), anomalies are not rare events but problematic patterns that drive failure. Detecting anomalies early in the process saves days of work and training.

ML model training observability is not just about tracking metrics. It requires proactive monitoring to catch issues early and ensure model success, given the high cost of training on large GPU clusters.

If you are an enterprise or a team operating a model, focus on three key areas: fine-tune your prompts to get the most effective outputs (prompt engineering), ensure that your model behaves safely and predictably, and implement robust monitoring and logging to track performance, detecting issues early.

The Neptune Scale experiment tracker supports fault tolerance and is designed to maintain progress despite hardware failures, making it adaptable for enterprise teams tackling LLM fine-tuning, compliance, and building domain-specific models.

Scaling large language model (LLM) operations is a challenge that many of us are facing right now. For those navigating similar waters, I recently shared some thoughts about it on the Data Exchange Podcast based on our journey at neptune.ai over the last few years.

Six years ago, we were mainly focused on MLOps when machine learning in production was still evolving. Experiment tracking back then was straightforward—dealing mostly with single models or small-scale distributed systems. Reinforcement learning was one of the few areas pushing the boundaries of scale. In that reinforcement learning, we wanted to run multiple agents and send data from multiple distributed machines to our experiment tracker. This was a huge challenge.

Scaling LLMs: from ML to LLMOps

The landscape changed two years ago when people started training LLMs at scale. LLMOps has taken center stage, and the importance of scaling large language models has grown with research becoming more industrialized. While researchers continue to lead the training process, they are also adjusting to the transition toward commercial applications.

LLMOps isn’t just MLOps with bigger servers, it is a paradigm shift for tracking experiments. We’re not tracking a few hundred metrics for a couple of hours anymore; we’re tracking thousands, even tens of thousands, over several months. These models are trained on GPU clusters spanning multiple data centers, with training jobs that can take months to complete.

Due to time constraints, training frontier models has become a production workflow rather than experimentation. When a training from scratch run takes 50,000 GPUs over several months in different data centers, you don’t get a second chance if something goes wrong—you need to get it right the first time.

Another interesting aspect of LLM training that only a few companies have truly nailed is the branch-and-fork style of training—something that Google has implemented effectively. This method involves branching off multiple experiments from a continuously running model, requiring a significant amount of data from previous runs. It’s a powerful approach, but it demands infrastructure capable of handling large data inheritance, which makes it feasible only for a handful of companies.

From experiment tracking to experiment monitoring

Now we want to track everything—every layer, every detail—because even a small anomaly can mean the difference between success and failure and many hours of work wasted. During this time, we should not only consider pre-training and training time; post-training takes a huge amount of time and collaborative human work. Grasping this issue, we have re-engineered Neptune’s platform to efficiently ingest and visualize massive volumes of data, enabling fast monitoring and analysis at a larger scale.

One of the biggest lessons we’ve learned is that experiment tracking has evolved into experiment monitoring. Unlike MLOps, tracking is no longer just about logging metrics and reviewing them later or restarting your training from a checkpoint a few steps back. It’s about having real-time insights to keep everything on track. With such long training times, a single overlooked metric can lead to significant setbacks. That’s why we’re focusing on building intelligent alerts and anomaly detection right into our experiment tracking system.

Think of it like this—we’re moving from being reactive trackers to proactive observers. Our goal is for our platform to recognize when something is off before the researcher even knows to look for it.

Fault tolerance in LLMs

When you’re dealing with LLM training at this scale, fault tolerance becomes a critical component. With thousands of GPUs running for months, hardware failures are almost inevitable. It’s crucial to have mechanisms in place to handle these faults gracefully.

At Neptune, our system is designed to ensure that the training can resume from checkpoints without losing any data. Fault tolerance does not only mean preventing failures; it also includes minimizing the impact when they occur, so that time and resources are not wasted.

How about being one of the first to access Neptune Scale?

What does this mean for enterprise teams?

If you’re creating your own LLMs from scratch, or even if you’re an enterprise fine-tuning a model, you might wonder how all this is relevant to you. Here’s the deal: strategies originally designed for handling the massive scale of training LLMs are now being adopted in other areas or by smaller-scale projects.

Today, cutting-edge models are pushing the boundaries of scale, complexity, and performance, but these same lessons are starting to matter in fine-tuning tasks, especially when dealing with compliance, reproducibility, or complex domain-specific models.

For enterprise teams, there are three key focuses to consider:

Prompt Engineering: Fine-tune your prompts to get the most effective outputs. This is crucial for adapting large models to your specific needs without having to train from scratch.
Implement guardrails in your application: Ensuring your models behave safely and predictably is key. Guardrails help manage the risks associated with deploying AI in production environments, especially when dealing with sensitive data or critical tasks.
Observability in your system: Observability is vital to understanding what’s happening inside your models. Implementing robust monitoring and logging allows you to track performance, detect issues early, and ensure your models are working as expected. Neptune’s experiment tracker provides the observability you need to stay on top of your model’s behavior.

The future: what we’re building next

At Neptune, we’ve nailed the data ingestion part—it’s fast, reliable, and efficient. The challenge for the next year is making this data useful at scale. We need more than just filtering; we need smart tools that can surface the most critical insights and the most granular information automatically. The goal is to build an experiment tracker that helps researchers discover insights, not just record data.

We’re also working on developing a platform that combines monitoring and anomaly detection with the deep expertise researchers acquire over years of experience. By embedding that expertise directly into the tool (either automatically or by defining rules manually), less experienced researchers can benefit from the patterns and signals that would otherwise take years to learn.

Was the article useful?

Explore more content topics:

Challenges & Solutions For Monitoring at Hyperscale

Machine Learning

dim

March 16, 2025

“What is not measured, cannot be improved.” This quote has become a guiding principle for teams training foundation models. When you’re dealing with complex, large-scale AI systems, things can spiral quickly without the right oversight. Operating at hyperscale poses significant challenges for teams, from the large volume of data generated to the unpredictability of hardware failures and the need for efficient resource management. These issues require strategic solutions, that’s why monitoring isn’t just a nice-to-have—it’s the backbone of transparency, reproducibility, and efficiency. During my talk at NeurIPS, I broke down five key lessons learned from teams facing large-scale model training and monitoring. Let’s get into it.

Real-time monitoring prevents costly failures

Imagine this: you’re training a large language model on thousands of GPUs at a cost of hundreds of thousands of dollars per day. Now imagine discovering, hours into training, that your model is diverging or that hardware issues are degrading your performance. The financial and operational implications are staggering. This is why live monitoring—the ability to act immediately—is so critical.

Live monitoring allows teams to see experiment progress as it happens, rather than waiting for checkpoints or the end of a run. This real-time visibility is a game-changer for identifying and fixing problems on the fly. In addition, automated processes allow you to set up monitoring workflows once and reuse them for similar experiments. This streamlines the process of comparing results, analyzing results, and debugging issues, saving time and effort.

However, achieving true live monitoring is far from simple. Hyperscale training generates an overwhelming volume of data, often reaching up to a million data points per second. Traditional monitoring tools struggle under such loads, creating bottlenecks that can delay corrective action. Some teams try to cope by batching or sampling metrics, but these approaches sacrifice real-time visibility and add complexity to the code.

The solution lies in systems that can handle high-throughput data ingestion while providing accurate, real-time insights. Tools like neptune.ai make this possible by providing dashboards that visualize metrics without delaying training. For example, live tracking of GPU utilization or memory usage can reveal early signs of bottlenecks or out-of-memory errors, allowing engineers to proactively adjust course. See here some testimonials:

One thing we’re always keeping track of is what the utilization is and how to improve it. Sometimes, we’ll get, for example, out-of-memory errors, and then seeing how the memory increases over time in the experiment is really helpful for debugging as well.

James Tu

Research Scientist, Waabi

For some of the pipelines, Neptune was helpful for us to see the utilization of the GPUs. The utilization graphs in the dashboard are a perfect proxy for finding some bottlenecks in the performance, especially if we are running many pipelines.

Wojtek Rosiński

CTO, ReSpo.Vision

Real-time visualization of GPU memory usage (top) and power consumption (bottom) during a large-scale training run. These metrics help identify potential bottlenecks, such as out-of-memory errors or inefficient hardware utilization, enabling immediate corrective actions to maintain optimal performance. | Source: Author

Troubleshooting hardware failures is challenging: simplify it with debugging

Distributed systems are prone to failure, and hardware failures are notoriously difficult to troubleshoot. A single hardware failure can cascade into widespread outages, often with cryptic error messages. Teams often waste time sifting through stack traces, trying to distinguish between infrastructure problems and code bugs.

At Cruise, engineers used frameworks like Ray and Lightning to improve error reporting. By automatically labeling errors as either “infra” or “user” issues and correlating stack traces across nodes, debugging became much faster.

Igor Tsvetkov

Former Senior Staff Software Engineer, Cruise

AI teams automating error categorization and correlation can significantly reduce debugging time in hyperscale environments, just as Cruise has done. How? By using classification strategies to identify if failures originated from hardware constraints (e.g., GPU memory leaks, network latency) or software bugs (e.g., faulty model architectures, misconfigured hyperparameters).

Intuitive experiment tracking optimizes resource utilization

Another relevant aspect of hyperscale monitoring is optimizing resource utilization, in particular in a scenario where hardware failures and training interruptions can set teams back significantly. Picture a scenario where training jobs suddenly deviate: loss metrics spike, and you’re left deciding whether to let the job run or terminate it. Advanced experiment trackers allow for remote experiment termination, eliminating the need for teams to manually access cloud logs or servers.

Use checkpoints at frequent intervals so you do not have to restart from scratch, but just warm-start from the previous checkpoint. Most mature training frameworks already offer automated checkpointing and warm-starts from previous checkpoints. But most of these, by default, save the checkpoints in the same machine. This doesn’t help if your hardware crashes, or, for example, you are using spot instances and they are reassigned.

For maximum resilience and to prevent losing data if hardware crashes, checkpoints should be linked to your experiment tracker. This does not mean that you upload GBs worth of checkpoints to the tracker (although you can and some of our customers, especially self-hosted customers, do this for security reasons), but rather have pointers to the remote location, like S3, where the checkpoints have been saved. This enables you to link the checkpoint with the corresponding experiment step, and efficiently retrieve the relevant checkpoint at any given step.

A comparison of training workflows with and without advanced experiment tracking and checkpointing. On the left, failed training runs at various stages lead to wasted time and resources. On the right, a streamlined approach with checkpoints and proactive monitoring ensures consistent progress and minimizes the impact of interruptions. | Source: Author

However, there are two caveats to successfully restarting an experiment from a checkpoint: assuming that the experimentation environment is constant, or at least reproducible, and addressing deterministic issues like Out-of-Memory errors (OOMs) or bottlenecks that may require parameter changes to avoid repeating failures. This is where forking can play a significant role in improving recovery and progress.

Track months-long model training with more confidence. Use neptune.ai forking feature to iterate faster and optimize the usage of GPU resources.

With Neptune, users can visualize forked training out of the box. This means you can:

Test multiple configs at the same time. Stop the runs that don’t improve accuracy. And continue from the most accurate last step.
Restart failed training sessions from any previous step. The training history is inherited, and the entire experiment is visible on a single chart.

In addition, checkpointing strategies are critical for optimizing recovery processes. Frequent checkpointing ensures minimal loss of progress, allowing you to warm-start from the most recent state instead of starting from scratch. However, checkpointing can be resource-intensive in terms of storage and time, so we need to strike a balance between frequency and overhead.

For large-scale models, the overhead of writing and reading weights to persistent storage can significantly reduce training efficiency. Innovations like redundant in-memory copies, as demonstrated by Google’s Gemini models, enable rapid recovery and improved training goodput (defined by Google as the time spent computing useful new steps over the elapsed time of the training job), increasing resilience and efficiency.

Features like PyTorch Distributed’s asynchronous checkpointing can significantly reduce checkpointing times making frequent checkpointing more viable without compromising training performance.

Beyond models, checkpointing the state of dataloaders remains a challenge due to distributed states across nodes. While some organizations like Meta have developed in-house solutions, general frameworks have yet to fully address this issue. Incorporating dataloader checkpointing can further enhance resilience by preserving the exact training state during recovery.

Reproducibility and transparency are non-negotiable

Reproducibility is the bedrock of reliable research, but it’s notoriously difficult at scale. Ensuring reproducibility requires consistent tracking of environment details, datasets, configurations, and results. This is where Neptune’s approach excels, linking every experiment’s lineage—from parent runs to dataset versions—in an accessible dashboard.

This transparency not only aids validation but also accelerates troubleshooting. Consider ReSpo.Vision’s challenges in managing and comparing results across pipelines. By implementing organized tracking systems, they gained visibility into pipeline dependencies and experiment parameters, streamlining their workflow.

A single source of truth simplifies data visualization and management at large-scale data

Managing and visualizing data at scale is a common challenge, amplified in the context of large-scale experimentation. While tools like MLflow or TensorBoard are sufficient for smaller projects with 10–20 experiments, they quickly fall short when handling thousands or even hundreds of experiments. At this scale, organizing and comparing results becomes a logistical hurdle, and relying on tools that cannot effectively visualize or manage this scale leads to inefficiencies and missed insights.

A solution lies in adopting a single source of truth for all experiment metadata, encompassing everything from input data and training metrics to checkpoints and outputs. Neptune’s dashboards address this challenge by providing a highly customizable and centralized platform for experiment tracking. These dashboards enable real-time visualization of key metrics, which can be tailored to include “custom metrics”—those not explicitly logged at the code level but calculated retrospectively within the tool. For instance, if a business requirement shifts from using precision and recall to the F1 score as a performance indicator, custom metrics allow you to calculate and visualize these metrics across existing and future experiments without rerunning them, ensuring flexibility and minimizing duplicated effort.

Consider the challenges faced by Waabi and ReSpo.Vision. Waabi’s teams, running large-scale ML experiments, needed a way to organize and share their experiment data efficiently. Similarly, ReSpo.Vision required an intuitive system to visualize multiple metrics in a standardized format that any team member—technical or non-technical—could easily access and interpret. Neptune’s dashboards provided the solution, allowing these teams to streamline their workflows by offering visibility into all relevant experiment data, reducing overhead, and enabling collaboration across stakeholders.

I like those dashboards because we need several metrics, so you code the dashboard once, have those styles, and easily see it on one screen. Then, any other person can view the same thing, so that’s pretty nice.

Łukasz Grad

Chief Data Scientist, ReSpo.Vision

The benefits of such an approach extend beyond visualization. Logging only essential data and calculating derived metrics within the tool reduces latency and streamlines the experimental process. This capability empowers teams to focus on actionable insights, enabling scalable and efficient experiment tracking, even for projects involving tens of thousands of models and subproblems.

Visualizing large datasets

We generally do not think of dataset visualization as part of experiment monitoring. However, preparing the dataset for model training is an experiment in itself, and while it may be an upstream experiment not in the same pipeline as the actual model training, data management and visualization is critical to LLMOps.

Large-scale experiments often involve processing billions of data points or embeddings. Visualizing such data to uncover relationships and debug issues is a common hurdle. Tools like Deepscatter and Jupyter Scatter have made progress in scaling visualizations for massive datasets, offering researchers valuable insights into their data distribution and embedding structures.

Moving forward

The path to efficient hyperscale training lies in combining robust monitoring, advanced debugging tools, and comprehensive experiment tracking. Solutions like Neptune Scale are designed to address these challenges, offering the scalability, precision, and transparency researchers need.

How about being one of the first to access Neptune Scale?

If you’re interested in learning more, visit our blog or join the MLOps community to explore case studies and actionable strategies for large-scale AI experimentation.

Acknowledgments

I would like to express my gratitude to Prince Canuma, Dr. Shantipriya Parida, and Igor Tsvetkov for their valuable time and insightful discussions on this topic. Their contributions and perspectives were instrumental in shaping this talk.

Was the article useful?

Explore more content topics:

12 Page 1 of 2

Start with the why

Common problems with messy datasets

Thinking about the how

Next articles in the series

Main findings

Limitations

Implications

Conclusion

LLM hyperparameters

Model size

Learning rate

How does the learning rate affect training duration and quality?

Cosine schedule

Warmup-stable-decay schedule

Cyclical cosine schedule

Cyclical warmup-stable-decay schedule

Learning behavior

Weight decay

Gradient clipping

Next-token generation

Temperature

Sampling strategy

How do we find the optimal hyperparameters?

Can we use traditional machine learning hyperparameter optimization methods for LLMs?

Advanced strategies for LLM hyperparameter optimization

Population-based training (PBT)

Bayesian optimization

Adaptive Low-Rank Adaptation (LoRA)

Hands-on: LLM hyperparameter optimization with neptune.ai

What’s next in LLM hyperparameter optimization?

Was the article useful?

Explore more content topics:

Dynamic games

The chocolate-pudding market

Subgame perfect equilibrium

Uncertainty

Summary & outlook

References

What is a multimodal large language model?

Why do we need multimodal LLMs?

How do multimodal LLMs work?

Examples of multimodal LLMs

Microsoft: Kosmos-1

Architecture and training

Performance

DeepMind: Flamingo

LLaVA

Google: PaLM-E

Architecture and training

Performance

Challenges, limitations, and future directions of MLLMs

Was the article useful?

Explore more content topics:

Understanding state space models

State space models for natural language processing

HiPPO: recurrent memory with optimal polynomial projections

Combining recurrent, convolutional, and continuous-time models with linear state-space layers

Discretization

The three pillars of SSMs as sequence models

Parallelizable training

Stateful inference

Time-scale adaptation

Linear state space layers (LSSLs)

Efficiently modeling long sequences with state structured spaces

Improvements on the state matrix A

Improvements in the training implementation

Simplified state space layers for sequence modeling

Using parallel associative scan

Allowing multi-input-multi-output

Diagonalized parametrization

Conclusion and outlook

Was the article useful?

Explore more content topics:

Transformer architecture overview

Basic self-attention module

Advanced self-attention modules

1 Attention is causal. 2 Dropout is used on attention weights. 3 Multi-head attention is used.

What is key-value caching?

Assessing the impact of key-value caching

Challenges and trade-offs

1
Attention is causal.

2
Dropout is used on attention weights.

3
Multi-head attention is used.