Machine Learning

Home Machine Learning

Essential Review Papers on Physics-Informed Neural Networks: A Curated Guide for Practitioners

0

Staying on top of a fast-growing research field is never easy.

I face this challenge firsthand as a practitioner in Physics-Informed Neural Networks (PINNs). New papers, be they algorithmic advancements or cutting-edge applications, are published at an accelerating pace by both academia and industry. While it is exciting to see this rapid development, it inevitably raises a pressing question:

How can one stay informed without spending countless hours sifting through papers?

This is where I have found review papers to be exceptionally valuable. Good review papers are effective tools that distill essential insights and highlight important trends. They are big-time savers guiding us through the flood of information.

In this blog post, I would like to share with you my personal, curated list of must-read review papers on PINNs, that are especially influential for my own understanding and use of PINNs. Those papers cover key aspects of PINNs, including algorithmic developments, implementation best practices, and real-world applications.

In addition to what’s available in existing literature, I’ve included one of my own review papers, which provides a comprehensive analysis of common functional usage patterns of PINNs — a practical perspective often missing from academic reviews. This analysis is based on my review of around 200 arXiv papers on PINNs across various engineering domains in the past 3 years and can serve as an essential guide for practitioners looking to deploy these techniques to tackle real-world challenges.

For each review paper, I will explain why it deserves your attention by explaining its unique perspective and indicating practical takeaways that you can benefit from immediately.

Whether you’re just getting started with PINNs, using them to tackle real-world problems, or exploring new research directions, I hope this collection makes navigating the busy field of PINN research easier for you.

Let’s cut through the complexity together and focus on what truly matters.

1️⃣ Scientific Machine Learning through Physics-Informed Neural Networks: Where we are and what’s next

📄 Paper at a glance

🔍 What it covers

  • Authors: S. Cuomo, V. Schiano di Cola, F. Giampaolo, G. Rozza, M. Raissi, and F. Piccialli
  • Year: 2022
  • Link: arXiv

This review is structured around key themes in PINNs: the fundamental components that define their architecture, theoretical aspects of their learning process, and their application to various computing challenges in engineering. The paper also explores the available toolsets, emerging trends, and future directions.

Fig 1. Overview of the #1 review paper. (Image by author)

✨ What’s unique

This review paper stands out in the following ways:

  • One of the best introductions to PINN fundamentals. This paper takes a well-paced approach to explaining PINNs from the ground up. Section 2 systematically dissects the building blocks of a PINN, covering various underlying neural network architectures and their associated characteristics, how PDE constraints are incorporated, common training methodologies, and learning theory (convergence, error analysis, etc.) of PINNs.
  • Putting PINNs in historical context. Rather than simply presenting PINNs as a standalone solution, the paper traces their development from earlier work on using deep learning to solve differential equations. This historical framing is valuable because it helps demystify PINNs by showing that they are an evolution of previous ideas, and it makes it easier for practitioners to see what alternatives are available.
  • Equation-driven organization. Instead of just classifying PINN research by scientific domains (e.g., geoscience, material science, etc.) as many other reviews do, this paper categorizes PINNs based on the types of differential equations (e.g., diffusion problems, advection problems, etc.) they solve. This equation-first perspective encourages knowledge transfer as the same set of PDEs could be used across multiple scientific domains. In addition, it makes it easier for practitioners to see the strengths and weaknesses of PINNs when dealing with different types of differential equations.

🛠 Practical goodies

Beyond its theoretical insights, this review paper offers immediately useful resources for practitioners:

  • A complete implementation example. In section 3.4, this paper walks through a full PINN implementation to solve a 1D Nonlinear Schrödinger equation. It covers translating equations into PINN formulations, handling boundary and initial conditions, defining neural network architectures, choosing training strategies, selecting collocation points, and applying optimization methods. All implementation details are clearly documented for easy reproducibility. The paper compares PINN performance by varying different hyperparameters, which could offer immediately applicable insights for your own PINN experiments.
  • Available frameworks and software tools. Table 3 compiles a comprehensive list of major PINN toolkits, with detailed tool descriptions provided in section 4.3. The considered backends include not only Tensorflow and PyTorch but also Julia and Jax. This side-by-side comparison of different frameworks is especially useful for picking the right tool for your needs.

💡Who would benefit

  • This review paper benefits anyone new to PINNs and looking for a clear, structured introduction.
  • Engineers and developers looking for practical implementation guidance would find the realistic, hands-on demo, and the thorough comparison of existing PINN frameworks most interesting. Additionally, they can find relevant prior work on differential equations similar to their current problem, which offers insights they can leverage in their own problem-solving.
  • Researchers investigating theoretical aspects of PINN convergence, optimization, or efficiency can also greatly benefit from this paper.

2️⃣ From PINNs to PIKANs: Recent Advances in Physics-Informed Machine Learning

📄 Paper at a glance

  • Authors: J. D. Toscano, V. Oommen, A. J. Varghese, Z. Zou, N. A. Daryakenari, C. Wu, and G. E. Karniadakis
  • Year: 2024
  • Link: arXiv

🔍 What it covers

This paper provides one of the most up-to-date overviews of the latest advancements in PINNs. It emphasises enhancements in network design, feature expansion, optimization strategies, uncertainty quantification, and theoretical insights. The paper also surveys key applications across a range of domains.

Fig 2. Overview of the #2 review paper. (Image by author)

✨ What’s unique

This review paper stands out in the following ways:

  • A structured taxonomy of algorithmic developments. One of the most fresh contributions of this paper is its taxonomy of algorithmic advancements. This new taxonomy scheme elegantly categorizes all the advancements into three core areas: (1) representation model, (2) handling governing equations, and (3) optimization process. This structure provides a clear framework for understanding both current developments and potential directions for future research. In addition, the illustrations used in the paper are top-notch and easily digestible.
Fig 3. The taxonomy of algorithmic developments in PINNs proposed by the #2 paper. (Image by author)
  • Spotlight on Physics-informed Kolmogorov–Arnold Networks (KAN). KAN, a new architecture based on the Kolmogorov–Arnold representation theorem, is currently a hot topic in deep learning. In the PINN community, some work has already been done to replace the multilayer perceptions (MLP) representation with KANs to gain more expressiveness and training efficiency. The community lacks a comprehensive review of this new line of research. This review paper (section 3.1) exactly fills in the gap.
  • Review on uncertainty quantification (UQ) in PINNs. UQ is essential for the reliable and trustworthy deployment of PINNs when tackling real-world engineering applications. In section 5, this paper provides a dedicated section on UQ, explaining the common sources of uncertainty in solving differential equations with PINNs and reviewing strategies for quantifying prediction confidence.
  • Theoretical advances in PINN training dynamics. In practice, training PINNs is non-trivial. Practitioners are often puzzled by why PINNs training sometimes fail, or how they should be trained optimally. In section 6.2, this paper provides one of the most detailed and up-to-date discussions on this aspect, covering the Neural Tangent Kernel (NTK) analysis of PINNs, information bottleneck theory, and multi-objective optimization challenges.

🛠 Practical goodies

Even though this review paper leans towards the theory-heavy side, two particularly valuable aspects stand out from a practical perspective:

  • A timeline of algorithmic advances in PINNs. In Appendix A Table, this paper tracks the milestones of key advancements in PINNs, from the original PINN formulation to the most recent extensions to KANs. If you’re working on algorithmic improvements, this timeline gives you a clear view of what’s already been done. If you’re struggling with PINN training or accuracy, you can use this table to find existing methods that might solve your issue.
  • A broad overview of PINN applications across domains. Compared to all the other reviews, this paper strives to give the most comprehensive and updated coverage of PINN applications in not only the engineering domains but also other less-covered fields such as finance. Practitioners can easily find prior works conducted in their domains and draw inspiration.

💡Who would benefit

  • For practitioners working in safety-critical fields that need confidence intervals or reliability estimates on their PINN predictions, the discussion on UQ would be useful. If you are struggling with PINN training instability, slow convergence, or unexpected failures, the discussion on PINN training dynamics can help unpack the theoretical reasons behind these issues.
  • Researchers may find this paper especially interesting because of the new taxonomy, which allows them to see patterns and identify gaps and opportunities for novel contributions. In addition, the review of cutting-edge work on PI-KAN can also be inspiring.

3️⃣ Physics-Informed Neural Networks: An Application-Centric Guide

📄 Paper at a glance

  • Authors: S. Guo (this author)
  • Year: 2024
  • Link: Medium

🔍 What it covers

This article reviews how PINNs are used to tackle different types of engineering tasks. For each task category, the article discusses the problem statement, why PINNs are useful, how PINNs can be implemented to address the problem, and is followed by a concrete use case published in the literature.

Fig 4. Overview of the #3 review paper. (Image by author)

✨ What’s unique

Unlike most reviews that categorize PINN applications either based on the type of differential equations solved or specific engineering domains, this article picks an angle that practitioners care about the most: the engineering tasks solved by PINNs. This work is based on reviewing papers on PINN case studies scattered in various engineering domains. The outcome is a list of distilled recurring functional usage patterns of PINNs:

  • Predictive modeling and simulations, where PINNs are leveraged for dynamical system forecasting, coupled system modeling, and surrogate modeling.
  • Optimization, where PINNs are commonly employed to achieve efficient design optimization, inverse design, model predictive control, and optimized sensor placement.
  • Data-driven insights, where PINNs are used to identify the unknown parameters or functional forms of the system, as well as to assimilate observational data to better estimate the system states.
  • Data-driven enhancement, where PINNs are used to reconstruct the field and enhance the resolution of the observational data.
  • Monitoring, diagnostic, and health assessment, where PINNs are leveraged to act as virtual sensors, anomaly detectors, health monitors, and predictive maintainers.

🛠 Practical goodies

This article places practitioners’ needs at the forefront. While most existing review papers merely answer the question, “Has PINN been used in my field?”, practitioners often seek more specific guidance: “Has PINN been used for the type of problem I’m trying to solve?”. This is precisely what this article tries to address.

By using the proposed five-category functional classification, practitioners can conveniently map their problems to these categories, see how others have solved them, and what worked and what did not. Instead of reinventing the wheel, practitioners can leverage established use cases and adapt proven solutions to their own problems.

💡Who would benefit

This review is best for practitioners who want to see how PINNs are actually being used in the real world. It can also be particularly valuable for cross-disciplinary innovation, as practitioners can learn from solutions developed in other fields.

4️⃣ An Expert’s Guide to Training Physics-informed Neural Networks

📄 Paper at a glance

  • Authors: S. Wang, S. Sankaran, H. Wang, P. Perdikaris
  • Year: 2023
  • Link: arXiv

🔍 What it covers

Even though it doesn’t market itself as a “standard” review, this paper goes all in on providing a comprehensive handbook for training PINNs. It presents a detailed set of best practices for training physics-informed neural networks (PINNs), addressing issues like spectral bias, unbalanced loss terms, and causality violations. It also introduces challenging benchmarks and extensive ablation studies to demonstrate these methods.

Fig 5. Overview of the #4 review paper. (Image by author)

✨ What’s unique

  • A unified “expert’s guide”. The main authors are active researchers in PINNs, working extensively on improving PINN training efficiency and model accuracy for the past years. This paper is a distilled summary of the authors’ past work, synthesizing a broad range of recent PINN techniques (e.g., Fourier feature embeddings, adaptive loss weighting, causal training) into a cohesive training pipeline. This feels like having a mentor who tells you exactly what does and doesn’t work with PINNs.
  • A thorough hyperparameter tuning study. This paper conducts various experiments to show how different tweaks (e.g., different architectures, training schemes, etc.) play out on different PDE tasks. Their ablation studies show precisely which methods move the needle, and by how much.
  • PDE benchmarks. The paper compiles a suite of challenging PDE benchmarks and offers state-of-the-art results that PINNs can achieve.

🛠 Practical goodies

  • A problem-solution cheat sheet. This paper thoroughly documents various techniques addressing common PINN training pain-points. Each technique is clearly presented using a structured format: the why (motivation), how (how the approach addresses the problem), and what (the implementation details). This makes it very easy for practitioners to identify the “cure” based on the “symptoms” observed in their PINN training process. What’s great is that the authors transparently discussed potential pitfalls of each approach, allowing practitioners to make well-informed decisions and effective trade-offs.
  • Empirical insights. The paper shares valuable empirical insights obtained from extensive hyperparameter tuning experiments. It offers practical guidance on choosing suitable hyperparameters, e.g., network architectures and learning rate schedules, and demonstrates how these parameters interact with the advanced PINN training techniques proposed.
  • Ready-to-use library. The paper is accompanied by an optimized JAX library that practitioners can directly adopt or customize. The library supports multi-GPU environments and is ready for scaling to large-scale problems.

💡Who would benefit

  • Practitioners who are struggling with unstable or slow PINN training can find many practical strategies to fix common pathologies. They can also benefit from the straightforward templates (in JAX) to quickly adapt PINNs to their own PDE setups.
  • Researchers looking for challenging benchmark problems and aiming to benchmark new PINN ideas against well-documented baselines will find this paper especially handy.

5️⃣ Domain-Specific Review Papers

Beyond general reviews in PINNs, there are several nice review papers that focus on specific scientific and engineering domains. If you’re working in one of these fields, these reviews could provide a deeper dive into best practices and cutting-edge applications.

1. Heat Transfer Problems

Paper: Physics-Informed Neural Networks for Heat Transfer Problems

The paper provides an application-centric discussion on how PINNs can be used to tackle various thermal engineering problems, including inverse heat transfer, convection-dominated flows, and phase-change modeling. It highlights real-world challenges such as missing boundary conditions, sensor-driven inverse problems, and adaptive cooling system design. The industrial case study related to power electronics is particularly insightful for understanding the usage of PINNs in practice.

2. Power Systems

Paper: Applications of Physics-Informed Neural Networks in Power Systems — A Review

This paper offers a structured overview of how PINNs are applied to critical power grid challenges, including state/parameter estimation, dynamic analysis, power flow calculation, optimal power flow (OPF), anomaly detection, and model synthesis. For each type of application, the paper discusses the shortcomings of traditional power system solutions and explains why PINNs could be advantageous in addressing those shortcomings. This comparative summary is useful for understanding the motivation for adopting PINNs.

3. Fluid Mechanics

Paper: Physics-informed neural networks (PINNs) for fluid mechanics: A review

This paper explored three detailed case studies that demonstrate PINNs application in fluid dynamics: (1) 3D wake flow reconstruction using sparse 2D velocity data, (2) inverse problems in compressible flow (e.g., shock wave prediction with minimal boundary data), and (3) biomedical flow modeling, where PINNs infer thrombus material properties from phase-field data. The paper highlights how PINNs overcome limitations in traditional CFD, e.g., mesh dependency, expensive data assimilation, and difficulty handling ill-posed inverse problems.

4. Additive Manufacturing

Paper: A review on physics-informed machine learning for monitoring metal additive manufacturing process

This paper examines how PINNs address critical challenges specific to additive manufacturing process prediction or monitoring, including temperature field prediction, fluid dynamics modeling, fatigue life estimation, accelerated finite element simulations, and process characteristics prediction.

6️⃣ Conclusion

In this blog post, we went through a curated list of review papers on PINNs, covering fundamental theoretical insights, the latest algorithmic advancements, and practical application-oriented perspectives. For each paper, we highlighted unique contributions, key takeaways, and the audience that would benefit the most from these insights. I hope this curated collection can help you better navigate the evolving field of PINNs.

Multimodal Large Language Models

0

Multimodal Large Language Models (MLLMs) process data from different modalities like text, audio, image, and video.

Compared to text-only models, MLLMs achieve richer contextual understanding and can integrate information across modalities, unlocking new areas of application. Prime use cases of MLLMs include content creation, personalized recommendations, and human-machine interaction.

Examples of MLLMs that process image and text data include Microsoft’s Kosmos-1, DeepMind’s Flamingo, and the open-source LLaVA. Google’s PaLM-E additionally handles information about a robot’s state and surroundings.

Combining different modalities and dealing with different types of data comes with some challenges and limitations, such as alignment of heterogeneous data, inherited biases from pre-trained models, and lack of robustness.

How would you translate this sentence: “The glasses are broken.” into French: “Les verres sont cases.” or  “Les lunettes sont cases.”? What if you have an image? Will you be able to choose the correct translation? As humans, we use different modalities daily to enhance communication. Machines can do the same.

Access to visual context can resolve ambiguity when translating between languages. In this example, the image of drinking glasses resolves the ambiguity in the meaning of “glasses” when translating the sentence from English to French.
Access to visual context can resolve ambiguity when translating between languages. In this example, the image of drinking glasses resolves the ambiguity in the meaning of “glasses” when translating the sentence from English to French. | Modified based on: source

While Large Language Models (LLMs) have shown impressive capabilities in understanding complex text, they are limited to a single data modality. However, many tasks span several modalities.

This article explores Multimodal Large Language Models, exploring their core functionalities, challenges, and potential for various machine-learning domains.

What is a multimodal large language model?

Let’s break down the concept of Multimodal Large Language Models (MLLMs) by first understanding the terms “modal” and “multimodal:”

“Modal” refers to a particular way of communicating or perceiving information. It’s like a channel through which we receive and express ourselves. Some of the common modalities are: 

  • Visual: Sight, including images, videos, and spatial information.
  • Auditory: Hearing, including sounds, music, and speech.
  • Textual: Written language, including words, sentences, and documents.
  • Haptic: Touch, including sensations of texture, temperature, and pressure.
  • Olfactory: Smell

“Multimodal” refers to incorporating various modalities to create a richer understanding of the task, e.g., as on a website or in a blog post that integrates text with visuals.

MLLMs can process not just text but other modalities as well. They are trained on samples containing different modalities, which allows them to develop joint representations and utilize multimodal information to solve tasks.

Why do we need multimodal LLMs?

Many industries heavily rely on multimodality, particularly those that handle a blend of data modalities. For example, MLLMs can be used in a healthcare setting to process patient reports comprising doctor notes (text), treatment plans (structured data), and X-rays or MRI scans (images).

Example of a multi-modal model. The model is trained on X-rays, medical reports, actions, and texts describing the diagnosis and outcome. This way, the model learns to use visual and textual information to predict potential diagnoses.
Example of a multi-modal model. The model is trained on X-rays, medical reports, actions, and texts describing the diagnosis and outcome. This way, the model learns to use visual and textual information to predict potential diagnoses. | Modified based on: source

MLLMs process and integrate information from different modalities (i.e., text, image, video, and audio), essential to solving many tasks. Some prominent applications are:

  1. Content creation: MLLMs can generate image captions, transform text into visually descriptive narratives, or create multimedia presentations, making them valuable tools for creative and professional industries.
  1. Enhanced human-machine interaction: By understanding and responding to inputs from diverse modalities such as text, speech, and images, MLLMs enable more natural communication. This can enrich the user experience in applications like virtual assistants, chatbots, and smart devices.
  1. Personalized recommendations: MLLMs contribute to refining recommendation systems by analyzing user preferences across diverse modalities. Whether suggesting movies based on textual reviews, recommending products through image recognition, or personalizing content recommendations across varied formats, these models elevate the precision and relevance of recommendations.
  1. Domain-specific problem solving: MLLMs are adaptable and invaluable in addressing challenges across various domains. In healthcare, their capability to interpret medical images aids in diagnostics, while in education, they enhance learning experiences by providing enriched materials that seamlessly combine text and visuals.

How do multimodal LLMs work?

A typical multimodal LLM has three primary modules:

  • The input module comprises specialized neural networks for each specific data type that output intermediate embeddings.
  • The fusion module converts the intermediate embeddings into a joint representation.
  • The output module generates outputs based on the task and the processed information. An output could be, e.g., a text, a classification (like “dog” for an image), or an image. Some MLLMs, like Google’s Gemini family, can produce outputs in more than one modality.
Basic structure of a multimodal LLM. Different modalities are processed by separate input modules. Then, the extracted information is joined in the fusion module. The output module (in this case, a classifier) generates the output in the desired modality.
Basic structure of a multimodal LLM. Different modalities are processed by separate input modules. Then, the extracted information is joined in the fusion module. The output module (in this case, a classifier) generates the output in the desired modality.

Examples of multimodal LLMs

Microsoft: Kosmos-1

Kosmos-1 (GitHub) is a multimodal LLM created by Microsoft for natural language and perception-intensive tasks. It can perform visual dialogue, visual explanation, visual question answering, image captioning, math equations, OCR, and zero-shot image classification with and without descriptions.

Architecture and training

Kosmos-1 processes inputs consisting of text and encoded image embeddings. Image embeddings are obtained through the pre-trained CLIP ViT-L/14 (GitHub) model. An embedding module processes this input before feeding it into a transformer-based decoder based on Magneto.

Kosmos-1 used the same initialization as the Magneto transformer for better optimization stability. To capture position information more precisely and better generalize to different sequence lengths (short sequences for training, long ones during testing), Kosmos-1 used xPOS as a relative position encoder.

Kosmos-1 has about 1.6 billion parameters in total, which is smaller than rival models like Flamingo, LLaVA, or GPT-4o. It was trained from scratch on web-scale multimodal corpora (text corpora, image caption pairs, and interleave image-text data).

A main limitation of Kosmos-1 is the limited number of input tokens (2,048) across text and image modalities.

Performance

The creators of Kosmos-1 proposed the Raven IQ test dataset to evaluate the nonverbal reasoning capabilities of MLLMs. This is the first time that a model is tested on nonverbal reasoning. The experimental results from the Kosmos-1 paper show that although the performance of Kosmos-1 is slightly better than that of random choice (random choosing one of the options), it is still far from the average results of adults for the same test. Nevertheless, this shows that MLLMs have the capability of nonverbal reasoning by aligning perception with language models.)

Experimental results published in the Kosmos-1 paper show that MLLMs benefit from performing cross-modal transfer, i.e., learning from one modality and transferring the knowledge to other modalities is more beneficial than using only one modality.

Microsoft published promising results for Kosmos-1 on the OCR-free language understanding task. In this task, the model reads and comprehends the meaning of words and sentences directly from the images. Microsoft also demonstrated that providing descriptions in the context improves the accuracy of zero-shot image classification tasks.

Examples of different Kosmos-1 tasks. The modal can explain an image (1, 2) or answer questions based on an image (3, 4). Kosmos-1 can also extract information from a text in an image (5) or answer math questions (6). The model is able to combine these capabilities to answer questions that require locating specific information in an image (7, 8)
Examples of different Kosmos-1 tasks. The modal can explain an image (1, 2) or answer questions based on an image (3, 4). Kosmos-1 can also extract information from a text in an image (5) or answer math questions (6). The model is able to combine these capabilities to answer questions that require locating specific information in an image (7, 8) | Source
Chain-of-thoughts prompting with Kosmos-1. In the first stage, given an image, a prompt is used to guide the model in generating a rationale. The model is then fed the rationale and a task-aware prompt to produce the final results.
Chain-of-thoughts prompting with Kosmos-1. In the first stage, given an image, a prompt is used to guide the model in generating a rationale. The model is then fed the rationale and a task-aware prompt to produce the final results. | Source

DeepMind: Flamingo

Flamingo architecture overview. Visual data is processed through a pretrained, frozen image encoder to extract image embeddings. These embeddings are passed through a Preceiver Sampler, trained from scratch, which outputs a fixed number of embeddings. The fixed image embeddings and text tokens are fed into gated cross-attention dense blocks, inserted between the frozen LLM blocks, and trained from scratch. The model produces free-form text as output.
Flamingo architecture overview. Visual data is processed through a pretrained, frozen image encoder to extract image embeddings. These embeddings are passed through a Preceiver Sampler, trained from scratch, which outputs a fixed number of embeddings. The fixed image embeddings and text tokens are fed into gated cross-attention dense blocks, inserted between the frozen LLM blocks, and trained from scratch. The model produces free-form text as output. | Source

Flamingo, a vision language model (VLM) developed by DeepMind, can perform various multimodal tasks, including image captioning, visual dialogue, and visual question answering (VQA). Flamingo models take interleaved image data and text as input and generate free-form text.

Flamingo consists of pre-trained vision and language models connected by a “Perceiver Resampler.” The Perceiver Resampler takes as input a variable number of image or video features from the pre-trained vision encoder and returns a fixed number of visual outputs. A pre-trained and frozen Normalizer-Free ResNet (NFNET) is used as a vision encoder, and a frozen Chinchilla is used as the language model. Gated cross-attention dense blocks (GATED XATTN-DENSE) are inserted between frozen LLM blocks and trained from scratch. The largest Flamingo model has 80B parameters and is trained on three datasets scraped from the web: interleaved image and text, image-text, and video-text pairs.

Experimental results on 16 multimodal image/video and language tasks show that Flamingo 80B models are more effective than fine-tuned models for specific tasks. However, as Flamingo focuses more on open-ended tasks, its performance on classification tasks is not as good as that of contrastive models like BASIC, CLI, and ALIGN.

Some limitations that Flamingo inherits from the pre-trained LLM used include hallucinations, poor sample efficiency during training, and poor generalizations for sequences that are longer than the ones used during training. Other limitations that many VLMs struggle with are outputting offensive language, toxicity, propagating social biases and stereotypes, and leaking private information. One way to mitigate these limitations is to filter them out of the training data and exclude them during evaluation.

LLaVA

The Large Language and Vision Assistant (LLaVA) is an end-to-end trained multimodal LLM that integrates the CLIP ViT-L/14 vision encoder and the Vicuna (a chat model created by fine-tuning Llama 2) for general-purpose visual and language understanding.

Given an input image, the pre-trained CLIP ViT-L/14 vision encoder extracts the vision features, which are transformed into the word embedding space using a simple linear layer. Vicuna was chosen as the LLM model because it is the best open-source instruction-following model for language tasks.

Overview of LLaVA architecture. The pretrained CLIP ViT-L/14 vision encoder extracts visual features from input images Xv, which are then mapped into the word embedding space using a linear projection W.
Overview of LLaVA architecture. The pretrained CLIP ViT-L/14 vision encoder extracts visual features from input images Xv, which are then mapped into the word embedding space using a linear projection W. | Source

LLaVA is trained using a two-stage instruction-tuning process. In the first pre-training stage for feature alignment, both the vision encoder and LLM weights are frozen, and the projection matrix is updated to align image features with the pre-trained LLM word embedding. In the second stage, end-to-end fine-tuning is performed to optimize the model for multimodal chatbot interactions and reasoning within the science domain.

Experimental results show that LLaVA 7B has better instruction-tuning capabilities than GPT-4 and Flamingo 80B despite having fewer parameters. LLaVA can follow user instructions and give a more comprehensive answer than GPT-4. LLaVA also outperforms GPT-4 on the ScienceQA dataset, which has multimodal multiple-choice questions from natural, social, and language sciences.

LLaVA has some limitations, including its perception of images as a “bag of patches,” failing to grasp the complex semantics within them. Similar to Flamingo, it inherits biases from both vision and language encoders and is prone to hallucinations and misinformation. Contrary to Flamingo, LLaVA cannot handle multiple images due to its lack of instructions.

This example shows LLaVA's capabilities of visual reasoning and chat. LLaVA accurately follows the user’s instructions instead of simply describing the scene and offers a comprehensive response. Even when merely asked to describe the image, LLaVA identifies atypical aspects of the image.
This example shows LLaVA’s capabilities of visual reasoning and chat. LLaVA accurately follows the user’s instructions instead of simply describing the scene and offers a comprehensive response. Even when merely asked to describe the image, LLaVA identifies atypical aspects of the image. | Source

Google: PaLM-E

Google developed an embodied language model, PaLM-E, to incorporate continuous sensor modalities into language models and establish the link between words and perceptions.

PaLM-E is a general-purpose MLLM for embodied reasoning, visual language, and language tasks. PaLM-E uses multimodal sentences, where inputs from different modalities (i.e., images in blue, state estimate of a robot in green) are inserted alongside text tokens (in orange) as input to an LLM and are trained end-to-end. PaLM-E can perform different tasks like robotic planning, visual question answering (VQA), and image captioning.
PaLM-E is a general-purpose MLLM for embodied reasoning, visual language, and language tasks. PaLM-E uses multimodal sentences, where inputs from different modalities (i.e., images in blue, state estimate of a robot in green) are inserted alongside text tokens (in orange) as input to an LLM and are trained end-to-end. PaLM-E can perform different tasks like robotic planning, visual question answering (VQA), and image captioning. | Source

Architecture and training

PaLM-E is a decoder-only LLM that auto-regressively generates text using a multimodal prompt consisting of text, tokenized image embeddings, and state estimates representing quantities like a robot’s position, orientation, and velocity.

PaLM-E combines PaLM, a decoder-only LLM with 540 billion parameters, and the ViT vision transformer by projecting the latter’s image representations into the former’s input token space. The same approach, which relies on a learned transformation function, is used for projecting state estimates.

Performance

Experimental results show that PALM-E outperforms other baselines like SayCan and PALI in different robotic domains and tasks. This shows that combining pre-trained PALM and ViT with the full mixture of robotics and general visual-language data increases the performance compared to training individual models on individual tasks. Moreover, PALM-E outperforms Flamingo in VQA tasks and PALM in language tasks.

PALM-E 562B has many capabilities, including zero-shot multi-modal chain of thought (CoT) reasoning, multi-image reasoning, OCR-free math reasoning, image captioning, VQA, and few-shot prompting.

Challenges, limitations, and future directions of MLLMs

Expanding LLMs to other modalities comes with challenges regarding data quality, interpretation, safety, and generalization. In a survey paper, Paul Liang et al. proposed a new taxonomy to characterize the challenges and limitations of large multimodal language models:

  1. Representation: How can one represent different modalities in a meaningful and comprehensive manner?

    Fusion, i.e., integrating two or more modalities and reducing the number of separate representations, is a closely related challenge. Fusion can happen after unimodal encoders capture unique representations of different modalities or directly using raw modalities, which is more challenging as data is heterogeneous.

    Representation coordination aims to organize different modalities in a shared coordinate space, such as Euclidian distance. The objective is to position similar modalities close together and put modalities that are not equivalent far away. For instance, the goal is that the representation of the text “a bike” and an image of a bike are placed close together in cosine distance but far away from an image of a cat.

    Human cognition offers valuable insights into developing and further improving multimodal models. Understanding how the brain processes different modalities and combining them can be a promising direction for proposing new approaches to multimodal learning and enabling more effective analysis of complex data.

  1. Alignment: Another challenge is identifying cross-modal connections and interactions between elements of different modalities. For instance, how can we align gestures with speech when a person is talking? Or how can we align an image with a description?

    When the elements of multiple modalities are discrete (i.e., there is a clear segmentation between elements, like words in a text) and supervised data exists, contrastive learning is used. It matches the representations of the same concepts expressed in different modalities (e.g., the word “car” with an image of a car).

    If the ground truth is unavailable, the alignment is done with all the elements of the modalities to learn the necessary connections and matchings between them. For example, aligning video clips with text descriptions when there are no ground truth labels that link descriptions with video clips requires comparing each video embedding with each text embedding. A similarity score (i.e., cosine similarity) is calculated for all pairs and aligns the modalities.

    Alignment is more challenging when elements of a modality are continuous (like time-series data) or data does not contain clear semantic boundaries (e.g., MRI images). Clustering can be used to group continuous data based on semantic similarity to achieve modality alignment.

    Further, current multimodal models struggle with long-range sequences and cannot learn interactions over long periods. For instance, aligning the text “After 25 minutes in the oven, the cupcakes are golden brown” with the correct scene in a video requires understanding that “25 minutes in the oven” corresponds to a specific scene later in the video. Capturing and aligning long-term interactions that happen very far in time and space is challenging and complex, but it is an important and promising future direction that needs to be explored.

  2. Reasoning: Reasoning is a complex process that involves drawing conclusions from knowledge through multiple logical steps and observations.

    One reasoning-related challenge in MLLMs is structure modeling, which involves learning and representing the relationships over which reasoning happens. Understanding hierarchical relationships where smaller components (atoms) are combined to create larger ones (molecules) is essential for complex reasoning. 

    Another challenge is encoding or representing multimodal concepts during reasoning so that they are interpretable and effective using attention mechanisms, language, or symbols. It is very important to understand how to go from low-level representations (e.g., pixels of an image or words) to high-level concepts (e.g., “What color is the jacket?”) while still being interpretable by humans.

    Understanding the reasoning process of the trained models and how they combine elements from different modalities (i.e., text, vision, audio) is very important for their transparency, reliability, and performance. This will help to discover potential biases and limitations in the reasoning process of MLLMs, enabling the development of robust models to overcome these challenges.

  3. Generation: Research is ongoing on generating meaningful outputs that reflect cross-modal interaction and are structured and coherent.

    Generative models focus on generating raw modalities (text, images, or videos) and capturing the relationships and interactions between different modalities. For instance, guided text summarization uses input modalities such as images, video, or audio to compress the data and summarize the most relevant and important information from the original content.

    Multimodal translation maps one modality to another while respecting semantic connections and information content. Generating novel high-dimensional data conditioned on initial inputs is extremely challenging. It has to preserve semantics, be meaningful and coherent, and capture many possible generations (different styles, colors, and shapes of the same scene).

    One of the main challenges of multimodal generation is the difficulty of evaluating the generated content, primarily when ethical issues (e.g., generating deepfakes, hate speech, and fake news) are involved. Evaluating user studies is time-consuming, costly, and biased.

    An insightful future work will be to study if the risk for the above ethical issues is reduced or increased when using a multimodal dataset and if there are ethical issues specific to multimodal generations. Multimodal datasets may reduce ethical issues as they are more diverse and contextually complete and may improve model fairness. On the other hand, the biases from one modality can interact and amplify biases in other modalities, leading to complex ethical issues (i.e., combining video with text metadata may reveal sensitive information).)

  1. Transference: In multimodal modeling, transference refers to the process of transferring knowledge from one modality (the second modality) to another (the primary modality) when the primary modality’s resources are limited (e.g., lack of annotated data, unreliable labels, noisy inputs). By leveraging the information from the second modality, the primary modality can enhance performance and learn new capabilities, which would not be possible without the shared information.

    In cross-modal transfer settings, large-scale pre-trained models are fine-tuned for specific downstream tasks with a focus on the primary modality. For example, fine-tuning pre-trained frozen large language models for image captioning. On the other hand, multimodal co-learning aims to transfer the learned information by sharing intermediate spaces between modalities. In this case, a single joint model is used across all modalities. For instance, having both image and text modalities during training and using the model for image classification. Contrary model induction, exemplified by co-training, promotes independent training of models and only exchanges their model predictions (outputs) to enable information transfer while maintaining separation.

Learning from many modalities increases the data heterogeneity and complexity challenges during data processing. Dealing with modalities that aren’t all present simultaneously is a direction that needs further exploration to enhance multimodality models’ performance.

  1. Quantification: Quantification aims to understand better and improve multimodal models’ reliability, interpretability, and robustness. Understanding the dimensions of heterogeneity and their effect on multimodal learning and modeling is very important. Exploring interactions and connections of multimodal modalities enhances the understanding of modality interconnections of the trained models. Improving how the multimodal models are trained and optimized is crucial to achieving better generalization, usability, and efficiency.

    Having formal guidelines and theories for evaluating which modalities are beneficial or harmful (adversarial attacks) is a critical challenge. Understanding what modalities to select and compare them in a systematic way is very important for improving multimodal models. Furthermore, it is essential to interpret and explain complex relationships and patterns of the multimodal models before employing them in real-world applications. For instance, recognizing social biases of the data (text or image) is key to ensuring fairness while guaranteeing the robustness of the model against noisy or out-of-distribution modalities. These unresolved core challenges require thorough analysis to ensure that multimodal models can be reliably applied across different domains. 

As this extensive list of open research questions and practical challenges shows, multimodal LLMs are still in their early stages. The LLaVA GitHub repository and the unit on multi-modal models in the Hugging Face Community Computer Vision Course are excellent resources to dive deeper and get hands-on experience training and fine-tuning MLLMs.

Was the article useful?

Explore more content topics:

One Turn After Another | Towards Data Science

0

While some games, like rock-paper-scissors, only work if all payers decide on their actions simultaneously, other games, like chess or Monopoly, expect the players to take turns one after another. In Game Theory, the first kind of game is called a static game, while turn-taking is a property of so-called dynamic games. In this article, we will analyse the latter with methods from game theory. 

This article is the fourth part of a four-chapter series on the fundamentals of game theory. I recommend you to read the first three articles if you haven’t done that yet, as the concepts shown here will build on the terms and paradigms introduced in the previous articles. But if you are already familiar with the core fundamentals of game theory, don’t let yourself be stopped, and go ahead!

Dynamic games

Dynamic games can be visualized as trees. Photo by Adarsh Kummur on Unsplash

While so far we only looked at static games, we will now introduce dynamic games where payers take turns. As previously, such games include a number of players n, a set of actions for each player, and a reward function that assesses the actions of a player given the other players’ actions. Beyond that, for a dynamic game, we need to define an order in which the players take their turns. Consider the following tree-like visualization of a dynamic game. 

A visualization of a dynamic game. Figure by author.

At the top we have a node where player 1 has to decide between two actions L and R. This determines whether to follow the left part or the right part of the tree. After player 1’s turn, player 2 takes their turn. If player 1 chooses L, player 2 can decide between l1 and r1. If player 1 chooses R, player 2 has to decide between l2 and r2. At the leaves of the tree (the nodes at the bottom), we see the rewards just like we had them in the matrix cells in static games. For example, if player 1 decides for L and player 2 decides for r1, the reward is (1,0); that is, player 1 gets a reward of 1, and player 2 gets a reward of 0. 

I bet you are eager to find the Nash equilibrium of this game, as this is what Game Theory is mainly about (if you still struggle with the concept of Nash equilibrium, you might want to take a look back at chapter 2 of this series). To do that, we can transform the game into a matrix, as we already know how to find a Nash equilibrium in a game displayed as a matrix. Player 1 decides on the row of the matrix, player 2 decides on the column and the values in the cell then specifies the reward. However, there is one important point to notice. When we look at the game displayed as a tree, player 2 decides on their action after player 1 does and hence only cares about the part of the tree that is actually reached. If player 1 chooses action L, player 2 only decides between l1 and r1 and doesn’t care about l2 and r2, because these actions are out of the question anyway. However, when we search for a Nash Equilibrium, we need to be aware of what would happen, if player 1 would change their action. Therefore, we must know what player 2 would have done if player 1 had chosen a different option. That is why we have four columns in the following matrix, to always account for decisions in both parts of the tree. 

A column like (r1,l2) can be read as “player 2 chooses r1 if player 1 chose L and chooses l2 if player 1 chose R”. On this matrix, we can search for the best answers. For example, the cell (L, (l1,l2)) with reward 3,1 is a best answer. Player 1 has no reason to change from L to R because that would lower his reward (from 3 to 1), and Player 2 has no reason to change either because none of the other options is better (one is as good, though). In total, we find three Nash equilibria, which are underlined in the upcoming matrix: 

The chocolate-pudding market

We will talk about chocolate pudding now. But also about game theory. Photo by American Heritage Chocolate on Unsplash

Our next example brings the idea of dynamic games to life. Let’s assume player 2 is a market-leading retailer of chocolate pudding. Player 1 also wants to build up his business but isn’t sure yet whether to join the chocolate pudding market or whether they rather should sell something else. In our game, player 1 has the first turn and can decide between two actions. Join the market (i.e., sell chocolate pudding), or don’t join the market (i.e., sell something else). If player 1 decides to sell something other than chocolate pudding, player 2 stays the market-dominating retailer for chocolate pudding and player 1 makes some money in the other area they decided for. This is reflected by the reward 1,3 in the right part of the tree in the following figure. 

The market-game as a dynamic game. Figure by author. 

But what if player 1 is greedy for the unimaginable riches that lie dormant on the chocolate pudding market? If they decide to join the market, it is player 2’s turn. They can decide to accept the new competitor, give in and share the market. In this case, both players get a reward of 2. But player 2 can also decide to start a price war to demonstrate his superiority to the new competitor. In this case, both players get a reward of 0, because they ruin their profit due to dumping prices. 

Just like before, we can turn this tree into a matrix and find the Nash equilibria by searching for the best answers:

If player 1 joins the market, the best option for player 1 is to give in. This is an equilibrium because no player has any reason to change. For player 1 it does not make sense to leave the market (that would give a reward of 1 instead of 2) and for player 2 it is no good idea to switch to fighting either (which would give a reward of 0 instead of 2). The other Nash equilibrium happens when player 1 just doesn’t join the market. However, this scenario includes player 2’s decision to fight, if player 1 had chosen to join the market instead. He basically makes a threat and says “If you join the market, I will fight you.” Remember that previously we said we need to know what the players would do even in the cases that don’t appear to happen? Here we see why this is important. Player 1 needs to assume that player 2 would fight because that is the only reason for player 1 to stay out of the market. If player 2 wouldn’t threaten to fight, we wouldn’t have a Nash equilibrium, because then joining the market would become a better option for player 1. 

But how reasonable is this threat? It keeps player 1 outside the market, but what would happen if player 1 didn’t believe the threat and decided to still join the market? Would player 2 really carry out his threat and fight? That would be very silly because it would give him a reward of 0, whereas giving in would give a reward of 2. From that perspective, player 2 used an empty threat that is not very reasonable. If the case really occurs, he wouldn’t carry it out anyway, would he?

Subgame perfect equilibrium

For a subgame perfect equilibrium, before you get the whole picture, you need to start with small parts of the game. Photo by Ben Stern on Unsplash

The previous example showed that sometimes Nash equilibria occur, that are not very reasonable within the game. To cope with this problem, a more strict concept of equilibrium has been introduced which is called a subgame perfect equilibrium. This adds some stricter conditions to the notion of an equilibrium. Hence every subgame perfect equilibrium is a Nash equilibrium, but not all Nash equilibria are subgame perfect. 

A Nash equilibrium is subgame perfect if every subgame of this equilibrium is a Nash equilibrium itself. What does that mean? First, we have to understand that a subgame is a part of the game’s tree that starts at any node. For example, if player 1 chooses L, the remainder of the tree under the node reached by playing L is a subgame. In a likewise fashion, the tree that comes after the node of action R is a subgame. Last but not least, the whole game is always a subgame of itself. As a consequence, the example we started with has three subgames, which are marked in grey, orange and blue in the following: 

The market game has three subgames. Figure by author.

We already saw, that this game has three Nash equilibria which are (L,(l1,l2)), (L, (l1,r2)) and (R,(r1,r2)). Let us now find out which of these are subgame perfect. To this end, we investigate the subgames one after another, starting with the orange one. If we only look at the orange part of the tree, there is a single Nash equilibrium that occurs if player 2 chooses l1. If we look at the blue subgame, there is also a single Nash equilibrium that is reached when player 2 chooses r2. Now that tells us that in every subgame perfect Nash equilibrium, player 2 has to choose option l1 if we arrive in the orange subgame (i.e. if player 1 chooses L) and player 2 has to choose option r2 if we arrive at the blue subgame (i.e., if player 1 chooses R). Only one of the previous Nash equilibria fulfills this condition, namely (L,(l1,r2)). Hence this is the only subgame perfect Nash equilibrium of the whole game. The other two versions are Nash equilibria as well, but they are somewhat unlogical in the sense, that they contain some kind of empty threat, as we had it in the chocolate pudding market example before. The method we just used to find the subgame perfect Nash equilibrium is called backwards induction, by the way. 

Uncertainty

In dynamic games, it can happen that you have to make decisions without knowing exactly what node of the game you are in. Photo by Denise Jans on Unsplash

So far in our dynamic games, we always knew which decisions the other players made. For a game like chess, this is the case indeed, as every move your opponent makes is perfectly observable. However, there are other situations in which you might not be sure about the exact moves the other players make. As an example, we go back to the chocolate pudding market. You take the perspective of the retailer that is already in the market and you have to decide whether you would start fighting if the other player joins the market. But there is one thing you don’t know, namely how aggressive your opponent will be. When you start fighting, will they be frightened easily and give up? Or will they be aggressive and fight you until only one of you is left? This can be seen as a decision made by the other player that influences your decision. If you expect the other player to be a coward, you might prefer to fight, but if they turn out to be aggressive, you would rather want to give in (reminds you of the birds fighting for food in the previous chapter, doesn’t it?). We can model this scenario in a game like this: 

A dynamic game with a hidden decision (indicated by the dotted circle). Figure by author.

The dotted circle around the two nodes indicates, that these are hidden decisions that are not observable to everyone. If you are player 2, you know whether player 1 joined the market or not, but if they joined, you don’t know whether they are aggressive (left node) or moderate (right node). Hence you act under uncertainty, which is a very common ingredient in many games you play in the real world. Poker would become very boring if everybody knew everyone’s cards, that’s why there is private information, namely the cards on your hand only you know about. 

Now you still have to decide whether to fight or give in, although you are not exactly sure what node of the tree you are in. To do that, you have to make assumptions about the likelihood of each state. If you are quite certain that the other player is behaving moderately, you might be up for a fight, but if you assume them to be aggressive, you might prefer giving in. Say there is a Probability p that the other player is aggressive and 1-p that they behave moderately. If you assume p to be high, you should give in, but if p becomes smaller, there should be a point where your decision switches to fighting. Let’s try to find that point. In particular, there should be a sweet spot in between where the probability of the other player being aggressive vs. moderate is such that fighting and giving in are equal alternatives to one another. That is, the rewards would be equal, which we can model as follows: 

Do you see how this formula is derived from the rewards for fighting or giving in in the different leaves of the tree? This formula solves to p=1/3, so if the probability of the other player being aggressive is 1/3 it would make no difference whether to fight or give in. But if you assume the other player to be aggressive with a probability of more than 1/3, you should give in, and if you assume aggressiveness to be less likely than 1/3, you should fight. This is a chain of thought you also have in other games where you act under uncertainty. When you play poker, you might not calculate the probabilities exactly, but you ask yourself, “How likely is it that John has two kings on his hand?” and depending on your assumption of that probability, you check, raise or give up. 

Summary & outlook

Your journey on the seas of game theory has only just begun. There is so much more to explore. Photo by George Liapis on Unsplash

Now we have learned a lot about dynamic games. Let us summarize our key findings. 

  • Dynamic games include an order in which players take turns. 
  • In dynamic games, the players’ possible actions depend on the previously executed actions of the other players. 
  • A Nash equilibrium in a dynamic game can be implausible, as it contains an empty threat that would not be rational.
  • The concept of subgame perfect equilibria prevents such implausible solutions. 
  • In dynamic games, decisions can be hidden. In that case, players may not exactly know which node of the game they are in and have to assign probabilities to different states of the games. 

With that, we have reached the end of our series on the fundamentals of game theory. We have learned a lot, yet there are plenty of things we haven’t been able to cover. Game theory is a science in itself, and we have only been able to scratch the surface. Other concepts that expand the possibilities of game-theoretic analyses include: 

  • Analysing games that are repeated multiple times. If you play the prisoner’s dilemma multiple times, you might be tempted to punish the other player for having betrayed you in the previous round. 
  • In cooperative games, players can conclude binding contracts that determine their actions to reach a solution of the game together. This is different from the non-cooperative games we looked at, where all players are free to decide and maximize their own reward. 
  • While we only looked at discrete games, where each player has a finite number of actions to choose from, continuous games allow an infinite number of actions (e.g., any number between 0 and 1). 
  • A big part of game theory considers the usage of public goods and the problem that individuals might consume these goods without contributing to their maintenance. 

These concepts allow us to analyse real-world scenarios from various fields such as auctions, social networks, evolution, markets, information sharing, voting behaviour and much more. I hope you enjoyed this series and find meaningful applications for the knowledge you gained, be it the analysis of customer behaviour, political negotiations or the next game night with your friends. From a game theory perspective, life is a game!

References

The topics introduced here are typically covered in standard textbooks on game theory. I mainly used this one, which is written in German though:

  • Bartholomae, F., & Wiens, M. (2016). Spieltheorie. Ein anwendungsorientiertes Lehrbuch. Wiesbaden: Springer Fachmedien Wiesbaden.

An alternative in the English language could be this one:

  • Espinola-Arredondo, A., & Muñoz-Garcia, F. (2023). Game Theory: An Introduction with Step-by-step Examples. Springer Nature.

Game theory is a rather young field of research, with the first main textbook being this one:

  • Von Neumann, J., & Morgenstern, O. (1944). Theory of games and economic behavior.

Like this article? Follow me to be notified of my future posts.

Hyperparameter Optimization For LLMs: Advanced Strategies

0

Finding an optimal set of hyperparameters is essential for efficient and effective training of Large Language Models (LLMs).

The key LLM hyperparameters influence the model size, learning rate, learning behavior, and token generation process.

Due to their computational demands, traditional methods for optimizing hyperparameters, such as grid search, are impractical for LLMs.

Advanced hyperparameter optimization strategies, like population-based training, Bayesian optimization, and adaptive LoRA, promise to balance computational effort and outcome.

The rise of large language models (LLMs) is bringing advances in text generation and contextual understanding. Hyperparameters control the size of LLMs, their training process, and how they generate outputs.

An optimal combination of hyperparameters is fundamental to efficiently pre-training and fine-tuning LLMs. Since LLM training is computationally intensive, exhaustive experimentation is not viable. This rules out traditional machine-learning hyperparameter optimization (HPO) methods that rely on systematically exploring the hyperparameter space by training many models with slightly different configurations.

When configuring models and training processes, LLM developers rely on a thorough understanding of each hyperparameter’s influence, insights from fundamental research, and empirical evidence gained from training state-of-the-art foundation models. Methods for estimating optimal hyperparameter values with limited compute budgets and adapting hyperparameters throughout the training process can help pre-training and fine-tuning.

After reading this article, you’ll be able to answer the following questions:

  • What key hyperparameters should be considered when developing, training, and applying LLMs?
  • How does each hyperparameter influence the LLM, and which trade-offs do we need to be aware of?
  • How can we select an optimal combination of hyperparameters in our scenario without fully training multiple model variants?
  • What advanced hyperparameter optimization techniques are available for LLMs, and when can we apply them?

LLM hyperparameters

A hyperparameter is a configuration value that controls the behavior of a machine-learning model during the training or inference process. Unlike model parameters (the weights), which are learned directly from the training data, hyperparameters are defined by the model developers. A hyperparameter can be constant or adjusted dynamically according to predefined rules or schedules.

Model size

In the case of LLMs, we often work with pre-trained models, where the activation functions, internal architecture of layers or blocks, and their connections—all examples of hyperparameters—are fixed. If our pre-trained LLM of choice is available in different sizes, the model size is the only hyperparameter affecting the model’s makeup we can actively control.

The size of an LLM refers to the total number of parameters it contains, which influences the model’s capacity to understand and generate complex language patterns. Hyperparameters set and tuned during pre-training influence the total size of an LLM.

One hyperparameter influencing a model’s size is its depth, corresponding to the total number of layers stacked sequentially. Each additional layer in an LLM adds more parameters, such as the weights for the self-attention mechanism and feed-forward layers in a transformer block.

Another hyperparameter influencing an LLM’s size is its hidden size, which refers to the dimensionality of the token embeddings and the internal representations within each layer. The hidden size determines how richly the model can encode information about each input token and how effectively it can process complex language patterns. A larger hidden size means each token is represented in a higher-dimensional space, allowing the model to capture more detailed semantic and syntactic nuances.

Further, the number of parallel attention heads in each transformer block influences the size of the LLM. Multiple heads allow the model to focus on different input aspects simultaneously. Through multi-query and grouped-query attention, we can reduce the number of necessary parameters.

Finally, the vocabulary size and context window (maximum sequence length) also impact the model’s size. They determine the language diversity a model can handle and the context length it can maintain, respectively.

These hyperparameters, set before beginning the training process and unable to be changed later, determine the model size. For example, GPT-3 has 96 layers, a hidden size of 12,288, 96 attention heads, a vocabulary of 50,257 tokens, and a context window of 2,048 tokens, resulting in a total of 175 billion parameters.

Learning rate

The learning rate (LR) is a critical hyperparameter in training LLMs. Optimizing these hyperparameters is essential for efficient learning, stable convergence, and good generalization to unseen data.

The learning rate determines how much model weights are changed during each update. A high learning rate helps speed up the training process but increases the risk of instability and overfitting. A low learning rate increases stability and tends to benefit generalization but leads to slow training.

In the case of LLMs, the learning rate is typically not constant but varies as training progresses. This variation is governed by a learning rate schedule (LRS). The schedule is usually tied to the number of tokens seen—either directly, or indirectly through the number of samples, steps, or epochs. At a high level, it contains phases of a rising, constant, and decreasing learning rate.

How does the learning rate affect training duration and quality?

Following theoretical work by Stanford researcher Kaiyue Wen and colleagues published in December 2024, we can think of LLM training as progressing along a loss landscape that looks like a river valley. They hypothesize that the existence and overall direction of the river are due to the facts and knowledge an LLM learns, which are reflected as highly deterministic and, therefore, easy-to-predict tokens. The valley slopes arise from flexibility and ambiguity inherent to language, i.e., hard-to-predict tokens.

Visualization of LLM training as traveling down a river valley. Using a stable but high learning rate ensures quick progress down the river but leads to jumps between relatively high loss values. Reducing the learning rate during a subsequent decay phase brings the model towards a local loss minimum.
Visualization of LLM training as traveling down a river valley. Using a stable but high learning rate ensures quick progress down the river but leads to jumps between relatively high loss values. Reducing the learning rate during a subsequent decay phase brings the model towards a local loss minimum. | Source

In this picture, the training goal is to reach the river mouth, at which point we should be as close to the bottom of the valley as possible. The first crucial insight is that it does not matter whether we stay at the bottom of the valley until then. Thus, if we can make faster progress down the river by bouncing back and forth between points high up the loss valley’s slopes, we can do this without affecting the final outcome.

Thus, we should aim to use a high learning rate—resulting in large steps towards the loss minimum but leading to wildly fluctuating loss values—for as long as possible. Towards the end of the training, the learning rate should be decreased to a very low value. This will slow down progress towards the river mouth but reduce the oscillations to a point where we constantly stay at the valley’s bottom, i.e., the local loss minimum.

However, all of this is only going to work if we are already in a sufficiently deep loss river valley. When training is first starting, a high learning rate will lead to undirected jumps across the loss landscape. To avoid this, learning rate schedules for LLMs start with a small learning rate and slowly ramp it up to its maximum value. This is called the warmup phase.

Cosine schedule

The cosine schedule (also known as cosine decay or cosine annealing) implements this approach by starting with a linear warmup phase that brings the learning rate to its maximum value, followed by a slow decay following the cosine function:

LR(t) = LRmin + 0.5 (LRmax – LRmin) (1 + cos(π t/T)

Here, LRmin and LRmax are the minimum and maximum learning rates, t is the training step, and T is the total number of training steps. The advantage of this schedule is that it stays close to the peak learning rate for a long time, and the final decay is gradual. It’s also easy to implement, as it depends on just three hyperparameters (LRmax, LRmin, and T) linked by the cosine function.

Cosine schedules have been highly popular for pretraining LLMs. For example, it was used for BLOOM, a 176-billion-parameter multilingual model developed by the BigScience Research Workshop and released in 2022. In an initial warmup phase, the learning rate was ramped to a peak of 6 x 10-5 over 375 million tokens. Afterward, it was lowered to 10% of this value with cosine decay over 410 million tokens and remained at this value. The implementation and detailed description are publicly accessible in BLOOM’s GitHub repository.

For pre-training their Llama 3 405B model, Meta used a slightly more involved variant of the cosine schedule. In the first stage, a warm-up phase of up to 8,000 steps brought the learning rate to a maximum of 8 x 10-5. Subsequently, the learning rate decreased to 8 x 10-7 over 1.2 million steps with a cosine decay. After the second stage focused on training the LLM up to its final context length of 128,000 tokens, the learning rate linearly decreased to 0 over 40 million tokens in the third stage. Supervised fine-tuning was conducted over about 9,000 steps with a learning rate of 10-5.

A major disadvantage of the cosine schedule is that the total number of training steps has to be known beforehand. When training large foundation models, the total compute budget is typically set, and the optimal number of training tokens can be estimated. However, when fine-tuning or experimenting, it would be preferable to base the decision on when to end training on the model’s performance.

Warmup-stable-decay schedule

The warmup-stable-decay (WSD) schedule is a simple protocol introduced by Shengding Hu and colleagues at Tsinghua University in 2024. It starts with a linear warmup to the maximum learning rate, keeps the learning rate constant for the majority of the training, and ramps it down at the end.

Through experiments, they found that a decay phase that makes up 10% of the total length is sufficient. They also demonstrated that a WSD schedule leads to a lower loss than a cosine schedule. According to Wen and colleagues at Stanford, this can readily be understood in the river valley picture. In the WSD schedule, the learning rate stays at a high value longer than in the cosine schedule. Hence, we make it further down the valley before dropping to its bottom. Further, their analysis shows that training progress in the stable phase is dominated by learning to predict deterministic tokens (facts and knowledge), while in the decay phase, the LLM learns the stochastic tokens (language variability).

Comparison of the loss curves resulting from a cosine and warmup-stable-decay (WSD) learning rate schedule. In the WSD schedule, the learning rate remains at a constant high value during the stable phase. This leads to high intermediate loss values as the loss fluctuates around the local minimum as it progresses towards lower values. During the final 10% of the total training steps, the learning rate is decreased to its minimum, leading to a sharp drop in the loss. Since the learning rate remained at a high value for longer, the final loss resulting from the WSD schedule is smaller than the loss from the cosine schedule.
Comparison of the loss curves resulting from a cosine and warmup-stable-decay (WSD) learning rate schedule. In the WSD schedule, the learning rate remains at a constant high value during the stable phase. This leads to high intermediate loss values as the loss fluctuates around the local minimum as it progresses towards lower values. During the final 10% of the total training steps, the learning rate is decreased to its minimum, leading to a sharp drop in the loss. Since the learning rate remained at a high value for longer, the final loss resulting from the WSD schedule is smaller than the loss from the cosine schedule. | Source

While a WSD schedule yields a lower loss for the same training budget, knowing the total number of training steps ahead of time is still required for scheduling the decay phase. However, the WSD schedule offers a straightforward way to extend the total number of training steps retroactively: If we find that our final model’s performance is unsatisfactory, we can resume training from a model snapshot taken at the end of the stable phase. This beams us back a small distance up the loss river valley, from where we continue making large jumpy steps towards the river mouth as if we had never descended down to the valley’s bottom in the first place.

Restarting this way, we still benefit from 90% of the compute budget spent so far. It allows us to determine the compute budget we need as we go, producing fully trained intermediate models—something that the cosine schedule inherently does not allow for.

Track months-long model training with more confidence. Use neptune.ai forking feature to iterate faster and optimize the usage of GPU resources.

With Neptune, users can visualize forked training out of the box. This means you can:

  • Test multiple configs at the same time. Stop the runs that don’t improve accuracy. And continue from the most accurate last step.
  • Restart failed training sessions from any previous step. The training history is inherited, and the entire experiment is visible on a single chart.

Cyclical cosine schedule

Returning to a high learning rate after decaying to a minimum is not a new idea in machine learning. Long established in gradient-free optimization, it was made popular for deep learning training through the “Stochastic Gradient Descent with Warm Restarts” technique proposed by Ilya Loshchilov and Frank Hutter in 2017. The learning rate is governed by a function very similar to the one for the cosine schedule:

LR(t) = LRmin + 0.5 (LRmax − LRmin) (1 + cos(π (t mod T)/T))

This time, T is not the total number of training steps but is understood as the schedule’s period. For example, we might train for 10,000 steps with T = 1,000, leading to ten consecutive cosine decay cycles. Commonly, LRmax is set to a new, lower value at the beginning of each cycle.

In the loss landscape river valley, we’re climbing down to the bottom over T steps, making ever slower progress down the river as we keep closer to the bottom. Then, we immediately go back to make large jumps toward the river mouth high up the valley’s slopes.

Right at the beginning of a new cosine cycle, the loss will be significantly higher than it was previously. This could be due to the jump in the learning rate, which might perturb the model. However, Wen and colleagues argue, based on their experiments and theoretical insights, that it is the result of training with a small learning rate for too long.

Whatever the cause, this doesn’t just make training less efficient. It’s also an obstacle to continue model training later. Whether we aim to further pre-train on newly acquired or different data, fine-tune an LLM, or incrementally evolve a model in a continual learning scenario—ideally, we could take a model snapshot and train it effectively, making the most of the compute budget we have available and the compute budget we have already spent. The learning rate schedule used during pretraining directly impacts this.

Cyclical warmup-stable-decay schedule

The Warmup-Stable-Decay (WSD) schedule allows continuing training from the final model checkpoint of the stable phase without incurring a loss penalty. This preserves a large fraction of the compute budget spent, as we only have to discard what we spent on intermediate decay phases. But this is not negligible at the scale of LLM pretraining, where the costs regularly exceed tens of millions of US dollars.

As Wen and colleagues found, starting from the final decay phase model checkpoint in a WSD schedule does not cause the same loss penalty as the cosine schedule. As the WSD schedule’s decay phase is rather short, they hypothesize it does not have the same destructive effect as the cosine schedule’s long and slow decay. Given a total compute budget, consecutively repeating the WSD cycle is more efficient than restarting from the final checkpoint of the latest stable phase.

A cyclical WSD schedule is easier to implement than WSD restarts, as the model evolves continuously down the loss landscape river valley, and no prior checkpoints have to be reloaded. It also helps downstream users, who initially often utilize few-shot prompting to adapt an LLM to their use case. If they later decide to fine-tune it, and the LLM is trained with a WSD schedule, training the same model checkpoint they already use for inference is efficient.

Learning behavior

In a neural network, the weights are the parameters of its neurons learned during training. In an LLM, weights include the query, key, and value matrices in the attention heads and the activation function parameters in the feed-forward layers. While the learning rate governs the scale of changes made to the model’s weights, we can also control how the weights change on a more fine-grained level.

Weight decay

Employing weight decay during training penalizes large weights, preventing small parts of the model from dominating its output. Weight decay in stochastic gradient descent is implemented by adding a term to the loss function. For example, using L2 regularization, the adapted loss function looks like this:

Here, Lorig is the original loss function, λ is the weight decay factor, and wi are the model weights.

Weight decay has been applied to transformer-based NLP models since the beginning. In the seminal 2018 paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, the authors state that they trained the model using “Adam with [a] learning rate of 1e-4, β₁=0.9, β₂=0.999, L2 weight decay of 0.01, learning rate warm up over the first 10,000 steps, and linear decay of the learning rate.”

As Ilya Loshchilov and Frank Hutter point out in their 2019 paper Decoupled Weight Decay Regularization, in adaptive optimizers like Adam, L2 regularization and weight decay are not identical, and L2 regularization is not effective. In Adam, the gradient of the regularization term is scaled with the gradient of Lorig, which leads to minimal regularization for terms in L for which the gradient is large. They introduced the AdamW optimizer, where the weight decay term is independent of the gradient-based update. AdamW is widely used for LLMs, such as for training Megatron-LM (2019), Llama 1 (2023), Llama 2 (2023), and Llama 3 (2024).

In LLM pretraining, models often see each training sample only once. Thus, overfitting to training data, which weight decay helps prevent in traditional deep learning scenarios, is only of concern if there are many similar or even identical samples in the training dataset. Still, weight decay positively affects training speed and the final loss.

According to a 2023 analysis by Francesco D’Angelo and colleagues at EPFL, this is because weight decay increases the effective learning rate. The effective learning rate at training step t is defined as LR(t)/||wt||2, the learning rate scaled by the inverse norm of the weight vector. The smaller the weights, the larger the influence of a weight update. Further, D’Angelo and colleagues find that weight decay stabilizes training in reduced floating-point precision.

Gradient clipping

Gradient clipping caps gradient magnitudes, helping maintain numerical stability. In the river valley analogy, we impose a threshold on slope steepness when deciding where to move next. Rather than jumping off a cliff, we treat it as a moderately steep hillside.

There are two common types of gradient clipping:

  1. Clipping by value: Set predefined minimum and maximum values for gradient magnitudes. A gradient component is clipped to the respective limit if it exceeds these thresholds. This approach has the key benefit of not requiring access to the entire gradient vector.
  2. Clipping by norm: The entire gradient vector is scaled down if the norm exceeds a specified threshold. For example, Nvidia’s original Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism paper first published in 2019 notes: “[W]e use global gradient norm clipping of 1.0 to improve the stability of training large models.” In contrast to clipping by value, this preserves the gradient vector’s direction but requires access to the entire gradient vector to compute.

In 2022, Yang and Ma introduced the Component-Wise Gradient Norm Clipping (CWGNC) approach for fine-tuning LLMs. In a nutshell, CWGNC applies gradient-clipping by norm separately to components in the LLM, such as the key, query, and value matrices or feed-forward layers. This stabilizes the training of each component individually, which might progress at significantly different rates.

Next-token generation

LLMs are autoregressive language models. They predict the next token by taking the sequence of previously generated tokens as input and producing a vector containing a probability for each token in the vocabulary. Different post-processing techniques can be used to determine the next token from these probabilities.

Temperature

Typically, LLMs use a softmax function as the final step in computing token probabilities. A temperature parameter controls this function.

The temperature influences the degree of randomness (or “originality” or “creativity”) in an LLM’s predicted text. At low temperatures, the model becomes more deterministic, rarely considering less likely options and instead focusing on the tokens with the highest probabilities. Conversely, a high temperature increases unpredictability, allowing the model to choose from a broader range of tokens. Thus, lower temperatures are helpful when you need reliable answers, while higher temperatures lead to more varied and surprising outputs.

The Text Gen Playground Hugging Face Space allows users to experiment with different temperature settings and models. By inputting a prompt and adjusting the temperature parameter, you can observe how the model’s output varies from predictable and deterministic to creative and varied.

For example, using the prompt “The sun rises in the” at different temperatures:

  • Low Temperature (e.g., T = 0.2): The model will likely complete the sentence with “east,” reflecting a common and expected continuation.
  • High Temperature (e.g., T = 1.2): The model might generate more imaginative completions like “morning haze” or “golden skies,” showcasing increased creativity.

Adjusting the temperature parameter in such playgrounds provides valuable insights into controlling the balance between determinism and creativity in language model outputs.

Sampling strategy

Given the vector of probabilities, there are many ways to select the next token.

A straightforward strategy is always picking the most likely token. Since the sampling process only considers the probabilities for the very next token, this “greedy decoding” leads to highly probable multi-token sequences being discarded if they start with a token that – viewed in isolation – is less likely.

Using beam search or random sampling according to the token probabilities can mitigate this. While the former produces deterministic outputs and thus no variety, the latter can lead to the selection of highly improbable tokens, producing nonsensical sequences.

A more balanced approach is top-k sampling, which restricts sampling of the next token to the k most probable tokens. Alternatively, in top-p sampling, only the most likely tokens up to a cumulative probability of p are considered. This approach adapts dynamically to the probability distribution, sampling from many tokens in uncertain scenarios and picking from only a few when the model is more confident. (p and k can be adjusted during training or inference time.)

As ML Engineers, we can fine-tune temperature and sampling strategy parameters according to your project needs. For example, if our tasks require precision (e.g., technical writing or summarization), we’ll use lower temperatures and top-k sampling to prioritize high-probability tokens. If we need more diversity, we’ll begin with common default values (temperature 0.7, top-k: k = 40, top-p: p = 0.9). We’ll iteratively adjust them based on the qualitative evaluation of outputs and document our findings to build a shared knowledge base with your team.

How do we find the optimal hyperparameters?

LLM training involves many hyperparameters, resulting in a combinatorial explosion of the search space. Simply guessing hyperparameters is unlikely to yield good results. Further, hyperparameters interact in complex ways, so the optimal value for one may depend on the values of others. Thus, adjusting hyperparameters one at a time may lead to suboptimal solutions, as we easily become trapped in local optima and don’t adequately explore the hyperparameter space.

Finding an optimal combination of hyperparameters requires a systematic approach. First, it’s paramount to understand the relevant hyperparameters and their influence on the particular LLM. It’s essential to research how similar architectures were trained or how the LLM we want to fine-tune was pre-trained. Further, we should clarify the available time, our compute budget, and the training objectives.

Next, we can sketch a roadmap. Can we afford to conduct experiments with particular hyperparameter combinations we believe are useful? Do we already have an experiment tracker and resource monitoring in place, or do we need to set it up first? What will be the decision points and criteria that ensure we end up with a fully trained LLM at the end of the project? Finally, we can start executing this roadmap and adjust our plans as we gather more information and insight.

The BLOOM team published a detailed paper on their preliminary experiments to determine the optimal model size and architecture. They describe how they started with GPT-3’s hyperparameters and conducted trial runs to estimate the optimal balance between model size and number of tokens given their fixed compute budget. Similar experiments were run by the Meta team that trained Llama3, who also aimed to predict downstream task performance.

Can we use traditional machine learning hyperparameter optimization methods for LLMs?

Methods for systematic hyperparameter optimization have long been studied in machine learning:

  • Learning curve analysis involves training models with varying hyperparameters over several epochs and plotting the loss to identify trends. In deep-learning models, plotting the gradient can further help assess whether and how efficiently a model learns.
  • Grid search systematically steps through the hyperparameter space, training a model for each possible combination. Random search samples the hyperparameter space, training models for randomly selected combinations.

While these approaches have successfully been applied to optimize LLM hyperparameters, their use is severely limited by the fact that LLMs are very expensive to train. The computational and memory requirements make it unviable to train large numbers of models. If training a model takes several months on a large cluster, we’ll only get one shot at a full training run.

Advanced strategies for LLM hyperparameter optimization

Beyond starting from a well-known hyperparameter combination and systematically conducting experiments, there is a range of approaches for automatically identifying or optimizing LLM hyperparameters in specific circumstances.

Population-based training (PBT)

Population-Based Training (PBT) is an approach pioneered by Google DeepMind that combines the concepts of evolutionary search and online training. Instead of fixing hyperparameters at the start of training and leaving them static throughout the process, PBT adapts them dynamically, informed by the models’ performance.

In a nutshell, the population-based training process consists of the following steps:

  1. Set up a population of models, each with unique hyperparameters hi and weights i. 
  2. Train each model, updating i every iteration.
  3. After a fixed number of iterations, evaluate each model’s performance on a validation dataset.
  4. Identify models that are underperforming relative to others. Replace their current weights​ and hyperparameters with those of a better-performing model (exploitation).
  5. Slightly perturb the hyperparameters of previously underperforming models to prevent the population from converging to a single configuration too early and improve diversity (exploration).
  6. Conclude the training if the compute budget is exhausted or the objective has been met. Otherwise, repeat the process starting from step 2.

This process initially appears resource-intensive since it requires maintaining and updating multiple models simultaneously, which can increase total GPU hours. However, PBT’s dynamic refinement of hyperparameters during training can significantly save wall-clock time. By avoiding restarting from scratch for each hyperparameter configuration and leveraging partially trained models, PBT reduces the number of training epochs needed to achieve optimal performance.

The 2017 DeepMind study on Population-Based Training (PBT) showcased its potential for LLMs by fine-tuning the first transformer model on the WMT 2014 English-German machine translation benchmark. They manually optimized a baseline model and compared it to a model where they used PBT to optimize the dropouts for different layers and the learning rate. Their evaluation showed that the PBT-optimized model outperformed their hand-tuned baseline. Further, they discovered that the learning rate schedule generated through PBT mimicked the human-created one. Starting with a small learning rate, it then jumped to a high value before something resembling an exponential decay” brought it down to a low value again. DeepMind’s original PBT transformer model also learned noticeably faster.

Ray Tune is a hyperparameter tuning library that supports population-based training. It is part of the open-source Ray framework for scaling machine-learning applications. The Ray Tune documentation includes an example of tuning BERT and RoBERTa on the GLUE benchmark dataset using population-based training.

Bayesian optimization

Bayesian optimization is a popular method for efficiently navigating the hyperparameter space by building a probabilistic model (surrogate model) of the influence of the hyperparameters on the objective (e.g., validation loss). The surrogate model is used to predict promising hyperparameter combinations to try next. The results of this exploration are then used to refine the surrogate model.

The 2024 paper Crafting Efficient Fine-Tuning Strategies for Large Language Models investigates the applicability of Bayesian optimization to fine-tuning LLMs. First, a population of N models is trained for a pre-defined budget t1. As each model is trained, the surrogate model is updated, and the updated version is used to set the hyperparameters of the next model. Once all N models are trained, the top k models are selected and are trained up to t2. Finally, the best model among the k fully trained models is selected.

Adaptive Low-Rank Adaptation (LoRA)

Low-Rank Adaptation (LoRA) is a popular technique for reducing the memory footprint and computational demands when fine-tuning LLMs. In brief, the idea is to represent the weights of the fine-tuned model as 

Wfine = Wpre + ∆W =  Wpre + BA

Here, the fine-tuned weights Wfine are the sum of the original weights Wpre and a difference ∆W, which is the product of two matrices, B and A. Only B and A are updated during fine-tuning, while Wpre remains unchanged. If Wpre and ∆W have dimensions m x n, B and A have dimensions m x r and r x n, respectively. If the rank r is much smaller than m and n, the number of weights to be updated is greatly reduced, leading to faster training progress while requiring less memory.

In practice, it is often unclear to which LLM components LoRA should be applied for the best outcome. While we know that not all weights influence task performance equally, identifying which components are important for a particular objective would require extensive ablation studies. Thus, LoRA is often applied across all suitable weight matrices in a model.

AdaLoRA (Adaptive Low-Rank Adaptation) is a method to allocate a given parameter budget across weight matrices. The core idea is to apply LoRA to all LLM components but to use different values for the rank r. Important components use a matrix pair with a large r, leading to a ∆W with many weights. Less important components are approximated using a lower-rank matrix pair. AdaLoRA assigns an importance score to each component and sets the values for r such that the total number of weights remains within the user-defined budget. This leads to an optimal training outcome for a fixed compute and memory budget.

AdaMoLE (Adaptive Mixture of Low-Rank Adaptation Experts) similarly aims to reduce the number of weights that need to be updated. It replaces the single low-rank matrix pair of the original LoRA with a collection of multiple matrix pairs (LoRA experts) that are activated dynamically based on the input context. This enables the LLM to learn different tasks with a minimal total number of weights.

Fine-tuning an LLM with the Adaptive Mixture of Low-Rank Adaptation Experts approach. The fine-tuned weights are approximated as the sum of the frozen pre-trained weights and a number of so-called LoRA experts that are activated by a gating function and a threshold function. Different LoRA experts specialize in different contexts, allowing the LLM to learn different tasks with a minimal number of weights.
Fine-tuning an LLM with the Adaptive Mixture of Low-Rank Adaptation Experts approach. The fine-tuned weights are approximated as the sum of the frozen pre-trained weights and a number of so-called LoRA experts that are activated by a gating function and a threshold function. Different LoRA experts specialize in different contexts, allowing the LLM to learn different tasks with a minimal number of weights. | Modified based on: source

Hands-on: LLM hyperparameter optimization with neptune.ai

Optuna is a framework for optimizing hyperparameter search using Bayesian optimization. It can be applied to various machine-learning tasks, including LLM hyperparameter tuning.

To see this in action, we’ve prepared a Colab notebook that walks you through the process of finding the optimal combination of learning rate, batch size, and number of epochs for fine-tuning a Hugging Face Transformers model on the IMBD dataset.

The tutorial uses neptune.ai to track training progress and analyze the different hyperparameters. If you don’t want to go through the tutorial yourself right now, you can still explore example results in this public Neptune project.

How about being one of the first to access Neptune Scale?

Neptune Scale is our upcoming product release built for teams that train foundation models. It offers enhanced scalability and exciting new features. You can join our beta program to benefit from Neptune Scale earlier.

What’s next in LLM hyperparameter optimization?

Finding an optimal combination of hyperparameters is essential for training LLMs. In this article, we’ve reviewed key LLM hyperparameters and their influence on the model and training performance. We’ve also discussed how to approach hyperparameter optimization systematically and explored methods to assist or even automate this task in certain scenarios.

From the examples of hyperparameter choices for state-of-the-art LLMs, we’ve seen that while architectures, training tasks, and data change, most models are trained with relatively similar learning rate schedules and optimizer configurations. As our understanding of the model and training mechanics deepens and more experiments yield empirical evidence, we’ll likely see an evolution of the standard recipes and more diversity.

Was the article useful?

Explore more content topics:

Effortless Spreadsheet Normalisation With LLM

0

This article is part of a series of articles on automating Data Cleaning for any tabular dataset.

You can test the feature described in this article on your own dataset using the CleanMyExcel.io service, which is free and requires no registration.

Start with the why

A spreadsheet containing information about awards given to films

Let’s consider this Excel spreadsheet, which contains information on awards given to films. It is sourced from the book Cleaning Data for Effective Data Science and is available here.

This is a typical and common spreadsheet that everyone may own and deal with in their daily tasks. But what is wrong with it?

To answer that question, let us first recall the end goal of using data: to derive insights that help guide our decisions in our personal or business lives. This process requires at least two crucial things:

  • Reliable data: clean data without issues, inconsistencies, duplicates, missing values, etc.
  • Tidy data: a well-normalised data frame that facilitates processing and manipulation.

The second point is the primary foundation of any analysis, including dealing with data quality.

Returning to our example, imagine we want to perform the following actions:

1. For each film involved in multiple awards, list the award and year it is associated with.

2. For each actor/actress winning multiple awards, list the film and award they are associated with.

3. Check that all actor/actress names are correct and well-standardised.

Naturally, this example dataset is small enough to derive those insights by eye or by hand if we structure it (as quickly as coding). But imagine now that the dataset contains the entire awards history; this would be time-consuming, painful, and error-prone without any automation.

Reading this spreadsheet and directly understanding its structure by a machine is difficult, as it does not follow good practices of data arrangement. That is why tidying data is so important. By ensuring that data is structured in a machine-friendly way, we can simplify parsing, automate quality checks, and enhance business analysis—all without altering the actual content of the dataset.

Example of a reshaping of this data:

Example of a reshaping of the data from the previous spreadsheet:

Now, anyone can use low/no-code tools or code-based queries (SQL, Python, etc.) to interact easily with this dataset and derive insights.

The main challenge is how to turn a shiny and human-eye-pleasant spreadsheet into a machine-readable tidy version.

What is tidy data? A well-shaped data frame?

The term tidy data was described in a well‐known article named Tidy Data by Hadley Wickham and published in the Journal of Statistical Software in 2014. Below are the key quotes required to understand the underlying concepts better.

Data tidying 

“Structuring datasets to facilitate manipulation, visualisation and modelling.”

“Tidy datasets provide a standardised way of linking the structure of a dataset (its physical layout) with its semantics (its meaning).”

Data structure

“Most statistical datasets are rectangular tables composed of rows and columns. The columns are almost always labelled, and the rows are sometimes labelled.”

Data semantics

“A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organised in two ways. Every value belongs to both a variable and an observation. A variable contains all values that measure the same underlying attribute (such as height, temperature or duration) across units. An observation contains all values measured on the same unit (for example, a person, a day or a race) across attributes.”

“In a given analysis, there may be multiple levels of observation. For example, in a trial of a new allergy medication, we might have three types of observations:

  • Demographic data collected from each person (age, sex, race),
  • Medical data collected from each person on each day (number of sneezes, redness of eyes), and
  • Meteorological data collected on each day (temperature, pollen count).”

Tidy data

“Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is considered messy or tidy depending on how its rows, columns and tables correspond to observations, variables and types. In tidy data:

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.”

Common problems with messy datasets

Column headers might be values rather than variable names.

  • Messy example: A table where column headers are years (2019, 2020, 2021) instead of a “Year” column.
  • Tidy version: A table with a “Year” column and each row representing an observation for a given year.

Multiple variables might be stored in one column.

  • Messy example: A column named “Age_Gender” containing values like 28_Female
  • Tidy version: Separate columns for “Age” and “Gender”

Variables might be stored in both rows and columns.

  • Messy example: A dataset tracking student test scores where subjects (Math, Science, English) are stored as both column headers and repeated in rows instead of using a single “Subject” column.
  • Tidy version: A table with columns for “Student ID,” “Subject,” and “Score,” where each row represents one student’s score for one subject.

Multiple types of observational units might be stored in the same table.

  • Messy example: A sales dataset that contains both customer information and store inventory in the same table.
  • Tidy version: Separate tables for “Customers” and “Inventory.”

A single observational unit might be stored in multiple tables.

  • Messy example: A patient’s medical records are split across multiple tables (Diagnosis Table, Medication Table) without a common patient ID linking them.
  • Tidy version: A single table or properly linked tables using a unique “Patient ID.”

Now that we have a better understanding of what tidy data is, let’s see how to transform a messy dataset into a tidy one.

Thinking about the how

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” Hadley Wickham (cf. Leo Tolstoy)

Although these guidelines sound clear in theory, they remain difficult to generalise easily in practice for any kind of dataset. In other words, starting with the messy data, no simple or deterministic process or algorithm exists to reshape the data. This is mainly explained by the singularity of each dataset. Indeed, it is surprisingly hard to precisely define variables and observations in general and then transform data automatically without losing content. That is why, despite massive improvements in data processing over the last decade, data cleaning and formatting are still done “manually” most of the time.

Thus, when complex and hardly maintainable rules-based systems are not suitable (i.e. to precisely deal with all contexts by describing decisions in advance), machine learning models may offer some benefits. This grants the system more freedom to adapt to any data by generalising what it has learned during training. Many large language models (LLMs) have been exposed to numerous data processing examples, making them capable of analysing input data and performing tasks such as spreadsheet structure analysis, table schema estimation, and code generation.

Then, let’s describe a workflow made of code and LLM-based modules, alongside business logic, to reshape any spreadsheet.

Diagram of a workflow made of code and LLM-based modules alongside business logic to reshape a spreadsheet

Spreadsheet encoder 

This module is designed to serialise into text the main information needed from the spreadsheet data. Only the necessary subset of cells contributing to the table layout is retained, removing non-essential or overly repetitive formatting information. By retaining only the necessary information, this step minimises token usage, reduces costs, and enhances model performance.. The current version is a deterministic algorithm inspired by the paper SpreadsheetLLM: Encoding Spreadsheets for Large Language Models, which relies on heuristics. More details about it will be the topic of a next article.

Table structure analysis 

Before moving forward, asking an LLM to extract the spreadsheet structure is a crucial step in building the next actions. Here are examples of questions addressed:

  • How many tables are present, and what are their locations (regions) in the spreadsheet?
  • What defines the boundaries of each table (e.g., empty rows/columns, specific markers)?
  • Which rows/columns serve as headers, and do any tables have multi-level headers?
  • Are there metadata sections, aggregated statistics, or notes that need to be filtered out or processed separately?
  • Are there any merged cells, and if so, how should they be handled?

Table schema estimation

Once the analysis of the spreadsheet structure has been completed, it is now time to start thinking about the ideal target table schema. This involves letting the LLM process iteratively by:

  • Identifying all potential columns (multi-row headers, metadata, etc.)
  • Comparing columns for domain similarities based on column names and data semantics
  • Grouping related columns  

The module outputs a final schema with names and a short description for each retained column.

Code generation to format the spreadsheet

Considering the previous structure analysis and the table schema, this last LLM-based module should draft code that transforms the spreadsheet into a proper data frame compliant with the table schema. Moreover, no useful content must be omitted (e.g. aggregated or computed values may still be derived from other variables).

As generating code that works well from scratch at the first iteration is challenging, two internal iterative processes are added to revise the code if needed:

  • Code checking: Whenever code cannot be compiled or executed, the trace error is provided to the model to update its code.
  • Data frame validation: The metadata of the created data frame—such as column names, first and last rows, and statistics about each column—is checked to validate whether the table conforms to expectations. Otherwise, the code is revised accordingly.

Convert the data frame into an Excel file

Finally, if all data fits properly into a single table, a worksheet is created from this data frame to respect the tabular format. The final asset returned is an Excel file whose active sheet contains the tidy spreadsheet data.

Et voilà! The sky’s the limit for making the most of your newly tidy dataset.

Feel free to test it with your own dataset using the CleanMyExcel.io service, which is free and requires no registration.

Final note on the workflow

Why is a workflow proposed instead of an agent for that purpose?  

At the time of writing, we consider that a workflow based on LLMs for precise sub-tasks is more robust, stable, iterable, and maintainable than a more autonomous agent. An agent may offer advantages: more freedom and liberty in actions to perform tasks. Nonetheless, they may still be hard to deal with in practice; for example, they may diverge quickly if the objective is not clear enough. I believe this is our case, but that does not mean that this model would not be applicable in the future in the same way as SWE-agent coding is performing, for example.

Next articles in the series

In upcoming articles, we plan to explore related topics, including:

  • A detailed description of the spreadsheet encoder mentioned earlier.
  • Data validity: ensuring each column meets the expectations.
  • Data uniqueness: preventing duplicate entities within the dataset.
  • Data completeness: handling missing values effectively.
  • Evaluating data reshaping, validity, and other key aspects of data quality.

Stay tuned!

Thank you to Marc Hobballah for reviewing this article and providing feedback.

All images, unless otherwise noted, are by the author.

Mixture of Experts LLMs: Key Concepts Explained

0

Mixture of Experts (MoE) is a type of neural network architecture that employs sub-networks (experts) to process specific input parts.

Only a subset of experts is activated per input, enabling models to scale efficiently. MoE models can leverage expert parallelism by distributing experts across multiple devices, enabling large-scale deployments while maintaining efficient inference.

MoE uses gating and load balancing mechanisms to dynamically route inputs to the most relevant experts, ensuring targeted and evenly distributed computation. Parallelizing the expert, along with the data, is key to having an optimized training pipeline.

MoEs have faster training and better or comparable performance than dense LLMs on many benchmarks, especially in multi-domain tasks. Challenges include load balancing, distributed training complexity, and tuning for stability and efficiency.

Scaling LLMs comes at a tremendous computational cost. Bigger models enable more powerful capabilities but require expensive hardware and infrastructure, also resulting in higher latency. So far, we’ve mainly achieved performance gains by making models larger, but this trajectory is not sustainable due to escalating costs, increasing energy consumption, and diminishing returns in performance improvement.

When considering the enormous amount of data and the wide variety of domains in which the huge LLM models are trained, it’s natural to ask —instead of using the entire LLM’s capacity, could we just pick and choose only a portion of the LLM that is relevant to our particular input? This is the key idea behind Mixture of Expert LLMs.

Mixture of Experts (MoE) is a type of neural network architecture in which parts of the network are divided into specialized sub-networks (experts), each optimized for a specific domain of the input space. During inference, only a part of the model is activated depending on the given input, significantly reducing the computational cost. Further, these experts can be distributed across multiple devices, allowing for parallel processing and efficient large-scale distributed setups.

On an abstract, conceptual level, we can imagine MoE experts specialized in processing specific input types. For example, we might have separate experts for different language translations or different experts for text generation, summarization, solving analytical problems, or writing code. These sub-networks have separate parameters but are part of the single model, sharing blocks and layers at different levels.

In this article, we explore the core concepts of MoE, including architectural blocks, gating mechanisms, and load balancing. We’ll also discuss the nuances of training MoEs and analyze why they are faster to train and yield superior performance in multi-domain tasks. Finally, we address key challenges of implementing MoEs, including distributed training complexity and maintaining stability.

Bridging LLM capacity and scalability with MoE layers

Since the introduction of Transformer-based models, LLM capabilities have continuously expanded through advancements in architecture, training methods, and hardware innovation. Scaling up LLMs has been shown to improve performance. Accordingly, we’ve seen rapid growth in the scale of the training data, model sizes, and infrastructure supporting training and inference.

Pre-trained LLMs have reached sizes of billions and trillions of parameters. Training these models takes extremely long and is expensive, and their inference costs scale proportionally with their size.

In a conventional LLM, all parameters of the trained model are used during inference. The table below gives an overview of the size of several impactful LLMs. It presents the total parameters of each model and the number of parameters activated during inference:

The last five models (highlighted) exhibit a significant difference between the total number of parameters and the number of parameters active during inference. The Switch-Language Transformer, Mixtral, GLaM, GShard, and DeepSeekMoE are Mixture of Experts LLMs (MoEs), which require only executing a portion of the model’s computational graph during inference.

MoE building blocks and architecture

The foundational idea behind the Mixture of Experts was introduced before the era of Deep Learning, back in the ’90s, with “Adaptive Mixtures of Local Experts” by Robert Jacobs, together with the “Godfather of AI” Geoffrey Hinton and colleagues. They introduced the idea of dividing the neural network into multiple specialized “experts” managed by a gating network.

With the Deep Learning boom, the MoE resurfaced. In 2017, Noam Shazeer and colleagues (including Geoffrey Hinton once again) proposed the Sparsely-Gated Mixture-of-Experts Layer for recurrent neural language models.

The Sparsely-Gated Mixture-of-Experts Layer consists of multiple experts (feed-forward networks) and a trainable gating network that selects the combination of experts to process each input. The gating mechanism enables conditional computation, directing processing to the parts of the network (experts) that are most suited to each part of the input text.

Such an MoE layer can be integrated into LLMs, replacing the feed-forward layer in the Transformer block. Its key components are the experts, the gating mechanism, and the load balancing.

Overview of the general architecture of a Transformer block with integrated MoE layer. The MoE layer has a gate (router) that activates selected experts based on the input. The aggregated experts’ outputs form the MoE layer’s output.
Overview of the general architecture of a Transformer block with integrated MoE layer. The MoE layer has a gate (router) that activates selected experts based on the input. The aggregated experts’ outputs form the MoE layer’s output. | Source: Author

Experts

The fundamental idea of the MoE approach is to introduce sparsity in the neural network layers. Instead of a dense layer where all parameters are used for every input (token), the MoE layer consists of several “expert” sub-layers. A gating mechanism determines which subset of “experts” is used for each input. The selective activation of sub-layers makes the MoE layer sparse, with only a part of the model parameters used for every input token.

How are experts integrated into LLMs?

In the Transformer architecture, MoE layers are integrated by modifying the feed-forward layers to include sub-layers. The exact implementation of this replacement varies, depending on the end goal and priorities: replacing all feed-forward layers with MoEs will maximize sparsity and reduce the computational cost, while replacing only a subset of feed-forward layers may help with training stability. For example, in the Switch Transformer, all feed-forward components are replaced with the MoE layer. In GShard and GLaM, only every other feed-forward layer is replaced.

The other LLM layers and parameters remain unchanged, and their parameters are shared between the experts. An analogy to this system with specialized and shared parameters could be the completion of a company project. The incoming project needs to be processed by the core team—they contribute to every project. However, at some stages of the project, they may require different specialized consultants, selectively brought in based on their expertise. Collectively, they form a system that shares the core team’s capacity and profits from expert consultants’ contributions.

Visualization of token-level expert selection in the MoE model (layers 0, 15, and 31). Each token is color-coded, indicating the first expert chosen by the gating mechanism. This illustrates how MoE assigns tokens to specific experts at different levels of architecture. It may not always be obvious why the same-colored tokens were directed to the same expert - the model processed high-dimensional representations of these tokens, and the logic and understanding of the token processing are not always similar to human logic.
Visualization of token-level expert selection in the MoE model (layers 0, 15, and 31). Each token is color-coded, indicating the first expert chosen by the gating mechanism. This illustrates how MoE assigns tokens to specific experts at different levels of architecture. It may not always be obvious why the same-colored tokens were directed to the same expert – the model processed high-dimensional representations of these tokens, and the logic and understanding of the token processing are not always similar to human logic. | Source

Gating mechanism

In the previous section, we have introduced the abstract concept of an “expert,” a specialized subset of the model’s parameters. These parameters are applied to the high-dimensional representation of the input at different levels of the LLM architecture. During training, these subsets become “skilled” at handling specific types of data. The gating mechanism plays a key role in this system.

What is the role of the gating mechanism in an MoE layer?

When an MoE LLM is trained, all the experts’ parameters are updated. The gating mechanism learns to distribute the input tokens to the most appropriate experts, and in turn, experts adapt to optimally process the types of input frequently routed their way. At inference, only relevant experts are activated based on the input. This enables a system with specialized parts to handle diverse types of inputs. In our company analogy, the gating mechanism is like a manager delegating tasks within the team.

The gating component is a trainable network within the MoE layer. The gating mechanism has several responsibilities:

  • Scoring the experts based on input. For N experts, N scores are calculated, corresponding to the experts’ relevance to the input token.
  • Selecting the experts to be activated. Based on the experts’ scoring, a subset of the experts is chosen to be activated. This is usually done by top-k selection.
  • Load balancing. Naive selection of top-k experts would lead to an imbalance in token distribution among experts. Some experts may become too specialized by only handling a minimal input range, while others would be overly generalized. During inference, touting most of the input to a small subset of experts would lead to overloaded and underutilized experts. Thus, the gating mechanism has to distribute the load evenly across all experts.

How is gating implemented in MoE LLMs?

Let’s consider an MoE layer consisting of n experts denoted as Experti(x) with i=1,…,n that takes input x. Then, the gating layer’s output is calculated as

How is gating implemented in MoE LLMs?

where gi is the ith expert’s score, modeled based on the Softmax function. The gating layer’s output is used as the weights when averaging the experts’ outputs to compute the MoE layer’s final output. If gi is 0, we can forgo computing Experti(x) entirely.

The general framework of a MoE gating mechanism looks like

How is gating implemented in MoE LLMs?

Some specific examples are:

  • Top-1 gating: Each token is directed to a single expert when choosing only the top-scored export. This is used in the Switch Transformer’s Switch layer. It is computationally efficient but requires careful load-balancing of the tokens for even distribution across experts.
  • Top-2 gating: Each token is sent to two experts. This approach is used in Mixtral.
  • Noisy top-k gating: Introduced with the Sparsely-Gated Mixture-of-Experts Layer, noise (standard normal) is added before applying Softmax to help with load-balancing. GShard uses a noisy top-2 strategy, adding more advanced load-balancing techniques.

Load balancing

The straightforward gating via scoring and selecting top-k experts can result in an imbalance of token distribution among experts. Some experts may become overloaded, being assigned to process a bigger portion of tokens, while others are selected much less frequently and stay underutilized. This causes a “collapse” in routing, hurting the effectiveness of the MoE approach in two ways.

First, the frequently selected experts are continuously updated during training, thus performing better than experts who don’t receive enough data to train properly.

Second, load imbalance causes memory and computational performance problems. When the experts are distributed across different GPUs and/or machines, an imbalance in expert selection will translate into network, memory, and expert capacity bottlenecks. If one expert has to handle ten times the number of tokens than another, this will increase the total processing time as subsequent computations are blocked until all experts finish processing their assigned load.

Strategies for improving load balancing in MoE LLMs include:

•  Adding random noise in the scoring process helps redistribute tokens among experts.

•  Adding an auxiliary load-balancing loss to the overall model loss. It tries to minimize the fraction of the input routed to each expert. For example, in the Switch Transformer, for N experts and T tokens in batch B, the loss would be

auxiliary load-balancing loss

where fi is the fraction of tokens routed to expert i and Pi is the fraction of the router probability allocated for expert i.

•  DeepSeekMoE introduced an additional device-level loss to ensure that tokens are routed evenly across the underlying infrastructure hosting the experts. The experts are divided into g groups, with each group deployed to a single device.

•  Setting a maximum capacity for each expert. GShard and the Switch Transformer define a maximum number of tokens that can be processed by one expert. If the capacity is exceeded, the “overflown” tokens are directly passed to the next layer (skipping all experts) or rerouted to the next-best expert that has not yet reached capacity.

Scalability and challenges in MoE LLMs

Selecting the number of experts

The number of experts is a key consideration when designing an MoE LLM. A larger number of experts increases a model’s capacity at the cost of increased infrastructure demands. Using too few experts has a detrimental effect on performance. If the tokens assigned to one expert are too diverse, the expert cannot specialize sufficiently.

The MoE LLMs’ scalability advantage is due to the conditional activation of experts. Thus, keeping the number of active experts k fixed but increasing the total number of experts n increases the model’s capacity (larger total number of parameters). Experiments conducted by the Switch Transformer’s developers underscore this. With a fixed number of active parameters, increasing the number of experts consistently led to improved task performance. Similar results were observed for MoE Transformers with GShard.

The Switch Transformers have 16 to 128 experts, GShard can scale up from 128 to 2048 experts, and Mixtral can operate with as few as 8. DeepSeekMoE takes a more advanced approach by dividing experts into fine-grained, smaller experts. While keeping the number of expert parameters constant, the number of combinations for possible expert selection is increased. For example, N=8 experts with hidden dimension h can be split into m=2 parts, giving N*m=16 experts of dimension h/m. The possible combinations of activated experts in top-k routing will change from 28 (2 out of 8) to 1820 (4 out of 16), which will increase flexibility and targeted knowledge distribution.

Routing tokens to different experts simultaneously may result in redundancy among experts. To address this problem, some approaches (like DeepSeek and DeepSpeed) can assign dedicated experts to act as a shared knowledge base. These experts are exempt from the gating mechanism, always receiving each input token.

Training and inference infrastructure

While MoE LLMs can, in principle, be operated on a single GPU, they can only be scaled efficiently in a distributed architecture combining data, model, and pipeline parallelism with expert parallelism. The MoE layers are sharded across devices (i.e., their experts are distributed evenly) while the rest of the model (like dense layers and attention blocks) is replicated to each device.

This requires high-bandwidth and low-latency communication for both forward and backward passes. For example, Google’s latest Gemini 1.5 was trained on multiple 4096-chip pods of Google’s TPUv4 accelerators distributed across multiple data centers.

Hyperparameter optimization

Introducing MoE layers adds additional hyperparameters that have to be carefully adjusted to stabilize training and optimize task performance. Key hyperparameters to consider include the overall number of experts, their size, the number of experts to select in the top-k selection, and any load balancing parameters. Optimization strategies for MoE LLMs are discussed comprehensively in the papers introducing the Switch Transformer, GShard, and GLaM.

LLM performance vs. MoE LLM performance

Before we wrap up, let’s take a closer look at how MoE LLMs compare to standard LLMs:

  • MoE models, unlike dense LLMs, activate only a portion of their parameters. Compared to dense LLMs, MoE LLMs with the same number of active parameters can achieve better task performance, having the benefit of a larger number of total trained parameters. For example, Mixtral 8x7B with 13 B active parameters (and 47 B total trained parameters) matches or outperforms LLaMA-2 with 13 B parameters on benchmarks like MMLU, HellaSwag, PIQA, and Math.
  • MoEs are faster, and thus less expensive, to train. The Switch Transformer authors showed, for example, that the sparse MoE outperforms the dense Transformer baseline with a considerable speedup in achieving the same performance. With a fixed number of FLOPs and training time, the Switch Transformer achieved the T5-Base’s performance level seven times faster and outperformed it with further training.

What’s next for MoE LLMs?

Mixture of Experts (MoE) is an approach to scaling LLMs to trillions of parameters with conditional computation while avoiding exploding computational costs. MoE allows for the separation of learnable experts within the model, integrated into the shared model skeleton, which helps the model more easily adapt to multi-task, multi-domain learning objectives. However, this comes at the cost of new infrastructure requirements and the need for careful tuning of additional hyperparameters.

The novel architectural solutions for building experts, managing their routing, and stable training are promising directions, with many more innovations to look forward to. Recent SoTA models like Google’s multi-modal Gemini 1.5 and IBM’s enterprise-focused Granite 3.0 are MoE models. DeepSeek R1, which has comparable performance to GPT-4o and o1, is an MoE architecture with 671B total and 37B activated number of parameters and 128 experts.

With the publication of open-source MoE LLMs such as DeepSeek R1 and V3, which rival or even surpass the performance of the aforementioned proprietary models, we are looking into exciting times for democratized and scalable LLMs.

Was the article useful?

Explore more content topics:

Forget About Cloud Computing. On-Premises Is All the Rage Again

0

Ten years ago, everybody was fascinated by the cloud. It was the new thing, and companies that adopted it rapidly saw tremendous growth. Salesforce, for example, positioned itself as a pioneer of this technology and saw great wins.

The tides are turning though. As much as cloud providers still proclaim that they’re the most cost-effective and efficient solution for businesses of all sizes, this is increasingly clashing with the day-to-day experience.

Cloud Computing was touted as the solution for scalability, flexibility, and reduced operational burdens. Increasingly, though, companies are finding that, at scale, the costs and control limitations outweigh the benefits.​

Attracted by free AWS credits, me and my CTO started out with setting up our entire company IT infrastructure on the cloud. However, we were shocked when we saw the costs ballooning after just a few software tests. We decided to invest in a high-quality server and moved our whole infrastructure onto it. And we’re not looking back: This decision is already saving us hundreds of Euros per month.

We’re not the only ones: Dropbox already made this move in 2016 and saved close to $75 million over the ensuing two years. The company behind Basecamp, 37signals, completed this transition in 2022, and expects to save $7 million over five years.

We’ll dive deeper into the how and why of this trend and the cost savings that are associated with it. You can expect some practical insights that will help you make or influence such a decision at your company, too.

Cloud costs have been exploding

According to a recent study by Harness, 21% of enterprise cloud infrastructure spend—which will be equivalent to $44.5 billion in 2025—is wasted on underutilized resources. According to the study author, cloud spend is one of the biggest cost drivers for many software enterprises, second only to salaries.

The premise of this study is that developers must develop a keener eye on costs. However, I disagree. Cost control can only get you so far—and many smart developers are already spending inordinate amounts of their time on cost control instead of building actual products.

Cloud costs have a tendency to balloon over time: Storage costs per GB of data might seem low, but when you’re dealing with terabytes of data—which even we as a three-person startup are already doing—costs add up very quickly. Add to this retrieval and egress fees, and you’re faced with a bill you cannot unsee.

Steep retrieval and egress fees only serve one thing: Cloud providers want to incentivize you to keep as much data as possible on the platform, so they can make money off every operation. If you download data from the cloud, it will cost you inordinate amounts of money.

Variable costs based on CPU and GPU usage often spike during high-performance workloads. A report by CNCF found that almost half of Kubernetes adopters found that they’d exceeded their budget as a result. Kubernetes is an open-source container orchestration software that is often used for cloud deployments.

The pay-per-use model of the cloud has its advantages, but billing becomes unpredictable as a result. Costs can then explode during usage spikes. Cloud add-ons for security, monitoring, and data analytics also come at a premium, which often increases costs further.

As a result, many IT leaders have started migrating back to on-premises servers. A 2023 survey by Uptime found that 33% of respondents had repatriated at least some production applications in the past year.

Cloud providers have not restructured their billing in response to this trend. One could argue that doing so would seriously impact their profitability, especially in a largely consolidated market where competitive pressure by upstarts and outsiders is limited. As long as this is the case, the trend towards on-premises is expected to continue.

Cost efficiency and control

There is a reason that cloud providers tend to advertise so much to small firms and startups. The initial setup costs of a cloud infrastructure are low because of pay-as-you-go models and free credits.

The easy setup can be a trap, though, especially once you start scaling. (At my firm, we noticed our costs going out of control even before we scaled to a decent extent, simply because we handle large amounts of data.) Monthly costs for on-premises servers are fixed and predictable; costs for cloud services can quickly balloon beyond expectations.

As mentioned before, cloud providers also charge steep data egress fees, which can quickly add up when you’re considering a hybrid infrastructure.

Security costs can initially be higher on-premises. On the other hand, you have full control over everything you implement. Cloud providers cover infrastructure security, but you remain responsible for data security and configuration. This often requires paid add-ons.

A round-up can be found in the table above. On the whole, an on-premises infrastructure comes with higher setup costs and needs considerable know-how. This initial investment pays off quickly, though, because you tend to have very predictable monthly costs and full control over additions like security measures.

There are plenty of prominent examples of companies that have saved millions by moving back on-premises. Whether this is a good choice for you depends on several factors, though, which need to be assessed carefully.

Should you move back on-premises?

Whether you should make the shift back to server racks depends on several factors. The most important considerations in most cases are financial, operational, and strategic.

From a financial point of view, your company’s cash structure plays a big role. If you prefer lean capital expenditures but have no problem racking up high operational costs every month, then you should remain on the cloud. If you can make a higher capital expenditure up front and then refrain from bleeding cash, you should do this though.

At the end of the day, the total operational costs (TCO) are key though. If your operational costs on cloud are consistently lower than running servers yourself, then you should absolutely stay on the cloud.

From an operational point of view, staying on the cloud can make sense if you often face spikes in usage. On-premises servers can only carry so much traffic; cloud servers scale pretty seamlessly in proportion to demand. If expensive and specialized hardware is more accessible for you on the cloud, this is also a point in favor of staying on the cloud. On the other hand, if you are worried about complying with specific regulations (like GDPR, HIPAA, or CSRD for example), then the shared-responsibility model of cloud services is likely not for you.

Strategically speaking, having full control of your infrastructure can be a strategic advantage. It keeps you from getting locked in with a vendor and having to play along with whatever they bill you and what services they are able to offer you. If you plan a geographic expansion or rapidly deploy new services, then cloud can be advantageous though. In the long run, however, going on-premises might make sense even when you’re expanding geographically or in your scope of services, due to increased control and lower operational costs.

The decision to move back on-premises depends on several factors. Diagram generated with the help of Claude AI.

On the whole, if you value predictability, control, and compliance, you should consider running on-premises. If, on the other hand, you value flexibility, then staying on the cloud might be your better choice.

How to repatriate easily

If you are considering repatriating your services, here is a brief checklist to follow:

  • Assess Current Cloud Usage: Inventory applications and data volume.
  • Cost Analysis: Calculate current cloud costs vs. projected on-prem costs.
  • Select On-Prem Infrastructure: Servers, storage, and networking requirements.
  • Minimize Data Egress Costs: Use compression and schedule transfers during off-peak hours.
  • Security Planning: Firewalls, encryption, and access controls for on-prem.
  • Test and Migrate: Pilot migration for non-critical workloads first.
  • Monitor and Optimize: Set up monitoring for resources and adjust.

Repatriation is not just for enterprise companies that make the headlines. As the example of my firm shows, even small startups need to make this consideration. The earlier you make the migration, the less cash you’ll bleed.

The bottom line: Cloud is not dead, but the hype around it is dying

Cloud services aren’t going anywhere. They offer flexibility and scalability, which are unmatched for certain use cases. Startups and companies with unpredictable or rapidly growing workloads still benefit greatly from cloud solutions.

That being said, even early-stage companies can benefit from on-premises infrastructure, for example if the large data loads they’re handling would make the cloud bill balloon out of control. This was the case at my firm.

The cloud has often been marketed as a one-size-fits-all solution for everything from data storage to AI workloads. We can see that this is not the case; the reality is a bit more granular than this. As companies scale, the costs, compliance challenges, and performance limitations of cloud computing become impossible to ignore.

The hype around cloud services is dying because experience is showing us that there are real limits and plenty of hidden costs. In addition, cloud providers can often not adequately provide for security solutions, options for compliance, and user control if you don’t pay a hefty premium for all this.

Most companies will likely adopt a hybrid approach in the long run: On-premises offers control and predictability; cloud servers can jump into the fray when demand from users spikes.

There’s no real one-size-fits-all solution. However, there are specific criteria that should help you guide your decision. Like every hype, there are ebbs and flows. The fact that cloud services are no longer hyped does not mean that you need to go all-in on server racks now. It does, however, invite for a deeper reflection about the advantages that this trend offers for your company.

Challenges & Solutions For Monitoring at Hyperscale

0

What is not measured, cannot be improved.” This quote has become a guiding principle for teams training foundation models. When you’re dealing with complex, large-scale AI systems, things can spiral quickly without the right oversight. Operating at hyperscale poses significant challenges for teams, from the large volume of data generated to the unpredictability of hardware failures and the need for efficient resource management. These issues require strategic solutions, that’s why monitoring isn’t just a nice-to-have—it’s the backbone of transparency, reproducibility, and efficiency. During my talk at NeurIPS,  I broke down five key lessons learned from teams facing large-scale model training and monitoring. Let’s get into it.

Real-time monitoring prevents costly failures

Imagine this: you’re training a large language model on thousands of GPUs at a cost of hundreds of thousands of dollars per day. Now imagine discovering, hours into training, that your model is diverging or that hardware issues are degrading your performance. The financial and operational implications are staggering. This is why live monitoring—the ability to act immediately—is so critical.

Live monitoring allows teams to see experiment progress as it happens, rather than waiting for checkpoints or the end of a run. This real-time visibility is a game-changer for identifying and fixing problems on the fly. In addition, automated processes allow you to set up monitoring workflows once and reuse them for similar experiments. This streamlines the process of comparing results, analyzing results, and debugging issues, saving time and effort.

However, achieving true live monitoring is far from simple. Hyperscale training generates an overwhelming volume of data, often reaching up to a million data points per second. Traditional monitoring tools struggle under such loads, creating bottlenecks that can delay corrective action. Some teams try to cope by batching or sampling metrics, but these approaches sacrifice real-time visibility and add complexity to the code.

The solution lies in systems that can handle high-throughput data ingestion while providing accurate, real-time insights. Tools like neptune.ai make this possible by providing dashboards that visualize metrics without delaying training. For example, live tracking of GPU utilization or memory usage can reveal early signs of bottlenecks or out-of-memory errors, allowing engineers to proactively adjust course. See here some testimonials:

One thing we’re always keeping track of is what the utilization is and how to improve it. Sometimes, we’ll get, for example, out-of-memory errors, and then seeing how the memory increases over time in the experiment is really helpful for debugging as well.

James Tu

Research Scientist, Waabi

For some of the pipelines, Neptune was helpful for us to see the utilization of the GPUs. The utilization graphs in the dashboard are a perfect proxy for finding some bottlenecks in the performance, especially if we are running many pipelines.

Wojtek Rosiński

CTO, ReSpo.Vision

Real-time visualization of GPU memory usage (top) and power consumption (bottom) during a large-scale training run. These metrics help identify potential bottlenecks, such as out-of-memory errors or inefficient hardware utilization, enabling immediate corrective actions to maintain optimal performance. | Source: Author

Troubleshooting hardware failures is challenging: simplify it with debugging

Distributed systems are prone to failure, and hardware failures are notoriously difficult to troubleshoot. A single hardware failure can cascade into widespread outages, often with cryptic error messages. Teams often waste time sifting through stack traces, trying to distinguish between infrastructure problems and code bugs.

At Cruise, engineers used frameworks like Ray and Lightning to improve error reporting. By automatically labeling errors as either “infra” or “user” issues and correlating stack traces across nodes, debugging became much faster.

Igor Tsvetkov

Former Senior Staff Software Engineer, Cruise

AI teams automating error categorization and correlation can significantly reduce debugging time in hyperscale environments, just as Cruise has done. How? By using classification strategies to identify if failures originated from hardware constraints (e.g., GPU memory leaks, network latency) or software bugs (e.g., faulty model architectures, misconfigured hyperparameters). 

Intuitive experiment tracking optimizes resource utilization

Another relevant aspect of hyperscale monitoring is optimizing resource utilization, in particular in a scenario where hardware failures and training interruptions can set teams back significantly. Picture a scenario where training jobs suddenly deviate: loss metrics spike, and you’re left deciding whether to let the job run or terminate it. Advanced experiment trackers allow for remote experiment termination, eliminating the need for teams to manually access cloud logs or servers.

Use checkpoints at frequent intervals so you do not have to restart from scratch, but just warm-start from the previous checkpoint. Most mature training frameworks already offer automated checkpointing and warm-starts from previous checkpoints. But most of these, by default, save the checkpoints in the same machine. This doesn’t help if your hardware crashes, or, for example, you are using spot instances and they are reassigned.

For maximum resilience and to prevent losing data if hardware crashes, checkpoints should be linked to your experiment tracker. This does not mean that you upload GBs worth of checkpoints to the tracker (although you can and some of our customers, especially self-hosted customers, do this for security reasons), but rather have pointers to the remote location, like S3, where the checkpoints have been saved. This enables you to link the checkpoint with the corresponding experiment step, and efficiently retrieve the relevant checkpoint at any given step.

A comparison of training workflows with and without advanced experiment tracking and checkpointing. On the left, failed training runs at various stages lead to wasted time and resources. On the right, a streamlined approach with checkpoints and proactive monitoring ensures consistent progress and minimizes the impact of interruptions.
A comparison of training workflows with and without advanced experiment tracking and checkpointing. On the left, failed training runs at various stages lead to wasted time and resources. On the right, a streamlined approach with checkpoints and proactive monitoring ensures consistent progress and minimizes the impact of interruptions. | Source: Author

However, there are two caveats to successfully restarting an experiment from a checkpoint: assuming that the experimentation environment is constant, or at least reproducible, and addressing deterministic issues like Out-of-Memory errors (OOMs) or bottlenecks that may require parameter changes to avoid repeating failures. This is where forking can play a significant role in improving recovery and progress.

Track months-long model training with more confidence. Use neptune.ai forking feature to iterate faster and optimize the usage of GPU resources.

With Neptune, users can visualize forked training out of the box. This means you can:

  • Test multiple configs at the same time. Stop the runs that don’t improve accuracy. And continue from the most accurate last step.
  • Restart failed training sessions from any previous step. The training history is inherited, and the entire experiment is visible on a single chart.

In addition, checkpointing strategies are critical for optimizing recovery processes. Frequent checkpointing ensures minimal loss of progress, allowing you to warm-start from the most recent state instead of starting from scratch. However, checkpointing can be resource-intensive in terms of storage and time, so we need to strike a balance between frequency and overhead.

For large-scale models, the overhead of writing and reading weights to persistent storage can significantly reduce training efficiency. Innovations like redundant in-memory copies, as demonstrated by Google’s Gemini models, enable rapid recovery and improved training goodput (defined by Google as the time spent computing useful new steps over the elapsed time of the training job), increasing resilience and efficiency.

Features like PyTorch Distributed’s asynchronous checkpointing can significantly reduce checkpointing times making frequent checkpointing more viable without compromising training performance.

Beyond models, checkpointing the state of dataloaders remains a challenge due to distributed states across nodes. While some organizations like Meta have developed in-house solutions, general frameworks have yet to fully address this issue. Incorporating dataloader checkpointing can further enhance resilience by preserving the exact training state during recovery.

Reproducibility and transparency are non-negotiable

Reproducibility is the bedrock of reliable research, but it’s notoriously difficult at scale. Ensuring reproducibility requires consistent tracking of environment details, datasets, configurations, and results. This is where Neptune’s approach excels, linking every experiment’s lineage—from parent runs to dataset versions—in an accessible dashboard.

This transparency not only aids validation but also accelerates troubleshooting. Consider ReSpo.Vision’s challenges in managing and comparing results across pipelines. By implementing organized tracking systems, they gained visibility into pipeline dependencies and experiment parameters, streamlining their workflow.

A single source of truth simplifies data visualization and management at large-scale data

Managing and visualizing data at scale is a common challenge, amplified in the context of large-scale experimentation. While tools like MLflow or TensorBoard are sufficient for smaller projects with 10–20 experiments, they quickly fall short when handling thousands or even hundreds of experiments. At this scale, organizing and comparing results becomes a logistical hurdle, and relying on tools that cannot effectively visualize or manage this scale leads to inefficiencies and missed insights.

A solution lies in adopting a single source of truth for all experiment metadata, encompassing everything from input data and training metrics to checkpoints and outputs. Neptune’s dashboards address this challenge by providing a highly customizable and centralized platform for experiment tracking. These dashboards enable real-time visualization of key metrics, which can be tailored to include “custom metrics”—those not explicitly logged at the code level but calculated retrospectively within the tool. For instance, if a business requirement shifts from using precision and recall to the F1 score as a performance indicator, custom metrics allow you to calculate and visualize these metrics across existing and future experiments without rerunning them, ensuring flexibility and minimizing duplicated effort.

Consider the challenges faced by Waabi and ReSpo.Vision. Waabi’s teams, running large-scale ML experiments, needed a way to organize and share their experiment data efficiently. Similarly, ReSpo.Vision required an intuitive system to visualize multiple metrics in a standardized format that any team member—technical or non-technical—could easily access and interpret. Neptune’s dashboards provided the solution, allowing these teams to streamline their workflows by offering visibility into all relevant experiment data, reducing overhead, and enabling collaboration across stakeholders.

I like those dashboards because we need several metrics, so you code the dashboard once, have those styles, and easily see it on one screen. Then, any other person can view the same thing, so that’s pretty nice.

Łukasz Grad

Chief Data Scientist, ReSpo.Vision

The benefits of such an approach extend beyond visualization. Logging only essential data and calculating derived metrics within the tool reduces latency and streamlines the experimental process. This capability empowers teams to focus on actionable insights, enabling scalable and efficient experiment tracking, even for projects involving tens of thousands of models and subproblems.

Visualizing large datasets

We generally do not think of dataset visualization as part of experiment monitoring. However, preparing the dataset for model training is an experiment in itself, and while it may be an upstream experiment not in the same pipeline as the actual model training, data management and visualization is critical to LLMOps.

Large-scale experiments often involve processing billions of data points or embeddings. Visualizing such data to uncover relationships and debug issues is a common hurdle. Tools like Deepscatter and Jupyter Scatter have made progress in scaling visualizations for massive datasets, offering researchers valuable insights into their data distribution and embedding structures.

Moving forward

The path to efficient hyperscale training lies in combining robust monitoring, advanced debugging tools, and comprehensive experiment tracking. Solutions like Neptune Scale are designed to address these challenges, offering the scalability, precision, and transparency researchers need.

How about being one of the first to access Neptune Scale?

Neptune Scale is our upcoming product release built for teams that train foundation models. It offers enhanced scalability and exciting new features. You can join our beta program to benefit from Neptune Scale earlier.

If you’re interested in learning more, visit our blog or join the MLOps community to explore case studies and actionable strategies for large-scale AI experimentation.

Acknowledgments

I would like to express my gratitude to Prince Canuma, Dr. Shantipriya Parida, and Igor Tsvetkov for their valuable time and insightful discussions on this topic. Their contributions and perspectives were instrumental in shaping this talk.

Was the article useful?

Explore more content topics:

Nine Pico PIO Wats with Rust (Part 2)

0

This is Part 2 of an exploration into the unexpected quirks of programming the Raspberry Pi Pico PIO with Micropython. If you missed Part 1, we uncovered four Wats that challenge assumptions about register count, instruction slots, the behavior of pull noblock, and smart yet cheap hardware.

Now, we continue our journey toward crafting a theremin-like musical instrument — a project that reveals some of the quirks and perplexities of PIO programming. Prepare to challenge your understanding of constants in a way that brings to mind a Shakespearean tragedy.

Wat 5: Inconstant constants

In the world of PIO programming, constants should be reliable, steadfast, and, well, constant. But what if they’re not? This brings us to a puzzling Wat about how the set instruction in PIO works—or doesn’t—when handling larger constants.

Much like Juliet doubting Romeo’s constancy, you might find yourself wondering if PIO constants will, as she says, “prove likewise variable.”

The problem: Constants are not as big as they seem

Imagine you’re programming an ultrasonic range finder and need to count down from 500 while waiting for the Echo signal to drop from high to low. To set up this wait time in PIO, you might naïvely try to load the constant value directly using set:

; In Rust, be sure 'config.shift_in.direction = ShiftDirection::Left;'
set y, 15       ; Load upper 5 bits (0b01111)
mov isr, y      ; Transfer to ISR (clears ISR)
set y, 20       ; Load lower 5 bits (0b10100)
in y, 5         ; Shift in lower bits to form 500 in ISR
mov y, isr      ; Transfer back to y

Aside: Don’t try to understand the crazy jmp operations here. We’ll discuss those next in Wat 6.

But here’s the tragic twist: the set instruction in PIO is limited to constants between 0 and 31. Moreover, the star-crossed set instruction doesn’t report an error. Instead, it silently corrupts the entire PIO instruction. This produces a nonsense result.

Workarounds for inconstant constants

To address this limitation, consider the following approaches:

  • Read Values and Store Them in a Register: We saw this approach in Wat 1. You can load your constant in the osr register, then transfer it to y. For example:
# Read the max echo wait into OSR.
pull                    ; same as pull block
mov y, osr              ; Load max echo wait into Y
  • Shift and Combine Smaller Values: Using the isr register and the in instruction, you can build up a constant of any size. This, however, consumes time and operations from your 32-operation budget (see Part 1, Wat 2).
; In Rust, be sure 'config.shift_in.direction = ShiftDirection::Left;'

set y, 15       ; Load upper 5 bits (0b01111)
mov isr, y      ; Transfer to ISR (clears ISR)
set y, 20       ; Load lower 5 bits (0b10100)
in y, 5         ; Shift in lower bits to form 500 in ISR
mov y, isr      ; Transfer back to y
  • Slow Down the Timing: Reduce the frequency of the state machine to stretch delays over more system clock cycles. For example, lowering the state machine speed from 125 MHz to 343 kHz reduces the timeout constant 182,216 to 500
  • Use Extra Delays and (Nested) Loops: All instructions support an optional delay, allowing you to add up to 31 extra cycles. (To generate even longer delays, use loops — or even nested loops.)
; Generate 10μs trigger pulse (4 cycles at 343_000Hz)
set pins, 1 [3]       ; Set trigger pin to high, add delay of 3
set pins, 0           ; Set trigger pin to low voltage
  • Use the “Subtraction Trick” to Generate the Maximum 32-bit Integer: In Wat 7, we’ll explore a way to generate 4,294,967,295 (the maximum unsigned 32-bit integer) via subtraction.

Much like Juliet cautioning against swearing by the inconstant moon, we’ve discovered that PIO constants are not always as steadfast as they seem. Yet, just as their story takes unexpected turns, so too does ours, moving from the inconstancy of constants to the uneven nature of conditionals. In the next Wat, we’ll explore how PIO’s handling of conditional jumps can leave you questioning its loyalty to logic.

Wat 6: Conditionals through the looking-glass

In most programming environments, logical conditionals feel balanced: you can test if a pin is high or low, or check registers for equality or inequality. In PIO, this symmetry breaks down. You can jump on pin high, but not pin low, and on x!=y, but not x==y. The rules are whimsical — like Humpty Dumpty in Through the Looking-Glass: “When I define a conditional, it means just what I choose it to mean — neither more nor less.”

These quirks force us to rewrite our code to fit the lopsided logic, creating a gulf between how we wish the code could be written and how we must write it.

The problem: Lopsided conditionals in action

Consider a simple scenario: using a range finder, you want to count down from a maximum wait time (y) until the ultrasonic echo pin goes low. Intuitively, you might write the logic like this:

measure_echo_loop:
 jmp !pin measurement_complete   ; If echo voltage is low, measurement is complete
 jmp y-- measure_echo_loop       ; Continue counting down unless timeout

And when processing the measurement, if we only wish to output values that differ from the previous value, we would write:

measurement_complete:
 jmp x==y cooldown             ; If measurement is the same, skip to cool down
 mov isr, y                    ; Store measurement in ISR
 push                          ; Output ISR
 mov x, y                      ; Save the measurement in X

Unfortunately, PIO doesn’t let you test !pin or x==y directly. You must restructure your logic to accommodate the available conditionals, such as pin and x!=y.

The solution: The way it must be

Given PIO’s limitations, we adapt our logic with a two-step approach that ensures the desired behavior despite the missing conditionals:

  • Jump on the opposite conditional to skip two instructions forward.
  • Next, use an unconditional jump to reach the desired target.

This workaround adds one extra jump (affecting the instruction limit), but the additional label is cost-free.

Here is the rewritten code for counting down until the pin goes low:

measure_echo_loop:
   jmp pin echo_active     ; if echo voltage is high continue count down
   jmp measurement_complete ; if echo voltage is low, measurement is complete
echo_active:
   jmp y-- measure_echo_loop ; Continue counting down unless timeout

And here is the code for processing the measurement such that it will only output differing values:

measurement_complete:
   jmp x!=y send_result    ; if measurement is different, then send it.
   jmp cooldown            ; If measurement is the same, don't send.

send_result:
   mov isr, y              ; Store measurement in ISR
   push                    ; Output ISR
   mov x, y               ; Save the measurement in X

Lessons from Humpty Dumpty’s conditionals

In Through the Looking-Glass, Alice learns to navigate Humpty Dumpty’s peculiar world — just as you’ll learn to navigate PIO’s Wonderland of lopsided conditions.

But as soon as you master one quirk, another reveals itself. In the next Wat, we’ll uncover a surprising behavior of jmp that, if it were an athlete, would shatter world records.

In Part 1’s Wat 1 and Wat 3, we saw how jmp x-- or jmp y-- is often used to loop a fixed number of times by decrementing a register until it reaches 0. Straightforward enough, right? But what happens when y is 0 and we run the following instruction?

jmp y-- measure_echo_loop

If you guessed that it does not jump to measure_echo_loop and instead falls through to the next instruction, you’re absolutely correct. But for full credit, answer this: What value does y have after the instruction?

The answer: 4,294,967,295. Why? Because y is decremented after it is tested for zero. Wat!?

Aside: If this doesn’t surprise you, you likely have experience with C or C++ which distinguish between pre-increment (e.g., ++x) and post-increment (e.g., x++) operations. The behavior of jmp y-- is equivalent to a post-decrement, where the value is tested before being decremented.

This value, 4,294,967,295, is the maximum for a 32-bit unsigned integer. It’s as if a track-and-field long jumper launches off the takeoff board but, instead of landing in the sandpit, overshoots and ends up on another continent.

Aside: As foreshadowed in Wat 5, we can use this behavior intentionally to set a register to the value 4,294,967,295.

Now that we’ve learned how to stick the landing with jmp, let’s see if we can avoid getting stuck by the pins that PIO reads and sets.

In Dr. Seuss’s Too Many Daves, Mrs. McCave had 23 sons, all named Dave, leading to endless confusion whenever she called out their name. In PIO programming, pin and pins can refer to completely different ranges of pins depending on the context. It’s hard to know which Dave or Daves you’re talking to.

The problem: Pin ranges and subranges

In PIO, both pin and pins instructions depend on pin ranges defined in Rust, outside of PIO. However, individual instructions often operate on a subrange of those pin ranges. The behavior varies depending on the command: the subrange could be the first n pins of the range, all the pins, or just a specific pin given by an index. To clarify PIO’s behavior, I created the following table:

This table shows how PIO interprets the terms pin and pins in different instructions, along with their associated contexts and configurations.

Example: Distance program for the range finder

Here’s a PIO program for measuring the distance to an object using Trigger and Echo pins. The key features of this program are:

  • Continuous Operation: The range finder runs in a loop as fast as possible.
  • Maximum Range Limit: Measurements are capped at a given distance, with a return value of 4,294,967,295 if no object is detected.
  • Filtered Outputs: Only measurements that differ from their immediate predecessor are sent, reducing the output rate.

Glance over the program and notice that although it is working with two pins — Trigger and Echo — throughout the program we only see pin and pins.

.program distance

; X is the last value sent. Initialize it to
; u32::MAX which means 'echo timeout'
; (Set X to u32::MAX by subtracting 1 from 0)
   set x, 0
subtraction_trick:
   jmp x-- subtraction_trick

; Read the max echo wait into OSR
   pull                         ; same as pull block

; Main loop
.wrap_target
   ; Generate 10μs trigger pulse (4 cycles at 343_000Hz)
   set pins, 0b1 [3]       ; Set trigger pin to high, add delay of 3
   set pins, 0b0           ; Set trigger pin to low voltage

   ; When the trigger goes high, start counting down until it goes low
   wait 1 pin 0            ; Wait for echo pin to be high voltage
   mov y, osr              ; Load max echo wait into Y

measure_echo_loop:
   jmp pin echo_active     ; if echo voltage is high continue count down
   jmp measurement_complete ; if echo voltage is low, measurement is complete
echo_active:
   jmp y-- measure_echo_loop ; Continue counting down unless timeout

; Y tells where the echo countdown stopped. It
; will be u32::MAX if the echo timed out.
measurement_complete:
   jmp x!=y send_result    ; if measurement is different, then sent it.
   jmp cooldown            ; If measurement is the same, don't send.

send_result:
   mov isr, y              ; Store measurement in ISR
   push                    ; Output ISR
   mov x, y               ; Save the measurement in X

; Cool down period before next measurement
cooldown:
   wait 0 pin 0           ; Wait for echo pin to be low
.wrap                      ; Restart the measurement loop

Configuring Pins

To ensure the PIO program behaves as intended:

  • set pins, 0b1 should control the Trigger pin.
  • wait 1 pin 0 should monitor the Echo pin.
  • jmp pin echo_active should also monitor the Echo pin.

Here’s how you can configure this in Rust (followed by an explanation):

let mut distance_state_machine = pio1.sm0;
let trigger_pio = pio1.common.make_pio_pin(hardware.trigger);
let echo_pio = pio1.common.make_pio_pin(hardware.echo);
distance_state_machine.set_pin_dirs(Direction::Out, &[&trigger_pio]);
distance_state_machine.set_pin_dirs(Direction::In, &[&echo_pio]);
distance_state_machine.set_config(&{
   let mut config = Config::default();
   config.set_set_pins(&[&trigger_pio]); // For set instruction
   config.set_in_pins(&[&echo_pio]); // For wait instruction
   config.set_jmp_pin(&echo_pio); // For jmp instruction
   let program_with_defines = pio_file!("examples/distance.pio");
   let program = pio1.common.load_program(&program_with_defines.program);
   config.use_program(&program, &[]); // No side-set pins
   config
});

The keys here are the set_set_pins, set_in_pins, and set_jmp_pin methods on the Config struct.

  • set_in_pins: Specifies the pins for input operations, such as wait(1, pin, …). The “in” pins must be consecutive.
  • set_set_pins: Configures the pin for set operations, like set(pins, 1). The “set” pins must also be consecutive.
  • set_jmp_pin: Defines the single pin used in conditional jumps, such as jmp(pin, ...).

As described in the table, other optional inputs include:

  • set_out_pins: Sets the consecutive pins for output operations, such as out(pins, …).
  • use_program: Sets a) the loaded program and b) consecutive pins for sideset operations. Sideset operations allow simultaneous pin toggling during other instructions.

Configuring Multiple Pins

Although not required for this program, you can configure a range of pins in PIO by providing a slice of consecutive pins. For example, suppose we had two ultrasonic range finders:

let trigger_a_pio = pio1.common.make_pio_pin(hardware.trigger_a);
let trigger_b_pio = pio1.common.make_pio_pin(hardware.trigger_b);
config.set_set_pins(&[&trigger_a_pio, &trigger_b_pio]);

A single instruction can then control both pins:

set pins, 0b11 [3]  # Sets both trigger pins (17, 18) high, adds delay
set pins, 0b00      # Sets both trigger pins low

This approach lets you efficiently apply bit patterns to multiple pins simultaneously, streamlining control for applications involving multiple outputs.

Aside: The Word “Set” in Programming

In programming, the word “set” is notoriously overloaded with multiple meanings. In the context of PIO, “set” refers to something to which you can assign a value — such as a pin’s state. It does not mean a collection of things, as it often does in other programming contexts. When PIO refers to a collection, it usually uses the term “range” instead. This distinction is crucial for avoiding confusion as you work with PIO.

Lessons from Mrs. McCave

In Too Many Daves, Mrs. McCave lamented not giving her 23 Daves more distinct names. You can avoid her mistake by clearly documenting your pins with meaningful names — like Trigger and Echo — in your comments.

But if you think handling these pin ranges is tricky, debugging a PIO program adds an entirely new layer of challenge. In the next Wat, we’ll dive into the kludgy debugging methods available. Let’s see just how far we can push them.

I like to debug with interactive breakpoints in VS Code. I also do print debugging, where you insert temporary info statements to see what the code is doing and the values of variables. Using the Raspberry Pi Debug Probe and probe-rs, I can do both of these with regular Rust code on the Pico.

With PIO programming, however, I can do neither.

The fallback is push-to-print debugging. In PIO, you temporarily output integer values of interest. Then, in Rust, you use info! to print those values for inspection.

For example, in the following PIO program, we temporarily add instructions to push the value of x for debugging. We also include set and out to push a constant value, such as 7, which must be between 0 and 31 inclusive.

.program distance

; X is the last value sent. Initialize it to
; u32::MAX which means 'echo timeout'
; (Set X to u32::MAX by subtracting 1 from 0)
   set x, 0
subtraction_trick:
   jmp x-- subtraction_trick

; DEBUG: See the value of x
   mov isr, x
   push

; Read the max echo wait into OSR
   pull                         ; same as pull block

; DEBUG: Send constant value
   set y, 7           ; Push '7' so that we know we've reached this point
   mov isr, y
   push
; ...

Back in Rust, you can read and print these values to help understand what’s happening in the PIO code (full code and project):

  // ...
   distance_state_machine.set_enable(true);
   distance_state_machine.tx().wait_push(MAX_LOOPS).await;
   loop {
       let end_loops = distance_state_machine.rx().wait_pull().await;
       info!("end_loops: {}", end_loops);
   }
  // ...

Outputs:

INFO  Hello, debug!
└─ distance_debug::inner_main::{async_fn#0} @ examplesdistance_debug.rs:27
INFO  end_loops: 4294967295
└─ distance_debug::inner_main::{async_fn#0} @ examplesdistance_debug.rs:57
INFO  end_loops: 7
└─ distance_debug::inner_main::{async_fn#0} @ examplesdistance_debug.rs:57

When push-to-print debugging isn’t enough, you can turn to hardware tools. I bought my first oscilloscope (a FNIRSI DSO152, for $37). With it, I was able to confirm the Echo signal was working. The Trigger signal, however, was too fast for this inexpensive oscilloscope to capture clearly.

Using these methods — especially push-to-print debugging — you can trace the flow of your PIO program, even without a traditional debugger.

Aside: In C/C++ (and potentially Rust), you can get closer to a full debugging experience for PIO, for example, by using the piodebug project.

That concludes the nine Wats, but let’s bring everything together in a bonus Wat.

Now that all the components are ready, it’s time to combine them into a working theremin-like musical instrument. We need a Rust monitor program. This program starts both PIO state machines — one for measuring distance and the other for generating tones. It then waits for a new distance measurement, maps that distance to a tone, and sends the corresponding tone frequency to the tone-playing state machine. If the distance is out of range, it stops the tone.

Rust’s Place: At the heart of this system is a function that maps distances (from 0 to 50 cm) to tones (approximately B2 to F5). This function is simple to write in Rust, leveraging Rust’s floating-point math and exponential operations. Implementing this in PIO would be virtually impossible due to its limited instruction set and lack of floating-point support.

Here’s the core monitor program to run the theremin (full file and project):

sound_state_machine.set_enable(true);
distance_state_machine.set_enable(true);
distance_state_machine.tx().wait_push(MAX_LOOPS).await;
loop {
   let end_loops = distance_state_machine.rx().wait_pull().await;
   match loop_difference_to_distance_cm(end_loops) {
       None => {
           info!("Distance: out of range");
           sound_state_machine.tx().wait_push(0).await;
       }
       Some(distance_cm) => {
           let tone_frequency = distance_to_tone_frequency(distance_cm);
           let half_period = sound_state_machine_frequency / tone_frequency as u32 / 2;
           info!("Distance: {} cm, tone: {} Hz", distance_cm, tone_frequency);
           sound_state_machine.tx().push(half_period); // non-blocking push
           Timer::after(Duration::from_millis(50)).await;
       }
   }
}

Using two PIO state machines alongside a Rust monitor program lets you literally run three programs at once. This setup is convenient on its own and is essential when strict timing or very high-frequency I/O operations are required.

Aside: Alternatively, Rust Embassy’s async tasks let you implement cooperative multitasking directly on a single main processor. You code in Rust rather than a mixture of Rust and PIO. Although Embassy tasks don’t literally run in parallel, they switch quickly enough to handle applications like a theremin. Here’s a snippet from theremin_no_pio.rs showing a similar core loop:

loop {
       match distance.measure().await {
           None => {
               info!("Distance: out of range");
               sound.rest().await;
           }
           Some(distance_cm) => {
               let tone_frequency = distance_to_tone_frequency(distance_cm);
               info!("Distance: {} cm, tone: {} Hz", distance_cm, tone_frequency);
               sound.play(tone_frequency).await;
               Timer::after(Duration::from_millis(50)).await;
           }
       }
   }

See our recent article on Rust Embassy programming for more details.

Now that we’ve assembled all the components, let’s watch the video again of me “playing” the musical instrument. On the monitor screen, you can see the debugging prints displaying the distance measurements and the corresponding tones. This visual connection highlights how the system responds in real time.

Conclusion

PIO programming on the Raspberry Pi Pico is a captivating blend of simplicity and complexity, offering unparalleled hardware control while demanding a shift in mindset for developers accustomed to higher-level programming. Through the nine Wats we’ve explored, PIO has both surprised us with its limitations and impressed us with its raw efficiency.

While we’ve covered significant ground — managing state machines, pin assignments, timing intricacies, and debugging — there’s still much more you can learn as needed: DMA, IRQ, side-set pins, differences between PIO on the Pico 1 and Pico 2, autopush and autopull, FIFO join, and more.

Recommended Resources

At its core, PIO’s quirks reflect a design philosophy that prioritizes low-level hardware control with minimal overhead. By embracing these characteristics, PIO will not only meet your project’s demands but also open doors to new possibilities in embedded systems programming.

Please follow Carl on Towards Data Science and on @carlkadie.bsky.social. I write on scientific programming in Rust and Python, machine learning, and statistics. I tend to write about one article per month.

Open LLMs are Necessary For Current Private Adaptations and Outperform Their Closed Alternatives [Paper Reflection]

0

Closed Large Language Models (LLMs), which are proprietary and accessible only via APIs, have dominated the LLM space since around 2022 due to their high performance and versatility. However, Open LLMs have made substantial progress, narrowing the performance gap with their Closed LLM counterparts. Open LLMs are models whose architecture and parameters are publicly available for use, modification, and distribution.

For instance, while Closed LLMs like Anthropic’s Claude (released in March 2023) and OpenAI’s GPT-4 (released in March 2023) set new benchmarks upon their launches, the Open LLM Llama 3 released by Meta in April 2024 and DeepSeek-R1 released in January 2025 not only matched but surpassed these models in tasks such as coding, reasoning, text classification, summarization, and question answering.

While much of the discussion around LLMs centers on task and computational performance, in our paper Open LLMs are Necessary for Current Private Adaptations and Outperform their Closed Alternatives, we focus on the privacy implications of using Open and Closed LLMs. Specifically, we explore whether and how models can be fine-tuned on sensitive data while ensuring robust privacy guarantees.

To this end, we define threat models, compare various Open and Closed LLMs that leverage differential privacy (DP) on classification and generation tasks and analyze methodological limitations. Our research results in a thorough analysis of the privacy-utility tradeoff under different privacy levels.

Our findings indicate that Open LLMs can be adapted to private data without leaking information to third parties, such as LLM providers and malicious users. Thus, they offer a significant privacy advantage over Closed, proprietary models.

The threat space in adapting LLMs to private data

The adaptation of Closed LLMs to private datasets introduces a multifaceted threat space. In typical scenarios, data curators provide their sensitive data to LLM providers for fine-tuning, producing a model tailored to the dataset. This customized model is subsequently queried by external parties, e.g., customers of the data curator.

The resulting threat space can be categorized into three key dimensions:

  1. From the data curator to the LLM provider: The private data shared during fine-tuning may be susceptible to unauthorized access or misuse.
  2. From the querying party to the LLM provider: Queries submitted by end users, which often contain sensitive information intended for the data curator, are exposed to the LLM provider.
  1. From malicious end users to the adapted LLM: Malicious end users may attempt to extract private information through the LLM’s responses to carefully crafted queries.

In contrast to Closed LLMs, Open LLMs provide full control over the model and data, enabling private adaptation without the need to share sensitive information with a third party. This control eliminates the first two threat vectors associated with Closed LLMs, such as unauthorized access or misuse by the provider and exposure of user queries. With Open LLMs, data curators can directly fine-tune the model on private datasets using privacy-preserving techniques, ensuring end-to-end privacy.

What are the current methods for private adaptation of LLMs? 

It follows from our threat space analysis that restricting access to the fine-tuning dataset alone does not guarantee data privacy. Model outputs can still reveal sensitive information from the fine-tuning data. If the fine-tuned model is exposed (e.g., via an API), it remains vulnerable to information extraction and inference attacks.

Differential privacy (DP) introduces a rigorous mathematical framework that ensures the privacy of individuals whose data is used in the fine-tuning process. Specifically, DP adds carefully calibrated noise to the model updates, making it statistically improbable to determine whether any individual’s data was included in the fine-tuning dataset. Its quantifiable and robust privacy guarantee makes DP valuable for protecting sensitive information in LLM fine-tuning.

While DP provides privacy guarantees for both Open and Closed LLMs, it does not address the issue of trust in third-party providers for Closed LLMs. For these models, data curators must rely on the provider to implement safeguards and handle sensitive data responsibly.

Private adaptation methods for Closed LLMs 

We can rule out fine-tuning services offered by LLM providers (e.g., OpenAI and Amazon), as this entails sharing private data with a third party. Closed LLMs are accessible only via APIs. Thus, we cannot access and adapt the model’s weights directly.

Instead, private adaptation methods for Closed LLMs rely on privacy-preserving discrete prompts or private in-context learning (ICL). These approaches work by carefully crafting input prompts or selecting relevant examples to guide the model’s behavior, all while ensuring that sensitive information in the prompts or examples is protected from potential leakage or inference attacks.

All methods we evaluate in our study follow the PATE (Private Aggregation of Teacher Ensembles) framework. At a high level, PATE achieves data privacy by splitting the private dataset into non-overlapping partitions. Then, each partition is used to train a so-called teacher model. These teacher models are joined into an ensemble model by combining their outputs while adding noise, which preserves privacy.

This ensemble is then used to train a so-called student model in the following way: The ensemble makes predictions for samples from an unlabeled public dataset. The resulting (sample, ensemble prediction) pairs constitute the training data for the student model. Thus, the student learns to make the same predictions as the teacher ensemble but never sees sensitive data samples. The student is what’s released as the final model.

Overview of the PATE framework. The sensitive dataset is divided into non-overlapping partitions, and a separate teacher model is trained on each partition. All teachers are aggregated noisily into an ensemble model, which is used to make predictions on a public dataset. The samples from the public dataset, together with the ensemble’s predictions, constitute the training data for the student model, which is the model that is eventually queried by users. | Source

The private adaptation methods for Closed LLMs we analyze in our study build on this general framework. They differ in how the teachers are utilized and how their responses are aggregated:

  • Differentially Private In-context Learning (DP-ICL): All teachers process the same prompt, and the ensemble’s response is the noisy consensus.
  • PromptPATE: The teacher ensemble assigns labels to public unlabeled data via private voting. These labeled public sequences are used to create new discrete student prompts, which are deployed with the LLM.
  • DP-FewShotGen: The teacher ensemble generates private synthetic few-shot samples that are used as samples for in-context learning.
  • DP-OPT: A local LLM generates privacy-preserving prompts and instructions from the private dataset. These are used for in-context learning for the third-party Closed LLM.

In our paper, we compare the privacy protection and performance of these four state-of-the-art methods for private adaptation of Closed LLMs. When applying them to the popular Closed LLMs Claude, GPT-3 Babbage, GPT-3 Davinci, and GPT-4 Turbo, we observe that compared to private adaptation of Open LLMs, these methods offer lower performance at a higher cost on various downstream tasks, including dialog summarization, classification, and generation. Further, all methods except DP-OPT leak training data to the LLM provider.

Private adaptation methods for Open LLMs 

Unlike Closed LLMs, Open LLMs provide access to their parameters, enabling more flexible and parameter-centric private adaptation methods. These methods typically follow the Differentially Private Stochastic Gradient Descent (DPSGD) paradigm to ensure privacy. In DPSGD, the influence of each private data point is constrained during training through gradient clipping and the addition of calibrated noise. This approach guarantees that the model does not memorize or leak sensitive information.

In our study, we explore three primary methods for private adaptation of Open LLMs: 

  1. Prompt-based adaptation (PromptDPSGD) introduces a small number of additional parameters (<1% of the model’s total parameters) in the input space through soft prompts or prefix-tuning and adapts Differentially Private Stochastic Gradient Descent (DPSGD) to preserve privacy.
  2. Parameter-efficient fine-tuning, such as LoRA, only updates a relatively small number of parameters (<10% of the model’s total parameters) within the model’s architecture to enable efficient updates. PrivateLoRA extends this approach with DP guarantees by building on the DPSGD algorithm.
  3. Full fine-tuning adaptations (DP-FineTune) involve fine-tuning the entire model or a subset of its layers for comprehensive adaptation while adhering to differential privacy principles.

Applying these methods to Vicuna, Llama-3, OpenLLaMa, BART, RoBERTa, and the Pythia suite of models, we find that private adaptation of Open LLMs improves performance on downstream tasks and reduces costs compared to their Closed counterparts. It also provides a critical privacy benefit by eliminating the risk of exposing private data and user queries to LLM providers.

Insightful results

Our analysis of private adaptation methods for both Closed and Open LLMs reveals several critical findings regarding data leakage, performance, and cost:

  1. Query data leakage: All private adaptation methods for Closed LLMs leak query data to the LLM provider. This means that sensitive information from user queries is exposed during the adaptation process, posing a significant privacy risk.
  2. Training data leakage: Only one method (DP-OPT) of the four methods of private adaptation of Closed LLMs successfully protects private training data from the LLM provider. However, this method requires a local LLM to effectively protect the privacy of the training data. The remaining private adaptation methods for Closed LLMs leak a large fraction of the training data to the LLM provider, undermining the privacy guarantees of the adaptation process.
  3. Performance: All adaptation methods for Closed LLMs achieve lower downstream task performance than privacy-preserving local adaptations on Open LLMs, even when the Open LLMs are significantly smaller than their Closed counterparts.
  4. Cost: The training and query costs for private adaptations of Closed LLMs are substantially higher due to the API access costs imposed by the LLM provider. In contrast, private adaptations for Open LLMs are more cost-effective. We estimated the costs assuming an A40 GPU with 48 GB of memory. In this scenario, privately adopting a Closed LLM to text classification tasks with DP-ICL costs about $140. In contrast, fine-tuning an Open LLM with PrivateLoRA on the same tasks costs about $30.

This leads to the conclusion that for a truly privacy-preserving adaptation of LLMs, one should use Open LLMs. By offering full control over the model and data, Open LLMs eliminate the risks associated with third-party providers and enable robust privacy-preserving techniques. As a result, Open LLMs address the limitations of Closed LLMs and enable efficient and customizable adaptations tailored to sensitive datasets.

Was the article useful?

Explore more content topics:

Popular Posts

My Favorites