Machine Learning

Home Machine Learning

Mastering Prompt Engineering with Functional Testing: A Systematic Guide to Reliable LLM Outputs 

0

Creating efficient prompts for large language models often starts as a simple task… but it doesn’t always stay that way. Initially, following basic best practices seems sufficient: adopt the persona of a specialist, write clear instructions, require a specific response format, and include a few relevant examples. But as requirements multiply, contradictions emerge, and even minor modifications can introduce unexpected failures. What was working perfectly in one prompt version suddenly breaks in another.

If you have ever felt trapped in an endless loop of trial and error, adjusting one rule only to see another one fail, you’re not alone! The reality is that traditional prompt optimisation is clearly missing a structured, more scientific approach that will help to ensure reliability.

That’s where functional testing for prompt engineering comes in! This approach, inspired by methodologies of experimental science, leverages automated input-output testing with multiple iterations and algorithmic scoring to turn prompt engineering into a measurable, data-driven process. 

No more guesswork. No more tedious manual validation. Just precise and repeatable results that allow you to fine-tune prompts efficiently and confidently.

In this article, we will explore a systematic approach for mastering prompt engineering, which ensures your Llm outputs will be efficient and reliable even for the most complex AI tasks.

Balancing precision and consistency in prompt optimisation

Adding a large set of rules to a prompt can introduce partial contradictions between rules and lead to unexpected behaviors. This is especially true when following a pattern of starting with a general rule and following it with multiple exceptions or specific contradictory use cases. Adding specific rules and exceptions can cause conflict with the primary instruction and, potentially, with each other.

What might seem like a minor modification can unexpectedly impact other aspects of a prompt. This is not only true when adding a new rule but also when adding more detail to an existing rule, like changing the order of the set of instructions or even simply rewording it. These minor modifications can unintentionally change the way the model interprets and prioritizes the set of instructions.

The more details you add to a prompt, the greater the risk of unintended side effects. By trying to give too many details to every aspect of your task, you increase as well the risk of getting unexpected or deformed results. It is, therefore, essential to find the right balance between clarity and a high level of specification to maximise the relevance and consistency of the response. At a certain point, fixing one requirement can break two others, creating the frustrating feeling of taking one step forward and two steps backward in the optimization process.

Testing each change manually becomes quickly overwhelming. This is especially true when one needs to optimize prompts that must follow numerous competing specifications in a complex AI task. The process cannot simply be about modifying the prompt for one requirement after the other, hoping the previous instruction remains unaffected. It also can’t be a system of selecting examples and checking them by hand. A better process with a more scientific approach should focus on ensuring repeatability and reliability in prompt optimization.

From laboratory to AI: Why testing LLM responses requires multiple iterations

Science teaches us to use replicates to ensure reproducibility and build confidence in an experiment’s results. I have been working in academic research in chemistry and biology for more than a decade. In those fields, experimental results can be influenced by a multitude of factors that can lead to significant variability. To ensure the reliability and reproducibility of experimental results, scientists mostly employ a method known as triplicates. This approach involves conducting the same experiment three times under identical conditions, allowing the experimental variations to be of minor importance in the result. Statistical analysis (standard mean and deviation) conducted on the results, mostly in biology, allows the author of an experiment to determine the consistency of the results and strengthens confidence in the findings.

Just like in biology and chemistry, this approach can be used with LLMs to achieve reliable responses. With LLMs, the generation of responses is non-deterministic, meaning that the same input can lead to different outputs due to the probabilistic nature of the models. This variability is challenging when evaluating the reliability and consistency of LLM outputs.

In the same way that biological/chemical experiments require triplicates to ensure reproducibility, testing LLMs should need multiple iterations to measure reproducibility. A single test by use case is, therefore, not sufficient because it does not represent the inherent variability in LLM responses. At least five iterations per use case allow for a better assessment. By analyzing the consistency of the responses across these iterations, one can better evaluate the reliability of the model and identify any potential issues or variation. It ensures that the output of the model is correctly controlled.

Multiply this across 10 to 15 different prompt requirements, and one can easily understand how, without a structured testing approach, we end up spending time in trial-and-error testing with no efficient way to assess quality.

A systematic approach: Functional testing for prompt optimization

To address these challenges, a structured evaluation methodology can be used to ease and accelerate the testing process and enhance the reliability of LLM outputs. This approach has several key components:

  • Data fixtures: The approach’s core center is the data fixtures, which are composed of predefined input-output pairs specifically created for prompt testing. These fixtures serve as controlled scenarios that represent the various requirements and edge cases the LLM must handle. By using a diverse set of fixtures, the performance of the prompt can be evaluated efficiently across different conditions.
  • Automated test validation: This approach automates the validation of the requirements on a set of data fixtures by comparison between the expected outputs defined in the fixtures and the LLM response. This automated comparison ensures consistency and reduces the potential for human error or bias in the evaluation process. It allows for quick identification of discrepancies, enabling fine and efficient prompt adjustments.
  • Multiple iterations: To assess the inherent variability of the LLM responses, this method runs multiple iterations for each test case. This iterative approach mimics the triplicate method used in biological/chemical experiments, providing a more robust dataset for analysis. By observing the consistency of responses across iterations, we can better assess the stability and reliability of the prompt.
  • Algorithmic scoring: The results of each test case are scored algorithmically, reducing the need for long and laborious « human » evaluation. This scoring system is designed to be objective and quantitative, providing clear metrics for assessing the performance of the prompt. And by focusing on measurable outcomes, we can make data-driven decisions to optimize the prompt effectively.     

Step 1: Defining test data fixtures

Selecting or creating compatible test data fixtures is the most challenging step of our systematic approach because it requires careful thought. A fixture is not only any input-output pair; it must be crafted meticulously to evaluate the most accurate as possible performance of the LLM for a specific requirement. This process requires:

1. A deep understanding of the task and the behavior of the model to make sure the selected examples effectively test the expected output while minimizing ambiguity or bias.

2. Foresight into how the evaluation will be conducted algorithmically during the test.

The quality of a fixture, therefore, depends not only on the good representativeness of the example but also on ensuring it can be efficiently tested algorithmically.

A fixture consists of:

    • Input example: This is the data that will be given to the LLM for processing. It should represent a typical or edge-case scenario that the LLM is expected to handle. The input should be designed to cover a wide range of possible variations that the LLM might have to deal with in production.

    • Expected output: This is the expected result that the LLM should produce with the provided input example. It is used for comparison with the actual LLM response output during validation.

Step 2: Running automated tests

Once the test data fixtures are defined, the next step involves the execution of automated tests to systematically evaluate the performance of the LLM response on the selected use cases. As previously stated, this process makes sure that the prompt is thoroughly tested against various scenarios, providing a reliable evaluation of its efficiency.

Execution process

    1. Multiple iterations: For each test use case, the same input is provided to the LLM multiple times. A simple for loop in nb_iter with nb_iter = 5 and voila!

    2. Response comparison: After each iteration, the LLM response is compared to the expected output of the fixture. This comparison checks whether the LLM has correctly processed the input according to the specified requirements.

    3. Scoring mechanism: Each comparison results in a score:

        ◦ Pass (1): The response matches the expected output, indicating that the LLM has correctly handled the input.

        ◦ Fail (0): The response does not match the expected output, signaling a discrepancy that needs to be fixed.

    4. Final score calculation: The scores from all iterations are aggregated to calculate the overall final score. This score represents the proportion of successful responses out of the total number of iterations. A high score, of course, indicates high prompt performance and reliability.

Example: Removing author signatures from an article

Let’s consider a simple scenario where an AI task is to remove author signatures from an article. To efficiently test this functionality, we need a set of fixtures that represent the various signature styles. 

A dataset for this example could be:

Example Input Expected Output
A long article
Jean Leblanc
The long article
A long article
P. W. Hartig
The long article
A long article
MCZ
The long article

Validation process:

  • Signature removal check: The validation function checks if the signature is absent from the rewritten text. This is easily done programmatically by searching for the signature needle in the haystack output text.
  • Test failure criteria: If the signature is still in the output, the test fails. This indicates that the LLM did not correctly remove the signature and that further adjustments to the prompt are required. If it is not, the test is passed. 

The test evaluation provides a final score that allows a data-driven assessment of the prompt efficiency. If it scores perfectly, there is no need for further optimization. However, in most cases, you will not get a perfect score because either the consistency of the LLM response to a case is low (for example, 3 out of 5 iterations scored positive) or there are edge cases that the model struggles with (0 out of 5 iterations). 

The feedback clearly indicates that there is still room for further improvements and it guides you to reexamine your prompt for ambiguous phrasing, conflicting rules, or edge cases. By continuously monitoring your score alongside your prompt modifications, you can incrementally reduce side effects, achieve greater efficiency and consistency, and approach an optimal and reliable output. 

A perfect score is, however, not always achievable with the selected model. Changing the model might just fix the situation. If it doesn’t, you know the limitations of your system and can take this fact into account in your workflow. With luck, this situation might just be solved in the near future with a simple model update. 

Benefits of this method 

  • Reliability of the result: Running five to ten iterations provides reliable statistics on the performance of the prompt. A single test run may succeed once but not twice, and consistent success for multiple iterations indicates a robust and well-optimized prompt.
  • Efficiency of the process: Unlike traditional scientific experiments that may take weeks or months to replicate, automated testing of LLMs can be carried out quickly. By setting a high number of iterations and waiting for a few minutes, we can obtain a high-quality, reproducible evaluation of the prompt efficiency.
  • Data-driven optimization: The score obtained from these tests provides a data-driven assessment of the prompt’s ability to meet requirements, allowing targeted improvements.
  • Side-by-side evaluation: Structured testing allows for an easy assessment of prompt versions. By comparing the test results, one can identify the most effective set of parameters for the instructions (phrasing, order of instructions) to achieve the desired results.
  • Quick iterative improvement: The ability to quickly test and iterate prompts is a real advantage to carefully construct the prompt ensuring that the previously validated requirements remain as the prompt increases in complexity and length.

By adopting this automated testing approach, we can systematically evaluate and enhance prompt performance, ensuring consistent and reliable outputs with the desired requirements. This method saves time and provides a robust analytical tool for continuous prompt optimization.

Systematic prompt testing: Beyond prompt optimization

Implementing a systematic prompt testing approach offers more advantages than just the initial prompt optimization. This methodology is valuable for other aspects of AI tasks:

    1. Model comparison:

        ◦ Provider evaluation: This approach allows the efficient comparison of different LLM providers, such as ChatGPT, Claude, Gemini, Mistral, etc., on the same tasks. It becomes easy to evaluate which model performs the best for their specific needs.

        ◦ Model version: State-of-the-art model versions are not always necessary when a prompt is well-optimized, even for complex AI tasks. A lightweight, faster version can provide the same results with a faster response. This approach allows a side-by-side comparison of the different versions of a model, such as Gemini 1.5 flash vs. 1.5 pro vs. 2.0 flash or ChatGPT 3.5 vs. 4o mini vs. 4o, and allows the data-driven selection of the model version.

    2. Version upgrades:

        ◦ Compatibility verification: When a new model version is released, systematic prompt testing helps validate if the upgrade maintains or improves the prompt performance. This is crucial for ensuring that updates do not unintentionally break the functionality.

        ◦ Seamless Transitions: By identifying key requirements and testing them, this method can facilitate better transitions to new model versions, allowing fast adjustment when necessary in order to maintain high-quality outputs.

    3. Cost optimization:

        ◦ Performance-to-cost ratio: Systematic prompt testing helps in choosing the best cost-effective model based on the performance-to-cost ratio. We can efficiently identify the most efficient option between performance and operational costs to get the best return on LLM costs.

Overcoming the challenges

The biggest challenge of this approach is the preparation of the set of test data fixtures, but the effort invested in this process will pay off significantly as time passes. Well-prepared fixtures save considerable debugging time and enhance model efficiency and reliability by providing a robust foundation for evaluating the LLM response. The initial investment is quickly returned by improved efficiency and effectiveness in LLM development and deployment.

Quick pros and cons

Key advantages:

  • Continuous improvement: The ability to add more requirements over time while ensuring existing functionality stays intact is a significant advantage. This allows for the evolution of the AI task in response to new requirements, ensuring that the system remains up-to-date and efficient.
  • Better maintenance: This approach enables the easy validation of prompt performance with LLM updates. This is crucial for maintaining high standards of quality and reliability, as updates can sometimes introduce unintended changes in behavior.
  • More flexibility: With a set of quality control tests, switching LLM providers becomes more straightforward. This flexibility allows us to adapt to changes in the market or technological advancements, ensuring we can always use the best tool for the job.
  • Cost optimization: Data-driven evaluations enable better decisions on performance-to-cost ratio. By understanding the performance gains of different models, we can choose the most cost-effective solution that meets the needs.
  • Time savings: Systematic evaluations provide quick feedback, reducing the need for manual testing. This efficiency allows to quickly iterate on prompt improvement and optimization, accelerating the development process.

Challenges

  • Initial time investment: Creating test fixtures and evaluation functions can require a significant investment of time. 
  • Defining measurable validation criteria: Not all AI tasks have clear pass/fail conditions. Defining measurable criteria for validation can sometimes be challenging, especially for tasks that involve subjective or nuanced outputs. This requires careful consideration and may involve a difficult selection of the evaluation metrics.
  • Cost associated with multiple tests: Multiple test use cases associated with 5 to 10 iterations can generate a high number of LLM requests for a single test automation. But if the cost of a single LLM call is neglectable, as it is in most cases for text input/output calls, the overall cost of a test remains minimal.  

Conclusion: When should you implement this approach?

Implementing this systematic testing approach is, of course, not always necessary, especially for simple tasks. However, for complex AI workflows in which precision and reliability are critical, this approach becomes highly valuable by offering a systematic way to assess and optimize prompt performance, preventing endless cycles of trial and error.

By incorporating functional testing principles into Prompt Engineering, we transform a traditionally subjective and fragile process into one that is measurable, scalable, and robust. Not only does it enhance the reliability of LLM outputs, it helps achieve continuous improvement and efficient resource allocation.

The decision to implement systematic prompt Testing should be based on the complexity of your project. For scenarios demanding high precision and consistency, investing the time to set up this methodology can significantly improve outcomes and speed up the development processes. However, for simpler tasks, a more classical, lightweight approach may be sufficient. The key is to balance the need for rigor with practical considerations, ensuring that your testing strategy aligns with your goals and constraints.

Thanks for reading!

Building The Most Scalable Experiment Tracker For Foundation Models

0

At a large-scale model training (in huge models), anomalies are not rare events but problematic patterns that drive failure. Detecting anomalies early in the process saves days of work and training.

ML model training observability is not just about tracking metrics. It requires proactive monitoring to catch issues early and ensure model success, given the high cost of training on large GPU clusters.

If you are an enterprise or a team operating a model, focus on three key areas: fine-tune your prompts to get the most effective outputs (prompt engineering), ensure that your model behaves safely and predictably, and implement robust monitoring and logging to track performance, detecting issues early.

The Neptune Scale experiment tracker supports fault tolerance and is designed to maintain progress despite hardware failures, making it adaptable for enterprise teams tackling LLM fine-tuning, compliance, and building domain-specific models.

Scaling large language model (LLM) operations is a challenge that many of us are facing right now. For those navigating similar waters, I recently shared some thoughts about it on the Data Exchange Podcast based on our journey at neptune.ai over the last few years. 

Six years ago, we were mainly focused on MLOps when machine learning in production was still evolving. Experiment tracking back then was straightforward—dealing mostly with single models or small-scale distributed systems. Reinforcement learning was one of the few areas pushing the boundaries of scale. In that reinforcement learning, we wanted to run multiple agents and send data from multiple distributed machines to our experiment tracker. This was a huge challenge.

Scaling LLMs: from ML to LLMOps

The landscape changed two years ago when people started training LLMs at scale. LLMOps has taken center stage, and the importance of scaling large language models has grown with research becoming more industrialized. While researchers continue to lead the training process, they are also adjusting to the transition toward commercial applications.

LLMOps isn’t just MLOps with bigger servers, it is a paradigm shift for tracking experiments. We’re not tracking a few hundred metrics for a couple of hours anymore; we’re tracking thousands, even tens of thousands, over several months. These models are trained on GPU clusters spanning multiple data centers, with training jobs that can take months to complete.

Due to time constraints, training frontier models has become a production workflow rather than experimentation. When a training from scratch run takes 50,000 GPUs over several months in different data centers, you don’t get a second chance if something goes wrong—you need to get it right the first time. 

Another interesting aspect of LLM training that only a few companies have truly nailed is the branch-and-fork style of training—something that Google has implemented effectively. This method involves branching off multiple experiments from a continuously running model, requiring a significant amount of data from previous runs. It’s a powerful approach, but it demands infrastructure capable of handling large data inheritance, which makes it feasible only for a handful of companies.

From experiment tracking to experiment monitoring

Now we want to track everything—every layer, every detail—because even a small anomaly can mean the difference between success and failure and many hours of work wasted. During this time, we should not only consider pre-training and training time; post-training takes a huge amount of time and collaborative human work. Grasping this issue, we have re-engineered Neptune’s platform to efficiently ingest and visualize massive volumes of data, enabling fast monitoring and analysis at a larger scale.


The Neptune Scale experiment tracker in action: enabling real-time monitoring and visualization of every relevant metric during model training (in this example, BLEU and edit distance). Neptune helps tracking across 200 runs, allowing users to identify patterns and potential anomalies early in pre-training and post-training, thus reducing risks even in long-running LLM training workflows.

One of the biggest lessons we’ve learned is that experiment tracking has evolved into experiment monitoring. Unlike MLOps, tracking is no longer just about logging metrics and reviewing them later or restarting your training from a checkpoint a few steps back. It’s about having real-time insights to keep everything on track. With such long training times, a single overlooked metric can lead to significant setbacks. That’s why we’re focusing on building intelligent alerts and anomaly detection right into our experiment tracking system.

Think of it like this—we’re moving from being reactive trackers to proactive observers. Our goal is for our platform to recognize when something is off before the researcher even knows to look for it.

Fault tolerance in LLMs

When you’re dealing with LLM training at this scale, fault tolerance becomes a critical component. With thousands of GPUs running for months, hardware failures are almost inevitable. It’s crucial to have mechanisms in place to handle these faults gracefully. 

At Neptune, our system is designed to ensure that the training can resume from checkpoints without losing any data. Fault tolerance does not only mean preventing failures; it also includes minimizing the impact when they occur, so that time and resources are not wasted.

How about being one of the first to access Neptune Scale?

Neptune Scale is our upcoming product release built for teams that train foundation models. It offers enhanced scalability and exciting new features. You can join our beta program to benefit from Neptune Scale earlier.

What does this mean for enterprise teams?

If you’re creating your own LLMs from scratch, or even if you’re an enterprise fine-tuning a model, you might wonder how all this is relevant to you. Here’s the deal: strategies originally designed for handling the massive scale of training LLMs are now being adopted in other areas or by smaller-scale projects. 

Today, cutting-edge models are pushing the boundaries of scale, complexity, and performance, but these same lessons are starting to matter in fine-tuning tasks, especially when dealing with compliance, reproducibility, or complex domain-specific models.

For enterprise teams, there are three key focuses to consider:

  1. Prompt Engineering: Fine-tune your prompts to get the most effective outputs. This is crucial for adapting large models to your specific needs without having to train from scratch.
  2. Implement guardrails in your application: Ensuring your models behave safely and predictably is key. Guardrails help manage the risks associated with deploying AI in production environments, especially when dealing with sensitive data or critical tasks.
  3. Observability in your system: Observability is vital to understanding what’s happening inside your models. Implementing robust monitoring and logging allows you to track performance, detect issues early, and ensure your models are working as expected. Neptune’s experiment tracker provides the observability you need to stay on top of your model’s behavior.

The future: what we’re building next

At Neptune, we’ve nailed the data ingestion part—it’s fast, reliable, and efficient. The challenge for the next year is making this data useful at scale. We need more than just filtering; we need smart tools that can surface the most critical insights and the most granular information automatically. The goal is to build an experiment tracker that helps researchers discover insights, not just record data.

We’re also working on developing a platform that combines monitoring and anomaly detection with the deep expertise researchers acquire over years of experience. By embedding that expertise directly into the tool (either automatically or by defining rules manually), less experienced researchers can benefit from the patterns and signals that would otherwise take years to learn.

Was the article useful?

Explore more content topics:

Transformers Key-Value Caching Explained

0

As the complexity and size of transformer-based models grow, so does the need to optimize their inference speed, especially in chat applications where the users expect immediate replies.

Key-value (KV) caching is a clever trick to do that: At inference time, key and value matrices are calculated for each generated token. KV caching stores these matrices in memory so that when subsequent tokens are generated, we only compute the keys and values for the new tokens instead of having to recompute everything.

The inference speedup from KV caching comes at the cost of increased memory consumption. When memory is a bottleneck, one can reclaim some of it by simplifying the model, thus sacrificing its accuracy.

Implementing K-V caching in large-scale production systems requires careful cache management, including choosing an appropriate strategy for cache invalidation and exploring opportunities for cache reuse.

The transformer architecture is arguably one of the most impactful innovations in modern deep learning. Proposed in the famous 2017 paper “Attention Is All You Need,” it has become the go-to approach for most language-related modeling, including all Large Language Models (LLMs), such as the GPT family, as well as many computer vision tasks.

As the complexity and size of these models grow, so does the need to optimize their inference speed, especially in chat applications where the users expect immediate replies. Key-value (KV) caching is a clever trick to do just that – let’s see how it works and when to use it.

Transformer architecture overview

Before we dive into KV caching, we will need to take a short detour to the attention mechanism used in transformers. Understanding how it works is required to spot and appreciate how KV caching optimizes transformer inference.

We will focus on autoregressive models used to generate text. These so-called decoder models include the GPT family, Gemini, Claude, or GitHub Copilot. They are trained on a simple task: predicting the next token in sequence. During inference, the model is provided with some text, and its task is to predict how this text should continue.

From a high-level perspective, most transformers consist of a few basic building blocks:

  • A tokenizer that splits the input text into subparts, such as words or sub-words.
  • An embedding layer that transforms the resulting tokens (and their relative positions within the texts) into vectors.
  • A couple of basic neural network layers, including dropout, layer normalization, and regular feed-forward linear layers.

The last building block missing from the list above is the slightly more involved self-attention modules.

The self-attention module is, arguably, the only advanced piece of logic in the transformer architecture. It is the cornerstone of every transformer, enabling it to focus on different parts of the input sequence when generating the outputs. It is this mechanism that gives transformers the ability to model long-range dependencies effectively.

Let’s inspect the self-attention module in more detail.

Basic self-attention module

Self-attention is a mechanism that allows the model to “pay attention” to specific parts of the input sequence as it generates the next token. For example, in generating the sentence “She poured the coffee into the cup,” the model might pay more attention to the words “poured” and “coffee” to predict “into” as the next word since these words provide context for what is likely to come next (as opposed to “she” and “the”).

Mathematically speaking, the goal of self-attention is to transform each input (embedded token) into a so-called context vector, which combines the information from all the inputs in a given text. Consider the text “She poured coffee”. Attention will compute three context vectors, one for each input token (let’s assume tokens are words).

To calculate the context vectors, self-attention computes three kinds of intermediate vectors: queries, keys, and values. The diagram below shows step by step how the context vector for the second word, “poured,” is calculated:

The diagram shows step by step how the context vector for the second word, “poured,” is calculated.
The diagram shows step by step how the context vector for the second word, “poured,” is calculated. | Source: Author

Let’s denote the three tokenized inputs as x1, x2, and x3, respectively. The diagram pictures them as vectors with three elements, but in practice, they will be hundreds or thousands of elements long.

As the first step, self-attention multiplies each input separately with two weight matrices, Wk and Wv. The input for which the context vector is now being computed (x2 in our case) is additionally multiplied with a third weight matrix, Wq. All three W matrices are your usual neural network weights, randomly initialized and optimized in the learning process. The outputs of this step are the keys (k) and values (v) vectors for each input, plus an additional query (q) vector for the input being processed.

In step two, the key vector of each input is multiplied by the query vector of the input being processed (our q2). The output is then normalized (not shown in the diagram) to produce the attention weights. In our example, a21 is the attention weight between the inputs “She” and “poured.”

Finally, each attention weight is multiplied by its corresponding value vector. The outputs are then summed to produce the context vector z. In our example, the context vector z2 corresponds to the input x2, “poured.” The context vectors are the outputs of the self-attention module.

If it’s easier for you to read code than diagrams, take a look at this implementation of the basic self-attention module by Sebastian Raschka. The code is part of his book, “Build A Large Language Model (From Scratch)”:

import torch

class SelfAttention_v2(torch.nn.Module):

    def __init__(self, d_in, d_out, qkv_bias=False):
        super().__init__()
        self.W_query = torch.nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = torch.nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = torch.nn.Linear(d_in, d_out, bias=qkv_bias)

    def forward(self, x):
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)
        
        attn_scores = queries @ keys.T
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)

        context_vec = attn_weights @ values
        return context_vec

Sebastian’s code operates on matrices: the x in his forward() method corresponds to our x1, x2, and x3 vectors stacked together as a matrix with three rows. This allows him to simply multiply x with W_key to obtain keys, a matrix consisting of three rows (k1, k2, and k3 in our example).

The important takeaway from this brief explanation of self-attention is that in each forward pass, we multiply keys with the queries and then later with the values. Keep this in mind as you read on.

Advanced self-attention modules

The variant of self-attention described above is its simplest vanilla form. Today’s largest LLMs typically use slightly modified variations that typically differ from our basic flavor in three ways:

  • 1
    Attention is causal.
  • 2
    Dropout is used on attention weights.
  • 3
    Multi-head attention is used.

Causal attention means that the model should only consider previous tokens in the sequence when predicting the next one, preventing it from “looking ahead” at future words. Going back to our example, “She poured coffee.”, when the model was given the word “She” and is now attempting to predict the next one (“poured” would be correct), it should not compute or have access to attention weights between “coffee” and any other word since the word “coffee” has not appeared in the text yet. Causal attention is typically implemented by masking the “look-ahead” part of the attention weights matrix with zeros.

Next, to reduce overfitting during training, dropout is often applied to the attention weights. This means that some of them are randomly set to zero in each forward pass.

Finally, basic attention can be referred to as single-head, meaning that there is just one set of Wk, Wq, and Wv matrices. An easy way to increase the model’s capacity is to switch to multi-head attention. This boils down to having multiple sets of the W-matrices and, consequently, multiple query, key, and value matrices, as well as multiple context vectors for each input.

Additionally, some transformers implement additional modifications of the attention module with the goal of improving speed or accuracy. Three popular ones are:

  • Grouped-query attention: Instead of looking at every input token individually, tokens are grouped, allowing the model to focus on related groups of words at once, which speeds up processing. This is used by Llama 3, Mixtral, and Gemini.
  • Paged attention: Attention is broken down into “pages” or chunks of tokens, so the model processes one page at a time, making it faster for very long sequences.
  • Sliding-window attention: The model only attends to nearby tokens within a fixed “window” around each token, so it focuses on the local context without needing to look at the entire sequence.

All of these state-of-the-art approaches to implementing self-attention don’t change its basic premise and the fundamental mechanism it relies on: one always needs to multiply the keys by the queries and then later by the values. And as it turns out, at inference time, these multiplications show major inefficiencies. Let’s see why that’s the case.

What is key-value caching?

During inference, transformers generate one token at a time. When we prompt the model to start generation by passing “She,” it will produce one word, such as “poured” (for the sake of avoiding distractions, let’s keep assuming one token is one word). Then, we can pass “She poured” to the model, and it produces “coffee.” Next, we pass “She poured coffee” and obtain the end-of-sequence token from the model, indicating that it considers generation to be complete.

This means we have run the forward pass three times, each time multiplying the queries by the keys to obtain the attention scores (the same applies to the later multiplication by the values).

In the first forward pass, there was just one input token (“She”), resulting in just one key vector and one query vector. We multiplied them to obtain the q1k1 attention score.

In the first forward pass, there is just one input token (“She”), resulting in just one key vector and one query vector. We multiplie them to obtain the q1k1 attention score.

Next, we passed “She poured” to the model. It now sees two input tokens, so the computation inside our attention module looks as follows:

Next, we pass “She poured” to the model. It now sees two input tokens.

We did the multiplication to compute three terms, but q1k1 was computed needlessly—we had already calculated it before! This q1k1 element is the same as in the previous forward pass because:

  • q1 is calculated as the embedding of the input (“She”) times the Wq matrix,
  • k1 is calculated as the embedding of the input (“She”) times the Wk matrix,
  • Both the embeddings and the weight matrices are constant at inference time.

Note the grayed-out entries in the attention scores matrix: these are masked with zero to achieve causal attention. For example, the top-right element where q1k3 would have been is not shown to the model as we don’t know the third word (and k3) at the moment of generating the second word.

Finally, here is the illustration of the query-times-keys calculation in our third forward pass.

We get the illustration of the query-times-keys calculation in the third forward pass.

We make the computational effort to calculate six values, half of which we already know and don’t need to recompute!

You may already have a hunch about what key-value caching is all about. At inference, as we compute the keys (K) and values (V) matrices, we store their elements in the cache. The cache is an auxiliary memory from which high-speed retrieval is possible. As subsequent tokens are generated, we only compute the keys and values for the new tokens.

For example, this is how the third forward pass would look with caching:

An example on how the third forward pass could look with caching.

When processing the third token, we don’t need to recompute the previous token’s attention scores. We can retrieve the keys and values for the first two tokens from the cache, thus saving computation time.

Assessing the impact of key-value caching

Key-value caching may have a significant impact on inference time. The magnitude of this impact depends on the model architecture. The more cachable computations there are, the larger the potential to reduce inference time.

Let’s analyze the impact of K-V caching on generation time using the GPT-Neo-1.3B model from EleutherAI, which is available on the Hugging Face Hub.

We will start by defining a timer context manager to calculate generation time:

import time

class Timer:

   def __enter__(self):
       self._start = time.time()
       return self

   def __exit__(self, exc_type, exc_value, traceback):
       self._end = time.time()
       self.duration = self._end - self._start

   def get_duration(self) -> float:
       return self.duration

Next, we load the model from the Hugging Face Hub, set up the tokenizer, and define the prompt:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "EleutherAI/gpt-neo-1.3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

input_text = "Why is a pour-over the only acceptable way to drink coffee?"

Finally, we can define the function to run model inference:

def generate(use_cache):
    input_ids = tokenizer.encode(
        input_text,
        return_tensors="pt").to(device),
    )
 output_ids = model.generate(
     input_ids,
     max_new_tokens=100,
     use_cache=use_cache,
 )

Note the use_cache argument we pass to model.generate: It controls whether K-V caching is employed.

With this setup, we can measure the average generation time with and without K-V caching:

for use_cache in (False, True):
   gen_times = []
   for _ in range(10):
     with Timer() as t:
       generate(use_cache=use_cache)
     gen_times += [t.duration]
   print(f"Average inference time with use_cache={use_cache}: {np.round(np.mean(gen_times), 2)} seconds")

I have executed this code on Google Colab using their free-tier T4 GPU using torch==2.5.1+cu121 and transformers==4.46.2 on Python 3.10.12 and obtained the following output:

Average inference time with use_cache=False: 9.28 seconds
Average inference time with use_cache=True: 3.19 seconds

As you can see, in this case, the speedup from caching is almost threefold.

Challenges and trade-offs

As is usually the case, there is no such thing as a free lunch. The generation speedup we have just seen can only be achieved at the cost of increased memory usage, and it requires considerate management in production systems.

Latency-memory trade-off

Storing data in the cache uses up memory space. Systems with limited memory resources may struggle to accommodate this additional memory overhead, potentially resulting in out-of-memory errors. This is especially the case when long inputs need to be processed, as the memory required for the cache grows linearly with the input length.

Another aspect to keep in mind is that the additional memory consumed by the cache is not available for storing the batches of data. As a result, one might need to reduce the batch size to keep it within the memory limits, thus decreasing the throughput of the system.

If the memory consumed by the cache becomes a problem, one can trade additional memory for some of the model accuracy. Specifically, one can truncate the sequences, prune the attention heads, or quantize the model:

  • Sequence truncation refers to limiting the maximum input sequence length, thus capping the cache size at the expense of losing long-term context. In tasks where this long context is relevant, the model’s accuracy might suffer.
  • Reducing the number of layers or attention heads, thereby decreasing both the model size and cache memory requirements, is another strategy to reclaim some memory. However, reducing model complexity may impact its accuracy.
  • Finally, there is quantization, which means using lower-precision data types (e.g., float16 instead of float32) for caching to reduce memory usage. Yet again, model accuracy can suffer.

To sum up, faster latency provided by K-V caching comes at the cost of increased memory usage. If there is sufficient memory, it’s a non-issue. If the memory becomes the bottleneck, however, one can reclaim it by simplifying the model in various ways, thus transitioning from a latency-memory trade-off to a latency-accuracy trade-off.

KV cache management in production systems

In large-scale production systems with many users, the K-V cache needs to be properly managed to ensure consistent and reliable response time while preventing excessive memory consumption. The two most critical aspects of this are cache invalidation (when to clear it) and cache reuse (how to use the same cache multiple times).

Cache invalidation

Three of the most popular cache invalidation strategies are session-based clearing, time-to-live invalidation, and contextual relevance-based approaches. Let’s explore them in this order.

The most basic cache invalidation strategy is session-based clearing. We simply clear the cache at the end of a user session or conversation with the model. This simple strategy is a perfect fit for applications where conversations are short and independent of each other.

Think about a customer support chatbot application in which each user session typically represents an individual conversation where the user seeks assistance with specific issues. In this context, the contents of this cache are unlikely to be needed again. Clearing the K-V cache once the user ends the chat or the session times out due to inactivity is a good choice, freeing up memory for the application to handle new users.

In situations where individual sessions are long, however, there are better solutions than session-based clearing. In time-to-live (TTL) invalidation, cache contents are automatically cleared after a certain period. This strategy is a good choice when the relevance of cached data diminishes predictably over time.

Consider a news aggregator app that provides real-time updates. Cached keys and values might only be relevant for as long as the news is hot. Implementing a TTL policy where cached entries expire after, say, one day ensures that responses to similar queries about fresh developments are generated fast while old news doesn’t fill up memory.

Finally, the most sophisticated of the three popular cache invalidation strategies is based on contextual relevance. Here, we clear the cache contents as soon as they become irrelevant to the current context or user interaction. This strategy is ideal when the application handles diverse tasks or topics within the same session, and the previous context doesn’t contribute value to the new one.

Think about a coding assistant that works as an IDE plug-in. While the user is working on a particular set of files, the cache should be retained. As soon as they switch to a different codebase, however, the previous keys and values become irrelevant and can be deleted to free memory. Contextual relevance-based approaches might be challenging to implement, though, as they require pinpointing the event or point in time at which the context switch occurs.

Cache reuse

Another important aspect of cache management is its reuse. On some occasions, a once-generated cache can be used again to speed up generation and save memory by avoiding storing the same data multiple times in different users’ cache instances.

Cache reuse opportunities typically show up when there is shared context and/or a warm start is desirable.

In scenarios where multiple requests share a common context, one can reuse the cache for that shared portion. In e-commerce platforms, certain products may have standard descriptions or specifications that are frequently asked about by multiple customers. These may include product details (“55-inch 4K Ultra HD Smart LED TV”), warranty information (“Comes with a 2-year manufacturer’s warranty covering parts and labor.”), or customer instructions (“For best results, mount the TV using a compatible wall bracket, sold separately.”). By caching the key-value pairs for these shared product descriptions, a customer support chatbot will generate responses to common questions faster.

Similarly, one can precompute and cache the initial K-V pairs for frequently used prompts or queries. Consider a voice-activated virtual assistant application. Users frequently start interactions with phrases like “What’s the weather today?” or “Set a timer for 10 minutes.” The assistant can respond more quickly by precomputing and caching the key-value pairs for these frequently used queries.

Conclusion

Key-value (K-V) caching is a technique in transformer models where the key and value matrices from previous steps are stored and reused during the generation of subsequent tokens. It allows for the reduction of redundant computations and speeding up inference time. This speedup comes at the cost of increased memory consumption. When memory is a bottleneck, one can reclaim some of it by simplifying the model, thus sacrificing its accuracy. Implementing K-V caching in large-scale production systems requires careful cache management, including choosing the strategy for cache invalidation and exploring the opportunities for cache reuse.

Was the article useful?

Explore more content topics:

Understanding LLMs Requires More Than Statistical Generalization [Paper Reflection]

0

In our paper, Understanding LLMs Requires More Than Statistical Generalization, we argue that current machine learning theory cannot explain the interesting emergent properties of Large Language Models, such as reasoning or in-context learning. From prior work (e.g., Liu et al., 2023) and our experiments, we’ve seen that these phenomena cannot be explained by reaching globally minimal test loss – the target of statistical generalization. In other words, model comparison based on the test loss is nearly meaningless.

We identified three areas where more research is required:

  • Understanding the role of inductive biases in LLM training, including the role of architecture, data, and optimization.
  • Developing more adequate measures of generalization.
  • Using formal languages to study language models in well-defined scenarios to understand transfer performance.

In this commentary, we focus on diving deeper into the role of inductive biases. Inductive biases affect which solution the neural network converges to, such as the model architecture or the optimization algorithm. For example, Stochastic Gradient Descent (SGD) favors neural networks with minimum-norm weights.

Inductive biases influence model performance. Even if two models with parameters θ1 and θ2 yield the same training and test loss, their downstream performance can differ.

How do language complexity and model architecture affect generalization ability?

In their Neural Networks and the Chomsky Hierarchy paper published in 2023, Delétang et al. showed how different neural network architectures generalize better for different language types.

Following the well-known Chomsky hierarchy, they distinguished four grammar types (regular, context-free, context-sensitive, and recursively enumerable) and defined corresponding sequence prediction tasks. Then, they trained different model architectures to solve these tasks and evaluated if and how well the model generalized, i.e., if a particular model architecture could handle the required language complexity.

In our position paper, we follow this general approach to expose the interaction of architecture and data in formal languages to gain insights into complexity limitations in natural language processing. We study popular architectures used for language modeling, e.g., Transformers, State-Space Models (SSMs) such as Mamba, the LSTM, and its novel extended version, the xLSTM.

To investigate how these models deal with formal languages of different complexity, we use a simple setup where each language consists only of two rules. During training, we monitor how well the models perform next-token prediction on the (in-distribution) test set, measured by accuracy.

However, our main question is whether these models generalize out-of-distribution. For this, we introduce the notion of rule extrapolation.

Can models adapt to changing grammar rules?

To understand rule extrapolation, let’s start with an example. A simple formal language is the anbn language, where the strings obey two rules:

  • 1
    a’s come before b’s.
  • 2
    The number of a’s and b’s is the same.

Examples of valid strings include “ab” and “aabb,” whereas strings like “baab” (violates rule 1) and “aab” (violates rule 2) are invalid. Having trained on such strings, we feed the models an out-of-distribution (OOD) string, violating rule 1 (e.g., a string where the first token is b). 

We find that most models still obey rule 2 when predicting tokens, which we call rule extrapolation – they do not discard the learned rules entirely but adapt to the new situation in which rule 1 is seemingly no longer relevant. 

This finding is surprising because none of the studied model architectures includes conscious choices to promote rule extrapolation. It emphasizes our point from the position paper that we need to understand the inductive biases of language models to explain emergent (OOD) behavior, such as reasoning or good zero-/few-shot prompting performance.

Efficient LLM training requires understanding what is a complex language for an LLM

According to the Chomsky hierarchy, the context-free anbn language is less complex than the context-sensitive anbncn language, where the n a’s and n b’s are followed by an equal number of c’s.

Despite their different complexity, the two languages seem very similar to humans. Our experiments show that, e.g., Transformers can learn context-free and context-sensitive languages equally well. However, they seem to struggle with regular languages, which are deemed to be much simpler by the Chomsky hierarchy.

Based on this and similar observations, we conclude that language complexity, as the Chomsky hierarchy defines it, is not a suitable predictor for how well a neural network can learn a language. To guide architecture choices in language models, we need better tools to measure the complexity of the language task we want to learn.

It’s an open question what these could look like. Presumably, we’ll need to find different complexity measures for different model architectures that consider their specific inductive biases.

Searching for a free experiment tracking solution for your academic research?

Join 1000s of researchers, professors, students, and Kagglers using neptune.ai for free to make monitoring experiments, comparing runs, and sharing results far easier than with open source tools.

What’s next?

Understanding how and why LLMs are so successful paves the way to more data-, cost- and energy efficiency. If you want to dive deeper into this topic, our position paper’s “Background” section is full of references, and we discuss numerous concrete research questions.

If you’re new to the field, I particularly recommend Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models (2023) by Liu et al., which nicely demonstrates the shortcomings of current evaluation practices based on the test loss. I also encourage you to check out SGD on Neural Networks Learns Functions of Increasing Complexity (2023) by Nakkiran et al. to understand more deeply how using stochastic gradient descent affects what functions neural networks learn.

Was the article useful?

Explore more content topics:

Challenges & Solutions For Monitoring at Hyperscale

0

What is not measured, cannot be improved.” This quote has become a guiding principle for teams training foundation models. When you’re dealing with complex, large-scale AI systems, things can spiral quickly without the right oversight. Operating at hyperscale poses significant challenges for teams, from the large volume of data generated to the unpredictability of hardware failures and the need for efficient resource management. These issues require strategic solutions, that’s why monitoring isn’t just a nice-to-have—it’s the backbone of transparency, reproducibility, and efficiency. During my talk at NeurIPS,  I broke down five key lessons learned from teams facing large-scale model training and monitoring. Let’s get into it.

Real-time monitoring prevents costly failures

Imagine this: you’re training a large language model on thousands of GPUs at a cost of hundreds of thousands of dollars per day. Now imagine discovering, hours into training, that your model is diverging or that hardware issues are degrading your performance. The financial and operational implications are staggering. This is why live monitoring—the ability to act immediately—is so critical.

Live monitoring allows teams to see experiment progress as it happens, rather than waiting for checkpoints or the end of a run. This real-time visibility is a game-changer for identifying and fixing problems on the fly. In addition, automated processes allow you to set up monitoring workflows once and reuse them for similar experiments. This streamlines the process of comparing results, analyzing results, and debugging issues, saving time and effort.

However, achieving true live monitoring is far from simple. Hyperscale training generates an overwhelming volume of data, often reaching up to a million data points per second. Traditional monitoring tools struggle under such loads, creating bottlenecks that can delay corrective action. Some teams try to cope by batching or sampling metrics, but these approaches sacrifice real-time visibility and add complexity to the code.

The solution lies in systems that can handle high-throughput data ingestion while providing accurate, real-time insights. Tools like neptune.ai make this possible by providing dashboards that visualize metrics without delaying training. For example, live tracking of GPU utilization or memory usage can reveal early signs of bottlenecks or out-of-memory errors, allowing engineers to proactively adjust course. See here some testimonials:

One thing we’re always keeping track of is what the utilization is and how to improve it. Sometimes, we’ll get, for example, out-of-memory errors, and then seeing how the memory increases over time in the experiment is really helpful for debugging as well.

James Tu

Research Scientist, Waabi

For some of the pipelines, Neptune was helpful for us to see the utilization of the GPUs. The utilization graphs in the dashboard are a perfect proxy for finding some bottlenecks in the performance, especially if we are running many pipelines.

Wojtek Rosiński

CTO, ReSpo.Vision

Real-time visualization of GPU memory usage (top) and power consumption (bottom) during a large-scale training run. These metrics help identify potential bottlenecks, such as out-of-memory errors or inefficient hardware utilization, enabling immediate corrective actions to maintain optimal performance. | Source: Author

Troubleshooting hardware failures is challenging: simplify it with debugging

Distributed systems are prone to failure, and hardware failures are notoriously difficult to troubleshoot. A single hardware failure can cascade into widespread outages, often with cryptic error messages. Teams often waste time sifting through stack traces, trying to distinguish between infrastructure problems and code bugs.

At Cruise, engineers used frameworks like Ray and Lightning to improve error reporting. By automatically labeling errors as either “infra” or “user” issues and correlating stack traces across nodes, debugging became much faster.

Igor Tsvetkov

Former Senior Staff Software Engineer, Cruise

AI teams automating error categorization and correlation can significantly reduce debugging time in hyperscale environments, just as Cruise has done. How? By using classification strategies to identify if failures originated from hardware constraints (e.g., GPU memory leaks, network latency) or software bugs (e.g., faulty model architectures, misconfigured hyperparameters). 

Intuitive experiment tracking optimizes resource utilization

Another relevant aspect of hyperscale monitoring is optimizing resource utilization, in particular in a scenario where hardware failures and training interruptions can set teams back significantly. Picture a scenario where training jobs suddenly deviate: loss metrics spike, and you’re left deciding whether to let the job run or terminate it. Advanced experiment trackers allow for remote experiment termination, eliminating the need for teams to manually access cloud logs or servers.

Use checkpoints at frequent intervals so you do not have to restart from scratch, but just warm-start from the previous checkpoint. Most mature training frameworks already offer automated checkpointing and warm-starts from previous checkpoints. But most of these, by default, save the checkpoints in the same machine. This doesn’t help if your hardware crashes, or, for example, you are using spot instances and they are reassigned.

For maximum resilience and to prevent losing data if hardware crashes, checkpoints should be linked to your experiment tracker. This does not mean that you upload GBs worth of checkpoints to the tracker (although you can and some of our customers, especially self-hosted customers, do this for security reasons), but rather have pointers to the remote location, like S3, where the checkpoints have been saved. This enables you to link the checkpoint with the corresponding experiment step, and efficiently retrieve the relevant checkpoint at any given step.

A comparison of training workflows with and without advanced experiment tracking and checkpointing. On the left, failed training runs at various stages lead to wasted time and resources. On the right, a streamlined approach with checkpoints and proactive monitoring ensures consistent progress and minimizes the impact of interruptions.
A comparison of training workflows with and without advanced experiment tracking and checkpointing. On the left, failed training runs at various stages lead to wasted time and resources. On the right, a streamlined approach with checkpoints and proactive monitoring ensures consistent progress and minimizes the impact of interruptions. | Source: Author

However, there are two caveats to successfully restarting an experiment from a checkpoint: assuming that the experimentation environment is constant, or at least reproducible, and addressing deterministic issues like Out-of-Memory errors (OOMs) or bottlenecks that may require parameter changes to avoid repeating failures. This is where forking can play a significant role in improving recovery and progress.

Track months-long model training with more confidence. Use neptune.ai forking feature to iterate faster and optimize the usage of GPU resources.

With Neptune, users can visualize forked training out of the box. This means you can:

  • Test multiple configs at the same time. Stop the runs that don’t improve accuracy. And continue from the most accurate last step.
  • Restart failed training sessions from any previous step. The training history is inherited, and the entire experiment is visible on a single chart.

In addition, checkpointing strategies are critical for optimizing recovery processes. Frequent checkpointing ensures minimal loss of progress, allowing you to warm-start from the most recent state instead of starting from scratch. However, checkpointing can be resource-intensive in terms of storage and time, so we need to strike a balance between frequency and overhead.

For large-scale models, the overhead of writing and reading weights to persistent storage can significantly reduce training efficiency. Innovations like redundant in-memory copies, as demonstrated by Google’s Gemini models, enable rapid recovery and improved training goodput (defined by Google as the time spent computing useful new steps over the elapsed time of the training job), increasing resilience and efficiency.

Features like PyTorch Distributed’s asynchronous checkpointing can significantly reduce checkpointing times making frequent checkpointing more viable without compromising training performance.

Beyond models, checkpointing the state of dataloaders remains a challenge due to distributed states across nodes. While some organizations like Meta have developed in-house solutions, general frameworks have yet to fully address this issue. Incorporating dataloader checkpointing can further enhance resilience by preserving the exact training state during recovery.

Reproducibility and transparency are non-negotiable

Reproducibility is the bedrock of reliable research, but it’s notoriously difficult at scale. Ensuring reproducibility requires consistent tracking of environment details, datasets, configurations, and results. This is where Neptune’s approach excels, linking every experiment’s lineage—from parent runs to dataset versions—in an accessible dashboard.

This transparency not only aids validation but also accelerates troubleshooting. Consider ReSpo.Vision’s challenges in managing and comparing results across pipelines. By implementing organized tracking systems, they gained visibility into pipeline dependencies and experiment parameters, streamlining their workflow.

A single source of truth simplifies data visualization and management at large-scale data

Managing and visualizing data at scale is a common challenge, amplified in the context of large-scale experimentation. While tools like MLflow or TensorBoard are sufficient for smaller projects with 10–20 experiments, they quickly fall short when handling thousands or even hundreds of experiments. At this scale, organizing and comparing results becomes a logistical hurdle, and relying on tools that cannot effectively visualize or manage this scale leads to inefficiencies and missed insights.

A solution lies in adopting a single source of truth for all experiment metadata, encompassing everything from input data and training metrics to checkpoints and outputs. Neptune’s dashboards address this challenge by providing a highly customizable and centralized platform for experiment tracking. These dashboards enable real-time visualization of key metrics, which can be tailored to include “custom metrics”—those not explicitly logged at the code level but calculated retrospectively within the tool. For instance, if a business requirement shifts from using precision and recall to the F1 score as a performance indicator, custom metrics allow you to calculate and visualize these metrics across existing and future experiments without rerunning them, ensuring flexibility and minimizing duplicated effort.

Consider the challenges faced by Waabi and ReSpo.Vision. Waabi’s teams, running large-scale ML experiments, needed a way to organize and share their experiment data efficiently. Similarly, ReSpo.Vision required an intuitive system to visualize multiple metrics in a standardized format that any team member—technical or non-technical—could easily access and interpret. Neptune’s dashboards provided the solution, allowing these teams to streamline their workflows by offering visibility into all relevant experiment data, reducing overhead, and enabling collaboration across stakeholders.

I like those dashboards because we need several metrics, so you code the dashboard once, have those styles, and easily see it on one screen. Then, any other person can view the same thing, so that’s pretty nice.

Łukasz Grad

Chief Data Scientist, ReSpo.Vision

The benefits of such an approach extend beyond visualization. Logging only essential data and calculating derived metrics within the tool reduces latency and streamlines the experimental process. This capability empowers teams to focus on actionable insights, enabling scalable and efficient experiment tracking, even for projects involving tens of thousands of models and subproblems.

Visualizing large datasets

We generally do not think of dataset visualization as part of experiment monitoring. However, preparing the dataset for model training is an experiment in itself, and while it may be an upstream experiment not in the same pipeline as the actual model training, data management and visualization is critical to LLMOps.

Large-scale experiments often involve processing billions of data points or embeddings. Visualizing such data to uncover relationships and debug issues is a common hurdle. Tools like Deepscatter and Jupyter Scatter have made progress in scaling visualizations for massive datasets, offering researchers valuable insights into their data distribution and embedding structures.

Moving forward

The path to efficient hyperscale training lies in combining robust monitoring, advanced debugging tools, and comprehensive experiment tracking. Solutions like Neptune Scale are designed to address these challenges, offering the scalability, precision, and transparency researchers need.

How about being one of the first to access Neptune Scale?

Neptune Scale is our upcoming product release built for teams that train foundation models. It offers enhanced scalability and exciting new features. You can join our beta program to benefit from Neptune Scale earlier.

If you’re interested in learning more, visit our blog or join the MLOps community to explore case studies and actionable strategies for large-scale AI experimentation.

Acknowledgments

I would like to express my gratitude to Prince Canuma, Dr. Shantipriya Parida, and Igor Tsvetkov for their valuable time and insightful discussions on this topic. Their contributions and perspectives were instrumental in shaping this talk.

Was the article useful?

Explore more content topics:

Open LLMs are Necessary For Current Private Adaptations and Outperform Their Closed Alternatives [Paper Reflection]

0

Closed Large Language Models (LLMs), which are proprietary and accessible only via APIs, have dominated the LLM space since around 2022 due to their high performance and versatility. However, Open LLMs have made substantial progress, narrowing the performance gap with their Closed LLM counterparts. Open LLMs are models whose architecture and parameters are publicly available for use, modification, and distribution.

For instance, while Closed LLMs like Anthropic’s Claude (released in March 2023) and OpenAI’s GPT-4 (released in March 2023) set new benchmarks upon their launches, the Open LLM Llama 3 released by Meta in April 2024 and DeepSeek-R1 released in January 2025 not only matched but surpassed these models in tasks such as coding, reasoning, text classification, summarization, and question answering.

While much of the discussion around LLMs centers on task and computational performance, in our paper Open LLMs are Necessary for Current Private Adaptations and Outperform their Closed Alternatives, we focus on the privacy implications of using Open and Closed LLMs. Specifically, we explore whether and how models can be fine-tuned on sensitive data while ensuring robust privacy guarantees.

To this end, we define threat models, compare various Open and Closed LLMs that leverage differential privacy (DP) on classification and generation tasks and analyze methodological limitations. Our research results in a thorough analysis of the privacy-utility tradeoff under different privacy levels.

Our findings indicate that Open LLMs can be adapted to private data without leaking information to third parties, such as LLM providers and malicious users. Thus, they offer a significant privacy advantage over Closed, proprietary models.

The threat space in adapting LLMs to private data

The adaptation of Closed LLMs to private datasets introduces a multifaceted threat space. In typical scenarios, data curators provide their sensitive data to LLM providers for fine-tuning, producing a model tailored to the dataset. This customized model is subsequently queried by external parties, e.g., customers of the data curator.

The resulting threat space can be categorized into three key dimensions:

  1. From the data curator to the LLM provider: The private data shared during fine-tuning may be susceptible to unauthorized access or misuse.
  2. From the querying party to the LLM provider: Queries submitted by end users, which often contain sensitive information intended for the data curator, are exposed to the LLM provider.
  1. From malicious end users to the adapted LLM: Malicious end users may attempt to extract private information through the LLM’s responses to carefully crafted queries.

In contrast to Closed LLMs, Open LLMs provide full control over the model and data, enabling private adaptation without the need to share sensitive information with a third party. This control eliminates the first two threat vectors associated with Closed LLMs, such as unauthorized access or misuse by the provider and exposure of user queries. With Open LLMs, data curators can directly fine-tune the model on private datasets using privacy-preserving techniques, ensuring end-to-end privacy.

What are the current methods for private adaptation of LLMs? 

It follows from our threat space analysis that restricting access to the fine-tuning dataset alone does not guarantee data privacy. Model outputs can still reveal sensitive information from the fine-tuning data. If the fine-tuned model is exposed (e.g., via an API), it remains vulnerable to information extraction and inference attacks.

Differential privacy (DP) introduces a rigorous mathematical framework that ensures the privacy of individuals whose data is used in the fine-tuning process. Specifically, DP adds carefully calibrated noise to the model updates, making it statistically improbable to determine whether any individual’s data was included in the fine-tuning dataset. Its quantifiable and robust privacy guarantee makes DP valuable for protecting sensitive information in LLM fine-tuning.

While DP provides privacy guarantees for both Open and Closed LLMs, it does not address the issue of trust in third-party providers for Closed LLMs. For these models, data curators must rely on the provider to implement safeguards and handle sensitive data responsibly.

Private adaptation methods for Closed LLMs 

We can rule out fine-tuning services offered by LLM providers (e.g., OpenAI and Amazon), as this entails sharing private data with a third party. Closed LLMs are accessible only via APIs. Thus, we cannot access and adapt the model’s weights directly.

Instead, private adaptation methods for Closed LLMs rely on privacy-preserving discrete prompts or private in-context learning (ICL). These approaches work by carefully crafting input prompts or selecting relevant examples to guide the model’s behavior, all while ensuring that sensitive information in the prompts or examples is protected from potential leakage or inference attacks.

All methods we evaluate in our study follow the PATE (Private Aggregation of Teacher Ensembles) framework. At a high level, PATE achieves data privacy by splitting the private dataset into non-overlapping partitions. Then, each partition is used to train a so-called teacher model. These teacher models are joined into an ensemble model by combining their outputs while adding noise, which preserves privacy.

This ensemble is then used to train a so-called student model in the following way: The ensemble makes predictions for samples from an unlabeled public dataset. The resulting (sample, ensemble prediction) pairs constitute the training data for the student model. Thus, the student learns to make the same predictions as the teacher ensemble but never sees sensitive data samples. The student is what’s released as the final model.

Overview of the PATE framework. The sensitive dataset is divided into non-overlapping partitions, and a separate teacher model is trained on each partition. All teachers are aggregated noisily into an ensemble model, which is used to make predictions on a public dataset. The samples from the public dataset, together with the ensemble’s predictions, constitute the training data for the student model, which is the model that is eventually queried by users. | Source

The private adaptation methods for Closed LLMs we analyze in our study build on this general framework. They differ in how the teachers are utilized and how their responses are aggregated:

  • Differentially Private In-context Learning (DP-ICL): All teachers process the same prompt, and the ensemble’s response is the noisy consensus.
  • PromptPATE: The teacher ensemble assigns labels to public unlabeled data via private voting. These labeled public sequences are used to create new discrete student prompts, which are deployed with the LLM.
  • DP-FewShotGen: The teacher ensemble generates private synthetic few-shot samples that are used as samples for in-context learning.
  • DP-OPT: A local LLM generates privacy-preserving prompts and instructions from the private dataset. These are used for in-context learning for the third-party Closed LLM.

In our paper, we compare the privacy protection and performance of these four state-of-the-art methods for private adaptation of Closed LLMs. When applying them to the popular Closed LLMs Claude, GPT-3 Babbage, GPT-3 Davinci, and GPT-4 Turbo, we observe that compared to private adaptation of Open LLMs, these methods offer lower performance at a higher cost on various downstream tasks, including dialog summarization, classification, and generation. Further, all methods except DP-OPT leak training data to the LLM provider.

Private adaptation methods for Open LLMs 

Unlike Closed LLMs, Open LLMs provide access to their parameters, enabling more flexible and parameter-centric private adaptation methods. These methods typically follow the Differentially Private Stochastic Gradient Descent (DPSGD) paradigm to ensure privacy. In DPSGD, the influence of each private data point is constrained during training through gradient clipping and the addition of calibrated noise. This approach guarantees that the model does not memorize or leak sensitive information.

In our study, we explore three primary methods for private adaptation of Open LLMs: 

  1. Prompt-based adaptation (PromptDPSGD) introduces a small number of additional parameters (<1% of the model’s total parameters) in the input space through soft prompts or prefix-tuning and adapts Differentially Private Stochastic Gradient Descent (DPSGD) to preserve privacy.
  2. Parameter-efficient fine-tuning, such as LoRA, only updates a relatively small number of parameters (<10% of the model’s total parameters) within the model’s architecture to enable efficient updates. PrivateLoRA extends this approach with DP guarantees by building on the DPSGD algorithm.
  3. Full fine-tuning adaptations (DP-FineTune) involve fine-tuning the entire model or a subset of its layers for comprehensive adaptation while adhering to differential privacy principles.

Applying these methods to Vicuna, Llama-3, OpenLLaMa, BART, RoBERTa, and the Pythia suite of models, we find that private adaptation of Open LLMs improves performance on downstream tasks and reduces costs compared to their Closed counterparts. It also provides a critical privacy benefit by eliminating the risk of exposing private data and user queries to LLM providers.

Insightful results

Our analysis of private adaptation methods for both Closed and Open LLMs reveals several critical findings regarding data leakage, performance, and cost:

  1. Query data leakage: All private adaptation methods for Closed LLMs leak query data to the LLM provider. This means that sensitive information from user queries is exposed during the adaptation process, posing a significant privacy risk.
  2. Training data leakage: Only one method (DP-OPT) of the four methods of private adaptation of Closed LLMs successfully protects private training data from the LLM provider. However, this method requires a local LLM to effectively protect the privacy of the training data. The remaining private adaptation methods for Closed LLMs leak a large fraction of the training data to the LLM provider, undermining the privacy guarantees of the adaptation process.
  3. Performance: All adaptation methods for Closed LLMs achieve lower downstream task performance than privacy-preserving local adaptations on Open LLMs, even when the Open LLMs are significantly smaller than their Closed counterparts.
  4. Cost: The training and query costs for private adaptations of Closed LLMs are substantially higher due to the API access costs imposed by the LLM provider. In contrast, private adaptations for Open LLMs are more cost-effective. We estimated the costs assuming an A40 GPU with 48 GB of memory. In this scenario, privately adopting a Closed LLM to text classification tasks with DP-ICL costs about $140. In contrast, fine-tuning an Open LLM with PrivateLoRA on the same tasks costs about $30.

This leads to the conclusion that for a truly privacy-preserving adaptation of LLMs, one should use Open LLMs. By offering full control over the model and data, Open LLMs eliminate the risks associated with third-party providers and enable robust privacy-preserving techniques. As a result, Open LLMs address the limitations of Closed LLMs and enable efficient and customizable adaptations tailored to sensitive datasets.

Was the article useful?

Explore more content topics:

Mixture of Experts LLMs: Key Concepts Explained

0

Mixture of Experts (MoE) is a type of neural network architecture that employs sub-networks (experts) to process specific input parts.

Only a subset of experts is activated per input, enabling models to scale efficiently. MoE models can leverage expert parallelism by distributing experts across multiple devices, enabling large-scale deployments while maintaining efficient inference.

MoE uses gating and load balancing mechanisms to dynamically route inputs to the most relevant experts, ensuring targeted and evenly distributed computation. Parallelizing the expert, along with the data, is key to having an optimized training pipeline.

MoEs have faster training and better or comparable performance than dense LLMs on many benchmarks, especially in multi-domain tasks. Challenges include load balancing, distributed training complexity, and tuning for stability and efficiency.

Scaling LLMs comes at a tremendous computational cost. Bigger models enable more powerful capabilities but require expensive hardware and infrastructure, also resulting in higher latency. So far, we’ve mainly achieved performance gains by making models larger, but this trajectory is not sustainable due to escalating costs, increasing energy consumption, and diminishing returns in performance improvement.

When considering the enormous amount of data and the wide variety of domains in which the huge LLM models are trained, it’s natural to ask —instead of using the entire LLM’s capacity, could we just pick and choose only a portion of the LLM that is relevant to our particular input? This is the key idea behind Mixture of Expert LLMs.

Mixture of Experts (MoE) is a type of neural network architecture in which parts of the network are divided into specialized sub-networks (experts), each optimized for a specific domain of the input space. During inference, only a part of the model is activated depending on the given input, significantly reducing the computational cost. Further, these experts can be distributed across multiple devices, allowing for parallel processing and efficient large-scale distributed setups.

On an abstract, conceptual level, we can imagine MoE experts specialized in processing specific input types. For example, we might have separate experts for different language translations or different experts for text generation, summarization, solving analytical problems, or writing code. These sub-networks have separate parameters but are part of the single model, sharing blocks and layers at different levels.

In this article, we explore the core concepts of MoE, including architectural blocks, gating mechanisms, and load balancing. We’ll also discuss the nuances of training MoEs and analyze why they are faster to train and yield superior performance in multi-domain tasks. Finally, we address key challenges of implementing MoEs, including distributed training complexity and maintaining stability.

Bridging LLM capacity and scalability with MoE layers

Since the introduction of Transformer-based models, LLM capabilities have continuously expanded through advancements in architecture, training methods, and hardware innovation. Scaling up LLMs has been shown to improve performance. Accordingly, we’ve seen rapid growth in the scale of the training data, model sizes, and infrastructure supporting training and inference.

Pre-trained LLMs have reached sizes of billions and trillions of parameters. Training these models takes extremely long and is expensive, and their inference costs scale proportionally with their size.

In a conventional LLM, all parameters of the trained model are used during inference. The table below gives an overview of the size of several impactful LLMs. It presents the total parameters of each model and the number of parameters activated during inference:

The last five models (highlighted) exhibit a significant difference between the total number of parameters and the number of parameters active during inference. The Switch-Language Transformer, Mixtral, GLaM, GShard, and DeepSeekMoE are Mixture of Experts LLMs (MoEs), which require only executing a portion of the model’s computational graph during inference.

MoE building blocks and architecture

The foundational idea behind the Mixture of Experts was introduced before the era of Deep Learning, back in the ’90s, with “Adaptive Mixtures of Local Experts” by Robert Jacobs, together with the “Godfather of AI” Geoffrey Hinton and colleagues. They introduced the idea of dividing the neural network into multiple specialized “experts” managed by a gating network.

With the Deep Learning boom, the MoE resurfaced. In 2017, Noam Shazeer and colleagues (including Geoffrey Hinton once again) proposed the Sparsely-Gated Mixture-of-Experts Layer for recurrent neural language models.

The Sparsely-Gated Mixture-of-Experts Layer consists of multiple experts (feed-forward networks) and a trainable gating network that selects the combination of experts to process each input. The gating mechanism enables conditional computation, directing processing to the parts of the network (experts) that are most suited to each part of the input text.

Such an MoE layer can be integrated into LLMs, replacing the feed-forward layer in the Transformer block. Its key components are the experts, the gating mechanism, and the load balancing.

Overview of the general architecture of a Transformer block with integrated MoE layer. The MoE layer has a gate (router) that activates selected experts based on the input. The aggregated experts’ outputs form the MoE layer’s output.
Overview of the general architecture of a Transformer block with integrated MoE layer. The MoE layer has a gate (router) that activates selected experts based on the input. The aggregated experts’ outputs form the MoE layer’s output. | Source: Author

Experts

The fundamental idea of the MoE approach is to introduce sparsity in the neural network layers. Instead of a dense layer where all parameters are used for every input (token), the MoE layer consists of several “expert” sub-layers. A gating mechanism determines which subset of “experts” is used for each input. The selective activation of sub-layers makes the MoE layer sparse, with only a part of the model parameters used for every input token.

How are experts integrated into LLMs?

In the Transformer architecture, MoE layers are integrated by modifying the feed-forward layers to include sub-layers. The exact implementation of this replacement varies, depending on the end goal and priorities: replacing all feed-forward layers with MoEs will maximize sparsity and reduce the computational cost, while replacing only a subset of feed-forward layers may help with training stability. For example, in the Switch Transformer, all feed-forward components are replaced with the MoE layer. In GShard and GLaM, only every other feed-forward layer is replaced.

The other LLM layers and parameters remain unchanged, and their parameters are shared between the experts. An analogy to this system with specialized and shared parameters could be the completion of a company project. The incoming project needs to be processed by the core team—they contribute to every project. However, at some stages of the project, they may require different specialized consultants, selectively brought in based on their expertise. Collectively, they form a system that shares the core team’s capacity and profits from expert consultants’ contributions.

Visualization of token-level expert selection in the MoE model (layers 0, 15, and 31). Each token is color-coded, indicating the first expert chosen by the gating mechanism. This illustrates how MoE assigns tokens to specific experts at different levels of architecture. It may not always be obvious why the same-colored tokens were directed to the same expert - the model processed high-dimensional representations of these tokens, and the logic and understanding of the token processing are not always similar to human logic.
Visualization of token-level expert selection in the MoE model (layers 0, 15, and 31). Each token is color-coded, indicating the first expert chosen by the gating mechanism. This illustrates how MoE assigns tokens to specific experts at different levels of architecture. It may not always be obvious why the same-colored tokens were directed to the same expert – the model processed high-dimensional representations of these tokens, and the logic and understanding of the token processing are not always similar to human logic. | Source

Gating mechanism

In the previous section, we have introduced the abstract concept of an “expert,” a specialized subset of the model’s parameters. These parameters are applied to the high-dimensional representation of the input at different levels of the LLM architecture. During training, these subsets become “skilled” at handling specific types of data. The gating mechanism plays a key role in this system.

What is the role of the gating mechanism in an MoE layer?

When an MoE LLM is trained, all the experts’ parameters are updated. The gating mechanism learns to distribute the input tokens to the most appropriate experts, and in turn, experts adapt to optimally process the types of input frequently routed their way. At inference, only relevant experts are activated based on the input. This enables a system with specialized parts to handle diverse types of inputs. In our company analogy, the gating mechanism is like a manager delegating tasks within the team.

The gating component is a trainable network within the MoE layer. The gating mechanism has several responsibilities:

  • Scoring the experts based on input. For N experts, N scores are calculated, corresponding to the experts’ relevance to the input token.
  • Selecting the experts to be activated. Based on the experts’ scoring, a subset of the experts is chosen to be activated. This is usually done by top-k selection.
  • Load balancing. Naive selection of top-k experts would lead to an imbalance in token distribution among experts. Some experts may become too specialized by only handling a minimal input range, while others would be overly generalized. During inference, touting most of the input to a small subset of experts would lead to overloaded and underutilized experts. Thus, the gating mechanism has to distribute the load evenly across all experts.

How is gating implemented in MoE LLMs?

Let’s consider an MoE layer consisting of n experts denoted as Experti(x) with i=1,…,n that takes input x. Then, the gating layer’s output is calculated as

How is gating implemented in MoE LLMs?

where gi is the ith expert’s score, modeled based on the Softmax function. The gating layer’s output is used as the weights when averaging the experts’ outputs to compute the MoE layer’s final output. If gi is 0, we can forgo computing Experti(x) entirely.

The general framework of a MoE gating mechanism looks like

How is gating implemented in MoE LLMs?

Some specific examples are:

  • Top-1 gating: Each token is directed to a single expert when choosing only the top-scored export. This is used in the Switch Transformer’s Switch layer. It is computationally efficient but requires careful load-balancing of the tokens for even distribution across experts.
  • Top-2 gating: Each token is sent to two experts. This approach is used in Mixtral.
  • Noisy top-k gating: Introduced with the Sparsely-Gated Mixture-of-Experts Layer, noise (standard normal) is added before applying Softmax to help with load-balancing. GShard uses a noisy top-2 strategy, adding more advanced load-balancing techniques.

Load balancing

The straightforward gating via scoring and selecting top-k experts can result in an imbalance of token distribution among experts. Some experts may become overloaded, being assigned to process a bigger portion of tokens, while others are selected much less frequently and stay underutilized. This causes a “collapse” in routing, hurting the effectiveness of the MoE approach in two ways.

First, the frequently selected experts are continuously updated during training, thus performing better than experts who don’t receive enough data to train properly.

Second, load imbalance causes memory and computational performance problems. When the experts are distributed across different GPUs and/or machines, an imbalance in expert selection will translate into network, memory, and expert capacity bottlenecks. If one expert has to handle ten times the number of tokens than another, this will increase the total processing time as subsequent computations are blocked until all experts finish processing their assigned load.

Strategies for improving load balancing in MoE LLMs include:

•  Adding random noise in the scoring process helps redistribute tokens among experts.

•  Adding an auxiliary load-balancing loss to the overall model loss. It tries to minimize the fraction of the input routed to each expert. For example, in the Switch Transformer, for N experts and T tokens in batch B, the loss would be

auxiliary load-balancing loss

where fi is the fraction of tokens routed to expert i and Pi is the fraction of the router probability allocated for expert i.

•  DeepSeekMoE introduced an additional device-level loss to ensure that tokens are routed evenly across the underlying infrastructure hosting the experts. The experts are divided into g groups, with each group deployed to a single device.

•  Setting a maximum capacity for each expert. GShard and the Switch Transformer define a maximum number of tokens that can be processed by one expert. If the capacity is exceeded, the “overflown” tokens are directly passed to the next layer (skipping all experts) or rerouted to the next-best expert that has not yet reached capacity.

Scalability and challenges in MoE LLMs

Selecting the number of experts

The number of experts is a key consideration when designing an MoE LLM. A larger number of experts increases a model’s capacity at the cost of increased infrastructure demands. Using too few experts has a detrimental effect on performance. If the tokens assigned to one expert are too diverse, the expert cannot specialize sufficiently.

The MoE LLMs’ scalability advantage is due to the conditional activation of experts. Thus, keeping the number of active experts k fixed but increasing the total number of experts n increases the model’s capacity (larger total number of parameters). Experiments conducted by the Switch Transformer’s developers underscore this. With a fixed number of active parameters, increasing the number of experts consistently led to improved task performance. Similar results were observed for MoE Transformers with GShard.

The Switch Transformers have 16 to 128 experts, GShard can scale up from 128 to 2048 experts, and Mixtral can operate with as few as 8. DeepSeekMoE takes a more advanced approach by dividing experts into fine-grained, smaller experts. While keeping the number of expert parameters constant, the number of combinations for possible expert selection is increased. For example, N=8 experts with hidden dimension h can be split into m=2 parts, giving N*m=16 experts of dimension h/m. The possible combinations of activated experts in top-k routing will change from 28 (2 out of 8) to 1820 (4 out of 16), which will increase flexibility and targeted knowledge distribution.

Routing tokens to different experts simultaneously may result in redundancy among experts. To address this problem, some approaches (like DeepSeek and DeepSpeed) can assign dedicated experts to act as a shared knowledge base. These experts are exempt from the gating mechanism, always receiving each input token.

Training and inference infrastructure

While MoE LLMs can, in principle, be operated on a single GPU, they can only be scaled efficiently in a distributed architecture combining data, model, and pipeline parallelism with expert parallelism. The MoE layers are sharded across devices (i.e., their experts are distributed evenly) while the rest of the model (like dense layers and attention blocks) is replicated to each device.

This requires high-bandwidth and low-latency communication for both forward and backward passes. For example, Google’s latest Gemini 1.5 was trained on multiple 4096-chip pods of Google’s TPUv4 accelerators distributed across multiple data centers.

Hyperparameter optimization

Introducing MoE layers adds additional hyperparameters that have to be carefully adjusted to stabilize training and optimize task performance. Key hyperparameters to consider include the overall number of experts, their size, the number of experts to select in the top-k selection, and any load balancing parameters. Optimization strategies for MoE LLMs are discussed comprehensively in the papers introducing the Switch Transformer, GShard, and GLaM.

LLM performance vs. MoE LLM performance

Before we wrap up, let’s take a closer look at how MoE LLMs compare to standard LLMs:

  • MoE models, unlike dense LLMs, activate only a portion of their parameters. Compared to dense LLMs, MoE LLMs with the same number of active parameters can achieve better task performance, having the benefit of a larger number of total trained parameters. For example, Mixtral 8x7B with 13 B active parameters (and 47 B total trained parameters) matches or outperforms LLaMA-2 with 13 B parameters on benchmarks like MMLU, HellaSwag, PIQA, and Math.
  • MoEs are faster, and thus less expensive, to train. The Switch Transformer authors showed, for example, that the sparse MoE outperforms the dense Transformer baseline with a considerable speedup in achieving the same performance. With a fixed number of FLOPs and training time, the Switch Transformer achieved the T5-Base’s performance level seven times faster and outperformed it with further training.

What’s next for MoE LLMs?

Mixture of Experts (MoE) is an approach to scaling LLMs to trillions of parameters with conditional computation while avoiding exploding computational costs. MoE allows for the separation of learnable experts within the model, integrated into the shared model skeleton, which helps the model more easily adapt to multi-task, multi-domain learning objectives. However, this comes at the cost of new infrastructure requirements and the need for careful tuning of additional hyperparameters.

The novel architectural solutions for building experts, managing their routing, and stable training are promising directions, with many more innovations to look forward to. Recent SoTA models like Google’s multi-modal Gemini 1.5 and IBM’s enterprise-focused Granite 3.0 are MoE models. DeepSeek R1, which has comparable performance to GPT-4o and o1, is an MoE architecture with 671B total and 37B activated number of parameters and 128 experts.

With the publication of open-source MoE LLMs such as DeepSeek R1 and V3, which rival or even surpass the performance of the aforementioned proprietary models, we are looking into exciting times for democratized and scalable LLMs.

Was the article useful?

Explore more content topics:

Essential Review Papers on Physics-Informed Neural Networks: A Curated Guide for Practitioners

0

Staying on top of a fast-growing research field is never easy.

I face this challenge firsthand as a practitioner in Physics-Informed Neural Networks (PINNs). New papers, be they algorithmic advancements or cutting-edge applications, are published at an accelerating pace by both academia and industry. While it is exciting to see this rapid development, it inevitably raises a pressing question:

How can one stay informed without spending countless hours sifting through papers?

This is where I have found review papers to be exceptionally valuable. Good review papers are effective tools that distill essential insights and highlight important trends. They are big-time savers guiding us through the flood of information.

In this blog post, I would like to share with you my personal, curated list of must-read review papers on PINNs, that are especially influential for my own understanding and use of PINNs. Those papers cover key aspects of PINNs, including algorithmic developments, implementation best practices, and real-world applications.

In addition to what’s available in existing literature, I’ve included one of my own review papers, which provides a comprehensive analysis of common functional usage patterns of PINNs — a practical perspective often missing from academic reviews. This analysis is based on my review of around 200 arXiv papers on PINNs across various engineering domains in the past 3 years and can serve as an essential guide for practitioners looking to deploy these techniques to tackle real-world challenges.

For each review paper, I will explain why it deserves your attention by explaining its unique perspective and indicating practical takeaways that you can benefit from immediately.

Whether you’re just getting started with PINNs, using them to tackle real-world problems, or exploring new research directions, I hope this collection makes navigating the busy field of PINN research easier for you.

Let’s cut through the complexity together and focus on what truly matters.

1️⃣ Scientific Machine Learning through Physics-Informed Neural Networks: Where we are and what’s next

📄 Paper at a glance

🔍 What it covers

  • Authors: S. Cuomo, V. Schiano di Cola, F. Giampaolo, G. Rozza, M. Raissi, and F. Piccialli
  • Year: 2022
  • Link: arXiv

This review is structured around key themes in PINNs: the fundamental components that define their architecture, theoretical aspects of their learning process, and their application to various computing challenges in engineering. The paper also explores the available toolsets, emerging trends, and future directions.

Fig 1. Overview of the #1 review paper. (Image by author)

✨ What’s unique

This review paper stands out in the following ways:

  • One of the best introductions to PINN fundamentals. This paper takes a well-paced approach to explaining PINNs from the ground up. Section 2 systematically dissects the building blocks of a PINN, covering various underlying neural network architectures and their associated characteristics, how PDE constraints are incorporated, common training methodologies, and learning theory (convergence, error analysis, etc.) of PINNs.
  • Putting PINNs in historical context. Rather than simply presenting PINNs as a standalone solution, the paper traces their development from earlier work on using deep learning to solve differential equations. This historical framing is valuable because it helps demystify PINNs by showing that they are an evolution of previous ideas, and it makes it easier for practitioners to see what alternatives are available.
  • Equation-driven organization. Instead of just classifying PINN research by scientific domains (e.g., geoscience, material science, etc.) as many other reviews do, this paper categorizes PINNs based on the types of differential equations (e.g., diffusion problems, advection problems, etc.) they solve. This equation-first perspective encourages knowledge transfer as the same set of PDEs could be used across multiple scientific domains. In addition, it makes it easier for practitioners to see the strengths and weaknesses of PINNs when dealing with different types of differential equations.

🛠 Practical goodies

Beyond its theoretical insights, this review paper offers immediately useful resources for practitioners:

  • A complete implementation example. In section 3.4, this paper walks through a full PINN implementation to solve a 1D Nonlinear Schrödinger equation. It covers translating equations into PINN formulations, handling boundary and initial conditions, defining neural network architectures, choosing training strategies, selecting collocation points, and applying optimization methods. All implementation details are clearly documented for easy reproducibility. The paper compares PINN performance by varying different hyperparameters, which could offer immediately applicable insights for your own PINN experiments.
  • Available frameworks and software tools. Table 3 compiles a comprehensive list of major PINN toolkits, with detailed tool descriptions provided in section 4.3. The considered backends include not only Tensorflow and PyTorch but also Julia and Jax. This side-by-side comparison of different frameworks is especially useful for picking the right tool for your needs.

💡Who would benefit

  • This review paper benefits anyone new to PINNs and looking for a clear, structured introduction.
  • Engineers and developers looking for practical implementation guidance would find the realistic, hands-on demo, and the thorough comparison of existing PINN frameworks most interesting. Additionally, they can find relevant prior work on differential equations similar to their current problem, which offers insights they can leverage in their own problem-solving.
  • Researchers investigating theoretical aspects of PINN convergence, optimization, or efficiency can also greatly benefit from this paper.

2️⃣ From PINNs to PIKANs: Recent Advances in Physics-Informed Machine Learning

📄 Paper at a glance

  • Authors: J. D. Toscano, V. Oommen, A. J. Varghese, Z. Zou, N. A. Daryakenari, C. Wu, and G. E. Karniadakis
  • Year: 2024
  • Link: arXiv

🔍 What it covers

This paper provides one of the most up-to-date overviews of the latest advancements in PINNs. It emphasises enhancements in network design, feature expansion, optimization strategies, uncertainty quantification, and theoretical insights. The paper also surveys key applications across a range of domains.

Fig 2. Overview of the #2 review paper. (Image by author)

✨ What’s unique

This review paper stands out in the following ways:

  • A structured taxonomy of algorithmic developments. One of the most fresh contributions of this paper is its taxonomy of algorithmic advancements. This new taxonomy scheme elegantly categorizes all the advancements into three core areas: (1) representation model, (2) handling governing equations, and (3) optimization process. This structure provides a clear framework for understanding both current developments and potential directions for future research. In addition, the illustrations used in the paper are top-notch and easily digestible.
Fig 3. The taxonomy of algorithmic developments in PINNs proposed by the #2 paper. (Image by author)
  • Spotlight on Physics-informed Kolmogorov–Arnold Networks (KAN). KAN, a new architecture based on the Kolmogorov–Arnold representation theorem, is currently a hot topic in deep learning. In the PINN community, some work has already been done to replace the multilayer perceptions (MLP) representation with KANs to gain more expressiveness and training efficiency. The community lacks a comprehensive review of this new line of research. This review paper (section 3.1) exactly fills in the gap.
  • Review on uncertainty quantification (UQ) in PINNs. UQ is essential for the reliable and trustworthy deployment of PINNs when tackling real-world engineering applications. In section 5, this paper provides a dedicated section on UQ, explaining the common sources of uncertainty in solving differential equations with PINNs and reviewing strategies for quantifying prediction confidence.
  • Theoretical advances in PINN training dynamics. In practice, training PINNs is non-trivial. Practitioners are often puzzled by why PINNs training sometimes fail, or how they should be trained optimally. In section 6.2, this paper provides one of the most detailed and up-to-date discussions on this aspect, covering the Neural Tangent Kernel (NTK) analysis of PINNs, information bottleneck theory, and multi-objective optimization challenges.

🛠 Practical goodies

Even though this review paper leans towards the theory-heavy side, two particularly valuable aspects stand out from a practical perspective:

  • A timeline of algorithmic advances in PINNs. In Appendix A Table, this paper tracks the milestones of key advancements in PINNs, from the original PINN formulation to the most recent extensions to KANs. If you’re working on algorithmic improvements, this timeline gives you a clear view of what’s already been done. If you’re struggling with PINN training or accuracy, you can use this table to find existing methods that might solve your issue.
  • A broad overview of PINN applications across domains. Compared to all the other reviews, this paper strives to give the most comprehensive and updated coverage of PINN applications in not only the engineering domains but also other less-covered fields such as finance. Practitioners can easily find prior works conducted in their domains and draw inspiration.

💡Who would benefit

  • For practitioners working in safety-critical fields that need confidence intervals or reliability estimates on their PINN predictions, the discussion on UQ would be useful. If you are struggling with PINN training instability, slow convergence, or unexpected failures, the discussion on PINN training dynamics can help unpack the theoretical reasons behind these issues.
  • Researchers may find this paper especially interesting because of the new taxonomy, which allows them to see patterns and identify gaps and opportunities for novel contributions. In addition, the review of cutting-edge work on PI-KAN can also be inspiring.

3️⃣ Physics-Informed Neural Networks: An Application-Centric Guide

📄 Paper at a glance

  • Authors: S. Guo (this author)
  • Year: 2024
  • Link: Medium

🔍 What it covers

This article reviews how PINNs are used to tackle different types of engineering tasks. For each task category, the article discusses the problem statement, why PINNs are useful, how PINNs can be implemented to address the problem, and is followed by a concrete use case published in the literature.

Fig 4. Overview of the #3 review paper. (Image by author)

✨ What’s unique

Unlike most reviews that categorize PINN applications either based on the type of differential equations solved or specific engineering domains, this article picks an angle that practitioners care about the most: the engineering tasks solved by PINNs. This work is based on reviewing papers on PINN case studies scattered in various engineering domains. The outcome is a list of distilled recurring functional usage patterns of PINNs:

  • Predictive modeling and simulations, where PINNs are leveraged for dynamical system forecasting, coupled system modeling, and surrogate modeling.
  • Optimization, where PINNs are commonly employed to achieve efficient design optimization, inverse design, model predictive control, and optimized sensor placement.
  • Data-driven insights, where PINNs are used to identify the unknown parameters or functional forms of the system, as well as to assimilate observational data to better estimate the system states.
  • Data-driven enhancement, where PINNs are used to reconstruct the field and enhance the resolution of the observational data.
  • Monitoring, diagnostic, and health assessment, where PINNs are leveraged to act as virtual sensors, anomaly detectors, health monitors, and predictive maintainers.

🛠 Practical goodies

This article places practitioners’ needs at the forefront. While most existing review papers merely answer the question, “Has PINN been used in my field?”, practitioners often seek more specific guidance: “Has PINN been used for the type of problem I’m trying to solve?”. This is precisely what this article tries to address.

By using the proposed five-category functional classification, practitioners can conveniently map their problems to these categories, see how others have solved them, and what worked and what did not. Instead of reinventing the wheel, practitioners can leverage established use cases and adapt proven solutions to their own problems.

💡Who would benefit

This review is best for practitioners who want to see how PINNs are actually being used in the real world. It can also be particularly valuable for cross-disciplinary innovation, as practitioners can learn from solutions developed in other fields.

4️⃣ An Expert’s Guide to Training Physics-informed Neural Networks

📄 Paper at a glance

  • Authors: S. Wang, S. Sankaran, H. Wang, P. Perdikaris
  • Year: 2023
  • Link: arXiv

🔍 What it covers

Even though it doesn’t market itself as a “standard” review, this paper goes all in on providing a comprehensive handbook for training PINNs. It presents a detailed set of best practices for training physics-informed neural networks (PINNs), addressing issues like spectral bias, unbalanced loss terms, and causality violations. It also introduces challenging benchmarks and extensive ablation studies to demonstrate these methods.

Fig 5. Overview of the #4 review paper. (Image by author)

✨ What’s unique

  • A unified “expert’s guide”. The main authors are active researchers in PINNs, working extensively on improving PINN training efficiency and model accuracy for the past years. This paper is a distilled summary of the authors’ past work, synthesizing a broad range of recent PINN techniques (e.g., Fourier feature embeddings, adaptive loss weighting, causal training) into a cohesive training pipeline. This feels like having a mentor who tells you exactly what does and doesn’t work with PINNs.
  • A thorough hyperparameter tuning study. This paper conducts various experiments to show how different tweaks (e.g., different architectures, training schemes, etc.) play out on different PDE tasks. Their ablation studies show precisely which methods move the needle, and by how much.
  • PDE benchmarks. The paper compiles a suite of challenging PDE benchmarks and offers state-of-the-art results that PINNs can achieve.

🛠 Practical goodies

  • A problem-solution cheat sheet. This paper thoroughly documents various techniques addressing common PINN training pain-points. Each technique is clearly presented using a structured format: the why (motivation), how (how the approach addresses the problem), and what (the implementation details). This makes it very easy for practitioners to identify the “cure” based on the “symptoms” observed in their PINN training process. What’s great is that the authors transparently discussed potential pitfalls of each approach, allowing practitioners to make well-informed decisions and effective trade-offs.
  • Empirical insights. The paper shares valuable empirical insights obtained from extensive hyperparameter tuning experiments. It offers practical guidance on choosing suitable hyperparameters, e.g., network architectures and learning rate schedules, and demonstrates how these parameters interact with the advanced PINN training techniques proposed.
  • Ready-to-use library. The paper is accompanied by an optimized JAX library that practitioners can directly adopt or customize. The library supports multi-GPU environments and is ready for scaling to large-scale problems.

💡Who would benefit

  • Practitioners who are struggling with unstable or slow PINN training can find many practical strategies to fix common pathologies. They can also benefit from the straightforward templates (in JAX) to quickly adapt PINNs to their own PDE setups.
  • Researchers looking for challenging benchmark problems and aiming to benchmark new PINN ideas against well-documented baselines will find this paper especially handy.

5️⃣ Domain-Specific Review Papers

Beyond general reviews in PINNs, there are several nice review papers that focus on specific scientific and engineering domains. If you’re working in one of these fields, these reviews could provide a deeper dive into best practices and cutting-edge applications.

1. Heat Transfer Problems

Paper: Physics-Informed Neural Networks for Heat Transfer Problems

The paper provides an application-centric discussion on how PINNs can be used to tackle various thermal engineering problems, including inverse heat transfer, convection-dominated flows, and phase-change modeling. It highlights real-world challenges such as missing boundary conditions, sensor-driven inverse problems, and adaptive cooling system design. The industrial case study related to power electronics is particularly insightful for understanding the usage of PINNs in practice.

2. Power Systems

Paper: Applications of Physics-Informed Neural Networks in Power Systems — A Review

This paper offers a structured overview of how PINNs are applied to critical power grid challenges, including state/parameter estimation, dynamic analysis, power flow calculation, optimal power flow (OPF), anomaly detection, and model synthesis. For each type of application, the paper discusses the shortcomings of traditional power system solutions and explains why PINNs could be advantageous in addressing those shortcomings. This comparative summary is useful for understanding the motivation for adopting PINNs.

3. Fluid Mechanics

Paper: Physics-informed neural networks (PINNs) for fluid mechanics: A review

This paper explored three detailed case studies that demonstrate PINNs application in fluid dynamics: (1) 3D wake flow reconstruction using sparse 2D velocity data, (2) inverse problems in compressible flow (e.g., shock wave prediction with minimal boundary data), and (3) biomedical flow modeling, where PINNs infer thrombus material properties from phase-field data. The paper highlights how PINNs overcome limitations in traditional CFD, e.g., mesh dependency, expensive data assimilation, and difficulty handling ill-posed inverse problems.

4. Additive Manufacturing

Paper: A review on physics-informed machine learning for monitoring metal additive manufacturing process

This paper examines how PINNs address critical challenges specific to additive manufacturing process prediction or monitoring, including temperature field prediction, fluid dynamics modeling, fatigue life estimation, accelerated finite element simulations, and process characteristics prediction.

6️⃣ Conclusion

In this blog post, we went through a curated list of review papers on PINNs, covering fundamental theoretical insights, the latest algorithmic advancements, and practical application-oriented perspectives. For each paper, we highlighted unique contributions, key takeaways, and the audience that would benefit the most from these insights. I hope this curated collection can help you better navigate the evolving field of PINNs.

Mastering Hadoop, Part 3: Hadoop Ecosystem: Get the most out of your cluster

0

As we have already seen with the basic components (Part 1, Part 2), the Hadoop ecosystem is constantly evolving and being optimized for new applications. As a result, various tools and technologies have developed over time that make Hadoop more powerful and even more widely applicable. As a result, it goes beyond the pure HDFS & MapReduce platform and offers, for example, SQL, as well as NoSQL queries or real-time streaming.

Hive/HiveQL

Apache Hive is a data warehousing system that allows for SQL-like queries on a Hadoop cluster. Traditional relational databases struggle with horizontal scalability and ACID properties in large datasets, which is where Hive shines. It enables querying Hadoop data through a SQL-like query language, HiveQL, without needing complex MapReduce jobs, making it accessible to business analysts and developers.

Apache Hive therefore makes it possible to query HDFS data systems using a SQL-like query language without having to write complex MapReduce processes in Java. This means that business analysts and developers can use HiveQL (Hive Query Language) to create simple queries and build evaluations based on Hadoop data architectures.

Hive was originally developed by Facebook for processing large volumes of structured and semi-structured data. It is particularly useful for batch analyses and can be operated with common business intelligence tools such as Tableau or Apache Superset.

The metastore is the central repository that stores metadata such as table definitions, column names, and HDFS location information. This makes it possible for Hive to manage and organize large datasets. The execution engine, on the other hand, converts HiveQL queries into tasks that Hadoop can process. Depending on the desired performance and infrastructure, you can choose different execution engines:

  • MapReduce: The classic, slower approach.
  • Tez: A faster alternative to MapReduce.
  • Spark: The fastest option, which runs queries in-memory for optimal performance.

To use Hive in practice, various aspects should be considered to maximize performance. For example, it is based on partitioning, so that data is not stored in a huge table, but in partitions that can be searched more quickly. For example, a company’s sales data can be partitioned by year and month:

CREATE TABLE sales_partitioned (
    customer_id STRING,
    amount DOUBLE
) PARTITIONED BY (year INT, month INT);

This means that only the specific partition that is required can be accessed during a query. When creating partitions, it makes sense to create ones that are queried frequently. Buckets can also be used to ensure that joins run faster and data is distributed evenly.

CREATE TABLE sales_bucketed (
    customer_id STRING,
    amount DOUBLE
) CLUSTERED BY (customer_id) INTO 10 BUCKETS;

In conclusion, Hive is a useful tool if structured queries on huge amounts of data are to be possible. It also offers an easy way to connect common BI tools, such as Tableau, with data in Hadoop. However, if the application requires many short-term read and write accesses, then Hive is not the right tool.

Pig

Apache Pig takes this one step further and enables the parallel processing of large amounts of data in Hadoop. Compared to Hive, it is not focused on data reporting, but on the ETL process of semi-structured and unstructured data. For these data analyses, it is not necessary to use the complex MapReduce process in Java; instead, simple processes can be written in the proprietary Pig Latin language.

In addition, Pig can handle various file formats, such as JSON or XML, and perform data transformations, such as merging, filtering, or grouping data sets. The general process then looks like this:

  • Loading the Information: The data can be pulled from different data sources, such as HDFS or HBase.
  • Transforming the data: The data is then modified depending on the application so that you can filter, aggregate, or join it.
  • Saving the results: Finally, the processed data can be stored in various data systems, such as HDFS, HBase, or even relational databases.

Apache Pig differs from Hive in many fundamental ways. The most important are:

Attribute Pig Hive
Language Pig Latin (script-based) HiveQL (similar to SQL)
Target Group Data Engineers Business Analysts
Data Structure Semi-structured and unstructured data Structured Data
Applications ETL processes, data preparation, data transformation SQL-based analyses, reporting
Optimization Parallel processing Optimized, analytical queries
Engine-Options MapReduce, Tez, Spark Tez, Spark

Apache Pig is a component of Hadoop that simplifies data processing through its script-based Pig Latin language and accelerates transformations by relying on parallel processing. It is particularly popular with data engineers who want to work on Hadoop without having to develop complex MapReduce programs in Java.

HBase

HBase is a key-value-based NoSQL database in Hadoop that stores data in a column-oriented manner. Compared to classic relational databases, it can be scaled horizontally and new servers can be added to the storage if required. The data model consists of various tables, all of which have a unique row key that can be used to uniquely identify them. This can be imagined as a primary key in a relational database.

Each table in turn is made up of columns that belong to a so-called column family and must be defined when the table is created. The key-value pairs are then stored in the cells of a column. By focusing on columns instead of rows, large amounts of data can be queried particularly efficiently.

This structure can also be seen when creating new data records. A unique row key is created first and the values for the individual columns can then be added to this.

Put put = new Put(Bytes.toBytes("1001"));
put.addColumn(Bytes.toBytes("Personal"), Bytes.toBytes("Name"), Bytes.toBytes("Max"));
put.addColumn(Bytes.toBytes("Bestellungen", Bytes.toBytes("Produkt"),Bytes.toBytes("Laptop"));
table.put(put);

The column family is named first and then the key-value pair is defined. The structure is used in the query by first defining the data set via the row key and then calling up the required column and the keys it contains.

Get get = new Get(Bytes.toBytes("1001"));
Result result = table.get(get);
byte[] name = result.getValue(Bytes.toBytes("Personal"), Bytes.toBytes("Name"));
System.out.println("Name: " + Bytes.toString(name));

The structure is based on a master-worker setup. The HMaster is the higher-level control unit for HBase and manages the underlying RegionServers. It is also responsible for load distribution by centrally monitoring system performance and distributing the so-called regions to the RegionServers. If a RegionServer fails, the HMaster also ensures that the data is distributed to other RegionServers so that operations can be maintained. If the HMaster itself fails, the cluster can also have additional HMasters, which can then be retrieved from standby mode. During operation, however, a cluster only ever has one running HMaster.

The RegionServers are the working units of HBase, as they store and manage the table data in the cluster. They also answer read and write requests. For this purpose, each HBase table is divided into several subsets, the so-called regions, which are then managed by the RegionServers. A RegionServer can manage several regions to manage the load between the nodes.

The RegionServers work directly with clients and therefore receive the read and write requests directly. These requests end up in the so-called MemStore, whereby incoming read requests are first served from the MemStore and if the required data is no longer available there, the permanent memory in HDFS is used. As soon as the MemStore has reached a certain size, the data it contains is stored in an HFile in HDFS.

The storage backend for HBase is, therefore, HDFS, which is used as permanent storage. As already described, the HFiles are used for this, which can be distributed across several nodes. The advantage of this is horizontal scalability, as the data volumes can be distributed across different machines. In addition, different copies of the data are used to ensure reliability.

Finally, Apache Zookeeper serves as the superordinate instance of HBase and coordinates the distributed application. It monitors the HMaster and all RegionServers and automatically selects a new leader if an HMaster should fail. It also stores important metadata about the cluster and prevents conflicts if several clients want to access data at the same time. This enables the smooth operation of even larger clusters.

HBase is, therefore, a powerful NoSQL database that is suitable for Big Data applications. Thanks to its distributed architecture, HBase remains accessible even in the event of server failures and offers a combination of RAM-supported processing in the MemStore and the permanent storage of data in HDFs.

Spark

Apache Spark is a further development of MapReduce and is up to 100x faster thanks to the use of in-memory computing. It has since developed into a comprehensive platform for various workloads, such as batch processing, data streaming, and even machine learning, thanks to the addition of many components. It is also compatible with a wide variety of data sources, including HDFS, Hive, and HBase.

At the heart of the components is Spark Core, which offers basic functions for distributed processing:

  • Task management: Calculations can be distributed and monitored across multiple nodes.
  • Fault tolerance: In the event of errors in individual nodes, these can be automatically restored.
  • In-memory computing: Data is stored in the server’s RAM to ensure fast processing and availability.

The central data structures of Apache Spark are the so-called Resilient Distributed Datasets (RDDs). They enable distributed processing across different nodes and have the following properties:

  • Resilient (fault-tolerant): Data can be restored in the event of node failures. The RDDs do not store the data themselves, but only the sequence of transformations. If a node then fails, Spark can simply re-execute the transactions to restore the RDD.
  • Distributed: The information is distributed across multiple nodes.
  • Immutable: Once created, RDDs cannot be changed, only recreated.
  • Lazily evaluated (delayed execution): The operations are only executed during an action and not during the definition.

Apache Spark also consists of the following components:

  • Spark SQL provides an SQL engine for Spark and runs on datasets and DataFrames. As it works in-memory, processing is particularly fast, and it is therefore suitable for all applications where efficiency and speed play an important role.
  • Spark streaming offers the possibility of processing continuous data streams in real-time and converting them into mini-batches. It can be used, for example, to analyze social media posts or monitor IoT data. It also supports many common streaming data sources, such as Kafka or Flume.
  • With MLlib, Apache Spark offers an extensive library that contains a wide range of machine learning algorithms and can be applied directly to the stored data sets. This includes, for example, models for classification, regression, or even entire recommendation systems.
  • GraphX is a powerful tool for processing and analyzing graph data. This enables efficient analyses of relationships between data points and they can be calculated simultaneously in a distributed manner. There are also special PageRank algorithms for analyzing social networks.

Apache Spark is arguably one of the rising components of Hadoop, as it enables fast in-memory calculations that would previously have been unthinkable with MapReduce. Although Spark is not an exclusive component of Hadoop, as it can also use other file systems such as S3, the two systems are often used together in practice. Apache Spark is also enjoying increasing popularity due to its universal applicability and many functionalities.

Oozie

Apache Oozie is a workflow management and scheduling system that was developed specifically for Hadoop and plans the execution and automation of various Hadoop jobs, such as MapReduce, Spark, or Hive. The most important functionality here is that Oozie defines the dependencies between the jobs and executes them in a specific order. In addition, schedules or specific events can be defined for which the jobs are to be executed. If errors occur during execution, Oozie also has error-handling options and can restart the jobs.

A workflow is defined in XML so that the workflow engine can read it and start the jobs in the correct order. If a job fails, it can simply be repeated or other steps can be initiated. Oozie also has a database backend system, such as MySQL or PostgreSQL, which is used to store status information.

Presto

Apache Presto offers another option for applying distributed SQL queries to large amounts of data. Compared to other Hadoop technologies, such as Hive, the queries are processed in real-time and it is therefore optimized for data warehouses running on large, distributed systems. Presto offers broad support for all relevant data sources and does not require a schema definition, so data can be queried directly from the sources. It has also been optimized to work on distributed systems and can, therefore, be used on petabyte-sized data sets.

Apache Presto uses a so-called massively parallel processing (MPP) architecture, which enables particularly efficient processing in distributed systems. As soon as the user sends an SQL query via the Presto CLI or a BI front end, the coordinator analyzes the query and creates an executable query plan. The worker nodes then execute the queries and return their partial results to the coordinator, which combines them into a final result.

Presto differs from the related systems in Hadoop as follows:

Attribute Presto Hive Spark SQL
Query Speed Milliseconds to seconds Minutes (batch processing) Seconds (in-memory)
Processing Model Real-time SQL queries Batch Processing In-Memory Processing
Data Source HDFS, S3, RDBMS, NoSQL, Kafka HDFS, Hive-Tables HDFS, Hive, RDBMS, Streams
Use Case Interactive queries, BI tools Slow big data queries Machine learning, streaming, SQL queries

This makes Presto the best choice for fast SQL queries on a distributed big data environment like Hadoop.

What are alternatives to Hadoop?

Especially in the early 2010s, Hadoop was the leading technology for distributed Data Processing for a long time. However, several alternatives have since emerged that offer more advantages in certain scenarios or are simply better suited to today’s applications.

Cloud-native alternatives to Hadoop

Many companies have moved away from hosting their servers and on-premise systems and are instead moving their big data workloads to the cloud. There, they can benefit significantly from automatic scaling, lower maintenance costs, and better performance. In addition, many cloud providers also offer solutions that are much easier to manage than Hadoop and can, therefore, also be operated by less trained personnel.

Amazon EMR (Elastic MapReduce)

Amazon EMR is a managed big data service from AWS that provides Hadoop, Spark, and other distributed computing frameworks so that these clusters no longer need to be hosted on-premises. This enables companies to no longer have to actively take care of cluster maintenance and administration. In addition to Hadoop, Amazon EMR supports many other open-source frameworks, such as Spark, Hive, Presto, and HBase. This broad support means that users can simply move their existing clusters to the cloud without any major problems.

For storage, Amazon uses EMR S3 as primary storage instead of HDFS. This not only makes storage cheaper as no permanent cluster is required, but it also has better availability as data is stored redundantly across multiple AWS regions. In addition, computing and storage can be scaled separately from each other and cannot be scaled exclusively via a cluster, as is the case with Hadoop.

There is a specially optimized interface for the EMR File System (EMRFS) that allows direct access from Hadoop or Spark to S3. It also supports the consistency models and enables metadata caching for better performance. If necessary, HDFS can also be used, for example, if local, temporary storage is required on the cluster nodes.

Another advantage of Amazon EMR over a classic Hadoop cluster is the ability to use dynamic auto-scaling to not only reduce costs but also improve performance. The cluster size and the available hardware are automatically adjusted to the CPU utilization or the job queue size so that costs are only incurred for the hardware that is needed.

So-called spot indices can then only be added temporarily when they are needed. In a company, for example, it makes sense to add them at night if the data from the productive systems is to be stored in the data warehouse. During the day, on the other hand, smaller clusters are operated and costs can be saved as a result.

Amazon EMR, therefore, offers several optimizations for the local use of Hadoop. The optimized storage access to S3, the dynamic cluster scaling, which increases performance and simultaneously optimizes costs, and the improved network communication between the nodes is particularly advantageous. Overall, the data can be processed faster with fewer resource requirements than with classic Hadoop clusters that run on their servers.

Google BigQuery

In the area of data warehousing, Google Big Query offers a fully managed and serverless data warehouse that can come up with fast SQL queries for large amounts of data. It relies on columnar data storage and uses Google Dremel technology to handle massive amounts of data more efficiently. At the same time, it can largely dispense with cluster management and infrastructure maintenance.

In contrast to native Hadoop, BigQuery uses a columnar orientation and can, therefore, save immense amounts of storage space by using efficient compression methods. In addition, queries are accelerated as only the required columns need to be read rather than the entire row. This makes it possible to work much more efficiently, which is particularly noticeable with very large amounts of data.

BigQuery also uses Dremel technology, which is capable of executing SQL queries in parallel hierarchies and distributing the workload across different machines. As such architectures often lose performance as soon as they have to merge the partial results again, BigQuery uses tree aggregation to combine the partial results efficiently.

BigQuery is the better alternative to Hadoop, especially for applications that focus on SQL queries, such as data warehouses or business intelligence. For unstructured data, on the other hand, Hadoop may be the more suitable alternative, although the cluster architecture and the associated costs must be taken into account. Finally, BigQuery also offers a good connection to the various machine learning offerings from Google, such as Google AI or AutoML, which should be taken into account when making a selection.

Snowflake

If you don’t want to become dependent on the Google Cloud with BigQuery or are already pursuing a multi-cloud strategy, Snowflake can be a valid alternative for building a cloud-native data warehouse. It offers dynamic scalability by separating computing power and storage requirements so that they can be adjusted independently of each other.

Compared to BigQuery, Snowflake is cloud-agnostic and can therefore be operated on common platforms such as AWS, Azure, or even in the Google Cloud. Although Snowflake also offers the option of scaling the hardware depending on requirements, there is no option for automatic scaling as with BigQuery. On the other hand, multiclusters can be created on which the data warehouse is distributed, thereby maximizing performance.

On the cost side, the providers differ due to the architecture. Thanks to the complete management and automatic scaling of BigQuery, Google Cloud can calculate the costs per query and does not charge any direct costs for computing power or storage. With Snowflake, on the other hand, the choice of provider is free and so in most cases it boils down to a so-called pay-as-you-go payment model in which the provider charges the costs for storage and computing power.

Overall, Snowflake offers a more flexible solution that can be hosted by various providers or even operated as a multi-cloud service. However, this requires greater knowledge of how to operate the system, as the resources have to be adapted independently. BigQuery, on the other hand, has a serverless model, which means that no infrastructure management is required.

Open-source alternatives for Hadoop

In addition to these complete and large cloud data platforms, several powerful open-source programs have been specifically developed as alternatives to Hadoop and specifically address its weaknesses, such as real-time data processing, performance, and complexity of administration. As we have already seen, Apache Spark is very powerful and can be used as a replacement for a Hadoop cluster, which we will not cover again.

Apache Flink

Apache Flink is an open-source framework that was specially developed for distributed stream processing so that data can be processed continuously. In contrast to Hadoop or Spark, which processes data in so-called micro-batches, data can be processed in near real-time with very low latency. This makes Apache Flink an alternative for applications in which information is generated continuously and needs to be reacted to in real-time, such as sensor data from machines.

While Spark Streaming processes the data in so-called mini-batches and thus simulates streaming, Apache Flink offers real streaming with an event-driven model that can process data just milliseconds after it arrives. This can further minimize latency as there is no delay due to mini-batches or other waiting times. For these reasons, Flink is much better suited to high-frequency data sources, such as sensors or financial market transactions, where every second counts.

Another advantage of Apache Flink is its advanced stateful processing. In many real-time applications, the context of an event plays an important role, such as the previous purchases of a customer for a product recommendation, and must therefore be saved. With Flink, this storage already takes place in the application so that long-term and stateful calculations can be carried out efficiently.

This becomes particularly clear when analyzing machine data in real-time, where previous anomalies, such as too high a temperature or faulty parts, must also be included in the current report and prediction. With Hadoop or Spark, a separate database must first be accessed for this, which leads to additional latency. With Flink, on the other hand, the machine’s historical anomalies are already stored in the application so that they can be accessed directly.

In conclusion, Flink is the better alternative for highly dynamic and event-based data processing. Hadoop, on the other hand, is based on batch processes and therefore cannot analyze data in real-time, as there is always a latency to wait for a completed data block.

Modern data warehouses

For a long time, Hadoop was the standard solution for processing large volumes of data. However, companies today also rely on modern data warehouses as an alternative, as these offer an optimized environment for structured data and thus enable faster SQL queries. In addition, there are a variety of cloud-native architectures that also offer automatic scaling, thus reducing administrative effort and saving costs.

In this section, we focus on the most common data warehouse alternatives to Hadoop and explain why they may be a better choice compared to Hadoop.

Amazon Redshift

Amazon Redshift is a cloud-based data warehouse that was developed for structured analyses with SQL. This optimizes the processing of large relational data sets and allows fast column-based queries to be used.

One of the main differences to traditional data warehouses is that data is stored in columns instead of rows, meaning that only the relevant columns need to be loaded for a query, which significantly increases efficiency. Hadoop, on the other hand, and HDFS in particular is optimized for semi-structured and unstructured data and does not natively support SQL queries. This makes Redshift ideal for OLAP analyses in which large amounts of data need to be aggregated and filtered.

Another feature that increases query speed is the use of a Massive Parallel Processing (MPP) system, in which queries can be distributed across several nodes and processed in parallel. This achieves extremely high parallelization capability and processing speed.

In addition, Amazon Redshift offers very good integration into Amazon’s existing systems and can be seamlessly integrated into the AWS environment without the need for open-source tools, as is the case with Hadoop. Frequently used tools are:

  • Amazon S3 offers direct access to large amounts of data in cloud storage.
  • AWS Glue can be used for ETL processes in which data is prepared and transformed.
  • Amazon QuickSight is a possible tool for the visualization and analysis of data.
  • Finally, machine learning applications can be implemented with the various AWS ML services.

Amazon Redshift is a real alternative compared to Hadoop, especially for relational queries, if you are looking for a managed and scalable data warehouse solution and you already have an existing AWS cluster or want to build the architecture on top of it. It can also offer a real advantage for high query speeds and large volumes of data due to its column-based storage and massive parallel processing system.

Databricks (lakehouse platform)

Databricks is a cloud platform based on Apache Spark that has been specially optimized for data analysis, machine learning, and artificial intelligence. It extends the functionalities of Spark with an easy-to-understand user interface, and optimized cluster management and also offers the so-called Delta Lake, which offers data consistency, scalability, and performance compared to Hadoop-based systems.

Databricks offers a fully managed environment that can be easily operated and automated using Spark clusters in the cloud. This eliminates the need for manual setup and configuration as with a Hadoop cluster. In addition, the use of Apache Spark is optimized so that batch and streaming processing can run faster and more efficiently. Finally, Databricks also includes automatic scaling, which is very valuable in the cloud environment as it can save costs and improve scalability.

The classic Hadoop platforms have the problem that they do not fulfill the ACID properties and, therefore, the consistency of the data is not always guaranteed due to the distribution across different servers. With Databricks, this problem is solved with the help of the so-called Delta Lake:

  • ACID transactions: The Delta Lake ensures that all transactions fulfill the ACID guidelines, allowing even complex pipelines to be executed completely and consistently. This ensures data integrity even in big data applications.
  • Schema evolution: The data models can be updated dynamically so that existing workflows do not have to be adapted.
  • Optimized storage & queries: Delta Lake uses processes such as indexing, caching, or automatic compression to make queries many times faster compared to classic Hadoop or HDFS environments.

Finally, Databricks goes beyond the classic big data framework by also offering an integrated machine learning & AI platform. The most common machine learning platforms, such as TensorFlow, scikit-learn, or PyTorch, are supported so that the stored data can be processed directly. As a result, Databricks offers a simple end-to-end pipeline for machine learning applications. From data preparation to the finished model, everything can take place in Databricks and the required resources can be flexibly booked in the cloud.

This makes Databricks a valid alternative to Hadoop if a data lake with ACID transactions and schema flexibility is required. It also offers additional components, such as the end-to-end solution for machine learning applications. In addition, the cluster in the cloud can not only be operated more easily and save costs by automatically adapting the hardware to the requirements, but it also offers significantly more performance than a classic Hadoop cluster due to its Spark basis.


In this part, we explored the Hadoop ecosystem, highlighting key tools like Hive, Spark, and HBase, each designed to enhance Hadoop’s capabilities for various data processing tasks. From SQL-like queries with Hive to fast, in-memory processing with Spark, these components provide flexibility for big data applications. While Hadoop remains a powerful framework, alternatives such as cloud-native solutions and modern data warehouses are worth considering for different needs.

This series has introduced you to Hadoop’s architecture, components, and ecosystem, giving you the foundation to build scalable, customized big data solutions. As the field continues to evolve, you’ll be equipped to choose the right tools to meet the demands of your data-driven projects.

How to Build a RAG System Using LangChain, Ragas, and Neptune

0

LangChain provides composable building blocks to create LLM-powered applications, making it an ideal framework for building RAG systems. Developers can integrate components and APIs of different vendors into coherent applications.

Evaluating a RAG system’s performance is crucial to ensure high-quality responses and robustness. The Ragas framework offers a large number of RAG-specific metrics as well as capabilities for generating dedicated evaluation datasets.

neptune.ai makes it easy for RAG developers to track evaluation metrics and metadata, enabling them to analyze and compare different system configurations. The experiment tracker can handle large amounts of data, making it well-suited for quick iteration and extensive evaluations of LLM-based applications.

Imagine asking a chat assistant about LLMOps only to receive outdated advice or irrelevant best practices. While LLMs are powerful, they rely solely on their pre-trained knowledge and lack the ability to fetch current data.

This is where Retrieval-Augmented Generation (RAG) comes in. RAG combines the generative power of LLMs with external data retrieval, enabling the assistant to access and use real-time information. For example, instead of outdated answers, the chat assistant could pull insights from Neptune’s LLMOps article collection to deliver accurate and contextually relevant responses.

In this guide, we’ll show you how to build a RAG system using the LangChain framework, evaluate its performance using Ragas, and track your experiments with neptune.ai. Along the way, you’ll learn to create a baseline RAG system, refine it using Ragas metrics, and enhance your workflow with Neptune’s experiment tracking.

Part 1: Building a baseline RAG system with LangChain

In the first part of this guide, we’ll use LangChain to build a RAG system for the blog posts in the LLMOps category on Neptune’s blog.

Overview of a baseline RAG system. A user’s question is used as the query to retrieve relevant documents from a database. The documents returned by the search are added to the prompt that is passed to the LLM together with the user’s question. The LLM uses the information in the prompt to generate an answer.
Overview of a baseline RAG system. A user’s question is used as the query to retrieve relevant documents from a database. The documents returned by the search are added to the prompt that is passed to the LLM together with the user’s question. The LLM uses the information in the prompt to generate an answer. | Source

What is LangChain?

LangChain offers a collection of open-source building blocks, including memory management, data loaders for various sources, and integrations with vector databases—all the essential components of a RAG system.

LangChain stands out among the frameworks for building RAG systems for its composability and versatility. Developers can combine and connect these building blocks using a coherent Python API, allowing them to focus on creating LLM applications rather than dealing with the nitty-gritty of API specifications and data transformations.

Overview of the categories of building blocks provided by LangChain. The framework includes interfaces to models and vector stores, document loaders, and text processing utilities like output parsers and text splitters. Further, LangChain offers features for prompt engineering, like templates and example selectors. The framework also contains a collection of tools that can be called by LLM agents.
Overview of the categories of building blocks provided by LangChain. The framework includes interfaces to models and vector stores, document loaders, and text processing utilities like output parsers and text splitters. Further, LangChain offers features for prompt engineering, like templates and example selectors. The framework also contains a collection of tools that can be called by LLM agents. | Source

Step 1: Setting up

We’ll begin by installing the necessary dependencies (I used Python 3.11.4 on Linux):

pip install -qU langchain-core==0.1.45 langchain-openai==0.0.6 langchain-chroma==0.1.4 ragas==0.2.8 neptune==1.13.0 pandas==2.2.3 datasets==3.2.0

For this example, we’ll use OpenAI’s models and configure the API key. To access OpenAI models, you’ll need to create an OpenAI account and generate an API key. Our usage in this blog should be well within the free-tier limits.

Once we have obtained our API key, we’ll set it as an environment variable so that LangChain’s OpenAI building blocks can access it:

import os
os.environ["OPENAI_API_KEY"] = "YOUR_KEY_HERE"

You can also use any of LangChain’s other embedding and chat models, including local models provided by Ollama. Thanks to the compositional structure of LangChain, all it takes is replacing OpenAIEmbeddings and OpenAIChat in the code with the respective alternative building blocks.

Step 2: Load and parse the raw data

Source data for RAG systems is often unstructured documents. Before we can use it effectively, we’ll need to process and parse it into a structured format.

Fetch the source data

Since we’re working with a blog, we’ll use LangChain’s WebBaseLoader to load data from Neptune’s blog. WebBaseLoader reads raw webpage content, capturing text and structure, such as headings.

The web pages are loaded as LangChain documents, which include the page content as a string and metadata associated with that document, e.g., the source page’s URL.

In this example, we select 3 blog posts to create the chat assistant’s knowledge base:

import bs4
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader(
    web_paths=[
        "https://neptune.ai/blog/llm-hallucinations",
        "https://neptune.ai/blog/llmops",
        "https://neptune.ai/blog/llm-guardrails"
    ],
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(name=["p", "h2", "h3", "h4"])
    ),
)
docs = loader.load()

Split the data into smaller chunks

To meet the embedding model’s token limit and improve retrieval performance, we’ll split the long blog posts into smaller chunks.

The chunk size is a trade-off between specificity (capturing detailed information within each chunk) and efficiency (reducing the total number of resulting chunks). By overlapping chunks, we mitigate the loss of critical information that occurs when a self-contained sequence of the source text is split into two incoherent chunks.

Visualization of the chunks created from the article LLM Hallucinations 101. The text is split into four chunks highlighted in blue, lime green, dark orange, and dark yellow. The overlaps between chunks are marked in olive green.
Visualization of the chunks created from the article LLM Hallucinations 101. The text is split into four chunks highlighted in blue, lime green, dark orange, and dark yellow. The overlaps between chunks are marked in olive green. | Created with ChunkViz

For generic text, LangChain recommends the RecursiveCharacterTextSplitter. We set the chunk size to a maximum of 1,000 characters with an overlap of 200 characters. We also filter out unnecessary parts of the documents, such as the header, footer, and any promotional content:

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

header_footer_keywords = ["peers about your research", "deepsense", "ReSpo", "Was the article useful?", "related articles", "All rights reserved"]

splits = []
for s in text_splitter.split_documents(docs):
    if not any(kw in s.page_content for kw in header_footer_keywords):
        splits.append(s)

len(splits)

Step 3: Set up the vector store

Vector stores are specialized data stores that enable indexing and retrieving information based on vector representations.

Choose a vector store

LangChain supports many vector stores. In this example, we’ll use Chroma, an open-source vector store specifically designed for LLM applications.

By default, Chroma stores the collection in memory; once the session ends, all the data (embeddings and indices) are lost. While this is fine for our small example, in production, you’ll want to persist the database to disk by passing the persist_directory keyword argument when initializing Chroma.

Specify which embedding model to use

Embedding models convert chunks into vectors. There are many embedding models to choose from. The Massive Text Embedding Benchmark (MTEB) leaderboard is a great resource for selecting one based on model size, embedding dimensions, and performance requirements.

The MTEB Leaderboard provides a standardized comparison of embedding models across diverse tasks and datasets, including retrieval, clustering, classification, and reranking. The leaderboard provides a clear comparison of model performance and makes selecting embedding models easier through filters and ranking.
The MTEB Leaderboard provides a standardized comparison of embedding models across diverse tasks and datasets, including retrieval, clustering, classification, and reranking. The leaderboard provides a clear comparison of model performance and makes selecting embedding models easier through filters and ranking.

For our example LLMOps RAG system, we’ll use OpenAIEmbeddings with its default model. (At the time of writing, this was text-embedding-ada-002.)

Create a retriever object from the vector store

A retriever performs semantic searches to find the most relevant pieces of information based on a user query. For this baseline example, we’ll configure the retriever to return only the top result, which will be used as context for the LLM to generate an answer.

Initializing the vector store for our RAG system and instantiating a retriever takes only two lines of code:

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(
   documents=splits,
   embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever(search_kwargs={"k": 1})

In the last line, we have specified through search_kwargs that the retriever only returns the most similar document (top-k retrieval with k = 1).

Step 4: Bring it all together

Now that we’ve set up a vector database with the source data and initialized the retriever to return the most relevant chunk given a query, we’ll combine it with an LLM to complete our baseline RAG chain.

Define a prompt template

We need to set a prompt to guide the LLM in responding. This prompt should tell the model to use the retrieved context to answer the query.

We’ll use a standard RAG prompt template that specifically asks the LLM to use the provided context (the retrieved chunk) to answer the user query concisely:

from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

Create the full RAG chain

We’ll use the create_stuff_documents_chain utility function to set up the generative part of our RAG chain. It combines an instantiated LLM and a prompt template with a {context} placeholder into a chain that takes a set of documents as its input, which are “stuffed” into the prompt before it is fed into the LLM. In our case, that’s OpenAI’s GPT4o-mini.

from langchain_openai import ChatOpenAI
from langchain.chains.combine_documents import create_stuff_documents_chain

llm = ChatOpenAI(model="gpt-4o-mini")
question_answer_chain = create_stuff_documents_chain(llm, prompt)

Then, we can use the create_retrieval_chain utility function to finally instantiate our complete RAG chain: 

from langchain.chains import create_retrieval_chain

rag_chain = create_retrieval_chain(retriever, question_answer_chain)

Get an output from the RAG chain

To see how our system works, we can run a first inference call. We’ll send a query to the chain that we know can be answered using the contents of one of the blog posts:

response = rag_chain.invoke({"input": "What are DOM-based attacks?"})
print(response["answer"])

The response is a dictionary that contains “input,” “context,” and “answer” keys:

{
  "input": 'What are DOM-based attacks?',
  'context': [Document(metadata={'source': 'https://neptune.ai/blog/llm-guardrails'}, page_content='By prompting the application to pretend to be a chatbot that “can do anything” and is not bound by any restrictions, users were able to manipulate ChatGPT to provide responses to questions it would usually decline to answer.Although “prompt injection” and “jailbreaking” are often used interchangeably in the community, they refer to distinct vulnerabilities that must be handled with different methods.DOM-based attacksDOM-based attacks are an extension of the traditional prompt injection attacks. The key idea is to feed a harmful instruction into the system by hiding it within a website’s code.Consider a scenario where your program crawls websites and feeds the raw HTML to an LLM on a daily basis. The rendered page looks normal to you, with no obvious signs of anything wrong. Yet, an attacker can hide a malicious key phrase by matching its color to the background or adding it in parts of the HTML code that are not rendered, such as a style Tag.While invisible to human eyes, the LLM will')],
  "answer": "DOM-based attacks are a type of vulnerability where harmful instructions are embedded within a website's code, often hidden from view. Attackers can conceal malicious content by matching its color to the background or placing it in non-rendered sections of the HTML, like style tags. This allows the malicious code to be executed by a system, such as a language model, when it processes the website's HTML."}

We see that the retriever appropriately identified a snippet from the LLM Guardrails: Secure and Controllable Deployment article as the most relevant chunk.

Define a prediction function

Now that we have a fully functioning end-to-end RAG chain, we can create a convenience function that enables us to query our RAG chain. It takes a RAG chain and a query and returns the chain’s response. We’ll also implement the option to pass just the stuff documents chain and provide the list of context documents via an additional input parameter. This will come in handy when evaluating the different parts of our RAG system.

Here’s what this function looks like:

from langchain_core.runnables.base import Runnable
from langchain_core.documents import Document

def predict(chain: Runnable, query: str, context: list[Document] | None = None)-> dict:
    """
    Accepts a retrieval chain or a stuff documents chain. If the latter, context must be passed in.
    Return a response dict with keys "input", "context", and "answer"
    """
    inputs = {"input": query}
    if context:
        inputs.update({"context": context})

    response = chain.invoke(inputs)

    result = {
        response["input"]: {
            "context": [d.page_content for d in response['context']],
            "answer": response["answer"],
        }
    }
    return result

Part 2: Evaluating a RAG system using Ragas and neptune.ai

Once a RAG system is built, it’s important to evaluate its performance and establish a baseline. The proper way to do this is by systematically testing it using a representative evaluation dataset. Since such a dataset is not available in our case yet, we’ll have to generate one.

To assess both the retrieval and generation aspects of the system, we’ll use Ragas as the evaluation framework and neptune.ai to track experiments as we iterate.

What is Ragas?

Ragas is an open-source toolkit for evaluating RAG applications. It offers both LLM-based and non-LLM-based metrics to assess the quality of retrieval and generated responses. Ragas works smoothly with LangChain, making it a great choice for evaluating our RAG system.

Step 1: Generate a RAG evaluation dataset

An evaluation set for RAG tasks is similar to a question-answering task dataset. The key difference is that each row includes not just the query and a reference answer but also reference contexts (documents that we expect to be retrieved to answer the query).

Thus, an example evaluation set entry looks like this:

Query

Reference context

Reference answer

How can users trick a chatbot to bypass restrictions?

[‘By prompting the application to pretend to be a chatbot that “can do anything” and is not bound by any restrictions, users were able to manipulate ChatGPT to provide responses to questions it would usually decline to answer.’]

Users trick chatbots to bypass restrictions by prompting the application to pretend to be a chatbot that ‘can do anything’ and is not bound by any restrictions, allowing it to provide responses to questions it would usually decline to answer.

Ragas provides utilities to generate such a dataset from a list of reference documents using an LLM.

As the reference documents, we’ll use the same chunks that we fed into the Chroma vector store in the first part, which is precisely the knowledge base from which our RAG system is drawing.

To test the generative part of our RAG chain, we’ll need to generate example queries and reference answers using a different model. Otherwise, we’d be testing our system’s self-consistency. We’ll use the full-sized GPT-4o model, which should outperform the GPT-4o-mini in our RAG chain.

As in the first part, it is possible to use a different LLM. The LangchainLLMWrapper and LangChainEmbeddingsWrapper make any model available via LangChain accessible to Ragas.

What happens under the hood?

Ragas’ TestSetGenerator builds a knowledge graph in which each node represents a chunk. It extracts information like named entities from the chunks and uses this data to model the relationship between nodes. From the knowledge graph, so-called query synthesizers derive scenarios consisting of a set of nodes, the desired query length and style, and a user persona. This scenario is used to populate a prompt template instructing an LLM to generate a query and answer (example). For more details, refer to the Ragas Testset Generation documentation.

Creating an evaluation dataset with 50 rows for our RAG system should take about a minute. We’ll generate a mixture of abstract queries (“What is concept A?”) and specific queries (“How often does subscription plan B bill its users?”):

from ragas.llms import LangChainLLMWrapper
from ragas.embeddings import LangChainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from ragas.testset import TestsetGenerator
from ragas.testset.synthesizers import AbstractQuerySynthesizer, SpecificQuerySynthesizer

generator_llm = LangChainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_embeddings = LangChainEmbeddingsWrapper(OpenAIEmbeddings())

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)

dataset = generator.generate_with_langchain_docs(
    splits,
    testset_size=50,
    query_distribution=[
        (AbstractQuerySynthesizer(llm=generator_llm), 0.1),
        (SpecificQuerySynthesizer(llm=generator_llm), 0.9),
    ],
)

Filtering unwanted data

We want to focus our evaluation on cases where the reference answer is helpful. In particular, we don’t want to include test samples with responses containing phrases like “the context is insufficient” or “the context does not contain.” Duplicate entries in the dataset would skew the evaluation, so they should also be omitted.

For filtering, we’ll use the ability to easily convert Ragas datasets into Pandas DataFrames or Hugging Face Datasets:


unique_indices = set(dataset.to_pandas().drop_duplicates(subset=["user_input"]).index)


not_helpful = set(dataset.to_pandas()[dataset.to_pandas()["reference"].str.contains("does not contain|does not provide|context does not|is insufficient|is incomplete", case=False, regex=True)].index)

unique_helpful_indices = unique_indices - not_helpful

ds = dataset.to_hf_dataset().select(unique_helpful_indices)

This leaves us with unique samples that look like this:

User input

Reference contexts

Reference answer

What role does reflection play in identifying and correcting hallucinations in LLM outputs?

[‘After the responseCorrecting a hallucination after the LLM output has been generated is still beneficial, as it prevents the user from seeing the incorrect information. This approach can effectively transform correction into prevention by ensuring that the erroneous response never reaches the user. The process can be broken down into the following steps:This method is part of multi-step reasoning strategies, which are increasingly important in handling complex problems. These strategies, often referred to as “agents,” are gaining popularity. One well-known agent pattern is reflection. By identifying hallucinations early, you can address and correct them before they impact the user.’]

Reflection plays a role in identifying and correcting hallucinations in LLM outputs by allowing early identification and correction of errors before they impact the user.

What are some examples of LLMs that utilize a reasoning strategy to improve their responses?

[‘Post-training or alignmentIt is hypothesized that an LLM instructed not only to respond and follow instructions but also to take time to reason and reflect on a problem could largely mitigate the hallucination issue—either by providing the correct answer or by stating that it does not know how to answer.Furthermore, you can teach a model to use external tools during the reasoning process,\xa0 like getting information from a search engine. There are a lot of different fine-tuning techniques being tested to achieve this. Some LLMs already working with this reasoning strategy are Matt Shumer’s Reflection-LLama-3.1-70b and OpenAI’s O1 family models.’]

Some examples of LLMs that utilize a reasoning strategy to improve their responses are Matt Shumer’s Reflection-LLama-3.1-70b and OpenAI’s O1 family models.

What distnguishes ‘promt injecton’ frm ‘jailbraking’ in vulnerabilties n handling?

[‘Although “prompt injection” and “jailbreaking” are often used interchangeably in the community, they refer to distinct vulnerabilities that must be handled with different methods.’]

‘Prompt injection’ and ‘jailbreaking’ are distinct vulnerabilities that require different handling methods.

In the third sample, the query contains a lot of typos. This is an example of the “MISSPELLED” query style.

💡 You can find a full example evaluation dataset on Hugging Face.

Step 2: Choose RAG evaluation metrics

As mentioned earlier, Ragas offers both LLM-based and non-LLM-based metrics for RAG system evaluation.

For this example, we’ll focus on LLM-based metrics. LLM-based metrics are more suitable for tasks requiring semantic and contextual understanding than quantitative metrics while being significantly less resource-intensive than having humans evaluate each response. This makes them a reasonable tradeoff despite concerns about reproducibility.

From the wide range of metrics available in Ragas, we’ll select five:

  1. LLM Context Recall measures how many of the relevant documents are successfully retrieved. It uses the reference answer as a proxy for the reference context and determines whether all claims in the reference answer can be attributed to the retrieved context.
  2. Faithfulness measures the generated answer’s factual consistency with the given context by assessing how many claims in the generated answer can be found in the retrieved context.
  3. Factual Correctness evaluates the factual accuracy of the generated answer by assessing whether claims are present in the reference answer (true and false positives) and whether any claims from the reference answer are missing (false negatives). From this information, precision, recall, or F1 scores are calculated.
  4. Semantic Similarity measures the similarity between the reference answer and the generated answer.
  5. Noise Sensitivity measures how often a system makes errors by providing incorrect responses when utilizing either relevant or irrelevant retrieved documents.

Each of these metrics requires specifying an LLM or an embedding model for its calculations. We’ll again use GPT-4o for this purpose:

from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, SemanticSimilarity, NoiseSensitivity
from ragas import EvaluationDataset
from ragas import evaluate

evaluator_llm = LangChainLLMWrapper(ChatOpenAI(model="gpt-4o"))
evaluator_embeddings = LangChainEmbeddingsWrapper(OpenAIEmbeddings())

metrics = [
    LLMContextRecall(llm=evaluator_llm),
    FactualCorrectness(llm=evaluator_llm),
    Faithfulness(llm=evaluator_llm),
    SemanticSimilarity(embeddings=evaluator_embeddings),
    NoiseSensitivity(llm=evaluator_llm),
]

Step 3: Evaluate the baseline RAG system’s performance

To evaluate our baseline RAG system, we’ll generate predictions and analyze them with the five selected metrics.

To speed up the process, we’ll use a concurrent approach to handle the I/O-bound predict calls from the RAG chain. This allows us to process multiple queries in parallel. Afterward, we can convert the results into a data frame for further inspection and manipulation. We’ll also store the results in a CSV file.

Here’s the complete performance evaluation code:

from concurrent.futures import ThreadPoolExecutor, as_completed
from datasets import Dataset

def concurrent_predict_retrieval_chain(chain: Runnable, dataset: Dataset):
    results = {}
    threads = []
    with ThreadPoolExecutor(max_workers=5) as pool:
        for query in dataset["user_input"]:
            threads.append(pool.submit(predict, chain, query))
        for task in as_completed(threads):
            results.update(task.result())
    return results

predictions = concurrent_predict_retrieval_chain(rag_chain, ds)


ds_k_1 = ds.map(lambda example: {"response": predictions[example["user_input"]]["answer"], "retrieved_contexts": predictions[example["user_input"]]["context"]})

results = evaluate(dataset=EvaluationDataset.from_hf_dataset(ds_k_1), metrics=metrics)


df = results.to_pandas()
df.to_csv("eval_results.csv", index=False)

Part 3: Iteratively refining the RAG performance

With the evaluation setup in place, we can now start to improve our RAG system. Using the initial evaluation results as our baseline, we can systematically make changes to our RAG chain and assess whether they improve performance.

While we could make do with saving all evaluation results in cleanly named files and taking notes, we’d quickly be overwhelmed with the amount of information. To efficiently iterate and keep track of our progress, we’ll need a way to record, analyze, and compare our experiments.

What is neptune.ai?

Neptune is a machine-learning experiment tracker focused on collaboration and scalability. It provides a centralized platform for tracking, logging, and comparing metrics, artifacts, and configurations.

Neptune can track not only single metrics values but also more complex metadata, such as text, arrays, and files. All metadata can be accessed and analyzed through a highly versatile user interface as well as programmatically. All this makes it a great tool for developing RAG systems and other LLM-based applications.

Step 1: Set up neptune.ai for experiment tracking

To get started with Neptune, sign up for a free account at app.neptune.ai and follow the steps to create a new project. Once that’s done, set the project name and API token as environment variables and initialize a run:

os.environ["NEPTUNE_PROJECT"] = "YOUR_PROJECT"
os.environ["NEPTUNE_API_TOKEN"] = "YOUR_API_TOKEN"

import neptune

run = neptune.init_run()

In Neptune, each run corresponds to one tracked experiment. Thus, every time we’ll execute our evaluation script, we’ll start a new experiment.

Logging Ragas metrics to neptune.ai

To make our lives easier, we’ll define a helper function that stores the Ragas evaluation results in the Neptune Run object, which represents the current experiment.

We’ll track the metrics for each sample in the evaluation dataset and an overall performance metric, which in our case is simply the average across all metrics for the entire dataset: 

import io

import neptune
import pandas as pd

def log_detailed_metrics(results_df: pd.DataFrame, run: neptune.Run, k: int):
    run[f"eval/k"].append(k)

    
    for i, row in results_df.iterrows():
        for m in metrics:
            val = row[m.name]
            run[f"eval/q{i}/{m.name}"].append(val)

        
        run[f"eval/q{i}/user_input"] = row["user_input"]
        run[f"eval/q{i}/response"].append(row["response"])
        run[f"eval/q{i}/reference"] = row["reference"]

        
        context_df = pd.DataFrame(
            zip(row["retrieved_contexts"], row["reference_contexts"]
            columns=["retrieved", "reference"],
        )
        context_stream = io.StringIO()
        context_data = context_df.to_csv(
            context_stream, index=True, index_label="k")
        run[f"eval/q{i}/contexts/{k}}"].upload(
            neptune.types.File.from_stream(context_stream, extension="csv")
        )
      
    
    overall_metrics = results_df[[m.name for m in metrics]].mean(axis=0).to_dict()
    for k, v in overall_metrics.items():
        run[f"eval/overall"].append(v)

log_detailed_metrics(df, run, k=1)


run.stop()

Once we run the evaluation and switch to Neptune’s Experiments tab, we see our currently active run and the first round of metrics that we’ve logged.

Step 2: Iterate over a retrieval parameter

In our baseline RAG chain, we only use the first retrieved document chunk in the LLM context. But what if there are relevant chunks ranked lower, perhaps in the top 3 or top 5? To explore this, we can experiment with using different values for k, the number of retrieved documents.

We’ll start by evaluating k = 3 and k = 5 to see how the results change. For each experiment, we instantiate a new retrieval chain, run the prediction and evaluation functions, and log the results for comparison:

for k in [1, 3, 5]:
    retriever_k = vectorstore.as_retriever(search_kwargs={"k": k})
    rag_chain_k = create_retrieval_chain(retriever_k, question_answer_chain)
    predictions_k = concurrent_predict_retrieval_chain(rag_chain_k, ds)

    
    ds_k = ds.map(lambda example: {
        "response": predictions_k[example["user_input"]]["answer"],
        "retrieved_contexts": predictions_k[example["user_input"]]["context"]
    })

    results_k = evaluate(dataset=EvaluationDataset.from_hf_dataset(ds_k), metrics=metrics)
    df_k = results_k.to_pandas()

    
    df_k.to_csv("eval_results.csv", index=False)
    run[f"eval/eval_data/{k}"].upload("eval_results.csv")

    log_detailed_metrics(df_k, run, k)


run.stop()

Once the evaluation is complete (this should take between 5 and 10 minutes), the script should display “Shutting down background jobs” and show “Done!” once the process is finished.

Results overview

Let’s take a look at the results. Navigate to the Charts tab. The graphs all share a common x-axis labeled “step.” The evaluations for k = [1, 3, 5] are recorded as steps [0, 1, 2].


Comparison of metrics values over three different values of k: The averaged metrics values over all samples (top row) and the metric values for the first sample question (bottom row) indicate that the third step (k = 5) yielded the best outcome.

Looking at the overall metrics, we can observe that increasing k has improved most metrics. Factual correctness decreases by a small amount. Additionally, noise sensitivity, where a lower value is preferable, increased. This is expected since increasing k will lead to more irrelevant chunks being included in the context. However, as both context recall and answer semantic similarity have gone up, it seems to be a worthy tradeoff.

Step 3: Iterate further

From here on, there are numerous possibilities for further experimentation, for example:

  • Trying different chunking strategies, such as semantic chunking, which determines the breakpoints between chunks based on semantic similarity rather than strict token counts.
  • Leveraging hybrid search, which combines keyword search algorithms like BM25 and semantic search with embeddings.
  • Trying other models that excel at question-answering tasks, like the Anthropic models, which are also available through LangChain.
  • Adding support components for dialogue systems, such as chat history.

Looking ahead

In the three parts of this tutorial, we’ve used LangChain to build a RAG system based on OpenAI models and the Chroma vector database, evaluated it with Ragas, and analyzed our progress with Neptune. Along the way, we explored essential foundations of developing performant RAG systems, such as:

  • How to efficiently chunk, store, and retrieve data to ensure our RAG system consistently delivers relevant and accurate responses to user queries.
  • How to generate an evaluation dataset for our particular RAG chain and use RAG-specific metrics like faithfulness and factual correctness to evaluate it.
  • How Neptune makes it easy to track, visualize, and analyze RAG system performance, allowing us to take a systematic approach when iteratively improving our application.

As we saw at the end of part 3, we’ve barely scratched the surface when it comes to improving retrieval performance and response quality. Using the triplet of tools we introduced and our evaluation setup, any new technique or change applied to the RAG system can be assessed and compared with alternative configurations. This allows us to confidently assess whether a modification improves performance and detect unwanted side effects.

Was the article useful?

Explore more content topics:

Popular Posts

My Favorites