Machine Learning

March 16, 2025

Bias is inherent to building a ML model. Bias exists on a spectrum. Our job is to tell the difference between the desirable bias and the one that needs correction.

We can identify biases using benchmarks like StereoSet and BBQ, and minimize them with ongoing monitoring across versions and iterations.

Adhering to data protection laws is not as complex if we focus less on the internal structure of the algorithms and more on the practical contexts of use.

To keep data secure throughout the model’s lifecycle, implement these practices: data anonymization, secure model serving and privacy penetration tests.

Transparency can be achieved by providing contextual insights into model outputs. Documentation and opt-out mechanisms are important aspects of a trustworthy system.

Picture this: you’ve spent months fine-tuning an AI-powered chatbot to provide mental health support. After months of development, you launch it, confident it will make therapy more accessible for those in need. But soon, reports emerge: one user seeking help for an eating disorder received diet tips instead of support, worsening their condition. Another, in a moment of crisis, met with responses that intentionally encouraged harmful behaviors (and later committed suicide). This is not hypothetical—it’s a real-life example.

Now think about your work as an AI professional. Just like the mortgage model, large language models (LLMs) influence critical decisions, and training them on biased data can perpetuate harmful stereotypes, exclude marginalized voices, or even generate unsafe recommendations. Whether the application is financial services, healthcare, or customer support, the ethical considerations are just as high: how do we ensure our work has long-term value and positive societal impact? By focusing on measurable solutions: differential privacy techniques to protect user data, bias-mitigation benchmarks to identify gaps, and reproducible tracking with tools like neptune.ai to ensure accountability.

This article isn’t just about why ethics matter—it’s about how you can take action now to build trustworthy LLMs. Let’s get started!

So how can we address bias in LLMs?

Bias in the context of training LLMs is often discussed with a negative connotation. However, the reality is more complex: algorithmic bias is inherent in any machine learning model because it reflects patterns, structures, and priorities encoded in the training data and design. Let’s put it this way: some bias is necessary for models to work effectively. When we fine-tune LLMs, we shift their biases to align with specific tasks or applications. For example, a large language model is intentionally biased toward generating grammatically correct sentences.

The challenge for AI researchers and engineers lies in separating desirable biases from harmful algorithmic biases that perpetuate social biases or inequity. To address it, it’s helpful to think of bias as existing on a spectrum:

Functional biases: The previous example falls on this end of the spectrum. These biases are intentional and beneficial to enhance model performance. They guide the LLM to generate text in a specific tone, style, or adhering to a logical reasoning pattern, etc.

Neutral biases: These may not directly harm users but can skew the diversity of outputs. For example, an LLM trained on predominantly European data might overrepresent those perspectives, unintentionally narrowing the scope of information or viewpoints it offers.

Harmful biases: These are the biases that demand active mitigation. Harmful biases lead to biased outputs that disadvantage certain groups. For example, a recruitment LLM favoring male applicants due to biased training data reflects a harmful bias that requires correction. During the data collection stage, two valuable frameworks to analyze data distribution are Datasheets for datasets and FACETS.

To mitigate unwanted biases (the third end of the spectrum), it is recommended to adopt a structured approach during the fine-tuning stage:

1. Define the desired outcome

Identify the biases your model should intentionally have and avoid. For example, an LLM designed for legal assistance should prioritize precision and formal language (functional biases), while actively avoiding harmful biases like racial assumptions in legal case studies.

2. Test and measure bias

Debiasing techniques assess how your pre-trained LLM handles both neutral and harmful biases. Two of the most popular benchmarks are StereoSet to test for stereotypical associations in the outputs of your large language model and BBQ (Bias Benchmark for QA) for highlighting biases in question-answering systems.

Let’s see how to use them in a simple example. Imagine you’re evaluating an LLM used in a recruitment platform. A StereoSet prompt might be:

“The software engineer was explaining the algorithm. After the meeting, ___ went back to coding.”

The benchmark would present two potential completions:

“he” (stereotypical)
“she” or “they” (non-stereotypical)

StereoSet evaluates the model’s likelihood of generating each option. Suppose your LLM is heavily biased toward stereotypical associations, like assuming “software engineer” is male. This would indicate a higher probability assigned to “he” over “she” or “they.”

This is a common stereotype, but StereoSet can evaluate more nuanced scenarios like:

“The team lead recommended a flexible work schedule for better work-life balance. ___ later presented their findings to the board.”

Here, the model’s output might be tested for implicit gender bias linking caregiving roles or flexibility to one gender while associating leadership and authority with another. The results are then compared to a baseline provided by the benchmark, which quantifies the degree of bias in your LLM’s outputs. By analyzing such patterns across thousands of prompts, these debiasing techniques provide a detailed breakdown of how biases manifest in your LLM’s outputs, allowing you to pinpoint specific areas for improvement.

Identify the appropriate bias benchmark for your specific task. For this, you can explore the collection of LLM benchmarks curated by researchers at McGill University, which offers a range of benchmarks tailored to a variety of scenarios.

3. Monitor bias continuously

Mitigating bias isn’t a one-time effort—it requires ongoing monitoring to ensure that your LLM remains fair and effective across iterations. Here are some ideas to help you implement it:

Create a script that evaluates your model

First, we create a script that runs a standardized set of evaluations against one of your model versions. Think about the metrics that you will implement to measure bias in your specific scenario. You can explore fairness metrics, such as demographic parity, measure disparate impact (the extent to which the model’s decisions disproportionately affect different groups), or assess stereotype reinforcement using the benchmarks mentioned earlier.

Demographic parity (also known as statistical parity) is a metric used to assess bias and fairness concerns, that is, whether a machine learning model treats different demographic groups equally in terms of outcomes. Specifically, it measures whether the probability of a positive outcome (e.g., approval for a loan, a job recommendation, etc.) is the same across different groups, regardless of their demographic attributes (e.g., gender, race, age). Here there is a manual implementation of this metric in Python:

from sklearn.metrics import confusion_matrix


y_true = [0, 1, 0, 1, 0]  
y_pred = [0, 1, 0, 0, 1]  
group_labels = ['male', 'female', 'male', 'female', 'male']  
def demographic_parity(y_true, y_pred, group_labels):
    groups = set(group_labels)
    parity = {}
    
    for group in groups:
        group_indices = [i for i, label in enumerate(group_labels) if label == group]
        group_outcomes = [y_pred[i] for i in group_indices]
        positive_rate = sum(group_outcomes) / len(group_outcomes)
        parity[group] = positive_rate

    return parity

parity_results = demographic_parity(y_true, y_pred, group_labels)
print(parity_results)

You can also explore demographic_parity_ratio from the fairlearn.metrics package, which simplifies the application of this fairness metric in your model evaluation.

Track your results in Neptune

You can use tools like neptune.ai to track bias metrics (e.g., fairness or disparate impact) across model versions. Let’s see how:

Set up your project: If you haven’t already, sign up for Neptune now and create a project to track your LLM’s training data and metrics.
Log the metrics: Set up custom logging for these metrics in your training code by calculating and recording them after each evaluation phase.
Monitor bias: Use Neptune’s dashboards to monitor how these fairness metrics evolve over model versions. Compare the impact of different debiasing strategies on the metrics, and create alerts to notify you when any metric exceeds a threshold. This allows you to take immediate corrective action.

Integrate bias checks into your CI/CD workflows

If your team manages model training through CI/CD, incorporate the automated bias detection scripts (that have already been created) into each pipeline iteration. Alternatively, this script can also be used as part of a manual QA process, ensuring that potential bias is identified and addressed before the model reaches production.

How to ensure LLM complies with user privacy and data laws?

When developing LLMs, you need to comply with data protection laws and ethical frameworks and guidelines. Regulations like the GDPR, HIPAA in healthcare, and the AI Act in the EU place significant demands on how personal data is handled, stored, and processed by AI systems. However, adhering to these standards is not as complex as it may seem, especially if you take a strategic approach.

I learned this perspective firsthand during a discussion where Teresa Rodríguez de las Heras, director of the Research Chair UC3M-Microsoft, shared her insights. She remarked:

The regulatory focus, especially in the draft AI Act, is less on the internal structure of the algorithms (i.e., their code or mathematical models) and more on the practical contexts in which AI is used.

Think about it this way: it is easy to integrate GDPR-compliant services like ChatGPT’s enterprise version or to use AI models in a law-compliant way through platforms such as Azure’s OpenAI offering, as providers take the necessary steps to ensure their platforms are compliant with regulations.

The real challenge lies in how the service is used. While the infrastructure may be compliant, you, as an AI researcher, need to ensure that your LLM’s deployment and data handling practices align with privacy laws. This includes how data is accessed, processed, and stored throughout the model’s lifecycle, as well as thorough documentation of these processes. Clear and detailed documentation is crucial—usually, a technically sound architecture following best practices meets the regulatory requirements, but it has to be documented that it does. By focusing on these aspects, we can shift our understanding of compliance from a purely technical standpoint to a broader, application-based risk perspective, which ultimately affects the overall compliance of your AI system.

You might be wondering, how can I meet these requirements? Here are some security steps you can take to ensure user privacy:

Data anonymization

Protect personal data in your training data by ensuring it is fully anonymized to prevent the leakage of personally identifiable information (PII). Start by:

Removing or masking direct identifiers such as names, addresses, emails, job titles, and geographic locations.
Using aggregated data instead of raw personal information (e.g., grouping individuals by age ranges or replacing specific locations with broader regions).
Applying K-anonymity to generalize or suppress data so each individual cannot be distinguished from at least k-1 others in the dataset.

Once these foundational steps are in place, consider additional measures to limit the risk of re-identification. For practical examples and implementation tips, consider exploring Google’s TensorFlow Privacy repository on GitHub.

Secure model serving

Ensure that your deployed model is served securely to protect user data during interactions. How?

Hosting the model in secure, GDPR-compliant cloud environments, such as Amazon Web Services or Azure.
Using encryption protocols like HTTPS and TLS to safeguard data in transit.
Implementing access controls to limit who can query the model and monitor interactions.

Privacy penetration tests

Conduct regular privacy penetration tests to identify vulnerabilities in your system. For example:

Simulate data extraction attacks to evaluate how well your model resists adversarial attempts to uncover training data. For more information on defending against these threats, check out Defense Strategies in Adversarial Machine Learning.
Collaborate with privacy experts to audit your model’s infrastructure and identify potential compliance gaps.

These measures serve as a robust framework for privacy protection without compromising the performance of your LLMs.

How to integrate transparency, accountability, and explainability?

As LLMs become increasingly integrated into applications and individuals and organizations rely on AI development for their own projects, concerns surrounding the transparency, accountability, and explainability of these systems are growing.

However, the current market leaves formal interpretability research and solutions mostly in the academic and R&D corners rather than demanding them in everyday products. This makes sense: you don’t need to know where the training data comes from to build an app with ChatGPT, and highly popular tools like GitHub Copilot and Bing Chat thrive without deep interpretability features. That said, certain practical approaches to interpretability (e.g., user-facing explanations for predictions or contextual annotations in outputs) occasionally emerge in industry settings. These glimpses, while rare, provide meaningful transparency and serve specific use cases where interpretability can enhance trust and usability.

Such practical approaches allow users to better understand the results without having to decipher the internal logic. As an AI professional developing LLM-based applications, learning about these strategies—contextual cues, custom filtering, and source references—can differentiate your product.

Transparency has become a key expectation in the AI industry, as highlighted by initiatives like the EU AI Act and guidelines from organizations such as the Partnership on AI, which emphasize the importance of explainable AI. By integrating them, you can meet these expectations while maintaining feasibility for deployment. Let’s get into it!

What does contextual transparency look like?

Contextual transparency provides meaningful insights into how the model produces outputs, for example, by showing relevant sources, highlighting influential inputs, or offering filtering options. When models display their sources, users can quickly assess their credibility and the accuracy of their results. In cases where the answer is not reliable, these sources are often either fake (links that go nowhere) or redirect to papers or articles unrelated to the topic. You can provide contextual transparency to your LLM by including:

• Disclaimers about outputs: Set expectations by clearly communicating the probabilistic nature of your LLM’s responses and their potential for inaccuracies. OpenAI, for example, includes disclaimers in ChatGPT to guide user understanding.

OpenAI's ChatGPT disclaimer encouraging users to verify information independently. — OpenAI’s ChatGPT disclaimer encouraging users to verify information independently | Source: Author

While researching for this article, I came across a collection of the best disclaimers from ChatGPT shared by Reddit users. These examples highlight how language models can be prompted to produce disclaimers, though the results don’t always make sense from a human perspective.

• Contextual cues: Contextual cues provide insights about the sources and processes behind the model’s outputs. Features like highlighting citations (as seen in Bing Chat) or referencing snippets of code and links to external materials (as ChatGPT does) help users understand the reasoning behind responses.

• RAG-specific contextualization: In Retrieval-Augmented Generation (RAG) systems, contextualization often involves surfacing top-related documents or tokens that influence the model’s output.

An example of contextual transparency: ChatGPT references the source code in the output. | Source: Author

An example of contextual transparency: Bing Chat cites the source that influenced its answer. | Source

How to navigate data usage risks in AI development?

While regulations often dictate what can be done legally, we also need to consider what should be done to build user trust and ensure fair practices. Deploying ML models implies navigating the line between necessary oversight (e.g., content moderation) and potential overreach. Being AI professionals, we need to approach this challenge responsibly.

Production logs, including user prompts, interactions, and model outputs, offer a wealth of information about the system’s performance and potential misuse. However, they also raise ethical implications about user consent and privacy risks.

Understand your data sources

An important part of building ethically sound AI models lies in verifying that your data comes from sources with clear usage rights. Your data pipeline should flag or exclude content from sources with uncertain copyright status. If you are using scraping tools, start by implementing rules to filter out certain domains or sites that have unclear copyright status.

Common Crawl is a free, open repository that provides a large dataset of web pages that can be filtered for copyrighted content. While it is a good starting point for identifying general content, I recommend refining these filters with additional checks tailored to your specific topics.

Using publicly accessible data that is copyrighted

The AI industry has faced growing scrutiny over practices like scraping data and using user-provided content without explicit consent. For example, while human users cannot legally reuse or republish copyrighted content from websites or books without explicit permission, many LLM providers use them as training data. The assumption that “publicly accessible” equals “fair use” has led to a growing backlash from creators, publishers, and regulators. Controversial examples include:

Using user data that is not publicly accessible

Some jurisdictions have more robust regulatory frameworks that explicitly regulate how user data can be used to train models. In the EU and the UK, laws like the GDPR have prompted companies to adopt stricter privacy practices. Let’s see some examples:

• Grammarly, for instance, follows a regional approach. It states on its Product Improvement and Training Control page and in the privacy settings that users in the EU and UK automatically have their data excluded from model training:

Since you created your account in the EU or UK, Grammarly will not use your content to train its models or improve its product for other users.

• In 2019, a Bloomberg report revealed that Amazon employees and contractors sometimes review Alexa voice recordings to help improve Alexa’s speech recognition models. While the data review process is intended to enhance product quality, the disclosure raised concerns about user consent, privacy, and the extent to which voice data—often from private homes—could be accessed for AI development. In May 2023, the Federal Trade Commission (FTC) imposed a $25 million fine on Amazon related to children’s privacy, alleging that the company had violated the Children’s Online Privacy Protection Act (COPPA) by retaining children’s voice recordings indefinitely and misrepresenting parents’ ability to delete those recordings.

These examples highlight how regulations differ across jurisdictions. This patchwork of regulations creates a challenging landscape for AI developers, highlighting that what is deemed legal (or even ethical) differs across regions. As a result, some users benefit from stronger protections against such practices than others, depending on their location.

There are some recommendations that may come in handy to navigate different jurisdictions. First, if resources permit, adopt a “highest common denominator” strategy by aligning global practices with the most restrictive data protection requirements (e.g., EU GDPR). Second, keep detailed documentation of each model’s training process—covering data sources, usage procedures, and implemented safeguards—and present this information in an accessible format (e.g., FAQs or transparency reports). This approach demonstrates a clear commitment to transparency and ethical standards.

Best practices for ethical LLM development

Navigating the regulatory landscape requires more than just complying with the local laws. Just as contextual transparency helps users trust the outputs of your LLMs, your broader organizational values, professional standards, or industry best practices form the ethical backbone that ensures this trust extends to the foundation of your system.

By following these practical steps, you can reinforce that commitment to building fair and transparent models:

Implement opt-out mechanisms

Opt-out mechanisms allow users to control whether their data is used to train AI models and other software, giving them some agency over how their data is processed and used. If you plan to store users’ data for training your AI or for any other purpose, implementing an opt-out mechanism is a good practice to give users back control over their personal data. Let’s look at some examples of how this can be done:

Social media platforms: Platforms such as Quora, LinkedIn, and Figma have opt-out mechanisms that allow users to request that their data be excluded from certain data mining purposes. However, the specific options and level of transparency can vary widely from platform to platform. Wired has a step-by-step guide on how to stop your data from being used by the most popular platforms to train AI, which I recommend checking out.
Opt-out of data scraping: Many websites indicate where or whether they permit automated crawling by providing a “robots.txt” file. While this file signals how a site wishes to be scrapped, it doesn’t technically prevent unauthorized crawlers from harvesting data; compliance ultimately depends on whether the crawler chooses to honor those instructions.

Structure of a 'robots.txt' file — Syntax of a robots-txt file to prevent agents from crawling a website. Each agent is separated in a different line containing its name and the disallow or allow rules attached to it | Source

Keep your documentation updated

Clear and comprehensive documentation can take multiple forms, from end-user guides (explaining the usage and limitations of your LLM) and developer-focused manuals (covering architecture, training procedures, and potential biases) to legal or regulatory documentation for compliance and accountability.

Model Cards, originally proposed by Margaret Mitchell and Timnit Gebru at Google, offer a structured template for detailing key information about machine learning models: the dataset used, intended use cases, limitations, etc. Hugging Face has implemented a version of Model Cards on its platform, facilitating a standardized way to document Large Language Models (LLMs) and other AI systems.

By maintaining up-to-date documentation, you help users and stakeholders understand your model’s capabilities and limitations. This plays a crucial role in fostering trust and encouraging responsible use.

For example, OpenAI has publicly documented its red-teaming process, which involves testing models against harmful content to assess their robustness and ethical implications. Documenting such efforts not only promotes transparency but also sets a benchmark for how ethical considerations are addressed in the development process.

Stay ahead of regulations

If your company has a legal team, collaborate with them to ensure compliance with local and international regulations. If not, and you are planning to expand your LLM globally, consider hiring legal advisors to mitigate the legal risks before launching your LLM.

For example, for applications that are subject to the GDPR, you need to implement and document appropriate technical and organizational measures protecting any personal data you store and process, as outlined in Article 32. These measures often include creating documentation, such as TOM documents, along with terms of service and privacy policies that users must agree to during signup. Adhering to these requirements, particularly in the European context, is essential for building trust and ensuring compliance.

Avoid legal pitfalls that may affect the long-term viability and trustworthiness of your LLMs by anticipating potential regulatory changes. Monitor the legal landscape for AI development in the regions where you currently operate or plan to expand in the future. These are some useful resources:

The U.S. National Institute of Standards and Technology (NIST) AI Risk Management Framework is an updated source with recommendations on AI risks and regulatory impacts for individuals and organizations.

Summing it up: AI ethics done right

Let’s wrap up with a quick recap of all the key takeaways from our discussion:

Bias in LLMs is inevitable, but manageable: While algorithmic bias in machine learning models is part of the game, not all biases are negative. Our job is to identify which biases are functional (beneficial to performance) and which ones are harmful (reinforce inequality). Tools like StereoSet and BBQ are useful for pinpointing and mitigating harmful biases.

Protect user privacy from start to finish: Think less about the mathematical structure of your model (that is usually handled by the provider, they will keep it law-compliant) and more about how data is handled in practice during your model’s lifecycle (this is where you are responsible to keep your system law-compliant). Safeguard sensitive information by implementing strong privacy measures like data anonymization, differential privacy, and secure model serving.

Transparency is your ally: You don’t have to explain every inner detail of your AI models to be transparent. Instead, focus on providing meaningful insights into how your model produces outputs. Contextual transparency—like source references and disclaimers—builds trust without overwhelming users with technical jargon.

Bias mitigation techniques and privacy protection aren’t one-time tasks: They should be continuously integrated throughout your model’s lifecycle. Using tools like Neptune to track and visualize key metrics, including fairness, helps ensure your models stay aligned with ethical standards across iterations and versions.

Ethical AI development requires proactive steps: Understand your data sources, implement opt-out mechanisms, keep your documentation up to date, and stay ahead of regulatory changes. Ethical AI isn’t just about compliance—it’s about building trust and accountability with users and stakeholders.

Was the article useful?

Explore more content topics:

Introduction to State Space Models as Natural Language Models

Machine Learning

dim

March 16, 2025

State Space Models (SSMs) use first-order differential equations to represent dynamic systems.

The HiPPO framework provides a mathematical foundation for maintaining continuous representations of time-dependent data, enabling efficient approximation of long-range dependencies in sequence modeling.

Discretization of continuous-time SSMs lays the groundwork for processing natural language and modeling long-range dependencies in a computationally efficient way.

LSSL, S4, and S5 are increasingly sophisticated and efficient sequence-to-sequence state-space models that pave the way for viable SSM-based alternatives to transformer models.

While transformer-based models are in the limelight of the NLP community, a quiet revolution in sequence modeling is underway. State Space Models (SSMs) have the potential to address one of the key challenges of transformers: scaling efficiently with sequence length.

In a series of articles, we’ll introduce the foundations of SSMs, explore their application to sequence-to-sequence language modeling, and provide hands-on guidance for training the state-of-the-art SSMs Mamba and Jamba.

In this first article of the three-part series, we’ll examine the core principles of SSMs, trace their evolution from Linear State Space Layers (LSSL) to the S5 model, and examine their potential to revolutionize sequence modeling with unparalleled efficiency.

Understanding state space models

Before exploring how State Space Models (SSMs) can function as components of large language models (LLMs), we’ll examine their foundational mechanics. This will allow us to understand how SSMs operate within deep neural networks and why they hold promise for efficient sequence modeling.

SSMs are a method for modeling, studying, and controlling the behavior of dynamic systems, which have a state that varies with time. SSMs represent dynamic systems using first-order differential equations, providing a structured framework for analysis and simplifying computations compared to solving higher-order differential equations directly.

Let’s dissect what this means.

Consider a system consisting of a moving car on the road. When we supply a certain input to this system (like pressing the gas pedal), we alter the car’s current state (for example, the amount of gas the engine is burning) and consequently cause the car to move at a certain speed.

Because our system’s state varies with time, it is considered a dynamic system. In this case, we are studying one state variable (the amount of gas the engine burns) in our state (the car’s internals). State variables are the minimum number of variables we can use to understand the system’s behavior through mathematical representation.

A car as a dynamic system. The system has a certain input, which is a foot pressing the gas pedal. This input is supplied to the car, influencing its state. The state variable being changed is the amount of gas the engine is burning. The output of the system is the speed of the car.

In our scenario, the car was already moving, so it was burning gas—a result of the previous force on the gas pedal. The speed we would get if we pressed the pedal in a stationary car differs from the speed we would get if the car were already moving since the engine would need less additional gas (and less additional input force) to reach a certain speed. Thus, when determining the speed, we should also factor in the car’s previous state.

A dynamic system with a previous state as the input. The value of the state variable depends not only on the input but also on the previous state.

There is one more thing to consider. State Space Models also model a “skip connection,” which represents the direct influence of the input on the output. In our case, the skip connection would model an immediate influence of pressing the gas pedal on the car’s speed, regardless of the current state. In the specific case of a car, this direct feedthrough (D) is zero, but we keep it in the model as, generally, systems can (and do) have direct input‐to‐output dependencies.

A dynamic system with a direct connection between input and output. There is a direct relationship between pressing a car’s gas pedal (input) and the car’s speed (output).

Now that we have considered all the possible connections in our system, let’s try to model it mathematically. First, we need representations for the variables in our system. We have the previous state of the model, x(t-1), the input, u(t), the current state of the model, x(t), and the output, y(t).

We also need a notation to represent the relationship between every two variables in the system. Let’s denote the effect of the previous state on the current one by a matrix A, the effect of the input on the current state by a matrix B, the effect of the state on the output by a matrix C, and the direct effect of the input on the output by the matrix D.

State space representation of a dynamic system. The input u(t), the state x(t), the output y(t), and the system’s previous state x(t-1) are connected through matrices A, B, C, and D, respectively. — State space representation of a dynamic system. The input *u(t)*, the state *x(t)*, the output *y(t)*, and the system’s previous state *x(t-1)* are connected through matrices A, B, C, and D, respectively.

From the input u(t), we need to compute two variables:

1. The new state x(t), which considers the effect of the previous state x(t-1) and the input u(t).

2. The output y(t), which considers the effect of the new state x(t) and the direct effect of the input u(t).

Consequently, we can derive the equations for the two variables:

1. The equation for the new state x(t):

2. The equation for the output y(t):

These two equations form our system’s state space representation (SSR). The SSR allows us to study the system’s stability by analyzing the effects of inputs on the system’s state variables and output.

We can model probabilistic dependencies between state variables and the inputs by introducing noise terms into the dynamics and observation equations. These stochastic extensions enable us to account for uncertainties in the system and its environment, providing a foundation for modeling and controlling the system’s behavior in real-world scenarios.

State space models for natural language processing

State Space Models (SSMs), long established in time series analysis, have been utilized as trainable sequence models for decades. Around 2020, their ability to efficiently handle long sequences spurred significant progress in adapting them for natural language processing (NLP).

The exploration of SSMs as trainable sequence models was gradual through multiple contributions that laid the foundation for introducing SSMs in deep learning models as “State Space Layers” (SSLs). In the following sections, we’ll explore key contributions that led to the use of SSMs as NLP models.

Applying SSMs to natural language processing reframes the input as a token, the state as the contextual representation, and the output as the predicted next token.

HiPPO: recurrent memory with optimal polynomial projections

The primary challenge sequence models face is capturing dependencies between two inputs that are far apart in a long sequence.

Let’s say we have a paragraph where the last sentence references something mentioned in the first sentence:

The word ‘Sushi’ in the first sentence is referenced in the last sentence, with a large number of words in between. Thus, understanding the phrase “that name” in the last sentence requires the first sentence for context.

Historically, sequence models, such as traditional RNNs, GRUs, and LSTMs, struggled to retain such long-range dependencies due to problems like vanishing or exploding gradients. The gating mechanisms these algorithms rely on regulate information flow by selectively retaining important features and discarding irrelevant ones, which mitigates issues like short-term memory loss.

However, these mechanisms are insufficient for capturing long-range dependencies because they struggle to preserve information over extended sequences. This is due to capacity constraints, a tendency to prioritize short-term patterns during training, and cumulative errors that degraded information over long sequences. While transformers address many of these issues through their self-attention mechanism, due to the quadratic complexity of attention, they are computationally inefficient for long sequences.

Albert Gu and colleagues at Stanford attempted to solve this problem by introducing HiPPO (short for “High-order Polynomial Projection Operators”). This mathematical framework aims to compress historical information into a fixed-size representation. The fixed-size representation captures the entire processed sequence and enables sequence models to process and utilize long-range dependencies efficiently. Unlike the hidden state in an LSTM or GRU, which is also a fixed-size representation but primarily optimized for short-term memory retention, HiPPO is explicitly designed to capture the entire processed sequence, enabling sequence models to process and utilize long-range dependencies efficiently.

HiPPO works by constructing a set of polynomial bases that are mathematically orthogonal with respect to a specific weighting function. The weighting function w(t) weighs the importance of historical information using one of two variants:

1. Transform HiPPO Matrix Variations: Transform matrices prioritize the latest inputs and change the system’s response continuously with time. The importance of information stored in the sequence history decays over time.

2. Stationary HiPPO Matrix Variations: Stationary matrices are time-invariant and consider all past data with consistent importance. The rate of natural decay of information remains consistent over time, providing a balance between retaining historical information and responding to new inputs.

Gu and colleagues applied the two variants to three different polynomial families referred to as Leg, Lag, and Cheb. The difference between the Leg, Lag, and Cheb is the amount of information retention, which is determined by the variations in the weighting functions w(t) associated with each set of polynomials and their orthogonality properties:

1. HiPPO-Leg is based on the Legendre polynomials. It gives uniform weighting for all the information in the sequence. Thus, the weighting function w(t) = 1. As the sequence length becomes larger, the older parts of the sequence are compressed into a fixed-size representation.

2. HiPPO-Lag is based on the Laguerre polynomials. There is an exponential decay of information over time.

3. HiPPO-Cheb is based on the Chebyshev polynomials. It creates a non-uniform distribution that prioritizes the latest and oldest information.

The storage and prioritization of the sequence’s historical data is due to the mathematical properties of these polynomials. The appendix of the HiPPO paper contains all the equations and mathematical proofs.

The HiPPO matrix is obtained by deriving differential operators that project the input signal onto the specified polynomial basis in real-time. The operators ensure the orthogonality of the states while preserving the defined weighting function. The following equation defines them:

Here, ϕ(t) are the basis functions of the chosen family of orthogonal polynomials (i.e., Legendre, Laguerre, or Chebyshev), ϕ′i is the derivative of the i-th basis function with respect to time t, and w(t) is the weighting function that defines the importance of information over time. i is the index of the current state or basis function being updated, and j is the index of the previous state or basis function contributing to the update. It points to the j-th basis function that is being integrated with respect to w(t). The integral computes the contribution of the j-th basis function to the update of the i-th state, considering the weighting w(t).

This mechanism allows for efficiently updating the model’s hidden state, minimizing the loss of long-range dependencies. Thus, the HiPPO matrix can be used to control the update of a model’s context or hidden state.

This sounds familiar, right? In the previous section, we saw that the representation of the state change (A) for text data would be the context of the text (or sequence). Just like in RNNs and LSTMs, we can use this context (or hidden state) to predict the next word. Since its structure allows it to handle long- and short-range dependencies, HiPPO acts as a template for the matrix A.

Combining recurrent, convolutional, and continuous-time models with linear state-space layers

HiPPO’s inventors collaborated with other Stanford researchers to develop the Structured State Space Sequence model, which uses the HiPPO framework. This model makes significant strides in applying SSMs to sequence modeling tasks.

Their 2021 paper Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers aims to combine the best and most efficient properties of all the existing sequence modeling algorithms.

According to the authors, an ideal sequence modeling algorithm would have the following capabilities:

1. Parallelizable training, as is possible with Convolutional Neural Networks (CNNs). This saves computational resources and enables a faster training process.

2. Stateful inference, as provided by Recurrent Neural Networks (RNNs). This allows context to be used as a factor while deciding on the output.

3. Time-scale adaptation, as in Neural Differential Equations (NDEs). This enables the sequence model to adapt to various lengths of input sequences.

In addition to these properties, the model should also be able to handle long-range dependencies in a computationally efficient manner.

Motivated by these goals, the authors explored using State Space Models (SSMs) to develop a computationally efficient and generalizable sequence model suitable for long sequences.

Let’s explore how they did that:

As we learned above, the SSR equations represent a dynamic system with a continuously changing state. To apply SSMs to NLP, we need to adapt these continuous-time models to operate on discrete input sequences. Rather than continuous signals, we’ll now feed strings of individual tokens to the model one by one.

Discretization

We can discretize the continuous SSR equations using numerical methods.

To understand this process, we will return to the example of the continuously moving car. The car’s speed is a continuous signal. To study the variation in the car’s speed, we need to measure it at all times. However, it’s impractical to record every infinitesimal change in speed. Instead, we take measurements at regular intervals—for example, every 30 seconds.

By recording the car’s speed at these specific moments, we convert the continuous speed profile into a series of discrete data points. This process of sampling the continuous signal at regular intervals is called “discretization.” The interval of time we are using to measure the speed is called the time scale Δt, also known as “step size” or “discretization parameter.”

To convert a continuous signal into a discrete signal, it is sampled in fixed intervals Δt. — To convert a continuous signal into a discrete signal, it is sampled in fixed intervals *Δt.*

Similar to discretizing car speed, to adapt SSMs for natural language processing, we start with continuous-time equations that describe how a system evolves. We discretize the equations, converting them into a form that updates at each discrete time step.

The choice of Δt is critical: if it is too large, we risk losing important details of the state dynamics (undersampling):

If Δt is too small, the system might become inefficient or numerically unstable due to excessive computations (oversampling):

In Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers, the authors explored several methods for discretizing state-space models to adapt them for sequence modeling tasks. They ultimately selected the Generalized Bilinear Transform (GBT), which effectively balances accuracy (by avoiding oversampling) and stability (by avoiding undersampling). The GBT allows the discrete state-space model to approximate the continuous dynamics while maintaining robustness in numerical computations.

The discrete state equation under GBT is given by:

Here, x is the state representation, Δt is the time step, A is the matrix that represents how the state is influenced by the previous state, B is the matrix that represents the effect of the input on the current state, and I is the identity matrix which ensures that the output has consistent dimensionality.

A critical decision when applying the Generalized Bilinear Transform is the choice of the parameter α, which controls the balance between preserving the characteristics of the continuous-time system and ensuring stability in the discrete domain. The authors selected α = 0.5 as it counterbalances accuracy and numerical stability. The resulting state equation is given by:

The bilinear transform equation is then applied to the initialized continuous-time matrices A and B, discretizing them into A and B respectively.

Now that we have a discretized version of the SSR equations, we can apply them to natural language generation tasks where:

1. u(t) is the input token we feed into the model.

2. x(t) is the context, which is the representation of the sequence’s history thus far.

3. y(t) is the output, the predicted next token.

Thus, we have a representation of SSMs that can handle tokens as input.

State Space Model with discretized matrices A and B. A and B map the current context xt-1 and the input token ut to the new context xt. C maps the context to the output token yt, with D modeling the direct relationship between ut and yt. The direct connection between the input and the output mediated by D is treated as a skip connection and is not explicitly incorporated into the model's internal architecture. — State Space Model with discretized matrices A and B. A and B map the current context x_t-1 and the input token u_t to the new context x_t. C maps the context to the output token y_t, with D modeling the direct relationship between u_t and y_t. The direct connection between the input and the output mediated by D is treated as a skip connection and is not explicitly incorporated into the model’s internal architecture.

The three pillars of SSMs as sequence models

Now that we can use SSMs for NLP tasks, let’s see how they measure up with respect to the other available sequencing algorithms by circling back to the goals the authors stated at the beginning of Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers.

Parallelizable training

Parallelizable training would save a considerable amount of computational resources and time. Two widely used sequencing architectures are inherently parallelizable during training:

1. Convolutional Neural Networks (CNNs) are inherently parallelizable because the convolution operation can be applied simultaneously across all positions in the input sequence. In sequence modeling, CNNs process the entire input in parallel by applying convolutional filters over the sequence, allowing for efficient computation during training.

2. Transformers achieve parallelism through the self-attention mechanism, which simultaneously computes attention weights between all pairs of tokens in the sequence. This is possible because the computations involve matrix operations that can be parallelized, allowing the model to process entire sequences at once.

Efficiently distributing the computational workload is crucial for sequence algorithms, especially when training on large datasets. To address this challenge, the authors introduced a convolutional representation of SSMs, which allows these models to process sequences in parallel, similar to CNNs and Transformers.

The author’s idea is to express the SSM as a convolution operation with a specific kernel k derived from the state-space parameters, enabling the model to compute outputs over long sequences efficiently.

To derive the SSR equations as a convolution operation, they assume the SSM model to be time-invariant. This means the matrices A, B, C, and D do not vary with time, the matrix A is stable (which is already achieved by adopting the HiPPO matrix for A that allows a numerically stable update of the context), and the initial state x(0) is 0.

Using the SSR equations mentioned earlier (state equation that derives x(t) and output equation that derives y(t)), the kernel k can be derived in two steps:

1. Solving for the state, we start with the state equation from the SSR equations where x₀= 0:

Solving for the state, we start with the state equation from the SSR equations where x0 = 0

We derived the state x_n, which represents the system’s state at time step n, based on the contributions of past inputs. Similarly, u_k denotes the input to the system at a specific time step k within the sequence. The number of time steps n (i.e., the number of times we sample using Δt) depends on the length of the input sequence, as the state x_n is influenced by all preceding inputs up to time n−1.

2. Substitute the x_nin the SSR output equation with the state that is derived from step 1.

We can simplify this equation by combining the state representations (A, B, C, and D) as the kernel k:

Here, m is the index for summing over past inputs. The result is the following equation for the output at step n:

Thus, we are left with the convolutional representation of State Space Representation: We take the input u_nas a common factor and denote the term multiplied by the input as the kernel k. We obtain the outputs from the input sequence by passing the kernel across it.

Stateful inference

Stateful inference refers to a sequence model’s ability to create, maintain, and utilize a “state,” which includes all the relevant context needed for further computations. This ability is desirable because it eliminates the computational inefficiency of understanding the context whenever a new input token is present.

Transformers capture long-range dependencies and context through the self-attention mechanism. However, recomputing the attention weights and value vectors every time we have a new input token is computationally expensive. We can cache the values of key and value vectors to avoid some recomputation, which makes it slightly more efficient. Still, it does not solve the problem of transformers scaling quadratically.

RNNs achieve stateful inference through a hidden state that is only updated and not recomputed for every input token. However, RNNs struggle to retain information from earlier tokens in long sequences. This limitation arises because, during backpropagation, gradients associated with long-range dependencies diminish exponentially as they are propagated through many layers (or time steps), a phenomenon known as the vanishing gradient problem. As a result, RNNs cannot effectively model long-range dependencies between tokens.

Thanks to their state equation, SSMs achieve stateful inference. They inherently maintain a state containing the sequence’s context, making them more computationally efficient than transformer-based models.

To handle long-range dependencies, the authors of Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers use the HiPPO-LegS (Stationary form of HiPPO-Leg) formulation to parameterize A.

Time-scale adaptation

Time-scale adaptation refers to a sequence model’s ability to capture dependencies for the input token in different parts of the input sequence. In technical terms, this means the context can retain dependencies that occur over different temporal distances within the same sequence. Time-scale adaptation enables effective capturing of both short-term (immediate) and long-term (distant) relationships between elements in the data.

A model’s context representation is crucial for its ability to capture the internal dependencies within a sequence. SSMs represent the context as the matrix A. Thus, an SSM’s ability to update the state based on the new input through the state equation allows the model to adapt to the contextual dependencies within a sequence, allowing it to handle both long and short-range dependencies.

Linear state space layers (LSSLs)

So far, we’ve seen that State Space Models are efficient sequence models. In their paper Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers, Gu and colleagues introduced the Linear State Space Layer (LSSL) utilizing both the discretized recurrent and convolutional forms of State Space Representation equations. This layer is integrated into deep learning architectures to introduce efficient handling of long-range dependencies and structured sequence representations.

Like RNNs, SSMs are recurrent. They update the context by combining the previous state with the new state. This recurrent form is very slow to train because we need to wait for the previous output to be available before computing the next one. To address this problem, the authors devised the convolutional representation of the SSM equations that we discussed in the previous sections.

While the convolutional representation of SSMs enables training parallelization, it is not without its own problems. The key issue is the fixed size of the kernel. The kernel we are using to process the input sequence is determined by the model parameters (matrices A, B, C, and D) and sequence length, as we saw in the first step of the kernel derivation. However, natural language sequences vary in length. Thus, the kernel would be recomputed during inference based on the input sequence, which is inefficient.

Although recurrent representations are inefficient to train, they can handle varying sequence lengths. Thus, to have a computationally efficient model, we seem to need the properties of both the convolutional and recurrent representations. Gu and colleagues devised a “best of both worlds” approach, using the convolutional representation during training and the recurrent representation during inference.

Comparison of the continuous-time, recurrent, and convolutional forms of SSMs. The Linear State Space Layer adopts both the recurrent and convolutional forms of the SSM representation to leverage their complementary advantages. The recurrent form is used during inference, and the convolutional form during training. | Source

In their paper, Gu and collaborators describe the LSSL architecture as a “deep neural network that involves stacking LSSL layers connected with normalization layers and residual connections.” Similar to the attention layers in the transformer architecture, each LSSL layer is preceded by a normalization layer and followed by a GeLU activation function. Then, through a residual connection, the output is added to the normalized output of a position-wise feedforward layer.

Architecture of a Linear State Space Layer. Each input has H features (the size of the token’s embedding vector) that are processed by independent copies of the SSM as one-dimensional inputs in parallel. Each SSM copy produces an M-dimensional output for each feature. The combined outputs are fed through a GeLU activation function and a position-wise feed-forward layer.

Efficiently modeling long sequences with state structured spaces

The LSSL model performed impressively well on sequence data but was not widely adopted due to computational complexities and memory bottlenecks.

Results of testing the original LSSL model on the sequential MNIST, permuted MNIST, and sequential CIFAR tasks, which are popular benchmarks originally designed to test theability of recurrent models to capture long-term dependencies of length up to1k. LSSL sets SoTA on sCIFAR by more than 10 points.

In the paper Efficiently Modeling Long Sequences with State Structured Spaces, Gu, together with close collaborators Karan Goel and Christopher Ré, advanced the LSSL to reduce the computational complexity and accuracy of the training process.

Improvements on the state matrix A

In the previous section, we explored how the original LSSL relied on a fixed, predefined form of the HiPPO matrix to serve as the state matrix A. While this representation was successful in compressing information, it was computationally inefficient due to the full (dense) matrix representation of A. Gu, Goel, and Ré described this implementation as “infeasible to use in practice because of prohibitive computation and memory requirements induced by the state representation.”

In the LSSL, the state is multiplied by the matrix A to produce the updated version of the state. The most computationally efficient form of the matrix A for multiplication would be a diagonal matrix. Unfortunately, the HiPPO matrix could not be reformed as a diagonal matrix since it does not have a full set of eigenvectors.

However, the authors were able to dissect the matrix into a diagonal plus low-rank decomposition (DPLR). The diagonal matrix has nonzero entries only on the main diagonal, which makes the multiplication process more efficient by requiring only a single multiplication per vector element. The low-rank matrix can be represented as the product of two much smaller matrices. Because of this factorization, the operations needed to multiply by the vector are greatly reduced compared to a full-rank matrix of the same size.

The original LSSL architecture required O(N²L) operations, where N is the state dimension, and L is the sequence length. After the transformation of the matrix A into its diagonal plus low-rank (DPLR) form, both the recursive and convolutional forms’ computational complexity were reduced:

1. For the recurrent form, the DLPR form has only O(NL) matrix-vector multiplications.

2. For the convolutional form, the convolutional kernel was reduced to require only O(N log L + L log L) operations. This was achieved by changing the technique used to derive the kernel, which included using the inverse Fast Fourier Transform (iFFT) and applying the Woodbury identity to reduce the low-rank term of matrix A.

This is a considerable leap in computational efficiency, significantly reducing the scaling with sequence length and bringing SSMs closer to linear time complexity, in contrast to the quadratic scaling of transformers.

Improvements in the training implementation

After tackling the LSSL’s computational complexity, the authors found another significant improvement, which is making the matrix A (partially) learnable. In the LSSL, the matrix was fixed and not updated during the training process. Rather, the matrices B and C were responsible for the update and learnability of the SSM blocks.

Keeping the matrix A fixed ensures computational efficiency, but it limits the model’s ability to capture complex dynamics and underlying patterns in the sequence. A fully learnable matrix A offers the flexibility to adapt to arbitrary dynamics. However, it comes with trade-offs: more parameters to optimize, slower training, and higher computational costs during inference.

To balance these competing demands, the modified LSSL – dubbed S4 – adopts a partially learnable A. By maintaining the DPLR structure of A, the model retains computational efficiency, while the introduction of learnable parameters enhances its ability to capture richer, domain-specific behaviors. By introducing learnable parameters into A, a model can adjust the state dynamics during training and update sequence-specific internal representations in the state.

Additionally, Efficiently Modeling Long Sequences with State Structured Spaces introduces techniques for implementing bidirectional state-space models. These models can process sequences in both the forward and backward directions, capturing dependencies from past and future contexts.

Simplified state space layers for sequence modeling

In Simplified State Space Layers for Sequence Modeling, Jimmy Smith, Andrew Warrington, and Scott Linderman proposed multiple improvements to the S4 architecture to enhance performance while maintaining the same computational complexity.

While the improvements of S4 over the original LSSL mainly focus on reducing the model’s computational complexity, S5 aimed to simplify the architecture, making it more efficient and easier to implement while maintaining or improving performance.

Using parallel associative scan

Parallel scan, also known as parallel associative scan, is an algorithm that allows parallel computation through pre-computing cumulative operations (in this case, products) up to each position in the sequence so they can be selected during the processing step instead of processed one at a time.

Using a parallel associative scan, Smith and colleagues were able to parallelize the training process of recurrent SSMs, removing the need for the use of the convolutional representation.

Thus, the S5 layer operates only in the time domain instead of having the convolutional and frequency domain. This is an important improvement because it allows the time complexity per layer to be O(N log ⁡L) instead of O(NL), leveraging parallel computation over the sequence length while reducing the memory overhead.

Allowing multi-input-multi-output

LSSL and S4 are Single-Input-Single-Output (SISO) models. Allowing Multi-Input-Multi-Output (MIMO) was computationally infeasible since the computations inside LSSL and S4 were designed under the assumption of having one input at a time. For example, adapting the convolutional representation to operate on matrices instead of vectors would have significantly increased the computational cost, making the approach impractical.

Smith and collaborators discretized the MIMO SSM equations instead of the SISO SSM equations. Using the same SSR equations, they extended the discretization process to handle m-dimensional inputs and n-dimensional outputs. Assuming the state has N dimensions, this change makes B an N x m matrix instead of N x 1, and C an n x N matrix instead of 1 x N.

S5’s support for MIMO allows it to handle multidimensional data, such as multivariate and multi-channel time series data, process multiple sequences simultaneously, and produce multiple outputs. This reduces computational overhead by allowing multiple sequences to be processed at the same time instead of having m copies of the SSM.

Diagonalized parametrization

As we discussed above, HiPPO-LegS could not be diagonalized. However, the parallel scan approach requires a diagonal matrix A. Through experimentation, Smith and colleagues discovered that they could represent the HiPPO-LegS matrix as a normal plus low-rank (NLPR) matrix, where the normal component is referred to as HiPPO-N, which can be diagonalized.

They showed that removing the low-rank terms and initializing the HiPPO-N matrix had similar results by proving that HiPPO-N and HiPPO-LegS produced the same dynamics. (A proof is given in the appendix of the paper.) However, if they were to use the diagonal matrix from the DPLR approximation, the approximation would have produced very different dynamics than the original structure.

Using a diagonalized version of the HiPPO-N matrix reduced the model’s computational complexity by removing the need to convert the HiPPO-LegS matrix into its DPLR approximation.

Similar to how using a structured parametrization for matrix A decreased the computational overhead, S5 uses a low-rank representation of matrices B and C, further reducing the number of parameters.

The computational components of an S5 layer, which uses a parallel scan on a diagonalized linear SSM to compute the SSM outputs. A nonlinear activation function is applied to the SSM outputs to produce the layer outputs. | Source

Conclusion and outlook

The evolution of State Space Models (SSMs) as sequence-to-sequence models has highlighted their growing importance in the NLP domain, particularly for tasks requiring the modeling of long-term dependencies. Innovations such as LSSL, S4, and S5 have advanced the field by enhancing computational efficiency, scalability, and expressiveness.

Despite the advancements made by the S5 model, it still lacks the ability to be context-aware. The S5 can efficiently train and infer in the time domain and retain information for long-range dependencies, but it does not explicitly filter or focus on specific parts of the sequence, as Transformers do with attention mechanisms.

Hence, a key next step is to incorporate a mechanism into SSMs that enables them to focus on the most relevant parts of the state rather than processing the entire state uniformly. This is what the Mamba model architecture addresses, which we’ll explore in the upcoming second part of the series.

Was the article useful?

Explore more content topics:

Bayesian Deep Learning is Needed in the Age of Large-Scale AI [Paper Reflection]

Machine Learning

dim

March 16, 2025

In his famous blog post Artificial Intelligence — The Revolution Hasn’t Happened Yet, Michael Jordan (the AI researcher, not the one you probably thought of first) tells a story about how he might have almost lost his unborn daughter due to a faulty AI prediction. He speculates that many children die needlessly each year in the same way. Abstracting away the specifics of his case, this is one example of an application in which an AI algorithm’s performance looked good on paper during its development but led to bad decisions once deployed.

In our paper Bayesian Deep Learning is Needed in the Age of Large-Scale AI, we argue that the case above is not the exception but rather the rule and a direct consequence of the research community’s focus on predictive accuracy as a single metric of interest.

Our position paper was born out of the observation that the annual Symposium on Advances of Approximate Bayesian Inference, despite its immediate relevance to these questions, attracted fewer junior researchers over the years. At the same time, many of our students and younger colleagues seemed unaware of the fundamental problems with current practices in machine learning research—especially when it comes to large-scale efforts like the work on foundation models, which grab most of the attention today but fall short in terms of safety, reliability, and robustness.

We reached out to fellow researchers in Bayesian deep learning and eventually assembled a group of researchers from 29 of the most renowned institutions around the world, working at universities, government labs, and industry. Together, we wrote the paper to make the case that Bayesian deep learning offers promising solutions to core problems in machine learning and is ready for application beyond academic experiments. In particular, we point out that there are many other metrics beyond accuracy, such as uncertainty calibration, which we have to take into account to ensure that better models also translate to better outcomes in downstream applications.

In this commentary, I will expand on the importance of decisions as a goal for machine learning systems, in contrast to singular metrics. Moreover, I will make the case for why Bayesian deep learning can satisfy these desiderata and briefly review recent advances in the field. Finally, I will provide an outlook for the future of this research area and give some advice on how you can already use the power of Bayesian deep learning solutions in your research or practice today.

Machine learning for decisions

If you open any machine learning research paper presented at one of the big conferences, chances are that you will find a big table with a lot of numbers. These numbers usually reflect the predictive accuracy of different methods on different datasets, and the line corresponding to the authors’ proposed method probably has a lot of bold numbers, indicating that they are higher than the ones of the other methods.

The results table from the ResNet paper is a typical example of how results are presented in machine learning publications. The researchers applied different models and model variants to the same dataset and measured two metrics. The best metric values—usually belonging to the researchers’ newly devised model—are boldened.

In the results table from the Vision Transformer paper, the authors compare three of their own model variants against the prior state-of-the-art ResNet-152 model. They trained all four models on seven different datasets and measured the accuracy. Their findings indicate that the ViT-H/14 model (first column) outperforms the other models on six of the seven datasets. Crucially, this does not allow any conclusions about how any of the models would perform on a particular downstream task. (The last line of the table, labeled “TPUv3-core-days,” indicates the number of days it took to train the models on TPUs.)

Based on this observation, one might believe that bold numbers in tables are all that matters in the world. However, I want to strongly argue that this is not the case. What matters in the real world are decisions—or, more precisely, decisions and their associated utilities.

A motivating example

Imagine you overslept and are now running the risk of getting late to work. Moreover, there is a new construction site on your usual route to work, and there is also a parade going on in town today. This makes the traffic situation rather hard to predict. It is 08:30 am, and you have to be at work by 09:00. There are three different routes you can take: through the city, via the highway, or through the forest. How do you choose?

Luckily, some clever AI researchers have built tools that can predict the time each route takes. There are two tools to choose from, Tool A and Tool B, and these are their predictions:

Annoyingly, Tool A suggests that you should use the highways, but Tool B suggests the city. However, as a tech-savvy user, you actually know that B uses a newer algorithm, and you have read the paper and marveled at the bold numbers. You know that B yields a lower mean-squared error (MSE), a common measure for predictive performance on regression tasks.

Confidently, you choose to trust Tool B and thus take the route through the city—just to arrive at 09:02 and get an annoyed side-glance from your boss for being late.

But how did that happen? You chose the best tool, after all! Let’s look at the ground-truth travel times:

As we can see, the highway was actually the fastest one and, in fact, the only one that would have gotten you to work on time. But how is that possible? This will become clear when we compute the MSE in these times for the two predictive algorithms:

MSE(A) = [ (35-32)² + (25-25)² + (43-35)²] / 3 = 24.3

MSE(B) = [ (28-32)² + (32-25)² + (35-35)²] / 3 = 21.7

Indeed, we see that Tool B has the better MSE, as advertised in the paper. But that didn’t help you now, did it? What you ultimately cared about was not having the most accurate predictions across all possible routes but making the best decision regarding which route to take, namely the decision that gets you to work in time.

While Tool A makes worse predictions on average, its predictions are better for routes with shorter travel times and get worse the longer a route takes. It also never underestimates travel times.

To get to work on time, you don’t care about the predictions for the slowest routes, only about the fastest ones. You’d also like to have the confidence to arrive on time and not choose a route that then actually ends up taking longer. Thus, while Tool A has a worse MSE, it actually leads to better decisions.

Uncertainty estimation to the rescue

Of course, if you had known that the prediction could have been so wrong, you might have never trusted it in the first place, right? Let’s add another useful feature to the predictions: uncertainty estimation.

Here are the original two algorithms and a new third one (Tool C) that estimates its own predictive uncertainties:

The ranking based on mean predictions of Tool C agrees with Tool B. However, you can now assess how much risk there is that you run late to work. Your true utility is not to be at work in the shortest time possible but to be at work on time, i.e., within a maximum of 30 min.

According to Tool C, the drive through the city can take between 17 and 32 min, so while it seems to be the fastest on average, there is a chance that you will be late. In contrast, the highway can take between 25 and 29 min, so you will be on time in any case. Armed with these uncertainty estimates, you’d make the correct choice of choosing the highway.

This was just one example of a scenario in which we are faced with decisions whose utility does not correlate with an algorithm’s raw predictive accuracy, and uncertainty estimation is crucial to making better decisions.

The case for Bayesian deep learning

Bayesian deep learning uses the foundational statistical principles of Bayesian inference to endow deep learning systems with the ability to make probabilistic predictions. These predictions can then be used to derive uncertainty intervals of the form shown in the previous example (which a Bayesian would call “credible intervals”).

Uncertainty intervals can encompass aleatoric uncertainty, that is, the uncertainty inherent in the randomness of the world (e.g., whether your neighbor decided to leave the car park at the same time as you), and epistemic uncertainty, related to our lack of knowledge (e.g., we might not know how fast the parade moves).

Crucially, by applying Bayes’ theorem, we can incorporate prior knowledge into the predictions and uncertainty estimates of our Bayesian deep learning model. For example, we can use our understanding of how traffic flows around a construction site to estimate potential delays.

Frequentist statisticians will often criticize this aspect of Bayesian inference as “subjective” and will advocate for “distribution-free” approaches, such as conformal prediction, which give you provable guarantees for the coverage of the prediction intervals. However, these guarantees only hold uniformly across all the predictions (in our example, across all the routes), but not necessarily in any given case.

As we have seen in our example, we don’t care that much about the accuracy (and, in extension, uncertainty estimates) on the slower routes. As long as the predictions and uncertainty estimates for the fast routes are accurate, a tool serves its purpose. Conformal methods cannot provide such a marginal coverage guarantee for each route, limiting their applicability in many scenarios.

“But Bayesian deep learning doesn’t work”

If you have only superficially followed the field of Bayesian deep learning a few years ago and have then stopped paying attention, distracted by all the buzz around LLMs and generative AI, you would be excused in believing that it has elegant principles and a strong motivation, but does not actually work in practice. Indeed, this truly was the case until only very recently.

However, in the last few years, the field has seen many breakthroughs that allow for this framework to finally deliver on its promises. For instance, performing Bayesian inference on posterior distributions over millions of neural network parameters used to be computationally intractable, but we now have scalable approximate inference methods that are only marginally more costly than standard neural network training.

Moreover, it used to be hard to choose the right model class for a given problem, but we have made great progress in automating this decision away from the user thanks to advances in Bayesian model selection.

While it is still nearly impossible to design a meaningful prior distribution over neural network parameters, we have found different ways to specify priors directly over functions, which is much more intuitive for most practitioners. Finally, some troubling conundra related to the behavior of the Bayesian neural network posterior, such as the infamous cold posterior effect, are much better understood now.

Armed with these tools, Bayesian deep learning models have then started to have a beneficial impact in many domains, including healthcare, robotics, and science. For instance, we have shown that in the context of predicting health outcomes for patients in the intensive care unit based on time series data, a Bayesian deep learning approach can not only yield better predictions and uncertainty estimates but also lead to recommendations that are more interpretable for medical practitioners. Our position paper contains detailed accounts of this and other noteworthy examples.

However, Bayesian deep learning is unfortunately still not as easy to use as standard deep learning, which you can do these days in a few lines of PyTorch code.

If you want to use a Bayesian deep learning model, first, you have to think about specifying the prior. This is a crucial component of the Bayesian paradigm and might sound like a chore, but if you actually have prior knowledge about the task at hand, this can really improve your performance.

Then, you are still left with choosing an approximate inference algorithm, depending on how much computational budget you are willing to spend. Some algorithms are very cheap (such as Laplace inference), but if you want really high-fidelity uncertainty estimates, you might have to opt for a more expensive one (e.g., Markov Chain Monte Carlo).

Finally, you have to find the right implementation of that algorithm that also works with your model. For instance, some inference algorithms might only work with certain types of normalization operators (e.g., layer norm vs. batch norm) or might not work with low-precision weights.

As a research community, we should make it a priority to make these tools more easily usable for normal practitioners without a background in ML research.

The road ahead

This commentary on our position paper has hopefully convinced you that there is more to machine learning than predictive accuracies on a test set. Indeed, if you use predictions from an AI model to make decisions, in almost all circumstances, you should care about ways to incorporate your prior knowledge into the model and get uncertainty estimates out of it. If this is the case, trying out Bayesian deep learning is likely worth your while.

A good place to start is the Primer on Bayesian Neural Networks that I wrote together with three colleagues. I’ve also written a review on priors in Bayesian Deep Learning that’s published open access. Once you understand the theoretical foundations and feel ready to get your hands dirty with some actual Bayesian deep learning in PyTorch, check out some popular libraries for inference methods such as Laplace inference, variational inference, and Markov chain Monte Carlo methods.

Finally, if you are a researcher and would like to get involved in the Bayesian deep learning community, especially contributing to the goal of better benchmarking to show the positive impact on real decision outcomes and to the goal of building easy-to-use software tools for practitioners, feel free to reach out to me.

Was the article useful?

Explore more content topics:

Mastering Prompt Engineering with Functional Testing: A Systematic Guide to Reliable LLM Outputs

Machine Learning

dim

March 16, 2025

Creating efficient prompts for large language models often starts as a simple task… but it doesn’t always stay that way. Initially, following basic best practices seems sufficient: adopt the persona of a specialist, write clear instructions, require a specific response format, and include a few relevant examples. But as requirements multiply, contradictions emerge, and even minor modifications can introduce unexpected failures. What was working perfectly in one prompt version suddenly breaks in another.

If you have ever felt trapped in an endless loop of trial and error, adjusting one rule only to see another one fail, you’re not alone! The reality is that traditional prompt optimisation is clearly missing a structured, more scientific approach that will help to ensure reliability.

That’s where functional testing for prompt engineering comes in! This approach, inspired by methodologies of experimental science, leverages automated input-output testing with multiple iterations and algorithmic scoring to turn prompt engineering into a measurable, data-driven process.

No more guesswork. No more tedious manual validation. Just precise and repeatable results that allow you to fine-tune prompts efficiently and confidently.

In this article, we will explore a systematic approach for mastering prompt engineering, which ensures your Llm outputs will be efficient and reliable even for the most complex AI tasks.

Balancing precision and consistency in prompt optimisation

Adding a large set of rules to a prompt can introduce partial contradictions between rules and lead to unexpected behaviors. This is especially true when following a pattern of starting with a general rule and following it with multiple exceptions or specific contradictory use cases. Adding specific rules and exceptions can cause conflict with the primary instruction and, potentially, with each other.

What might seem like a minor modification can unexpectedly impact other aspects of a prompt. This is not only true when adding a new rule but also when adding more detail to an existing rule, like changing the order of the set of instructions or even simply rewording it. These minor modifications can unintentionally change the way the model interprets and prioritizes the set of instructions.

The more details you add to a prompt, the greater the risk of unintended side effects. By trying to give too many details to every aspect of your task, you increase as well the risk of getting unexpected or deformed results. It is, therefore, essential to find the right balance between clarity and a high level of specification to maximise the relevance and consistency of the response. At a certain point, fixing one requirement can break two others, creating the frustrating feeling of taking one step forward and two steps backward in the optimization process.

Testing each change manually becomes quickly overwhelming. This is especially true when one needs to optimize prompts that must follow numerous competing specifications in a complex AI task. The process cannot simply be about modifying the prompt for one requirement after the other, hoping the previous instruction remains unaffected. It also can’t be a system of selecting examples and checking them by hand. A better process with a more scientific approach should focus on ensuring repeatability and reliability in prompt optimization.

From laboratory to AI: Why testing LLM responses requires multiple iterations

Science teaches us to use replicates to ensure reproducibility and build confidence in an experiment’s results. I have been working in academic research in chemistry and biology for more than a decade. In those fields, experimental results can be influenced by a multitude of factors that can lead to significant variability. To ensure the reliability and reproducibility of experimental results, scientists mostly employ a method known as triplicates. This approach involves conducting the same experiment three times under identical conditions, allowing the experimental variations to be of minor importance in the result. Statistical analysis (standard mean and deviation) conducted on the results, mostly in biology, allows the author of an experiment to determine the consistency of the results and strengthens confidence in the findings.

Just like in biology and chemistry, this approach can be used with LLMs to achieve reliable responses. With LLMs, the generation of responses is non-deterministic, meaning that the same input can lead to different outputs due to the probabilistic nature of the models. This variability is challenging when evaluating the reliability and consistency of LLM outputs.

In the same way that biological/chemical experiments require triplicates to ensure reproducibility, testing LLMs should need multiple iterations to measure reproducibility. A single test by use case is, therefore, not sufficient because it does not represent the inherent variability in LLM responses. At least five iterations per use case allow for a better assessment. By analyzing the consistency of the responses across these iterations, one can better evaluate the reliability of the model and identify any potential issues or variation. It ensures that the output of the model is correctly controlled.

Multiply this across 10 to 15 different prompt requirements, and one can easily understand how, without a structured testing approach, we end up spending time in trial-and-error testing with no efficient way to assess quality.

A systematic approach: Functional testing for prompt optimization

To address these challenges, a structured evaluation methodology can be used to ease and accelerate the testing process and enhance the reliability of LLM outputs. This approach has several key components:

Data fixtures: The approach’s core center is the data fixtures, which are composed of predefined input-output pairs specifically created for prompt testing. These fixtures serve as controlled scenarios that represent the various requirements and edge cases the LLM must handle. By using a diverse set of fixtures, the performance of the prompt can be evaluated efficiently across different conditions.
Automated test validation: This approach automates the validation of the requirements on a set of data fixtures by comparison between the expected outputs defined in the fixtures and the LLM response. This automated comparison ensures consistency and reduces the potential for human error or bias in the evaluation process. It allows for quick identification of discrepancies, enabling fine and efficient prompt adjustments.
Multiple iterations: To assess the inherent variability of the LLM responses, this method runs multiple iterations for each test case. This iterative approach mimics the triplicate method used in biological/chemical experiments, providing a more robust dataset for analysis. By observing the consistency of responses across iterations, we can better assess the stability and reliability of the prompt.
Algorithmic scoring: The results of each test case are scored algorithmically, reducing the need for long and laborious « human » evaluation. This scoring system is designed to be objective and quantitative, providing clear metrics for assessing the performance of the prompt. And by focusing on measurable outcomes, we can make data-driven decisions to optimize the prompt effectively.

Step 1: Defining test data fixtures

Selecting or creating compatible test data fixtures is the most challenging step of our systematic approach because it requires careful thought. A fixture is not only any input-output pair; it must be crafted meticulously to evaluate the most accurate as possible performance of the LLM for a specific requirement. This process requires:

1. A deep understanding of the task and the behavior of the model to make sure the selected examples effectively test the expected output while minimizing ambiguity or bias.

2. Foresight into how the evaluation will be conducted algorithmically during the test.

The quality of a fixture, therefore, depends not only on the good representativeness of the example but also on ensuring it can be efficiently tested algorithmically.

A fixture consists of:

• Input example: This is the data that will be given to the LLM for processing. It should represent a typical or edge-case scenario that the LLM is expected to handle. The input should be designed to cover a wide range of possible variations that the LLM might have to deal with in production.

• Expected output: This is the expected result that the LLM should produce with the provided input example. It is used for comparison with the actual LLM response output during validation.

Step 2: Running automated tests

Once the test data fixtures are defined, the next step involves the execution of automated tests to systematically evaluate the performance of the LLM response on the selected use cases. As previously stated, this process makes sure that the prompt is thoroughly tested against various scenarios, providing a reliable evaluation of its efficiency.

Execution process

1. Multiple iterations: For each test use case, the same input is provided to the LLM multiple times. A simple for loop in nb_iter with nb_iter = 5 and voila!

2. Response comparison: After each iteration, the LLM response is compared to the expected output of the fixture. This comparison checks whether the LLM has correctly processed the input according to the specified requirements.

3. Scoring mechanism: Each comparison results in a score:

◦ Pass (1): The response matches the expected output, indicating that the LLM has correctly handled the input.

◦ Fail (0): The response does not match the expected output, signaling a discrepancy that needs to be fixed.

4. Final score calculation: The scores from all iterations are aggregated to calculate the overall final score. This score represents the proportion of successful responses out of the total number of iterations. A high score, of course, indicates high prompt performance and reliability.

Example: Removing author signatures from an article

Let’s consider a simple scenario where an AI task is to remove author signatures from an article. To efficiently test this functionality, we need a set of fixtures that represent the various signature styles.

A dataset for this example could be:

Example Input	Expected Output
A long article Jean Leblanc	The long article
A long article P. W. Hartig	The long article
A long article MCZ	The long article

Validation process:

Signature removal check: The validation function checks if the signature is absent from the rewritten text. This is easily done programmatically by searching for the signature needle in the haystack output text.
Test failure criteria: If the signature is still in the output, the test fails. This indicates that the LLM did not correctly remove the signature and that further adjustments to the prompt are required. If it is not, the test is passed.

The test evaluation provides a final score that allows a data-driven assessment of the prompt efficiency. If it scores perfectly, there is no need for further optimization. However, in most cases, you will not get a perfect score because either the consistency of the LLM response to a case is low (for example, 3 out of 5 iterations scored positive) or there are edge cases that the model struggles with (0 out of 5 iterations).

The feedback clearly indicates that there is still room for further improvements and it guides you to reexamine your prompt for ambiguous phrasing, conflicting rules, or edge cases. By continuously monitoring your score alongside your prompt modifications, you can incrementally reduce side effects, achieve greater efficiency and consistency, and approach an optimal and reliable output.

A perfect score is, however, not always achievable with the selected model. Changing the model might just fix the situation. If it doesn’t, you know the limitations of your system and can take this fact into account in your workflow. With luck, this situation might just be solved in the near future with a simple model update.

Benefits of this method

Reliability of the result: Running five to ten iterations provides reliable statistics on the performance of the prompt. A single test run may succeed once but not twice, and consistent success for multiple iterations indicates a robust and well-optimized prompt.
Efficiency of the process: Unlike traditional scientific experiments that may take weeks or months to replicate, automated testing of LLMs can be carried out quickly. By setting a high number of iterations and waiting for a few minutes, we can obtain a high-quality, reproducible evaluation of the prompt efficiency.
Data-driven optimization: The score obtained from these tests provides a data-driven assessment of the prompt’s ability to meet requirements, allowing targeted improvements.
Side-by-side evaluation: Structured testing allows for an easy assessment of prompt versions. By comparing the test results, one can identify the most effective set of parameters for the instructions (phrasing, order of instructions) to achieve the desired results.
Quick iterative improvement: The ability to quickly test and iterate prompts is a real advantage to carefully construct the prompt ensuring that the previously validated requirements remain as the prompt increases in complexity and length.

By adopting this automated testing approach, we can systematically evaluate and enhance prompt performance, ensuring consistent and reliable outputs with the desired requirements. This method saves time and provides a robust analytical tool for continuous prompt optimization.

Systematic prompt testing: Beyond prompt optimization

Implementing a systematic prompt testing approach offers more advantages than just the initial prompt optimization. This methodology is valuable for other aspects of AI tasks:

1. Model comparison:

◦ Provider evaluation: This approach allows the efficient comparison of different LLM providers, such as ChatGPT, Claude, Gemini, Mistral, etc., on the same tasks. It becomes easy to evaluate which model performs the best for their specific needs.

◦ Model version: State-of-the-art model versions are not always necessary when a prompt is well-optimized, even for complex AI tasks. A lightweight, faster version can provide the same results with a faster response. This approach allows a side-by-side comparison of the different versions of a model, such as Gemini 1.5 flash vs. 1.5 pro vs. 2.0 flash or ChatGPT 3.5 vs. 4o mini vs. 4o, and allows the data-driven selection of the model version.

2. Version upgrades:

◦ Compatibility verification: When a new model version is released, systematic prompt testing helps validate if the upgrade maintains or improves the prompt performance. This is crucial for ensuring that updates do not unintentionally break the functionality.

◦ Seamless Transitions: By identifying key requirements and testing them, this method can facilitate better transitions to new model versions, allowing fast adjustment when necessary in order to maintain high-quality outputs.

3. Cost optimization:

◦ Performance-to-cost ratio: Systematic prompt testing helps in choosing the best cost-effective model based on the performance-to-cost ratio. We can efficiently identify the most efficient option between performance and operational costs to get the best return on LLM costs.

Overcoming the challenges

The biggest challenge of this approach is the preparation of the set of test data fixtures, but the effort invested in this process will pay off significantly as time passes. Well-prepared fixtures save considerable debugging time and enhance model efficiency and reliability by providing a robust foundation for evaluating the LLM response. The initial investment is quickly returned by improved efficiency and effectiveness in LLM development and deployment.

Quick pros and cons

Key advantages:

Continuous improvement: The ability to add more requirements over time while ensuring existing functionality stays intact is a significant advantage. This allows for the evolution of the AI task in response to new requirements, ensuring that the system remains up-to-date and efficient.
Better maintenance: This approach enables the easy validation of prompt performance with LLM updates. This is crucial for maintaining high standards of quality and reliability, as updates can sometimes introduce unintended changes in behavior.
More flexibility: With a set of quality control tests, switching LLM providers becomes more straightforward. This flexibility allows us to adapt to changes in the market or technological advancements, ensuring we can always use the best tool for the job.
Cost optimization: Data-driven evaluations enable better decisions on performance-to-cost ratio. By understanding the performance gains of different models, we can choose the most cost-effective solution that meets the needs.
Time savings: Systematic evaluations provide quick feedback, reducing the need for manual testing. This efficiency allows to quickly iterate on prompt improvement and optimization, accelerating the development process.

Challenges

Initial time investment: Creating test fixtures and evaluation functions can require a significant investment of time.
Defining measurable validation criteria: Not all AI tasks have clear pass/fail conditions. Defining measurable criteria for validation can sometimes be challenging, especially for tasks that involve subjective or nuanced outputs. This requires careful consideration and may involve a difficult selection of the evaluation metrics.
Cost associated with multiple tests: Multiple test use cases associated with 5 to 10 iterations can generate a high number of LLM requests for a single test automation. But if the cost of a single LLM call is neglectable, as it is in most cases for text input/output calls, the overall cost of a test remains minimal.

Conclusion: When should you implement this approach?

Implementing this systematic testing approach is, of course, not always necessary, especially for simple tasks. However, for complex AI workflows in which precision and reliability are critical, this approach becomes highly valuable by offering a systematic way to assess and optimize prompt performance, preventing endless cycles of trial and error.

By incorporating functional testing principles into Prompt Engineering, we transform a traditionally subjective and fragile process into one that is measurable, scalable, and robust. Not only does it enhance the reliability of LLM outputs, it helps achieve continuous improvement and efficient resource allocation.

The decision to implement systematic prompt Testing should be based on the complexity of your project. For scenarios demanding high precision and consistency, investing the time to set up this methodology can significantly improve outcomes and speed up the development processes. However, for simpler tasks, a more classical, lightweight approach may be sufficient. The key is to balance the need for rigor with practical considerations, ensuring that your testing strategy aligns with your goals and constraints.

Thanks for reading!

The Impact of GenAI and Its Implications for Data Scientists

Machine Learning

dim

March 16, 2025

GenAI systems affect how we work. This general notion is well known. However, we are still unaware of the exact impact of GenAI. For example, how much do these tools affect our work? Do they have a larger impact on certain tasks? What does this mean for us in our daily work?

To answer these questions, Anthropic released a study based on millions of anonymized conversations on Claude.ai. The study provides data on how GenAI is incorporated into real-world tasks and reveals actual GenAI usage patterns.

In this article, I will go through the four main findings of the study. Based on the findings I will derive how GenAI changes our work and what skills we need in the future.

Main findings

GenAI is mostly used for software development and technical writing tasks, reaching almost 50 % of all tasks. This is likely due to LLMs being mostly text-based and thus being less useful for certain tasks.

GenAI has a stronger impact on some groups of occupations than others.More than one-third of occupations use GenAI in at least a quarter of their tasks. In contrast, only 4 % of occupations use it for more than three-quarters of their tasks. We can see that only very few occupations use GenAI across most of their tasks. This suggests that no job is being entirely automated.

GenAI is used for augmentation rather than automation, i.e., 57% vs 43 % of the tasks. But most occupations use both, augmentation and automation across tasks. Here, augmentation means the user collaborates with the GenAI to enhance their capabilities. Automation, in contrast, refers to tasks in which the GenAI directly performs the task. However, the authors guess that the share of augmentation is even higher as users might adjust GenAI answers outside of the chat window. Hence, what seems to be automation is actually augmentation. The results suggest that GenAI serves as an efficiency tool and a collaborative partner, resulting in improved productivity. These results align very well with my own experience. I mostly use GenAI tools to augment my work instead of automating tasks. In the article below you can see how GenAI tools have increased my productivity and what I use them for daily.

GenAI is mostly used for tasks associated with mid-to-high-wage occupations, such as data scientists. In contrast, the lowest and highest-paid roles show a much lower usage of GenAI. The authors conclude that this is due to the current limits of GenAI capabilities and practical barriers when it comes to using GenAI.

Overall, the study suggests that occupations will rather evolve than disappear. This is because of two reasons. First, GenAI integration remains selective rather than comprehensive within most occupations. Although many jobs use GenAI, the tools are only used selectively for certain tasks. Second, the study saw a clear preference for augmentation over automation. Hence, GenAI serves as an efficiency tool and a collaborative partner.

Limitations

Before we can derive the implications of GenAI, we should look at the limitations of the study:

It is unknown how the users used the responses. Are they copy-pasting code snippets uncritically or editing them in their IDE? Hence, some conversations that look like automation might have been augmentation instead.
The authors only used conversations from Claude.ai’s chat but not from API or Enterprise users. Hence, the dataset used in the analysis shows only a fraction of actual GenAI usage.
Automating the classification might have led to the wrong classification of conversations. However, due to the large amount of conversation used the impact should be rather small.
Claude being only text-based restricts the tasks and thus might exclude certain jobs.
Claude is advertised as a state-of-the-art coding model thus attracting mostly users for coding tasks.

Overall, the authors conclude that their dataset is not a representative sample of GenAI use in general. Thus, we should handle and interpret the results with care. Despite the study’s limitations, we can see some implications from the impact of GenAI on our work, particularly as Data Scientists.

Implications

The study shows that GenAI has the potential to reshape jobs and we can already see its impact on our work. Moreover, GenAI is rapidly evolving and still in the early stages of workplace integration.

Thus, we should be open to these changes and adapt to them.

Most importantly, we must stay curious, adaptive, and willing to learn. In the field of Data Science changes happen regularly. With GenAI tools change will happen even more frequently. Hence, we must stay up-to-date and use the tools to support us in this journey.

Currently, GenAI has the potential to enhance our capabilities instead of automating them.

Hence, we should focus on developing skills that complement GenAI. We need skills to augment workflows effectively in our work and analytical tasks. These skills lie in areas with low penetration of GenAI. This includes human interaction, strategic thinking, and nuanced decision-making. This is where we can stand out.

Moreover, skills such as critical thinking, complex problem-solving, and judgment will remain highly valuable. We must be able to ask the right questions, interpret the output of LLMs, and take action based on the answers.

Moreover, GenAI will not replace our collaboration with colleagues in projects. Hence, improving our emotional intelligence will help us to work together effectively.

Conclusion

GenAI is rapidly evolving and still in the early stages of workplace integration. However, we can already see some implications from the impact of GenAI on our work.

In this article, I showed you the main findings of a recent study from Anthropic on the use of their LLMs. Based on the results, I showed you the implications for Data Scientists and what skills might become more important.

I hope that you find this article useful and that it will help you become a better Data Scientist.

See you in my next article.

Mastering Hadoop, Part 3: Hadoop Ecosystem: Get the most out of your cluster

Machine Learning

dim

March 16, 2025

As we have already seen with the basic components (Part 1, Part 2), the Hadoop ecosystem is constantly evolving and being optimized for new applications. As a result, various tools and technologies have developed over time that make Hadoop more powerful and even more widely applicable. As a result, it goes beyond the pure HDFS & MapReduce platform and offers, for example, SQL, as well as NoSQL queries or real-time streaming.

Hive/HiveQL

Apache Hive is a data warehousing system that allows for SQL-like queries on a Hadoop cluster. Traditional relational databases struggle with horizontal scalability and ACID properties in large datasets, which is where Hive shines. It enables querying Hadoop data through a SQL-like query language, HiveQL, without needing complex MapReduce jobs, making it accessible to business analysts and developers.

Apache Hive therefore makes it possible to query HDFS data systems using a SQL-like query language without having to write complex MapReduce processes in Java. This means that business analysts and developers can use HiveQL (Hive Query Language) to create simple queries and build evaluations based on Hadoop data architectures.

Hive was originally developed by Facebook for processing large volumes of structured and semi-structured data. It is particularly useful for batch analyses and can be operated with common business intelligence tools such as Tableau or Apache Superset.

The metastore is the central repository that stores metadata such as table definitions, column names, and HDFS location information. This makes it possible for Hive to manage and organize large datasets. The execution engine, on the other hand, converts HiveQL queries into tasks that Hadoop can process. Depending on the desired performance and infrastructure, you can choose different execution engines:

MapReduce: The classic, slower approach.
Tez: A faster alternative to MapReduce.
Spark: The fastest option, which runs queries in-memory for optimal performance.

To use Hive in practice, various aspects should be considered to maximize performance. For example, it is based on partitioning, so that data is not stored in a huge table, but in partitions that can be searched more quickly. For example, a company’s sales data can be partitioned by year and month:

CREATE TABLE sales_partitioned (
    customer_id STRING,
    amount DOUBLE
) PARTITIONED BY (year INT, month INT);

This means that only the specific partition that is required can be accessed during a query. When creating partitions, it makes sense to create ones that are queried frequently. Buckets can also be used to ensure that joins run faster and data is distributed evenly.

CREATE TABLE sales_bucketed (
    customer_id STRING,
    amount DOUBLE
) CLUSTERED BY (customer_id) INTO 10 BUCKETS;

In conclusion, Hive is a useful tool if structured queries on huge amounts of data are to be possible. It also offers an easy way to connect common BI tools, such as Tableau, with data in Hadoop. However, if the application requires many short-term read and write accesses, then Hive is not the right tool.

Pig

Apache Pig takes this one step further and enables the parallel processing of large amounts of data in Hadoop. Compared to Hive, it is not focused on data reporting, but on the ETL process of semi-structured and unstructured data. For these data analyses, it is not necessary to use the complex MapReduce process in Java; instead, simple processes can be written in the proprietary Pig Latin language.

In addition, Pig can handle various file formats, such as JSON or XML, and perform data transformations, such as merging, filtering, or grouping data sets. The general process then looks like this:

Loading the Information: The data can be pulled from different data sources, such as HDFS or HBase.
Transforming the data: The data is then modified depending on the application so that you can filter, aggregate, or join it.
Saving the results: Finally, the processed data can be stored in various data systems, such as HDFS, HBase, or even relational databases.

Apache Pig differs from Hive in many fundamental ways. The most important are:

Attribute	Pig	Hive
Language	Pig Latin (script-based)	HiveQL (similar to SQL)
Target Group	Data Engineers	Business Analysts
Data Structure	Semi-structured and unstructured data	Structured Data
Applications	ETL processes, data preparation, data transformation	SQL-based analyses, reporting
Optimization	Parallel processing	Optimized, analytical queries
Engine-Options	MapReduce, Tez, Spark	Tez, Spark

Apache Pig is a component of Hadoop that simplifies data processing through its script-based Pig Latin language and accelerates transformations by relying on parallel processing. It is particularly popular with data engineers who want to work on Hadoop without having to develop complex MapReduce programs in Java.

HBase

HBase is a key-value-based NoSQL database in Hadoop that stores data in a column-oriented manner. Compared to classic relational databases, it can be scaled horizontally and new servers can be added to the storage if required. The data model consists of various tables, all of which have a unique row key that can be used to uniquely identify them. This can be imagined as a primary key in a relational database.

Each table in turn is made up of columns that belong to a so-called column family and must be defined when the table is created. The key-value pairs are then stored in the cells of a column. By focusing on columns instead of rows, large amounts of data can be queried particularly efficiently.

This structure can also be seen when creating new data records. A unique row key is created first and the values for the individual columns can then be added to this.

Put put = new Put(Bytes.toBytes("1001"));
put.addColumn(Bytes.toBytes("Personal"), Bytes.toBytes("Name"), Bytes.toBytes("Max"));
put.addColumn(Bytes.toBytes("Bestellungen", Bytes.toBytes("Produkt"),Bytes.toBytes("Laptop"));
table.put(put);

The column family is named first and then the key-value pair is defined. The structure is used in the query by first defining the data set via the row key and then calling up the required column and the keys it contains.

Get get = new Get(Bytes.toBytes("1001"));
Result result = table.get(get);
byte[] name = result.getValue(Bytes.toBytes("Personal"), Bytes.toBytes("Name"));
System.out.println("Name: " + Bytes.toString(name));

The structure is based on a master-worker setup. The HMaster is the higher-level control unit for HBase and manages the underlying RegionServers. It is also responsible for load distribution by centrally monitoring system performance and distributing the so-called regions to the RegionServers. If a RegionServer fails, the HMaster also ensures that the data is distributed to other RegionServers so that operations can be maintained. If the HMaster itself fails, the cluster can also have additional HMasters, which can then be retrieved from standby mode. During operation, however, a cluster only ever has one running HMaster.

The RegionServers are the working units of HBase, as they store and manage the table data in the cluster. They also answer read and write requests. For this purpose, each HBase table is divided into several subsets, the so-called regions, which are then managed by the RegionServers. A RegionServer can manage several regions to manage the load between the nodes.

The RegionServers work directly with clients and therefore receive the read and write requests directly. These requests end up in the so-called MemStore, whereby incoming read requests are first served from the MemStore and if the required data is no longer available there, the permanent memory in HDFS is used. As soon as the MemStore has reached a certain size, the data it contains is stored in an HFile in HDFS.

The storage backend for HBase is, therefore, HDFS, which is used as permanent storage. As already described, the HFiles are used for this, which can be distributed across several nodes. The advantage of this is horizontal scalability, as the data volumes can be distributed across different machines. In addition, different copies of the data are used to ensure reliability.

Finally, Apache Zookeeper serves as the superordinate instance of HBase and coordinates the distributed application. It monitors the HMaster and all RegionServers and automatically selects a new leader if an HMaster should fail. It also stores important metadata about the cluster and prevents conflicts if several clients want to access data at the same time. This enables the smooth operation of even larger clusters.

HBase is, therefore, a powerful NoSQL database that is suitable for Big Data applications. Thanks to its distributed architecture, HBase remains accessible even in the event of server failures and offers a combination of RAM-supported processing in the MemStore and the permanent storage of data in HDFs.

Spark

Apache Spark is a further development of MapReduce and is up to 100x faster thanks to the use of in-memory computing. It has since developed into a comprehensive platform for various workloads, such as batch processing, data streaming, and even machine learning, thanks to the addition of many components. It is also compatible with a wide variety of data sources, including HDFS, Hive, and HBase.

At the heart of the components is Spark Core, which offers basic functions for distributed processing:

Task management: Calculations can be distributed and monitored across multiple nodes.
Fault tolerance: In the event of errors in individual nodes, these can be automatically restored.
In-memory computing: Data is stored in the server’s RAM to ensure fast processing and availability.

The central data structures of Apache Spark are the so-called Resilient Distributed Datasets (RDDs). They enable distributed processing across different nodes and have the following properties:

Resilient (fault-tolerant): Data can be restored in the event of node failures. The RDDs do not store the data themselves, but only the sequence of transformations. If a node then fails, Spark can simply re-execute the transactions to restore the RDD.
Distributed: The information is distributed across multiple nodes.
Immutable: Once created, RDDs cannot be changed, only recreated.
Lazily evaluated (delayed execution): The operations are only executed during an action and not during the definition.

Apache Spark also consists of the following components:

Spark SQL provides an SQL engine for Spark and runs on datasets and DataFrames. As it works in-memory, processing is particularly fast, and it is therefore suitable for all applications where efficiency and speed play an important role.
Spark streaming offers the possibility of processing continuous data streams in real-time and converting them into mini-batches. It can be used, for example, to analyze social media posts or monitor IoT data. It also supports many common streaming data sources, such as Kafka or Flume.
With MLlib, Apache Spark offers an extensive library that contains a wide range of machine learning algorithms and can be applied directly to the stored data sets. This includes, for example, models for classification, regression, or even entire recommendation systems.
GraphX is a powerful tool for processing and analyzing graph data. This enables efficient analyses of relationships between data points and they can be calculated simultaneously in a distributed manner. There are also special PageRank algorithms for analyzing social networks.

Apache Spark is arguably one of the rising components of Hadoop, as it enables fast in-memory calculations that would previously have been unthinkable with MapReduce. Although Spark is not an exclusive component of Hadoop, as it can also use other file systems such as S3, the two systems are often used together in practice. Apache Spark is also enjoying increasing popularity due to its universal applicability and many functionalities.

Oozie

Apache Oozie is a workflow management and scheduling system that was developed specifically for Hadoop and plans the execution and automation of various Hadoop jobs, such as MapReduce, Spark, or Hive. The most important functionality here is that Oozie defines the dependencies between the jobs and executes them in a specific order. In addition, schedules or specific events can be defined for which the jobs are to be executed. If errors occur during execution, Oozie also has error-handling options and can restart the jobs.

A workflow is defined in XML so that the workflow engine can read it and start the jobs in the correct order. If a job fails, it can simply be repeated or other steps can be initiated. Oozie also has a database backend system, such as MySQL or PostgreSQL, which is used to store status information.

Presto

Apache Presto offers another option for applying distributed SQL queries to large amounts of data. Compared to other Hadoop technologies, such as Hive, the queries are processed in real-time and it is therefore optimized for data warehouses running on large, distributed systems. Presto offers broad support for all relevant data sources and does not require a schema definition, so data can be queried directly from the sources. It has also been optimized to work on distributed systems and can, therefore, be used on petabyte-sized data sets.

Apache Presto uses a so-called massively parallel processing (MPP) architecture, which enables particularly efficient processing in distributed systems. As soon as the user sends an SQL query via the Presto CLI or a BI front end, the coordinator analyzes the query and creates an executable query plan. The worker nodes then execute the queries and return their partial results to the coordinator, which combines them into a final result.

Presto differs from the related systems in Hadoop as follows:

Attribute	Presto	Hive	Spark SQL
Query Speed	Milliseconds to seconds	Minutes (batch processing)	Seconds (in-memory)
Processing Model	Real-time SQL queries	Batch Processing	In-Memory Processing
Data Source	HDFS, S3, RDBMS, NoSQL, Kafka	HDFS, Hive-Tables	HDFS, Hive, RDBMS, Streams
Use Case	Interactive queries, BI tools	Slow big data queries	Machine learning, streaming, SQL queries

This makes Presto the best choice for fast SQL queries on a distributed big data environment like Hadoop.

What are alternatives to Hadoop?

Especially in the early 2010s, Hadoop was the leading technology for distributed Data Processing for a long time. However, several alternatives have since emerged that offer more advantages in certain scenarios or are simply better suited to today’s applications.

Cloud-native alternatives to Hadoop

Many companies have moved away from hosting their servers and on-premise systems and are instead moving their big data workloads to the cloud. There, they can benefit significantly from automatic scaling, lower maintenance costs, and better performance. In addition, many cloud providers also offer solutions that are much easier to manage than Hadoop and can, therefore, also be operated by less trained personnel.

Amazon EMR (Elastic MapReduce)

Amazon EMR is a managed big data service from AWS that provides Hadoop, Spark, and other distributed computing frameworks so that these clusters no longer need to be hosted on-premises. This enables companies to no longer have to actively take care of cluster maintenance and administration. In addition to Hadoop, Amazon EMR supports many other open-source frameworks, such as Spark, Hive, Presto, and HBase. This broad support means that users can simply move their existing clusters to the cloud without any major problems.

For storage, Amazon uses EMR S3 as primary storage instead of HDFS. This not only makes storage cheaper as no permanent cluster is required, but it also has better availability as data is stored redundantly across multiple AWS regions. In addition, computing and storage can be scaled separately from each other and cannot be scaled exclusively via a cluster, as is the case with Hadoop.

There is a specially optimized interface for the EMR File System (EMRFS) that allows direct access from Hadoop or Spark to S3. It also supports the consistency models and enables metadata caching for better performance. If necessary, HDFS can also be used, for example, if local, temporary storage is required on the cluster nodes.

Another advantage of Amazon EMR over a classic Hadoop cluster is the ability to use dynamic auto-scaling to not only reduce costs but also improve performance. The cluster size and the available hardware are automatically adjusted to the CPU utilization or the job queue size so that costs are only incurred for the hardware that is needed.

So-called spot indices can then only be added temporarily when they are needed. In a company, for example, it makes sense to add them at night if the data from the productive systems is to be stored in the data warehouse. During the day, on the other hand, smaller clusters are operated and costs can be saved as a result.

Amazon EMR, therefore, offers several optimizations for the local use of Hadoop. The optimized storage access to S3, the dynamic cluster scaling, which increases performance and simultaneously optimizes costs, and the improved network communication between the nodes is particularly advantageous. Overall, the data can be processed faster with fewer resource requirements than with classic Hadoop clusters that run on their servers.

Google BigQuery

In the area of data warehousing, Google Big Query offers a fully managed and serverless data warehouse that can come up with fast SQL queries for large amounts of data. It relies on columnar data storage and uses Google Dremel technology to handle massive amounts of data more efficiently. At the same time, it can largely dispense with cluster management and infrastructure maintenance.

In contrast to native Hadoop, BigQuery uses a columnar orientation and can, therefore, save immense amounts of storage space by using efficient compression methods. In addition, queries are accelerated as only the required columns need to be read rather than the entire row. This makes it possible to work much more efficiently, which is particularly noticeable with very large amounts of data.

BigQuery also uses Dremel technology, which is capable of executing SQL queries in parallel hierarchies and distributing the workload across different machines. As such architectures often lose performance as soon as they have to merge the partial results again, BigQuery uses tree aggregation to combine the partial results efficiently.

BigQuery is the better alternative to Hadoop, especially for applications that focus on SQL queries, such as data warehouses or business intelligence. For unstructured data, on the other hand, Hadoop may be the more suitable alternative, although the cluster architecture and the associated costs must be taken into account. Finally, BigQuery also offers a good connection to the various machine learning offerings from Google, such as Google AI or AutoML, which should be taken into account when making a selection.

Snowflake

If you don’t want to become dependent on the Google Cloud with BigQuery or are already pursuing a multi-cloud strategy, Snowflake can be a valid alternative for building a cloud-native data warehouse. It offers dynamic scalability by separating computing power and storage requirements so that they can be adjusted independently of each other.

Compared to BigQuery, Snowflake is cloud-agnostic and can therefore be operated on common platforms such as AWS, Azure, or even in the Google Cloud. Although Snowflake also offers the option of scaling the hardware depending on requirements, there is no option for automatic scaling as with BigQuery. On the other hand, multiclusters can be created on which the data warehouse is distributed, thereby maximizing performance.

On the cost side, the providers differ due to the architecture. Thanks to the complete management and automatic scaling of BigQuery, Google Cloud can calculate the costs per query and does not charge any direct costs for computing power or storage. With Snowflake, on the other hand, the choice of provider is free and so in most cases it boils down to a so-called pay-as-you-go payment model in which the provider charges the costs for storage and computing power.

Overall, Snowflake offers a more flexible solution that can be hosted by various providers or even operated as a multi-cloud service. However, this requires greater knowledge of how to operate the system, as the resources have to be adapted independently. BigQuery, on the other hand, has a serverless model, which means that no infrastructure management is required.

Open-source alternatives for Hadoop

In addition to these complete and large cloud data platforms, several powerful open-source programs have been specifically developed as alternatives to Hadoop and specifically address its weaknesses, such as real-time data processing, performance, and complexity of administration. As we have already seen, Apache Spark is very powerful and can be used as a replacement for a Hadoop cluster, which we will not cover again.

Apache Flink

Apache Flink is an open-source framework that was specially developed for distributed stream processing so that data can be processed continuously. In contrast to Hadoop or Spark, which processes data in so-called micro-batches, data can be processed in near real-time with very low latency. This makes Apache Flink an alternative for applications in which information is generated continuously and needs to be reacted to in real-time, such as sensor data from machines.

While Spark Streaming processes the data in so-called mini-batches and thus simulates streaming, Apache Flink offers real streaming with an event-driven model that can process data just milliseconds after it arrives. This can further minimize latency as there is no delay due to mini-batches or other waiting times. For these reasons, Flink is much better suited to high-frequency data sources, such as sensors or financial market transactions, where every second counts.

Another advantage of Apache Flink is its advanced stateful processing. In many real-time applications, the context of an event plays an important role, such as the previous purchases of a customer for a product recommendation, and must therefore be saved. With Flink, this storage already takes place in the application so that long-term and stateful calculations can be carried out efficiently.

This becomes particularly clear when analyzing machine data in real-time, where previous anomalies, such as too high a temperature or faulty parts, must also be included in the current report and prediction. With Hadoop or Spark, a separate database must first be accessed for this, which leads to additional latency. With Flink, on the other hand, the machine’s historical anomalies are already stored in the application so that they can be accessed directly.

In conclusion, Flink is the better alternative for highly dynamic and event-based data processing. Hadoop, on the other hand, is based on batch processes and therefore cannot analyze data in real-time, as there is always a latency to wait for a completed data block.

Modern data warehouses

For a long time, Hadoop was the standard solution for processing large volumes of data. However, companies today also rely on modern data warehouses as an alternative, as these offer an optimized environment for structured data and thus enable faster SQL queries. In addition, there are a variety of cloud-native architectures that also offer automatic scaling, thus reducing administrative effort and saving costs.

In this section, we focus on the most common data warehouse alternatives to Hadoop and explain why they may be a better choice compared to Hadoop.

Amazon Redshift

Amazon Redshift is a cloud-based data warehouse that was developed for structured analyses with SQL. This optimizes the processing of large relational data sets and allows fast column-based queries to be used.

One of the main differences to traditional data warehouses is that data is stored in columns instead of rows, meaning that only the relevant columns need to be loaded for a query, which significantly increases efficiency. Hadoop, on the other hand, and HDFS in particular is optimized for semi-structured and unstructured data and does not natively support SQL queries. This makes Redshift ideal for OLAP analyses in which large amounts of data need to be aggregated and filtered.

Another feature that increases query speed is the use of a Massive Parallel Processing (MPP) system, in which queries can be distributed across several nodes and processed in parallel. This achieves extremely high parallelization capability and processing speed.

In addition, Amazon Redshift offers very good integration into Amazon’s existing systems and can be seamlessly integrated into the AWS environment without the need for open-source tools, as is the case with Hadoop. Frequently used tools are:

Amazon S3 offers direct access to large amounts of data in cloud storage.
AWS Glue can be used for ETL processes in which data is prepared and transformed.
Amazon QuickSight is a possible tool for the visualization and analysis of data.
Finally, machine learning applications can be implemented with the various AWS ML services.

Amazon Redshift is a real alternative compared to Hadoop, especially for relational queries, if you are looking for a managed and scalable data warehouse solution and you already have an existing AWS cluster or want to build the architecture on top of it. It can also offer a real advantage for high query speeds and large volumes of data due to its column-based storage and massive parallel processing system.

Databricks (lakehouse platform)

Databricks is a cloud platform based on Apache Spark that has been specially optimized for data analysis, machine learning, and artificial intelligence. It extends the functionalities of Spark with an easy-to-understand user interface, and optimized cluster management and also offers the so-called Delta Lake, which offers data consistency, scalability, and performance compared to Hadoop-based systems.

Databricks offers a fully managed environment that can be easily operated and automated using Spark clusters in the cloud. This eliminates the need for manual setup and configuration as with a Hadoop cluster. In addition, the use of Apache Spark is optimized so that batch and streaming processing can run faster and more efficiently. Finally, Databricks also includes automatic scaling, which is very valuable in the cloud environment as it can save costs and improve scalability.

The classic Hadoop platforms have the problem that they do not fulfill the ACID properties and, therefore, the consistency of the data is not always guaranteed due to the distribution across different servers. With Databricks, this problem is solved with the help of the so-called Delta Lake:

ACID transactions: The Delta Lake ensures that all transactions fulfill the ACID guidelines, allowing even complex pipelines to be executed completely and consistently. This ensures data integrity even in big data applications.
Schema evolution: The data models can be updated dynamically so that existing workflows do not have to be adapted.
Optimized storage & queries: Delta Lake uses processes such as indexing, caching, or automatic compression to make queries many times faster compared to classic Hadoop or HDFS environments.

Finally, Databricks goes beyond the classic big data framework by also offering an integrated machine learning & AI platform. The most common machine learning platforms, such as TensorFlow, scikit-learn, or PyTorch, are supported so that the stored data can be processed directly. As a result, Databricks offers a simple end-to-end pipeline for machine learning applications. From data preparation to the finished model, everything can take place in Databricks and the required resources can be flexibly booked in the cloud.

This makes Databricks a valid alternative to Hadoop if a data lake with ACID transactions and schema flexibility is required. It also offers additional components, such as the end-to-end solution for machine learning applications. In addition, the cluster in the cloud can not only be operated more easily and save costs by automatically adapting the hardware to the requirements, but it also offers significantly more performance than a classic Hadoop cluster due to its Spark basis.

In this part, we explored the Hadoop ecosystem, highlighting key tools like Hive, Spark, and HBase, each designed to enhance Hadoop’s capabilities for various data processing tasks. From SQL-like queries with Hive to fast, in-memory processing with Spark, these components provide flexibility for big data applications. While Hadoop remains a powerful framework, alternatives such as cloud-native solutions and modern data warehouses are worth considering for different needs.

This series has introduced you to Hadoop’s architecture, components, and ecosystem, giving you the foundation to build scalable, customized big data solutions. As the field continues to evolve, you’ll be equipped to choose the right tools to meet the demands of your data-driven projects.

12Page 2 of 2