Home Blog Page 35

WatchGuard unveils FireCloud Internet Access

0

WatchGuard® Technologies, a provider of unified cybersecurity, has announced the launch of FireCloud Internet Access, the first in what it’s describing as “a new family of hybrid secure access service edge (SASE) products”. The company said that FireCloud “uniquely meets the needs of hybrid organisations and WatchGuard’s partners by delivering consistency across Fireboxes and FireCloud with nearly identical configurations and no learning curve.”

Managing real-world cybersecurity means managing hybrid networks that combine traditional on-premises and Cloud/firewall-as-a-service (FWaaS) environments. Many vendors providing SASE solutions overlook the importance of integrated on-premises environments, which diminishes the value of deploying a SASE solution. When a SASE solution does not take these environments into account, they end up creating isolated systems that are managed separately, leading to unnecessary complexity and overhead.

FireCloud Internet Access, WatchGuard said, is the “right answer” for hybrid environments because it integrates with WatchGuard Cloud and shares unified policy management with Firebox, combining firewall-as-a-service (FWaaS) and secure web gateway (SWG) to deliver robust protection without complexity. Furthermore, WatchGuard enables managed service providers (MSPs) to deliver a valuable SASE solution to their clients with an adoption model that fits their hybrid environments. This solution is part of the WatchGuard Unified Security Platform® architecture, which includes Identity, Network, and Endpoint security components, unified management in the WatchGuard Cloud, and a common installation framework for WatchGuard endpoints.

“FireCloud Internet Access provides real security for real-world challenges that today’s businesses face. As remote and distributed work environments evolve and companies transition to the Cloud, the range of threat surfaces and location of endpoints that need protection has expanded,” said Andrew Young, chief product officer at WatchGuard. “Existing solutions don’t allow security teams to seamlessly manage their network security in concert with their SASE deployments, creating security gaps and management complexities. To overcome these limitations, we have developed a new hybrid SASE approach which begins with FireCloud Internet Access.”

 

The FireCloud Internet Access Difference: In addition to being uniquely designed for hybrid Cloud/on-premises environments, FireCloud Internet Access also promises ease of deployment, flexible and scalable licensing and pricing, and integration into WatchGuard’s threat detection and response platform.

  • Designed for Hybrid – WatchGuard’s SASE architecture is one of the few solutions that is designed to deliver value and benefits to a hybrid environment. For lean IT teams or MSPs, this approach means easier management, consistent security controls, and lower costs over other SASE offerings.
  • Ease of Deployment – Administrators can configure and enforce security policies from a single interface, which simplifies management by using consistent policy structures and terminology. Security settings are automatically deployed to all WatchGuard-hosted points of presence (PoPs) worldwide, ensuring consistent policy enforcement no matter where the user is located. FireCloud clients are delivered from the WatchGuard Cloud, making them easy to deploy and manage.
  • Flexible and Scalable – The flexible pricing available with WatchGuard’s FlexPay helps build and grow managed security services provider (MSSP) business. As a firewall-as-a-service, the number of users doesn’t impact performance, and more licenses can be easily added with customer growth.

WatchGuard is committed to delivering a complete SASE solution to meet partners’ and their clients’ needs. Over time, WatchGuard’s FireCloud family of solutions covering private access, SD-WAN, ZTNA, and CASB will be built out and deployed, and along the way, FireCloud customers will also benefit from soon-to-be-released integrations with ThreatSync+ software as a service (SaaS) delivering overwatch threat detection and response, and the client will be integrated with the soon-to-be-released WatchGuard Universal Agent that simplifies device management. As always, WatchGuard said it will work closely with partners to determine the specific SASE needs of their clients.

“SASE is the future of secure connectivity, merging network and security functions into a Cloud-native service. With FireCloud Internet Access and its overall approach to hybrid SASE architecture, WatchGuard’s focus on delivering powerful cybersecurity solutions specially designed for MSPs is on full display,” said Kevin Willette, president of Verus. “This is an affordable and effective solution to protect our clients’ networks and users while still using the same enterprise security found in our Firebox, which makes my business more efficient and improves our bottom line.”

This news follows WatchGuard’s recent acquisition of ActZero, a leading provider of Managed Detection and Response (MDR) services, to accelerate MDR growth for MSP partners and extend their sales reach. WatchGuard, which received recognition from IT Awards, ChannelVision, Fortress Cybersecurity, InfoSec Awards, and TMCnet for its security solutions in 2024, continues to lead the industry in security innovation to offer MSPs more scalable, ready-to-sell solutions that drive revenue.

 

The post WatchGuard unveils FireCloud Internet Access appeared first on IT Security Guru.

Allen Institute for AI (AI2) Releases OLMo 32B: A Fully Open Model to Beat GPT 3.5 and GPT-4o mini on a Suite of Multi-Skill Benchmarks

0

The rapid evolution of artificial intelligence (AI) has ushered in a new era of large language models (LLMs) capable of understanding and generating human-like text. However, the proprietary nature of many of these models poses challenges for accessibility, collaboration, and transparency within the research community. Additionally, the substantial computational resources required to train such models often limit participation to well-funded organizations, thereby hindering broader innovation.​

Addressing these concerns, the Allen Institute for AI (AI2) has introduced OLMo 2 32B, the latest and most advanced model in the OLMo 2 series. This model distinguishes itself as the first fully open model to surpass GPT-3.5 Turbo and GPT-4o mini across a suite of widely recognized, multi-skill academic benchmarks. By making all data, code, weights, and training details freely available, AI2 promotes a culture of openness and collaboration, enabling researchers worldwide to build upon this work.

OLMo 2 32B’s architecture comprises 32 billion parameters, reflecting a significant scaling from its predecessors. The training process was meticulously structured in two primary phases: pretraining and mid-training. During pretraining, the model was exposed to approximately 3.9 trillion tokens from diverse sources, including DCLM, Dolma, Starcoder, and Proof Pile II, ensuring a comprehensive understanding of language patterns. The mid-training phase utilized the Dolmino dataset, which consists of 843 billion tokens curated for quality, encompassing educational, mathematical, and academic content. This phased approach ensured that OLMo 2 32B developed a robust and nuanced grasp of language.

A notable aspect of OLMo 2 32B is its training efficiency. The model achieved performance levels comparable to leading open-weight models while utilizing only a fraction of the computational resources. Specifically, it required approximately one-third of the training compute compared to models like Qwen 2.5 32B, highlighting AI2’s commitment to resource-efficient AI development. ​

In benchmark evaluations, OLMo 2 32B demonstrated impressive results. It matched or exceeded the performance of models such as GPT-3.5 Turbo, GPT-4o mini, Qwen 2.5 32B, and Mistral 24B. Furthermore, it approached the performance levels of larger models like Qwen 2.5 72B and Llama 3.1 and 3.3 70B. These assessments spanned various tasks, including Massive Multitask Language Understanding (MMLU), mathematics problem-solving (MATH), and instruction-following evaluations (IFEval), underscoring the model’s versatility and competence across diverse linguistic challenges. ​

The release of OLMo 2 32B signifies a pivotal advancement in the pursuit of open and accessible AI. By providing a fully open model that not only competes with but also surpasses certain proprietary models, AI2 exemplifies how thoughtful scaling and efficient training methodologies can lead to significant breakthroughs. This openness fosters a more inclusive and collaborative environment, empowering researchers and developers globally to engage with and contribute to the evolving landscape of artificial intelligence.


Check out the Technical Details, HF Project and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

Parlant: Build Reliable AI Customer Facing Agents with LLMs 💬 ✅ (Promoted)

The Ethical Implications of AI in Personal Interactions

0

The Ethical Implications of AI in Personal Interactions

Introduction

Artificial intelligence has transformed nearly every aspect of our lives, from how we shop to how we communicate. But perhaps one of the most fascinating developments lies in its role in personal interactions. AI-powered tools and applications have started to serve as companions, emotional support systems, and even romantic partners.

This progress sparks excitement but also raises pressing questions about ethical boundaries. As we embrace this AI-driven world, understanding the implications of these technologies is crucial for shaping a future where innovation is balanced with responsibility.

Understanding AI in Personal Interactions

AI in personal interactions refers to technology designed to simulate or enhance human connection. Think of chatbots, virtual assistants, and AI-driven matchmaking platforms that foster communication or companionship.

Examples include:

  • Virtual companions like user experiences with AI girlfriend chatbots, which simulate emotional engagement.
  • Smart assistants like Siri and Alexa, blending functionality with conversational interaction.
  • Mental health support tools, such as AI-based therapy chatbots.

What sets these apart is their ability to process natural language, learn from behavior, and adapt responses to mimic human emotions. These capabilities blur the line between tool and companion.

Key Ethical Considerations

AI in personal interactions raises significant ethical questions. Here’s a closer look at some of the main concerns:

Privacy Concerns: AI applications often require substantial data to function effectively. But how is this data collected, and who controls it?

  • Risks: Sensitive information might be misused or shared without consent.
  • Solutions: Developers need to prioritize transparency in data policies and offer users control over their data.

Emotional Manipulation: AI tools, especially the best AI apps for emotional support, are designed to foster connection. However, creating emotional dependency poses risks.

  • Over-reliance on AI can affect real-world relationships.
  • Manipulative algorithms could exploit vulnerable users for profit or influence.

Bias in Algorithms: AI systems are only as unbiased as the data they’re trained on.

  • Impact: Biased responses can reinforce stereotypes or exclude certain user groups.
  • Solution: Diverse training data and regular audits of AI systems are essential.

Accountability and Transparency: If an AI chatbot causes harm—be it emotional or financial—who is responsible?

  • Developers? Users? The AI itself?
  • Clear accountability structures are crucial as we move forward.

Societal Impact of AI in Personal Interactions

AI isn’t just changing individual lives—it’s reshaping society.

Positive Impacts:

  • Reduced loneliness through user experiences with AI girlfriend chatbots.
  • Enhanced accessibility for individuals with disabilities via voice-assisted technologies.
  • Improved mental health support with AI-based counseling.

Negative Impacts:

  • Over-reliance on AI may weaken human relationships.
  • AI’s role in workplaces might lead to job displacement in communication-heavy roles like customer service.

Example:
Consider the rise of AI in dating apps. While AI matchmaking is convenient, it can commodify relationships and set unrealistic expectations for human interactions.

Ethical Frameworks and Guidelines

Creating a strong ethical framework is critical to mitigating risks while leveraging AI’s benefits.

Current Efforts:

  • Governments and tech companies are working on AI-specific regulations to ensure responsible use.
  • Initiatives like the ethics in AI adult content creation aim to set boundaries for sensitive areas.

Key Guidelines:

  • Transparency: Users should know when they’re interacting with AI versus a human.
  • Consent: Explicit permission must be sought for collecting and using personal data.
  • Fairness: Systems should be inclusive and accessible to all demographics.

Future Trends and Ethical Challenges

AI is advancing rapidly, and with it comes new opportunities—and challenges.

Emerging Trends:

  • Real-time emotion analysis in AI companions, enabling more tailored interactions.
  • Advanced AI girlfriend chatbots integrating augmented reality for immersive experiences.
  • Widespread adoption of the best AI apps for personalized mental health support.

Ethical Challenges:

  • How do we ensure AI doesn’t perpetuate harmful stereotypes?
  • How do we define boundaries for emotional attachment to AI systems?
  • What happens when AI begins to replace human relationships entirely?

Balancing Innovation and Ethics

Achieving harmony between innovation and ethics requires collaboration from developers, users, and regulators.

What Companies Can Do:

  • Invest in ethical AI research and development.
  • Be transparent about how AI systems are trained and used.

What Users Can Do:

  • Stay informed about the AI systems they engage with.
  • Advocate for ethical practices and responsible AI development.

Ultimately, it’s about building trust—ensuring AI serves as a tool for good while respecting human dignity.

Conclusion

As AI continues to redefine personal interactions, it’s essential to address its ethical implications. From user experiences with AI girlfriend chatbots to the ethics of AI in adult content creation, these technologies hold immense potential—but only if developed responsibly.

By embracing transparency, fairness, and accountability, we can ensure that AI enhances human lives without compromising our values. Let’s shape a future where AI complements, not replaces, our humanity.

One Turn After Another | Towards Data Science

0

While some games, like rock-paper-scissors, only work if all payers decide on their actions simultaneously, other games, like chess or Monopoly, expect the players to take turns one after another. In Game Theory, the first kind of game is called a static game, while turn-taking is a property of so-called dynamic games. In this article, we will analyse the latter with methods from game theory. 

This article is the fourth part of a four-chapter series on the fundamentals of game theory. I recommend you to read the first three articles if you haven’t done that yet, as the concepts shown here will build on the terms and paradigms introduced in the previous articles. But if you are already familiar with the core fundamentals of game theory, don’t let yourself be stopped, and go ahead!

Dynamic games

Dynamic games can be visualized as trees. Photo by Adarsh Kummur on Unsplash

While so far we only looked at static games, we will now introduce dynamic games where payers take turns. As previously, such games include a number of players n, a set of actions for each player, and a reward function that assesses the actions of a player given the other players’ actions. Beyond that, for a dynamic game, we need to define an order in which the players take their turns. Consider the following tree-like visualization of a dynamic game. 

A visualization of a dynamic game. Figure by author.

At the top we have a node where player 1 has to decide between two actions L and R. This determines whether to follow the left part or the right part of the tree. After player 1’s turn, player 2 takes their turn. If player 1 chooses L, player 2 can decide between l1 and r1. If player 1 chooses R, player 2 has to decide between l2 and r2. At the leaves of the tree (the nodes at the bottom), we see the rewards just like we had them in the matrix cells in static games. For example, if player 1 decides for L and player 2 decides for r1, the reward is (1,0); that is, player 1 gets a reward of 1, and player 2 gets a reward of 0. 

I bet you are eager to find the Nash equilibrium of this game, as this is what Game Theory is mainly about (if you still struggle with the concept of Nash equilibrium, you might want to take a look back at chapter 2 of this series). To do that, we can transform the game into a matrix, as we already know how to find a Nash equilibrium in a game displayed as a matrix. Player 1 decides on the row of the matrix, player 2 decides on the column and the values in the cell then specifies the reward. However, there is one important point to notice. When we look at the game displayed as a tree, player 2 decides on their action after player 1 does and hence only cares about the part of the tree that is actually reached. If player 1 chooses action L, player 2 only decides between l1 and r1 and doesn’t care about l2 and r2, because these actions are out of the question anyway. However, when we search for a Nash Equilibrium, we need to be aware of what would happen, if player 1 would change their action. Therefore, we must know what player 2 would have done if player 1 had chosen a different option. That is why we have four columns in the following matrix, to always account for decisions in both parts of the tree. 

A column like (r1,l2) can be read as “player 2 chooses r1 if player 1 chose L and chooses l2 if player 1 chose R”. On this matrix, we can search for the best answers. For example, the cell (L, (l1,l2)) with reward 3,1 is a best answer. Player 1 has no reason to change from L to R because that would lower his reward (from 3 to 1), and Player 2 has no reason to change either because none of the other options is better (one is as good, though). In total, we find three Nash equilibria, which are underlined in the upcoming matrix: 

The chocolate-pudding market

We will talk about chocolate pudding now. But also about game theory. Photo by American Heritage Chocolate on Unsplash

Our next example brings the idea of dynamic games to life. Let’s assume player 2 is a market-leading retailer of chocolate pudding. Player 1 also wants to build up his business but isn’t sure yet whether to join the chocolate pudding market or whether they rather should sell something else. In our game, player 1 has the first turn and can decide between two actions. Join the market (i.e., sell chocolate pudding), or don’t join the market (i.e., sell something else). If player 1 decides to sell something other than chocolate pudding, player 2 stays the market-dominating retailer for chocolate pudding and player 1 makes some money in the other area they decided for. This is reflected by the reward 1,3 in the right part of the tree in the following figure. 

The market-game as a dynamic game. Figure by author. 

But what if player 1 is greedy for the unimaginable riches that lie dormant on the chocolate pudding market? If they decide to join the market, it is player 2’s turn. They can decide to accept the new competitor, give in and share the market. In this case, both players get a reward of 2. But player 2 can also decide to start a price war to demonstrate his superiority to the new competitor. In this case, both players get a reward of 0, because they ruin their profit due to dumping prices. 

Just like before, we can turn this tree into a matrix and find the Nash equilibria by searching for the best answers:

If player 1 joins the market, the best option for player 1 is to give in. This is an equilibrium because no player has any reason to change. For player 1 it does not make sense to leave the market (that would give a reward of 1 instead of 2) and for player 2 it is no good idea to switch to fighting either (which would give a reward of 0 instead of 2). The other Nash equilibrium happens when player 1 just doesn’t join the market. However, this scenario includes player 2’s decision to fight, if player 1 had chosen to join the market instead. He basically makes a threat and says “If you join the market, I will fight you.” Remember that previously we said we need to know what the players would do even in the cases that don’t appear to happen? Here we see why this is important. Player 1 needs to assume that player 2 would fight because that is the only reason for player 1 to stay out of the market. If player 2 wouldn’t threaten to fight, we wouldn’t have a Nash equilibrium, because then joining the market would become a better option for player 1. 

But how reasonable is this threat? It keeps player 1 outside the market, but what would happen if player 1 didn’t believe the threat and decided to still join the market? Would player 2 really carry out his threat and fight? That would be very silly because it would give him a reward of 0, whereas giving in would give a reward of 2. From that perspective, player 2 used an empty threat that is not very reasonable. If the case really occurs, he wouldn’t carry it out anyway, would he?

Subgame perfect equilibrium

For a subgame perfect equilibrium, before you get the whole picture, you need to start with small parts of the game. Photo by Ben Stern on Unsplash

The previous example showed that sometimes Nash equilibria occur, that are not very reasonable within the game. To cope with this problem, a more strict concept of equilibrium has been introduced which is called a subgame perfect equilibrium. This adds some stricter conditions to the notion of an equilibrium. Hence every subgame perfect equilibrium is a Nash equilibrium, but not all Nash equilibria are subgame perfect. 

A Nash equilibrium is subgame perfect if every subgame of this equilibrium is a Nash equilibrium itself. What does that mean? First, we have to understand that a subgame is a part of the game’s tree that starts at any node. For example, if player 1 chooses L, the remainder of the tree under the node reached by playing L is a subgame. In a likewise fashion, the tree that comes after the node of action R is a subgame. Last but not least, the whole game is always a subgame of itself. As a consequence, the example we started with has three subgames, which are marked in grey, orange and blue in the following: 

The market game has three subgames. Figure by author.

We already saw, that this game has three Nash equilibria which are (L,(l1,l2)), (L, (l1,r2)) and (R,(r1,r2)). Let us now find out which of these are subgame perfect. To this end, we investigate the subgames one after another, starting with the orange one. If we only look at the orange part of the tree, there is a single Nash equilibrium that occurs if player 2 chooses l1. If we look at the blue subgame, there is also a single Nash equilibrium that is reached when player 2 chooses r2. Now that tells us that in every subgame perfect Nash equilibrium, player 2 has to choose option l1 if we arrive in the orange subgame (i.e. if player 1 chooses L) and player 2 has to choose option r2 if we arrive at the blue subgame (i.e., if player 1 chooses R). Only one of the previous Nash equilibria fulfills this condition, namely (L,(l1,r2)). Hence this is the only subgame perfect Nash equilibrium of the whole game. The other two versions are Nash equilibria as well, but they are somewhat unlogical in the sense, that they contain some kind of empty threat, as we had it in the chocolate pudding market example before. The method we just used to find the subgame perfect Nash equilibrium is called backwards induction, by the way. 

Uncertainty

In dynamic games, it can happen that you have to make decisions without knowing exactly what node of the game you are in. Photo by Denise Jans on Unsplash

So far in our dynamic games, we always knew which decisions the other players made. For a game like chess, this is the case indeed, as every move your opponent makes is perfectly observable. However, there are other situations in which you might not be sure about the exact moves the other players make. As an example, we go back to the chocolate pudding market. You take the perspective of the retailer that is already in the market and you have to decide whether you would start fighting if the other player joins the market. But there is one thing you don’t know, namely how aggressive your opponent will be. When you start fighting, will they be frightened easily and give up? Or will they be aggressive and fight you until only one of you is left? This can be seen as a decision made by the other player that influences your decision. If you expect the other player to be a coward, you might prefer to fight, but if they turn out to be aggressive, you would rather want to give in (reminds you of the birds fighting for food in the previous chapter, doesn’t it?). We can model this scenario in a game like this: 

A dynamic game with a hidden decision (indicated by the dotted circle). Figure by author.

The dotted circle around the two nodes indicates, that these are hidden decisions that are not observable to everyone. If you are player 2, you know whether player 1 joined the market or not, but if they joined, you don’t know whether they are aggressive (left node) or moderate (right node). Hence you act under uncertainty, which is a very common ingredient in many games you play in the real world. Poker would become very boring if everybody knew everyone’s cards, that’s why there is private information, namely the cards on your hand only you know about. 

Now you still have to decide whether to fight or give in, although you are not exactly sure what node of the tree you are in. To do that, you have to make assumptions about the likelihood of each state. If you are quite certain that the other player is behaving moderately, you might be up for a fight, but if you assume them to be aggressive, you might prefer giving in. Say there is a Probability p that the other player is aggressive and 1-p that they behave moderately. If you assume p to be high, you should give in, but if p becomes smaller, there should be a point where your decision switches to fighting. Let’s try to find that point. In particular, there should be a sweet spot in between where the probability of the other player being aggressive vs. moderate is such that fighting and giving in are equal alternatives to one another. That is, the rewards would be equal, which we can model as follows: 

Do you see how this formula is derived from the rewards for fighting or giving in in the different leaves of the tree? This formula solves to p=1/3, so if the probability of the other player being aggressive is 1/3 it would make no difference whether to fight or give in. But if you assume the other player to be aggressive with a probability of more than 1/3, you should give in, and if you assume aggressiveness to be less likely than 1/3, you should fight. This is a chain of thought you also have in other games where you act under uncertainty. When you play poker, you might not calculate the probabilities exactly, but you ask yourself, “How likely is it that John has two kings on his hand?” and depending on your assumption of that probability, you check, raise or give up. 

Summary & outlook

Your journey on the seas of game theory has only just begun. There is so much more to explore. Photo by George Liapis on Unsplash

Now we have learned a lot about dynamic games. Let us summarize our key findings. 

  • Dynamic games include an order in which players take turns. 
  • In dynamic games, the players’ possible actions depend on the previously executed actions of the other players. 
  • A Nash equilibrium in a dynamic game can be implausible, as it contains an empty threat that would not be rational.
  • The concept of subgame perfect equilibria prevents such implausible solutions. 
  • In dynamic games, decisions can be hidden. In that case, players may not exactly know which node of the game they are in and have to assign probabilities to different states of the games. 

With that, we have reached the end of our series on the fundamentals of game theory. We have learned a lot, yet there are plenty of things we haven’t been able to cover. Game theory is a science in itself, and we have only been able to scratch the surface. Other concepts that expand the possibilities of game-theoretic analyses include: 

  • Analysing games that are repeated multiple times. If you play the prisoner’s dilemma multiple times, you might be tempted to punish the other player for having betrayed you in the previous round. 
  • In cooperative games, players can conclude binding contracts that determine their actions to reach a solution of the game together. This is different from the non-cooperative games we looked at, where all players are free to decide and maximize their own reward. 
  • While we only looked at discrete games, where each player has a finite number of actions to choose from, continuous games allow an infinite number of actions (e.g., any number between 0 and 1). 
  • A big part of game theory considers the usage of public goods and the problem that individuals might consume these goods without contributing to their maintenance. 

These concepts allow us to analyse real-world scenarios from various fields such as auctions, social networks, evolution, markets, information sharing, voting behaviour and much more. I hope you enjoyed this series and find meaningful applications for the knowledge you gained, be it the analysis of customer behaviour, political negotiations or the next game night with your friends. From a game theory perspective, life is a game!

References

The topics introduced here are typically covered in standard textbooks on game theory. I mainly used this one, which is written in German though:

  • Bartholomae, F., & Wiens, M. (2016). Spieltheorie. Ein anwendungsorientiertes Lehrbuch. Wiesbaden: Springer Fachmedien Wiesbaden.

An alternative in the English language could be this one:

  • Espinola-Arredondo, A., & Muñoz-Garcia, F. (2023). Game Theory: An Introduction with Step-by-step Examples. Springer Nature.

Game theory is a rather young field of research, with the first main textbook being this one:

  • Von Neumann, J., & Morgenstern, O. (1944). Theory of games and economic behavior.

Like this article? Follow me to be notified of my future posts.

Hyperparameter Optimization For LLMs: Advanced Strategies

0

Finding an optimal set of hyperparameters is essential for efficient and effective training of Large Language Models (LLMs).

The key LLM hyperparameters influence the model size, learning rate, learning behavior, and token generation process.

Due to their computational demands, traditional methods for optimizing hyperparameters, such as grid search, are impractical for LLMs.

Advanced hyperparameter optimization strategies, like population-based training, Bayesian optimization, and adaptive LoRA, promise to balance computational effort and outcome.

The rise of large language models (LLMs) is bringing advances in text generation and contextual understanding. Hyperparameters control the size of LLMs, their training process, and how they generate outputs.

An optimal combination of hyperparameters is fundamental to efficiently pre-training and fine-tuning LLMs. Since LLM training is computationally intensive, exhaustive experimentation is not viable. This rules out traditional machine-learning hyperparameter optimization (HPO) methods that rely on systematically exploring the hyperparameter space by training many models with slightly different configurations.

When configuring models and training processes, LLM developers rely on a thorough understanding of each hyperparameter’s influence, insights from fundamental research, and empirical evidence gained from training state-of-the-art foundation models. Methods for estimating optimal hyperparameter values with limited compute budgets and adapting hyperparameters throughout the training process can help pre-training and fine-tuning.

After reading this article, you’ll be able to answer the following questions:

  • What key hyperparameters should be considered when developing, training, and applying LLMs?
  • How does each hyperparameter influence the LLM, and which trade-offs do we need to be aware of?
  • How can we select an optimal combination of hyperparameters in our scenario without fully training multiple model variants?
  • What advanced hyperparameter optimization techniques are available for LLMs, and when can we apply them?

LLM hyperparameters

A hyperparameter is a configuration value that controls the behavior of a machine-learning model during the training or inference process. Unlike model parameters (the weights), which are learned directly from the training data, hyperparameters are defined by the model developers. A hyperparameter can be constant or adjusted dynamically according to predefined rules or schedules.

Model size

In the case of LLMs, we often work with pre-trained models, where the activation functions, internal architecture of layers or blocks, and their connections—all examples of hyperparameters—are fixed. If our pre-trained LLM of choice is available in different sizes, the model size is the only hyperparameter affecting the model’s makeup we can actively control.

The size of an LLM refers to the total number of parameters it contains, which influences the model’s capacity to understand and generate complex language patterns. Hyperparameters set and tuned during pre-training influence the total size of an LLM.

One hyperparameter influencing a model’s size is its depth, corresponding to the total number of layers stacked sequentially. Each additional layer in an LLM adds more parameters, such as the weights for the self-attention mechanism and feed-forward layers in a transformer block.

Another hyperparameter influencing an LLM’s size is its hidden size, which refers to the dimensionality of the token embeddings and the internal representations within each layer. The hidden size determines how richly the model can encode information about each input token and how effectively it can process complex language patterns. A larger hidden size means each token is represented in a higher-dimensional space, allowing the model to capture more detailed semantic and syntactic nuances.

Further, the number of parallel attention heads in each transformer block influences the size of the LLM. Multiple heads allow the model to focus on different input aspects simultaneously. Through multi-query and grouped-query attention, we can reduce the number of necessary parameters.

Finally, the vocabulary size and context window (maximum sequence length) also impact the model’s size. They determine the language diversity a model can handle and the context length it can maintain, respectively.

These hyperparameters, set before beginning the training process and unable to be changed later, determine the model size. For example, GPT-3 has 96 layers, a hidden size of 12,288, 96 attention heads, a vocabulary of 50,257 tokens, and a context window of 2,048 tokens, resulting in a total of 175 billion parameters.

Learning rate

The learning rate (LR) is a critical hyperparameter in training LLMs. Optimizing these hyperparameters is essential for efficient learning, stable convergence, and good generalization to unseen data.

The learning rate determines how much model weights are changed during each update. A high learning rate helps speed up the training process but increases the risk of instability and overfitting. A low learning rate increases stability and tends to benefit generalization but leads to slow training.

In the case of LLMs, the learning rate is typically not constant but varies as training progresses. This variation is governed by a learning rate schedule (LRS). The schedule is usually tied to the number of tokens seen—either directly, or indirectly through the number of samples, steps, or epochs. At a high level, it contains phases of a rising, constant, and decreasing learning rate.

How does the learning rate affect training duration and quality?

Following theoretical work by Stanford researcher Kaiyue Wen and colleagues published in December 2024, we can think of LLM training as progressing along a loss landscape that looks like a river valley. They hypothesize that the existence and overall direction of the river are due to the facts and knowledge an LLM learns, which are reflected as highly deterministic and, therefore, easy-to-predict tokens. The valley slopes arise from flexibility and ambiguity inherent to language, i.e., hard-to-predict tokens.

Visualization of LLM training as traveling down a river valley. Using a stable but high learning rate ensures quick progress down the river but leads to jumps between relatively high loss values. Reducing the learning rate during a subsequent decay phase brings the model towards a local loss minimum.
Visualization of LLM training as traveling down a river valley. Using a stable but high learning rate ensures quick progress down the river but leads to jumps between relatively high loss values. Reducing the learning rate during a subsequent decay phase brings the model towards a local loss minimum. | Source

In this picture, the training goal is to reach the river mouth, at which point we should be as close to the bottom of the valley as possible. The first crucial insight is that it does not matter whether we stay at the bottom of the valley until then. Thus, if we can make faster progress down the river by bouncing back and forth between points high up the loss valley’s slopes, we can do this without affecting the final outcome.

Thus, we should aim to use a high learning rate—resulting in large steps towards the loss minimum but leading to wildly fluctuating loss values—for as long as possible. Towards the end of the training, the learning rate should be decreased to a very low value. This will slow down progress towards the river mouth but reduce the oscillations to a point where we constantly stay at the valley’s bottom, i.e., the local loss minimum.

However, all of this is only going to work if we are already in a sufficiently deep loss river valley. When training is first starting, a high learning rate will lead to undirected jumps across the loss landscape. To avoid this, learning rate schedules for LLMs start with a small learning rate and slowly ramp it up to its maximum value. This is called the warmup phase.

Cosine schedule

The cosine schedule (also known as cosine decay or cosine annealing) implements this approach by starting with a linear warmup phase that brings the learning rate to its maximum value, followed by a slow decay following the cosine function:

LR(t) = LRmin + 0.5 (LRmax – LRmin) (1 + cos(π t/T)

Here, LRmin and LRmax are the minimum and maximum learning rates, t is the training step, and T is the total number of training steps. The advantage of this schedule is that it stays close to the peak learning rate for a long time, and the final decay is gradual. It’s also easy to implement, as it depends on just three hyperparameters (LRmax, LRmin, and T) linked by the cosine function.

Cosine schedules have been highly popular for pretraining LLMs. For example, it was used for BLOOM, a 176-billion-parameter multilingual model developed by the BigScience Research Workshop and released in 2022. In an initial warmup phase, the learning rate was ramped to a peak of 6 x 10-5 over 375 million tokens. Afterward, it was lowered to 10% of this value with cosine decay over 410 million tokens and remained at this value. The implementation and detailed description are publicly accessible in BLOOM’s GitHub repository.

For pre-training their Llama 3 405B model, Meta used a slightly more involved variant of the cosine schedule. In the first stage, a warm-up phase of up to 8,000 steps brought the learning rate to a maximum of 8 x 10-5. Subsequently, the learning rate decreased to 8 x 10-7 over 1.2 million steps with a cosine decay. After the second stage focused on training the LLM up to its final context length of 128,000 tokens, the learning rate linearly decreased to 0 over 40 million tokens in the third stage. Supervised fine-tuning was conducted over about 9,000 steps with a learning rate of 10-5.

A major disadvantage of the cosine schedule is that the total number of training steps has to be known beforehand. When training large foundation models, the total compute budget is typically set, and the optimal number of training tokens can be estimated. However, when fine-tuning or experimenting, it would be preferable to base the decision on when to end training on the model’s performance.

Warmup-stable-decay schedule

The warmup-stable-decay (WSD) schedule is a simple protocol introduced by Shengding Hu and colleagues at Tsinghua University in 2024. It starts with a linear warmup to the maximum learning rate, keeps the learning rate constant for the majority of the training, and ramps it down at the end.

Through experiments, they found that a decay phase that makes up 10% of the total length is sufficient. They also demonstrated that a WSD schedule leads to a lower loss than a cosine schedule. According to Wen and colleagues at Stanford, this can readily be understood in the river valley picture. In the WSD schedule, the learning rate stays at a high value longer than in the cosine schedule. Hence, we make it further down the valley before dropping to its bottom. Further, their analysis shows that training progress in the stable phase is dominated by learning to predict deterministic tokens (facts and knowledge), while in the decay phase, the LLM learns the stochastic tokens (language variability).

Comparison of the loss curves resulting from a cosine and warmup-stable-decay (WSD) learning rate schedule. In the WSD schedule, the learning rate remains at a constant high value during the stable phase. This leads to high intermediate loss values as the loss fluctuates around the local minimum as it progresses towards lower values. During the final 10% of the total training steps, the learning rate is decreased to its minimum, leading to a sharp drop in the loss. Since the learning rate remained at a high value for longer, the final loss resulting from the WSD schedule is smaller than the loss from the cosine schedule.
Comparison of the loss curves resulting from a cosine and warmup-stable-decay (WSD) learning rate schedule. In the WSD schedule, the learning rate remains at a constant high value during the stable phase. This leads to high intermediate loss values as the loss fluctuates around the local minimum as it progresses towards lower values. During the final 10% of the total training steps, the learning rate is decreased to its minimum, leading to a sharp drop in the loss. Since the learning rate remained at a high value for longer, the final loss resulting from the WSD schedule is smaller than the loss from the cosine schedule. | Source

While a WSD schedule yields a lower loss for the same training budget, knowing the total number of training steps ahead of time is still required for scheduling the decay phase. However, the WSD schedule offers a straightforward way to extend the total number of training steps retroactively: If we find that our final model’s performance is unsatisfactory, we can resume training from a model snapshot taken at the end of the stable phase. This beams us back a small distance up the loss river valley, from where we continue making large jumpy steps towards the river mouth as if we had never descended down to the valley’s bottom in the first place.

Restarting this way, we still benefit from 90% of the compute budget spent so far. It allows us to determine the compute budget we need as we go, producing fully trained intermediate models—something that the cosine schedule inherently does not allow for.

Track months-long model training with more confidence. Use neptune.ai forking feature to iterate faster and optimize the usage of GPU resources.

With Neptune, users can visualize forked training out of the box. This means you can:

  • Test multiple configs at the same time. Stop the runs that don’t improve accuracy. And continue from the most accurate last step.
  • Restart failed training sessions from any previous step. The training history is inherited, and the entire experiment is visible on a single chart.

Cyclical cosine schedule

Returning to a high learning rate after decaying to a minimum is not a new idea in machine learning. Long established in gradient-free optimization, it was made popular for deep learning training through the “Stochastic Gradient Descent with Warm Restarts” technique proposed by Ilya Loshchilov and Frank Hutter in 2017. The learning rate is governed by a function very similar to the one for the cosine schedule:

LR(t) = LRmin + 0.5 (LRmax − LRmin) (1 + cos(π (t mod T)/T))

This time, T is not the total number of training steps but is understood as the schedule’s period. For example, we might train for 10,000 steps with T = 1,000, leading to ten consecutive cosine decay cycles. Commonly, LRmax is set to a new, lower value at the beginning of each cycle.

In the loss landscape river valley, we’re climbing down to the bottom over T steps, making ever slower progress down the river as we keep closer to the bottom. Then, we immediately go back to make large jumps toward the river mouth high up the valley’s slopes.

Right at the beginning of a new cosine cycle, the loss will be significantly higher than it was previously. This could be due to the jump in the learning rate, which might perturb the model. However, Wen and colleagues argue, based on their experiments and theoretical insights, that it is the result of training with a small learning rate for too long.

Whatever the cause, this doesn’t just make training less efficient. It’s also an obstacle to continue model training later. Whether we aim to further pre-train on newly acquired or different data, fine-tune an LLM, or incrementally evolve a model in a continual learning scenario—ideally, we could take a model snapshot and train it effectively, making the most of the compute budget we have available and the compute budget we have already spent. The learning rate schedule used during pretraining directly impacts this.

Cyclical warmup-stable-decay schedule

The Warmup-Stable-Decay (WSD) schedule allows continuing training from the final model checkpoint of the stable phase without incurring a loss penalty. This preserves a large fraction of the compute budget spent, as we only have to discard what we spent on intermediate decay phases. But this is not negligible at the scale of LLM pretraining, where the costs regularly exceed tens of millions of US dollars.

As Wen and colleagues found, starting from the final decay phase model checkpoint in a WSD schedule does not cause the same loss penalty as the cosine schedule. As the WSD schedule’s decay phase is rather short, they hypothesize it does not have the same destructive effect as the cosine schedule’s long and slow decay. Given a total compute budget, consecutively repeating the WSD cycle is more efficient than restarting from the final checkpoint of the latest stable phase.

A cyclical WSD schedule is easier to implement than WSD restarts, as the model evolves continuously down the loss landscape river valley, and no prior checkpoints have to be reloaded. It also helps downstream users, who initially often utilize few-shot prompting to adapt an LLM to their use case. If they later decide to fine-tune it, and the LLM is trained with a WSD schedule, training the same model checkpoint they already use for inference is efficient.

Learning behavior

In a neural network, the weights are the parameters of its neurons learned during training. In an LLM, weights include the query, key, and value matrices in the attention heads and the activation function parameters in the feed-forward layers. While the learning rate governs the scale of changes made to the model’s weights, we can also control how the weights change on a more fine-grained level.

Weight decay

Employing weight decay during training penalizes large weights, preventing small parts of the model from dominating its output. Weight decay in stochastic gradient descent is implemented by adding a term to the loss function. For example, using L2 regularization, the adapted loss function looks like this:

Here, Lorig is the original loss function, λ is the weight decay factor, and wi are the model weights.

Weight decay has been applied to transformer-based NLP models since the beginning. In the seminal 2018 paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, the authors state that they trained the model using “Adam with [a] learning rate of 1e-4, β₁=0.9, β₂=0.999, L2 weight decay of 0.01, learning rate warm up over the first 10,000 steps, and linear decay of the learning rate.”

As Ilya Loshchilov and Frank Hutter point out in their 2019 paper Decoupled Weight Decay Regularization, in adaptive optimizers like Adam, L2 regularization and weight decay are not identical, and L2 regularization is not effective. In Adam, the gradient of the regularization term is scaled with the gradient of Lorig, which leads to minimal regularization for terms in L for which the gradient is large. They introduced the AdamW optimizer, where the weight decay term is independent of the gradient-based update. AdamW is widely used for LLMs, such as for training Megatron-LM (2019), Llama 1 (2023), Llama 2 (2023), and Llama 3 (2024).

In LLM pretraining, models often see each training sample only once. Thus, overfitting to training data, which weight decay helps prevent in traditional deep learning scenarios, is only of concern if there are many similar or even identical samples in the training dataset. Still, weight decay positively affects training speed and the final loss.

According to a 2023 analysis by Francesco D’Angelo and colleagues at EPFL, this is because weight decay increases the effective learning rate. The effective learning rate at training step t is defined as LR(t)/||wt||2, the learning rate scaled by the inverse norm of the weight vector. The smaller the weights, the larger the influence of a weight update. Further, D’Angelo and colleagues find that weight decay stabilizes training in reduced floating-point precision.

Gradient clipping

Gradient clipping caps gradient magnitudes, helping maintain numerical stability. In the river valley analogy, we impose a threshold on slope steepness when deciding where to move next. Rather than jumping off a cliff, we treat it as a moderately steep hillside.

There are two common types of gradient clipping:

  1. Clipping by value: Set predefined minimum and maximum values for gradient magnitudes. A gradient component is clipped to the respective limit if it exceeds these thresholds. This approach has the key benefit of not requiring access to the entire gradient vector.
  2. Clipping by norm: The entire gradient vector is scaled down if the norm exceeds a specified threshold. For example, Nvidia’s original Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism paper first published in 2019 notes: “[W]e use global gradient norm clipping of 1.0 to improve the stability of training large models.” In contrast to clipping by value, this preserves the gradient vector’s direction but requires access to the entire gradient vector to compute.

In 2022, Yang and Ma introduced the Component-Wise Gradient Norm Clipping (CWGNC) approach for fine-tuning LLMs. In a nutshell, CWGNC applies gradient-clipping by norm separately to components in the LLM, such as the key, query, and value matrices or feed-forward layers. This stabilizes the training of each component individually, which might progress at significantly different rates.

Next-token generation

LLMs are autoregressive language models. They predict the next token by taking the sequence of previously generated tokens as input and producing a vector containing a probability for each token in the vocabulary. Different post-processing techniques can be used to determine the next token from these probabilities.

Temperature

Typically, LLMs use a softmax function as the final step in computing token probabilities. A temperature parameter controls this function.

The temperature influences the degree of randomness (or “originality” or “creativity”) in an LLM’s predicted text. At low temperatures, the model becomes more deterministic, rarely considering less likely options and instead focusing on the tokens with the highest probabilities. Conversely, a high temperature increases unpredictability, allowing the model to choose from a broader range of tokens. Thus, lower temperatures are helpful when you need reliable answers, while higher temperatures lead to more varied and surprising outputs.

The Text Gen Playground Hugging Face Space allows users to experiment with different temperature settings and models. By inputting a prompt and adjusting the temperature parameter, you can observe how the model’s output varies from predictable and deterministic to creative and varied.

For example, using the prompt “The sun rises in the” at different temperatures:

  • Low Temperature (e.g., T = 0.2): The model will likely complete the sentence with “east,” reflecting a common and expected continuation.
  • High Temperature (e.g., T = 1.2): The model might generate more imaginative completions like “morning haze” or “golden skies,” showcasing increased creativity.

Adjusting the temperature parameter in such playgrounds provides valuable insights into controlling the balance between determinism and creativity in language model outputs.

Sampling strategy

Given the vector of probabilities, there are many ways to select the next token.

A straightforward strategy is always picking the most likely token. Since the sampling process only considers the probabilities for the very next token, this “greedy decoding” leads to highly probable multi-token sequences being discarded if they start with a token that – viewed in isolation – is less likely.

Using beam search or random sampling according to the token probabilities can mitigate this. While the former produces deterministic outputs and thus no variety, the latter can lead to the selection of highly improbable tokens, producing nonsensical sequences.

A more balanced approach is top-k sampling, which restricts sampling of the next token to the k most probable tokens. Alternatively, in top-p sampling, only the most likely tokens up to a cumulative probability of p are considered. This approach adapts dynamically to the probability distribution, sampling from many tokens in uncertain scenarios and picking from only a few when the model is more confident. (p and k can be adjusted during training or inference time.)

As ML Engineers, we can fine-tune temperature and sampling strategy parameters according to your project needs. For example, if our tasks require precision (e.g., technical writing or summarization), we’ll use lower temperatures and top-k sampling to prioritize high-probability tokens. If we need more diversity, we’ll begin with common default values (temperature 0.7, top-k: k = 40, top-p: p = 0.9). We’ll iteratively adjust them based on the qualitative evaluation of outputs and document our findings to build a shared knowledge base with your team.

How do we find the optimal hyperparameters?

LLM training involves many hyperparameters, resulting in a combinatorial explosion of the search space. Simply guessing hyperparameters is unlikely to yield good results. Further, hyperparameters interact in complex ways, so the optimal value for one may depend on the values of others. Thus, adjusting hyperparameters one at a time may lead to suboptimal solutions, as we easily become trapped in local optima and don’t adequately explore the hyperparameter space.

Finding an optimal combination of hyperparameters requires a systematic approach. First, it’s paramount to understand the relevant hyperparameters and their influence on the particular LLM. It’s essential to research how similar architectures were trained or how the LLM we want to fine-tune was pre-trained. Further, we should clarify the available time, our compute budget, and the training objectives.

Next, we can sketch a roadmap. Can we afford to conduct experiments with particular hyperparameter combinations we believe are useful? Do we already have an experiment tracker and resource monitoring in place, or do we need to set it up first? What will be the decision points and criteria that ensure we end up with a fully trained LLM at the end of the project? Finally, we can start executing this roadmap and adjust our plans as we gather more information and insight.

The BLOOM team published a detailed paper on their preliminary experiments to determine the optimal model size and architecture. They describe how they started with GPT-3’s hyperparameters and conducted trial runs to estimate the optimal balance between model size and number of tokens given their fixed compute budget. Similar experiments were run by the Meta team that trained Llama3, who also aimed to predict downstream task performance.

Can we use traditional machine learning hyperparameter optimization methods for LLMs?

Methods for systematic hyperparameter optimization have long been studied in machine learning:

  • Learning curve analysis involves training models with varying hyperparameters over several epochs and plotting the loss to identify trends. In deep-learning models, plotting the gradient can further help assess whether and how efficiently a model learns.
  • Grid search systematically steps through the hyperparameter space, training a model for each possible combination. Random search samples the hyperparameter space, training models for randomly selected combinations.

While these approaches have successfully been applied to optimize LLM hyperparameters, their use is severely limited by the fact that LLMs are very expensive to train. The computational and memory requirements make it unviable to train large numbers of models. If training a model takes several months on a large cluster, we’ll only get one shot at a full training run.

Advanced strategies for LLM hyperparameter optimization

Beyond starting from a well-known hyperparameter combination and systematically conducting experiments, there is a range of approaches for automatically identifying or optimizing LLM hyperparameters in specific circumstances.

Population-based training (PBT)

Population-Based Training (PBT) is an approach pioneered by Google DeepMind that combines the concepts of evolutionary search and online training. Instead of fixing hyperparameters at the start of training and leaving them static throughout the process, PBT adapts them dynamically, informed by the models’ performance.

In a nutshell, the population-based training process consists of the following steps:

  1. Set up a population of models, each with unique hyperparameters hi and weights i. 
  2. Train each model, updating i every iteration.
  3. After a fixed number of iterations, evaluate each model’s performance on a validation dataset.
  4. Identify models that are underperforming relative to others. Replace their current weights​ and hyperparameters with those of a better-performing model (exploitation).
  5. Slightly perturb the hyperparameters of previously underperforming models to prevent the population from converging to a single configuration too early and improve diversity (exploration).
  6. Conclude the training if the compute budget is exhausted or the objective has been met. Otherwise, repeat the process starting from step 2.

This process initially appears resource-intensive since it requires maintaining and updating multiple models simultaneously, which can increase total GPU hours. However, PBT’s dynamic refinement of hyperparameters during training can significantly save wall-clock time. By avoiding restarting from scratch for each hyperparameter configuration and leveraging partially trained models, PBT reduces the number of training epochs needed to achieve optimal performance.

The 2017 DeepMind study on Population-Based Training (PBT) showcased its potential for LLMs by fine-tuning the first transformer model on the WMT 2014 English-German machine translation benchmark. They manually optimized a baseline model and compared it to a model where they used PBT to optimize the dropouts for different layers and the learning rate. Their evaluation showed that the PBT-optimized model outperformed their hand-tuned baseline. Further, they discovered that the learning rate schedule generated through PBT mimicked the human-created one. Starting with a small learning rate, it then jumped to a high value before something resembling an exponential decay” brought it down to a low value again. DeepMind’s original PBT transformer model also learned noticeably faster.

Ray Tune is a hyperparameter tuning library that supports population-based training. It is part of the open-source Ray framework for scaling machine-learning applications. The Ray Tune documentation includes an example of tuning BERT and RoBERTa on the GLUE benchmark dataset using population-based training.

Bayesian optimization

Bayesian optimization is a popular method for efficiently navigating the hyperparameter space by building a probabilistic model (surrogate model) of the influence of the hyperparameters on the objective (e.g., validation loss). The surrogate model is used to predict promising hyperparameter combinations to try next. The results of this exploration are then used to refine the surrogate model.

The 2024 paper Crafting Efficient Fine-Tuning Strategies for Large Language Models investigates the applicability of Bayesian optimization to fine-tuning LLMs. First, a population of N models is trained for a pre-defined budget t1. As each model is trained, the surrogate model is updated, and the updated version is used to set the hyperparameters of the next model. Once all N models are trained, the top k models are selected and are trained up to t2. Finally, the best model among the k fully trained models is selected.

Adaptive Low-Rank Adaptation (LoRA)

Low-Rank Adaptation (LoRA) is a popular technique for reducing the memory footprint and computational demands when fine-tuning LLMs. In brief, the idea is to represent the weights of the fine-tuned model as 

Wfine = Wpre + ∆W =  Wpre + BA

Here, the fine-tuned weights Wfine are the sum of the original weights Wpre and a difference ∆W, which is the product of two matrices, B and A. Only B and A are updated during fine-tuning, while Wpre remains unchanged. If Wpre and ∆W have dimensions m x n, B and A have dimensions m x r and r x n, respectively. If the rank r is much smaller than m and n, the number of weights to be updated is greatly reduced, leading to faster training progress while requiring less memory.

In practice, it is often unclear to which LLM components LoRA should be applied for the best outcome. While we know that not all weights influence task performance equally, identifying which components are important for a particular objective would require extensive ablation studies. Thus, LoRA is often applied across all suitable weight matrices in a model.

AdaLoRA (Adaptive Low-Rank Adaptation) is a method to allocate a given parameter budget across weight matrices. The core idea is to apply LoRA to all LLM components but to use different values for the rank r. Important components use a matrix pair with a large r, leading to a ∆W with many weights. Less important components are approximated using a lower-rank matrix pair. AdaLoRA assigns an importance score to each component and sets the values for r such that the total number of weights remains within the user-defined budget. This leads to an optimal training outcome for a fixed compute and memory budget.

AdaMoLE (Adaptive Mixture of Low-Rank Adaptation Experts) similarly aims to reduce the number of weights that need to be updated. It replaces the single low-rank matrix pair of the original LoRA with a collection of multiple matrix pairs (LoRA experts) that are activated dynamically based on the input context. This enables the LLM to learn different tasks with a minimal total number of weights.

Fine-tuning an LLM with the Adaptive Mixture of Low-Rank Adaptation Experts approach. The fine-tuned weights are approximated as the sum of the frozen pre-trained weights and a number of so-called LoRA experts that are activated by a gating function and a threshold function. Different LoRA experts specialize in different contexts, allowing the LLM to learn different tasks with a minimal number of weights.
Fine-tuning an LLM with the Adaptive Mixture of Low-Rank Adaptation Experts approach. The fine-tuned weights are approximated as the sum of the frozen pre-trained weights and a number of so-called LoRA experts that are activated by a gating function and a threshold function. Different LoRA experts specialize in different contexts, allowing the LLM to learn different tasks with a minimal number of weights. | Modified based on: source

Hands-on: LLM hyperparameter optimization with neptune.ai

Optuna is a framework for optimizing hyperparameter search using Bayesian optimization. It can be applied to various machine-learning tasks, including LLM hyperparameter tuning.

To see this in action, we’ve prepared a Colab notebook that walks you through the process of finding the optimal combination of learning rate, batch size, and number of epochs for fine-tuning a Hugging Face Transformers model on the IMBD dataset.

The tutorial uses neptune.ai to track training progress and analyze the different hyperparameters. If you don’t want to go through the tutorial yourself right now, you can still explore example results in this public Neptune project.

How about being one of the first to access Neptune Scale?

Neptune Scale is our upcoming product release built for teams that train foundation models. It offers enhanced scalability and exciting new features. You can join our beta program to benefit from Neptune Scale earlier.

What’s next in LLM hyperparameter optimization?

Finding an optimal combination of hyperparameters is essential for training LLMs. In this article, we’ve reviewed key LLM hyperparameters and their influence on the model and training performance. We’ve also discussed how to approach hyperparameter optimization systematically and explored methods to assist or even automate this task in certain scenarios.

From the examples of hyperparameter choices for state-of-the-art LLMs, we’ve seen that while architectures, training tasks, and data change, most models are trained with relatively similar learning rate schedules and optimizer configurations. As our understanding of the model and training mechanics deepens and more experiments yield empirical evidence, we’ll likely see an evolution of the standard recipes and more diversity.

Was the article useful?

Explore more content topics:

126% Surge in Attacks in February 2025

0

February 2025 saw a record 126% surge in ransomware attacks, with Cl0p leading the charge. Hackers exploited file transfer flaws, infostealers, and AI-driven tactics, reveals Bitdefender’s latest Threat Debrief report.

Cybersecurity just reached a new milestone; and not in a good way. According to Bitdefender’s latest Threat Debrief report, February 2025 was the worst month in history for ransomware attacks, with a 126% increase in claimed victims compared to the same period last year.

This surprising jump saw the number of victims soar from 425 in February 2024 to a staggering 962 in February 2025. The massive surge in ransomware attacks occurred despite the United States-led alliance of 40 countries, announced in November 2023, aimed at dismantling ransomware gangs and their infrastructure. The initiative focused on disrupting payments, taking down infrastructure, and enhancing intelligence sharing.

Clop (Cl0p) Ransomware at Its Peak

According to Bitdefender’s report shared with Hackread.com ahead of publishing on Thursday, Cl0p ransomware group Clop was responsible for more than a third of the attacks, claiming 335 victims in just one month. This makes a 300% increase from the previous month.

So, what’s behind this sudden rise in attacks? Cybersecurity experts point to a new trend that’s not so new: attackers are increasingly targeting vulnerabilities in edge network devices, such as file transfer systems and remote access tools.

Instead of focusing on specific industries, these opportunistic hackers are scanning the internet for easily exploitable flaws and launching automated attacks. For example, the Cl0p ransomware gang is notorious for exploiting vulnerabilities in MOVEit, a managed file transfer (MFT) software, with the highest frequency in 2023. The group stole so much data through MOVEit vulnerabilities that it launched a clearnet website to leak stolen information from victims worldwide.

In December 2024, Cl0p also announced exploiting security vulnerabilities in Cleo’s managed file transfer software, specifically targeting Cleo Harmony, VLTrader, and LexiCom products. Bitdefender’s Threat Debrief report also spotted Cl0p’s exploitation of Cleo vulnerabilities, especially CVE-2024-50623 and CVE-2024-55956 both rated 9.8 out of 10 in severity.

Both flaws allow attackers to execute commands remotely on compromised systems and were disclosed late last year. Despite patches being available, many organizations failed to update their systems in time, leaving them wide open to exploitation leading to the surge in victims seen in February 2025.

The illustration highlights the rapid pace at which ransomware gangs exploit vulnerabilities and shift to new targets. (Credit: Ditdefender)

Other Notable Developments in the Ransomware World

Beyond the record-breaking numbers, Bitdefender researchers noticed several other noteworthy trends in February 2025 including:

FunkSec’s New Infostealer

FunkSec, a growing ransomware group, released Wolfer, a tool designed to extract sensitive information from infected machines. It communicates with a Telegram bot to gather system details, Wi-Fi passwords, and more.

A ransomware gang using infostealers is bad news, especially as researchers recently found that cybercriminals are successfully breaching U.S. national security with infostealers as cheap as $10. Even high-security institutions like the military and the FBI have had their systems compromised, with access being sold on the dark web.

Black Basta Gets Analyzed by AI

On February 11, 2025, the notorious Black Basta ransomware gang had its internal chats leaked. These chats contained over 200,000 Russian-language messages. Hudson Rock’s researchers created a chatbot called BlackBastaGPT to sift through the chat logs.

Insights revealed details about their profits, use of deepfake technology, and internal conflicts. The group’s leader emphasized avoiding detection by using built-in system tools, a tactic known as “living off the land.”

Ghost Ransomware Under Scrutiny

A joint advisory from CISA highlighted Ghost (also known as Cring), a China-based ransomware operation exploiting older but still unpatched vulnerabilities. Recommendations include patching affected software, segmenting networks, and backing up data regularly.

Akira’s Webcam Hack

The Akira ransomware gang found a creative way to bypass security by hijacking a victim’s webcam. Since the device ran Linux and wasn’t monitored closely, it became the perfect launchpad for encrypting files across the network undetected.

Stephen Kowski, Field CTO at Pleasanton, Calif.-based SlashNext Email Security+ commented on the latest development emphasizing the need to fix vulnerabilities by improving threat detection and response capabilities.

“We expect ransomware attacks to continue increasing this year, especially targeting healthcare, manufacturing, critical infrastructure, and supply chains. High-profile incidents in 2024 highlight the ongoing vulnerabilities,” warned Stephen. “To combat this, organizations need to focus on strengthening email security, implementing zero-trust architectures, and improving threat detection and response capabilities.”

Top 10 Companies Most Targeted by Ransomware Gangs

The United States, Canada, the UK, Germany, and other developed nations remain the biggest targets of ransomware groups. These countries are highly vulnerable due to their reliance on connected edge devices, cloud infrastructure, and critical business data.

In total, these are the top 10 companies most targeted by ransomware gangs:

  1. USA
  2. Canada
  3. The UK
  4. Germany
  5. France
  6. Australia
  7. Brazil
  8. Mexico
  9. Italy
  10. Sweden

For those looking to understand the full scope of modern ransomware operations and how to fight back, Bitdefender has published a comprehensive whitepaper detailing current attack methods and defence strategies. You can access it here.


KnowBe4 research reveals a confidence gap in cybersecurity, putting organisations at risk

0

KnowBe4, cybersecurity platform that comprehensively addresses human risk management, has released new research indicating that while 86% of employees believe they can confidently identify phishing emails, nearly half have fallen for scams. The study, which surveyed professionals across the UK, USA, Germany, France, Netherlands, and South Africa, reveals a growing gap between confidence and competence in identifying cyber threats.

Notably, South Africa leads with both the highest confidence levels and the highest scam victimization rate, suggesting that misplaced confidence can create a false sense of security, leaving employees more susceptible to advanced cyber threats. Beyond training, the report highlights the importance of fostering a transparent security culture. While 56% of employees feel “very comfortable” reporting security concerns, 1 in 10 still hesitate due to fear or uncertainty.

Key findings from the survey included:

●      86% of employees believe they can confidently identify phishing emails.

●      24% have fallen for phishing attacks.

●      12% have been tricked by deepfake scams.

●      68% of South African employees reported falling for scams—the highest victimisation rate.

“Overconfidence fosters a dangerous blind spot—employees assume they are scam-savvy when, in reality, cybercriminals can exploit more than 30 susceptibility factors, including psychological and cognitive biases, situational awareness gaps, behavioural tendencies, and even demographic traits,” said Anna Collard, SVP content strategy and evangelist, KnowBe4. “With phishing, AI-driven social engineering, and deepfake scams evolving rapidly, organisations must counteract misplaced confidence with hands-on, scenario-based training. True cyber resilience comes not from assumed knowledge but from continuous education, real-world testing, and an adaptive security mindset.”

The survey findings emphasize the critical need for personalised, relevant, and adaptive training that caters to employees’ individual needs while considering regional influences and evolving cyber tactics. Organisations that prioritise this approach will not only reduce risk but also cultivate a genuine security-first culture. In the battle against digital deception, the most dangerous mistake employees can make is assuming they are immune.

The survey findings, “Security Approaches Around the Globe: The Confidence Gap,” is available for download here.

The post KnowBe4 research reveals a confidence gap in cybersecurity, putting organisations at risk appeared first on IT Security Guru.

Patronus AI Introduces the Industry’s First Multimodal LLM-as-a-Judge (MLLM-as-a-Judge): Designed to Evaluate and Optimize AI Systems that Convert Image Inputs into Text Outputs

0

​In recent years, the integration of image generation technologies into various platforms has opened new avenues for enhancing user experiences. However, as these multimodal AI systems—capable of processing and generating multiple data forms like text and images—expand, challenges such as “caption hallucination” have emerged. This phenomenon occurs when AI-generated descriptions of images contain inaccuracies or irrelevant details, potentially diminishing user trust and engagement. Traditional methods of evaluating these systems often rely on manual inspection, which is neither scalable nor efficient, highlighting the need for automated and reliable evaluation tools tailored to multimodal AI applications.​

Addressing these challenges, Patronus AI has introduced the industry’s first Multimodal LLM-as-a-Judge (MLLM-as-a-Judge), designed to evaluate and optimize AI systems that convert image inputs into text outputs. This tool utilizes Google’s Gemini model, selected for its balanced judgment approach and consistent scoring distribution, distinguishing it from alternatives like OpenAI’s GPT-4V, which has shown higher levels of egocentricity. The MLLM-as-a-Judge aligns with Patronus AI’s commitment to advancing scalable oversight of AI systems, providing developers with the means to assess and enhance the performance of their multimodal applications.

Technically, the MLLM-as-a-Judge is equipped to process and evaluate image-to-text generation tasks. It offers built-in evaluators that create a ground truth snapshot of images by analyzing attributes such as text presence and location, grid structures, spatial orientation, and object identification. The suite of evaluators includes criteria like:​

  • caption-describes-primary-object
  • caption-describes-non-primary-objects
  • caption-hallucination
  • caption-hallucination-strict
  • caption-mentions-primary-object-location

These evaluators enable a thorough assessment of image captions, ensuring that generated descriptions accurately reflect the visual content. Beyond verifying caption accuracy, the MLLM-as-a-Judge can be used to test the relevance of product screenshots in response to user queries, validate the accuracy of Optical Character Recognition (OCR) extractions for tabular data, and assess the fidelity of AI-generated brand images and logos. ​

A practical application of the MLLM-as-a-Judge is its implementation by Etsy, a prominent e-commerce platform specializing in handmade and vintage products. Etsy’s AI team employs generative AI to automatically generate captions for product images uploaded by sellers, streamlining the listing process. However, they encountered quality issues with their multimodal AI systems, as the autogenerated captions often contained errors and unexpected outputs. To address this, Etsy integrated Judge-Image, a component of the MLLM-as-a-Judge, to evaluate and optimize their image captioning system. This integration allowed Etsy to reduce caption hallucinations, thereby improving the accuracy of product descriptions and enhancing the overall user experience. ​

In conclusion, as organizations continue to adopt and scale multimodal AI systems, addressing the unpredictability of these systems becomes essential. Patronus AI’s MLLM-as-a-Judge offers an automated solution to evaluate and optimize image-to-text AI applications, mitigating issues such as caption hallucination. By providing built-in evaluators and leveraging advanced models like Google Gemini, the MLLM-as-a-Judge enables developers and organizations to enhance the reliability and accuracy of their multimodal AI systems, ultimately fostering greater user trust and engagement.


Check out the Technical Details. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

Parlant: Build Reliable AI Customer Facing Agents with LLMs 💬 ✅ (Promoted)

How Will AI Reshape Apps and App Development in the Future

0

Even if you’re not a tech-savvy person, you’re probably using your smartphone and computer for work. We all browse the Internet, and it’s impossible to avoid keywords and articles on AI. Artificial intelligence models are getting smarter thanks to machine learning algorithms. And while many hate them and consider them unfair to the current business environment, numerous generative AI applications will make our lives better.

Generative AI is supposed to reshape app development. Companies are already tapping into the power of AI for their apps, which are supposed to become better and more intuitive. In this article, we’ll break down the ways AI will reshape apps and app development sooner than expected.

A Variety of AI Services

Many apps are already using various AI services. The implementation is already on its way. For example, online casino apps are using AI bots as part of their customer support service. Casino developers are using these services for customer support too, and using AI models to speed up app development. This mostly focuses on slot and casino game theme innovations, as well as math models.

Casino apps are quite popular these days, and playing slots and apps will become even more intuitive in the future. AI is supposed to drive this model forward, which likely means more games and tailored experiences.

It’s not just casino apps. GenAI is used in all kinds of enterprise apps, such as Microsoft’s 365 product suite. Google Workspaces also introduced a variety of AI services while Snapchat introduced an AI-powered chatbot named myAI.

Writing platforms such as Grammarly have moved to an AI model too. This powerful app and plugin improves the writing for all kinds of content, and with its AI approach in the past year, it’s becoming more and more powerful. In just a short time, grammar mistakes could be a thing of the past, and it’s all thanks to artificial intelligence.

AI-Powered vs. Traditional App Development

There’s a great difference between traditional app development and that powered by AI. Development cycles have gotten a great boost. Up until now, long development cycles meant extensive testing and debugging, as well as patching all kinds of apps. This is lengthy and expensive, as it requires a team of testers, which costs companies money.

AI embraces agile development instead. It may or may not integrate continuous machine learning, but either way, the whole process is much faster. That’s because AI can spot any problems before they materialize, resulting in a much better. Plus, AI models are getting better thanks to constant training and improvement, meaning the customers get a better and more polished experience overall.

Data handling is also getting much better. In traditional app development cycles, real-time processing of structured data is a problem. AI models are much better at handling all kinds of data. While they still need time to introduce real-time analytics, this particular feature of app development is much better at handling sensitive data.

Decision-making is about to get much better. Companies are already using AI models to rely on user-focused decision-making instead of static algorithms used in traditional app development. The current techniques have limited adaptability, but with the help of AI, they will soon become a thing of the past. Decision-making with AI relies on self-learning algorithms, which means predictive decisions.

One notable example of this decision-making AI implementation is Google Maps’ optimized routes. They’ve already been phased out for some time, giving you better and more fuel-efficient routes when you enter the correct data. Some may not like it, but thanks to AI, we’re looking at much better routes in the future.

With superior app scaling, app development will become even better in the future. So will maintenance, which has already been in use. Extensive redevelopment is clogging things up in app development, and digital-native app development doesn’t utilize dynamic scaling methods. Unlike them, AI-powered app development is inherently adaptable and scalable. Netflix is already using it as part of the responsive content delivery system, and many other apps will soon start using it too.

In the maintenance department, AI-powered app development will speed things up in over-the-air updates. This has already been implemented by many companies, including Tesla. AI can scan for in-app delivery errors or updates much faster and more precisely. Self-improving machine learning algorithms will make maintenance and evolution much better. With an updated software update map, users will enjoy a much better experience.

User experience will also be more personalized and highly adaptive in the future. App developers can use it to deliver a more custom-tailored experience. For example, casino apps can recommend games most suitable to player preferences. Spotify has already adopted such a model in its ever-evolving music recommendations.

This is also notable in streaming apps such as Netflix and HBO, as well as dating apps and similar alternatives.

Why is AI Integral in Modern App Development?

There are several reasons why artificial intelligence is crucial for modern app development. First and foremost, automated processes are streamlining development lifecycles. It means less strain on developers, as AI models and machine learning algorithms are more precise with their predictions.

Adaptive learning is another factor that makes AI integral for future app development. AI-powered apps are adjusting to user feedback and implementing changes faster than ever before. Social media algorithms are getting the most out of these models at the moment. They deliver much more precise recommendations to a level we haven’t experienced before.

The predictive capabilities of AI app development are out of this world. AI doesn’t just predict changes – it anticipates user needs and updates features proactively. Thanks to enhanced personalization, we’ll soon be getting apps that offer a custom-tailored experience, which mainly applies to gaming apps and retail shopping apps as well.

Resource optimization is another factor where AI app development excels. It enhances app performance and reduces operational costs. Some employees in certain departments may not like it, but the future is already here, and we need to adapt to it.

Effortless Spreadsheet Normalisation With LLM

0

This article is part of a series of articles on automating Data Cleaning for any tabular dataset.

You can test the feature described in this article on your own dataset using the CleanMyExcel.io service, which is free and requires no registration.

Start with the why

A spreadsheet containing information about awards given to films

Let’s consider this Excel spreadsheet, which contains information on awards given to films. It is sourced from the book Cleaning Data for Effective Data Science and is available here.

This is a typical and common spreadsheet that everyone may own and deal with in their daily tasks. But what is wrong with it?

To answer that question, let us first recall the end goal of using data: to derive insights that help guide our decisions in our personal or business lives. This process requires at least two crucial things:

  • Reliable data: clean data without issues, inconsistencies, duplicates, missing values, etc.
  • Tidy data: a well-normalised data frame that facilitates processing and manipulation.

The second point is the primary foundation of any analysis, including dealing with data quality.

Returning to our example, imagine we want to perform the following actions:

1. For each film involved in multiple awards, list the award and year it is associated with.

2. For each actor/actress winning multiple awards, list the film and award they are associated with.

3. Check that all actor/actress names are correct and well-standardised.

Naturally, this example dataset is small enough to derive those insights by eye or by hand if we structure it (as quickly as coding). But imagine now that the dataset contains the entire awards history; this would be time-consuming, painful, and error-prone without any automation.

Reading this spreadsheet and directly understanding its structure by a machine is difficult, as it does not follow good practices of data arrangement. That is why tidying data is so important. By ensuring that data is structured in a machine-friendly way, we can simplify parsing, automate quality checks, and enhance business analysis—all without altering the actual content of the dataset.

Example of a reshaping of this data:

Example of a reshaping of the data from the previous spreadsheet:

Now, anyone can use low/no-code tools or code-based queries (SQL, Python, etc.) to interact easily with this dataset and derive insights.

The main challenge is how to turn a shiny and human-eye-pleasant spreadsheet into a machine-readable tidy version.

What is tidy data? A well-shaped data frame?

The term tidy data was described in a well‐known article named Tidy Data by Hadley Wickham and published in the Journal of Statistical Software in 2014. Below are the key quotes required to understand the underlying concepts better.

Data tidying 

“Structuring datasets to facilitate manipulation, visualisation and modelling.”

“Tidy datasets provide a standardised way of linking the structure of a dataset (its physical layout) with its semantics (its meaning).”

Data structure

“Most statistical datasets are rectangular tables composed of rows and columns. The columns are almost always labelled, and the rows are sometimes labelled.”

Data semantics

“A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organised in two ways. Every value belongs to both a variable and an observation. A variable contains all values that measure the same underlying attribute (such as height, temperature or duration) across units. An observation contains all values measured on the same unit (for example, a person, a day or a race) across attributes.”

“In a given analysis, there may be multiple levels of observation. For example, in a trial of a new allergy medication, we might have three types of observations:

  • Demographic data collected from each person (age, sex, race),
  • Medical data collected from each person on each day (number of sneezes, redness of eyes), and
  • Meteorological data collected on each day (temperature, pollen count).”

Tidy data

“Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is considered messy or tidy depending on how its rows, columns and tables correspond to observations, variables and types. In tidy data:

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.”

Common problems with messy datasets

Column headers might be values rather than variable names.

  • Messy example: A table where column headers are years (2019, 2020, 2021) instead of a “Year” column.
  • Tidy version: A table with a “Year” column and each row representing an observation for a given year.

Multiple variables might be stored in one column.

  • Messy example: A column named “Age_Gender” containing values like 28_Female
  • Tidy version: Separate columns for “Age” and “Gender”

Variables might be stored in both rows and columns.

  • Messy example: A dataset tracking student test scores where subjects (Math, Science, English) are stored as both column headers and repeated in rows instead of using a single “Subject” column.
  • Tidy version: A table with columns for “Student ID,” “Subject,” and “Score,” where each row represents one student’s score for one subject.

Multiple types of observational units might be stored in the same table.

  • Messy example: A sales dataset that contains both customer information and store inventory in the same table.
  • Tidy version: Separate tables for “Customers” and “Inventory.”

A single observational unit might be stored in multiple tables.

  • Messy example: A patient’s medical records are split across multiple tables (Diagnosis Table, Medication Table) without a common patient ID linking them.
  • Tidy version: A single table or properly linked tables using a unique “Patient ID.”

Now that we have a better understanding of what tidy data is, let’s see how to transform a messy dataset into a tidy one.

Thinking about the how

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” Hadley Wickham (cf. Leo Tolstoy)

Although these guidelines sound clear in theory, they remain difficult to generalise easily in practice for any kind of dataset. In other words, starting with the messy data, no simple or deterministic process or algorithm exists to reshape the data. This is mainly explained by the singularity of each dataset. Indeed, it is surprisingly hard to precisely define variables and observations in general and then transform data automatically without losing content. That is why, despite massive improvements in data processing over the last decade, data cleaning and formatting are still done “manually” most of the time.

Thus, when complex and hardly maintainable rules-based systems are not suitable (i.e. to precisely deal with all contexts by describing decisions in advance), machine learning models may offer some benefits. This grants the system more freedom to adapt to any data by generalising what it has learned during training. Many large language models (LLMs) have been exposed to numerous data processing examples, making them capable of analysing input data and performing tasks such as spreadsheet structure analysis, table schema estimation, and code generation.

Then, let’s describe a workflow made of code and LLM-based modules, alongside business logic, to reshape any spreadsheet.

Diagram of a workflow made of code and LLM-based modules alongside business logic to reshape a spreadsheet

Spreadsheet encoder 

This module is designed to serialise into text the main information needed from the spreadsheet data. Only the necessary subset of cells contributing to the table layout is retained, removing non-essential or overly repetitive formatting information. By retaining only the necessary information, this step minimises token usage, reduces costs, and enhances model performance.. The current version is a deterministic algorithm inspired by the paper SpreadsheetLLM: Encoding Spreadsheets for Large Language Models, which relies on heuristics. More details about it will be the topic of a next article.

Table structure analysis 

Before moving forward, asking an LLM to extract the spreadsheet structure is a crucial step in building the next actions. Here are examples of questions addressed:

  • How many tables are present, and what are their locations (regions) in the spreadsheet?
  • What defines the boundaries of each table (e.g., empty rows/columns, specific markers)?
  • Which rows/columns serve as headers, and do any tables have multi-level headers?
  • Are there metadata sections, aggregated statistics, or notes that need to be filtered out or processed separately?
  • Are there any merged cells, and if so, how should they be handled?

Table schema estimation

Once the analysis of the spreadsheet structure has been completed, it is now time to start thinking about the ideal target table schema. This involves letting the LLM process iteratively by:

  • Identifying all potential columns (multi-row headers, metadata, etc.)
  • Comparing columns for domain similarities based on column names and data semantics
  • Grouping related columns  

The module outputs a final schema with names and a short description for each retained column.

Code generation to format the spreadsheet

Considering the previous structure analysis and the table schema, this last LLM-based module should draft code that transforms the spreadsheet into a proper data frame compliant with the table schema. Moreover, no useful content must be omitted (e.g. aggregated or computed values may still be derived from other variables).

As generating code that works well from scratch at the first iteration is challenging, two internal iterative processes are added to revise the code if needed:

  • Code checking: Whenever code cannot be compiled or executed, the trace error is provided to the model to update its code.
  • Data frame validation: The metadata of the created data frame—such as column names, first and last rows, and statistics about each column—is checked to validate whether the table conforms to expectations. Otherwise, the code is revised accordingly.

Convert the data frame into an Excel file

Finally, if all data fits properly into a single table, a worksheet is created from this data frame to respect the tabular format. The final asset returned is an Excel file whose active sheet contains the tidy spreadsheet data.

Et voilà! The sky’s the limit for making the most of your newly tidy dataset.

Feel free to test it with your own dataset using the CleanMyExcel.io service, which is free and requires no registration.

Final note on the workflow

Why is a workflow proposed instead of an agent for that purpose?  

At the time of writing, we consider that a workflow based on LLMs for precise sub-tasks is more robust, stable, iterable, and maintainable than a more autonomous agent. An agent may offer advantages: more freedom and liberty in actions to perform tasks. Nonetheless, they may still be hard to deal with in practice; for example, they may diverge quickly if the objective is not clear enough. I believe this is our case, but that does not mean that this model would not be applicable in the future in the same way as SWE-agent coding is performing, for example.

Next articles in the series

In upcoming articles, we plan to explore related topics, including:

  • A detailed description of the spreadsheet encoder mentioned earlier.
  • Data validity: ensuring each column meets the expectations.
  • Data uniqueness: preventing duplicate entities within the dataset.
  • Data completeness: handling missing values effectively.
  • Evaluating data reshaping, validity, and other key aspects of data quality.

Stay tuned!

Thank you to Marc Hobballah for reviewing this article and providing feedback.

All images, unless otherwise noted, are by the author.

Popular Posts

My Favorites

Housing market remains strong, despite mortgage rate worries

0
The Federal Reserve is raising interest rates, and that's led some...

Mansions on the Vineyard