March 16, 2025

As the complexity and size of transformer-based models grow, so does the need to optimize their inference speed, especially in chat applications where the users expect immediate replies.

Key-value (KV) caching is a clever trick to do that: At inference time, key and value matrices are calculated for each generated token. KV caching stores these matrices in memory so that when subsequent tokens are generated, we only compute the keys and values for the new tokens instead of having to recompute everything.

The inference speedup from KV caching comes at the cost of increased memory consumption. When memory is a bottleneck, one can reclaim some of it by simplifying the model, thus sacrificing its accuracy.

Implementing K-V caching in large-scale production systems requires careful cache management, including choosing an appropriate strategy for cache invalidation and exploring opportunities for cache reuse.

The transformer architecture is arguably one of the most impactful innovations in modern deep learning. Proposed in the famous 2017 paper “Attention Is All You Need,” it has become the go-to approach for most language-related modeling, including all Large Language Models (LLMs), such as the GPT family, as well as many computer vision tasks.

As the complexity and size of these models grow, so does the need to optimize their inference speed, especially in chat applications where the users expect immediate replies. Key-value (KV) caching is a clever trick to do just that – let’s see how it works and when to use it.

Transformer architecture overview

Before we dive into KV caching, we will need to take a short detour to the attention mechanism used in transformers. Understanding how it works is required to spot and appreciate how KV caching optimizes transformer inference.

We will focus on autoregressive models used to generate text. These so-called decoder models include the GPT family, Gemini, Claude, or GitHub Copilot. They are trained on a simple task: predicting the next token in sequence. During inference, the model is provided with some text, and its task is to predict how this text should continue.

From a high-level perspective, most transformers consist of a few basic building blocks:

A tokenizer that splits the input text into subparts, such as words or sub-words.
An embedding layer that transforms the resulting tokens (and their relative positions within the texts) into vectors.
A couple of basic neural network layers, including dropout, layer normalization, and regular feed-forward linear layers.

The last building block missing from the list above is the slightly more involved self-attention modules.

The self-attention module is, arguably, the only advanced piece of logic in the transformer architecture. It is the cornerstone of every transformer, enabling it to focus on different parts of the input sequence when generating the outputs. It is this mechanism that gives transformers the ability to model long-range dependencies effectively.

Let’s inspect the self-attention module in more detail.

Basic self-attention module

Self-attention is a mechanism that allows the model to “pay attention” to specific parts of the input sequence as it generates the next token. For example, in generating the sentence “She poured the coffee into the cup,” the model might pay more attention to the words “poured” and “coffee” to predict “into” as the next word since these words provide context for what is likely to come next (as opposed to “she” and “the”).

Mathematically speaking, the goal of self-attention is to transform each input (embedded token) into a so-called context vector, which combines the information from all the inputs in a given text. Consider the text “She poured coffee”. Attention will compute three context vectors, one for each input token (let’s assume tokens are words).

To calculate the context vectors, self-attention computes three kinds of intermediate vectors: queries, keys, and values. The diagram below shows step by step how the context vector for the second word, “poured,” is calculated:

The diagram shows step by step how the context vector for the second word, “poured,” is calculated. | Source: Author

Let’s denote the three tokenized inputs as x1, x2, and x3, respectively. The diagram pictures them as vectors with three elements, but in practice, they will be hundreds or thousands of elements long.

As the first step, self-attention multiplies each input separately with two weight matrices, Wk and Wv. The input for which the context vector is now being computed (x2 in our case) is additionally multiplied with a third weight matrix, Wq. All three W matrices are your usual neural network weights, randomly initialized and optimized in the learning process. The outputs of this step are the keys (k) and values (v) vectors for each input, plus an additional query (q) vector for the input being processed.

In step two, the key vector of each input is multiplied by the query vector of the input being processed (our q2). The output is then normalized (not shown in the diagram) to produce the attention weights. In our example, a21 is the attention weight between the inputs “She” and “poured.”

Finally, each attention weight is multiplied by its corresponding value vector. The outputs are then summed to produce the context vector z. In our example, the context vector z2 corresponds to the input x2, “poured.” The context vectors are the outputs of the self-attention module.

If it’s easier for you to read code than diagrams, take a look at this implementation of the basic self-attention module by Sebastian Raschka. The code is part of his book, “Build A Large Language Model (From Scratch)”:

import torch

class SelfAttention_v2(torch.nn.Module):

    def __init__(self, d_in, d_out, qkv_bias=False):
        super().__init__()
        self.W_query = torch.nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = torch.nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = torch.nn.Linear(d_in, d_out, bias=qkv_bias)

    def forward(self, x):
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)
        
        attn_scores = queries @ keys.T
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)

        context_vec = attn_weights @ values
        return context_vec

Sebastian’s code operates on matrices: the x in his forward() method corresponds to our x1, x2, and x3 vectors stacked together as a matrix with three rows. This allows him to simply multiply x with W_key to obtain keys, a matrix consisting of three rows (k1, k2, and k3 in our example).

The important takeaway from this brief explanation of self-attention is that in each forward pass, we multiply keys with the queries and then later with the values. Keep this in mind as you read on.

Advanced self-attention modules

The variant of self-attention described above is its simplest vanilla form. Today’s largest LLMs typically use slightly modified variations that typically differ from our basic flavor in three ways:

1
Attention is causal.

2
Dropout is used on attention weights.

3
Multi-head attention is used.

Causal attention means that the model should only consider previous tokens in the sequence when predicting the next one, preventing it from “looking ahead” at future words. Going back to our example, “She poured coffee.”, when the model was given the word “She” and is now attempting to predict the next one (“poured” would be correct), it should not compute or have access to attention weights between “coffee” and any other word since the word “coffee” has not appeared in the text yet. Causal attention is typically implemented by masking the “look-ahead” part of the attention weights matrix with zeros.

Next, to reduce overfitting during training, dropout is often applied to the attention weights. This means that some of them are randomly set to zero in each forward pass.

Finally, basic attention can be referred to as single-head, meaning that there is just one set of Wk, Wq, and Wv matrices. An easy way to increase the model’s capacity is to switch to multi-head attention. This boils down to having multiple sets of the W-matrices and, consequently, multiple query, key, and value matrices, as well as multiple context vectors for each input.

Additionally, some transformers implement additional modifications of the attention module with the goal of improving speed or accuracy. Three popular ones are:

Grouped-query attention: Instead of looking at every input token individually, tokens are grouped, allowing the model to focus on related groups of words at once, which speeds up processing. This is used by Llama 3, Mixtral, and Gemini.
Paged attention: Attention is broken down into “pages” or chunks of tokens, so the model processes one page at a time, making it faster for very long sequences.
Sliding-window attention: The model only attends to nearby tokens within a fixed “window” around each token, so it focuses on the local context without needing to look at the entire sequence.

All of these state-of-the-art approaches to implementing self-attention don’t change its basic premise and the fundamental mechanism it relies on: one always needs to multiply the keys by the queries and then later by the values. And as it turns out, at inference time, these multiplications show major inefficiencies. Let’s see why that’s the case.

What is key-value caching?

During inference, transformers generate one token at a time. When we prompt the model to start generation by passing “She,” it will produce one word, such as “poured” (for the sake of avoiding distractions, let’s keep assuming one token is one word). Then, we can pass “She poured” to the model, and it produces “coffee.” Next, we pass “She poured coffee” and obtain the end-of-sequence token from the model, indicating that it considers generation to be complete.

This means we have run the forward pass three times, each time multiplying the queries by the keys to obtain the attention scores (the same applies to the later multiplication by the values).

In the first forward pass, there was just one input token (“She”), resulting in just one key vector and one query vector. We multiplied them to obtain the q1k1 attention score.

In the first forward pass, there is just one input token (“She”), resulting in just one key vector and one query vector. We multiplie them to obtain the q1k1 attention score.

Next, we passed “She poured” to the model. It now sees two input tokens, so the computation inside our attention module looks as follows:

Next, we pass “She poured” to the model. It now sees two input tokens.

We did the multiplication to compute three terms, but q1k1 was computed needlessly—we had already calculated it before! This q1k1 element is the same as in the previous forward pass because:

q1 is calculated as the embedding of the input (“She”) times the Wq matrix,
k1 is calculated as the embedding of the input (“She”) times the Wk matrix,
Both the embeddings and the weight matrices are constant at inference time.

Note the grayed-out entries in the attention scores matrix: these are masked with zero to achieve causal attention. For example, the top-right element where q1k3 would have been is not shown to the model as we don’t know the third word (and k3) at the moment of generating the second word.

Finally, here is the illustration of the query-times-keys calculation in our third forward pass.

We get the illustration of the query-times-keys calculation in the third forward pass.

We make the computational effort to calculate six values, half of which we already know and don’t need to recompute!

You may already have a hunch about what key-value caching is all about. At inference, as we compute the keys (K) and values (V) matrices, we store their elements in the cache. The cache is an auxiliary memory from which high-speed retrieval is possible. As subsequent tokens are generated, we only compute the keys and values for the new tokens.

For example, this is how the third forward pass would look with caching:

An example on how the third forward pass could look with caching.

When processing the third token, we don’t need to recompute the previous token’s attention scores. We can retrieve the keys and values for the first two tokens from the cache, thus saving computation time.

Assessing the impact of key-value caching

Key-value caching may have a significant impact on inference time. The magnitude of this impact depends on the model architecture. The more cachable computations there are, the larger the potential to reduce inference time.

Let’s analyze the impact of K-V caching on generation time using the GPT-Neo-1.3B model from EleutherAI, which is available on the Hugging Face Hub.

We will start by defining a timer context manager to calculate generation time:

import time

class Timer:

   def __enter__(self):
       self._start = time.time()
       return self

   def __exit__(self, exc_type, exc_value, traceback):
       self._end = time.time()
       self.duration = self._end - self._start

   def get_duration(self) -> float:
       return self.duration

Next, we load the model from the Hugging Face Hub, set up the tokenizer, and define the prompt:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "EleutherAI/gpt-neo-1.3B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

input_text = "Why is a pour-over the only acceptable way to drink coffee?"

Finally, we can define the function to run model inference:

def generate(use_cache):
    input_ids = tokenizer.encode(
        input_text,
        return_tensors="pt").to(device),
    )
 output_ids = model.generate(
     input_ids,
     max_new_tokens=100,
     use_cache=use_cache,
 )

Note the use_cache argument we pass to model.generate: It controls whether K-V caching is employed.

With this setup, we can measure the average generation time with and without K-V caching:

for use_cache in (False, True):
   gen_times = []
   for _ in range(10):
     with Timer() as t:
       generate(use_cache=use_cache)
     gen_times += [t.duration]
   print(f"Average inference time with use_cache={use_cache}: {np.round(np.mean(gen_times), 2)} seconds")

I have executed this code on Google Colab using their free-tier T4 GPU using torch==2.5.1+cu121 and transformers==4.46.2 on Python 3.10.12 and obtained the following output:

Average inference time with use_cache=False: 9.28 seconds Average inference time with use_cache=True: 3.19 seconds

As you can see, in this case, the speedup from caching is almost threefold.

Challenges and trade-offs

As is usually the case, there is no such thing as a free lunch. The generation speedup we have just seen can only be achieved at the cost of increased memory usage, and it requires considerate management in production systems.

Latency-memory trade-off

Storing data in the cache uses up memory space. Systems with limited memory resources may struggle to accommodate this additional memory overhead, potentially resulting in out-of-memory errors. This is especially the case when long inputs need to be processed, as the memory required for the cache grows linearly with the input length.

Another aspect to keep in mind is that the additional memory consumed by the cache is not available for storing the batches of data. As a result, one might need to reduce the batch size to keep it within the memory limits, thus decreasing the throughput of the system.

If the memory consumed by the cache becomes a problem, one can trade additional memory for some of the model accuracy. Specifically, one can truncate the sequences, prune the attention heads, or quantize the model:

Sequence truncation refers to limiting the maximum input sequence length, thus capping the cache size at the expense of losing long-term context. In tasks where this long context is relevant, the model’s accuracy might suffer.

Reducing the number of layers or attention heads, thereby decreasing both the model size and cache memory requirements, is another strategy to reclaim some memory. However, reducing model complexity may impact its accuracy.

Finally, there is quantization, which means using lower-precision data types (e.g., float16 instead of float32) for caching to reduce memory usage. Yet again, model accuracy can suffer.

To sum up, faster latency provided by K-V caching comes at the cost of increased memory usage. If there is sufficient memory, it’s a non-issue. If the memory becomes the bottleneck, however, one can reclaim it by simplifying the model in various ways, thus transitioning from a latency-memory trade-off to a latency-accuracy trade-off.

KV cache management in production systems

In large-scale production systems with many users, the K-V cache needs to be properly managed to ensure consistent and reliable response time while preventing excessive memory consumption. The two most critical aspects of this are cache invalidation (when to clear it) and cache reuse (how to use the same cache multiple times).

Cache invalidation

Three of the most popular cache invalidation strategies are session-based clearing, time-to-live invalidation, and contextual relevance-based approaches. Let’s explore them in this order.

The most basic cache invalidation strategy is session-based clearing. We simply clear the cache at the end of a user session or conversation with the model. This simple strategy is a perfect fit for applications where conversations are short and independent of each other.

Think about a customer support chatbot application in which each user session typically represents an individual conversation where the user seeks assistance with specific issues. In this context, the contents of this cache are unlikely to be needed again. Clearing the K-V cache once the user ends the chat or the session times out due to inactivity is a good choice, freeing up memory for the application to handle new users.

In situations where individual sessions are long, however, there are better solutions than session-based clearing. In time-to-live (TTL) invalidation, cache contents are automatically cleared after a certain period. This strategy is a good choice when the relevance of cached data diminishes predictably over time.

Consider a news aggregator app that provides real-time updates. Cached keys and values might only be relevant for as long as the news is hot. Implementing a TTL policy where cached entries expire after, say, one day ensures that responses to similar queries about fresh developments are generated fast while old news doesn’t fill up memory.

Finally, the most sophisticated of the three popular cache invalidation strategies is based on contextual relevance. Here, we clear the cache contents as soon as they become irrelevant to the current context or user interaction. This strategy is ideal when the application handles diverse tasks or topics within the same session, and the previous context doesn’t contribute value to the new one.

Think about a coding assistant that works as an IDE plug-in. While the user is working on a particular set of files, the cache should be retained. As soon as they switch to a different codebase, however, the previous keys and values become irrelevant and can be deleted to free memory. Contextual relevance-based approaches might be challenging to implement, though, as they require pinpointing the event or point in time at which the context switch occurs.

Cache reuse

Another important aspect of cache management is its reuse. On some occasions, a once-generated cache can be used again to speed up generation and save memory by avoiding storing the same data multiple times in different users’ cache instances.

Cache reuse opportunities typically show up when there is shared context and/or a warm start is desirable.

In scenarios where multiple requests share a common context, one can reuse the cache for that shared portion. In e-commerce platforms, certain products may have standard descriptions or specifications that are frequently asked about by multiple customers. These may include product details (“55-inch 4K Ultra HD Smart LED TV”), warranty information (“Comes with a 2-year manufacturer’s warranty covering parts and labor.”), or customer instructions (“For best results, mount the TV using a compatible wall bracket, sold separately.”). By caching the key-value pairs for these shared product descriptions, a customer support chatbot will generate responses to common questions faster.

Similarly, one can precompute and cache the initial K-V pairs for frequently used prompts or queries. Consider a voice-activated virtual assistant application. Users frequently start interactions with phrases like “What’s the weather today?” or “Set a timer for 10 minutes.” The assistant can respond more quickly by precomputing and caching the key-value pairs for these frequently used queries.

Conclusion

Key-value (K-V) caching is a technique in transformer models where the key and value matrices from previous steps are stored and reused during the generation of subsequent tokens. It allows for the reduction of redundant computations and speeding up inference time. This speedup comes at the cost of increased memory consumption. When memory is a bottleneck, one can reclaim some of it by simplifying the model, thus sacrificing its accuracy. Implementing K-V caching in large-scale production systems requires careful cache management, including choosing the strategy for cache invalidation and exploring the opportunities for cache reuse.

Was the article useful?

Explore more content topics:

LockBit Developer Rostislav Panev Extradited from Israel to the US

CyberSecurity

dim

March 16, 2025

The US extradites LockBit ransomware developer, Rostislav Panev, from Israel. Learn how his arrest impacts the fight against cybercrime and understand LockBit’s devastating impact.

The United States has achieved a significant victory in its ongoing battle against cybercrime with the extradition of Rostislav Panev, a 51-year-old dual Russian and Israeli national, who is accused of being a key developer of the notorious LockBit ransomware.

Panev is alleged to have been deeply involved in the development and maintenance of the LockBit ransomware from its inception around 2019 until at least February 2024. During this period, he and his co-conspirators are believed to have transformed LockBit into what the Department of Justice (DoJ) describes as “the most active and destructive ransomware group in the world.”

The group, operating as a ransomware-as-a-service (RaaS) model, is believed to have targeted over 2,500 victims across at least 120 countries, including approximately 1,800 victims within the United States. These victims spanned across critical sectors, encompassing hospitals, schools, and government agencies, causing widespread disruption and financial losses.

The financial impact of LockBit’s activities is staggering. According to the DoJ, the group successfully extracted at least $500 million in ransom payments, while causing billions of dollars in additional losses through lost revenue and recovery costs. Evidence uncovered by law enforcement indicates Panev’s direct involvement in the development of tools that facilitated these attacks.

“The LockBit group attacked more than 2,500 victims in at least 120 countries around the world, including 1,800 in the United States. Their victims ranged from individuals and small businesses to multinational corporations, including hospitals, schools, nonprofit organizations, critical infrastructure, and government and law-enforcement agencies,” the DoJ’s press release revealed.

Authorities discovered administrator credentials on his computer, granting access to a dark web repository containing the source code for multiple versions of the LockBit builder, which enabled affiliates to generate custom malware.

They also found source code for the StealBit tool, used to exfiltrate stolen data, and evidence of direct communications between Panev and Dmitry Yuryevich Khoroshev, the alleged primary administrator of LockBit. They were charged by the DoJ, discussing development work on the LockBit builder and control panel.

Furthermore, financial records revealed cryptocurrency transfers exceeding $230,000 from Khoroshev to Panev between June 2022 and February 2024, providing concrete evidence of their financial relationship. In interviews with Israeli authorities, Panev reportedly admitted to performing coding, development, and consulting work for LockBit, confirming the regular cryptocurrency payments he received.

Panev’s extradition from Israel, where he was apprehended in August 2024 following a US provisional arrest request, marks a crucial step in holding individuals accountable for their roles in the devastating ransomware attacks that have plagued organizations worldwide. He has since appeared before a US magistrate and will remain detained pending his trial.

Top/Featured Image: Pixabay/Maxleron

Building The Most Scalable Experiment Tracker For Foundation Models

Machine Learning

dim

March 16, 2025

At a large-scale model training (in huge models), anomalies are not rare events but problematic patterns that drive failure. Detecting anomalies early in the process saves days of work and training.

ML model training observability is not just about tracking metrics. It requires proactive monitoring to catch issues early and ensure model success, given the high cost of training on large GPU clusters.

If you are an enterprise or a team operating a model, focus on three key areas: fine-tune your prompts to get the most effective outputs (prompt engineering), ensure that your model behaves safely and predictably, and implement robust monitoring and logging to track performance, detecting issues early.

The Neptune Scale experiment tracker supports fault tolerance and is designed to maintain progress despite hardware failures, making it adaptable for enterprise teams tackling LLM fine-tuning, compliance, and building domain-specific models.

Scaling large language model (LLM) operations is a challenge that many of us are facing right now. For those navigating similar waters, I recently shared some thoughts about it on the Data Exchange Podcast based on our journey at neptune.ai over the last few years.

Six years ago, we were mainly focused on MLOps when machine learning in production was still evolving. Experiment tracking back then was straightforward—dealing mostly with single models or small-scale distributed systems. Reinforcement learning was one of the few areas pushing the boundaries of scale. In that reinforcement learning, we wanted to run multiple agents and send data from multiple distributed machines to our experiment tracker. This was a huge challenge.

Scaling LLMs: from ML to LLMOps

The landscape changed two years ago when people started training LLMs at scale. LLMOps has taken center stage, and the importance of scaling large language models has grown with research becoming more industrialized. While researchers continue to lead the training process, they are also adjusting to the transition toward commercial applications.

LLMOps isn’t just MLOps with bigger servers, it is a paradigm shift for tracking experiments. We’re not tracking a few hundred metrics for a couple of hours anymore; we’re tracking thousands, even tens of thousands, over several months. These models are trained on GPU clusters spanning multiple data centers, with training jobs that can take months to complete.

Due to time constraints, training frontier models has become a production workflow rather than experimentation. When a training from scratch run takes 50,000 GPUs over several months in different data centers, you don’t get a second chance if something goes wrong—you need to get it right the first time.

Another interesting aspect of LLM training that only a few companies have truly nailed is the branch-and-fork style of training—something that Google has implemented effectively. This method involves branching off multiple experiments from a continuously running model, requiring a significant amount of data from previous runs. It’s a powerful approach, but it demands infrastructure capable of handling large data inheritance, which makes it feasible only for a handful of companies.

From experiment tracking to experiment monitoring

Now we want to track everything—every layer, every detail—because even a small anomaly can mean the difference between success and failure and many hours of work wasted. During this time, we should not only consider pre-training and training time; post-training takes a huge amount of time and collaborative human work. Grasping this issue, we have re-engineered Neptune’s platform to efficiently ingest and visualize massive volumes of data, enabling fast monitoring and analysis at a larger scale.

One of the biggest lessons we’ve learned is that experiment tracking has evolved into experiment monitoring. Unlike MLOps, tracking is no longer just about logging metrics and reviewing them later or restarting your training from a checkpoint a few steps back. It’s about having real-time insights to keep everything on track. With such long training times, a single overlooked metric can lead to significant setbacks. That’s why we’re focusing on building intelligent alerts and anomaly detection right into our experiment tracking system.

Think of it like this—we’re moving from being reactive trackers to proactive observers. Our goal is for our platform to recognize when something is off before the researcher even knows to look for it.

Fault tolerance in LLMs

When you’re dealing with LLM training at this scale, fault tolerance becomes a critical component. With thousands of GPUs running for months, hardware failures are almost inevitable. It’s crucial to have mechanisms in place to handle these faults gracefully.

At Neptune, our system is designed to ensure that the training can resume from checkpoints without losing any data. Fault tolerance does not only mean preventing failures; it also includes minimizing the impact when they occur, so that time and resources are not wasted.

How about being one of the first to access Neptune Scale?

Neptune Scale is our upcoming product release built for teams that train foundation models. It offers enhanced scalability and exciting new features. You can join our beta program to benefit from Neptune Scale earlier.

What does this mean for enterprise teams?

If you’re creating your own LLMs from scratch, or even if you’re an enterprise fine-tuning a model, you might wonder how all this is relevant to you. Here’s the deal: strategies originally designed for handling the massive scale of training LLMs are now being adopted in other areas or by smaller-scale projects.

Today, cutting-edge models are pushing the boundaries of scale, complexity, and performance, but these same lessons are starting to matter in fine-tuning tasks, especially when dealing with compliance, reproducibility, or complex domain-specific models.

For enterprise teams, there are three key focuses to consider:

Prompt Engineering: Fine-tune your prompts to get the most effective outputs. This is crucial for adapting large models to your specific needs without having to train from scratch.
Implement guardrails in your application: Ensuring your models behave safely and predictably is key. Guardrails help manage the risks associated with deploying AI in production environments, especially when dealing with sensitive data or critical tasks.
Observability in your system: Observability is vital to understanding what’s happening inside your models. Implementing robust monitoring and logging allows you to track performance, detect issues early, and ensure your models are working as expected. Neptune’s experiment tracker provides the observability you need to stay on top of your model’s behavior.

The future: what we’re building next

At Neptune, we’ve nailed the data ingestion part—it’s fast, reliable, and efficient. The challenge for the next year is making this data useful at scale. We need more than just filtering; we need smart tools that can surface the most critical insights and the most granular information automatically. The goal is to build an experiment tracker that helps researchers discover insights, not just record data.

We’re also working on developing a platform that combines monitoring and anomaly detection with the deep expertise researchers acquire over years of experience. By embedding that expertise directly into the tool (either automatically or by defining rules manually), less experienced researchers can benefit from the patterns and signals that would otherwise take years to learn.

Was the article useful?

Explore more content topics:

Modat launches premier product, Modat Magnify for Cybersecurity Professionals

CyberSecurity

dim

March 16, 2025

The Hague, the Netherlands, March 13th, 2025, CyberNewsWire

Founded in 2024, Modat – the European-crafted, research-driven, AI-powered cybersecurity company, has announced the launch of its premier product, Modat Magnify.

Designed by and for cybersecurity professionals, the team behind the product aims to speed up the lives of these individuals easier by giving them access to the largest Internet ‘Device DNA’ dataset available. The ‘Device DNA’ catalogues the essential attributes of each internet-connected device to create a unique profile.

FAST. AI-powered for unparalleled speed. Continuously scanning the entire internet and identify adversary infrastructure in real-time.
SMART. Research enhances the development of the platform and offers contextualized data, historical context, and predictive insights.
EASY. User-centric UI designed from firsthand experience as cybersecurity professionals From the initial query to the findings result pages, easy-to-filter, read and use.

“It starts with research to gain insight and build,” says Soufian El Yadmani, CEO & Founder of Modat. “Offensive and defensive professionals shared what solutions they need to be faster and to focus on what they do best. Scanning the internet is just a beginning. Speed, contextual data, and insight is vital to our products and services. Our ‘Device DNA’ gives value in the results to increase proactive efforts and build cyber resilience.”

“Protecting your country takes clear insight into internet connected devices. Modat helps you to protect your country’s infrastructure with this insight,” emphasized Vincent Thiele, COO & Co-Founder of Modat. “We support communities to improve the health of the Internet and deliver products to help make the internet a safer place.”

Recent research covered by 35+ media outlets Global Impact:

Users can learn more:

Pricing & Access:

FREE: covers most basic use cases. Solid start for many security professionals
Practitioner: €20/m
Professional: €60/m
Business: €400/m
Enterprise: tailored solutions for more complex needs of organisations and governments

About Modat

Modat, founded in 2024 is the European-crafted, AI-powered, research-driven cybersecurity company dedicated to helping security professionals outpace adversaries and stay ahead of evolving threats. Their flagship product, Modat Magnify, provides access to the world’s largest Internet “Device DNA” dataset.

Modat was created by researching, listening to, and directly experiencing the needs and challenges of security professionals. Their products enable the security community by giving access to unparalleled speed, contextualized data, and predictive insights.

By design, the Modat Magnify platform helps offensive and defensive professionals by giving them a fast, smart, easy way to stop searching and start finding. Our ‘Device DNA’ catalogues the essential attributes of each internet-connected device to create a unique profile to support proactive cybersecurity.

Modat empowers individuals, companies, and governments to strengthen their security posture and increase cyber resiliency. The team actively joining the fight to get ahead of cyber-attacks by narrowing the growing gap between digital threats and resilience. Join us to outpace and outlast.

Contact:

modat.io

LIn:

Bluesky:

For quotes/to schedule an interview, users can reach:

Soufian El Yadmani – CEO & Founder

Email: [email protected]

LinkedIn:

Vincent Thiele – COO & Co-Founder

Email: [email protected]

LinkedIn:

Contact

Head of Marketing
Bessie Schenk
Modat
[email protected]

KnowBe4 Wins Cybersecurity Company of the Year at the 2025 teissAwards

CyberSecurity

dim

March 16, 2025

KnowBe4, the world-renowned cybersecurity platform that comprehensively addresses human risk management, today announced that it has been awarded first place in this year’s teissAwards Cybersecurity Company of the Year category for enterprise organisations.

The teissAwards celebrate excellence in cyber and information security, recognising the outstanding contributions of vendors and technologies over the past year.

Winning first place in the Cybersecurity Company of the Year category underscores KnowBe4’s commitment to innovation, product development, and addressing the human element in cybersecurity. It also reflects the organisation’s dedication to improving cyber resilience by placing the customer at the heart of its operations.

Over the past 12 months, KnowBe4 has consistently integrated advanced AI-driven capabilities into its platform, providing organisations with an innovative approach to managing human risk in real-time. This enhancement highlights KnowBe4’s ongoing commitment to adapting its offerings to meet the evolving demands of the security landscape, particularly in addressing vulnerabilities stemming from human error.

“This recognition is a testament to our team’s hard work and dedication to empowering organisations to manage human risk effectively,” said Stu Sjouwerman, CEO of KnowBe4. “Our platform’s success comes from combining innovative technology with effective human risk management, helping organisations build a strong security culture from the ground up. We remain committed to continuous innovation and providing our customers with the tools and knowledge they need to stay ahead of evolving cyber threats.”

For more information on the teissAwards, please visit here. For more information on KnowBe4, please visit here.

The post KnowBe4 Wins Cybersecurity Company of the Year at the 2025 teissAwards appeared first on IT Security Guru.

How AI is Shaping the Future of Stock Market Predictions

dim

March 16, 2025

How AI is Shaping the Future of Stock Market Predictions

Introduction:

The stock market is a dynamic and unpredictable environment, and for years, predicting its movements has been both an art and a science. But what if technology could enhance our ability to predict these fluctuations more accurately and efficiently? Enter artificial intelligence (AI). AI is now making a significant impact in financial markets, providing tools to better predict trends, optimize portfolios, and even forecast market crashes. In this article, I’ll explore how AI in high-frequency trading, AI predicting market crashes, and machine learning in portfolio optimization are revolutionizing the way investors approach the stock market.

The Basics of AI in Stock Market Predictions

Before diving deep into the applications, let’s first understand what AI and machine learning are. Artificial Intelligence (AI) refers to the ability of machines to perform tasks that would normally require human intelligence, such as learning, problem-solving, and decision-making. Machine learning, a subset of AI, enables systems to learn from data, improve their predictions over time, and make decisions without explicit programming.

In stock market predictions, AI algorithms analyze vast amounts of data to identify patterns, correlations, and trends. For example, AI might look at historical stock prices, news articles, financial reports, and even social media to predict future market behavior. By using predictive analytics and sophisticated algorithms, AI is helping investors make more informed decisions.

The Evolution of AI in Stock Market Predictions

AI’s role in stock market predictions has evolved significantly over the years. In the early days, traders relied on simple statistical models and human intuition. But as computing power increased, so did the complexity of predictive models. The introduction of AI in high-frequency trading marked a major turning point. AI-driven algorithms can now execute trades at lightning speeds, analyzing vast data sets and making decisions in milliseconds.

The rise of machine learning further enhanced stock market predictions by allowing models to learn from data without human intervention. Over time, the algorithms became more accurate, capable of recognizing intricate patterns that were once invisible to human traders. Today, AI can predict stock price movements with impressive precision, analyze market sentiment, and even foresee potential market crashes.

How AI Enhances Stock Market Predictions

So, how exactly does AI enhance stock market predictions? Let’s break it down into several key areas.

Big Data Integration

AI thrives on data. The more information it has, the better it can predict market trends. Unlike traditional models, AI can process large amounts of unstructured data, such as news articles, social media posts, and financial reports. This enables it to detect subtle signals that could impact the market, providing investors with a more comprehensive view of the situation.

Sentiment Analysis

AI can also analyze investor sentiment by examining social media posts, news stories, and forums. By understanding how investors feel about certain stocks or the market in general, AI can predict market movements that are driven by emotions like fear or optimism. This is especially important in volatile market conditions, where sentiment plays a significant role.

Pattern Recognition

Machine learning algorithms are exceptional at recognizing patterns in vast data sets. For example, AI can identify recurring patterns in stock price movements or correlations between specific economic events and market behavior. This pattern recognition can be invaluable for predicting future price movements and adjusting investment strategies accordingly.

Speed and Efficiency

AI can analyze and process data far faster than any human. This gives it a significant advantage in high-frequency trading, where the ability to act quickly can make a substantial difference. AI’s speed and efficiency allow it to capitalize on market opportunities that would otherwise be missed by human traders.

Automation of Decision-Making

One of AI’s most important advantages is its ability to automate decision-making. In high-frequency trading, for example, AI can make thousands of trades per second, adjusting its strategies in real-time based on data. This automation reduces the risk of human error and increases the overall efficiency of trading systems.

AI vs. Traditional Methods: Pros and Cons

AI has undoubtedly revolutionized stock market predictions, but it’s essential to compare its effectiveness with traditional methods.

Benefits of AI

Speed: AI can process vast amounts of data in seconds, enabling quicker decisions.
Accuracy: AI models are trained to identify patterns that may be missed by human analysts.
Adaptability: AI algorithms continuously learn and adapt based on new data.
Risk Reduction: AI’s automated decision-making can reduce the chances of human error.
Comprehensive Data Analysis: AI can analyze unstructured data, such as news articles and social media, which traditional methods cannot.

Limitations of AI

Data Dependency: AI is only as good as the data it’s given. If the data is biased or incomplete, the predictions can be flawed.

Lack of Human Judgment: While AI is excellent at analyzing data, it lacks the intuitive judgment that human investors bring to the table.
Overfitting: AI models can sometimes become too finely tuned to historical data, which can limit their effectiveness in predicting future market behavior.
The “Black-Box” Problem: Many AI models operate as black boxes, meaning it’s often unclear how they arrive at specific predictions. This can make it difficult to trust the system fully.

Real-World Applications of AI in Stock Market Predictions

AI is already being used in a variety of real-world applications to improve stock market predictions.

Algorithmic Trading: AI in high-frequency trading has been a game-changer for the financial industry. AI-powered algorithms can execute trades at lightning speeds, far faster than any human could. These algorithms analyze market data in real-time and execute trades based on predefined criteria, capitalizing on small price movements that occur in fractions of a second.

Robo-Advisors: Robo-advisors use AI to provide automated, algorithm-driven financial planning services. They assess individual investor preferences, goals, and risk tolerance to create personalized portfolios. Machine learning in portfolio optimization helps these robo-advisors adjust portfolios automatically based on market conditions, minimizing risk and maximizing returns.

Hedge Funds and Investment Banks: Many hedge funds and investment banks are now using AI to gain an edge in the market. For example, AI can analyze vast datasets, including alternative data like satellite images and weather reports, to predict stock movements. This allows institutional investors to make data-driven decisions faster and more accurately.

AI-Powered Prediction Platforms: Platforms such as QuantConnect and Kavout offer AI-driven predictions for stocks, using machine learning algorithms to identify profitable trades. These platforms have become increasingly popular among retail investors who want to leverage AI to make better trading decisions.

Challenges and Ethical Considerations

Despite the many advantages, there are several challenges and ethical concerns surrounding the use of AI in stock market predictions.

Data Bias and Ethical Implications: AI models are heavily dependent on the data they’re trained on. If the data is biased or flawed, the predictions can be inaccurate, which could lead to unethical market behavior. It’s essential to ensure that AI models are trained on diverse, representative data to avoid reinforcing existing biases.

Market Manipulation Risks: AI-driven trading systems, especially those in high-frequency trading, have the potential to manipulate markets. The speed at which these systems operate could give a few investors an unfair advantage, potentially distorting stock prices and creating market instability.

The Role of Regulation: As AI continues to influence stock market predictions, regulators will need to establish guidelines to ensure fair and transparent use of AI in financial markets. Governments must create frameworks to address concerns like algorithmic manipulation, data privacy, and the ethical use of AI.

Over-Reliance on AI: There’s a risk that investors might become overly reliant on AI, ignoring the human judgment that is essential in complex market conditions. AI should be seen as a tool to assist investors, not replace them entirely.

The Future of AI in Stock Market Predictions

AI is constantly evolving, and its potential in stock market predictions is vast. Here are some ways AI might shape the future of stock market predictions:

Advancements in AI Technology: As AI technology continues to improve, we can expect even more accurate predictions and more sophisticated trading algorithms. The combination of AI with other emerging technologies, such as quantum computing, could revolutionize stock market predictions.

Integrating AI with Other Technologies: AI’s role in the stock market will continue to grow, especially when integrated with technologies like blockchain and big data. For example, blockchain could provide a more secure and transparent way of recording AI-driven trades.

Impact on Investment Strategies: As AI becomes more ingrained in the stock market, it will likely lead to a shift in investment strategies. Both retail and institutional investors will increasingly rely on AI to make data-driven decisions, which could level the playing field and open up new opportunities for smaller investors.

Ethical Frameworks for the Future: In the future, it will be crucial to develop ethical frameworks to govern the use of AI in stock market predictions. These frameworks should address issues such as transparency, accountability, and fairness to ensure that AI is used responsibly and ethically in financial markets.

Conclusion

AI has already had a profound impact on stock market predictions, enhancing the speed, accuracy, and efficiency of trading. From AI in high-frequency trading to AI predicting market crashes and machine learning in portfolio optimization, the potential for AI to transform financial markets is vast. While there are challenges and ethical concerns, AI’s ability to analyze vast amounts of data and identify hidden patterns is reshaping the way investors approach the stock market. Looking ahead, AI will likely continue to evolve, making stock market predictions even more accurate and accessible. The future of stock market predictions

Understanding LLMs Requires More Than Statistical Generalization [Paper Reflection]

Machine Learning

dim

March 16, 2025

In our paper, Understanding LLMs Requires More Than Statistical Generalization, we argue that current machine learning theory cannot explain the interesting emergent properties of Large Language Models, such as reasoning or in-context learning. From prior work (e.g., Liu et al., 2023) and our experiments, we’ve seen that these phenomena cannot be explained by reaching globally minimal test loss – the target of statistical generalization. In other words, model comparison based on the test loss is nearly meaningless.

We identified three areas where more research is required:

Understanding the role of inductive biases in LLM training, including the role of architecture, data, and optimization.
Developing more adequate measures of generalization.
Using formal languages to study language models in well-defined scenarios to understand transfer performance.

In this commentary, we focus on diving deeper into the role of inductive biases. Inductive biases affect which solution the neural network converges to, such as the model architecture or the optimization algorithm. For example, Stochastic Gradient Descent (SGD) favors neural networks with minimum-norm weights.

Inductive biases influence model performance. Even if two models with parameters θ₁ and θ₂ yield the same training and test loss, their downstream performance can differ.

How do language complexity and model architecture affect generalization ability?

In their Neural Networks and the Chomsky Hierarchy paper published in 2023, Delétang et al. showed how different neural network architectures generalize better for different language types.

Following the well-known Chomsky hierarchy, they distinguished four grammar types (regular, context-free, context-sensitive, and recursively enumerable) and defined corresponding sequence prediction tasks. Then, they trained different model architectures to solve these tasks and evaluated if and how well the model generalized, i.e., if a particular model architecture could handle the required language complexity.

In our position paper, we follow this general approach to expose the interaction of architecture and data in formal languages to gain insights into complexity limitations in natural language processing. We study popular architectures used for language modeling, e.g., Transformers, State-Space Models (SSMs) such as Mamba, the LSTM, and its novel extended version, the xLSTM.

To investigate how these models deal with formal languages of different complexity, we use a simple setup where each language consists only of two rules. During training, we monitor how well the models perform next-token prediction on the (in-distribution) test set, measured by accuracy.

However, our main question is whether these models generalize out-of-distribution. For this, we introduce the notion of rule extrapolation.

Can models adapt to changing grammar rules?

To understand rule extrapolation, let’s start with an example. A simple formal language is the aⁿbⁿ language, where the strings obey two rules:

1
a’s come before b’s.

2
The number of a’s and b’s is the same.

Examples of valid strings include “ab” and “aabb,” whereas strings like “baab” (violates rule 1) and “aab” (violates rule 2) are invalid. Having trained on such strings, we feed the models an out-of-distribution (OOD) string, violating rule 1 (e.g., a string where the first token is b).

We find that most models still obey rule 2 when predicting tokens, which we call rule extrapolation – they do not discard the learned rules entirely but adapt to the new situation in which rule 1 is seemingly no longer relevant.

This finding is surprising because none of the studied model architectures includes conscious choices to promote rule extrapolation. It emphasizes our point from the position paper that we need to understand the inductive biases of language models to explain emergent (OOD) behavior, such as reasoning or good zero-/few-shot prompting performance.

Efficient LLM training requires understanding what is a complex language for an LLM

According to the Chomsky hierarchy, the context-free aⁿbⁿ language is less complex than the context-sensitive aⁿbⁿcⁿ language, where the n a’s and n b’s are followed by an equal number of c’s.

Despite their different complexity, the two languages seem very similar to humans. Our experiments show that, e.g., Transformers can learn context-free and context-sensitive languages equally well. However, they seem to struggle with regular languages, which are deemed to be much simpler by the Chomsky hierarchy.

Based on this and similar observations, we conclude that language complexity, as the Chomsky hierarchy defines it, is not a suitable predictor for how well a neural network can learn a language. To guide architecture choices in language models, we need better tools to measure the complexity of the language task we want to learn.

It’s an open question what these could look like. Presumably, we’ll need to find different complexity measures for different model architectures that consider their specific inductive biases.

Searching for a free experiment tracking solution for your academic research?

Join 1000s of researchers, professors, students, and Kagglers using neptune.ai for free to make monitoring experiments, comparing runs, and sharing results far easier than with open source tools.

What’s next?

Understanding how and why LLMs are so successful paves the way to more data-, cost- and energy efficiency. If you want to dive deeper into this topic, our position paper’s “Background” section is full of references, and we discuss numerous concrete research questions.

If you’re new to the field, I particularly recommend Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models (2023) by Liu et al., which nicely demonstrates the shortcomings of current evaluation practices based on the test loss. I also encourage you to check out SGD on Neural Networks Learns Functions of Increasing Complexity (2023) by Nakkiran et al. to understand more deeply how using stochastic gradient descent affects what functions neural networks learn.

Was the article useful?

Explore more content topics:

Using AI-Driven Cybersecurity Training to Counter Emerging Threats

CyberSecurity

dim

March 16, 2025

Cary, North Carolina, March 13th, 2025, CyberNewsWire

As Artificial Intelligence (AI)-powered cyber threats surge, INE Security, a global leader in cybersecurity training and certification, is launching a new initiative to help organizations rethink cybersecurity training and workforce development. The company warns that AI is reshaping both the threat landscape and the skills required for cybersecurity professionals. While AI offers significant advantages in cyber defense, organizations must ensure their teams are properly trained to leverage it effectively without becoming overly reliant on automation.

“The rise of AI in cybersecurity isn’t just a challenge—it’s an opportunity,” said Dara Warn, CEO of INE Security. “By training cybersecurity professionals properly, AI can be leveraged to filter noise, reduce burnout, and increase efficiency. However, if we don’t train people to understand the ‘why’ behind AI-driven decisions, we risk a future where cybersecurity professionals are blindly following AI without the expertise to think critically beyond it.”

AI as a Force Multiplier: Improving SOC Efficiency and Threat Detection

AI-driven security tools are improving the signal-to-noise ratio, making Security Operations Centers (SOCs) more efficient by reducing false positive alerts—an area cybersecurity tools have been refining for over a decade. AI can prioritize critical threats, allowing analysts to focus on real dangers rather than wasting time investigating false alarms.

“AI is making threat detection smarter, but it’s not foolproof,” said Tracy Wallace, Director of Content at INE Security. “Security professionals need to be trained to work alongside AI, not just follow its outputs. AI is great at reducing alert fatigue, but analysts still need the expertise to investigate, interpret, and respond to threats accurately.”

Generative AI: A Double-Edged Sword for Cybersecurity Talent

One of the most promising yet complex aspects of AI’s rise is its impact on the cybersecurity workforce. On one hand, generative AI will lower the barrier to entry, allowing more professionals to enter the cybersecurity field and reducing the global labor shortage.

However, this shift also presents risks. “The concern isn’t that AI is making cybersecurity easier,” said Wallace. “The concern is that if professionals become too dependent on AI outputs, they won’t develop the critical-thinking skills necessary to work beyond what the AI gives them. Organizations must ensure that cybersecurity training teaches professionals not just how to use AI but how to work independently of it when needed.”

The Data Privacy Dilemma: AI and LLM Security Risks

Another concern in AI-driven cybersecurity is data privacy and security risks with large language models (LLMs). While concerns over data leakage with cloud-based AI models are growing, this isn’t a new challenge—it’s an evolution of longstanding security principles. Organizations must ensure AI-powered security solutions do not require external data sharing.

“As AI becomes more deeply integrated into cybersecurity operations, privacy-first security architectures are crucial,” said Wallace. “Organizations need AI models that can operate securely without exposing sensitive data to external systems.”

The Future of AI Security Training: Agentic Architectures and AI-Driven Automation

Looking ahead, Agentic AI architectures are becoming a hot topic in cybersecurity. While some view it as buzzword hype, there is real potential for AI-driven security agents that autonomously investigate threats, adjust defenses in real-time, and improve security workflows with minimal human intervention.

However, automation must be carefully balanced. “Agentic AI might be the future, but we can’t let it replace hands-on expertise and human decision-making,” said Warn. “Security professionals must be trained to interpret AI-driven insights, make judgment calls, and recognize when AI is wrong.”

Training as the Solution: INE Security’s AI-Powered Cybersecurity Curriculum

To close the cybersecurity skills gap and help professionals work effectively with AI, INE Security is working to expand its AI-driven training programs. These programs will focus on:

AI-Driven Threat Analysis – Training security teams to interpret AI-generated threat intelligence and reduce false positives.
Machine Learning for Cyber Defense – Teaching professionals how AI-powered security models work and how attackers exploit AI vulnerabilities.
Generative AI in Cybersecurity – Helping cybersecurity teams understand the risks and benefits of AI-generated attacks and defenses.
Hands-On AI Security Labs – Simulating real-world AI-powered attacks and training professionals on how to counter them manually and with AI assistance.

“Our end goal is not just to train security professionals how to use AI but to train them how to think critically in an AI-driven world,” said Wallace.

The Call to Action: Prepare for AI-Driven Threats Now

With AI transforming cybersecurity threats at an unprecedented pace, INE Security urges companies to:

Train their cybersecurity teams on AI-driven tools, while ensuring they develop critical problem-solving skills.
Prioritize AI-powered security solutions that enhance, not replace, human expertise.
Implement privacy-first AI models that reduce data exposure risks.

“The AI revolution in cybersecurity is here,” concluded Warn. “Organizations that act now—by investing in security training, developing cybersecurity talent, and understanding how AI truly impacts the field—will be the ones leading the industry forward. The future of cybersecurity belongs to those who train for it.”

About INE Security

INE Security is the premier provider of online networking and cybersecurity training and certification. Harnessing a powerful hands-on lab platform, cutting-edge technology, a global video distribution network, and world-class instructors, INE Security is the top training choice for Fortune 500 companies worldwide for cybersecurity training in business and for IT professionals looking to advance their careers, offering both Red Team training and Blue Team training. INE Security’s suite of learning paths offers an incomparable depth of expertise across cybersecurity and is committed to delivering advanced technical training while also lowering the barriers worldwide for those looking to enter and excel in an IT career.

Contact

Kathryn Brown
INE Security
[email protected]

Advancing Gender Equality in 2025 and Beyond

CyberSecurity

dim

March 16, 2025

International Women’s Day (IWD) 2025 carries the powerful theme: ‘Accelerate Action.’ This theme calls on individuals, communities, and organisations to take decisive steps toward achieving gender equality. Despite ongoing efforts, at the current rate of progress, it will take until 2158, more than five generations, to reach full gender parity, according to the World Economic Forum. Such a timeline is unacceptable. Now, more than ever, we must accelerate action to break down systemic barriers and biases that hinder gender equality in both personal and professional spheres.

The Role of Women in Cybersecurity

The cybersecurity industry is one of the fastest-growing sectors globally, yet it remains deeply underrepresented when it comes to gender diversity. As of 2022, women accounted for only 25% of cybersecurity jobs, with projections suggesting an increase to 30% by 2025. However, leadership positions remain scarce for women, particularly in the UK. The challenge is guaranteeing this growth is meaningful, extending beyond entry-level roles to positions of influence and decision-making.

With Diversity, Equity, and Inclusion (DEI) initiatives under increasing threat due to shifting global political landscapes, it is crucial to keep the conversation about gender equality alive, even when formal policies may be at risk. Industry leaders must explore how to sustain progress and prevent regression in diverse hiring and inclusive workplace cultures.

Step Outside Your Comfort Zone

Liz Harvey, Director of Product Research at Huntress, believes true growth comes from stepping outside one’s comfort zone and embracing diversity in all its forms. She emphasises the importance of rejecting sameness and actively fostering inclusivity by challenging norms and making intolerance unacceptable.

“Build tolerance. Become the other,” Harvey says, reflecting on the experiences that shaped her perspective. From working as an AmeriCorps Construction Crew Lead at Habitat for Humanity, where she defied traditional gender roles, to immersing herself in different cultures while studying abroad, she has continuously sought opportunities to broaden her worldview.

She recalls joining community soccer leagues and summer camps organised by a religion different from her own, gaining first-hand insight into new beliefs and perspectives. Throughout her career, she has often been the only woman in the room, yet she has never let that limit her. Instead, she encourages others to step beyond familiar spaces, embrace discomfort, and contribute to a more inclusive world.

“Reject sameness. Embrace adventuring out of your comfort zone to evolve humanity,” Harvey urges. “Make intolerance unacceptable.” Her message is clear: true progress comes from diversity, curiosity, and the courage to challenge societal norms.

Joy Burkholder Meier, General Counsel and Chief Human Resources Officer at Black Duck, agrees that stepping outside one’s comfort zone is important. She attributes much of her career growth to mentorship, not through formal programs but through organic relationships with leaders who offered guidance and encouragement.

Meier stresses the importance of being prepared for opportunities, embracing challenges, and actively solving problems rather than merely identifying them. Her key advice is to work hard, be a problem-solver, and make others’ jobs easier to stand out and advance in your career.

On diversity, Meier states: “Diverse viewpoints lead to the best results. If we don’t problem-solve with these diverse customer bases in mind, then we will have blind spots. And for me, diversity means a lot of different things – not only people of varying gender, race, or nationality but also different educational backgrounds and experiences. A diverse team is going to win every time.”

Breaking Barriers and Driving Change

Dr. Ksenia Peguero, Director of Software Engineering at Black Duck, underscores the historical significance of International Women’s Day, particularly in Russia and other countries where it’s been observed for over a century. “Having grown up in the Soviet Union, International Women’s Day has always been important for me. Firstly, it was and still is a federal holiday in my home country and in many other countries. It was declared a holiday in Russia by Vladimir Lenin as a day to celebrate gender equality in labour and voting rights more than a hundred years ago. Secondly, although the agenda of the holiday has changed throughout the years, its main focus on women’s rights and the advancement of women in the workplace and in all spheres of life is as important today as it was a hundred years ago,” she explains.

Despite progress, gender disparities in pay, leadership opportunities, and household responsibilities persist, making the observance of this day more relevant than ever. In the tech industry, initiatives such as Girls Who Code and workplace employee resource groups (ERGs) are actively working to reduce bias and support women’s success. “In the technical field, women and allies have been working hard over the last few years to advance the success of women,” Peguero emphasises.

Aditi Gupta, Senior Manager of Professional Services Consulting at Black Duck, reflects on the slow but steady progress of women in STEM. “When I entered the technology workforce in India over 15 years ago, women made up roughly 12% of the STEM workforce. Growing up in my small Indian town, my exposure to professional women was primarily limited to teachers and bank employees, even though countless women contributed invisibly to the economy through informal labour. As one of the fortunate 8% of women enrolled in engineering programs then, I learned early on to pursue the less travelled path,” she notes.

Despite these challenges, research consistently shows that companies with diverse leadership financially outperform their peers by 25%. At Black Duck, initiatives like the Women’s Employee Resource Group (ERG) play a crucial role in bridging gender gaps by providing mentorship, sponsorship, and networking opportunities. “Our ERG works to increase the visibility and representation of women in the industry,” Gupta emphasises, reinforcing the importance of continued efforts to foster diversity in technology.

More Female Representation Will Drive Change

Women face higher rates of cybercrime and online harassment, making cybersecurity awareness a vital tool for personal safety but also providing exciting career opportunities and professional growth due to that experience.

Zoya Schaller, Director of Compliance at Keeper Security, emphasises the critical role of cybersecurity in protecting women from the unique threats they face online. “Women experience higher rates of cybercrime, online harassment, and privacy violations,” she explains. With most modern women having some form of online presence, understanding cybersecurity basics is essential for safeguarding personal information and maintaining control over digital identities.

Beyond personal security, Schaller highlights the growing career opportunities in cybersecurity, an industry that combines intellectual challenges with excellent compensation and rapid growth potential. “By joining this field, women can both protect their own digital lives and help safeguard others,” she says, noting that diverse perspectives strengthen the industry’s ability to combat cyber threats more effectively.

Increasing female representation in cybersecurity is about more than just filling positions; it’s about transforming the industry with fresh perspectives and problem-solving approaches. “When we expand the talent pool to include more women, we’re not only addressing the huge skills gap in the field, but we’re also bringing in new ways of thinking about and solving security problems,” Schaller points out.

Women’s ability to connect with people and communicate complex concepts in an accessible way makes a tangible impact, especially in designing security measures that users will actually adopt. “What good is a security solution if users find it so frustrating that they look for workarounds?” she asks. Women in cybersecurity also bring invaluable firsthand experience in tackling issues like online harassment and digital privacy, contributing to more effective solutions. Moreover, female leaders tend to uplift other women in the field through mentorship, fostering a ripple effect that benefits the entire industry. “A more diverse cybersecurity industry is better equipped to protect all of us in our increasingly connected world,” Schaller concludes.

Carla Roncato, VP of Identity at WatchGuard Technologies, also looks at how women’s experiences can open doors to career opportunities while addressing critical global challenges. “Today, approximately 850 million people around the world do not have an official ID or a digitally verifiable identification. This impacts their ability to access digital services, such as opening a bank account or applying for a loan. Women, in particular, are disproportionately affected by this identity gap,” she explains. This issue impacts countless communities, including those displaced by conflict and climate disasters, individuals facing housing insecurity, vulnerable youth without legal guardianship, and survivors of domestic violence seeking critical support.

Roncato stresses the importance of raising awareness around the need for digitally verifiable identification to enhance identity protection, reduce fraud, prevent identity theft, and provide broader access to essential services. She also encourages women to consider careers in technology and security, emphasising the opportunities in Digital Identity. “Digital Identity offers not just professional growth but the chance to create impactful change for women everywhere. There has never been a more important time to join this mission and help drive a more inclusive digitally secure future for all.”

Shaping the Future

Catarina Santos, Data Protection Consultant at Data Protection People, emphasises the vital role women play in shaping policies, enforcing regulations, and safeguarding data security. She highlights how gender diversity strengthens digital infrastructure and fosters public trust, making the industry more resilient and effective.

“On International Women’s Day, we acknowledge the critical role women play in the evolving field of data protection. As the digital world grows increasingly complex, their expertise is central to shaping policies, enforcing regulations, and ensuring that personal data is kept safe and secure. Women in data protection are instrumental in tackling the challenges of data security, compliance, and privacy in today’s interconnected environment. Their work helps build trust, protect individuals’ rights, and support the integrity of the digital infrastructure. This day serves as a reminder of the importance of diverse leadership and the ongoing need for excellence and innovation in the field of data protection.”

Teresa Jose, Consultant at Pentest People, reflects on her journey into cybersecurity, expressing excitement about her growth and learning in the field. “I was thrilled to enter the cybersecurity industry when I first joined Pentest People as a graduate consultant. I’m incredibly proud of how much I’ve developed my understanding of security within the extended digital environments of organisational structures,” she shares.

She encourages more women to explore careers in cybersecurity, acknowledging the industry’s gender imbalance, particularly in offensive security roles. “Compared to other fields, cybersecurity has fewer female role models, especially in offensive security. I believe more women should consider entering this space,” she says. For those looking to break into the industry, Jose recommends earning fundamental certifications. “Getting certified is a great way to build a strong foundation in cybersecurity and gain a solid understanding of the cyber environment,” she advises.

Natalia Lewandowska, a Security Consultant at Pentest People, highlights the inspiration that comes with being a woman in cybersecurity. “It’s incredible to see more women breaking barriers in this field, bringing diverse perspectives and strengthening the industry as a whole,” she says. The increasing presence of female professionals, including those at Pentest People, fills her with pride and motivation. “Knowing that we are paving the way for future generations to thrive in tech and security is truly inspiring,” she adds.

The Need for Collective Action

The message for IWD 2025 is clear: gender equality cannot wait until 2158. While massive strides have been made, the risk of regression is real, especially with DEI initiatives under threat. Women in cybersecurity and all industries must continue advocating for inclusivity, challenging biases, and accelerating action toward gender equality.

The cybersecurity industry is a prime example of how diverse teams produce better outcomes, bridge skill gaps, and enhance problem-solving. The time for action is now.

The post Advancing Gender Equality in 2025 and Beyond appeared first on IT Security Guru.

Optimizing Test-Time Compute for LLMs: A Meta-Reinforcement Learning Approach with Cumulative Regret Minimization

dim

March 16, 2025

Enhancing the reasoning abilities of LLMs by optimizing test-time compute is a critical research challenge. Current approaches primarily rely on fine-tuning models with search traces or RL using binary outcome rewards. However, these methods may not fully exploit test-time compute efficiently. Recent research suggests that increasing test-time computing can improve reasoning by generating longer solution traces and incorporating structured steps such as reflection, planning, and algorithmic search. Key challenges remain whether LLMs allocate computational resources effectively based on task complexity and discover solutions to more difficult problems when given a larger test-time compute budget. Addressing these is crucial for improving efficiency and generalization in LLM reasoning.

Recent advancements in scaling test-time compute have explored training separate verifiers for selection-based methods like best-of-N or beam search, which can sometimes be more effective than increasing data or model size. However, fine-tuning on unfamiliar search traces may lead to memorization rather than genuine reasoning improvements. RL-based approaches have demonstrated promise in generating chain-of-thought reasoning, enabling models to introspect, plan, and refine their outputs. However, increasing reasoning length does not always correlate with higher accuracy, as models may generate unnecessarily long sequences without meaningful progress. To address this, recent efforts have incorporated structured reward mechanisms and length penalties to encourage efficient reasoning, ensuring that models focus on producing informative, concise solutions rather than excessive computation.

Researchers from Carnegie Mellon University & Hugging Face investigate optimizing test-time compute for LLMs by refining how models allocate computational resources during reasoning. Instead of relying solely on outcome-reward RL, they introduce a fine-tuning approach that balances exploration and exploitation, ensuring steady progress toward correct answers. Their method incorporates a dense reward bonus to quantify progress, improving efficiency. Evaluations on mathematical benchmarks demonstrate that this approach significantly outperforms existing methods, enhancing both accuracy and token efficiency. Their findings also suggest that optimizing for progress minimizes computational regret while improving solution discovery without sacrificing accuracy.

The problem of optimizing test-time compute is framed as a meta reinforcement learning (meta RL) challenge. The goal is to maximize an LLM’s performance within a given test-time token budget by balancing exploration and exploitation. Instead of solely optimizing for outcomes, the proposed Meta Reinforcement Fine-Tuning (MRT) approach minimizes cumulative regret by rewarding progress across sequential episodes. This budget-agnostic strategy allows LLMs to make steady progress regardless of training constraints. By incorporating a reward bonus based on incremental improvements, MRT ensures efficient test-time compute usage, enhancing adaptability and response accuracy within deployment constraints.

The study evaluates the effectiveness of MRT in optimizing test-time computation, with a focus on achieving high accuracy while maintaining computational efficiency. The study presents key findings, compares MRT’s efficiency with prior methods, and conducts ablation experiments on token budget and progress. MRT consistently outperforms baseline models and outcome-reward RL (GRPO), achieving state-of-the-art results in its size category. It also improves out-of-distribution robustness and delivers larger performance gains with weaker models. Furthermore, MRT significantly enhances token efficiency, requiring fewer tokens for comparable accuracy. Additional experiments highlight its effectiveness in backtracking search and linearized evaluations.

In conclusion, the study reframes optimizing test-time compute as a meta-reinforcement learning (RL) problem, introducing cumulative regret as a key metric. State-of-the-art outcome-reward RL models fail to minimize regret, often struggling with novel queries within a token budget. This limitation arises from training solely with outcome rewards, which lack the granularity to guide stepwise progress. To address this, MRT is proposed, incorporating a dense reward bonus that encourages incremental improvement. MRT enhances test-time compute efficiency, achieving 2-3x better performance and 1.5x greater token efficiency in mathematical reasoning compared to outcome-reward RL, though several open questions remain.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.

Parlant: Build Reliable AI Customer Facing Agents with LLMs 💬 ✅ (Promoted)

1...323334...39 Page 33 of 39

NEWSMAG

Transformer architecture overview

Basic self-attention module

Advanced self-attention modules

1 Attention is causal. 2 Dropout is used on attention weights. 3 Multi-head attention is used.

What is key-value caching?

Assessing the impact of key-value caching

Challenges and trade-offs

Latency-memory trade-off

KV cache management in production systems

Cache invalidation

Cache reuse

Conclusion

Was the article useful?

Explore more content topics:

Scaling LLMs: from ML to LLMOps

From experiment tracking to experiment monitoring

Fault tolerance in LLMs

What does this mean for enterprise teams?

The future: what we’re building next

Was the article useful?

Explore more content topics:

About Modat

Contact

How AI is Shaping the Future of Stock Market Predictions

The Basics of AI in Stock Market Predictions

The Evolution of AI in Stock Market Predictions

How AI Enhances Stock Market Predictions

AI vs. Traditional Methods: Pros and Cons

Real-World Applications of AI in Stock Market Predictions

Challenges and Ethical Considerations

The Future of AI in Stock Market Predictions

Conclusion

How do language complexity and model architecture affect generalization ability?

Can models adapt to changing grammar rules?

1 a’s come before b’s. 2 The number of a’s and b’s is the same.

Efficient LLM training requires understanding what is a complex language for an LLM

What’s next?

Was the article useful?

Explore more content topics:

Contact

Popular Posts

My Favorites

Popular Categories

1
Attention is causal.

2
Dropout is used on attention weights.

3
Multi-head attention is used.

1
a’s come before b’s.

2
The number of a’s and b’s is the same.