Home Blog Page 36

Mixture of Experts LLMs: Key Concepts Explained

0

Mixture of Experts (MoE) is a type of neural network architecture that employs sub-networks (experts) to process specific input parts.

Only a subset of experts is activated per input, enabling models to scale efficiently. MoE models can leverage expert parallelism by distributing experts across multiple devices, enabling large-scale deployments while maintaining efficient inference.

MoE uses gating and load balancing mechanisms to dynamically route inputs to the most relevant experts, ensuring targeted and evenly distributed computation. Parallelizing the expert, along with the data, is key to having an optimized training pipeline.

MoEs have faster training and better or comparable performance than dense LLMs on many benchmarks, especially in multi-domain tasks. Challenges include load balancing, distributed training complexity, and tuning for stability and efficiency.

Scaling LLMs comes at a tremendous computational cost. Bigger models enable more powerful capabilities but require expensive hardware and infrastructure, also resulting in higher latency. So far, we’ve mainly achieved performance gains by making models larger, but this trajectory is not sustainable due to escalating costs, increasing energy consumption, and diminishing returns in performance improvement.

When considering the enormous amount of data and the wide variety of domains in which the huge LLM models are trained, it’s natural to ask —instead of using the entire LLM’s capacity, could we just pick and choose only a portion of the LLM that is relevant to our particular input? This is the key idea behind Mixture of Expert LLMs.

Mixture of Experts (MoE) is a type of neural network architecture in which parts of the network are divided into specialized sub-networks (experts), each optimized for a specific domain of the input space. During inference, only a part of the model is activated depending on the given input, significantly reducing the computational cost. Further, these experts can be distributed across multiple devices, allowing for parallel processing and efficient large-scale distributed setups.

On an abstract, conceptual level, we can imagine MoE experts specialized in processing specific input types. For example, we might have separate experts for different language translations or different experts for text generation, summarization, solving analytical problems, or writing code. These sub-networks have separate parameters but are part of the single model, sharing blocks and layers at different levels.

In this article, we explore the core concepts of MoE, including architectural blocks, gating mechanisms, and load balancing. We’ll also discuss the nuances of training MoEs and analyze why they are faster to train and yield superior performance in multi-domain tasks. Finally, we address key challenges of implementing MoEs, including distributed training complexity and maintaining stability.

Bridging LLM capacity and scalability with MoE layers

Since the introduction of Transformer-based models, LLM capabilities have continuously expanded through advancements in architecture, training methods, and hardware innovation. Scaling up LLMs has been shown to improve performance. Accordingly, we’ve seen rapid growth in the scale of the training data, model sizes, and infrastructure supporting training and inference.

Pre-trained LLMs have reached sizes of billions and trillions of parameters. Training these models takes extremely long and is expensive, and their inference costs scale proportionally with their size.

In a conventional LLM, all parameters of the trained model are used during inference. The table below gives an overview of the size of several impactful LLMs. It presents the total parameters of each model and the number of parameters activated during inference:

The last five models (highlighted) exhibit a significant difference between the total number of parameters and the number of parameters active during inference. The Switch-Language Transformer, Mixtral, GLaM, GShard, and DeepSeekMoE are Mixture of Experts LLMs (MoEs), which require only executing a portion of the model’s computational graph during inference.

MoE building blocks and architecture

The foundational idea behind the Mixture of Experts was introduced before the era of Deep Learning, back in the ’90s, with “Adaptive Mixtures of Local Experts” by Robert Jacobs, together with the “Godfather of AI” Geoffrey Hinton and colleagues. They introduced the idea of dividing the neural network into multiple specialized “experts” managed by a gating network.

With the Deep Learning boom, the MoE resurfaced. In 2017, Noam Shazeer and colleagues (including Geoffrey Hinton once again) proposed the Sparsely-Gated Mixture-of-Experts Layer for recurrent neural language models.

The Sparsely-Gated Mixture-of-Experts Layer consists of multiple experts (feed-forward networks) and a trainable gating network that selects the combination of experts to process each input. The gating mechanism enables conditional computation, directing processing to the parts of the network (experts) that are most suited to each part of the input text.

Such an MoE layer can be integrated into LLMs, replacing the feed-forward layer in the Transformer block. Its key components are the experts, the gating mechanism, and the load balancing.

Overview of the general architecture of a Transformer block with integrated MoE layer. The MoE layer has a gate (router) that activates selected experts based on the input. The aggregated experts’ outputs form the MoE layer’s output.
Overview of the general architecture of a Transformer block with integrated MoE layer. The MoE layer has a gate (router) that activates selected experts based on the input. The aggregated experts’ outputs form the MoE layer’s output. | Source: Author

Experts

The fundamental idea of the MoE approach is to introduce sparsity in the neural network layers. Instead of a dense layer where all parameters are used for every input (token), the MoE layer consists of several “expert” sub-layers. A gating mechanism determines which subset of “experts” is used for each input. The selective activation of sub-layers makes the MoE layer sparse, with only a part of the model parameters used for every input token.

How are experts integrated into LLMs?

In the Transformer architecture, MoE layers are integrated by modifying the feed-forward layers to include sub-layers. The exact implementation of this replacement varies, depending on the end goal and priorities: replacing all feed-forward layers with MoEs will maximize sparsity and reduce the computational cost, while replacing only a subset of feed-forward layers may help with training stability. For example, in the Switch Transformer, all feed-forward components are replaced with the MoE layer. In GShard and GLaM, only every other feed-forward layer is replaced.

The other LLM layers and parameters remain unchanged, and their parameters are shared between the experts. An analogy to this system with specialized and shared parameters could be the completion of a company project. The incoming project needs to be processed by the core team—they contribute to every project. However, at some stages of the project, they may require different specialized consultants, selectively brought in based on their expertise. Collectively, they form a system that shares the core team’s capacity and profits from expert consultants’ contributions.

Visualization of token-level expert selection in the MoE model (layers 0, 15, and 31). Each token is color-coded, indicating the first expert chosen by the gating mechanism. This illustrates how MoE assigns tokens to specific experts at different levels of architecture. It may not always be obvious why the same-colored tokens were directed to the same expert - the model processed high-dimensional representations of these tokens, and the logic and understanding of the token processing are not always similar to human logic.
Visualization of token-level expert selection in the MoE model (layers 0, 15, and 31). Each token is color-coded, indicating the first expert chosen by the gating mechanism. This illustrates how MoE assigns tokens to specific experts at different levels of architecture. It may not always be obvious why the same-colored tokens were directed to the same expert – the model processed high-dimensional representations of these tokens, and the logic and understanding of the token processing are not always similar to human logic. | Source

Gating mechanism

In the previous section, we have introduced the abstract concept of an “expert,” a specialized subset of the model’s parameters. These parameters are applied to the high-dimensional representation of the input at different levels of the LLM architecture. During training, these subsets become “skilled” at handling specific types of data. The gating mechanism plays a key role in this system.

What is the role of the gating mechanism in an MoE layer?

When an MoE LLM is trained, all the experts’ parameters are updated. The gating mechanism learns to distribute the input tokens to the most appropriate experts, and in turn, experts adapt to optimally process the types of input frequently routed their way. At inference, only relevant experts are activated based on the input. This enables a system with specialized parts to handle diverse types of inputs. In our company analogy, the gating mechanism is like a manager delegating tasks within the team.

The gating component is a trainable network within the MoE layer. The gating mechanism has several responsibilities:

  • Scoring the experts based on input. For N experts, N scores are calculated, corresponding to the experts’ relevance to the input token.
  • Selecting the experts to be activated. Based on the experts’ scoring, a subset of the experts is chosen to be activated. This is usually done by top-k selection.
  • Load balancing. Naive selection of top-k experts would lead to an imbalance in token distribution among experts. Some experts may become too specialized by only handling a minimal input range, while others would be overly generalized. During inference, touting most of the input to a small subset of experts would lead to overloaded and underutilized experts. Thus, the gating mechanism has to distribute the load evenly across all experts.

How is gating implemented in MoE LLMs?

Let’s consider an MoE layer consisting of n experts denoted as Experti(x) with i=1,…,n that takes input x. Then, the gating layer’s output is calculated as

How is gating implemented in MoE LLMs?

where gi is the ith expert’s score, modeled based on the Softmax function. The gating layer’s output is used as the weights when averaging the experts’ outputs to compute the MoE layer’s final output. If gi is 0, we can forgo computing Experti(x) entirely.

The general framework of a MoE gating mechanism looks like

How is gating implemented in MoE LLMs?

Some specific examples are:

  • Top-1 gating: Each token is directed to a single expert when choosing only the top-scored export. This is used in the Switch Transformer’s Switch layer. It is computationally efficient but requires careful load-balancing of the tokens for even distribution across experts.
  • Top-2 gating: Each token is sent to two experts. This approach is used in Mixtral.
  • Noisy top-k gating: Introduced with the Sparsely-Gated Mixture-of-Experts Layer, noise (standard normal) is added before applying Softmax to help with load-balancing. GShard uses a noisy top-2 strategy, adding more advanced load-balancing techniques.

Load balancing

The straightforward gating via scoring and selecting top-k experts can result in an imbalance of token distribution among experts. Some experts may become overloaded, being assigned to process a bigger portion of tokens, while others are selected much less frequently and stay underutilized. This causes a “collapse” in routing, hurting the effectiveness of the MoE approach in two ways.

First, the frequently selected experts are continuously updated during training, thus performing better than experts who don’t receive enough data to train properly.

Second, load imbalance causes memory and computational performance problems. When the experts are distributed across different GPUs and/or machines, an imbalance in expert selection will translate into network, memory, and expert capacity bottlenecks. If one expert has to handle ten times the number of tokens than another, this will increase the total processing time as subsequent computations are blocked until all experts finish processing their assigned load.

Strategies for improving load balancing in MoE LLMs include:

•  Adding random noise in the scoring process helps redistribute tokens among experts.

•  Adding an auxiliary load-balancing loss to the overall model loss. It tries to minimize the fraction of the input routed to each expert. For example, in the Switch Transformer, for N experts and T tokens in batch B, the loss would be

auxiliary load-balancing loss

where fi is the fraction of tokens routed to expert i and Pi is the fraction of the router probability allocated for expert i.

•  DeepSeekMoE introduced an additional device-level loss to ensure that tokens are routed evenly across the underlying infrastructure hosting the experts. The experts are divided into g groups, with each group deployed to a single device.

•  Setting a maximum capacity for each expert. GShard and the Switch Transformer define a maximum number of tokens that can be processed by one expert. If the capacity is exceeded, the “overflown” tokens are directly passed to the next layer (skipping all experts) or rerouted to the next-best expert that has not yet reached capacity.

Scalability and challenges in MoE LLMs

Selecting the number of experts

The number of experts is a key consideration when designing an MoE LLM. A larger number of experts increases a model’s capacity at the cost of increased infrastructure demands. Using too few experts has a detrimental effect on performance. If the tokens assigned to one expert are too diverse, the expert cannot specialize sufficiently.

The MoE LLMs’ scalability advantage is due to the conditional activation of experts. Thus, keeping the number of active experts k fixed but increasing the total number of experts n increases the model’s capacity (larger total number of parameters). Experiments conducted by the Switch Transformer’s developers underscore this. With a fixed number of active parameters, increasing the number of experts consistently led to improved task performance. Similar results were observed for MoE Transformers with GShard.

The Switch Transformers have 16 to 128 experts, GShard can scale up from 128 to 2048 experts, and Mixtral can operate with as few as 8. DeepSeekMoE takes a more advanced approach by dividing experts into fine-grained, smaller experts. While keeping the number of expert parameters constant, the number of combinations for possible expert selection is increased. For example, N=8 experts with hidden dimension h can be split into m=2 parts, giving N*m=16 experts of dimension h/m. The possible combinations of activated experts in top-k routing will change from 28 (2 out of 8) to 1820 (4 out of 16), which will increase flexibility and targeted knowledge distribution.

Routing tokens to different experts simultaneously may result in redundancy among experts. To address this problem, some approaches (like DeepSeek and DeepSpeed) can assign dedicated experts to act as a shared knowledge base. These experts are exempt from the gating mechanism, always receiving each input token.

Training and inference infrastructure

While MoE LLMs can, in principle, be operated on a single GPU, they can only be scaled efficiently in a distributed architecture combining data, model, and pipeline parallelism with expert parallelism. The MoE layers are sharded across devices (i.e., their experts are distributed evenly) while the rest of the model (like dense layers and attention blocks) is replicated to each device.

This requires high-bandwidth and low-latency communication for both forward and backward passes. For example, Google’s latest Gemini 1.5 was trained on multiple 4096-chip pods of Google’s TPUv4 accelerators distributed across multiple data centers.

Hyperparameter optimization

Introducing MoE layers adds additional hyperparameters that have to be carefully adjusted to stabilize training and optimize task performance. Key hyperparameters to consider include the overall number of experts, their size, the number of experts to select in the top-k selection, and any load balancing parameters. Optimization strategies for MoE LLMs are discussed comprehensively in the papers introducing the Switch Transformer, GShard, and GLaM.

LLM performance vs. MoE LLM performance

Before we wrap up, let’s take a closer look at how MoE LLMs compare to standard LLMs:

  • MoE models, unlike dense LLMs, activate only a portion of their parameters. Compared to dense LLMs, MoE LLMs with the same number of active parameters can achieve better task performance, having the benefit of a larger number of total trained parameters. For example, Mixtral 8x7B with 13 B active parameters (and 47 B total trained parameters) matches or outperforms LLaMA-2 with 13 B parameters on benchmarks like MMLU, HellaSwag, PIQA, and Math.
  • MoEs are faster, and thus less expensive, to train. The Switch Transformer authors showed, for example, that the sparse MoE outperforms the dense Transformer baseline with a considerable speedup in achieving the same performance. With a fixed number of FLOPs and training time, the Switch Transformer achieved the T5-Base’s performance level seven times faster and outperformed it with further training.

What’s next for MoE LLMs?

Mixture of Experts (MoE) is an approach to scaling LLMs to trillions of parameters with conditional computation while avoiding exploding computational costs. MoE allows for the separation of learnable experts within the model, integrated into the shared model skeleton, which helps the model more easily adapt to multi-task, multi-domain learning objectives. However, this comes at the cost of new infrastructure requirements and the need for careful tuning of additional hyperparameters.

The novel architectural solutions for building experts, managing their routing, and stable training are promising directions, with many more innovations to look forward to. Recent SoTA models like Google’s multi-modal Gemini 1.5 and IBM’s enterprise-focused Granite 3.0 are MoE models. DeepSeek R1, which has comparable performance to GPT-4o and o1, is an MoE architecture with 671B total and 37B activated number of parameters and 128 experts.

With the publication of open-source MoE LLMs such as DeepSeek R1 and V3, which rival or even surpass the performance of the aforementioned proprietary models, we are looking into exciting times for democratized and scalable LLMs.

Was the article useful?

Explore more content topics:

FBI and CISA Urge Enabling 2FA to Counter Medusa Ransomware

0

FBI and CISA warn of Medusa ransomware attacks impacting critical infrastructure. Learn about Medusa’s tactics, prevention tips, and why paying ransoms is discouraged. 

A joint advisory by the Federal Bureau of Investigation (FBI) and the Cybersecurity and Infrastructure Security Agency (CISA) has revealed a particularly aggressive digital threat- a criminal operation, known as the Medusa ransomware gang.

According to the advisory (#StopRansomware: Medusa Ransomware), Medusa, a ransomware-as-a-service (RaaS) group first identified in June 2021, has become a serious threat to critical infrastructure sectors in the United States.

Authorities have identified a pattern of attacks affecting organizations across diverse sectors, including healthcare, education, law firms, insurance providers, technology companies, and manufacturers. Their victims include Bell Ambulance in Wisconsin, CPI Books, Customer Management Systems, and Heartland Health Center. The sheer number of victims, surpassing 300 as of December 2024, highlights the scope of this threat. 

The actors utilize different methods to infiltrate systems, including deceptive communications (phishing) and exploiting unpatched software vulnerabilities (e.g. ScreenConnect authentication bypass CVE-2024-1709). Once inside a network, they use legitimate system administration tools to move undetected. 

They employ a unique approach to extortion, which involves encrypting victims’ data and rendering it inaccessible, along with threatening to expose sensitive information if their demands are not met. This tactic creates immense pressure on targeted organizations, forcing them to consider paying the ransom to prevent public disclosure of their data.  

“Medusa developers typically recruit initial access brokers (IABs) in cybercriminal forums and marketplaces to obtain initial access to potential victims. Potential payments between $100 USD and $1 million USD are offered to these affiliates with the opportunity to work exclusively for Medusa,” the advisory (PDF) warns.

Medusa uses advanced techniques to conceal its activities, such as remote access software to control compromised systems and using encrypted scripts and tools to create hidden connections to its command servers, thereby evading security software detection. 

A particularly concerning aspect of this operation is the aggressive nature of their extortion tactics. Victims are given a very short window of time to pay the ransom, often just two days. They are pressured through direct communication, and if they fail to comply, their stolen data is made available on darknet websites. There are even reports that paying the initial ransom might not guarantee the end of the ordeal, as further demands may follow.

In response to this growing threat, federal agencies have emphasized the need for ensuring regular software updates, implementing reliable access controls, and using multi-factor authentication. They also advise monitoring network activity for suspicious behaviour, limiting the use of remote desktop protocols, and segmenting networks to contain any potential breaches. 

Moreover, users are urged to enable two-factor authentication (2FA) for webmail and VPNs as social engineering is a significant factor in these attacks. All organizations affected by the Medusa ransomware are requested to report the incidents to law enforcement and to avoid paying any ransom demands.


Check Point Software Celebrates Continued Partner Success at UK Partner Awards

0

Check Point® Software has announced the winners of its UK Partner Awards. The annual awards ceremony, which took place at One Moorgate Place on March 6th, 2025, celebrated the input of Check Point’s affiliate companies and the growing partner community across the UK.

The 2025 Check Point UK Partner Awards recognised the continued dedication of trusted UK partners over the past year and their commitment to helping organisations become more secure. A gala dinner was held to celebrate these successes, followed by the awards presentations. Mark Weir, Regional Director UK&I at Check Point Software, and Martin Rutterford, Channel Director for the UK & Ireland at Check Point Software, opened the event by reflecting on the company’s achievements over the past year. Charlotte WIlson, Head of Enterprise Sales at Check Point Software, joined esteemed comedian Tom Allen to present the awards. 

Organisations of all sizes have faced unprecedented challenges when it comes to cyber security over the past year. Check Point’s State of Cyber Security 2025 report revealed that there’s been a worrying 44% increase in global cyberattacks year on year, with a 58% surge in infostealer attacks, pointing to a maturing threat ecosystem. This, compounded by the rising threat faced by AI-fuelled attacks, increased targeting of Edge devices, and complexity of ransomware, has presented organisations with a challenging cyber landscape to manage thoroughly, especially alongside maintaining innovation and business growth. Check Point’s partners help organisations manage the rising risks with trust and ease, making the business ecosystem safer for all.

The Check Point UK Partner Awards recognise the exceptional accomplishments of regional industry leaders in tackling the critical cyber security issues their clients face. These awards celebrate the commitment, effort, and triumphs of key figures in the cyber security field who are working relentlessly to safeguard businesses and individuals in the face of rising threats. Channel partners are indispensable as an extension of these organisations, assisting in the development of resilience and the reinforcement of cyber security, all without requiring internal Security Operations Centres (SOCs).

The winners of the 2025 UK Partner Awards were: 

  • Marketplace Partner of the Year: Computacenter
  • Quantum Partner of the Year: BT
  • Harmony Partner of the Year: Softcat
  • Cloud Partner of the Year: Computacenter
  • Infinity Partner of the Year: Bytes
  • Distribution Partner of the Year: Westcon
  • Rising Star Partner of the Year: Systal
  • New Logo Partner of the Year: Softcat
  • Project of the Year: World Wide Technology
  • Technical Champion of the Year: John Tammaro, SEP2
  • Sales Champion of the Year: Becky Clayton, Westcon
  • Marketing Champion of the Year: Daniela Miccardi, Bytes
  • Check Point Champion of the Year: Michael Lenham, Bytes
  • Global Systems Integrator of the Year: BT
  • Partner of the Year: BT

“Every day, our partners are on the frontlines, helping businesses stay one step ahead of increasingly sophisticated cyber threats,” said Mark Weir, Regional Director UK&I at Check Point Software. “ In a year where AI-fuelled attacks and targeted ransomware campaigns have surged, their dedication, expertise, and innovation have been crucial in protecting organisations across the UK. These awards are not just about recognising success—they’re about celebrating the relentless commitment of our partners to keeping businesses secure, resilient, and future-ready. We’re incredibly proud to work alongside such a talented and driven network and look forward to another year of growth and shared victories.”

At the ceremony, over £2,000 was raised for LupusUK

The post Check Point Software Celebrates Continued Partner Success at UK Partner Awards appeared first on IT Security Guru.

HPC-AI Tech Releases Open-Sora 2.0: An Open-Source SOTA-Level Video Generation Model Trained for Just $200K

0

AI-generated videos from text descriptions or images hold immense potential for content creation, media production, and entertainment. Recent advancements in deep learning, particularly in transformer-based architectures and diffusion models, have propelled this progress. However, training these models remains resource-intensive, requiring large datasets, extensive computing power, and significant financial investment. These challenges limit access to cutting-edge video generation technologies, making them primarily available to well-funded research groups and organizations.

Training AI video models is expensive and computationally demanding. High-performance models require millions of training samples and powerful GPU clusters, making them difficult to develop without significant funding. Large-scale models, such as OpenAI’s Sora, push video generation quality to new heights but demand enormous computational resources. The high cost of training restricts access to advanced AI-driven video synthesis, limiting innovation to a few major organizations. Addressing these financial and technical barriers is essential to making AI video generation more widely available and encouraging broader adoption.

Different approaches have been developed to handle the computational demands of AI video generation. Proprietary models like Runway Gen-3 Alpha feature highly optimized architectures but are closed-source, restricting broader research contributions. Open-source models like HunyuanVideo and Step-Video-T2V offer transparency but require significant computing power. Many rely on extensive datasets, autoencoder-based compression, and hierarchical diffusion techniques to enhance video quality. However, each approach comes with trade-offs between efficiency and performance. While some models focus on high-resolution output and motion accuracy, others prioritize lower computational costs, resulting in varying performance levels across evaluation metrics. Researchers continue to seek an optimal balance that preserves video quality while reducing financial and computational burdens.

HPC-AI Tech researchers introduce Open-Sora 2.0, a commercial-level AI video generation model that achieves state-of-the-art performance while significantly reducing training costs. This model was developed with an investment of only $200,000, making it five to ten times more cost-efficient than competing models such as MovieGen and Step-Video-T2V. Open-Sora 2.0 is designed to democratize AI video generation by making high-performance technology accessible to a wider audience. Unlike previous high-cost models, this approach integrates multiple efficiency-driven innovations, including improved data curation, an advanced autoencoder, a novel hybrid transformer framework, and highly optimized training methodologies.

The research team implemented a hierarchical data filtering system that refines video datasets into progressively higher-quality subsets, ensuring optimal training efficiency. A significant breakthrough was the introduction of the Video DC-AE autoencoder, which improves video compression while reducing the number of tokens required for representation. The model’s architecture incorporates full attention mechanisms, multi-stream processing, and a hybrid diffusion transformer approach to enhance video quality and motion accuracy. Training efficiency was maximized through a three-stage pipeline: text-to-video learning on low-resolution data, image-to-video adaptation for improved motion dynamics, and high-resolution fine-tuning. This structured approach allows the model to understand complex motion patterns and spatial consistency while maintaining computational efficiency.

The model was tested across multiple dimensions: visual quality, prompt adherence, and motion realism. Human preference evaluations showed that Open-Sora 2.0 outperforms proprietary and open-source competitors in at least two categories. In VBench evaluations, the performance gap between Open-Sora and OpenAI’s Sora was reduced from 4.52% to just 0.69%, demonstrating substantial improvements. Open-Sora 2.0 also achieved a higher VBench score than HunyuanVideo and CogVideo, establishing itself as a strong contender among current open-source models. Also, the model integrates advanced training optimizations such as parallelized processing, activation checkpointing, and automated failure recovery, ensuring continuous operation and maximizing GPU efficiency.

Key takeaways from the research on Open-Sora 2.0 include :

  1. Open-Sora 2.0 was trained for only $200,000, making it five to ten times more cost-efficient than comparable models.
  2. The hierarchical data filtering system refines video datasets through multiple stages, improving training efficiency.
  3. The Video DC-AE autoencoder significantly reduces token counts while maintaining high reconstruction fidelity.
  4. The three-stage training pipeline optimizes learning from low-resolution data to high-resolution fine-tuning.
  5. Human preference evaluations indicate that Open-Sora 2.0 outperforms leading proprietary and open-source models in at least two performance categories.
  6. The model reduced the performance gap with OpenAI’s Sora from 4.52% to 0.69% in VBench evaluations.
  7. Advanced system optimizations, such as activation checkpointing and parallelized training, maximize GPU efficiency and reduce hardware overhead.
  8. Open-Sora 2.0 demonstrates that high-performance AI video generation can be achieved with controlled costs, making the technology more accessible to researchers and developers worldwide.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.


Aswin AK is a consulting intern at MarkTechPost. He is pursuing his Dual Degree at the Indian Institute of Technology, Kharagpur. He is passionate about data science and machine learning, bringing a strong academic background and hands-on experience in solving real-life cross-domain challenges.

Parlant: Build Reliable AI Customer Facing Agents with LLMs 💬 ✅ (Promoted)

A Guide to AI Sexting Apps

0

Customizing Your Virtual Companion: A Guide to AI Sexting Apps

Introduction

The world of artificial intelligence (AI) has expanded into nearly every corner of our lives, including personal and emotional connections. With the advent of AI sexting apps, technology now enables users to interact with highly personalized virtual companions. These apps aim to provide comfort, intimacy, and engaging conversations, filling gaps that traditional relationships might leave unaddressed.

In this guide, we’ll dive deep into the world of AI sexting app technology, exploring how it works, its benefits, and the ethical considerations that come with it. Whether you’re curious or considering using one, this article will leave you well-informed.

The Rise of AI Sexting Apps

A Brief History
The journey of AI sexting apps began with rudimentary chatbots designed to mimic human conversation. As technology advanced, these bots evolved into interactive systems capable of understanding context, tone, and emotion. Today, AI sexting app technology stands at the forefront of emotional intelligence.

Why the Boom?

  • Increasing social isolation and loneliness.
  • A desire for safe, judgment-free connections.
  • Growing tech accessibility globally.

Popular Platforms
Many apps cater to this space, offering a range of features from basic interactions to immersive role-playing experiences. Notable examples include Replika and Paradot, each offering a unique take on virtual companionship.

Key Features of AI Sexting Apps

Personalization Options
The magic of these apps lies in their ability to tailor interactions. Users can select personality traits, tone, and even the depth of conversation. This customization makes each interaction feel unique and personal.

Adaptive AI Technology
Powered by machine learning, these apps improve with time. They adapt based on user preferences, providing more meaningful interactions the longer they are used.

Free vs Premium Options

  • Free AI sexting apps often provide basic features.
  • Premium AI sexting apps include advanced features like in-depth role-playing, image generation, and voice interaction.

How to Customize Your Virtual Companion

Setting Up Preferences: When you first start using an app, you’ll typically be asked to set up your companion’s personality. Want a playful, witty bot? Or a more serious and caring one? It’s entirely up to you.

Exploring Visual Customization: Some apps allow users to create avatars for their virtual companions, adding a layer of visual engagement.

Scripted Interactions: For those looking for more control, premium apps offer tools to script specific scenarios, creating unique conversational flows tailored to your desires.

Ethical and Privacy Considerations

Data Security
Given the sensitive nature of these apps, data security is paramount. Reputable apps provide robust privacy measures, but users should still be cautious about sharing personal details.

Ethical Concerns

  • Potential emotional dependency on AI.
  • The balance between realistic interaction and manipulation.

Transparency in AI Sexting App Technology
It’s vital for users to understand how the app’s AI operates, ensuring an ethical balance between functionality and user safety.

Benefits of AI Sexting Apps

Emotional Support: These apps can provide a safe space for expressing thoughts and feelings, acting as a non-judgmental confidant.

Accessibility and Flexibility: Unlike human relationships, virtual companions are always available, offering consistent interaction regardless of time zones or schedules.

Exploration and Learning: AI sexting apps also allow users to explore communication styles or intimacy preferences in a pressure-free environment.

Challenges and Limitations

Unrealistic Expectations: While the apps are powerful, they can sometimes lead users to develop unrealistic expectations about human interactions.

Cultural and Linguistic Barriers: Not all apps are equally adept at understanding nuances across cultures and languages.

Free vs Premium AI Sexting Apps: Free versions might have limited capabilities, while premium versions come with a price tag that may not be accessible to everyone.

Future Trends in AI Sexting Apps

AR and VR Integration: The next wave of innovation involves immersive technologies. Imagine interacting with a virtual companion in augmented or virtual reality!

Emotionally Intelligent AI: Future apps will likely feature advanced emotional intelligence, making interactions even more lifelike and fulfilling.

Expanding Beyond Sexting: AI chatbots may grow beyond intimate interactions to offer broader emotional and mental health support, redefining what virtual companionship means.

Conclusion

AI sexting apps are a testament to how far technology has come in addressing human needs. With their ability to adapt and personalize, they provide unique opportunities for connection and self-expression. However, users must navigate their use responsibly, balancing the benefits with ethical and privacy considerations.

As technology advances, the line between digital and real relationships will continue to blur, promising an exciting yet challenging future for AI sexting app technology.

Forget About Cloud Computing. On-Premises Is All the Rage Again

0

Ten years ago, everybody was fascinated by the cloud. It was the new thing, and companies that adopted it rapidly saw tremendous growth. Salesforce, for example, positioned itself as a pioneer of this technology and saw great wins.

The tides are turning though. As much as cloud providers still proclaim that they’re the most cost-effective and efficient solution for businesses of all sizes, this is increasingly clashing with the day-to-day experience.

Cloud Computing was touted as the solution for scalability, flexibility, and reduced operational burdens. Increasingly, though, companies are finding that, at scale, the costs and control limitations outweigh the benefits.​

Attracted by free AWS credits, me and my CTO started out with setting up our entire company IT infrastructure on the cloud. However, we were shocked when we saw the costs ballooning after just a few software tests. We decided to invest in a high-quality server and moved our whole infrastructure onto it. And we’re not looking back: This decision is already saving us hundreds of Euros per month.

We’re not the only ones: Dropbox already made this move in 2016 and saved close to $75 million over the ensuing two years. The company behind Basecamp, 37signals, completed this transition in 2022, and expects to save $7 million over five years.

We’ll dive deeper into the how and why of this trend and the cost savings that are associated with it. You can expect some practical insights that will help you make or influence such a decision at your company, too.

Cloud costs have been exploding

According to a recent study by Harness, 21% of enterprise cloud infrastructure spend—which will be equivalent to $44.5 billion in 2025—is wasted on underutilized resources. According to the study author, cloud spend is one of the biggest cost drivers for many software enterprises, second only to salaries.

The premise of this study is that developers must develop a keener eye on costs. However, I disagree. Cost control can only get you so far—and many smart developers are already spending inordinate amounts of their time on cost control instead of building actual products.

Cloud costs have a tendency to balloon over time: Storage costs per GB of data might seem low, but when you’re dealing with terabytes of data—which even we as a three-person startup are already doing—costs add up very quickly. Add to this retrieval and egress fees, and you’re faced with a bill you cannot unsee.

Steep retrieval and egress fees only serve one thing: Cloud providers want to incentivize you to keep as much data as possible on the platform, so they can make money off every operation. If you download data from the cloud, it will cost you inordinate amounts of money.

Variable costs based on CPU and GPU usage often spike during high-performance workloads. A report by CNCF found that almost half of Kubernetes adopters found that they’d exceeded their budget as a result. Kubernetes is an open-source container orchestration software that is often used for cloud deployments.

The pay-per-use model of the cloud has its advantages, but billing becomes unpredictable as a result. Costs can then explode during usage spikes. Cloud add-ons for security, monitoring, and data analytics also come at a premium, which often increases costs further.

As a result, many IT leaders have started migrating back to on-premises servers. A 2023 survey by Uptime found that 33% of respondents had repatriated at least some production applications in the past year.

Cloud providers have not restructured their billing in response to this trend. One could argue that doing so would seriously impact their profitability, especially in a largely consolidated market where competitive pressure by upstarts and outsiders is limited. As long as this is the case, the trend towards on-premises is expected to continue.

Cost efficiency and control

There is a reason that cloud providers tend to advertise so much to small firms and startups. The initial setup costs of a cloud infrastructure are low because of pay-as-you-go models and free credits.

The easy setup can be a trap, though, especially once you start scaling. (At my firm, we noticed our costs going out of control even before we scaled to a decent extent, simply because we handle large amounts of data.) Monthly costs for on-premises servers are fixed and predictable; costs for cloud services can quickly balloon beyond expectations.

As mentioned before, cloud providers also charge steep data egress fees, which can quickly add up when you’re considering a hybrid infrastructure.

Security costs can initially be higher on-premises. On the other hand, you have full control over everything you implement. Cloud providers cover infrastructure security, but you remain responsible for data security and configuration. This often requires paid add-ons.

A round-up can be found in the table above. On the whole, an on-premises infrastructure comes with higher setup costs and needs considerable know-how. This initial investment pays off quickly, though, because you tend to have very predictable monthly costs and full control over additions like security measures.

There are plenty of prominent examples of companies that have saved millions by moving back on-premises. Whether this is a good choice for you depends on several factors, though, which need to be assessed carefully.

Should you move back on-premises?

Whether you should make the shift back to server racks depends on several factors. The most important considerations in most cases are financial, operational, and strategic.

From a financial point of view, your company’s cash structure plays a big role. If you prefer lean capital expenditures but have no problem racking up high operational costs every month, then you should remain on the cloud. If you can make a higher capital expenditure up front and then refrain from bleeding cash, you should do this though.

At the end of the day, the total operational costs (TCO) are key though. If your operational costs on cloud are consistently lower than running servers yourself, then you should absolutely stay on the cloud.

From an operational point of view, staying on the cloud can make sense if you often face spikes in usage. On-premises servers can only carry so much traffic; cloud servers scale pretty seamlessly in proportion to demand. If expensive and specialized hardware is more accessible for you on the cloud, this is also a point in favor of staying on the cloud. On the other hand, if you are worried about complying with specific regulations (like GDPR, HIPAA, or CSRD for example), then the shared-responsibility model of cloud services is likely not for you.

Strategically speaking, having full control of your infrastructure can be a strategic advantage. It keeps you from getting locked in with a vendor and having to play along with whatever they bill you and what services they are able to offer you. If you plan a geographic expansion or rapidly deploy new services, then cloud can be advantageous though. In the long run, however, going on-premises might make sense even when you’re expanding geographically or in your scope of services, due to increased control and lower operational costs.

The decision to move back on-premises depends on several factors. Diagram generated with the help of Claude AI.

On the whole, if you value predictability, control, and compliance, you should consider running on-premises. If, on the other hand, you value flexibility, then staying on the cloud might be your better choice.

How to repatriate easily

If you are considering repatriating your services, here is a brief checklist to follow:

  • Assess Current Cloud Usage: Inventory applications and data volume.
  • Cost Analysis: Calculate current cloud costs vs. projected on-prem costs.
  • Select On-Prem Infrastructure: Servers, storage, and networking requirements.
  • Minimize Data Egress Costs: Use compression and schedule transfers during off-peak hours.
  • Security Planning: Firewalls, encryption, and access controls for on-prem.
  • Test and Migrate: Pilot migration for non-critical workloads first.
  • Monitor and Optimize: Set up monitoring for resources and adjust.

Repatriation is not just for enterprise companies that make the headlines. As the example of my firm shows, even small startups need to make this consideration. The earlier you make the migration, the less cash you’ll bleed.

The bottom line: Cloud is not dead, but the hype around it is dying

Cloud services aren’t going anywhere. They offer flexibility and scalability, which are unmatched for certain use cases. Startups and companies with unpredictable or rapidly growing workloads still benefit greatly from cloud solutions.

That being said, even early-stage companies can benefit from on-premises infrastructure, for example if the large data loads they’re handling would make the cloud bill balloon out of control. This was the case at my firm.

The cloud has often been marketed as a one-size-fits-all solution for everything from data storage to AI workloads. We can see that this is not the case; the reality is a bit more granular than this. As companies scale, the costs, compliance challenges, and performance limitations of cloud computing become impossible to ignore.

The hype around cloud services is dying because experience is showing us that there are real limits and plenty of hidden costs. In addition, cloud providers can often not adequately provide for security solutions, options for compliance, and user control if you don’t pay a hefty premium for all this.

Most companies will likely adopt a hybrid approach in the long run: On-premises offers control and predictability; cloud servers can jump into the fray when demand from users spikes.

There’s no real one-size-fits-all solution. However, there are specific criteria that should help you guide your decision. Like every hype, there are ebbs and flows. The fact that cloud services are no longer hyped does not mean that you need to go all-in on server racks now. It does, however, invite for a deeper reflection about the advantages that this trend offers for your company.

Challenges & Solutions For Monitoring at Hyperscale

0

What is not measured, cannot be improved.” This quote has become a guiding principle for teams training foundation models. When you’re dealing with complex, large-scale AI systems, things can spiral quickly without the right oversight. Operating at hyperscale poses significant challenges for teams, from the large volume of data generated to the unpredictability of hardware failures and the need for efficient resource management. These issues require strategic solutions, that’s why monitoring isn’t just a nice-to-have—it’s the backbone of transparency, reproducibility, and efficiency. During my talk at NeurIPS,  I broke down five key lessons learned from teams facing large-scale model training and monitoring. Let’s get into it.

Real-time monitoring prevents costly failures

Imagine this: you’re training a large language model on thousands of GPUs at a cost of hundreds of thousands of dollars per day. Now imagine discovering, hours into training, that your model is diverging or that hardware issues are degrading your performance. The financial and operational implications are staggering. This is why live monitoring—the ability to act immediately—is so critical.

Live monitoring allows teams to see experiment progress as it happens, rather than waiting for checkpoints or the end of a run. This real-time visibility is a game-changer for identifying and fixing problems on the fly. In addition, automated processes allow you to set up monitoring workflows once and reuse them for similar experiments. This streamlines the process of comparing results, analyzing results, and debugging issues, saving time and effort.

However, achieving true live monitoring is far from simple. Hyperscale training generates an overwhelming volume of data, often reaching up to a million data points per second. Traditional monitoring tools struggle under such loads, creating bottlenecks that can delay corrective action. Some teams try to cope by batching or sampling metrics, but these approaches sacrifice real-time visibility and add complexity to the code.

The solution lies in systems that can handle high-throughput data ingestion while providing accurate, real-time insights. Tools like neptune.ai make this possible by providing dashboards that visualize metrics without delaying training. For example, live tracking of GPU utilization or memory usage can reveal early signs of bottlenecks or out-of-memory errors, allowing engineers to proactively adjust course. See here some testimonials:

One thing we’re always keeping track of is what the utilization is and how to improve it. Sometimes, we’ll get, for example, out-of-memory errors, and then seeing how the memory increases over time in the experiment is really helpful for debugging as well.

James Tu

Research Scientist, Waabi

For some of the pipelines, Neptune was helpful for us to see the utilization of the GPUs. The utilization graphs in the dashboard are a perfect proxy for finding some bottlenecks in the performance, especially if we are running many pipelines.

Wojtek Rosiński

CTO, ReSpo.Vision

Real-time visualization of GPU memory usage (top) and power consumption (bottom) during a large-scale training run. These metrics help identify potential bottlenecks, such as out-of-memory errors or inefficient hardware utilization, enabling immediate corrective actions to maintain optimal performance. | Source: Author

Troubleshooting hardware failures is challenging: simplify it with debugging

Distributed systems are prone to failure, and hardware failures are notoriously difficult to troubleshoot. A single hardware failure can cascade into widespread outages, often with cryptic error messages. Teams often waste time sifting through stack traces, trying to distinguish between infrastructure problems and code bugs.

At Cruise, engineers used frameworks like Ray and Lightning to improve error reporting. By automatically labeling errors as either “infra” or “user” issues and correlating stack traces across nodes, debugging became much faster.

Igor Tsvetkov

Former Senior Staff Software Engineer, Cruise

AI teams automating error categorization and correlation can significantly reduce debugging time in hyperscale environments, just as Cruise has done. How? By using classification strategies to identify if failures originated from hardware constraints (e.g., GPU memory leaks, network latency) or software bugs (e.g., faulty model architectures, misconfigured hyperparameters). 

Intuitive experiment tracking optimizes resource utilization

Another relevant aspect of hyperscale monitoring is optimizing resource utilization, in particular in a scenario where hardware failures and training interruptions can set teams back significantly. Picture a scenario where training jobs suddenly deviate: loss metrics spike, and you’re left deciding whether to let the job run or terminate it. Advanced experiment trackers allow for remote experiment termination, eliminating the need for teams to manually access cloud logs or servers.

Use checkpoints at frequent intervals so you do not have to restart from scratch, but just warm-start from the previous checkpoint. Most mature training frameworks already offer automated checkpointing and warm-starts from previous checkpoints. But most of these, by default, save the checkpoints in the same machine. This doesn’t help if your hardware crashes, or, for example, you are using spot instances and they are reassigned.

For maximum resilience and to prevent losing data if hardware crashes, checkpoints should be linked to your experiment tracker. This does not mean that you upload GBs worth of checkpoints to the tracker (although you can and some of our customers, especially self-hosted customers, do this for security reasons), but rather have pointers to the remote location, like S3, where the checkpoints have been saved. This enables you to link the checkpoint with the corresponding experiment step, and efficiently retrieve the relevant checkpoint at any given step.

A comparison of training workflows with and without advanced experiment tracking and checkpointing. On the left, failed training runs at various stages lead to wasted time and resources. On the right, a streamlined approach with checkpoints and proactive monitoring ensures consistent progress and minimizes the impact of interruptions.
A comparison of training workflows with and without advanced experiment tracking and checkpointing. On the left, failed training runs at various stages lead to wasted time and resources. On the right, a streamlined approach with checkpoints and proactive monitoring ensures consistent progress and minimizes the impact of interruptions. | Source: Author

However, there are two caveats to successfully restarting an experiment from a checkpoint: assuming that the experimentation environment is constant, or at least reproducible, and addressing deterministic issues like Out-of-Memory errors (OOMs) or bottlenecks that may require parameter changes to avoid repeating failures. This is where forking can play a significant role in improving recovery and progress.

Track months-long model training with more confidence. Use neptune.ai forking feature to iterate faster and optimize the usage of GPU resources.

With Neptune, users can visualize forked training out of the box. This means you can:

  • Test multiple configs at the same time. Stop the runs that don’t improve accuracy. And continue from the most accurate last step.
  • Restart failed training sessions from any previous step. The training history is inherited, and the entire experiment is visible on a single chart.

In addition, checkpointing strategies are critical for optimizing recovery processes. Frequent checkpointing ensures minimal loss of progress, allowing you to warm-start from the most recent state instead of starting from scratch. However, checkpointing can be resource-intensive in terms of storage and time, so we need to strike a balance between frequency and overhead.

For large-scale models, the overhead of writing and reading weights to persistent storage can significantly reduce training efficiency. Innovations like redundant in-memory copies, as demonstrated by Google’s Gemini models, enable rapid recovery and improved training goodput (defined by Google as the time spent computing useful new steps over the elapsed time of the training job), increasing resilience and efficiency.

Features like PyTorch Distributed’s asynchronous checkpointing can significantly reduce checkpointing times making frequent checkpointing more viable without compromising training performance.

Beyond models, checkpointing the state of dataloaders remains a challenge due to distributed states across nodes. While some organizations like Meta have developed in-house solutions, general frameworks have yet to fully address this issue. Incorporating dataloader checkpointing can further enhance resilience by preserving the exact training state during recovery.

Reproducibility and transparency are non-negotiable

Reproducibility is the bedrock of reliable research, but it’s notoriously difficult at scale. Ensuring reproducibility requires consistent tracking of environment details, datasets, configurations, and results. This is where Neptune’s approach excels, linking every experiment’s lineage—from parent runs to dataset versions—in an accessible dashboard.

This transparency not only aids validation but also accelerates troubleshooting. Consider ReSpo.Vision’s challenges in managing and comparing results across pipelines. By implementing organized tracking systems, they gained visibility into pipeline dependencies and experiment parameters, streamlining their workflow.

A single source of truth simplifies data visualization and management at large-scale data

Managing and visualizing data at scale is a common challenge, amplified in the context of large-scale experimentation. While tools like MLflow or TensorBoard are sufficient for smaller projects with 10–20 experiments, they quickly fall short when handling thousands or even hundreds of experiments. At this scale, organizing and comparing results becomes a logistical hurdle, and relying on tools that cannot effectively visualize or manage this scale leads to inefficiencies and missed insights.

A solution lies in adopting a single source of truth for all experiment metadata, encompassing everything from input data and training metrics to checkpoints and outputs. Neptune’s dashboards address this challenge by providing a highly customizable and centralized platform for experiment tracking. These dashboards enable real-time visualization of key metrics, which can be tailored to include “custom metrics”—those not explicitly logged at the code level but calculated retrospectively within the tool. For instance, if a business requirement shifts from using precision and recall to the F1 score as a performance indicator, custom metrics allow you to calculate and visualize these metrics across existing and future experiments without rerunning them, ensuring flexibility and minimizing duplicated effort.

Consider the challenges faced by Waabi and ReSpo.Vision. Waabi’s teams, running large-scale ML experiments, needed a way to organize and share their experiment data efficiently. Similarly, ReSpo.Vision required an intuitive system to visualize multiple metrics in a standardized format that any team member—technical or non-technical—could easily access and interpret. Neptune’s dashboards provided the solution, allowing these teams to streamline their workflows by offering visibility into all relevant experiment data, reducing overhead, and enabling collaboration across stakeholders.

I like those dashboards because we need several metrics, so you code the dashboard once, have those styles, and easily see it on one screen. Then, any other person can view the same thing, so that’s pretty nice.

Łukasz Grad

Chief Data Scientist, ReSpo.Vision

The benefits of such an approach extend beyond visualization. Logging only essential data and calculating derived metrics within the tool reduces latency and streamlines the experimental process. This capability empowers teams to focus on actionable insights, enabling scalable and efficient experiment tracking, even for projects involving tens of thousands of models and subproblems.

Visualizing large datasets

We generally do not think of dataset visualization as part of experiment monitoring. However, preparing the dataset for model training is an experiment in itself, and while it may be an upstream experiment not in the same pipeline as the actual model training, data management and visualization is critical to LLMOps.

Large-scale experiments often involve processing billions of data points or embeddings. Visualizing such data to uncover relationships and debug issues is a common hurdle. Tools like Deepscatter and Jupyter Scatter have made progress in scaling visualizations for massive datasets, offering researchers valuable insights into their data distribution and embedding structures.

Moving forward

The path to efficient hyperscale training lies in combining robust monitoring, advanced debugging tools, and comprehensive experiment tracking. Solutions like Neptune Scale are designed to address these challenges, offering the scalability, precision, and transparency researchers need.

How about being one of the first to access Neptune Scale?

Neptune Scale is our upcoming product release built for teams that train foundation models. It offers enhanced scalability and exciting new features. You can join our beta program to benefit from Neptune Scale earlier.

If you’re interested in learning more, visit our blog or join the MLOps community to explore case studies and actionable strategies for large-scale AI experimentation.

Acknowledgments

I would like to express my gratitude to Prince Canuma, Dr. Shantipriya Parida, and Igor Tsvetkov for their valuable time and insightful discussions on this topic. Their contributions and perspectives were instrumental in shaping this talk.

Was the article useful?

Explore more content topics:

AI Chatbot DeepSeek R1 Can Be Manipulated to Create Malware

0

Tenable Research reveals that AI chatbot DeepSeek R1 can be manipulated to generate keyloggers and ransomware code. While not fully autonomous, it provides a playground for cybercriminals to refine and exploit its capabilities for malicious purposes.

A new analysis from cybersecurity firm Tenable Research reveals that the open-source AI chatbot DeepSeek R1 can be manipulated to generate malicious software, including keyloggers and ransomware.

Tenable’s research team set out to assess DeepSeek’s ability to create harmful code. They focused on two common types of malware: keyloggers, which secretly record keystrokes, and ransomware, which encrypts files and demands payment for their release.

While the AI chatbot isn’t producing fully functional malware “out of the box,” and requires proper guidance and manual code corrections to produce a fully working keylogger; the research suggests that it could lower the barrier to entry for cybercriminals.

Initially, like other large language models (LLMs), DeepSeek stood up to its built-in ethical guidelines and refused direct requests to write malware. However, the Tenable researchers employed a “jailbreak” technique tricking the AI by framing the request for “educational purposes” to bypass these restrictions.

The researchers leveraged a key part of DeepSeek’s functionality: its “chain-of-thought” (CoT) capability. This feature allows the AI to explain its reasoning process step-by-step, much like someone thinking aloud while solving a problem. By observing DeepSeek’s CoT, researchers gained insights into how the AI approached malware development and even recognised the need for stealth techniques to avoid detection.

DeepSeek Building Keylogger

When tasked with building a keylogger, DeepSeek first outlined a plan and then generated C++ code. This initial code was flawed and contained several errors that the AI itself could not fix. However, with a few manual code adjustments by the researchers, the keylogger became functional, successfully logging keystrokes to a file.

Taking it a step further, the researchers prompted DeepSeek to help enhance the malware by hiding the log file and encrypting its contents, which it managed to provide code for, again requiring minor human correction.

This screenshot displays the keylogger created by DeepSeek running in the Task Manager, alongside the log file it generated. (Credit: Tenable Research)

DeepSeek Building Ransomware

The experiment with ransomware followed a similar pattern. DeepSeek laid out its strategy for creating file-encrypting malware. It produced several code samples designed to perform this function, but none of these initial versions would compile without manual editing.

Nevertheless, after some tweaking by the Tenable team, some of the ransomware samples were made operational. These functional samples included features for finding and encrypting files, a method to ensure the malware runs automatically when the system starts, and even a pop-up message informing the victim about the encryption.

DeepSeek Struggled with Complex Malicious Tasks

While DeepSeek demonstrated an ability to generate the basic building blocks of malware, Tenable’s findings highlight that it’s far from a push-button solution for cybercriminals. Creating effective malware still requires technical knowledge to guide the AI and debug the resulting code. For instance, DeepSeek struggled with more complex tasks like making the malware process invisible to the system’s task manager.

However, despite these limitations, Tenable researchers believe that access to tools like DeepSeek could accelerate malware development activities. The AI can provide a significant head start, offering code snippets and outlining necessary steps, which could be particularly helpful for individuals with limited coding experience looking to engage in cybercrime.

“DeepSeek can create the basic structure for malware,” explains Tenable’s technical report shared with Hackread.com ahead of its publishing on Thursday. “However, it is not capable of doing so without additional prompt engineering as well as manual code editing for more advanced features.” The AI struggled with more complex tasks like completely hiding the malware’s presence from system monitoring tools.

Trey Ford, Chief Information Security Officer at Bugcrowd, a San Francisco, Calif.-based leader in crowdsourced cybersecurity commented on the latest development emphasising that AI can aid both good and bad actors, but security efforts should focus on making cyberattacks more costly by hardening endpoints rather than expecting EDR solutions to prevent all threats.

Criminals are going to be criminals – and they’re going to use every tool and technique available to them. GenAI-assisted development is going to enable a new generation of developers – for altruistic and malicious efforts alike, said Trey,

As a reminder, the EDR market is explicitly endpoint DETECTION and RESPONSE – they’re not intended to disrupt all attacks. Ultimately, we need to do what we can to drive up the cost of these campaigns by making endpoints harder to exploit – pointedly they need to be hardened to CIS 1 or 2 benchmarks, he explained.


Cold Wallets vs. Hot Wallets: Which Offers Better Security?

0

Cryptocurrency isn’t just a buzzword anymore. By December 2024, the number of global cryptocurrency owners reached approximately 659 million, marking a 13% increase from January 2024. That might not sound like a massive chunk, but it still represents millions of individuals who want to protect their virtual holdings. Where regular banking once ruled, self-managed wallets are now front and center for those who prefer having full control of their tokens.

Part of the appeal is the chance to bypass middlemen. However, questions arise on the best way to handle security—especially for people who want quick access to their coins while also trying to avoid potential hacks. 

Hot Wallets and Why People Use Them

Hot wallets and cold wallets both serve important purposes in this field, yet they each come with a unique mix of convenience and risk. Anthony Clarke’s research on crypto storage might notice that he discusses various features of the top web3 wallets. A significant number of these are what we call “hot” wallets, which are connected to the internet at nearly all times. Plenty of enthusiasts who enjoy web-based gaming services lean on hot wallets because they often allow speedy deposits and withdrawals, leading to near-instant play. Once that gaming topic is covered, though, these wallets also appeal to traders, freelancers, or anyone who wants immediate transfers.

Hot wallets are praised for their ease of use. They’re typically tied to user-friendly apps or browser extensions, so you can send or receive tokens within seconds. While this makes day-to-day transactions painless, it also means a constant link to the internet. Hackers often eye anything that’s frequently connected, so staying sharp with two-factor authentication and strong passwords is a must. Phishing attacks are a known threat, where someone might trick you into giving away personal details or private keys.

Another consideration is how these hot solutions store your credentials. Some keep private keys on external servers, while others let you store them on your own device. Either way, the open nature of being connected leaves a bigger window for unwanted visitors to sneak through. If you’re someone who likes fast trades, though, hot wallets remain a popular choice.

 

 

Cold Wallets: Safeguarding Your Crypto Offline

While hot wallets thrive on convenience, cold wallets shut off direct access to the web. They come in the form of hardware devices that look like USB sticks, or even paper wallets with keys and QR codes printed on them. Because these storage methods aren’t plugged into the internet all the time, they present a far smaller target for hackers. Someone would need physical control of your device or printout, making it way harder for them to stage a remote break-in.

Cold wallets are known for long-term storage. If you have coins you’re holding for months or years, it makes sense to lock them away from prying eyes. Many large investors keep the bulk of their funds in offline vaults to minimize risk. However, this approach creates its own challenges. Losing the device or paper could be devastating, and there’s no customer support line that can restore lost private keys. You might want multiple backups—perhaps in separate secure locations—so one house fire or other mishap doesn’t wipe out your stash.

Though it can be more tedious to move your coins in and out of cold storage, the added security is often worth that extra step. Many people prefer a hybrid strategy: store most of your holdings offline, and keep a small portion in a hot wallet for quick trades.

Picking the Right Match for Your Needs

Hot wallets and cold wallets each have their strengths, so the choice depends on how you plan to manage your cryptocurrency. If you’re regularly trading tokens, a hot wallet feels more convenient. Just stay on your toes: never click random links or download unverified software, and consider pairing your wallet with hardware-based two-factor solutions. That level of caution is essential, because even a moment of inattention can lead to stolen funds.

On the flip side, if you’re happy to park coins for a while, cold wallets offer a sense of security that’s tough to beat. Not being connected nearly closes the door on remote hacking attempts. The downside is that you’ll have to keep track of your physical device and backups. Anyone who loses their cold wallet without a recovery phrase faces the possibility of never seeing their crypto again.

Some people take a balanced path, splitting their holdings between the two methods. A portion stays hot for day-to-day transactions, while the rest sits offline. This gives you that sweet spot of easy access and lower risk. Think of it like keeping a bit of cash in your pocket for small expenses, with the bulk of your savings safely locked away.

In the crypto world, your personal habits play a big role in choosing the best wallet type. Day traders and gamers may favor rapid moves, but that also means they should be extra cautious with security steps. Long-haul investors often breathe easier knowing their coins are tucked away in cold storage, though they accept the burden of safeguarding physical devices.

 

The post Cold Wallets vs. Hot Wallets: Which Offers Better Security? appeared first on IT Security Guru.

Meet Attentive Reasoning Queries (ARQs): A Structured Approach to Enhancing Large Language Model Instruction Adherence, Decision-Making Accuracy, and Hallucination Prevention in AI-Driven Conversational Systems

0

Large Language Models (LLMs) have become crucial in customer support, automated content creation, and data retrieval. However, their effectiveness is often hindered by their inability to follow detailed instructions during multiple interactions consistently. This issue is particularly critical in high-stakes environments, such as financial services and customer support systems, where strict adherence to guidelines is essential. LLMs frequently struggle with instruction recall, leading to deviations from intended behaviors. Also, they generate misleading or incorrect information, commonly called hallucination, making their deployment challenging in scenarios requiring precise, context-aware decision-making.

Maintaining reasoning consistency in complex scenarios remains a challenge for LLMs. While they generate coherent responses to simple queries, their performance declines in multi-turn conversations influenced by past interactions. One key issue is alignment drift, where models gradually move away from original instructions, causing misinterpretation of guidelines and incorrect recommendations. Context forgetfulness is another concern, where models prioritize recent information over earlier details, often disregarding critical constraints. These factors contribute to errors that undermine the reliability of LLM-driven systems. Despite strategies like Chain-of-Thought (CoT) and verification-based prompting, existing methods do not provide enough structure to guide models reliably through complex tasks.

Various prompting techniques have been developed to improve instruction adherence. CoT prompting encourages step-by-step reasoning to enhance logical accuracy, while Chain-of-Verification requires explicit self-checking of outputs. Although these methods improve upon direct response generation, they lack mechanisms to reinforce domain-specific constraints and systematically prevent common failures. AI frameworks like LangChain add structural elements for tool integration and workflow automation but treat LLM reasoning as a black box, limiting their ability to enforce strict guidelines. The lack of mechanisms to prevent hallucination and instruction drift highlights the need for a more structured approach.

Researchers at Emcie Co Ltd. developed Attentive Reasoning Queries (ARQs) to address these shortcomings. This novel approach introduces a structured reasoning blueprint designed to guide LLMs systematically through predefined queries. Unlike free-form reasoning methods, ARQs implement a structured JSON schema that directs the model’s attention to specific decision points at critical moments. This design enables ARQs to enhance guideline adherence while minimizing failures caused by misinterpretation or loss of contextual details. To evaluate its effectiveness, the approach was tested within Parlant, a framework used for building customer-facing AI applications. Initial findings demonstrated that ARQs significantly improved instruction-following capabilities while mitigating hallucination-related errors.

The ARQ framework consists of multiple stages that collectively enhance reasoning performance. The first step involves issuing targeted, structured queries that remind the model of key constraints before response generation. These queries reinforce critical instructions, ensuring the model does not deviate from predefined guidelines. Next, the model processes a series of step-by-step queries to reinforce task-specific reasoning. In some implementations, an additional verification step follows, where the model checks its response against predefined correctness criteria before finalizing the output. This structured approach contrasts sharply with CoT prompting by incorporating explicit mechanisms to ensure consistency at every stage of the reasoning process.

On performance evaluation within the Parlant framework, in a controlled test environment comprising 87 distinct conversational scenarios, ARQs achieved a 90.2% success rate, outperforming both CoT reasoning (86.1%) and direct response generation (81.5%). The ARQ methodology excelled in addressing two critical failure modes: guideline re-application and hallucination prevention. Specifically, in cases where the model needed to reapply earlier instructions, ARQs ensured a 92.19% success rate, significantly higher than CoT (87.81%) and direct response generation (85.31%). Also, ARQs reduced the occurrence of factual inaccuracies, with models trained on ARQs exhibiting a 23% lower hallucination rate than those relying on standard CoT techniques. These results underscore the importance of structured reasoning approaches in improving LLM reliability.

Several Key takeaways from the research include:

  1. ARQs improved instruction adherence, achieving a 90.2% success rate across 87 test cases, surpassing Chain-of-Thought (86.1%) and direct response generation (81.5%).
  2. ARQs significantly reduced hallucination errors by 23% compared to CoT, making them particularly useful for business-critical AI applications requiring factual consistency.
  3. In guideline re-application scenarios, ARQs outperformed CoT by 4.38%, achieving a success rate of 92.19% compared to CoT’s 87.81%.
  4. The structured nature of ARQs allowed for more efficient reasoning in classification tasks, reducing token usage by 29% compared to CoT.
  5. The verification mechanism in ARQs was key to preventing alignment drift. It ensured that models focused on predefined constraints even in extended conversations.
  6. Future research aims to optimize ARQ efficiency further by refining query design and exploring its application in diverse AI-driven decision-making systems.

Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

Parlant: Build Reliable AI Customer Facing Agents with LLMs 💬 ✅ (Promoted)

Popular Posts

My Favorites

Why it might be time for Elon Musk to step down...

0
A prominent short seller who is suing Tesla and Elon Musk...