Remember, a Jedi can feel the Force flowing through him. I can’t get involved! I’ve got work to do! It’s not that I like the Empire, I hate it, but there’s nothing I can do about it right now. It’s such a long way from here. I call it luck. You are a part of the Rebel Alliance and a traitor! Take her away!
Remember, a Jedi can feel the Force flowing through him. I can’t get involved! I’ve got work to do! It’s not that I like the Empire, I hate it, but there’s nothing I can do about it right now. It’s such a long way from here. I call it luck. You are a part of the Rebel Alliance and a traitor! Take her away!
Remember, a Jedi can feel the Force flowing through him. I can’t get involved! I’ve got work to do! It’s not that I like the Empire, I hate it, but there’s nothing I can do about it right now. It’s such a long way from here. I call it luck. You are a part of the Rebel Alliance and a traitor! Take her away!
Remember, a Jedi can feel the Force flowing through him. I can’t get involved! I’ve got work to do! It’s not that I like the Empire, I hate it, but there’s nothing I can do about it right now. It’s such a long way from here. I call it luck. You are a part of the Rebel Alliance and a traitor! Take her away!
Stereo depth estimation plays a crucial role in computer vision by allowing machines to infer depth from two images. This capability is vital for autonomous driving, robotics, and augmented reality applications. Despite advancements in deep learning, many existing stereo-matching models require domain-specific fine-tuning to achieve high accuracy. The challenge lies in developing a model that can be generalized across different environments without additional training.
One of the key problems in stereo depth estimation is the domain gap between training and real-world data. Many current approaches depend on small, specific datasets that fail to capture the complexity of natural environments. This limitation results in models that perform well on controlled benchmarks but fail in diverse scenarios. Furthermore, fine-tuning these models for new domains is computationally expensive and impractical for real-time applications. Overcoming these challenges requires a more robust approach that eliminates the need for domain-specific training.
Traditional stereo depth estimation methods rely on constructing cost volumes, which encode the disparity between image pairs. These methods utilize 3D convolutional neural networks (CNNs) for cost filtering but struggle with generalization beyond their training data. Iterative refinement techniques attempt to enhance accuracy by progressively improving disparity predictions. However, these approaches are limited by their reliance on recurrent modules, which increase computational costs. Some recent methods have explored transformer-based architectures but have faced challenges in effectively handling the disparity search space while maintaining efficiency.
Researchers at NVIDIA introduced FoundationStereo, a foundation model designed to address these limitations and achieve strong zero-shot generalization. To build this model, the research team created a large-scale synthetic training dataset containing one million stereo-image pairs with high photorealism and diverse scenarios. An automated self-curation pipeline was developed to filter out ambiguous samples, ensuring high-quality training data. Further, the model incorporates a side-tuning feature backbone, which leverages monocular priors from existing vision foundation models. This adaptation bridges the gap between synthetic and real-world data, improving generalization without requiring per-domain fine-tuning.
The methodology behind FoundationStereo integrates several innovative components. The Attentive Hybrid Cost Volume (AHCF) module is a key element that enhances disparity estimation by combining 3D Axial-Planar Convolution and a Disparity Transformer. The 3D Axial-Planar Convolution refines cost volume filtering by separating spatial and disparity information, leading to improved feature aggregation. Meanwhile, the Disparity Transformer introduces long-range context reasoning, allowing the model to process complex depth structures effectively. Moreover, FoundationStereo employs a hybrid approach, integrating a CNN with a Vision Transformer (ViT) to adapt monocular depth priors into the stereo framework. Combining these techniques ensures a more precise initial disparity estimation, which is further refined through iterative processing.
Performance evaluation of FoundationStereo demonstrates its superiority over existing methods. To assess its zero-shot generalization capabilities, the model was tested on multiple datasets, including Middlebury, KITTI, and ETH3D. When trained solely on Scene Flow, FoundationStereo significantly reduced error rates compared to previous models. For instance, the Middlebury dataset recorded a BP-2 error of 4.4%, outperforming prior state-of-the-art methods. On ETH3D, it achieved a BP-1 error of 1.1%, further establishing its robustness. In KITTI-15, the model attained a D1 error rate of 2.3%, marking a significant improvement over previous benchmarks. Qualitative comparisons of in-the-wild images revealed its ability to handle challenging scenarios, including reflections, textureless surfaces, and complex lighting conditions. These results highlight the effectiveness of FoundationStereo’s architecture in achieving reliable depth estimation without fine-tuning.
The research presents a major advancement in stereo-depth estimation by addressing generalization challenges and computational efficiency. By leveraging a large-scale synthetic dataset and integrating monocular priors with innovative cost-filtering techniques, FoundationStereo eliminates the need for domain-specific training while maintaining high accuracy across different environments. The findings demonstrate how the proposed methodology sets a new benchmark for zero-shot stereo-matching models and paves the way for more versatile applications in real-world settings.
Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.
Nikhil is an intern consultant at Marktechpost. He is pursuing an integrated dual degree in Materials at the Indian Institute of Technology, Kharagpur. Nikhil is an AI/ML enthusiast who is always researching applications in fields like biomaterials and biomedical science. With a strong background in Material Science, he is exploring new advancements and creating opportunities to contribute.
Artificial Neural Networks (ANNs) have revolutionized computer vision with great performance, but their “black-box” nature creates significant challenges in domains requiring transparency, accountability, and regulatory compliance. The opacity of these systems hampers their adoption in critical applications where understanding decision-making processes is essential. Scientists are curious to understand these models’ internal mechanisms and want to utilize these insights for effective debugging, model improvement, and exploring potential parallels with neuroscience. These factors have catalyzed the rapid development of explainable artificial intelligence (XAI) as a dedicated field. It focuses on the interpretability of ANNs, bridging the gap between machine intelligence and human understanding.
Concept-based methods are powerful frameworks among XAI approaches for revealing intelligible visual concepts within ANNs’ complex activation patterns. Recent research characterizes concept extraction as dictionary learning problems, where activations map to a higher-dimensional, sparse “concept space” that is more interpretable. Techniques like Non-negative Matrix Factorization (NMF) and K-Means are used to accurately reconstruct original activations, while Sparse Autoencoders (SAEs) have recently gained prominence as powerful alternatives. SAEs achieve an impressive balance between sparsity and reconstruction quality but suffer from instability. Training identical SAEs on the same data can produce different concept dictionaries, limiting their reliability and interpretability for meaningful analysis.
Researchers from Harvard University, York University, CNRS, and Google DeepMind have proposed two novel variants of Sparse Autoencoders to address the instability issues: Archetypal-SAE (A-SAE) and its relaxed counterpart (RA-SAE). These approaches build upon archetypal analysis to enhance stability and consistency in concept extraction. The A-SAE model constrains each dictionary atom to reside strictly within the convex hull of the training data, which imposes a geometric constraint that improves stability across different training runs. The RA-SAE extends this framework further by incorporating a small relaxation term, allowing for slight deviations from the convex hull to enhance modeling flexibility while maintaining stability.
The researchers evaluate their approach using five vision models: DINOv2, SigLip, ViT, ConvNeXt, and ResNet50, all obtained from the timm library. They construct overcomplete dictionaries with sizes five times the feature dimension (e.g., 768×5 for DINOv2 and 2048×5 for ConvNeXt), providing sufficient capacity for concept representation. The models undergo training on the entire ImageNet dataset, processing approximately 1.28 million images that generate over 60 million tokens per epoch for ConvNeXt and more than 250 million tokens for DINOv2, continuing for 50 epochs. Moreover, RA-SAE builds upon a TopK SAE architecture to maintain consistent sparsity levels across experiments. The computation of a matrix involves K-Means clustering of the entire dataset into 32,000 centroids.
The results demonstrate significant performance differences between traditional approaches and the proposed methods. Classical dictionary learning algorithms and standard SAEs show comparable performance but struggle to recover true generative factors in the tested datasets accurately. In contrast, RA-SAE achieves higher accuracy in recovering underlying object classes across all synthetic datasets used in the evaluation. In qualitative results, RA-SAE uncovers meaningful concepts, including shadow-based features linked to depth reasoning, context-dependent concepts like “barber”, and fine-grained edge detection capabilities in flower petals. Moreover, it learns more structured within-class distinctions than TopK-SAEs, separating features like rabbit ears, faces, and paws into distinct concepts rather than mixing them.
In conclusion, researchers have introduced two variants of Sparse Autoencoders: A-SAE and its relaxed counterpart RA-SAE. A-SAE constrains dictionary atoms to the convex hull of the training data and enhances stability while preserving expressive power. Then, RA-SAE effectively balances reconstruction quality with meaningful concept discovery in large-scale vision models. To evaluate these approaches, the team developed novel metrics and benchmarks inspired by identifiability theory, providing a systematic framework for measuring dictionary quality and concept disentanglement. Beyond computer vision, A-SAE establishes a foundation for more reliable concept discovery across broader modalities, including LLMs and other structured data domains.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.
Sajjad Ansari is a final year undergraduate from IIT Kharagpur. As a Tech enthusiast, he delves into the practical applications of AI with a focus on understanding the impact of AI technologies and their real-world implications. He aims to articulate complex AI concepts in a clear and accessible manner.
Modern VLMs struggle with tasks requiring complex visual reasoning, where understanding an image alone is insufficient, and deeper interpretation is needed. While recent advancements in LLMs have significantly improved text-based reasoning, similar progress in the visual domain remains limited. Existing VLMs often fail when required to combine visual and textual cues for logical deductions, highlighting a critical gap in their capabilities. This limitation is particularly evident in tasks that demand stepwise reasoning, where merely recognizing objects in an image is inadequate without an underlying understanding of relationships and contextual information.
Prior research on multimodal AI has primarily focused on object detection, captioning, and question answering, with limited exploration of higher-order reasoning. Some studies have attempted to enhance VLMs with chain-of-thought prompting or explicit reasoning structures. Still, these approaches are either restricted to textual data or fail to generalize across diverse visual tasks. Moreover, most open-source efforts in this area remain underdeveloped, making it difficult to advance visual reasoning beyond simple recognition tasks. Addressing these gaps is crucial for developing VLMs to perform sophisticated reasoning on real-world images.
Groundlight researchers explored training VLMs for visual reasoning using reinforcement learning, leveraging GRPO to enhance efficiency. While prior work, such as Deepseek’s research and advanced reasoning in language models, had little been done to extend these techniques to VLMs, they designed a cryptogram-solving task requiring both visual and textual processing to demonstrate their approach. The model deciphers encoded messages using a randomly generated decoder image, achieving 96% accuracy with a 3B parameter model. Attention analysis confirms the model actively engages with visual input, highlighting its ability to focus on relevant decoder regions while solving the task.
Training VLMs with GRPO presents multiple challenges, particularly in tokenization and reward design. Since models process text as tokens rather than individual characters, tasks requiring precise character-level reasoning can be problematic. To mitigate this, researchers formatted messages with spaces between letters to simplify decoding. Reward design was another crucial aspect, as reinforcement learning models require well-structured feedback to learn effectively. Three reward types were used: a format reward ensuring consistency in output, a decoding reward encouraging meaningful transformations of scrambled text, and a correctness reward refining accuracy. By carefully balancing these rewards, the researchers prevented unintended learning shortcuts, ensuring the model genuinely improved at cryptogram solving.
GRPO, which optimizes learning by comparing multiple outputs rather than relying on direct gradient computation, provided advantages in stabilizing training. By generating various responses per query and evaluating them relative to each other, the approach allowed for smoother learning curves. The research also highlighted the potential of VLMs in reasoning-based tasks but acknowledged the high computational costs associated with complex vision models. Techniques like selective model escalation were proposed to address efficiency concerns, where expensive models are used only for ambiguous cases. Additionally, integrating pre-trained models for object detection, segmentation, and depth estimation was suggested to enhance reasoning without significantly increasing computational overhead. This tool-based approach offers a scalable alternative to training massive end-to-end models, emphasizing efficiency without compromising accuracy.
In conclusion, The Groundlight team has made significant strides in enhancing VLMs by integrating reinforcement learning techniques, specifically GRPO. Their approach was tested on a cryptogram-solving task, where the model demonstrated impressive accuracy. This advancement underscores the potential of combining visual and textual data to improve VLM performance. By open-sourcing their methodology and tools, Groundlight aims to empower the broader community to further develop visual reasoning capabilities in AI systems.
Check out the Technical details, GitHub Page and Demo. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.
Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.
Did you know that 43% of cyberattacks target small businesses, yet only 14% are prepared to defend themselves? For start-ups, the stakes are even higher. A single breach could lead to significant financial losses, tarnish your reputation and derail your growth trajectory. So, all things considered, cybersecurity is no longer a luxury or an afterthought, it is the need of the hour.
As cybercriminals become more sophisticated, start-ups must recognise their vulnerabilities and take proactive measures to safeguard their operations. This guide explores the essential steps you need to protect your venture, from identifying risks to implementing strong cybersecurity practices.
Why Cybersecurity Matters for Start-ups
Start-ups often operate under the misconception that their small size makes them unlikely targets for cybercriminals. However, this couldn’t be further from the truth. Start-ups are particularly vulnerable to cyber attacks due to their limited resources, growing digital footprint, and often less-developed security infrastructure.
Cybercriminals understand that strained resources mean that start-ups cannot invest significant time, effort or money into large-scale security measures, making them easier targets. The generally tighter-knit employee pool also means that the company is more susceptible to social engineering tactics than a large decentralised corporation.
The consequences of a cyber attack can be devastating for a start-up. Beyond immediate financial losses, businesses face potential damage to their reputation, loss of customer trust and legal ramifications. And for many start-ups, a significant security breach could mean the difference between survival and failure.
But cybersecurity is not always about the sheer volume of resources you can throw at the problem. Understanding the nature of the attacks headed your way can significantly help you fight against them.
Common Cyber Risks Facing Start-ups
Image via Freepik
Understanding the threats your start-up faces is the first step toward protecting against them. Here are the primary risks that start-ups need to be aware of:
Data Breaches
Start-ups often handle sensitive information, including customer data, intellectual property and financial records. A data breach can expose this valuable information, leading to significant financial and reputational damage. This type of attack commonly targets large-scale data storage solutions like servers or data clouds. They can be especially damaging both in terms of financial losses as well as a large hit to the company’s reputation.
Ransomware Attacks
These attacks encrypt a company’s data and demand payment for its release. Start-ups are particularly vulnerable due to their often limited backup systems and immediate need for data access to maintain operations. While the MO for them vary from social engineering to physically infecting systems, it usually involves the attacker using a custom-made program to lock away access to data and demand payment with the threat of deleting or releasing said data.
Phishing Scams
Sophisticated phishing attempts can trick employees into revealing sensitive information or downloading malicious software (usually by masquerading as a trusted entity). These attacks often target start-ups due to their typically less-experienced workforce. Start-ups also often consist of a small group of employees who have high-level access to sensitive information, making them more vulnerable.
Cloud Security Vulnerabilities
As start-ups increasingly rely on cloud services for operations, inadequate cloud security measures can leave sensitive data exposed to unauthorised access. Solutions like large offline server farms are often not feasible from a financial standpoint for start-ups leaving them with no backup in the case of Cloud service attack.
Cybersecurity Best Practices for Start-ups
Implementing strong cybersecurity measures doesn’t have to be overwhelming. Here are key practices that every start-up should adopt:
Regular Security Assessments
Conduct periodic security audits to identify vulnerabilities in your systems and processes. This helps you stay ahead of potential threats and address weaknesses before they can be exploited.
Data Encryption
Implement strong encryption protocols for all sensitive data, both in transit and at rest. This ensures that even if data is intercepted, it remains unreadable to unauthorised parties. Encryption protocols and programs can get expensive, but they are often worth the investment.
Access Control
Establish strict access control policies, limiting employee access to only the data and systems they need for their specific roles. This is especially a concern for start-ups which often have a very small team of employees. This hierarchical access control restricts the damage that can be inflicted, even if just one employee is compromised.
Backup Systems
Maintain regular backups of all critical data and systems, storing them securely off-site or in the cloud. This provides a safety net in case of data loss or ransomware attacks.
How to Implement Cybersecurity Measures
Creating a strong cybersecurity foundation requires a systematic approach and ongoing commitment. Here’s how to get started:
1. Develop a Security Policy
Create clear guidelines for data handling, access controls and security procedures. This policy should be documented and easily accessible to all employees.
2. Invest in Employee Education
While IT professionals with qualifications like a Master of Cyber Security understand the complexities of cyber threats, most employees will need additional security training. Regular security awareness training helps employees recognise and respond to potential threats, making them your first line of defence against cyber attacks.
3. Implement Technical Controls
Deploy essential security tools such as:
Multi-factor authentication
Endpoint protection solutions
Virtual Private Networks (VPNs)
Firewalls and antivirus software
4. Create an Incident Response Plan
Develop and maintain a clear plan for responding to security incidents. This should include steps for containing the breach, assessing damage and notifying affected parties.
5. Regular Updates and Maintenance
Keep all software and systems updated with the latest security patches. Regularly review and update security measures to address emerging threats.
Building a Security-First Culture
Creating a security-conscious culture is crucial for long-term success. This means making cybersecurity a priority from day one and integrating it into every aspect of your operations. Encourage open communication about security concerns and celebrate security-conscious behaviour.
Investing in Cybersecurity
For start-ups, cybersecurity isn’t just an IT issue – it’s a business imperative. By understanding the risks and implementing appropriate security measures, you can protect your venture’s future while building trust with customers and stakeholders.
Remember, cybersecurity is an ongoing journey, not a destination. Start with the basics, build incrementally, and stay vigilant as your business grows.
While the initial investment in cybersecurity might seem a lot, the cost of a security breach far outweighs the resources required for prevention. By taking proactive steps to secure your start-up today, you’re investing in its long-term success and sustainability in an increasingly digital world.
Normalization layers have become fundamental components of modern neural networks, significantly improving optimization by stabilizing gradient flow, reducing sensitivity to weight initialization, and smoothing the loss landscape. Since the introduction of batch normalization in 2015, various normalization techniques have been developed for different architectures, with layer normalization (LN) becoming particularly dominant in Transformer models. Their widespread use is largely attributed to their ability to accelerate convergence and enhance model performance, especially as networks grow deeper and more complex. Despite ongoing architectural innovations that replace other core components like attention or convolution layers, normalization layers remain integral to most designs, underscoring their perceived necessity in deep learning.
While normalization layers have proven beneficial, researchers have also explored methods to train deep networks without them. Studies have proposed alternative weight initialization strategies, weight normalization techniques, and adaptive gradient clipping to maintain stability in models like ResNets. In Transformers, recent efforts have examined modifications that reduce reliance on normalization, such as restructuring Transformer blocks or gradually removing LN layers through fine-tuning. These approaches demonstrate that, while normalization layers offer optimization advantages, they are not strictly indispensable, and alternative training techniques can achieve stable convergence with comparable performance.
Researchers from FAIR, Meta, NYU, MIT, and Princeton propose Dynamic Tanh (DyT) as a simple yet effective alternative to normalization layers in Transformers. DyT operates as an element-wise function, DyT(x) = tanh(alpha x), where (alpha) is a learnable parameter that scales activations while limiting extreme values. Unlike layer normalization, DyT eliminates the need for activation statistics, simplifying computations. Empirical evaluations show that replacing normalization layers with DyT maintains or improves performance across various tasks without extensive hyperparameter tuning. Additionally, DyT enhances training and inference efficiency, challenging the assumption that normalization is essential for modern deep networks.
Researchers analyzed normalization layers in Transformers using models like ViT-B, wav2vec 2.0, and DiT-XL. They found that LN often exhibits a tanh-like, S-shaped input-output mapping, primarily linear for most values but squashing extreme activations. Inspired by this, they propose Dynamic Tanh (DyT) as a replacement for LN. Defined as DyT(x) = gamma *tanh(alpha x) + beta), where alpha, gamma, and beta are learnable parameters, DyT preserves LN’s effects without computing activation statistics. Empirical results show DyT integrates seamlessly into existing architectures, maintaining stability and reducing the need for hyperparameter tuning.
To evaluate DyT’s effectiveness, experiments were conducted across various architectures and tasks by replacing LN or RMSNorm with DyT while keeping hyperparameters unchanged. In supervised vision tasks, DyT slightly outperformed LN in ImageNet-1K classification. For self-supervised learning, diffusion models, language models, speech processing, and DNA sequence modeling, DyT achieved performance comparable to existing normalization methods. Efficiency tests on LLaMA-7B showed DyT reduced computation time. Ablation studies highlighted the importance of the tanh function and learnable parameter α, which correlated with activation standard deviation, acting as an implicit normalization mechanism. DyT demonstrated competitive performance with improved efficiency.
In conclusion, the study shows that modern neural networks, particularly Transformers, can be trained effectively without normalization layers. The proposed DyT replaces traditional normalization using a learnable scaling factor alpha and an S-shaped tanh function to regulate activation values. Despite its simplicity, DyT replicates normalization behavior and achieves comparable or superior performance across various tasks, including recognition, generation, and self-supervised learning. The results challenge the assumption that normalization layers are essential, offering new insights into their function. DyT provides a lightweight alternative that simplifies training while maintaining or improving performance, often without requiring hyperparameter adjustments.
Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.
Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.
LLMs are widely used for conversational AI, content generation, and enterprise automation. However, balancing performance with computational efficiency is a key challenge in this field. Many state-of-the-art models require extensive hardware resources, making them impractical for smaller enterprises. The demand for cost-effective AI solutions has led researchers to develop models that deliver high performance with lower computational requirements.
Training and deploying AI models present hurdles for researchers and businesses. Large-scale models require substantial computational power, making them costly to maintain. Also, AI models must handle multilingual tasks, ensure high instruction-following accuracy, and support enterprise applications such as data analysis, automation, and coding. Current market solutions, while effective, often demand infrastructure beyond the reach of many enterprises. The challenge is to optimize AI models for processing efficiency without compromising accuracy or functionality.
Several AI models currently dominate the market, including GPT-4o and DeepSeek-V3. These models excel in natural language processing and generation but require high-end hardware, sometimes needing up to 32 GPUs to operate effectively. While they provide advanced capabilities in text generation, multilingual support, and coding, their hardware dependencies limit accessibility. Some models also struggle with enterprise-level instruction-following accuracy and tool integration. Businesses need AI solutions that maintain competitive performance while minimizing infrastructure and deployment costs. This demand has driven efforts to optimize language models to function with minimal hardware requirements.
Researchers from Cohere introduced Command A, a high-performance AI model, designed specifically for enterprise applications requiring maximum efficiency. Unlike conventional models that require large computational resources, Command A operates on just two GPUs while maintaining competitive performance. The model comprises 111 billion parameters and supports a context length of 256K, making it suitable for enterprise applications that involve long-form document processing. Its ability to efficiently handle business-critical agentic and multilingual tasks sets it apart from its predecessors. The model has been optimized to provide high-quality text generation while reducing operational costs, making it a cost-effective alternative for businesses aiming to leverage AI for various applications.
The underlying technology of Command A is structured around an optimized transformer architecture, which includes three layers of sliding window attention, each with a window size of 4096 tokens. This mechanism enhances local context modeling, allowing the model to retain important details across extended text inputs. A fourth layer incorporates global attention without positional embeddings, enabling unrestricted token interactions across the entire sequence. The model’s supervised fine-tuning and preference training further refine its ability to align responses with human expectations regarding accuracy, safety, and helpfulness. Also, Command A supports 23 languages, making it one of the most versatile AI models for businesses with global operations. Its chat capabilities are preconfigured for interactive behavior, enabling seamless conversational AI applications.
Performance evaluations indicate that Command A competes favorably with leading AI models such as GPT-4o and DeepSeek-V3 across various enterprise-focused benchmarks. The model achieves a token generation rate of 156 tokens per second, 1.75 times higher than GPT-4o and 2.4 times higher than DeepSeek-V3, making it one of the most efficient models available. Regarding cost efficiency, private deployments of Command A are up to 50% cheaper than API-based alternatives, significantly reducing the financial burden on businesses. Command A also excels in instruction-following tasks, SQL-based queries, and retrieval-augmented generation (RAG) applications. It has demonstrated high accuracy in real-world enterprise data evaluations, outperforming its competitors in multilingual business use cases.
In a direct comparison of enterprise task performance, human evaluation results show that Command A consistently outperforms its competitors in fluency, faithfulness, and response utility. The model’s enterprise-ready capabilities include robust retrieval-augmented generation with verifiable citations, advanced agentic tool use, and high-level security measures to protect sensitive business data. Its multilingual capabilities extend beyond simple translation, demonstrating superior proficiency in responding accurately in region-specific dialects. For instance, evaluations of Arabic dialects, including Egyptian, Saudi, Syrian, and Moroccan Arabic, revealed that Command A delivered more precise and contextually appropriate responses than leading AI models. These results emphasize its strong applicability in global enterprise environments where language diversity is crucial.
Several key takeaways from the research include:
Command A operates on just two GPUs, significantly reducing computational costs while maintaining high performance.
With 111 billion parameters, the model is optimized for enterprise-scale applications that require extensive text processing.
The model supports a 256K context length, enabling it to process longer enterprise documents more effectively than competing models.
Command A is trained on 23 languages, ensuring high accuracy and contextual relevance for global businesses.
It achieves 156 tokens per second, 1.75x higher than GPT-4o and 2.4x higher than DeepSeek-V3.
The model consistently outperforms competitors in real-world enterprise evaluations, excelling in SQL, agentic, and tool-based tasks.
Advanced RAG capabilities with verifiable citations make it highly suitable for enterprise information retrieval applications.
Private deployments of Command A can be up to 50% cheaper than API-based models.
The model includes enterprise-grade security features, ensuring safe handling of sensitive business data.
Demonstrates high proficiency in regional dialects, making it ideal for businesses operating in linguistically diverse regions.
Check out the Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.