Seamless AI Acceleration for Developers Everywhere
To scale the AI opportunity, developers need access to the fastest methods of AI deployment, together with optimal performance that best suits their specific workload. Arm is dedicated to maximizing AI performance across the entirety of the Arm platform, helping to ensure seamless acceleration for every developer, every model, and every workload.
Robust AI Ecosystem
Connects developers with a vibrant ecosystem of machine learning software providers, frameworks, and open-source projects supporting the latest AI features.
Open Source Technology
Arm Kleidi technologies are integrated into popular frameworks, accelerating models on Arm CPUs without extra developer effort.
Developer Enablement
Community contributions and extensive resources include usage guides, learning paths, and demos to empower developers.
Connecting Developers to a Robust AI Software Ecosystem
The purpose of Arm Kleidi is to collaborate with leading AI frameworks, cloud service providers, and the machine learning ISV community to provide full ML stack, out-of-the-box inference performance improvements for billions of workloads without the need for extra developer work or expertise.
PyTorch
Arm works closely with the PyTorch community, helping to ensure models running on PyTorch just work on Arm and driving seamless acceleration for even the most demanding AI workloads.
BERT-Large
Arm has been working to improve PyTorch inference performance on Arm CPUs, including optimizing the primary execution modes, Eager Mode and Graph Mode.
Integrating Kleidi improves Llama model inference by up to 18 times, Gemma 2 2B by 15 times, and performance for natural language processing (NLP) models, including 2.2 times uplift on Bert-Large.
Llama 3.1 8B
Using Arm Neoverse V2-based Graviton4 processors, we can achieve an estimated 12 times uplift in token generation rate for a chatbot demo with KleidiAI optimizations applied to PyTorch.
This demo shows how easy it is to build AI applications using LLMs, making use of existing Arm-based compute capacity.
RoBERTa
AWS collaborated with Arm to optimize the PyTorch torch.compile feature for Neoverse V1-based Graviton3 processors with Arm Compute Library (ACL) kernels using oneDNN.
This optimization results in up to 2 times inference performance improvement for the most popular NLP models on Hugging Face.
FunASR Paraformer-Large
FunASR is an advanced open-source automatic speech recognition (ASR) toolkit developed by Alibaba DAMO Academy.
By integrating ACL with PyTorch via oneDNN, we have seen a 2.3 times performance improvement when running the Paraformer model on Neoverse N2-based AliCloud Yitian710 processors.
ExecuTorch
Together, Arm and ExecuTorch, a lightweight ML framework, enable efficient on-device inference capabilities at the edge.
Stable Audio Open
Stability AI and Arm have partnered to accelerate on-device generative AI, unlocking real-time audio generation capabilities without the need for an internet connection.
Through model distillation and leveraging Arm KleidiAI, Stable Audio Open now delivers 30x faster text-to-audio generation on Arm-based smartphones than previously – letting users create high-quality sounds at the edge in seconds.
Llama 3.2 1B
Thanks to the collaborative efforts of Arm and Meta, AI developers can now run quantized Llama 3.2 models up to 20% faster than ever on Arm CPUs.
By integrating KleidiAI with ExecuTorch and developing optimized quantization schemes, we have achieved speeds of over 350 tokens per second on the prefill stage for generative AI workloads on mobile.
Llama.cpp
To demonstrate the capability of Arm-based CPUs for LLM inferencing, Arm and partners are optimizing the int4 and int8 kernels implemented in llama.cpp to leverage these newer instructions.
Custom SLM
AWS and Arm have fine-tuned the TinyLlama 1.1B SLM to create a chatbot for the car manual, enabling drivers to interact directly with their vehicle. Using KleidiAI, SLM inference is 10 times faster than previously on Arm Cortex-A76 CPUs, achieving response times of 3 seconds.
TinyLlama 1.1B
Using llama.cpp with KleidiAI, VicOne accelerated performance, doubling prefill and uplifting encode by 60%. Our partnership enables fast in-vehicle cybersecurity threat detection by reducing cloud dependency, lowering costs, and keeping data secure onboard.
TinyStories
TinyStories is a dataset containing words a typical 3-year-old might understand. It can be used to train and evaluate small models below 10M parameters. When running TinyStories on the Arm Cortex-A320 CPU, a performance uplift of over 70% has been achieved.
Llama 3.3 70B
In partnership with Meta and leveraging KleidiAI with 4-bit quantization, the SLM achieved similar performance to the larger Llama 3.1 405B model. Performance was consistent at 50 tokens/second when deployed on Arm Neoverse-powered Google Axion processors.
Phi 3 3.8B
Due to our optimizations, the time-to-first token (TTFT) for Microsoft’s Phi 3 LLM is accelerated by around 190% when running a chatbot demo on the Arm Cortex-X925 CPU, which is used in premium smartphones.
Llama 3 8B
Running a text generation demo on Graviton3 processors with our optimizations achieves a 2.5 times performance uplift for TTFT and over 35 tokens / second in the text generation phase, which is more than sufficient for real-time use cases.
Other Leading Frameworks
To maximize AI performance across the entirety of the Arm compute platform, we are dedicated to optimizing inference workloads across all major AI and ML frameworks.
MNN
MNN is an open source deep learning framework developed by Alibaba. Our partnership helps improve performance and efficiency for on-device multimodal use cases.
As demonstrated with the multilingual instruction-tuned Qwen2-VL 2B model, integrating Kleidi with MNN accelerates prefill performance by 57% and decode by 28%.
OpenCV
With increasing demand for advanced, energy-efficient computer vision (CV) at the edge, KleidiCV helps ensure optimized performance for CV applications on Arm CPUs.
Now integrated with OpenCV 4.11, developers benefit from four times faster processing for key image processing tasks such as blur, filter, rotation and resizing. This acceleration helps boost performance for image segmentation and object detection and recognition use cases.
MediaPipe
Arm’s partnership with Google AI Edge on MediaPipe and XNNPACK is accelerating AI workloads on current and future Arm CPUs. This enables developers to deliver outstanding AI performance for mobile, web, edge and IoT, using numerous LLMs, like Gemma and Falcon.
Thanks to Kleidi integration with MediaPipe via XNNPACK, a 30% acceleration in TTFT has been achieved when running a chatbot demo on the Gemma 1 2B LLM on Arm-based premium smartphones.
Angel
Tencent’s Angel ML framework supports Hunyuan LLM, available in sizes from 1B to over 300B parameters. It enables AI capabilities across a wide range of devices, including smartphones and Windows on Arm PCs.
Our partnership was announced at the 2024 Tencent Global Digital Ecosystem Summit and is having a positive impact on real-world workloads by providing users with even more powerful and efficient on-device AI services across Tencent’s many applications.
Key Developer Technologies for Accelerating CPU Performance
Arm Kleidi includes the latest developer enablement technologies designed to advance AI model capability, accuracy, and speed. This helps ensure AI workloads get the best out of the underlying Arm Cortex-A, Arm Cortex-X or Arm Neoverse CPU.
KleidiAI and KleidiCV libraries are lightweight kernels designed to make it easy for machine learning (ML) and computer vision (CV) frameworks to target optimum performance and leverage the latest features for enhancing AI and CV in Arm CPU-based designs.
A fully comprehensive and flexible library that enables independent software vendors to source ML functions optimized for Cortex-A and Neoverse CPUs. The library is OS agnostic and is portable to Android, Linux, and bare metal systems.
Latest News and Resources
- Developer
- NEWS and BLOGS
- Guide
- eBook
- White Papers

AI Workloads
Guide to Understanding AI Inference on CPU
Demand for running AI workloads on CPU is growing. Our helpful guide explores the benefits and considerations for CPU inference across a range of sectors.

Generative AI
The Role of Generative AI in Business Transformation
Explore how to leverage generative AI to fulfill its full potential and the role of Arm in leading this transformation.

Software AI Acceleration
Why Software is Crucial to Achieving AI’s Full Potential
Discover why software is the key to implementing AI and how to accelerate the creation of high-performance and secure AI applications.

Generative AI
Scale Generative AI With Flexibility and Speed
The race to scale new generative AI capabilities is creating both opportunities for innovation and challenges. Learn how to beat these challenges and successfully deploy AI on Arm everywhere.
Stay Connected
Subscribe to stay up to date on the latest news, case studies, and insights.