Revolutionizing On-Device AI: 4-Bit Quantization and Small Language Models on ARM
Introduction: Edge AI Moves Onto the Phone
Artificial intelligence is slowly moving off remote servers and onto personal devices. On smartphones and other ARM-based systems, this shift—often called edge AI—means models can run locally, respond faster, work offline, and keep data on the device. The change is driven by two technical trends: more capable mobile processors, and new ways to shrink language models without breaking them.
Low-bit quantization is central to this shift. Recent work shows that mixed-precision matrix multiplication (mpGEMM) makes it practical to run large language models using very small numerical formats, including 4-bit weights, on constrained hardware. At the same time, modern mobile system-on-chips have seen steady gains in compute and memory. Together, these advances make it possible to run small language models directly on phones.
One concrete example is Alif Semiconductor. Its hardware acceleration for small language models draws about 36 milliwatts during text generation, a level suitable for battery-powered devices. The combination of 4-bit quantization and compact models is changing what “on-device AI” can realistically mean.
Small Language Models: Smaller by Design
Small language models, or SLMs, are designed to trade scale for efficiency. Models such as Llama2-7B and Mistral-7B are far smaller than cloud-only systems, but still expressive enough for many tasks. Their size makes them candidates for local execution on ARM processors.
Alif’s Ensemble E4, E6, and E8 microcontrollers illustrate this direction. These chips are built to run transformer-based generative models locally, using the Arm Ethos-U85 neural processing unit for acceleration. The Ethos-U85 is designed specifically to speed up transformer workloads on embedded devices.
Research and deployment have also centered on models like the Llama 3.1 family, which has become a reference point for evaluating small-model performance. Google’s Gemma 3 270M is another example. It has a footprint of 536 MB, a 256,000-token vocabulary, and handles rare tokens well enough to support domain-specific fine-tuning on mobile hardware.
Liquid Foundation Models take a different approach. They use liquid neural networks, which are optimized for sequential reasoning and can be deployed across a wide range of devices, including ARM-based phones. Tooling such as LEAP for on-device development and Apollo for rapid testing has lowered the barrier to experimenting with these models. In practice, frameworks like llama.cpp and MLC LLM are already running 7B-parameter models on mobile hardware.
What 4-Bit Quantization Actually Does
Quantization reduces the number of bits used to represent model weights and activations. Moving from 16- or 32-bit values down to 4-bit integers cuts memory use dramatically. In one example, a model that previously required four high-end GPUs could run on a single low-end GPU after 4-bit quantization.
The savings are concrete. A 1-billion-parameter model stored in 32-bit floating point takes roughly 4 GB. In 4-bit form, it can shrink to about 0.5 GB a - reduction of around 75%. Lower precision also reduces computation and power draw. Some measurements show power use dropping by around 60%, which matters on battery-limited ARM devices.
The trade-off is accuracy. Reducing precision introduces approximation error, and output quality can suffer depending on the task and method used. Still, formats such as W4A16 - 4-bit weights with 16-bit activations—have proven effective. In single-stream mobile scenarios, W4A16 delivers roughly 3.5× compression and a 2.4× speedup.
More aggressive approaches have shown that performance loss is not always proportional to size reduction. TAKANE reports 89% performance retention with a 3× speedup in some low-bit settings, far better than earlier methods that fell below 50% retention. The same work shows student models as small as 1/100th the size of their teachers, with GPU memory use and cost reduced by about 70%.
New Quantization Techniques for Edge Hardware
Beyond basic quantization, newer techniques focus on making low-bit models run efficiently on real hardware. The Ladder data-type compiler converts unsupported low-precision formats into ones that edge devices can execute, without losing numerical fidelity. T-MAC’s mpGEMM library takes a different approach. It relies on lookup tables to avoid expensive dequantization and multiplication steps, which makes it well suited to ARM CPUs running low-bit models.
Vector quantization shows similar trade-offs. Compressing one million 1,536-dimensional vectors by 8× using 4-bit representations requires about 732 MB of memory and still achieves 90–97% recall. Higher compression rates, such as 32×, save more memory but noticeably degrade quality. In practice, moderate compression performs best.
GranQ uses zero-shot quantization with layer- and channel-level awareness. It minimizes error without retraining and outperforms several quantization-aware training methods on edge models. On vision benchmarks such as CIFAR and ImageNet, it reaches state-of-the-art results for mobile-oriented models.
any4 takes yet another route. It learns 4-bit numeric representations directly, without preprocessing, and outperforms int4, fp4, and nf4 formats across Llama and Mistral models using only a single calibration sample. By applying Lloyd–Max optimization to weight rows and generating efficient lookup tables, it improves perplexity. TinyGEMM implements this approach efficiently on GPUs and is suitable for ARM-based mobile deployments.
Performance gains can be substantial. Deepseek 7B reaches about 130 tokens per second using AWQ 4-bit quantization on an RTX 4090, compared with 52 tokens per second without it.
ARM Hardware and Software Acceleration
ARM platforms are increasingly optimized for this kind of workload. The Ethos-U85 NPU in Alif’s chips accelerates transformer inference directly on embedded hardware. On server-class ARM systems, KleidiAI running on Neoverse V2-based Graviton4 processors delivers up to a 12× increase in token throughput for PyTorch chatbots.
ARM has also worked with Meta to speed up quantized Llama models. Llama 3.2 runs about 20% faster on ARM CPUs after optimization. In one deployment, Arcee AI’s Virtuoso-Lite model, running with 4-bit weights through llama.cpp, achieves roughly 40 tokens per second while offering a reported 4.5× cost advantage for enterprise use.
On mobile-class cores, KleidiAI improves small-model inference by up to 10× on Cortex-A76, enabling responses from TinyLlama 1.1B in about three seconds. In partnership with Meta, ARM reports that 4-bit KleidiAI reaches around 50 tokens per second on Axion processors for Llama 3.1 405B-class workloads.
Quantization also makes models like Gemma 3 270M practical on phones. After quantization, its size drops by roughly 2–3× from the original 536 MB. On a Pixel 9 Pro, the INT4 version used about 0.75% of the battery to handle 25 conversations.
Benchmarks and Deployments
Benchmarks increasingly reflect these gains. Running 7B-parameter models on mobile devices with llama.cpp and MLC LLM has become routine in testing environments. Tools such as Snapdragon Profiler and Arm Streamline are used to confirm hardware-level improvements.
Several results stand out:
Deepseek 7B shows a jump to 130 tokens per second with 4-bit AWQ.
Alif’s text generation at 36 mW enables sustained use on battery-powered devices.
W4A16 delivers a 2.4× speedup in latency-sensitive ARM workloads.
KleidiAI reports up to 10× gains on Cortex-A76 and parity around 50 tokens per second in larger ARM deployments.
TAKANE’s results combine 89% accuracy retention with 3× speed improvements.
Liquid Foundation Models also report strong performance on ARM NPUs across real-world multimodal tasks.
Limits and Open Problems
The gains come with real constraints. On average, moving to 8-bit precision reduces accuracy by about 0.8%. Dropping to 4-bit can cause much larger losses—up to 59% on long-context tasks—depending on the model and method. For example, Qwen-2.5 72B remains relatively stable under BNB-nf4 quantization, while Llama-3.1 70B loses about 32% performance on the same tasks.
Vector compression shows similar limits. Very high compression rates save memory but hurt recall, which is why 4× compression often outperforms more aggressive schemes. any4 reduces some of these losses through calibration, while methods like GranQ still require careful handling of activation quantization.
Current research points toward mixed-precision approaches and wider use of liquid neural networks to balance efficiency and accuracy. TAKANE’s results, with student models at 1/100th the size and large cost reductions, suggest how edge deployments might scale further.
Conclusion
Running language models directly on ARM-based devices is no longer speculative. The combination of 4-bit quantization and small language models has made it practical, measurable, and already deployable. Techniques such as T-MAC’s lookup-table execution and KleidiAI’s ARM-specific optimizations show that low-bit models can run quickly without relying on the cloud.
Real deployments reinforce the point. Gemma 3’s low battery use on a Pixel phone and Alif’s 36 mW text generation are not lab curiosities; they are shipping results. As quantization methods like any4 and GranQ continue to reduce accuracy loss, the boundary between cloud-hosted AI and on-device inference will keep narrowing.