LiteRT vs. Llama Stack: Privacy-Centric On-Device AI for Android Developers

Why On-Device AI Has Become a Privacy Issue

Picture a phone that can transcribe a private voice memo without sending audio to a remote server. That is the practical promise of on-device AI. Instead of uploading data for processing, the model runs directly on the device. For Android developers, this shift matters because trust in cloud processing has weakened. Large breaches are common, and regulations such as GDPR now impose strict limits on how personal data can be handled.

On-device AI reduces exposure by design. Data stays on the phone. There is no network transfer and no server to secure. Google has pushed this approach through its Android AI stack, particularly LiteRT, which allows developers to deploy custom machine-learning models that run locally with hardware acceleration. Meta’s Llama Stack takes a similar stance from a different angle, offering lightweight open models designed to run entirely on devices. Both approaches aim to support AI features that work offline, reduce latency, and avoid cloud dependence.

One visible example is Gemini Nano in the Google Pixel voice recorder app. It generates summaries locally, even without an internet connection. That design choice is not cosmetic. It is a direct response to privacy risk and regulatory pressure, as described in Android developer documentation and Google AI Edge resources.

What "On-Device AI" Means in Practice

On-device AI refers to running inference locally on a phone’s CPU, GPU, or dedicated neural processing unit (NPU). Inputs never leave the device. There is no round-trip to a data center, no logging on external servers, and no exposure during transmission.

This approach has trade-offs. Phones have limited memory, battery, and thermal headroom. Models must be smaller and more efficient. That constraint has driven work on runtimes such as LiteRT and compact model families such as Llama’s 1B and 3B variants. The result is not general-purpose cloud AI on a phone, but focused systems tuned for specific tasks such as summarization, transcription, image understanding, or text rewriting.

Google LiteRT: A Runtime Built Around Local Execution

LiteRT is Google’s runtime for deploying machine-learning models directly on devices. Its core goal is to make local inference practical and fast, while hiding hardware differences across phones. Developers can ship custom models that run on CPUs, GPUs, or NPUs through a single API, rather than writing vendor-specific code.

Google Play for On-Device AI handles distribution. Models are delivered through App Bundles, keeping APK size down and allowing Google Play to select the best variant for a given device. According to Google, LiteRT abstracts away vendor SDKs and exposes unified access to hardware accelerators, including NPUs.

Performance data from Google’s internal benchmarks shows why this matters. AI-specific NPUs can run models up to 25 times faster than CPUs while using roughly one-fifth the power. LiteRT’s TensorBuffer API further reduces overhead by allowing direct access to hardware memory, avoiding repeated CPU copies.

The runtime itself is small, measured in a few megabytes. It supports models converted from JAX, Keras, PyTorch, and TensorFlow, and runs across Android, iOS, the web, and even microcontrollers. Google also provides higher-level APIs for common tasks in vision, audio, text, and generative AI. These APIs are designed to run fully offline, which is central to LiteRT’s privacy model.

The cost implications are straightforward. If inference runs locally, there is no cloud bill. Latency is also lower, because the model does not wait on network requests. For apps that handle sensitive data, this architecture removes entire classes of risk.

Llama Stack: Open Models Designed for the Edge

Meta’s Llama Stack approaches the same problem from the model side. Instead of focusing on a single runtime, it provides a family of open models optimized for on-device and edge deployment. The smaller 1B and 3B models target tasks such as summarization, instruction following, rewriting, and tool calling. They support multilingual use and offer context windows up to 128,000 tokens, which is unusually large for models intended to run locally.

Llama Stack is designed to run across environments, including fully on-device setups. On Android, this is typically done through PyTorch ExecuTorch, which enables local execution without cloud involvement. Both pre-trained and aligned variants are available, and developers can fine-tune models using torchtune and deploy them with torchchat.

Larger models, such as the 11B and 90B variants, add vision capabilities. These support document understanding, image captioning, and visual grounding. While not all devices can run these models locally, they extend the same privacy-first principle to more demanding tasks when hardware allows.

Performance optimizations come from tools like KleidiAI. On Arm Cortex-A76 CPUs, small language models can run up to 10 times faster, with response times around three seconds and prefill speeds exceeding 350 tokens per second. Stable Audio Open, built on Llama technology, achieves up to 30 times faster text-to-audio generation on Arm smartphones. All of this runs locally, without sending prompts or outputs to external servers.

Llama’s open licensing allows modification and redistribution, which appeals to developers who want control over their models and deployment pipelines.

Cloud Risk and the Case for Local Processing

The appeal of on-device AI is not abstract. Cloud systems concentrate data, which makes them attractive targets. Public breach reports show that large centralized databases are routinely compromised, often exposing millions of records at once. Each cloud request also creates a new copy of user data, increasing the attack surface and compliance burden.

Local processing avoids these risks by design. There is no transmission, no central store, and no shared infrastructure. LiteRT enforces this through offline execution paths and Play-managed delivery. Llama Stack achieves it by making local execution the default for its smaller models.

This does not eliminate all risk, but it removes the largest one: sending sensitive user data to someone else’s servers.

Comparing LiteRT and Llama Stack

Both LiteRT and Llama Stack are built around data staying on the device, but they differ in emphasis.

LiteRT focuses on runtime consistency and hardware abstraction. It integrates tightly with Google Play and Android’s system components. Developers get predictable performance across devices and strong NPU acceleration, with reported gains of up to 25x in speed and 5x in power efficiency over CPU-only execution.

Llama Stack emphasizes model flexibility and openness. Its smaller models are designed to run efficiently on CPUs, and performance gains from KleidiAI reach 10x for inference and over 350 tokens per second in some scenarios. The stack supports more aggressive customization through fine-tuning and model modification.

Both approaches support full offline use. LiteRT is more opinionated and Google-centric. Llama Stack trades some of that integration for openness and model variety.

Android Integration and Real-World Examples

In practice, both stacks already appear in real Android apps. LiteRT is used to deploy models across Android, iOS, and edge devices such as Raspberry Pi, with Google Play managing delivery to minimize app size. Gemini Nano’s offline summarization in the Pixel voice recorder is a clear example of sensitive data being processed locally.

Llama Stack integrates through ExecuTorch on Android. Developers use it for tasks like local text rewriting, summarization, or image reasoning without cloud calls. Privacy-focused health apps can run symptom analysis locally using 1B models. Camera apps can perform object detection with LiteRT. Financial tools use LiteRT’s low-latency inference for secure processing, while legal and enterprise apps use Llama’s vision models for document scanning.

These use cases rely on the same principle: data sovereignty enforced by architecture, not policy.

Practical Challenges and Implementation Choices

On-device AI is not effortless. Hardware varies widely across Android devices. LiteRT reduces this complexity through backend abstraction, but developers still need to test across CPUs, GPUs, and NPUs. Llama Stack’s quantization and optimization tools can significantly boost speed, but they require careful tuning.

Battery consumption is another concern. Both stacks address it through efficiency. LiteRT’s small runtime and NPU acceleration reduce power draw. Llama’s lightweight models are designed to fit within mobile constraints.

Common best practices include starting with prebuilt APIs, then moving to custom models when needed. Developers typically fine-tune Llama models with torchtune or adapt LiteRT pipelines using Keras or PyTorch. Distribution through Google Play simplifies updates, and performance should be monitored with real benchmarks rather than assumptions.

Conclusion: Choosing Between LiteRT and Llama Stack

LiteRT and Llama Stack represent two mature paths toward privacy-first AI on Android. LiteRT offers a tightly integrated Google ecosystem with strong hardware acceleration and predictable deployment. Llama Stack offers open models, flexible fine-tuning, and strong performance on modest hardware.

Both avoid cloud dependency by design. The choice comes down to priorities. If you want uniform deployment and deep Android integration, LiteRT is the more direct fit. If you want control over models and greater flexibility in how they evolve, Llama Stack is the stronger option.

What they share is more important than how they differ. Both assume that sensitive data should stay on the device. That assumption, more than any feature, defines the future of private AI on mobile.