Skip to content

Run large language models like Qwen and LLaMA locally on Android for offline, private, real-time question answering and chat - powered by ONNX Runtime.

License

Notifications You must be signed in to change notification settings

dineshsoudagar/local-llms-on-android

Repository files navigation

🤖 Local LLMs on Android (Offline, Private & Fast)

An Android application that brings a large language model (LLM) to your phone — fully offline, no internet needed. Powered by ONNX Runtime and a Hugging Face-compatible tokenizer, it provides fast, private, on-device question answering with streaming responses.


✨ Features

  • 📱 Fully on-device LLM inference with ONNX Runtime.
  • 🔤 Hugging Face-compatible BPE tokenizer (tokenizer.json)
  • 🧠 Qwen2.5 & Qwen3 prompt formatting with streaming generation
  • 🧩 Custom ModelConfig for precision, prompt style, and KV cache
  • 🧘‍♂️ Thinking Mode toggle (enabled in Qwen3) for step-by-step reasoning
  • 🚀 Coroutine-based UI for smooth user experience.
  • 🔐 Runs 100% offline — no network, no telemetry

📸 Inference Preview

Model Output 2 Input Prompt Input Prompt

Figure: App interface showing prompt input and generated answers using the local LLM.


📂 App Variants

This repo includes two modes of interaction:

  • Single-turn QA with minimal prompt.
  • Fastest response time.
  • Best for quick facts or instructions.
  • Multi-turn chat with short-term memory.
  • Qwen-style prompt formatting with context compression.
  • Best for reasoning, assistant-style dialogue, and follow-up questions.

🧠 Model Info

This app supports both Qwen2.5-0.5B-Instruct and Qwen3-0.6B — optimized for instruction-following, QA, and reasoning tasks.

🔁 Option 1: Use Preconverted ONNX Model

Download the model.onnx and tokenizer.json from Hugging Face:

⚙️ Option 2: Convert Model Yourself

pip install optimum[onnxruntime]
# or
python -m pip install git+https://github.com/huggingface/optimum.git

Export the model:

optimum-cli export onnx --model Qwen/Qwen2.5-0.5B-Instruct qwen2.5-0.5B-onnx/
  • You can also convert any fine-tuned variant by specifying the model path.
  • Learn more about Optimum here.

⚙️ Requirements

  • Android Studio
  • ONNX Runtime for Android (already included in this repo).
  • A physical Android device for deployment and testing, ≥ 4 GB RAM for FP16 / Q4 models, ≥ 6 GB RAM for FP32 models.
  • Real hardware preferred—emulators are acceptable for UI checks only.


Choose which Qwen model to run

InMainActivity.kt you will find two pre-defined ModelConfig objects:

val modelconfigqwen25 =// Qwen 2.5-0.5B
val modelconfigqwen3  =// Qwen 3-0.6B

Right below them is a single line that tells the app which one to use:

val config = modelconfigqwen25      // ← change to modelconfigqwen3 for Qwen 3

How to Build & Run

  1. Open Android Studio and create a new project (Empty Activity).
  2. Name your app local_llm.
  3. Copy all the project files from Qwen_QA_style_app or Qwen_chat_style_app into the appropriate folders.
  4. Place your model.onnx and tokenizer.json in:
    app/src/main/assets/
    
  5. Connect your Android phone using wireless debugging or USB.
  6. To install:
    • Press Run ▶️ in Android Studio, or
    • Go to Build → Generate Signed Bundle / APK to export the .apk file.
  7. Once installed, look for the Pocket LLM icon  Pocket LLM icon on your home screen.

Note: All Kotlin files are declared in the package com.example.local_llm, and the Gradle script sets applicationId "com.example.local_llm". If you name the app (or change the package) to anything other than local_llm, you must refactor:

  • The directory structure in app/src/main/java/...,
  • Every package com.example.local_llm line, and
  • The applicationId in app/build.gradle.
  • Otherwise, Android Studio will raise “package … does not exist” errors and the project will fail to compile.

📦 Download Prebuilt APKs

Customize Your App Experience with These

  • Define the assistant’s tone and role by setting defaultSystemPrompt (in your model config).
  • Adjust TEMPERATURE to control response randomness — lower for accuracy, higher for creativity (OnnxModel.kt).
  • Use REPETITION_PENALTY to avoid repetitive answers and improve fluency (OnnxModel.kt).
  • Change MAX_TOKENS to limit or expand the length of generated replies (OnnxModel.kt).

📄 License Notice

Note: These ONNX models are based on Qwen, which is licensed under the Apache License 2.0.

About

Run large language models like Qwen and LLaMA locally on Android for offline, private, real-time question answering and chat - powered by ONNX Runtime.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published