A privacy-first React application that demonstrates intelligent routing between local browser-based AI inference (WebLLM) and cloud-based models (OpenAI) with automatic PII redaction and complexity analysis.
WebLLM is a framework that enables local execution of Large Language Models (LLMs) directly in the browser using WebGPU acceleration. Instead of relying on external servers, users can download compact models and perform inference directly on their machines.
- WebGPU Acceleration: Uses WebGPU kernels for high-performance computation, maintaining approximately 85% of native performance
- Cross-Platform GPU Support: WebGPU acts as an abstraction layer, running efficiently across different GPUs (NVIDIA, AMD, Apple Metal)
- Browser Caching: Model weights and WebAssembly libraries are downloaded once and cached in IndexedDB for future use
- Privacy-First: All inference happens locally - your data never leaves your browser
- GPU: Modern GPU with WebGPU support (integrated or dedicated)
- RAM: Minimum 8GB system RAM (4GB+ available for the browser)
- Storage: ~700MB-1GB free space for model caching
- Chrome/Edge: Version 113+ with WebGPU enabled
- Firefox: Experimental WebGPU support (requires enabling in
about:config
) - Safari: WebGPU support in Safari 17+
Note: WebLLM is currently not supported on most mobile devices including iPhones and many Android devices due to memory constraints and limited WebGPU support.
If WebGPU is not available by default:
- Chrome/Edge: Go to
chrome://flags
and enable "Unsafe WebGPU" - Firefox: Go to
about:config
and setdom.webgpu.enabled
totrue
- Node.js 18+ and npm/yarn
- A modern browser with WebGPU support
-
Clone the repository
git clone https://github.com/your-username/webllm-privacy-experiment.git cd webllm-privacy-experiment
-
Install dependencies
npm install
-
Set up environment variables (optional for OpenAI integration)
cp .env.example .env # Add your OpenAI API key echo "VITE_OPENAI_API_KEY=your-api-key-here" >> .env
-
Start the development server
npm run dev # or npm start
-
Open your browser Navigate to
http://localhost:5173
npm run dev
/npm start
- Start development servernpm run build
- Build for productionnpm run lint
- Run ESLintnpm run preview
- Preview production build
When you first launch the application, WebLLM will download and cache the AI model:
- Download Size: ~700MB (Llama-3.2-1B model)
- Download Time: Varies by connection speed:
- 1 Gbps fiber: ~30 seconds
- 50 Mbps cable: ~3-4 minutes
- Rural 4G (12 Mbps): ~20-45 minutes
The model is cached in your browser's IndexedDB, so subsequent visits will load instantly.
The application automatically analyzes query complexity and routes requests to the most appropriate model:
- Simple queries → WebLLM (local, private, fast)
- Complex queries → OpenAI (cloud-based, more capable)
Override automatic selection using mentions:
@webllm What is the capital of France?
@openai Explain quantum computing in detail
For queries sent to OpenAI, the application automatically:
- Detects personally identifiable information (PII)
- Replaces names with placeholders (e.g.,
PERSON_1
,ORG_1
) - Sends redacted query to OpenAI
- Restores original names in the response
Example:
Input: "Tell John Smith at Microsoft about our new product"
Sent to OpenAI: "Tell PERSON_1 at ORG_1 about our new product"
Response: "I'll help you tell John Smith at Microsoft about your new product"
The system uses a 5-point complexity scale:
- 1-2: Simple queries (handled by WebLLM)
- 3-5: Complex queries (sent to OpenAI)
Complexity Factors:
- Word count and query length
- Domain-specific terminology
- Multi-step reasoning requirements
- Context complexity
User: What is 2+2?
→ WebLLM: 2+2 equals 4.
User: Explain the implications of quantum computing on cryptography
→ OpenAI: [Detailed explanation about quantum computing and its impact on cryptographic systems...]
User: Draft an email to Sarah Johnson at Apple about our collaboration
→ Redacted: "Draft an email to PERSON_1 at ORG_1 about our collaboration"
→ Response: "Here's a draft email to Sarah Johnson at Apple about your collaboration..."
User: @webllm Write a simple poem about cats
→ WebLLM: [Generates poem locally]
User: @openai Write a complex analysis of modern literature
→ OpenAI: [Generates detailed analysis via cloud API]
useChatModel
: Main orchestration hookuseWebLlm
: WebLLM integration and local inferenceuseOpenAi
: OpenAI API integrationuseComplexityAnalysis
: Query complexity evaluationusePrivacyRedaction
: PII detection and redaction
MODEL: {
WEB_LLM: {
DEFAULT_MODEL: 'Llama-3.2-1B-Instruct-q4f32_1-MLC',
VRAM_REQUIRED_MB: 1128.82,
CONTEXT_WINDOW_SIZE: 4096
}
}
src/
├── components/ # React components
├── hooks/ # Custom React hooks
├── models/ # Type definitions and interfaces
├── utils/ # Utility functions
├── config/ # Configuration files
└── assets/ # Static assets
- First Token: < 1 second
- Privacy: Complete privacy (no network requests)
- Cost: Free after initial download
- Reliability: Good for simple queries
- First Token: ~8 seconds (including network latency)
- Privacy: PII redacted before sending
- Cost: Pay-per-use API calls
- Reliability: Excellent for complex queries
- Local Processing: Simple queries never leave your browser
- PII Redaction: Personal information is automatically stripped from cloud requests
- Selective Routing: Only complex queries that require advanced capabilities are sent to the cloud
- Compliance Ready: Helps meet GDPR, HIPAA, and other privacy regulations
- Model Size: Currently limited to smaller models (1B-8B parameters)
- Hardware Dependency: Requires WebGPU-compatible hardware
- Mobile Support: Limited support on mobile devices
- Consistency: May be unpredictable for edge cases
- Initial Download: Large one-time download required
- Browser Compatibility: Limited to WebGPU-enabled browsers
- Resource Usage: Consumes significant GPU memory during inference
- Fine-tuned PII Models: Specialized models for better privacy redaction
- Mobile Optimization: Smaller models for mobile device compatibility
- Offline Mode: Complete offline operation for privacy-sensitive environments
- Multi-Model Support: Support for different specialized models
This project uses ESLint for code quality. The configuration can be expanded to enable type-aware lint rules as described in the original template:
export default tseslint.config({
languageOptions: {
parserOptions: {
project: ['./tsconfig.node.json', './tsconfig.app.json'],
tsconfigRootDir: import.meta.dirname,
},
},
})
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- WebLLM for enabling browser-based LLM inference
- OpenAI for providing cloud-based AI capabilities
- The React and TypeScript communities for excellent development tools
If you encounter issues:
- WebGPU Support: Check if your browser supports WebGPU
- Memory Issues: Ensure you have sufficient RAM (8GB+ recommended)
- Network Issues: Verify internet connection for initial model download
- Browser Compatibility: Try Chrome/Edge with WebGPU enabled
For technical issues, please open an issue on GitHub with:
- Browser version and WebGPU support status
- System specifications (RAM, GPU)
- Error messages or console logs