Private AI
running inside
your app.
GemmaKit turns Gemma 4 E2B into an Apple Silicon-optimised local text runtime for apps that need private inference without a cloud dependency. The 4-bit MLX repack is 2.63 GB on disk, runs 902.7 MiB leaner than the source checkpoint, and has passed a 500-generation structured-output validation run at 100% parseability. It drops behind an OpenAI-compatible Chat Completions API, with prompt content kept on the device.
- Client
- Swift · OpenAI-style
- Runtime
- local Gemma server
- Network
- licence channel only
Point your existing OpenAI-compatible client at a local base URL.
No SDK rewrite. Same JSON shape on the supported subset. Streamed or buffered, with optional local bearer-token auth.
# Stream a chat completion against the local server curl http://127.0.0.1:11436/v1/chat/completions \ -H "Authorization: Bearer $GEMMAKIT_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "gemma-4-e2b-it", "stream": true, "stream_options": { "include_usage": true }, "messages": [ { "role": "system", "content": "You are concise." }, { "role": "user", "content": "Summarise the changelog above." } ] }'
Gemma 4 power, repacked for the local footprint.
A 0.7B-parameter, 4-bit text runtime built for Apple Silicon apps: 2.63 GB on disk, 1,234 text tensors kept, 1,415 unused audio and vision tensors removed, and 100% parseability across a 500-generation structured-output validation run.
● Included
- +Gemma 4 E2B instruction text model in an Apple Silicon-optimised 4-bit MLX package
- +Text-only repack that saves 902.7 MiB against the source MLX checkpoint
- +Internal 500-generation validation run with 100% parseability and 100% determinism
- +Runtime shape built for quick local loading inside app-controlled storage
- +Local text chat completions on the customer device
- +OpenAI-compatible Chat Completions endpoint
- +Swift client helpers for app integration
- +Buffered and streamed responses
○ Out of scope
- −Cloud inference of any kind
- −Full OpenAI API or Responses API parity
- −Tool / function calling, images, audio, embeddings
- −Hosted retrieval or stored completions
- −Automatic model download or remote registry
- −Unlimited offline entitlement
- −Replacement for legal review of model distribution
A request, a stream, a local response.
Click Run to preview the streaming shape. The response is illustrative; production tokens are generated by the local runtime on the device.
No prompts leave the device. Not in the demo. Not in production.
Prompts, completions, local documents, model artefacts, and embeddings are not sent to the licence service. The runtime binds to 127.0.0.1 by default, and only the licence channel reaches the network.
Every piece of the runtime is replaceable, inspectable, and on the device.
Six components. One binary. A Swift package. The rest is your app.
Local Gemma server
A converted Gemma text model packaged behind a local HTTP server. Binds to 127.0.0.1 by default and never opens external ports.
OpenAI-compatible subset
The Chat Completions endpoint, with the same JSON request and SSE response shape your existing client already speaks. No Responses API.
Swift helpers
A small Swift surface for issuing chat completions, handling streamed deltas, and managing local bearer tokens from inside an Apple-platform app.
Bearer + CORS
Optional local bearer-token enforcement and configurable CORS for embedding the runtime in a webview-bearing app or a sibling local web tool.
Pro org licensing
Pro organisation keys, optional app-id binding, signed local licence certificates, and active-device reporting — without sending prompts or completions.
Just text
Text in, text out. No images, audio, embeddings, tool calls, retrieval, or stored completions — those are intentionally outside the boundary.