GemmaKit

Private AI
running inside your app.

GemmaKit turns Gemma 4 E2B into an Apple Silicon-optimised local text runtime for apps that need private inference without a cloud dependency. The 4-bit MLX repack is 2.63 GB on disk, runs 902.7 MiB leaner than the source checkpoint, and has passed a 500-generation structured-output validation run at 100% parseability. It drops behind an OpenAI-compatible Chat Completions API with Swift, TypeScript, Dart, Rust, Cordova, and webview integration paths.

2.63 GB runtime 902.7 MiB smaller 500-run validated Swift · TS · Dart · Rust
Running on device offline
9:41 Notebook • on device Plan my week around two deep work blocks. Tuesday and Thursday mornings are clearest. Block 9–11 on each — tag them as Focus and decline meetings. Add them to my calendar. Message
Client
Swift · TS · Dart · Rust
Runtime
local Gemma server
Network
licence channel only
GemmaKit / Pro private beta
OpenAI-compatible · subset Chat completions only Swift · TS · Dart · Rust

Point your app shell at a local base URL.

No SDK rewrite for OpenAI-style clients, and thin GemmaKit clients where that makes app code cleaner. Streamed or buffered, with optional local bearer-token auth.

# Stream a chat completion against the local server
curl http://127.0.0.1:11436/v1/chat/completions \
  -H "Authorization: Bearer $GEMMAKIT_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4-e2b-it",
    "stream": true,
    "stream_options": { "include_usage": true },
    "messages": [
      { "role": "system", "content": "You are concise." },
      { "role": "user",   "content": "Summarise the changelog above." }
    ]
  }'

One local runtime, several app shells.

GemmaKit is the local server plus small client surfaces. Native, cross-platform, Rust-backed, and webview apps all talk to the same loopback Chat Completions endpoint.

Swift and Apple apps

Use GemmaKitCore for typed requests, streamed chunks, and OpenAI-compatible response models. Use the Pro package when the app also manages the local server process and approved model acquisition.

Swift macOS Apple Silicon

React Native

@gemmakit/client uses standard fetch and ReadableStream, so React Native apps can call the same local endpoint once the host app has started the GemmaKit runtime.

TypeScript Hermes streaming

Flutter

The Dart gemmakit package mirrors the Swift and TypeScript scope: buffered generation, streamed text deltas, full chunks, and OpenAI-shaped errors against the local server.

Dart Flutter mobile

Electron

Electron main and renderer processes can use @gemmakit/client or any OpenAI-compatible JavaScript SDK pointed at 127.0.0.1.

desktop Node webview

Ionic, Capacitor, Cordova

Webview apps integrate over local HTTP with the JavaScript client when fetch and streams are available. Cordova projects can use cordova-plugin-gemmakit for a plugin install path.

webview plugin shell CORS

Tauri and Rust

Tauri apps can call GemmaKit from the webview through @gemmakit/client or from the Rust side through gemmakit-client. The server still stays on loopback by default.

Rust crate webview loopback

Node and CLIs

Use @gemmakit/client, the official OpenAI JavaScript SDK, or direct HTTP calls for local developer tools, build assistants, and same-machine command-line workflows.

Node 18+ OpenAI-style SSE

Other OpenAI-style clients

Python, Go, Kotlin, and other HTTP clients can use GemmaKit by changing only the base URL and staying inside the supported Chat Completions subset.

HTTP JSON subset

Gemma 4 power, repacked for the local footprint.

A 0.7B-parameter, 4-bit text runtime built for Apple Silicon apps: 2.63 GB on disk, 1,234 text tensors kept, 1,415 unused audio and vision tensors removed, and 100% parseability across a 500-generation structured-output validation run.

  Included

  • +Gemma 4 E2B instruction text model in an Apple Silicon-optimised 4-bit MLX package
  • +Text-only repack that saves 902.7 MiB against the source MLX checkpoint
  • +Internal 500-generation validation run with 100% parseability and 100% determinism
  • +Runtime shape built for quick local loading inside app-controlled storage
  • +Local text chat completions on the customer device
  • +OpenAI-compatible Chat Completions endpoint
  • +Swift client helpers for app integration
  • +TypeScript client for React Native, Electron, Ionic, Capacitor, Cordova, Tauri webviews, and Node
  • +Dart client for Flutter and Dart command-line applications
  • +Rust client for Tauri backends and Rust applications
  • +Cordova plugin shell for Cordova and Ionic installs
  • +Signed model manifests and approved model download installation
  • +Buffered and streamed responses

  Out of scope

  • Cloud inference of any kind
  • Full OpenAI API or Responses API parity
  • Tool / function calling, images, audio, embeddings
  • Hosted retrieval or stored completions
  • Open-ended remote model registry or unapproved artefact downloads
  • Unlimited offline entitlement
  • Replacement for legal review of model distribution

A request, a stream, a local response.

Click Run to preview the streaming shape. The response is illustrative; production tokens are generated by the local runtime on the device.

POST 127.0.0.1:11436/v1/chat/completions ready
user
Explain the difference between buffered and streamed responses in two sentences.
Model
gemma-4-e2b-it
Stream
true
Latency · ttft
Tokens
0

No prompts leave the device. Not in the demo. Not in production.

Prompts, completions, local documents, model artefacts, and embeddings are not sent to the licence service. The runtime binds to 127.0.0.1 by default, and only the licence channel reaches the network.

bind
127.0.0.1:11436
auth
local bearer · optional
cors
configurable
egress
licence channel only
content
not sent server-side

Every piece of the runtime is replaceable, inspectable, and on the device.

Six components. One binary. Swift, TypeScript, Dart, Rust, and Cordova clients. The rest is your app.

01 / runtime

Local Gemma server

A converted Gemma text model packaged behind a local HTTP server. Binds to 127.0.0.1 by default and never opens external ports.

02 / api

OpenAI-compatible subset

The Chat Completions endpoint, with the same JSON request and SSE response shape your existing client already speaks. No Responses API.

03 / clients

Thin client helpers

Small Swift, TypeScript, Dart, Rust, and Cordova surfaces for issuing chat completions, handling streamed deltas, and keeping app code typed without changing the local API contract.

04 / auth

Bearer + CORS

Optional local bearer-token enforcement and configurable CORS for React Native, Electron, Ionic, Capacitor, Cordova, Tauri, and sibling local web tools.

05 / licence

Pro org licensing

Pro organisation keys, optional app-id binding, signed local licence certificates, and active-device reporting — without sending prompts or completions.

06 / scope

Just text

Text in, text out. No images, audio, embeddings, tool calls, retrieval, or stored completions — those are intentionally outside the boundary.

Monthly platform fee plus active-device usage.

A device counts once per billing period after activation, licence refresh, or a gated generation call. Repeated requests do not become per-token or per-request billing.

Probase + device meter
Configured in Stripe · billed monthly
  • Pro organisation keys with optional app-id binding
  • Signed local licence certificates
  • Active-device reporting · no prompt content
  • Device and certificate revocation paths
sample
{
  "org_id":        "org_4e2fa1",
  "app_id":        "app.example.studio",
  "device_id":     "dev_opaque_installation_id",
  "event":         "gated_generation",
  "period_start":  "2026-05-01T00:00:00Z",
  "prompt_content": null,
  "completion_content": null,
  "certificate_id": "lic_9c1a..."
}
— what gets reported, in full

One package. Drop it next to your app.