Air-gapped AI: how to run LLMs in secure environments without sacrificing control

Air-gapped does not mean “we never patch.” It means no reliance on the public internet at runtime for the protected workload, and data that must not leave a defined boundary stays inside it. Large language models complicate the picture: leadership wants copilots and agents without exporting prompts, telemetry, or keys to providers they cannot contractually or technically accept to local standards.

This article describes an architecture pattern enterprises use in regulated, defense, and critical infrastructure contexts: operate policy and evidence on your stack, run inference only where risk accepts it, and refuse the two-stack trap—where “secure” APIs go through a gateway but AI uses a browser plugin and a public API key.

What “air-gapped AI” is trying to guarantee

Stakeholders usually want all of the following at once; half measures fail reviews:

Data residency — Sensitive payloads and audit logs remain in your data center or approved private cloud.
No runtime egress — Production data plane paths do not depend on vendor SaaS, public model APIs, or opaque control planes outside the boundary.
Provable control — Who invoked what, with standardized or tamper-evident evidence suitable for assessors and internal risk teams.
Operational reality — Patching, model updates, break-glass, and disaster recovery are documented and tested, not tribal knowledge.
Supply chain visibility — Model weights, containers, and dependencies have owners, versions, and update paths.

Air-gapped programs fail when teams smuggle convenience: personal consumer chat accounts, shared API keys in Slack, or unapproved model binaries that phone home for licensing, telemetry, or updates.

Reference architecture: split inference, authorization, and evidence

A clean decomposition keeps roles clear.

Inference plane (the model)

Run approved models inside the boundary. Common patterns:

On-prem or VPC-only OpenAI-compatible endpoints (Ollama, vLLM, commercial stacks your security team has vetted).
Dedicated GPU pools with network isolation—not shared with interactive research laptops without controls.
Secrets for provider endpoints (even internal) stay in Vault-class stores; no model keys in browser code for admin UIs.

Zerq Copilot supports bring-your-own-model configurations—including Ollama, Azure OpenAI, Amazon Bedrock, and other providers—with server-side custody so credentials do not ship to clients.

Authorization plane (who may act)

LLMs do not replace IAM. Every tool call that hits a business API—or every automation action that changes gateway configuration—must pass through the same identity, RBAC, and separation of duties as the rest of the enterprise. Break-glass must be rare, logged, time-bound, and reviewed.

Evidence plane (audit and observability)

Config and audit data live in your infrastructure. Structured logs and metrics feed your SIEM and your dashboards—not a third-party “AI ops” product outside the boundary.

Zerq is designed for on-prem or fully offline operation: no outbound dependency at runtime—see Architecture and Deployment flexibility on Capabilities.

Network zones: where the gateway sits

In segmented networks, the API gateway is usually a choke point between untrusted clients (partners, agents, DMZ services) and trusted application tiers. Inference hosts may live in a separate zone with strict egress (often none). Management access to those hosts should be jump-host or SRE-only—with session logging.

Discovery (e.g. which operations exist for an MCP client) must still respect RBAC: internal catalogs and partner catalogs are not the same surface—see For AI agents.

Updating models and containers without breaking the air gap

Air-gapped does not mean static. It means controlled supply:

Artifact registry inside the boundary (or sneakernet-style import with hash verification).
SBOM or dependency inventory for inference stacks—same discipline as any critical server.
Staging that mirrors production network policy—not “staging has internet, prod does not.”
Rollback plans when a new model degrades quality or latency.

API gateway: the choke point for apps and agents

Whether traffic comes from a microservice, a partner, or an MCP client, the gateway validates tokens, enforces scopes, applies rate limits, and emits structured logs—before expensive upstream work. Smart callers do not get a pass.

Zerq’s model: same credentials as REST, one deployment, full visibility in metrics and logs—Observability.

Edge workflows without shipping another microservice

Teams often need scope checks, request shaping, and consistent error envelopes without a deploy per change. Visual workflows at the gateway can express policy close to enforcement—keep them narrow so they do not become a shadow application server. See Design gateway workflows without shipping another microservice.

Failure modes we see in “secure AI” programs

Failure	Symptom
Split observability	REST in Splunk, “AI” in a vendor UI—incidents cannot correlate.
Operator bypass	Break-glass without MFA or audit because “emergency.”
Model sprawl	Dozens of unvetted weights on shared GPUs; no owner.
Data leakage via prompts	PII in prompts to unapproved endpoints—DLP must cover API proxies, not only email.
Dual identity	Agents use long-lived shared keys—never in access certification.

Hardening checklist (air-gapped LLM + API)

Network — Document allowed egress by zone; default deny for data-plane servers that do not need general web access.
Identity — Same IdP for humans and automation; no immortal shared keys for admin MCP.
Gateway — One authorization and audit story for all API consumers, including agents.
Models — Approved inventory, versioning, owner, update path; no ad hoc binaries.
Evidence — Tabletop: complete trace for one agent API call and one config change without crossing the boundary.
Break-glass — Time-bound, logged, reviewed; not a standing shared root.

Summary: Air-gapped AI works when inference and policy respect the same perimeter. Control means gateway-enforced access and auditable operations—not a parallel stack for “smart” clients.

Request an enterprise demo to discuss offline deployment, Copilot with on-prem models, and unified audit for REST and AI traffic.