Loading...
Back To Top

On-Device vs. Cloud AI in the Field: A Systems Architecture Playbook for Latency, Privacy, and ROI

A deep dive into building AI copilots for field teams with the right split between on-device, edge, and cloud. We quantify latency budgets, privacy risk, cost models, failure modes, and present a reference architecture and rollout blueprint that actually works in the field.

On-Device vs. Cloud AI in the Field: A Systems Architecture Playbook for Latency, Privacy, and ROI

Engineering at CoSkip
11/12/2025
On-Device vs. Cloud AI in the Field: A Systems Architecture Playbook for Latency, Privacy, and ROI
A deep dive into building AI copilots for field teams with the right split between on-device, edge, and cloud. We quantify latency budgets, privacy risk, cost models, failure modes, and present a reference architecture and rollout blueprint that actually works in the field.

Field technicians don’t care which GPU rendered a vector or where an embedding lives—they care whether the next step arrives before their hand reaches the panel, and whether close-out finishes before they shut the van door. This playbook explains how to architect the right split of on-device, edge, and cloud AI for real-world field operations—so latency drops, privacy strengthens, and the ROI is provable.


TL;DR

  • Design around latency budgets, not model fads. Voice UX needs <150 ms round-trip; AR assist needs <50 ms for object overlays; proof-of-work OCR can tolerate 1–5 s in the background.
  • Process sensitive data on-device by default; offload only what benefits from fleet learning or heavy models.
  • Engineer failure-tolerant paths (offline-first, state machines, resumable queues) so jobs don’t die when the cell tower does.
  • Measure ROI in callbacks avoided and minutes saved (not token counts). The split pays for itself when first-time-fix climbs and close-out shrinks.

The Decision Triangle: Latency, Privacy, Cost

Every AI request in the field hits three constraints:

  1. Latency: How fast must the response be to keep the tech “heads-up”?
  2. Privacy/Regulatory: Can this data leave the device? If yes, must retention be zero?
  3. Cost/Footprint: What compute, network, and battery budgets do we have?

Architecture rule: Place each capability at the lowest tier that meets its latency and privacy requirements without blowing cost.

Table 1 — Typical SLOs & Placement

Capability Target SLO Default Placement Notes
Wake word + VAD < 30 ms On-device Keeps UX snappy & private
Command parsing (NLU) 80–150 ms On-device → Edge fallback Distill + quantize small model
Step guidance TTS 80–150 ms On-device (cached) Cache prompts/voices
AR object alignment 16–33 ms/frame On-device GPU/Neural Prefer device NN APIs
Document OCR (proof) 1–5 s (async) On-device → Edge Batch + opportunistic upload
Long-form summarization 1–4 s Edge/Cloud Retain redactions + no-retain modes
Retrieval over KB 150–500 ms Edge (with local cache) Push hot docs to device
Fleet analytics & model training Minutes → hours Cloud De-identified aggregates only
Three-tier AI architecture: device, edge, cloud with data flows.

Latency Budgets You Can Actually Hit

Voice: the unforgiving path

  • Wake → Intent → Response audio must feel instantaneous.
  • Budget: Wake (10–20 ms) + NLU (≤120 ms) + TTS (≤80 ms) = ≤200 ms end-to-end.

Tactics

  • Distill a small NLU (e.g., 30–100M params) with intents/slots aligned to your job grammar.
  • Pre-compile TTS prompts for static guidance; cache audio on device.
  • Use audio ring buffers and half-duplex control to avoid barge-in chaos.
Voice pipeline with latency budgets per stage.

AR: the perception trap

  • Pose + alignment need 30–60 FPS for comfort.
  • Offload only heavy recognition; keep pose tracking local (ARKit/ARCore/NNAPI/Metal).

Tactics

  • Use multi-rate pipelines: 60 FPS pose → 10 FPS detect → 1 FPS heavy classify.
  • Quantize models; prefer device GPU/NPUs.
  • Fallback: degrade gracefully to static overlays and photo annotations.
AR pipeline showing multi-rate processing.

Privacy First: Data Minimization by Design

  • Default to on-device for audio and images; ship no-retention modes server-side.
  • Redaction at source: serials okay; customer PII masked before leaving device.
  • Immutable audit: hash of media + timestamp + step id; store proofs against job/asset IDs.

Pattern: Local Redact → Remote Summarize

  1. On device: OCR + redact PII + create proof bundle (images, readings, signatures).
  2. Edge/Cloud: Summarize + structure without raw media unless customer policy allows.
Redact locally, summarize remotely architecture.

Cost Modeling That Survives Finance Review

Token costs are a rounding error relative to truck rolls. Still, predictability matters.

OPEX Model

  • On-device: one-time model packaging; marginal cost near zero; battery budget matters.
  • Edge: predictable per-request cost; cache and dedupe to reduce spikes.
  • Cloud LLM: bursty; use tiered QoS (small model by default; escalate on exception).

Example Monthly Model (100 techs)

  • Voice/AR on-device: $0.00 marginal (battery cost only).
  • Retrieval (edge KV with CDN): $0.20/user.
  • Summaries (small model): 3k calls × $0.001 ≈ $3.
  • Heavy exception calls (cloud): 500 × $0.02 ≈ $10.

Savings Driver: 30% fewer callbacks on 1,000 jobs → $80–120k/month saved (see Post #1).

Cost vs. savings dashboard emphasizing callback reduction.

The Reference Architecture (Deployable)

Device (Phone/Headset)

  • Wake/VAD, small NLU, TTS cache, pose tracking, OCR-lite, Local Proof Store (encrypted).
  • Event Log: append-only; state machine for steps; resumable queue for sync.

Edge (Regional)

  • RAG Gateway (KB shards + vector index with hot-doc prefetch to devices).
  • Summarization/Normalization (small models); Policy Engine (retention, redaction).
  • Signed Media Store (if permitted): pre-signed URLs; lifecycle rules; WORM options.

Cloud

  • Fleet Learning (de-identified traces); Model Registry; A/B rollout.
  • Observability: per-step latency SLOs; offline rate; failure taxonomy.

Event Types

  • voice.intent, step.enter/exit, photo.proof, reading.capture, summary.ready, sync.retry.
Microservices architecture showing device/edge/cloud responsibilities.

Reliability Engineering: When the Tower Goes Dark

Patterns

  • Offline-first UI: everything important works with zero network.
  • Idempotent events: retried events don’t duplicate steps.
  • Backpressure: pause nonessential uploads when battery/network are low.
  • Escalation ladder: small→medium→large model hops only when needed.

Minimal State Machine (pseudocode)

stateDiagram-v2
  [*] --> Idle
  Idle --> Listening : wake
  Listening --> Intent : vad_ok
  Intent --> Step : intent_ok
  Step --> Capture : requires_proof
  Capture --> Step : proof_ok
  Step --> Summarize : job_end
  Summarize --> Sync : local_ok
  Sync --> Idle : remote_ok
  state Sync {
    [*] --> Queue
    Queue --> Retry : net_fail/battery_low
    Retry --> Queue : recover
    Queue --> [*] : ok
  }
Reliability patterns: offline, retry, backpressure, escalation.

Evaluation: Prove It Works (and Keeps Working)

SLOs

  • Voice round-trip (p95):200 ms
  • AR overlay stability:95% of session time
  • Proof bundle completeness:99% of jobs
  • Offline successful close-out:90% of offline starts

Field KPIs

  • Callback rate:25–35% within 60–90 days
  • Close-out time:8–15 minutes/job
  • Warranty approval time:30–50%

Experiment Design

  • A/B by geography/crew.
  • Pre-register metrics and hold out at least one crew for 8 weeks.
  • Weekly triage on failure taxonomy (where did it break, why, fix).
Experiment dashboard for callback reduction and SLOs.

Security & Trust Controls (Operator-Friendly)

  • Zero-retention toggles at tenant and job-type levels.
  • Admin-visible processing map: clearly shows what stayed local vs. what left the device.
  • Per-step attestations cryptographically signed by device keys.
  • Customer-facing proof packet with redacted media and readings.
Trust console showing retention toggles and audit.

Rollout Blueprint (90 Days)

Phase 0 (Week 0): Choose one job type + baseline

  • Instrument: callback %, close-out minutes, failure taxonomy.
  • Pick a high-volume, well-defined job (e.g., PM visits).

Phase 1 (Weeks 1–3): On-device Voice Core

  • Ship wake/VAD/NLU/TTS; cache scripted steps locally.
  • Success: p95 voice round-trip ≤ 200 ms.

Phase 2 (Weeks 4–6): Proof Defaults + OCR

  • Require 2–3 annotated photos + readings; perform local redaction.
  • Success: Proof bundle completeness ≥ 95%.

Phase 3 (Weeks 7–9): AR & RAG Edge

  • Pilot overlays on 1–2 confusing components; enable hot-doc caches.
  • Success: Δ callback ↓ ≥ 25% vs. baseline crew.

Phase 4 (Weeks 10–13): Observability & A/B

  • Model registry, cohort testing, weekly failure triage.
  • Success: Stable deltas, reproducible across crews.
90-day rollout timeline with milestones.

The Executive POV: Why This Pays

Risk (Trust & Privacy). Sending raw field media to generic clouds is a non-starter for many enterprise buyers. On-device first wins deals by default: sensitive data stays local; only derived signals or redacted assets egress.

Speed (Behavior Change). Sub-second guidance (p95 ≤ 200 ms) shifts technician behavior immediately. When the “right way” is also the fastest way, adoption sticks without mandates.

Scale (Smart Use of Cloud). Edge/cloud is used surgically—for fleet learning, summaries, and cross-site search—not for every keystroke. This lowers variable costs and reduces blast radius.

Return (Hard Dollars). Callback reduction is the cash engine; close-out acceleration and faster warranty approvals compound the benefit. The result is higher gross margin with no additional headcount.

Bottom line: Place intelligence as close to the work as possible. Promote only what benefits from the crowd.

Executive summary emphasizing latency, privacy, ROI.

Appendix A — Quick Architecture Checklist

  • Voice pipeline p95 ≤ 200 ms; TTS cached.
    Wake/VAD/NLU/TTS tuned for sub-second round-trip; cache hot prompts and synthesis.

  • On-device redaction; zero-retention pathways.
    Blur faces/PII on device; enforce tenant/job-type zero-retention toggles.

  • Local proof store (encrypted); immutable hashes.
    Photo/video/readings written to an encrypted store; generate content hashes for audit trails.

  • RAG gateway with device hot-doc cache.
    Retrieve only the smallest, relevant chunks; cache job-type manuals/SOPs locally for offline.

  • Offline-first state machine + resumable queues.
    Deterministic job state transitions; queue uploads, retries with backoff when connectivity returns.

  • Observability: step SLOs, offline rate, failure taxonomy.
    Emit per-step latency/accuracy metrics, % time offline, and standardized failure codes.

  • A/B cohorts; callback and close-out KPIs.
    Cohort flags by crew/region; pre-register metrics; track callbacks ↓, close-out mins ↓, warranty approvals ↑.

Architecture checklist cards for voice, privacy, proof, RAG, offline, observability, and A/B.

Conclusion & Next Steps

Reducing callbacks and accelerating close-outs is a systems problem, not a single feature bet. The pattern that survives contact with the field is consistent:

  • Latency: Sub-200 ms voice loops change behavior.
  • Clarity: AR/visual confirmations remove ambiguity where it actually occurs.
  • Proof by default: Evidence is captured as a byproduct of doing the work.
  • Privacy by design: On-device first, zero-retention paths, auditable flows.
  • Observability: SLOs and failure taxonomy make improvement compounding.

If you adopt one thing this quarter, start with Voice → Guidance → Proof on a single, well-scoped job type. Baseline, instrument, A/B, and iterate weekly. The 25–35% callback reduction isn’t a moonshot—it’s a byproduct of removing friction and ambiguity where techs feel it most.

Call to action: If you’d like a reference flow, SLO template, or a pilot plan tailored to your fleet, reach out—let’s make first-time fix the norm, not the exception.


Further Reading & Tools

  • SLO Starter Sheet: Voice RT p95, AR stability %, proof completeness %, offline close-out %
  • Failure Taxonomy Template: Diagnosis, Procedure, Proof, Handoff, Close-out
  • Field Trial Playbook: Cohort design, pre-registration, weekly triage ritual
  • Edge Privacy Checklist: On-device redaction, zero-retention modes, processing map

Need a copy of the templates? Email [email protected] with subject “Field SLO Kit”.

on-deviceedge-computingcloudlatencyprivacycost-modelreliabilityvoiceARMLOpsobservabilityarchitecture & systems

Recent Posts

View All

Apply to Become a Pilot Partner

Tell us a bit about your team. We'll follow up with next steps.

Join the Waitlist

Get launch updates and early access invites.