A deep dive into building AI copilots for field teams with the right split between on-device, edge, and cloud. We quantify latency budgets, privacy risk, cost models, failure modes, and present a reference architecture and rollout blueprint that actually works in the field.
On-Device, Edge, or Cloud? Architecting AI for Real-World Field Operations
Field technicians don’t care which GPU rendered a vector or where an embedding lives—they care whether the next step arrives before their hand reaches the panel, and whether close-out finishes before they shut the van door.
This playbook explains how to architect the right split of on-device, edge, and cloud AI for real-world field operations—so latency drops, privacy strengthens, and the ROI is provable.
TL;DR
- Design around latency budgets, not model fads. Voice UX needs
<150 msround-trip; AR assist needs<50 msfor object overlays; proof-of-work OCR can tolerate1–5 sin the background. - Process sensitive data on-device by default; offload only what benefits from fleet learning or heavy models.
- Engineer failure-tolerant paths with offline-first design, state machines, and resumable queues so jobs don’t die when the cell tower does.
- Measure ROI in callbacks avoided and minutes saved — not token counts. The split pays for itself when first-time-fix climbs and close-out shrinks.
The Decision Triangle: Latency, Privacy, Cost
Every AI request in the field hits three constraints:
- Latency: How fast must the response be to keep the tech “heads-up”?
- Privacy/Regulatory: Can this data leave the device? If yes, must retention be zero?
- Cost/Footprint: What compute, network, and battery budgets do we have?
Architecture rule: Place each capability at the lowest tier that meets its latency and privacy requirements without blowing cost.
Table 1 — Typical SLOs & Placement
| Capability | Target SLO | Default Placement | Notes |
|---|---|---|---|
| Wake word + VAD | <30 ms |
On-device | Keeps UX snappy and private |
| Command parsing / NLU | 80–150 ms |
On-device → Edge fallback | Distill and quantize small model |
| Step guidance TTS | 80–150 ms |
On-device, cached | Cache prompts and voices |
| AR object alignment | 16–33 ms/frame |
On-device GPU / Neural | Prefer device NN APIs |
| Document OCR / proof | 1–5 s, async |
On-device → Edge | Batch and opportunistic upload |
| Long-form summarization | 1–4 s |
Edge / Cloud | Retain redactions and no-retain modes |
| Retrieval over KB | 150–500 ms |
Edge, with local cache | Push hot docs to device |
| Fleet analytics and model training | Minutes → hours | Cloud | De-identified aggregates only |

Latency Budgets You Can Actually Hit
Voice: The Unforgiving Path
- Wake → Intent → Response audio must feel instantaneous.
- Budget: Wake
10–20 ms+ NLU≤120 ms+ TTS≤80 ms= ≤200 ms end-to-end.
Tactics
- Distill a small NLU model, such as
30–100Mparameters, with intents and slots aligned to your job grammar. - Pre-compile TTS prompts for static guidance and cache audio on device.
- Use audio ring buffers and half-duplex control to avoid barge-in chaos.

AR: The Perception Trap
- Pose + alignment need
30–60 FPSfor comfort. - Offload only heavy recognition; keep pose tracking local with ARKit, ARCore, NNAPI, or Metal.
Tactics
- Use multi-rate pipelines:
60 FPSpose →10 FPSdetect →1 FPSheavy classify. - Quantize models and prefer device GPUs / NPUs.
- Fallback gracefully to static overlays and photo annotations.

Privacy First: Data Minimization by Design
- Default to on-device for audio and images; ship no-retention modes server-side.
- Redaction at source: serials can be preserved, while customer PII is masked before leaving the device.
- Immutable audit: hash media with timestamp and step ID; store proofs against job and asset IDs.
Pattern: Local Redact → Remote Summarize
- On device: OCR, redact PII, and create a proof bundle with images, readings, and signatures.
- Edge / Cloud: Summarize and structure without raw media unless customer policy allows.
Cost Modeling That Survives Finance Review
Token costs are a rounding error relative to truck rolls. Still, predictability matters.
OPEX Model
- On-device: One-time model packaging; marginal cost near zero; battery budget matters.
- Edge: Predictable per-request cost; cache and dedupe to reduce spikes.
- Cloud LLM: Bursty; use tiered QoS with a small model by default and escalation on exception.
Example Monthly Model: 100 Techs
| Cost Driver | Estimate |
|---|---|
| Voice / AR on-device | $0.00 marginal, battery cost only |
| Retrieval, edge KV with CDN | $0.20/user |
| Summaries, small model | 3k calls × $0.001 ≈ $3 |
| Heavy exception calls, cloud | 500 × $0.02 ≈ $10 |
Savings Driver: 30% fewer callbacks on 1,000 jobs → $80–120k/month saved.
The Reference Architecture: Deployable
Device: Phone / Headset
- Wake / VAD
- Small NLU
- TTS cache
- Pose tracking
- OCR-lite
- Local Proof Store, encrypted
- Event Log: append-only state machine for steps
- Resumable sync queue
Edge: Regional
- RAG Gateway with KB shards, vector index, and hot-doc prefetch to devices
- Summarization / Normalization using small models
- Policy Engine for retention and redaction
- Signed Media Store, if permitted:
- Pre-signed URLs
- Lifecycle rules
- WORM options
Cloud
- Fleet Learning from de-identified traces
- Model Registry
- A/B rollout
- Observability:
- Per-step latency SLOs
- Offline rate
- Failure taxonomy
Event Types
voice.intent
step.enter
step.exit
photo.proof
reading.capture
summary.ready
sync.retry
Reliability Engineering: When the Tower Goes Dark
Patterns
- Offline-first UI: Everything important works with zero network.
- Idempotent events: Retried events don’t duplicate steps.
- Backpressure: Pause nonessential uploads when battery or network are low.
- Escalation ladder: Small → medium → large model hops only when needed.
Minimal State Machine
stateDiagram-v2
[*] --> Idle
Idle --> Listening : wake
Listening --> Intent : vad_ok
Intent --> Step : intent_ok
Step --> Capture : requires_proof
Capture --> Step : proof_ok
Step --> Summarize : job_end
Summarize --> Sync : local_ok
Sync --> Idle : remote_ok
state Sync {
[*] --> Queue
Queue --> Retry : net_fail/battery_low
Retry --> Queue : recover
Queue --> [*] : ok
}

Evaluation: Prove It Works and Keeps Working
SLOs
- Voice round-trip, p95:
≤200 ms - AR overlay stability:
≥95%of session time - Proof bundle completeness:
≥99%of jobs - Offline successful close-out:
≥90%of offline starts
Field KPIs
- Callback rate: ↓
25–35%within60–90 days - Close-out time: ↓
8–15 minutes/job - Warranty approval time: ↓
30–50%
Experiment Design
- A/B by geography or crew.
- Pre-register metrics and hold out at least one crew for 8 weeks.
- Weekly triage on failure taxonomy:
- Where did it break?
- Why did it break?
- What should be fixed first?

Security & Trust Controls: Operator-Friendly
- Zero-retention toggles at tenant and job-type levels.
- Admin-visible processing map that clearly shows what stayed local versus what left the device.
- Per-step attestations cryptographically signed by device keys.
- Customer-facing proof packet with redacted media and readings.
Rollout Blueprint: 90 Days
Phase 0: Week 0 — Choose One Job Type + Baseline
- Instrument:
- Callback percentage
- Close-out minutes
- Failure taxonomy
- Pick a high-volume, well-defined job, such as PM visits.
Phase 1: Weeks 1–3 — On-Device Voice Core
- Ship wake / VAD / NLU / TTS.
- Cache scripted steps locally.
Success metric: p95 voice round-trip ≤200 ms.
Phase 2: Weeks 4–6 — Proof Defaults + OCR
- Require 2–3 annotated photos plus readings.
- Perform local redaction.
Success metric: Proof bundle completeness ≥95%.
Phase 3: Weeks 7–9 — AR & RAG Edge
- Pilot overlays on 1–2 confusing components.
- Enable hot-doc caches.
Success metric: Callback delta ↓ ≥25% versus baseline crew.
Phase 4: Weeks 10–13 — Observability & A/B
- Model registry
- Cohort testing
- Weekly failure triage
Success metric: Stable deltas that reproduce across crews.
The Executive POV: Why This Pays
Risk: Trust & Privacy
Sending raw field media to generic clouds is a non-starter for many enterprise buyers. On-device first wins deals by default: sensitive data stays local; only derived signals or redacted assets egress.
Speed: Behavior Change
Sub-second guidance, with p95 ≤200 ms, shifts technician behavior immediately. When the “right way” is also the fastest way, adoption sticks without mandates.
Scale: Smart Use of Cloud
Edge and cloud are used surgically for fleet learning, summaries, and cross-site search—not for every keystroke. This lowers variable costs and reduces blast radius.
Return: Hard Dollars
Callback reduction is the cash engine; close-out acceleration and faster warranty approvals compound the benefit. The result is higher gross margin with no additional headcount.
Bottom line: Place intelligence as close to the work as possible. Promote only what benefits from the crowd.

Appendix A — Quick Architecture Checklist
Voice pipeline p95 ≤200 ms; TTS cached.
Wake / VAD / NLU / TTS tuned for sub-second round-trip; cache hot prompts and synthesis.On-device redaction; zero-retention pathways.
Blur faces and PII on device; enforce tenant/job-type zero-retention toggles.Local proof store, encrypted; immutable hashes.
Photo, video, and readings written to an encrypted store; generate content hashes for audit trails.RAG gateway with device hot-doc cache.
Retrieve only the smallest relevant chunks; cache job-type manuals and SOPs locally for offline use.Offline-first state machine + resumable queues.
Deterministic job state transitions; queue uploads and retry with backoff when connectivity returns.Observability: step SLOs, offline rate, failure taxonomy.
Emit per-step latency and accuracy metrics, percentage of time offline, and standardized failure codes.A/B cohorts; callback and close-out KPIs.
Cohort flags by crew or region; pre-register metrics; track callbacks ↓, close-out minutes ↓, and warranty approvals ↑.
Conclusion & Next Steps
Reducing callbacks and accelerating close-outs is a systems problem, not a single feature bet. The pattern that survives contact with the field is consistent:
- Latency: Sub-200 ms voice loops change behavior.
- Clarity: AR and visual confirmations remove ambiguity where it actually occurs.
- Proof by default: Evidence is captured as a byproduct of doing the work.
- Privacy by design: On-device first, zero-retention paths, and auditable flows.
- Observability: SLOs and failure taxonomy make improvement compounding.
Adopt one thing this quarter: start with Voice → Guidance → Proof on a single, well-scoped job type.
Baseline, instrument, A/B, and iterate weekly. The 25–35% callback reduction isn’t a moonshot—it’s a byproduct of removing friction and ambiguity where techs feel it most.
Call to action: If you’d like a reference flow, SLO template, or a pilot plan tailored to your fleet, reach out—let’s make first-time fix the norm, not the exception.
Further Reading & Tools
- SLO Starter Sheet: Voice RT p95, AR stability percentage, proof completeness percentage, offline close-out percentage
- Failure Taxonomy Template: Diagnosis, procedure, proof, handoff, close-out
- Field Trial Playbook: Cohort design, pre-registration, weekly triage ritual
- Edge Privacy Checklist: On-device redaction, zero-retention modes, processing map
Need a copy of the templates? Email [email protected] with subject line: Field SLO Kit.