Skip to main content

Scaling guidance — what to scale, when

Stub

This How-to is a stub. Production scaling guidance will evolve as JARVIS stabilizes its durability and scheduling model.

Goal

You will understand when to scale Run Coordinator vs executors vs registry.

When to use this

You see timeouts/latency under load.
You want predictable throughput for runs.

Prerequisites

Basic metrics (request rate, latency, error rate) per service
Clear separation of responsibilities (stateless vs stateful services)

Steps

Scale stateless services first (gateway, selection, executors).
Scale stateful stores carefully (run store, artifact store, event stream).
Treat coordinator scaling as durability-sensitive:
- ensure idempotency,
- avoid double-dispatch.

Verify

Scaling does not change semantics (no duplicated node runs).
Throughput improves without regressions.

Troubleshooting

Increased duplicates → idempotency keys missing or not enforced.
Store contention → move from local SQLite to external DB when needed.
Hot spots → cache inventory snapshots carefully (with invalidation).

Cleanup / Rollback

Reduce replicas and return to baseline.

Next steps

How-to: Enable OpenTelemetry tracing
Concept: Artifacts and replay

Goal
When to use this
Prerequisites
Steps
Verify
Troubleshooting
Cleanup / Rollback
Next steps