Upgrading the Connector
The Connector is distributed as a Docker image and follows semantic versioning (e.g.,1.35.0). Updates are applied by pulling the latest image and restarting the container; there is no in-place auto-update mechanism.
Configuration changes (policies, resources, listeners, users, etc.) are pushed from the Control Plane to the Connector in real time: they usually don’t require a restart or an upgrade.
Version Checking
Before upgrading, it’s useful to know what version is currently running. The running version can be checked through:- OTLP Metrics: the
service.versionattribute on all emitted metrics - Connector Logs: the version is logged at startup
- Console: the “Connected Instances” section on the Connector page shows the version
Upgrade Procedure
- AWS ECS Fargate
- Kubernetes
- Docker
- Update the Connector image tag in your ECS task definition
- Deploy the new task definition
- ECS performs a rolling update: it starts new tasks with the new image, waits for them to pass health checks, then drains and stops old tasks
- Verify the new version is running via health metrics
Lifecycle
The Connector is designed for zero-downtime deployments.Health Checks
The Connector exposes two HTTP health check endpoints on port8080:
| Endpoint | Purpose | Success | Failure |
|---|---|---|---|
GET /health | Liveness probe: is the process running? | 200 OK | No response |
GET /ready | Readiness probe: is initialization complete? | 200 OK | 503 Service Unavailable |
503 until the Connector has connected to the Control Plane, loaded its configuration, and started all listeners.
The Quickstart deployment options come with probes already configured to make the best use of these health check endpoints.
Graceful Shutdown
When the Connector receives a SIGTERM or SIGINT signal (e.g., during a rolling update or manual stop), it performs a graceful shutdown:- Stops accepting new connections on all listeners
- Leaves the cluster gracefully, notifying other Connector instances
- Stops the telemetry exporter
- Closes the cache and cleans up resources
- Shuts down the health check server
Disaster Recovery
The Connector is designed to be mostly stateless, which makes disaster recovery straightforward. All configuration (policies, resources, users, etc.) is stored in the Control Plane and pushed to the Connector on startup and in real time, so any lost instance can simply be replaced. The only local data that can be lost is:| Data | Location | Impact |
|---|---|---|
| Log spool | /formal/logs | Buffered logs are lost if the disk isn’t persistent |
| Cluster state | In-memory | Rate limit counters and cache reset; rebuilds via gossip on rejoin |
Recovery Scenarios
Single instance failure
Single instance failure
Scenario: A Connector instance stops unexpectedly.Impact: Clients connected to the failed instance are disconnected.Recovery:
- The orchestrator (ECS/Kubernetes) automatically starts a replacement instance
- The new instance connects to the Control Plane and loads configuration
- Clients reconnect via the load balancer to a healthy instance
- If clustering is enabled, the new instance rejoins the cluster and receives state replay
All instances fail
All instances fail
Scenario: All Connector instances stop unexpectedly.Impact: All client connections are dropped.Recovery:
- The orchestrator starts new instances
- Each instance connects to the Control Plane and loads full configuration
- Instances discover each other and form a new cluster
- Clients reconnect via the load balancer
Control Plane outage
Control Plane outage
Scenario: The Formal Control Plane becomes unavailable.Impact: The Connector continues operating with its last known configuration. Configuration updates are not received, logs are spooled to disk, and new instances cannot start.Recovery:
- The Connector automatically reconnects when the Control Plane is restored
- Pending configuration updates are received
- Buffered logs are flushed
Network partition: Connectors cluster
Network partition: Connectors cluster
Scenario: Network issues prevent some Connector instances from communicating with each other.Impact: If clustering is enabled, each partition continues operating independently with its own local state; rate limiting and cache sharing only works within each partition.Recovery:
- Nodes rejoin when the partition heals
- State is reconciled via event replay
- Configuration and policies are not affected
Network partition: Control Plane
Network partition: Control Plane
Scenario: Network issues prevent the Connector from reaching the Control Plane.Impact: The Connector continues operating with its last known configuration. Configuration updates are not received, logs are spooled to disk, and new instances cannot start.Recovery:
- The Connector automatically reconnects when network access is restored
- Pending configuration updates are received
- Buffered logs are flushed
Best Practices for Resilience
Run multiple instances
Run multiple instances
Deploy at least 2 Connector instances behind a load balancer for high availability. Distribute across availability zones for fault tolerance.
Configure probes
Configure probes
Set up probes to ensure traffic is only routed to healthy instances. See Health Checks for details.
Configure disk spool
Configure disk spool
Mount a persistent volume to
/formal/logs to buffer logs during outages. A typical log entry is ~1 KB — at 100 requests/sec, 10 GB provides over a day of buffer. See Log Spool for details.Monitor Connector health
Monitor Connector health
Set up alerts on Connector metrics: heartbeat gaps, memory usage spikes, connection count anomalies, and missed Control Plane pings.
Use rolling updates
Use rolling updates
Always use rolling update strategies (the default in ECS and Kubernetes) to avoid downtime during upgrades. Ensure readiness probes are configured so traffic isn’t routed to instances that haven’t finished starting.