Skip to main content
Once a Connector is deployed, keeping it running smoothly requires occasional maintenance. This guide covers day-to-day operational tasks: upgrades, health monitoring, and what to do when things go wrong.

Upgrading the Connector

The Connector is distributed as a Docker image and follows semantic versioning (e.g., 1.35.0). Updates are applied by pulling the latest image and restarting the container; there is no in-place auto-update mechanism.
Configuration changes (policies, resources, listeners, users, etc.) are pushed from the Control Plane to the Connector in real time: they usually don’t require a restart or an upgrade.

Version Checking

Before upgrading, it’s useful to know what version is currently running. The running version can be checked through:
  • OTLP Metrics: the service.version attribute on all emitted metrics
  • Connector Logs: the version is logged at startup
  • Console: the “Connected Instances” section on the Connector page shows the version

Upgrade Procedure

  1. Update the Connector image tag in your ECS task definition
  2. Deploy the new task definition
  3. ECS performs a rolling update: it starts new tasks with the new image, waits for them to pass health checks, then drains and stops old tasks
  4. Verify the new version is running via health metrics

Lifecycle

The Connector is designed for zero-downtime deployments.

Health Checks

The Connector exposes two HTTP health check endpoints on port 8080:
EndpointPurposeSuccessFailure
GET /healthLiveness probe: is the process running?200 OKNo response
GET /readyReadiness probe: is initialization complete?200 OK503 Service Unavailable
The readiness endpoint returns 503 until the Connector has connected to the Control Plane, loaded its configuration, and started all listeners.
The Quickstart deployment options come with probes already configured to make the best use of these health check endpoints.

Graceful Shutdown

When the Connector receives a SIGTERM or SIGINT signal (e.g., during a rolling update or manual stop), it performs a graceful shutdown:
  1. Stops accepting new connections on all listeners
  2. Leaves the cluster gracefully, notifying other Connector instances
  3. Stops the telemetry exporter
  4. Closes the cache and cleans up resources
  5. Shuts down the health check server
Use multiple Connector instances behind a load balancer to ensure clients can reconnect to a healthy instance during rolling updates.

Disaster Recovery

The Connector is designed to be mostly stateless, which makes disaster recovery straightforward. All configuration (policies, resources, users, etc.) is stored in the Control Plane and pushed to the Connector on startup and in real time, so any lost instance can simply be replaced. The only local data that can be lost is:
DataLocationImpact
Log spool/formal/logsBuffered logs are lost if the disk isn’t persistent
Cluster stateIn-memoryRate limit counters and cache reset; rebuilds via gossip on rejoin
When running multiple instances with clustering enabled, cluster state is shared and rebuilds automatically on rejoin.

Recovery Scenarios

Scenario: A Connector instance stops unexpectedly.Impact: Clients connected to the failed instance are disconnected.Recovery:
  1. The orchestrator (ECS/Kubernetes) automatically starts a replacement instance
  2. The new instance connects to the Control Plane and loads configuration
  3. Clients reconnect via the load balancer to a healthy instance
  4. If clustering is enabled, the new instance rejoins the cluster and receives state replay
Data loss: In-memory state (rate limit counters, cache) is lost unless clustering is enabled. Buffered logs on persistent disk are recovered on restart.
Scenario: All Connector instances stop unexpectedly.Impact: All client connections are dropped.Recovery:
  1. The orchestrator starts new instances
  2. Each instance connects to the Control Plane and loads full configuration
  3. Instances discover each other and form a new cluster
  4. Clients reconnect via the load balancer
Data loss: All in-memory state (rate limit counters, cache) is lost. Buffered logs on disk spool are recovered on restart.
Scenario: The Formal Control Plane becomes unavailable.Impact: The Connector continues operating with its last known configuration. Configuration updates are not received, logs are spooled to disk, and new instances cannot start.Recovery:
  • The Connector automatically reconnects when the Control Plane is restored
  • Pending configuration updates are received
  • Buffered logs are flushed
Data loss: None, if a persistent disk is available for the log spool.
Scenario: Network issues prevent some Connector instances from communicating with each other.Impact: If clustering is enabled, each partition continues operating independently with its own local state; rate limiting and cache sharing only works within each partition.Recovery:
  • Nodes rejoin when the partition heals
  • State is reconciled via event replay
  • Configuration and policies are not affected
Data loss: None.
Scenario: Network issues prevent the Connector from reaching the Control Plane.Impact: The Connector continues operating with its last known configuration. Configuration updates are not received, logs are spooled to disk, and new instances cannot start.Recovery:
  • The Connector automatically reconnects when network access is restored
  • Pending configuration updates are received
  • Buffered logs are flushed
Data loss: None, if a persistent disk is available for the log spool.

Best Practices for Resilience

Deploy at least 2 Connector instances behind a load balancer for high availability. Distribute across availability zones for fault tolerance.
Set up probes to ensure traffic is only routed to healthy instances. See Health Checks for details.
Mount a persistent volume to /formal/logs to buffer logs during outages. A typical log entry is ~1 KB — at 100 requests/sec, 10 GB provides over a day of buffer. See Log Spool for details.
Set up alerts on Connector metrics: heartbeat gaps, memory usage spikes, connection count anomalies, and missed Control Plane pings.
Always use rolling update strategies (the default in ECS and Kubernetes) to avoid downtime during upgrades. Ensure readiness probes are configured so traffic isn’t routed to instances that haven’t finished starting.