Clustering - Formal Documentation

For load balancing and redundancy, we recommend deploying multiple Connector instances to represent the same logical Connector. The Connector instances can operate entirely independently, but you may want to share their state across all instances to enable advanced features such as distributed rate limiting or BigQuery job ID caching. To do this, you will need to enable these instances to form a cluster.

Configuration

Discovery

To enable clustering, the Connectors need to be able to discover each other. Clustering is supported on AWS ECS and Kubernetes.

AWS ECS Fargate (Terraform)
Kubernetes (Helm)

For AWS ECS deployments, configure IAM permissions to allow Connector instances to list and describe ECS tasks. This enables them to discover other instances in the same service:

# Allow ECS tasks to discover peers for cluster formation
resource "aws_iam_policy" "ecs_task_permissions" {
  name        = "connector-ecs-task-permissions"
  description = "Allow ECS tasks to discover peers for cluster formation"

  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Action   = ["ecs:DescribeTasks", "ecs:ListTasks"],
        Effect   = "Allow",
        Resource = "*"
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "ecs_task_permissions" {
  role       = aws_iam_role.ecs_task_role.name
  policy_arn = aws_iam_policy.ecs_task_permissions.arn
}

See our AWS Terraform example for a complete configuration.

Network

Connectors communicate with each other over the network on ports 7946.

AWS ECS Fargate (Terraform)
Kubernetes (Helm)

Configure the security group to allow inter-instance communication on the cluster ports:

# Allow internal cluster communication on port 7946
resource "aws_security_group" "connector" {
  ingress {
    description = "Cluster communication port TCP"
    from_port   = 7946
    to_port     = 7946
    protocol    = "tcp"
    self        = true  # Allow from same security group
  }
  ingress {
    description = "Cluster communication port UDP"
    from_port   = 7946
    to_port     = 7946
    protocol    = "udp"
    self        = true  # Allow from same security group
  }
  # ...
}

See our AWS Terraform example for a complete configuration.

Example

Here’s an example policy that uses the cluster’s shared state to enforce a rate limit on S3 bucket access:

package formal.v2

import future.keywords.if

request := { "action": "block", "type": "block_with_formal_message" } if {
  input.bucket.name == "sensitive-bucket"
  input.bucket.access_count.last_minute > 5
}

This policy blocks access to sensitive-bucket if it has been accessed more than five times in the last minute by the user, across all Connector instances.

How It Works

Instances discover each other via service discovery (e.g. ECS task listing, Kubernetes pod listing) and communicate over port 7946 (TCP and UDP) using a gossip protocol. Each node maintains a membership list and shared state such as rate limit counters and cache entries.

Node Failure and Recovery

When a cluster member becomes unresponsive:

Other nodes detect the failure through missed gossip heartbeats
The node is marked as failed in the membership list
After 1 minute, the failed node is removed from the cluster state
Remaining healthy nodes continue operating with shared state preserved

When a new or recovered instance joins the cluster:

The instance discovers existing members via service discovery
It sends a join request to known peers
A peer sends recent state updates from its event buffer (up to 10,000)
The node can resume normal operation with up-to-date shared state

Consistency Model

The cluster uses eventual consistency via gossip. State updates are broadcast to all nodes and ordered using a Lamport Clock for causal ordering. There is a brief propagation delay before all nodes have the same state. The Control Plane remains the source of truth for all configuration (policies, resources, users).

Rate limiting counters may have slight inaccuracies during the gossip propagation window. For strict rate limiting, this means a user might get a few extra requests through before the limit is enforced across all nodes.

Network Partitions

If a network partition separates cluster members, each partition continues operating independently with its own local state; rate limiting and cache sharing only works within each partition. When the partition heals, nodes rejoin and reconcile state via event replay. Configuration and policies are not affected since each Connector independently receives updates from the Control Plane.

Documentation Index

​Configuration

​Discovery

​Network

​Example

​How It Works

​Node Failure and Recovery

​Consistency Model

​Network Partitions

Configuration

Discovery

Network

Example

How It Works

Node Failure and Recovery

Consistency Model

Network Partitions