Cluster Failover Testing
Objective
This test simulates a scenario in which the central cluster goes down. It evaluates the behavior of Kafka topics, leader election, in-sync replicas (ISRs), and how a stretched Kafka cluster handles failover. After testing the failover, the central cluster is recovered to determine if the old state is restored.
Cluster setup
We have three Kubernetes clusters with a stretched Kafka deployment across them.
Cluster Configurations
Central Cluster (calico-1)
$ .kube % kubectl get pods --kubeconfig calico-1 -n strimzi
NAME READY STATUS RESTARTS AGE
my-cluster-broker-0 1/1 Running 0 23h
my-cluster-broker-1 1/1 Running 0 23h
my-cluster-broker-2 1/1 Running 0 23h
my-cluster-controller-3 1/1 Running 0 23h
my-cluster-controller-4 1/1 Running 0 23h
my-cluster-controller-5 1/1 Running 0 23h
strimzi-cluster-operator-5b7b9d9bf6-w4hxl 1/1 Running 0 23h
Stretch Cluster 1 (calico-2)
$ .kube % kubectl get pods --kubeconfig calico-2 -n strimzi
NAME READY STATUS RESTARTS AGE
my-cluster-stretch1-broker-6 1/1 Running 0 23h
my-cluster-stretch1-broker-7 1/1 Running 0 23h
my-cluster-stretch1-broker-8 1/1 Running 0 23h
my-cluster-stretch1-controller-10 1/1 Running 0 23h
my-cluster-stretch1-controller-11 1/1 Running 0 23h
my-cluster-stretch1-controller-9 1/1 Running 0 23h
strimzi-cluster-operator-6d7db9dd95-k5m24 1/1 Running 0 23h
Stretch Cluster 2 (calico-3)
$ .kube % kubectl get pods --kubeconfig calico-3 -n strimzi
NAME READY STATUS RESTARTS AGE
my-cluster-stretch2-broker-12 1/1 Running 0 23h
my-cluster-stretch2-broker-13 1/1 Running 0 23h
my-cluster-stretch2-broker-14 1/1 Running 0 23h
my-cluster-stretch2-controller-15 1/1 Running 0 23h
my-cluster-stretch2-controller-16 1/1 Running 0 23h
my-cluster-stretch2-controller-17 1/1 Running 0 23h
strimzi-cluster-operator-7966fb9659-zqfmv 1/1 Running 0 23h
The central cluster contains the Kafka and KafkaNodePool CRs:
$ .kube % kubectl get kafka -n strimzi --kubeconfig calico-1
NAME DESIRED KAFKA REPLICAS DESIRED ZK REPLICAS READY METADATA STATE WARNINGS
my-cluster
$ .kube % kubectl get kafkanodepool -n strimzi --kubeconfig calico-1
NAME DESIRED REPLICAS ROLES NODEIDS
broker 3 ["broker"] [0,1,2]
controller 3 ["controller"] [3,4,5]
stretch1-broker 3 ["broker"] [6,7,8]
stretch1-controller 3 ["controller"] [9,10,11]
stretch2-broker 3 ["broker"] [12,13,14]
stretch2-controller 3 ["controller"] [15,16,17]
Listing the metadata quorum
Checking if the metadata reflects all brokers and controllers across clusters:
[kafka@my-cluster-broker-0 kafka]$ bin/kafka-metadata-quorum.sh --bootstrap-server my-cluster-kafka-bootstrap.strimzi.svc:9092 describe --status
ClusterId: 1RYWwDxMT8mT0lpiqtc69w
LeaderId: 11
LeaderEpoch: 195814
HighWatermark: 168406
MaxFollowerLag: 0
MaxFollowerLagTimeMs: 54
CurrentVoters: [16,17,3,4,5,9,10,11,15]
CurrentObservers: [0,1,2,6,7,8,12,13,14]
Topic Creation and Message Testing
[kafka@my-cluster-broker-0 kafka]$ bin/kafka-topics.sh --create --bootstrap-server my-cluster-kafka-bootstrap.strimzi.svc:9092 --replication-factor 6 --partitions 6 --topic failover-test
Created topic failover-test.
Describing the topics shows that partition 4 and 5 have leaders from the central cluster and partition's all ISRs contains all brokers from the central cluster.
[kafka@my-cluster-broker-0 kafka]$ bin/kafka-topics.sh --describe --bootstrap-server my-cluster-kafka-bootstrap.strimzi.svc:9092 --topic failover-test
Topic: failover-test TopicId: 7U-yMkfgT1GfJRY-DoyEhQ PartitionCount: 6 ReplicationFactor: 6 Configs: min.insync.replicas=2
Topic: failover-test Partition: 0 Leader: 8 Replicas: 8,12,13,14,0,1 Isr: 8,12,13,14,0,1 Elr: LastKnownElr:
Topic: failover-test Partition: 1 Leader: 12 Replicas: 12,13,14,0,1,2 Isr: 12,13,14,0,1,2 Elr: LastKnownElr:
Topic: failover-test Partition: 2 Leader: 13 Replicas: 13,14,0,1,2,6 Isr: 13,14,0,1,2,6 Elr: LastKnownElr:
Topic: failover-test Partition: 3 Leader: 14 Replicas: 14,0,1,2,6,7 Isr: 14,0,1,2,6,7 Elr: LastKnownElr:
Topic: failover-test Partition: 4 Leader: 0 Replicas: 0,1,2,6,7,8 Isr: 0,1,2,6,7,8 Elr: LastKnownElr:
Topic: failover-test Partition: 5 Leader: 1 Replicas: 1,2,6,7,8,12 Isr: 1,2,6,7,8,12 Elr: LastKnownElr:
Producing and consuming messages from the topic
[kafka@my-cluster-broker-0 kafka]$ bin/kafka-console-producer.sh --bootstrap-server my-cluster-kafka-bootstrap.strimzi.svc:9092 --topic failover-test
>asdfasdf
>this is stretch
>sending data from one cluster to the other
>Hello Kafka
>pushing enough messages to kafka cluster
>Hello world
>Testing
>Testing asdf
[kafka@my-cluster-stretch1-broker-8 kafka]$ bin/kafka-console-consumer.sh --bootstrap-server my-cluster-kafka-bootstrap.strimzi.svc:9092 --topic failover-test
asdfasdf
this is stretch
sending data from one cluster to the other
Hello Kafka
pushing enough messages to kafka cluster
Hello world
Testing
Testing asdf
^CProcessed a total of 8 messages
Simulating Central Cluster Failure
Manually shutting down the central cluster to simulate a cluster failure
$ .kube % kubectl get pods --kubeconfig calico-1 -n strimzi -v=8
I0313 13:57:19.830013 14803 loader.go:395] Config loaded from file: calico-1
I0313 13:57:19.834745 14803 round_trippers.go:463] GET https://9.46.88.97:6443/api/v1/namespaces/strimzi/pods?limit=500
I0313 13:57:19.834762 14803 round_trippers.go:469] Request Headers:
I0313 13:57:19.834768 14803 round_trippers.go:473] Accept: application/json;as=Table;v=v1;g=meta.k8s.io,application/json;as=Table;v=v1beta1;g=meta.k8s.io,application/json
I0313 13:57:19.834771 14803 round_trippers.go:473] User-Agent: kubectl1.31.1/v1.31.1 (darwin/arm64) kubernetes/948afe5
I0313 13:57:49.837331 14803 round_trippers.go:574] Response Status: in 30002 milliseconds
I0313 13:57:49.837405 14803 round_trippers.go:577] Response Headers:
I0313 13:57:49.838573 14803 helpers.go:264] Connection error: Get https://9.46.88.97:6443/api/v1/namespaces/strimzi/pods?limit=500: dial tcp 9.46.88.97:6443: i/o timeout
Unable to connect to the server: dial tcp 9.46.88.97:6443: i/o timeout
Testing if we can produce and consume with the other two clusters
We're using cluster 2 and 3 to produce and consume
[kafka@my-cluster-stretch2-broker-12 kafka]$ bin/kafka-console-producer.sh --bootstrap-server my-cluster-kafka-bootstrap.strimzi.svc:9092 --topic failover-test
>hello world
>Seems like the data is not lost
[kafka@my-cluster-stretch1-broker-8 kafka]$ bin/kafka-console-consumer.sh --bootstrap-server my-cluster-kafka-bootstrap.strimzi.svc:9092 --topic failover-test --from-beginning
asdfasdf
this is stretch
sending data from one cluster to the other
Hello Kafka
pushing enough messages to kafka cluster
Hello world
Testing
Testing asdf
hello world
Seems like the data is not lost
^CProcessed a total of 10 messages
Checking if leader election elected new partition leaders
Partition 4 and 5 which had leaders from the failed central cluster are assigned new leaders from available cluster (cluster 2).
[kafka@my-cluster-stretch2-broker-12 kafka]$ bin/kafka-topics.sh --describe --bootstrap-server my-cluster-kafka-bootstrap.strimzi.svc:9092 --topic failover-test
Topic: failover-test TopicId: 7U-yMkfgT1GfJRY-DoyEhQ PartitionCount: 6 ReplicationFactor: 6 Configs: min.insync.replicas=2
Topic: failover-test Partition: 0 Leader: 8 Replicas: 8,12,13,14,0,1 Isr: 8,12,14,13 Elr: LastKnownElr:
Topic: failover-test Partition: 1 Leader: 12 Replicas: 12,13,14,0,1,2 Isr: 12,14,13 Elr: LastKnownElr:
Topic: failover-test Partition: 2 Leader: 13 Replicas: 13,14,0,1,2,6 Isr: 14,6,13 Elr: LastKnownElr:
Topic: failover-test Partition: 3 Leader: 14 Replicas: 14,0,1,2,6,7 Isr: 14,6,7 Elr: LastKnownElr:
Topic: failover-test Partition: 4 Leader: 6 Replicas: 0,1,2,6,7,8 Isr: 6,7,8 Elr: LastKnownElr:
Topic: failover-test Partition: 5 Leader: 6 Replicas: 1,2,6,7,8,12 Isr: 6,7,8,12 Elr: LastKnownElr:
Recovering the central cluster
Manually recovering the central cluster from failure
$ .kube % kubectl get pods --kubeconfig calico-1 -n strimzi -w
NAME READY STATUS RESTARTS AGE
my-cluster-broker-0 0/1 Running 1 (2m6s ago) 24h
my-cluster-broker-1 0/1 Running 1 (2m9s ago) 23h
my-cluster-broker-2 0/1 Running 1 (2m9s ago) 24h
my-cluster-controller-3 1/1 Running 1 (2m9s ago) 24h
my-cluster-controller-4 1/1 Running 1 (2m6s ago) 24h
my-cluster-controller-5 1/1 Running 1 (2m9s ago) 24h
strimzi-cluster-operator-5b7b9d9bf6-w4hxl 0/1 Running 1 (2m6s ago) 24h
strimzi-cluster-operator-5b7b9d9bf6-w4hxl 1/1 Running 1 (2m11s ago) 24h
my-cluster-broker-0 1/1 Running 1 (2m12s ago) 24h
my-cluster-broker-2 1/1 Running 1 (2m17s ago) 24h
my-cluster-broker-1 1/1 Running 1 (2m56s ago) 23h
Testing if the ISR's come back from the central clusters
The brokers from the recovered cluster are getting in sync with the partitions and is being reflected in the ISRs
[kafka@my-cluster-stretch2-broker-12 kafka]$ bin/kafka-topics.sh --describe --bootstrap-server my-cluster-kafka-bootstrap.strimzi.svc:9092 --topic failover-test
Topic: failover-test TopicId: 7U-yMkfgT1GfJRY-DoyEhQ PartitionCount: 6 ReplicationFactor: 6 Configs: min.insync.replicas=2
Topic: failover-test Partition: 0 Leader: 8 Replicas: 8,12,13,14,0,1 Isr: 0,14,13,12,8 Elr: LastKnownElr:
Topic: failover-test Partition: 1 Leader: 12 Replicas: 12,13,14,0,1,2 Isr: 0,14,13,2,12 Elr: LastKnownElr:
Topic: failover-test Partition: 2 Leader: 14 Replicas: 13,14,0,1,2,6 Isr: 0,14,6,13,2 Elr: LastKnownElr:
Topic: failover-test Partition: 3 Leader: 14 Replicas: 14,0,1,2,6,7 Isr: 0,14,6,2,7 Elr: LastKnownElr:
Topic: failover-test Partition: 4 Leader: 6 Replicas: 0,1,2,6,7,8 Isr: 0,6,2,7,8 Elr: LastKnownElr:
Topic: failover-test Partition: 5 Leader: 6 Replicas: 1,2,6,7,8,12 Isr: 6,2,12,7,8 Elr: LastKnownElr:
Interpretation of Results
- The test confirms that Kafka's metadata quorum can successfully transition leadership to available controllers when the central cluster fails.
- Message production and consumption remain uninterrupted, demonstrating the resilience of stretched clusters.
- Upon recovery, the central cluster resumes operations, reinstating previous leader assignments where possible.
- The system effectively handles failover and restoration, ensuring high availability and minimal downtime.
Conclusion
This failover test validates the effectiveness of stretched Kafka clusters in handling central cluster failures while maintaining data integrity.