All Case Studies
Production Infra Box · 5 days

VPN Service: Kafka Automation and Zero-Downtime Updates

A cross-platform VPN service managed Kafka clusters manually. Every update was an incident risk. Configurations on dev, staging, and prod diverged. In 5 days, we automated the entire Kafka lifecycle through Ansible.

30 min
Deploy identical Kafka cluster
0
Incidents during updates
100%
dev/staging/prod parity
5 days
Delivery timeline

The Problem

  • Manual Kafka cluster management via SSH — each deployment took half a day
  • dev/staging/prod configs diverged, 'works on my machine' was the norm
  • Kafka updates without config testing → production incidents
  • No way to quickly spin up a new cluster for testing
  • Runbook existed only in one engineer's head

The Solution

  • Ansible roles for full Kafka lifecycle (install, configure, upgrade, rolling restart)
  • Single source of truth for configuration — one vars.yml for all environments
  • Configuration testing before rollout via Molecule + Testinfra
  • Rolling update without service interruption — brokers updated one at a time
  • Documentation in code: every parameter with a comment and link to Kafka docs

How the Automation Works

Ansible Project Structure

kafka-automation/
├── roles/
│ ├── kafka/
│ │ ├── tasks/main.yml
│ │ ├── handlers/main.yml
│ │ └── defaults/main.yml
│ └── zookeeper/
├── inventories/
│ ├── dev/
│ ├── staging/
│ └── prod/
└── playbooks/
├── deploy.yml
└── rolling-update.yml

Rolling Update Without Downtime

# rolling-update.yml
- name: Rolling Kafka update
hosts: kafka_brokers
serial: 1 # один брокер за раз
tasks:
- name: Wait for partition reassignment
kafka_info:
bootstrap_servers: ...
until: result.under_replicated == 0
retries: 30
- name: Stop broker
service: name=kafka state=stopped
- name: Update and start
include_role: name=kafka

Results

New cluster deployment
Half a day30 min
Kafka update
Manual + riskAuto + tests
dev/staging/prod configs
Diverged100% parity

"I used to be afraid to update Kafka in production. Now it's just a command in GitLab CI — I run it and go get coffee."

— Lead Engineer, VPN Service (SaaS / VPN)

Technology Stack

KafkaAnsibleKubernetesTerraformGitLab CI/CDArgoCDGrafanaPrometheusPostgreSQL

Need Infrastructure Automation?

Tell us about your stack — we'll propose a plan within 24 hours.