ETL Strategies and Where Kafka Fits | Yasser Alattas

ETL is not one architecture. It is a set of trade-offs around latency, ownership, operational cost, and how fresh the data needs to be.

The mistake I see often is treating every data movement problem the same way. Some workloads need a nightly batch. Some need near real-time events. Some only need clean, reliable extracts with clear ownership.

Quick comparison

Strategy	Best for	Trade-off
Batch ETL	Reports, finance, historical analysis	Simple and cheap, but delayed
ELT	Analytics warehouses and lakehouses	Flexible, but pushes complexity into SQL/dbt layers
CDC	Database change replication	Low impact on services, but schema changes need discipline
Streaming ETL	Real-time workflows, fraud, notifications, operations	Fast and scalable, but operationally more complex
Reverse ETL	Sending modeled data back to business tools	Useful for activation, but easy to create hidden coupling

There is no universal winner. The right choice depends on the SLA.

If the business can wait until tomorrow, batch is usually enough. If downstream systems need to react within seconds, streaming becomes part of the core architecture.

Where Kafka contributes

Kafka is useful when data movement becomes a platform concern, not just a pipeline concern.

Instead of connecting every producer directly to every consumer, Kafka provides a durable event backbone:

Producers publish events once.
Consumers read independently.
New consumers can replay history.
Temporary downstream failures do not immediately break producers.
Teams can own topics and contracts instead of point-to-point integrations.

This changes the architecture from tightly coupled pipelines to event-driven data flows.

For example, an order service can publish order.created. Analytics, notifications, fraud checks, inventory, and ML feature pipelines can all consume the same event without the order service knowing about each consumer.

That separation is the real value.

Kafka is not only about speed. It is about decoupling, replayability, and operational control.

Where Strimzi fits

If Kafka runs on Kubernetes, Strimzi is a strong provider for managing it.

Strimzi gives Kafka a Kubernetes-native operating model:

Kafka clusters are declared as custom resources.
Brokers, listeners, users, topics, and ACLs can be managed through Kubernetes manifests.
Upgrades and configuration changes become part of the GitOps workflow.
Platform teams can standardize Kafka operations across environments.

This matters because Kafka is not a simple stateless workload. Broker storage, networking, certificates, users, topic configuration, and rolling upgrades all need careful handling.

Strimzi does not remove the need to understand Kafka. It gives teams a better control plane for running Kafka consistently on Kubernetes.

Practical architecture direction

A practical data architecture usually combines multiple strategies:

Use batch or ELT for heavy analytical workloads.
Use CDC when database changes need to be replicated reliably.
Use Kafka for events that multiple systems need to consume independently.
Use streaming ETL only where latency justifies the operational cost.
Use Strimzi when Kafka is part of the Kubernetes platform and needs GitOps-friendly operations.

The goal is not to make everything real-time.

The goal is to put each data flow on the right path: simple where possible, event-driven where valuable, and operationally controlled where critical.