ETL is not one architecture. It is a set of trade-offs around latency, ownership, operational cost, and how fresh the data needs to be.
The mistake I see often is treating every data movement problem the same way. Some workloads need a nightly batch. Some need near real-time events. Some only need clean, reliable extracts with clear ownership.
Quick comparison
| Strategy | Best for | Trade-off |
|---|---|---|
| Batch ETL | Reports, finance, historical analysis | Simple and cheap, but delayed |
| ELT | Analytics warehouses and lakehouses | Flexible, but pushes complexity into SQL/dbt layers |
| CDC | Database change replication | Low impact on services, but schema changes need discipline |
| Streaming ETL | Real-time workflows, fraud, notifications, operations | Fast and scalable, but operationally more complex |
| Reverse ETL | Sending modeled data back to business tools | Useful for activation, but easy to create hidden coupling |
There is no universal winner. The right choice depends on the SLA.
If the business can wait until tomorrow, batch is usually enough. If downstream systems need to react within seconds, streaming becomes part of the core architecture.
Where Kafka contributes
Kafka is useful when data movement becomes a platform concern, not just a pipeline concern.
Instead of connecting every producer directly to every consumer, Kafka provides a durable event backbone:
- Producers publish events once.
- Consumers read independently.
- New consumers can replay history.
- Temporary downstream failures do not immediately break producers.
- Teams can own topics and contracts instead of point-to-point integrations.
This changes the architecture from tightly coupled pipelines to event-driven data flows.
For example, an order service can publish order.created. Analytics, notifications, fraud checks, inventory, and ML feature pipelines can all consume the same event without the order service knowing about each consumer.
That separation is the real value.
Kafka is not only about speed. It is about decoupling, replayability, and operational control.
Where Strimzi fits
If Kafka runs on Kubernetes, Strimzi is a strong provider for managing it.
Strimzi gives Kafka a Kubernetes-native operating model:
- Kafka clusters are declared as custom resources.
- Brokers, listeners, users, topics, and ACLs can be managed through Kubernetes manifests.
- Upgrades and configuration changes become part of the GitOps workflow.
- Platform teams can standardize Kafka operations across environments.
This matters because Kafka is not a simple stateless workload. Broker storage, networking, certificates, users, topic configuration, and rolling upgrades all need careful handling.
Strimzi does not remove the need to understand Kafka. It gives teams a better control plane for running Kafka consistently on Kubernetes.
Practical architecture direction
A practical data architecture usually combines multiple strategies:
- Use batch or ELT for heavy analytical workloads.
- Use CDC when database changes need to be replicated reliably.
- Use Kafka for events that multiple systems need to consume independently.
- Use streaming ETL only where latency justifies the operational cost.
- Use Strimzi when Kafka is part of the Kubernetes platform and needs GitOps-friendly operations.
The goal is not to make everything real-time.
The goal is to put each data flow on the right path: simple where possible, event-driven where valuable, and operationally controlled where critical.