Every data pipeline starts with a choice: batch, stream, or event-driven. Pick the wrong topology, and you'll fight latency, waste compute, or drown in complexity. This guide compares these three orbits—not as abstract patterns, but as real-world workflows with concrete trade-offs. We'll walk through where each fits, where they break, and how to decide without over-engineering.
Why Topology Choices Matter in Production
Data pipeline topology defines how data moves from source to destination—how often, how reliably, and with what latency. In production, the wrong topology can mean missed SLAs, ballooning costs, or brittle systems that fail under load. Teams often choose based on what's familiar rather than what fits the problem, leading to rework down the line.
Consider a typical e-commerce company: they need real-time inventory updates for the website, nightly sales reports for finance, and event-driven notifications when stock runs low. Each use case pulls toward a different topology. Trying to force one topology to serve all three usually results in compromises that satisfy no one.
We see three dominant topologies in practice: batch (periodic, high-latency, high-throughput), stream (continuous, low-latency, moderate throughput), and event-driven (reactive, decoupled, variable throughput). Each has a sweet spot, but also sharp edges. Understanding those edges is what prevents a pipeline from becoming a liability.
What This Guide Covers
We'll compare the three topologies across five dimensions: latency, throughput, complexity, fault tolerance, and maintenance cost. We'll also highlight anti-patterns that teams commonly fall into, such as over-streaming batch workloads or under-provisioning event brokers. By the end, you should be able to map your own use cases to the right topology—or a hybrid that actually works.
Foundations: What Each Topology Actually Does
Before comparing, we need a clear definition of each topology. Batch processes data in discrete chunks at scheduled intervals. Stream processes data continuously as it arrives, with sub-second latency. Event-driven reacts to specific events, often using a message broker to decouple producers and consumers.
Batch is the oldest and most mature. It excels at large-scale transformations where latency isn't critical—think nightly aggregations, historical analysis, or backup generation. The trade-off is that data is always stale: a batch pipeline that runs every hour shows a view of the past hour, not the present.
Stream processing, popularized by frameworks like Apache Kafka and Flink, handles data as it flows. It's ideal for real-time dashboards, fraud detection, and monitoring. But stream pipelines are harder to debug: state management, exactly-once semantics, and backpressure handling require careful design.
Event-driven topology is a superset of streaming but with a focus on decoupling. Instead of a continuous flow, events are discrete messages that trigger actions. This pattern shines in microservices architectures, where each service reacts to events without direct dependencies. The challenge is ensuring event ordering, deduplication, and eventual consistency.
Common Misconceptions
One frequent mistake is equating stream with real-time. Stream can be real-time, but it can also be near-real-time with micro-batching. Another is assuming event-driven is always more scalable—without proper partitioning, event brokers become bottlenecks. Finally, batch is often dismissed as legacy, but for many workloads (e.g., end-of-day reconciliations), it remains the most efficient choice.
Patterns That Usually Work
Through years of observing production systems, certain patterns emerge as reliable. For batch, the lambda architecture (batch layer + speed layer) remains a solid foundation, though it adds complexity. A simpler pattern is the periodic ETL with idempotent writes: run a job every N minutes, overwrite the target, and handle failures by replaying the batch.
For stream, the Kappa architecture (single stream pipeline for both real-time and historical) reduces duplication. It works well when you can reprocess from a log (like Kafka) and have a stream processor that supports stateful operations. The key is to treat the stream as the source of truth, not just a transport.
Event-driven patterns often follow the choreography or orchestration model. Choreography (each service reacts to events independently) is simpler but harder to trace. Orchestration (a central coordinator) provides visibility but introduces a single point of failure. A pragmatic middle ground is to use orchestration for critical workflows and choreography for fire-and-forget events.
Hybrid Topologies
Many teams find that a single topology doesn't fit all. A common hybrid is to use stream for ingestion, batch for heavy transformations, and event-driven for notifications. For example, a streaming pipeline ingests clickstream data into a Kafka topic. A batch job runs hourly to aggregate that data into a data warehouse. Meanwhile, an event-driven service triggers alerts when certain thresholds are crossed. This hybrid avoids the weaknesses of each topology while leveraging their strengths.
Anti-Patterns and Why Teams Revert
Even with good patterns, teams often hit walls and revert to simpler setups. One anti-pattern is over-streaming: building a stream pipeline for a workload that only needs hourly updates. The cost of maintaining state, handling exactly-once semantics, and debugging backpressure outweighs the benefit of lower latency. Teams end up switching to batch after months of operational pain.
Another anti-pattern is under-provisioning event brokers. Event-driven systems rely on the broker's ability to handle spikes. If you size the cluster based on average throughput, a sudden burst can cause message loss or backpressure cascading to producers. Teams then add retries and dead-letter queues, which complicate the system further.
Batch anti-patterns include running too many small jobs instead of batching them into larger, less frequent runs. This increases overhead and contention. Also common is ignoring idempotency: if a batch job fails halfway, rerunning it can produce duplicates. Without idempotent writes, teams must implement manual cleanup scripts.
Why Teams Revert
The most common reason for reverting is complexity that doesn't pay off. A team adopts stream processing because it's trendy, but their latency requirements are in minutes, not seconds. After months of debugging state stores and checkpointing, they switch to a simple batch job that runs every five minutes. The lesson: choose topology based on requirements, not novelty.
Maintenance, Drift, and Long-Term Costs
Every topology has a maintenance cost that grows over time. Batch pipelines are relatively cheap to maintain: schedule jobs, monitor failures, and reprocess when needed. But they accumulate technical debt in the form of outdated scripts, hardcoded paths, and missing documentation. Over years, batch pipelines become fragile and hard to change.
Stream pipelines require ongoing attention to state management, versioning of schemas, and scaling partitions. As data volume grows, you may need to rebalance partitions, upgrade cluster nodes, and tune checkpointing intervals. These tasks require specialized skills that are harder to hire for.
Event-driven systems suffer from drift in event schemas. Producers and consumers evolve independently, and without a schema registry, events can become unparseable. Teams then spend time reconciling mismatches. Additionally, event ordering guarantees (or lack thereof) can cause subtle bugs that only surface under load.
Cost Comparison
Infrastructure cost also varies. Batch is often cheapest because you can use spot instances and scale down between runs. Stream requires always-on clusters, which can be expensive for low-throughput workloads. Event-driven falls in between: brokers like Kafka are moderately expensive, but you can scale consumers independently. The total cost of ownership includes not just compute but also engineering time for debugging and maintenance.
When Not to Use Each Topology
Batch is not suitable for real-time decision-making. If your application needs sub-second responses to new data (e.g., fraud blocking, live pricing), batch will introduce unacceptable delay. Also avoid batch for incremental processing of high-velocity data—you'll end up with large, expensive runs that overlap.
Stream is overkill for periodic reporting. If your stakeholders only need daily or hourly updates, a stream pipeline adds complexity without benefit. Stream is also a poor fit for workloads that require exactly-once semantics without a compatible source (e.g., many databases emit change data capture with at-least-once guarantees).
Event-driven topology should be avoided when you need strong consistency across services. Eventual consistency is inherent; if you need ACID transactions, a monolithic database or distributed transaction coordinator is more appropriate. Also avoid event-driven for simple request-response patterns—a direct API call is simpler and faster.
Edge Cases
One edge case is the need for both low latency and high throughput. In theory, stream can handle both, but in practice, you may need to partition aggressively and accept some latency. Another edge case is regulatory compliance: batch pipelines make auditing easier because data is processed in discrete windows. Stream and event-driven require careful logging and replay mechanisms to meet audit requirements.
Open Questions and FAQ
We often hear the same questions from teams evaluating topologies. Here are answers to the most common ones.
Can we use stream processing for batch workloads?
Yes, but it's usually not worth it. You can set a stream processor to micro-batch every few minutes, but you lose the simplicity of a traditional batch system. Only do this if you already have a stream infrastructure and want to avoid maintaining a separate batch system.
How do we handle schema evolution in event-driven systems?
Use a schema registry (like Confluent Schema Registry or Apicurio). It enforces compatibility rules and allows producers and consumers to evolve independently. Without it, you'll face serialization errors and silent data corruption.
What's the best topology for a startup?
Start with batch. It's simple, cheap, and covers 80% of use cases. Add stream or event-driven only when you have a clear latency requirement that batch cannot meet. Premature optimization with complex topologies is a common startup killer.
How do we migrate from batch to stream?
Gradually. Start by streaming a subset of data (e.g., critical metrics) while keeping batch for the rest. Use a dual-write pattern: write to both the stream and the batch store, then verify consistency. Once the stream pipeline is stable, retire the batch path.
Is event-driven always more resilient?
No. Event-driven systems can fail in new ways: broker outages, message loss, and consumer lag. Resilience comes from careful design (replication, dead-letter queues, monitoring), not from the topology itself.
Summary and Next Experiments
Choosing a data pipeline topology is not a one-time decision. As your data volume, latency needs, and team evolve, the right topology may shift. Start with the simplest topology that meets your requirements—usually batch—and only add complexity when you have a clear, measurable need.
To apply this guide, try these experiments:
- Audit your current pipelines: For each pipeline, note the required latency, throughput, and fault tolerance. Then check if the topology matches. You may find mismatches that are costing you.
- Run a cost comparison: Estimate the total cost (compute, storage, engineering time) of your current topology vs. a simpler alternative. Often, batch is cheaper than you think.
- Prototype a hybrid: Pick one pipeline that could benefit from lower latency. Build a stream version alongside the existing batch, and compare operational overhead after a month.
- Test failure modes: Simulate a broker outage or a batch job failure. Measure recovery time and data loss. This will reveal weaknesses in your topology choice.
- Talk to your stakeholders: Ask what latency they actually need. You might discover that hourly updates are fine, saving you from over-engineering a stream solution.
Data pipeline topology is a tool, not a religion. Use the one that fits your problem, and don't be afraid to switch when the problem changes. The best pipeline is the one that runs reliably, costs less, and lets your team sleep at night.
Comments (0)
Please sign in to post a comment.
Don't have an account? Create one
No comments yet. Be the first to comment!