Skip to main content
Data Pipeline Topologies

Temporal Orbits and Data Flow: Conceptualizing Stream Topologies Against Batch-Oriented Process Constellations

When we design data pipelines, we often reach for a familiar mental model: batch processing runs on a schedule, stream processing reacts to events as they arrive. But real-world systems rarely fit neatly into one bucket. A nightly ETL job that ingests files from a shared drive is clearly batch. A Kafka consumer that updates a dashboard in milliseconds is clearly stream. Yet between these poles lies a vast gray zone where teams must decide how to conceptualize their data flow—and that decision shapes everything from infrastructure cost to debugging complexity. This guide is for engineers and architects who already know the basics of batch and stream processing but want a more nuanced framework for choosing and combining topologies.

When we design data pipelines, we often reach for a familiar mental model: batch processing runs on a schedule, stream processing reacts to events as they arrive. But real-world systems rarely fit neatly into one bucket. A nightly ETL job that ingests files from a shared drive is clearly batch. A Kafka consumer that updates a dashboard in milliseconds is clearly stream. Yet between these poles lies a vast gray zone where teams must decide how to conceptualize their data flow—and that decision shapes everything from infrastructure cost to debugging complexity.

This guide is for engineers and architects who already know the basics of batch and stream processing but want a more nuanced framework for choosing and combining topologies. We'll use the metaphor of temporal orbits—the idea that each pipeline has a characteristic cadence, latency budget, and reprocessing strategy—to compare approaches without falling into the trap of declaring one paradigm universally superior.

Where Temporal Orbits Collide: The Real-World Context

The tension between batch and stream topologies surfaces most acutely in systems that must serve both operational and analytical workloads. Consider a logistics company tracking delivery vehicles: telemetry data arrives every few seconds (a stream), but the monthly revenue report needs to reconcile all trips with invoicing records (a batch job). The same data, two very different temporal requirements.

Common scenarios that demand hybrid thinking

Teams often encounter the batch-stream tension in these situations:

  • Real-time dashboards with historical baselines—a monitoring system needs current latency (stream) but compares against last week's patterns (batch).
  • Event-driven microservices that feed a data warehouse—the service emits events in real time, but the warehouse loads in hourly or daily batches.
  • Machine learning feature pipelines—training uses historical data (batch), while inference requires fresh features (stream).
  • Audit and compliance—transactions must be processed quickly for fraud detection (stream) but retained and reconciled in batches for regulators.

In each case, the topology choice is not absolute. The question is not “stream or batch?” but “where along the temporal continuum does each part of my pipeline belong?”

Why the metaphor matters

Thinking in terms of temporal orbits helps us see that every pipeline has a natural rhythm—a cycle time that matches the business need. A stream is not faster batch; it's a fundamentally different relationship with time. Batch processing says, “I will run when data is complete.” Stream processing says, “I will run as data arrives, accepting uncertainty.” Recognizing this shift is the first step to designing topologies that don't fight their own nature.

Foundations Readers Confuse: Event Time, Processing Time, and Watermarks

One of the most persistent sources of confusion is the distinction between event time (when the data was generated) and processing time (when the pipeline handles it). Batch systems naturally align with processing time—they run at a fixed clock and see all data that arrived before the run. Stream systems must explicitly track event time, because data can arrive late or out of order.

Event time vs. processing time

Imagine a sensor that reports temperature every second. If the network glitches for ten seconds, then the sensor's readings arrive in a burst. A batch job that runs every minute would see all ten readings together and process them as a group. A stream processor, however, must decide whether to wait for the burst or emit results immediately with incomplete data. This is where watermarks come in: a heuristic that estimates how late data can arrive before the pipeline considers a window complete.

Teams often assume that stream processing guarantees low latency without trade-offs. The reality is that watermarks are imprecise. If you set a watermark too aggressive, you may drop late events. If you set it too conservative, you add latency. Batch processing avoids this trade-off entirely by simply waiting until a scheduled run, but that means higher latency by design.

Exactly-once semantics and state

Another foundational confusion is exactly-once processing. Batch systems achieve it naturally because they can retry failed tasks using idempotent writes. Stream systems require coordination between source, processor, and sink—often using transactional writes or idempotent sinks. Many teams discover that their stream processor's “exactly-once” guarantee only works under specific conditions (e.g., when the sink supports idempotency) and that state size grows unbounded if not managed.

Common mental model errors

  • “Streaming is just batch with smaller windows.” No—streaming must handle unordered, late data; batch assumes ordered, complete input.
  • “Batch is always cheaper.” For low-volume, low-latency needs, a stream topology can reduce infrastructure cost because it processes data immediately rather than storing it until a batch run.
  • “Exactly-once is the same in batch and stream.” In batch, exactly-once is straightforward; in stream, it requires careful design and often comes with performance costs.

Clearing up these misconceptions early prevents teams from choosing a topology based on buzzwords rather than actual requirements.

Patterns That Usually Work: Hybrid Topologies and Their Sweet Spots

After years of experimentation, the industry has converged on a few reliable patterns that combine batch and stream elements. These patterns are not one-size-fits-all, but they offer a starting point for most real-world pipelines.

Lambda architecture

Lambda architecture runs two parallel pipelines: a speed layer (stream) for low-latency results and a batch layer for accurate, complete results. A serving layer merges both outputs. This pattern works well when you need both freshness and correctness, but it introduces operational complexity—you must maintain two codebases and reconcile results when the batch layer catches up.

Kappa architecture

Kappa architecture simplifies lambda by using a single stream processing pipeline for all data, with the ability to replay historical data from a log (like Kafka). This avoids the dual-codebase problem but requires that your stream processor supports stateful reprocessing and that your data retention in the log is long enough to replay. Kappa works best for organizations that can tolerate eventual consistency and have strong stream processing infrastructure.

Micro-batching

Micro-batching (used by Spark Streaming, Flink's mini-batches) processes data in small, fixed-size groups—typically seconds worth of events. This gives a middle ground: latency of a few seconds, with the fault tolerance and simplicity of batch. Micro-batching is a good choice when your use case can tolerate second-level latency and you want to reuse batch processing tools.

PatternLatencyCorrectnessComplexityBest for
LambdaSeconds (speed) + hours (batch)High (batch reconciles)HighDashboards with historical accuracy
KappaSeconds to minutesEventual (depends on replay)MediumEvent-driven systems with replay capability
Micro-batchingSecondsHigh (batch-like fault tolerance)Low to mediumNear-real-time analytics with existing batch tools

Each pattern has a sweet spot. The key is to match the pattern to your latency requirements and tolerance for complexity, not to the latest trend.

Anti-Patterns and Why Teams Revert to Batch

Despite the promise of stream processing, many teams eventually revert to batch for parts of their pipeline. Understanding the common anti-patterns can help you avoid the same pitfalls.

Anti-pattern 1: Streaming everything because it's modern

Some teams adopt stream processing for all data flows, including use cases that are inherently batch—like monthly reporting or historical analysis. The result is a system that is more complex than necessary, with state management issues and higher operational costs. The fix is to ask: does this data need to be acted upon within seconds? If not, batch is simpler and more reliable.

Anti-pattern 2: Underestimating state management

Stream processors often maintain state—aggregations, joins, session windows. State grows with data volume and time. Teams that don't plan for state size, checkpointing, and recovery find themselves with out-of-memory errors or slow restarts. Batch avoids this because state is ephemeral (per job). The lesson: if your streaming use case requires large state, consider micro-batching or periodic state snapshots.

Anti-pattern 3: Ignoring backpressure

Stream pipelines must handle backpressure—when the sink cannot keep up with the source. Batch systems handle this naturally because they process data in discrete chunks. Stream systems that ignore backpressure can cause data loss or unbounded memory growth. Proper design includes buffering, rate limiting, or adaptive scaling.

Why teams revert to batch

When the cost of maintaining a stream topology outweighs the latency benefit, teams often revert to batch for the problematic parts. Common triggers: debugging late data issues, dealing with state corruption after a failure, or high infrastructure costs from running 24/7 stream processors. Reverting is not a failure—it's a pragmatic choice. The goal is to find the right temporal orbit for each data flow, not to prove that streaming works everywhere.

Maintenance, Drift, and Long-Term Costs

Every data pipeline degrades over time if not actively maintained. Stream topologies introduce unique drift patterns that batch systems don't face.

Schema evolution and compatibility

In batch systems, schema changes are handled by running a migration job before the next scheduled run. In stream systems, the pipeline must handle multiple schema versions simultaneously because events in flight may have different formats. This requires schema registries and backward-compatible changes, adding to maintenance overhead.

State drift and data quality

Stream state (like running counts or session windows) can drift due to bugs in windowing logic or watermark misconfiguration. Detecting drift is harder than in batch because there is no single point of comparison—the state is constantly changing. Teams should instrument their stream pipelines with monitoring that compares output against a batch reference periodically.

Operational costs

Stream processors run continuously, consuming compute resources even when data volume is low. Batch jobs run only when needed, making them more cost-effective for variable or low-volume workloads. Over a year, a stream topology can cost 2–3x more than an equivalent batch pipeline for the same data volume, especially if the stream processor is over-provisioned for peak load.

Long-term maintainability

Batch pipelines are easier to understand and debug because they are deterministic—given the same input, they produce the same output. Stream pipelines depend on timing and ordering, making reproducibility harder. Teams should weigh the long-term cost of debugging stream issues against the latency benefit. For many use cases, a hybrid approach that uses batch for complex transformations and stream for simple filtering or routing is more maintainable.

When Not to Use This Approach: The Case Against Stream Topologies

Not every data flow benefits from stream processing. In some situations, batch is clearly superior, and trying to force a stream topology will lead to unnecessary complexity.

When data completeness is mandatory

If your pipeline must process every event exactly once, without any possibility of late arrivals, batch is the safer choice. Stream processing always involves some uncertainty about late data, and watermarks are heuristics, not guarantees. For financial reconciliation or regulatory reporting, batch provides the certainty you need.

When latency requirements are loose

If your use case can tolerate latency of minutes or hours, batch is simpler and cheaper. There is no need to pay the complexity premium of stream processing if a nightly job meets the business need.

When the data source is inherently batch

Some data sources—like files dropped into S3, or daily exports from a legacy system—are naturally batch. Trying to stream them by polling frequently or using change data capture (CDC) adds complexity with little benefit. Instead, schedule a batch job that runs after the source is ready.

When the team lacks stream processing expertise

Stream processing requires a different skill set than batch. If your team is experienced with SQL and batch ETL but new to stateful stream processing, the learning curve will be steep. In that case, start with micro-batching or a managed stream service that abstracts away the complexity, and only move to full streaming when the team is ready.

In short, stream topologies are a tool, not a goal. Use them where they fit, and don't hesitate to use batch where it serves the business need better.

Open Questions and Common Confusions

Even experienced teams have lingering questions about stream topologies. Here are the ones we hear most often.

How do I handle exactly-once semantics in a stream pipeline?

Exactly-once in streaming is achievable but requires coordination between source, processor, and sink. Use idempotent sinks (e.g., Kafka with exactly-once producer), transactional writes (e.g., Flink's two-phase commit), or deduplication logic. Remember that exactly-once often comes with a performance trade-off—you may accept at-least-once with deduplication for higher throughput.

What is the best window size for streaming aggregations?

There is no universal answer. The window size should match your business requirement: if you need alerts within 5 minutes, use a 5-minute tumbling window. Larger windows reduce state overhead but increase latency. Test with realistic data to see how watermarking behaves with your data's typical delay.

How do I reprocess historical data in a stream pipeline?

Reprocessing requires replaying data from a log (like Kafka) or from batch storage. In Kappa architecture, you simply reset the consumer offset and reprocess. In Lambda, you run the batch layer and then reconcile. The key is to design your pipeline so that reprocessing is a first-class operation, not an afterthought.

Should I use a managed stream service or build my own?

Managed services (like AWS Kinesis, Google Pub/Sub, or Confluent Cloud) reduce operational overhead but limit customization. Building your own (e.g., with Kafka and Flink) gives more control but requires expertise. For most teams, starting with a managed service and migrating to self-managed only if needed is the pragmatic path.

These questions have no single right answer—the best choice depends on your data volume, latency needs, and team skills.

Summary and Next Experiments

Stream and batch topologies are not rivals; they are different temporal orbits that suit different data flows. The key takeaway is to map each part of your pipeline to the right rhythm: use stream processing where low latency and event-driven action matter, and use batch where completeness, simplicity, and cost efficiency are more important.

To put this into practice, try these experiments:

  • Pick one pipeline that currently runs on a strict schedule (e.g., hourly) and instrument it to measure the actual latency requirement. If the business can tolerate minutes, consider micro-batching.
  • Set up a stream processor for a small, non-critical data flow (like internal monitoring metrics) and run it alongside a batch version for a week. Compare the results for correctness and operational effort.
  • Review your existing batch pipelines for data that arrives irregularly—those are candidates for stream processing if latency matters.
  • Implement a watermark monitor that tracks how often late data is discarded. Use that data to decide whether your stream topology is too aggressive or too conservative.

By treating topology as a design decision rather than a dogma, you can build data systems that are both responsive and reliable, no matter the temporal orbit they occupy.

Share this article:

Comments (0)

No comments yet. Be the first to comment!