Solving Multi-Level Data Aggregation With Kafka And Golang

Written by Admin | Jul 24, 2025 6:30:00 PM

Introduction

Real-time aggregation is one of those features that sounds absolutely necessary until it starts breaking your system.

The idea is simple: fetch the latest data every time a user interacts with a dashboard or an API. But in cloud platforms with data complexity, high-frequency events, and shared upstream relationships, this approach quietly becomes a liability.

Behind every real-time dashboard is a database trying to join, compute, and aggregate metrics on the fly: for every request, for every user, often across multiple levels. It works for small teams and controlled usage. But once your platform scales, things begin to fail: query latencies spike, systems time out, and users experience more loading time than insight.

This isn’t just a performance issue, it’s a design flaw. One that won’t be solved by throwing more infrastructure at the problem.

In this blog, we’ll explore how event-driven, pre-aggregated architectures help platforms scale past these bottlenecks. We’ll break down the technical challenges, walk through an architectural response, and highlight practical engineering lessons you can apply, no matter your domain.

Why Real-Time Aggregation Breaks Down

1. Every Request Is A Computational Chain Reaction

In multi-tenant platforms where users or entities exist in hierarchical relationships (e.g., teams, branches, partners, affiliates), a single action can ripple through multiple levels. For example, calculating financials or operational performance might require aggregating both an individual’s data and that of their upstream or downstream connections.

If this computation happens every time someone opens a dashboard or triggers an API call, the backend must:

Fetch nested data across many levels
Compute summary metrics on-the-fly
Serve a response with low latency

This approach doesn’t scale, especially at scale.

2. Relational Databases Weren’t Built For This

Modern SQL databases are powerful, but things get slow when it comes to deep joins, recursive data, and simultaneous reads and writes on shared resources.

In one architecture example, a single transactional event led to:

Updates to 7 hierarchies using recursive queries
Modifications across 3+ aggregation tables
Over 84 rows updated per event

Scaling this to 500 concurrent events resulted in over 42,000 row updates. No amount of query optimization can make that sustainable, especially when it happens in real time.

3. Concurrency + Hierarchy = Contention

Shared hierarchies introduce data contention. If one transaction updates a parent entity’s metrics, and another user further down the chain performs an action simultaneously, both updates may target the same row. Locking, retries, and database contention follow.

Add more users and more actions, and your transactional model collapses under its own weight.

The Solution: Event-Driven Aggregation With Kafka And Go

The key architectural shift is simple: stop aggregating data when it’s requested. Instead, compute and store it at the moment the event occurs, asynchronously and reliably.

Here’s how to do that:

1. Stream Key Events Using Kafka

Every critical business activity (e.g., a user transaction, data update, or structural change) is published to a Kafka topic. This message includes all the necessary data to compute downstream metrics.

Kafka provides:

High-throughput ingestion
Ordered message processing
Decoupling of producers and consumers
Replay capabilities for recovery

This shifts the system from request-driven computation to event-driven updates.

2. Process Events Via Scalable Consumers

Dedicated consumers in this case, implemented in Go for performance and concurrency listen to Kafka topics and process events as they arrive.

Each consumer:

Extracts relationship data (e.g., organization tree)
Applies business logic to compute new state
Updates all necessary tables in a single atomic transaction

Because these are event-driven, the updates happen once, and before anyone queries the data.

3. Pre-Aggregate And Store Data For Fast Reads

Consumers write to purpose-built tables optimized for dashboard queries:

Summary tables hold user or entity-level metrics
Relational mappings are flattened or indexed
Transactional updates are batched and structured for efficient access

Dashboards now read from pre-computed views, no more on the fly computation per request.

4. Add Recovery And Acknowledgment Mechanisms

To ensure data consistency and reliability, the system includes a two-step acknowledgment flow:

After processing an event and committing changes, consumers send an acknowledgement to a dead letter queue Kafka topic.
A separate consumer scans for any events that didn’t receive acknowledgment and reprocesses them.

This recovery mechanism guarantees eventual consistency, even if parts of the system temporarily fail.

Handling Dynamic User Hierarchies

In complex platforms, hierarchies aren’t static. Users are created, suspended, reassigned, or restructured regularly. And when data aggregation depends on the accuracy of these relationships, even the slightest misalignment can cause discrepancies in reporting, financials, or system behavior.

To maintain consistency, the architecture must keep hierarchical structures accurate in near real-time.

1. Event-Driven Hierarchy Syncing

Every time a user’s status or relationship changes, whether it’s onboarding a new user, suspending an account, or modifying reporting lines, a background task is triggered. This task:

Publishes a change message to a Kafka topic
Processes the update asynchronously
Rewrites the affected user’s upstream and downstream references in the aggregation layer

This approach ensures low-latency propagation of structural changes without blocking real-time operations.

2. Failure-Driven Hierarchy Reprocessing

Despite real-time syncing, edge cases and transient failures, like network blips, processing timeouts, or message delivery issues, can occasionally disrupt hierarchy updates. To ensure nothing falls through the cracks, the system includes a dedicated failure-handling mechanism powered by Kafka.

Instead of relying on scheduled jobs, a failure topic captures and tracks events that weren’t successfully processed during their initial attempt. A specialized consumer monitors this topic and:

Identifies missing or unacknowledged updates
Reprocesses the original event with retry logic
Reapplies hierarchy updates to the aggregation database

This reactive flow ensures eventual consistency without introducing additional time-based reconciliation logic. It also provides greater traceability and granularity, since each failed event is treated as a first-class citizen in the system rather than being swept up by a batch job.

By embedding failure detection directly into the event pipeline, the architecture stays true to its real-time, event-first design philosophy, combining immediacy, observability, and resilience.

Why This Architecture Works

This model brings several benefits for data-heavy platforms:

High Performance Reads

Dashboards hit optimized, indexed tables with pre-computed results. Queries are instantaneous, even with high concurrency.

Efficient Writes

Updates happen asynchronously, once per event. There’s no redundancy, and the database isn’t overloaded with reactive joins.

Resilience By Design

Kafka’s durable logs and recovery processors ensure that events aren’t lost, and errors are retried without manual intervention.

Horizontal Scalability

Kafka consumers can scale with partitions. Each instance processes messages independently, and Go’s concurrency model makes scaling lightweight and efficient.

When To Apply This Pattern

This architectural approach is especially effective in platforms where data dependencies are complex, event volumes are high, and real-time expectations are growing.

1. You Have Deep Hierarchies Of Users Or Entities

Whether it’s multi-level partner networks, affiliate programs, field operations teams, or reseller models, data relationships can span multiple levels. Each action by a user may impact not just their own metrics, but those of several upstream or downstream entities. If your current system struggles to compute these relationships on-the-fly, pre-aggregated event-driven architecture can eliminate that bottleneck.

2. Your System Processes High-Frequency Transactions

In systems where events are generated constantly, think thousands of transactions, updates, or activities per minute, real-time queries quickly become a scalability nightmare. Event-driven processing allows these updates to happen independently of dashboard load or user access patterns, enabling much higher throughput and concurrency.

3. Your Dashboards Must Stay Fast And Accurate

If your business depends on dashboards that need to reflect near-real-time metrics without slowing down, this model ensures data freshness without the latency of live computation. Pre-aggregated tables ensure your dashboards stay performant, even under pressure.

4. You Require Strong Data Consistency Across Dependent Metrics

In platforms where a single action impacts multiple entities, ensuring consistency is critical. Event-driven processing guarantees that every metric update is computed using the same logic, within the same transaction, and applied systematically across all related tables, no inconsistencies, no skipped steps.

5. You Anticipate Failure And Want Recovery Built-In

This pattern is also valuable if you're designing for resilience and reliability. With event logs, acknowledgment topics, and recovery processors in place, your system can handle crashes, delays, and even partial failures gracefully, without losing critical data or requiring manual intervention.

Engineering Lessons

Event-driven pre-aggregation is more than a performance optimization; it's an architectural mindset. Here are the key takeaways for engineering leaders and architects:

1. Stop Putting Compute On The Request Path

Live queries are expensive. Every time a user loads a page and your backend has to scan, join, and compute, you're spending unnecessary resources. Instead, move your compute to the event pipeline. Let updates happen asynchronously, and reserve query time for reading only.

2. Pre-Aggregation Isn’t A Compromise But A Competitive Advantage

It’s a common misconception that pre-aggregated data is less flexible or accurate. In reality, it enables real-time insights without sacrificing system performance. With the right recovery logic and schema design, it becomes a reliable, scalable solution for dynamic environments.

3. Atomic Transactions Are Critical At The Event Level

When processing events that affect multiple tables and multiple levels of hierarchy, partial writes can introduce major inconsistencies. Always bundle updates into a single atomic transaction per event. If something fails, the whole transaction rolls back, keeping your system clean and traceable.

4. Observability Must Be Part Of Your System DNA

Design your architecture with visibility in mind. Track every message processed, every acknowledgment sent, and every error encountered. Use tools like Prometheus and OpenTelemetry to build observability into the flow, not as a monitoring afterthought, but as an architectural requirement.

5. Design For Failure, Not Just For Success

It’s not about whether failures will happen, they will. What matters is how you recover. A well-architected event-driven system includes backoff, retries, recovery jobs, and acknowledgment audits. These aren’t hacks, they’re what make your system resilient and production-grade.

From Real-Time to Real-World Scalability

Event-driven aggregation isn’t a workaround. It’s a deliberate shift in how modern engineering teams think about data flow, consistency, and user experience. It decouples the urgency of computation from the pressure of user interaction, unlocking performance, predictability, and peace of mind.

At Axelerant, we don’t just build systems that work, we engineer systems that scale, that recover, and that evolve with your business. Whether you're designing a new platform or modernizing a legacy architecture, our teams bring the depth, discipline, and innovation needed to architect for the next phase, not the last one.

If your current backend is struggling to keep up, it’s not a sign to scale up, it’s a signal to level up your architecture.

View full post