Real-time aggregation is one of those features that sounds absolutely necessary until it starts breaking your system.
The idea is simple: fetch the latest data every time a user interacts with a dashboard or an API. But in cloud platforms with data complexity, high-frequency events, and shared upstream relationships, this approach quietly becomes a liability.
Behind every real-time dashboard is a database trying to join, compute, and aggregate metrics on the fly: for every request, for every user, often across multiple levels. It works for small teams and controlled usage. But once your platform scales, things begin to fail: query latencies spike, systems time out, and users experience more loading time than insight.
This isn’t just a performance issue, it’s a design flaw. One that won’t be solved by throwing more infrastructure at the problem.
In this blog, we’ll explore how event-driven, pre-aggregated architectures help platforms scale past these bottlenecks. We’ll break down the technical challenges, walk through an architectural response, and highlight practical engineering lessons you can apply, no matter your domain.
In multi-tenant platforms where users or entities exist in hierarchical relationships (e.g., teams, branches, partners, affiliates), a single action can ripple through multiple levels. For example, calculating financials or operational performance might require aggregating both an individual’s data and that of their upstream or downstream connections.
If this computation happens every time someone opens a dashboard or triggers an API call, the backend must:
This approach doesn’t scale, especially at scale.
Modern SQL databases are powerful, but things get slow when it comes to deep joins, recursive data, and simultaneous reads and writes on shared resources.
In one architecture example, a single transactional event led to:
Scaling this to 500 concurrent events resulted in over 42,000 row updates. No amount of query optimization can make that sustainable, especially when it happens in real time.
Shared hierarchies introduce data contention. If one transaction updates a parent entity’s metrics, and another user further down the chain performs an action simultaneously, both updates may target the same row. Locking, retries, and database contention follow.
Add more users and more actions, and your transactional model collapses under its own weight.
The key architectural shift is simple: stop aggregating data when it’s requested. Instead, compute and store it at the moment the event occurs, asynchronously and reliably.
Here’s how to do that:
Every critical business activity (e.g., a user transaction, data update, or structural change) is published to a Kafka topic. This message includes all the necessary data to compute downstream metrics.
Kafka provides:
This shifts the system from request-driven computation to event-driven updates.
Dedicated consumers in this case, implemented in Go for performance and concurrency listen to Kafka topics and process events as they arrive.
Each consumer:
Because these are event-driven, the updates happen once, and before anyone queries the data.
Consumers write to purpose-built tables optimized for dashboard queries:
Dashboards now read from pre-computed views, no more on the fly computation per request.
To ensure data consistency and reliability, the system includes a two-step acknowledgment flow:
This recovery mechanism guarantees eventual consistency, even if parts of the system temporarily fail.
In complex platforms, hierarchies aren’t static. Users are created, suspended, reassigned, or restructured regularly. And when data aggregation depends on the accuracy of these relationships, even the slightest misalignment can cause discrepancies in reporting, financials, or system behavior.
To maintain consistency, the architecture must keep hierarchical structures accurate in near real-time.
Every time a user’s status or relationship changes, whether it’s onboarding a new user, suspending an account, or modifying reporting lines, a background task is triggered. This task:
This approach ensures low-latency propagation of structural changes without blocking real-time operations.
Despite real-time syncing, edge cases and transient failures, like network blips, processing timeouts, or message delivery issues, can occasionally disrupt hierarchy updates. To ensure nothing falls through the cracks, the system includes a dedicated failure-handling mechanism powered by Kafka.
Instead of relying on scheduled jobs, a failure topic captures and tracks events that weren’t successfully processed during their initial attempt. A specialized consumer monitors this topic and:
This reactive flow ensures eventual consistency without introducing additional time-based reconciliation logic. It also provides greater traceability and granularity, since each failed event is treated as a first-class citizen in the system rather than being swept up by a batch job.
By embedding failure detection directly into the event pipeline, the architecture stays true to its real-time, event-first design philosophy, combining immediacy, observability, and resilience.
This model brings several benefits for data-heavy platforms:
Dashboards hit optimized, indexed tables with pre-computed results. Queries are instantaneous, even with high concurrency.
Updates happen asynchronously, once per event. There’s no redundancy, and the database isn’t overloaded with reactive joins.
Kafka’s durable logs and recovery processors ensure that events aren’t lost, and errors are retried without manual intervention.
Kafka consumers can scale with partitions. Each instance processes messages independently, and Go’s concurrency model makes scaling lightweight and efficient.
This architectural approach is especially effective in platforms where data dependencies are complex, event volumes are high, and real-time expectations are growing.
Whether it’s multi-level partner networks, affiliate programs, field operations teams, or reseller models, data relationships can span multiple levels. Each action by a user may impact not just their own metrics, but those of several upstream or downstream entities. If your current system struggles to compute these relationships on-the-fly, pre-aggregated event-driven architecture can eliminate that bottleneck.
In systems where events are generated constantly, think thousands of transactions, updates, or activities per minute, real-time queries quickly become a scalability nightmare. Event-driven processing allows these updates to happen independently of dashboard load or user access patterns, enabling much higher throughput and concurrency.
If your business depends on dashboards that need to reflect near-real-time metrics without slowing down, this model ensures data freshness without the latency of live computation. Pre-aggregated tables ensure your dashboards stay performant, even under pressure.
In platforms where a single action impacts multiple entities, ensuring consistency is critical. Event-driven processing guarantees that every metric update is computed using the same logic, within the same transaction, and applied systematically across all related tables, no inconsistencies, no skipped steps.
This pattern is also valuable if you're designing for resilience and reliability. With event logs, acknowledgment topics, and recovery processors in place, your system can handle crashes, delays, and even partial failures gracefully, without losing critical data or requiring manual intervention.
Event-driven pre-aggregation is more than a performance optimization; it's an architectural mindset. Here are the key takeaways for engineering leaders and architects:
Live queries are expensive. Every time a user loads a page and your backend has to scan, join, and compute, you're spending unnecessary resources. Instead, move your compute to the event pipeline. Let updates happen asynchronously, and reserve query time for reading only.
It’s a common misconception that pre-aggregated data is less flexible or accurate. In reality, it enables real-time insights without sacrificing system performance. With the right recovery logic and schema design, it becomes a reliable, scalable solution for dynamic environments.
When processing events that affect multiple tables and multiple levels of hierarchy, partial writes can introduce major inconsistencies. Always bundle updates into a single atomic transaction per event. If something fails, the whole transaction rolls back, keeping your system clean and traceable.
Design your architecture with visibility in mind. Track every message processed, every acknowledgment sent, and every error encountered. Use tools like Prometheus and OpenTelemetry to build observability into the flow, not as a monitoring afterthought, but as an architectural requirement.
It’s not about whether failures will happen, they will. What matters is how you recover. A well-architected event-driven system includes backoff, retries, recovery jobs, and acknowledgment audits. These aren’t hacks, they’re what make your system resilient and production-grade.
Event-driven aggregation isn’t a workaround. It’s a deliberate shift in how modern engineering teams think about data flow, consistency, and user experience. It decouples the urgency of computation from the pressure of user interaction, unlocking performance, predictability, and peace of mind.
At Axelerant, we don’t just build systems that work, we engineer systems that scale, that recover, and that evolve with your business. Whether you're designing a new platform or modernizing a legacy architecture, our teams bring the depth, discipline, and innovation needed to architect for the next phase, not the last one.
If your current backend is struggling to keep up, it’s not a sign to scale up, it’s a signal to level up your architecture.