Is Your Dashboard Slow Because of Too Much Data? Think Again.
Our team specializes in telematics. We began as a data analysis-focused division and have since expanded to manage the full data journey—from data acquisition using custom scripts on telematics devices attached to electric buses, forklifts, and hydrogen fuel cells to deliver actionable insights on a computer monitor that enable end-users to make evidence-based decisions.
Handling large datasets is part of our daily routine. As an IT consulting company, we've had the opportunity to understand and assist many enterprises with their big data challenges. One recurring issue we've encountered is companies seeking performance improvements in their data management, particularly in their dashboards. The common scenario is that a project is initiated by either an internal team or an external partner, but once it's deployed and field data starts pouring in, the system becomes unresponsive and non-functional.
We all assume the main issue is the sheer amount of data, right? Or it isn’t?
Let's define the problem we're addressing: dashboards. While data ingress challenges like parallelization and horizontal scaling are topics for another blog, the primary pain point customers face is at the dashboard level. Charts take forever to load. When you consult the engineering team, they explain that plotting hundreds of thousands of data points in a time series chart is causing the delay, and they suggest limiting the amount of data. But you push back, pointing out the need to inspect an entire day, week, or more of operational data, a problem that is aggravated when collecting values at seconds or sub-second intervals.
Here's where we argue, from a first-principles perspective, that the limiting factor isn't the amount of data, it's the number of pixels on a monitor. Whether we have hundreds of thousands of data points or not, a monitor with, say, 2,000 pixels in width can only display so much. The data gets compressed by the charting library to fit the screen anyway. So, why not perform that compression on the server and send just the necessary data points? The chart would look the same.
This idea often meets resistance. I've seen excellent engineers challenge this line of thought, insisting that no information should be lost and that all data should remain intact and be charted.
Working Code Instead of Words
Rather than debate, we developed a simple proof of concept, a compression algorithm. While there's plenty of literature on different algorithms to achieve these results, as an IT consulting company, we'd prefer to focus on the outcomes rather than the "secret sauce."
Original Dataset (300,000 Data Points):
This chart shows two 300,000 data points series generated from a sine function, each series with 100 completely random outliers added.
The screenshots have been taken from a 1920x1080 pixels monitor.
Downsampled Dataset (3,000 Data Points):
Using our downsampling algorithm to reduce the dataset to 1% of its original size.
As you can see, all the essential information is preserved, including the outliers.
Here, the two are side by side for an easier comparison
Further Downsampling (300 Data Points):
Reducing the sample size below the number of available pixels results in a loss of visual information and doesn’t help the point we are trying to convey. However, for the sake of transparency, it’s useful to see how the algorithm performs at the limit.
Imagine your dashboards loading 99% faster for that 300,000-data-point chart.
By performing this downsampling on the backend, we reduce network traffic, decrease your computer's memory requirements, and, most importantly, save you time.
Currently, this compression is often handled by your charting library on the client side. This means all the data is retrieved from your database, sent over the network to your API server, and then transmitted to your web client, where the chart library downscales it based on the available pixels.
Returning to first principles, we know that bringing computation closer to the data improves overall architecture. With our proposal, we can implement this downsampling technology directly on your database. In this example, only 1% of the data would be sent to the API and then to the web client.
If you're concerned about the performance cost of running this computation on the database, we can also employ pre-computation using map-reduce approaches. This way, all the data is preprocessed at different resolutions and ready to be queried—a topic we'll explore in another blog.
If you're facing data challenges like this one, feel free to reach out here to schedule some time with our Director of Business Development. We'd be happy to put some brain power at it.