Battling the Steep Price of Storage for Real-Time Analytics

Storing the vast amounts of data needed for real-time analytics poses big and expensive challenges. Here's how the latest version of InfluxDB can help improve performance and cut costs.

Sep 22nd, 2023 6:00am by B. Cameron Gain

Featued image for: Battling the Steep Price of Storage for Real-Time Analytics

Photo by Alexander Schimmeck from Unsplash.

Nowadays, customers demand that database providers offer massive amounts of data storage for real-time analytics. For many use cases, the amount of data that these users are working with requires large amounts of storage.

Plus, this storage needs to be readily accessible and fast. Manufacturers, healthcare providers, climate change scientists, and various other use cases need to access data stored in memory caches in real time, while simultaneously leveraging historical data relevant to that data point.

Adding AI into this mix increases the amount of data companies have to deal with exponentially. The generation of predictive models results in applications calculating more data inferences, which, in turn, creates even more data.

As organizations seek to achieve greater observability into their systems and applications, they’re tasked with collecting more data from more devices — such as industrial Internet of Things (IoT) devices and aerospace telemetry. In many cases, these sources generate data at high resolutions, which increases storage costs even more.

“The fact of the matter is that companies have a lot more data coming in and the gap between what it was, even a few years ago, and what it looks like today is orders of magnitude wider,” Rick Spencer, vice president of products at InfluxData, told The New Stack.

While real-time data analytics alone requires cutting-edge database and streaming technologies, the cost of storage to meet these demands remains too high for many, if not most, organizations.

“Customers just have so much data these days,” Spencer said. “And they have two things they want to do with it: act on it and perform analytics on it.”

Acting on it in real time requires users to write automation that detects and responds to any change in activity. This can range from spacecraft wobbling or increasing error rates in shopping carts – whatever things users need to detect in order to respond to quickly.

“The other thing they want to do is perform historical analytics on that data. So, the dilemma that customers faced in the past is over what data to keep, because attempting to keep all the data becomes extremely expensive.”

With that in mind, let’s look at some of the technology challenges that real-time data analytics pose and offer more details about the associated storage cost conundrum. We’ll also explore InfluxDB 3.0, the latest version of InfluxData’s leading time series database, which promises to reduce data storage costs by up to 90%.

The latest iteration of the InfluxDB 3.0 product suite, InfluxDB Clustered, delivers these capabilities for self-managed environments.

Real-Time Evolution

The capacity to execute queries against vast amounts of data is typically a key requirement for large-scale real-time data analytics.

InfluxDB 3.0, InfluxData’s columnar time series database, is purpose-built to handle this. Users can conduct historical queries or analytical queries across multiple rows. These queries might consist of calculating the mean or moving average for all rows in large columnar datasets. The time needed to do so could be measured in milliseconds, even when retrieving data from objects.

However, Spencer noted, InfluxData’s customers demand a lot from its databases. “Our users tend to push the limits of our query capabilities,” he said. “If there was a query, say, across a month of data that used to time out but doesn’t now, they’ll run it. So the question isn’t necessarily about how slow the queries are but rather, how much data you can query based on your requirements.”

Previously, InfluxDB 1.x and 2.x releases provided exceptionally fast data transfers for tag value matching. However, in 1.x and 2.x, it was challenging to perform analytic queries or store a lot of data like logs and traces, just metrics.

By contrast, the new InfluxDB 3.0, which was released for general availability in January, provides those capabilities.

For queries against large data sets, it might take 40 seconds to access data such as logs and traces with InfluxDB 3.0, where those same queries would have timed out in earlier versions. Queries against smaller data sets complete in milliseconds, Spencer said.

“Now we can handle much more than metrics, resulting in cost savings as you can consolidate various databases,” he said.

The cost savings come into even more direct play with the recent InfluxDB Clustered release that added a final mile to Influx 3.0 capabilities.

The idea here is to keep data in object storage, instead of in an attached local disk, like traditional databases do. Object stores cost about 1/13th the price of an attached disk, Spencer said.

Efficient Data Compression, Enhanced Performance

Among the main features of InfluxDB are four components that offer:

Data ingestion.
Data querying.
Data compaction.
Garbage collection.

The main components of InfluxDB 3.0. (Source: InfluxData)

With InfluxDB Clustered, organizations can extend InfluxDB 3.0’s capabilities to on-premises and private cloud environments. These core capabilities consist of what InfluxData says is unlimited cardinality, high-speed ingest, real-time querying and very efficient data compression, to realize the 90% reduction in storage costs that low-cost object storage and separation of compute and storage offer.

InfluxDB 3.0 also heavily uses Parquet files. This is an open source, column-oriented data file format developed for efficient data storage and retrieval. It is designed to provide efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

A significant aspect of Parquet files lies in the fact that their specification is designed by a highly skilled community of developers, aiming to facilitate efficient compression of analytical data, Spencer said.

“Given your time series use case, we can make specific assumptions that allow for substantial compression,” he said. ”Parquet files become quite compact due to their columnar structure. It turns out that as data accumulates, a columnar database generally compresses much more efficiently.”

Storage Costs: a Drop from $8 Million to $150,000 per Year

One InfluxData customer was spending $8 million annually on storage costs. The customer was concerned that this cost would severely impact its business.

“However, adopting InfluxDB 3.0 reduced their storage costs to approximately $150,000 per year,” Spencer said. “Consider what this means for a business — transitioning from an $8 million budget to $150,000 is truly remarkable and highly beneficial for their business.

“With this approach, I can tell customers that even if their budget only allows for $10,000, and they’re currently spending $100,000 to retain their full data fidelity, they may be able to afford to keep all their data.”

Driving the Time Series Market Forward

InfluxDB 3.0 takes several giant leaps forward when it comes to performance, including data compression. Not only is the database itself able to compress data smaller than previous versions, but its persistence format compounds that benefit because Apache Parquet is designed for optimized compression of columnar data.

Taken together, these improvements can drastically reduce an organization’s financial commitment to data storage. It also means that InfluxDB enables users to store more of the data they want, to easily manage that data, and — most importantly — to generate value from that data in real time.

BC Gain is founder and principal analyst for ReveCom Media. His obsession with computers began when he hacked a Space Invaders console to play all day for 25 cents at the local video arcade in the early 1980s. He then...