Data Storage Systems Overview | The New Stack https://thenewstack.io/storage/ Wed, 27 Sep 2023 15:15:55 +0000 en-US hourly 1 https://wordpress.org/?v=6.2.2 NoSQL Data Modeling Mistakes that Ruin Performance https://thenewstack.io/nosql-data-modeling-mistakes-that-ruin-performance/ Wed, 27 Sep 2023 16:00:16 +0000 https://thenewstack.io/?p=22719303

Getting your data modeling wrong is one of the easiest ways to ruin your performance. And it’s especially easy to

The post NoSQL Data Modeling Mistakes that Ruin Performance appeared first on The New Stack.

]]>

Getting your data modeling wrong is one of the easiest ways to ruin your performance. And it’s especially easy to screw this up when you’re working with NoSQL, which (ironically) tends to be used for the most performance-sensitive workloads. NoSQL data modeling might initially appear quite simple: just model your data to suit your application’s access patterns. But in practice, that’s much easier said than done.

Fixing data modeling is no fun, but it’s often a necessary evil. If your data modeling is fundamentally inefficient, your performance will suffer once you scale to some tipping point that varies based on your specific workload and deployment. Even if you adopt the fastest database on the most powerful infrastructure, you won’t be able to tap its full potential unless you get your data modeling right.

This article explores three of the most common ways to ruin your NoSQL database performance, along with tips on how to avoid or resolve them.

Not Addressing Large Partitions

Large partitions commonly emerge as teams scale their distributed databases. Large partitions are partitions that grow too big, up to the point when they start introducing performance problems across the cluster’s replicas.

One of the questions that we hear often — at least once a month — is “What constitutes a large partition?” Well, it depends. Some things to consider:

  • Latency expectations:  The larger your partition grows, the longer it will take to be retrieved. Consider your page size and the number of client-server round trips needed to fully scan a partition.
  • Average payload size: Larger payloads generally lead to higher latency. They require more server-side processing time for serialization and deserialization and also incur a higher network data transmission overhead.
  • Workload needs: Some workloads organically require larger payloads than others. For instance, I’ve worked with a Web3 blockchain company that would store several transactions as BLOBs under a single key, and every key could easily get past 1 megabyte in size.
  • How you read from these partitions: For example, a time series use case will typically have a timestamp clustering component. In that case, reading from a specific time window will retrieve much less data than if you were to scan the entire partition.

The following table illustrates the impact of large partitions under different payload sizes, such as 1, 2 and 4 kilobytes.

As you can see, the higher your payload gets under the same row count, the larger your partition is going to be. However, if your use case frequently requires scanning partitions as a whole, then be aware that databases have limits to prevent unbounded memory consumption.

For example, ScyllaDB cuts off pages at every 1MB to prevent the system from potentially running out of memory. Other databases (even relational ones) have similar protection mechanisms to prevent an unbounded bad query from starving the database resources.

To retrieve a payload size of 4KB and 10K rows with ScyllaDB, you would need to retrieve at least 40 pages to scan the partition with a single query. This may not seem a big deal at first. However, as you scale over time, it could affect your overall client-side tail latency.

Another consideration: With databases like ScyllaDB and Cassandra, data written to the database is stored in the commit log and under an in-memory data structure called a “memtable.”

The commit log is a write-ahead log that is never really read from, except when there’s a server crash or a service interruption. Since the memtable lives in memory, it eventually gets full. To free up memory space, the database flushes memtables to disk. That process results in SSTables (sorted strings tables), which is how your data gets persisted.

What does all this have to do with large partitions? Well, SSTables have specific components that need to be held in memory when the database starts. This ensures that reads are always efficient and minimizes wasting storage disk I/O when looking for data. When you have extremely large partitions (for example, we recently had a user with a 2.5 terabyte partition in ScyllaDB), these SSTable components introduce heavy memory pressure, therefore shrinking the database’s room for caching and further constraining your latencies.

How do you address large partitions via data modeling? Basically, it’s time to rethink your primary key. The primary key determines how your data will be distributed across the cluster, which improves performance as well as resource utilization.

A good partition key should have high cardinality and roughly even distribution. For example, a high cardinality attribute like User Name, User ID or Sensor ID might be a good partition key. Something like State would be a bad choice because states like California and Texas are likely to have more data than less populated states such as Wyoming and Vermont.

Or consider this example. The following table could be used in a distributed air quality monitoring system with multiple sensors:

CREATE TABLE air_quality_data (
   sensor_id text,
   time timestamp,
   co_ppm int,
   PRIMARY KEY (sensor_id, time)
);


With time being our table’s clustering key, it’s easy to imagine that partitions for each sensor can grow very large, especially if data is gathered every couple of milliseconds. This innocent-looking table can eventually become unusable. In this example, it takes only ~50 days.

A standard solution is to amend the data model to reduce the number of clustering keys per partition key. In this case, let’s take a look at the updated air_quality_data table:

CREATE TABLE air_quality_data (
   sensor_id text,
   date text,
   time timestamp,
   co_ppm int,
   PRIMARY KEY ((sensor_id, date), time)
);


After the change, one partition holds the values gathered in a single day, which makes it less likely to overflow. This technique is called bucketing, as it allows us to control how much data is stored in partitions.

Bonus: See how Discord applies the same bucketing technique to avoid large partitions.

Introducing Hot Spots

Hot spots can be a side effect of large partitions. If you have a large partition (storing a large portion of your data set), it’s quite likely that your application access patterns will hit that partition more frequently than others. In that case, it also becomes a hot spot.

Hot spots occur whenever a problematic data access pattern causes an imbalance in the way data is accessed in your cluster. One culprit: when the application fails to impose any limits on the client side and allows tenants to potentially spam a given key.

For example, think about bots in a messaging app frequently spamming messages in a channel. Hot spots could also be introduced by erratic client-side configurations in the form of retry storms. That is, a client attempts to query specific data, times out before the database does and retries the query while the database is still processing the previous one.

Monitoring dashboards should make it simple for you to find hot spots in your cluster. For example, this dashboard shows that shard 20 is overwhelmed with reads.

For another example, the following graph shows three shards with higher utilization, which correlates to the replication factor of three, configured for the keyspace in question.

Here, shard 7 introduces a much higher load due to the spamming.

How do you address hot spots? First, use a vendor utility on one of the affected nodes to sample which keys are most frequently hit during your sampling period. You can also use tracing, such as probabilistic tracing, to analyze which queries are hitting which shards and then act from there.

If you find hot spots, consider:

  • Reviewing your application access patterns. You might find that you need a data modeling change such as the previously-mentioned bucketing technique. If you need sorting, you could use a monotonically increasing component, such as Snowflake. Or, maybe it’s best to apply a concurrency limiter and throttle down potential bad actors.
  • Specifying per-partition rate limits, after which the database will reject any queries that hit that same partition.
  • Ensuring that your client-side timeouts are higher than the server-side timeouts to prevent clients from retrying queries before the server has a chance to process them (“retry storms”).

Misusing Collections

Teams don’t always use collections, but when they do, they often use them incorrectly. Collections are meant for storing/denormalizing a relatively small amount of data. They’re essentially stored in a single cell, which can make serialization/deserialization extremely expensive.

When you use collections, you can define whether the field in question is frozen or non-frozen. A frozen collection can only be written as a whole; you cannot append or remove elements from it. A non-frozen collection can be appended to, and that’s exactly the type of collection that people most misuse. To make matters worse, you can even have nested collections, such as a map that contains another map, which includes a list, and so on.

Misused collections will introduce performance problems much sooner than large partitions, for example. If you care about performance, collections can’t be very large at all. For example, if we create a simple key:value table, where our key is a sensor_id and our value is a collection of samples recorded over time, our performance will be suboptimal as soon as we start ingesting data.

CREATE TABLE IF NOT EXISTS {table} (
           	sensor_id uuid PRIMARY KEY,
           	events map<timestamp, FROZEN<map<text, int>>>,
        	)


The following monitoring snapshots show what happens when you try to append several items to a collection at once.

You can see that while the throughput decreases, the p99 latency increases. Why does this occur?

  • Collection cells are stored in memory as sorted vectors.
  • Adding elements requires a merge of two collections (old and new).
  • Adding an element has a cost proportional to the size of the entire collection.
  • Trees (instead of vectors) would improve the performance, BUT…
  • Trees would make small collections less efficient!

Returning that same example, the solution would be to move the timestamp to a clustering key and transform the map into a frozen collection (since you no longer need to append data to it). These very simple changes will greatly improve the performance of the use case.

CREATE TABLE IF NOT EXISTS {table} (
           	sensor_id uuid,
		record_time timestamp,
           	events FROZEN<map<text, int>>,
	 PRIMARY KEY(sensor_id, record_time)
        	)

Learn More: On-Demand NoSQL Data Modeling Masterclass

Want to learn more about NoSQL data modeling best practices for performance? Take a look at our NoSQL data modeling masterclass — three hours of expert instruction, now on demand and free. You will learn how to:

  • Analyze your application’s data usage patterns and determine which data modeling approach will be most performant for your specific usage patterns.
  • Select the appropriate data modeling options to address a broad range of technical challenges, including the benefits and trade-offs of each option.
  • Apply common NoSQL data modeling strategies in the context of a sample application.
  • Identify signs that indicate your data modeling is at risk of causing hot spots, timeouts and performance degradation — and how to recover.

The post NoSQL Data Modeling Mistakes that Ruin Performance appeared first on The New Stack.

]]>
Address High Scale Google Drive Data Exposure with Bulk Remediation https://thenewstack.io/address-high-scale-google-drive-data-exposure-with-bulk-remediation/ Tue, 26 Sep 2023 17:00:16 +0000 https://thenewstack.io/?p=22717854

Millions of organizations around the globe use SaaS applications like Google Drive to store and exchange company files internally and

The post Address High Scale Google Drive Data Exposure with Bulk Remediation appeared first on The New Stack.

]]>

Millions of organizations around the globe use SaaS applications like Google Drive to store and exchange company files internally and externally. Because of the collaborative nature of these applications, company files can be accessed easily by the public, held externally with vendors, or shared within private emails. Data risk exposure exponentially increases as companies scale operations and internal data. Shared files through SaaS applications like Google Drive enable significant business-critical data exposure that could potentially get into the wrong hands.

As technology companies experience mass layoffs, IT professionals should take extra caution when managing shared file permissions. For example, if a company recently laid off an employee that shared work files externally with their private email, the former employee will still have access to the data. Moreover, if the previous employee begins working for a competitor, they can share sensitive company files, reports and data with their new employer. Usually, once internal links are publicly shared with an external source, the owner of the file is unable to see who else has access. This poses an enormous security risk for organizations as anyone, including bad actors or competitors, can easily steal personal or proprietary information within the shared documents.

Digitization and Widespread SaaS Adoption

Smaller, private companies tend to underestimate their risk of data exposure when externally sharing files. An organization is still at risk even if they only have a small number of employees. On average, one employee creates 50 new SaaS assets every week. It only takes one publicly-shared asset to expose private company data.

The growing adoption of SaaS applications and digital transformation are exacerbating this problem. In today’s digital age, companies are becoming more digitized and shifting from on-premises or legacy systems to the cloud. Within 24 months, a typical business’s total SaaS assets will multiply by four times. As organizations grow and scale, the amount of SaaS data and events becomes uncontrollable for security teams to maintain. Without the proper controls and automation in place, businesses are leaving a massive hole in their cloud security infrastructure that only worsens as time goes on. The longer they wait to tackle this challenge, the harder it becomes to truly gain confidence in their SaaS security posture.

Pros and Cons of Bulk Remediating

Organizations looking to protect themselves from this risk should look to bulk remediate their data security. By bulk remediating, IT leaders can quickly ensure a large amount of sensitive company files remain private and are unable to be accessed by third parties without explicit permission. This is a quick way to guarantee data security as organizations scale and become digitized.

However, as an organization grows, they will likely retain more employees, vendors, and shared drives. When attempting to remediate inherited permissions for multiple files, administrators face the difficulty of ensuring accurate and appropriate access levels for each file and user. It requires meticulous planning and a thorough understanding of the existing permission structure to avoid unintended consequences.

Coordinating and executing bulk remediation actions can also be time-consuming and resource-intensive, particularly when dealing with shared drives that contain a vast amount of files and multiple cloud, developer, security, and IT teams with diverse access requirements. The process becomes even more intricate when trying to strike a balance between minimizing disruption to users’ workflows and enforcing proper data security measures.

Managing SaaS Data Security

Organizations looking to manage their SaaS data security should first understand their current risk exposure and the number of applications currently used within the company. This will help IT professionals gain a better understanding of which files to prioritize that contain sensitive information that needs to quickly be remediated. Next, IT leaders should look for an automated and flexible bulk remediation solution to help them quickly manage complex file permissions as the company grows.

Companies should ensure they are only using SaaS applications that are up to their specific security standards. This is crucial to not only avoid data exposure, but also comply with business compliance regulations. IT admins should reassess each quarter their overall data posture and whether current SaaS applications are properly securing their private assets. Automation workflows within specific bulk remediation plans should be continuously updated to ensure companies are not missing security blind spots.

Each organization has different standards and policies that they will determine as best practices to keep their internal data safe. As the world becomes increasingly digital and the demand for SaaS applications exponentially grows, it is important for businesses to ensure they are not leaving their sensitive data exposed to third parties. Those that fail to remediate their SaaS security might be the next victim of a significant data breach.

The post Address High Scale Google Drive Data Exposure with Bulk Remediation appeared first on The New Stack.

]]>
Battling the Steep Price of Storage for Real-Time Analytics https://thenewstack.io/battling-the-steep-price-of-storage-for-real-time-analytics/ Fri, 22 Sep 2023 13:00:59 +0000 https://thenewstack.io/?p=22718805

Nowadays, customers demand that database providers offer massive amounts of data storage for real-time analytics. For many use cases, the

The post Battling the Steep Price of Storage for Real-Time Analytics appeared first on The New Stack.

]]>

Nowadays, customers demand that database providers offer massive amounts of data storage for real-time analytics. For many use cases, the amount of data that these users are working with requires large amounts of storage.

Plus, this storage needs to be readily accessible and fast. Manufacturers, healthcare providers, climate change scientists, and various other use cases need to access data stored in memory caches in real time, while simultaneously leveraging historical data relevant to that data point.

Adding AI into this mix increases the amount of data companies have to deal with exponentially. The generation of predictive models results in applications calculating more data inferences, which, in turn, creates even more data.

As organizations seek to achieve greater observability into their systems and applications, they’re tasked with collecting more data from more devices — such as industrial Internet of Things (IoT) devices and aerospace telemetry. In many cases, these sources generate data at high resolutions, which increases storage costs even more.

“The fact of the matter is that companies have a lot more data coming in and the gap between what it was, even a few years ago, and what it looks like today is orders of magnitude wider,” Rick Spencer, vice president of products at InfluxData, told The New Stack.

While real-time data analytics alone requires cutting-edge database and streaming technologies, the cost of storage to meet these demands remains too high for many, if not most, organizations.

“Customers just have so much data these days,” Spencer said. “And they have two things they want to do with it: act on it and perform analytics on it.”

Acting on it in real time requires users to write automation that detects and responds to any change in activity. This can range from spacecraft wobbling or increasing error rates in shopping carts – whatever things users need to detect in order to respond to quickly.

“The other thing they want to do is perform historical analytics on that data. So, the dilemma that customers faced in the past is over what data to keep, because attempting to keep all the data becomes extremely expensive.”

With that in mind, let’s look at some of the technology challenges that real-time data analytics pose and offer more details about the associated storage cost conundrum. We’ll also explore InfluxDB 3.0, the latest version of InfluxData’s leading time series database, which promises to reduce data storage costs by up to 90%.

The latest iteration of the InfluxDB 3.0 product suite, InfluxDB Clustered, delivers these capabilities for self-managed environments.

Real-Time Evolution

The capacity to execute queries against vast amounts of data is typically a key requirement for large-scale real-time data analytics.

InfluxDB 3.0, InfluxData’s columnar time series database, is purpose-built to handle this. Users can conduct historical queries or analytical queries across multiple rows. These queries might consist of calculating the mean or moving average for all rows in large columnar datasets. The time needed to do so could be measured in milliseconds, even when retrieving data from objects.

However, Spencer noted, InfluxData’s customers demand a lot from its databases. “Our users tend to push the limits of our query capabilities,” he said. “If there was a query, say, across a month of data that used to time out but doesn’t now, they’ll run it. So the question isn’t necessarily about how slow the queries are but rather, how much data you can query based on your requirements.”

Previously, InfluxDB 1.x and 2.x releases provided exceptionally fast data transfers for tag value matching. However, in 1.x and 2.x, it was challenging to perform analytic queries or store a lot of data like logs and traces, just metrics.

By contrast, the new InfluxDB 3.0, which was released for general availability in January, provides those capabilities.

For queries against large data sets, it might take 40 seconds to access data such as logs and traces with InfluxDB 3.0, where those same queries would have timed out in earlier versions. Queries against smaller data sets complete in milliseconds, Spencer said.

“Now we can handle much more than metrics, resulting in cost savings as you can consolidate various databases,” he said.

The cost savings come into even more direct play with the recent InfluxDB Clustered release that added a final mile to Influx 3.0 capabilities.

The idea here is to keep data in object storage, instead of in an attached local disk, like traditional databases do. Object stores cost about 1/13th the price of an attached disk, Spencer said.

Efficient Data Compression, Enhanced Performance

Among the main features of InfluxDB are four components that offer:

  • Data ingestion.
  • Data querying.
  • Data compaction.
  • Garbage collection.
Diagram showing how InfluxDB works

The main components of InfluxDB 3.0. (Source: InfluxData)

With InfluxDB Clustered, organizations can extend InfluxDB 3.0’s capabilities to on-premises and private cloud environments. These core capabilities consist of what InfluxData says is unlimited cardinality, high-speed ingest, real-time querying and very efficient data compression, to realize the 90% reduction in storage costs that low-cost object storage and separation of compute and storage offer.

InfluxDB 3.0 also heavily uses Parquet files. This is an open source, column-oriented data file format developed for efficient data storage and retrieval. It is designed to provide efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

A significant aspect of Parquet files lies in the fact that their specification is designed by a highly skilled community of developers, aiming to facilitate efficient compression of analytical data, Spencer said.

“Given your time series use case, we can make specific assumptions that allow for substantial compression,” he said. ”Parquet files become quite compact due to their columnar structure. It turns out that as data accumulates, a columnar database generally compresses much more efficiently.”

Storage Costs: a Drop from $8 Million to $150,000 per Year

One InfluxData customer was spending $8 million annually on storage costs. The customer was concerned that this cost would severely impact its business.

“However, adopting InfluxDB 3.0 reduced their storage costs to approximately $150,000 per year,” Spencer said. “Consider what this means for a business — transitioning from an $8 million budget to $150,000 is truly remarkable and highly beneficial for their business.

“With this approach, I can tell customers that even if their budget only allows for $10,000, and they’re currently spending $100,000 to retain their full data fidelity, they may be able to afford to keep all their data.”

Driving the Time Series Market Forward

InfluxDB 3.0 takes several giant leaps forward when it comes to performance, including data compression. Not only is the database itself able to compress data smaller than previous versions, but its persistence format compounds that benefit because Apache Parquet is designed for optimized compression of columnar data.

Taken together, these improvements can drastically reduce an organization’s financial commitment to data storage. It also means that InfluxDB enables users to store more of the data they want, to easily manage that data, and — most importantly — to generate value from that data in real time.

The post Battling the Steep Price of Storage for Real-Time Analytics appeared first on The New Stack.

]]>
Unlock Data’s Full Potential with a Mature Analytics Strategy https://thenewstack.io/unlock-datas-full-potential-with-a-mature-analytics-strategy/ Fri, 08 Sep 2023 17:16:18 +0000 https://thenewstack.io/?p=22717769

Over the past decade, businesses have harnessed the power of “big data” to unlock new possibilities and enhance their analytical

The post Unlock Data’s Full Potential with a Mature Analytics Strategy appeared first on The New Stack.

]]>

Over the past decade, businesses have harnessed the power of “big data” to unlock new possibilities and enhance their analytical capabilities. Today, those businesses must accelerate those capabilities by moving beyond experimentation with analytics toward mature investments and capabilities, or risk losing a competitive edge.

A mature data analytics strategy is critical to deriving the most value from data, but many organizations struggle to get it right. Despite the exponential growth in data collection, about 73% of enterprise data remains unused for analytics, according to Forrester. This means that just one-fourth of the data generated is effectively leveraged to gain valuable insights. Embracing modern technology, such as containerized storage capabilities, can help leaders obtain a strong handle on their data and derive actionable insights from it to truly drive business growth.

Legacy Analytics Architectures Are Obstructing Innovation

Today’s software applications need to handle millions of users across the globe on demand while running on multiple platforms and environments. They also need to provide high availability to enable businesses to innovate and respond to changing market conditions. Legacy platforms were designed prior to ubiquitous fast storage and network fabric, presenting more challenges than solutions for organizations looking to get ahead of the competition.

When I spoke to IT leaders who use legacy deployment models, the number-one complaint I heard is that it requires too much effort to support data at the indexer layer, which leads to reduced operational efficiencies. Hours, days and even weeks can be spent on software updates, patches and scaling hardware to support growth. This, in turn, affects optimization as at-scale teams are challenged to meet the needs of their growing organization.

Additionally, legacy architectures require multiple copies of data, which significantly increases compute and storage requirements. When you add storage in a distributed architecture, you add compute regardless of organizational needs, affecting overall utilization and the ability to control costs.

Lastly, with varying performance capabilities across different storage tiers, there is a risk of slower query response times or inconsistent search results. This can hinder the speed and accuracy of data analysis. A mature analytics strategy faces these challenges head-on to provide operational efficiency, accelerated innovation and reduced cost of doing business.

The Case for Containerizing Modern Analytics Loads

Managing modern data involves more than relying on cloud architecture capabilities alone. Containerization can seamlessly integrate into cloud infrastructure to support modern analytics workloads. Imagine the convenience of running an application in a virtual environment without the hefty resource requirements of a hypervisor. By encapsulating software into virtual self-contained units, that’s exactly what a container can do.

Containerized applications provide greater performance and can run reliably from one computing environment to another. More application instances allow for greater performance overall, and the portability of the storage method enables centralized image management, rapid deployment and elasticity for organizations to scale storage capacity based on demand.

Interestingly, containerized applications can help with CPU utilization as well. In testing, we found that containerized applications enabled up to 60% utilization, compared to only 17% from a bare metal application model. Pair containerization with a high-performance storage solution, and organizations can achieve more flexibility and quicker response as data volumes increase.

Kubernetes’ Role in Unlocking Agile Data Management

Container orchestration platforms like Kubernetes provide robust tools for managing and orchestrating containerized applications at scale. With Kubernetes, platform and DevOps teams can easily deploy and run thousands of applications in a containerized or VM format, on any infrastructure, and can operate with much lower operational costs.

But to fully derive the benefits of a powerful application platform like Kubernetes, users need an equally powerful data platform to complete the solution. The Portworx Data Platform offers advancements such as automated and declarative storage provisioning, volume management, high availability and data replication, data protection and backup, business continuity and disaster recovery, security and robust cost optimization and management. These capabilities enable organizations to efficiently manage and control their data storage across distributed cloud environments, ensuring data availability and agility.

When using Kubernetes for containerized storage, there are considerations to keep in mind to ensure an organization’s mature analytics strategy is optimized and agile. First, using Kubernetes operators can further enhance storage capabilities by automating and simplifying complex tasks.

It’s also crucial to set up high availability at both the data service layer and the storage layer because relying on a single instance in a Kubernetes environment can be risky. Lastly, understanding whether an organization’s data service can be scaled up or scaled out will allow IT teams to choose the best solution to add more capacity or compute power as needed.

Organizations with mature analytics investments are achieving bigger impacts on business outcomes across the board, from customer experience and strategy to product innovation. Through modern data management like container applications and Kubernetes, organizations can make greater use of their data for innovation and growth and, more to the point, increase sales.

The post Unlock Data’s Full Potential with a Mature Analytics Strategy appeared first on The New Stack.

]]>
Ditching Databases for Apache Kafka as System of Record https://thenewstack.io/ditching-databases-for-apache-kafka-as-system-of-record/ Fri, 25 Aug 2023 14:53:22 +0000 https://thenewstack.io/?p=22716536

Databases have long acted as a system of record, with most organizations still using them to store and manage critical

The post Ditching Databases for Apache Kafka as System of Record appeared first on The New Stack.

]]>

Databases have long acted as a system of record, with most organizations still using them to store and manage critical data in a reliable and persistent manner.

But times are changing. Many emerging trends are influencing the way data is stored and managed today, forcing companies to rethink data storage and offering lots of paths to innovation.

Take KOR Financial, for example, our financial services startup where we use Kafka as the system of record instead of relying on relational databases to store data.

We store all our data in Kafka, allowing us to cost-effectively and securely store tens or even hundreds of petabytes of data and retain it over many decades.

Instituting this approach not only provided immense flexibility and scalability in our data architecture, it has also enabled lean and agile operations.

In this article, I’ll explain why organizations need to think differently about data storage, the benefits of using Kafka as a system of record and offer advice for anyone interested in following this path.

Why Data Storage Requires ‘Outside the Box’ Thinking

A modern flexible data architecture that enables companies to harness data-driven decisions has become more important than ever. And having robust, reliable and flexible data storage is a key component for success.

But the rise of big data, distributed systems, cloud computing and real-time data processing, just a few examples of the emerging trends I mentioned earlier, means traditional databases can no longer keep up with the velocity and volume of data being generated every second.

That’s because databases are not designed for scale. Their inherent rigid structure only impedes the flexibility that businesses need from their data architecture.

As an operator of different business-to-business financial trade repositories globally along with complementary modular services, we deal with a ton of data. Our data-streaming-first approach is what differentiates us from our competitors. Our goal: to revolutionize the way the derivatives market and global regulators think about trade reporting, data management and compliance.

This means putting Kafka at the core of our architecture, which enables us to capture events instead of just state. And storing data in Kafka, rather than a database, and using it as our system of record enables us to keep track of all these events, process them and create materialized views of the data depending on our use cases — now or in the future.

While other trade repositories and intermediary service providers often use databases like Oracle Exadata for their data storage needs, it can be super expensive and pose data management challenges. While it allows you to do SQL queries, the challenge lies in managing large SQL databases and ensuring data consistency within those systems.

Being in the business of global mandated trade reporting means you are serving multiple jurisdictions, each with its own unique data model and interpretation. If you consolidate all data into a single schema or model, the task of uniformly managing that becomes increasingly complex. And schema evolution is challenging without a historical overview of the data, as it is materialized in a specific version of the state — further adding to data management woes.

Plus, the scalability of a traditional database is limited when dealing with vast amounts of data.

In contrast, we use Confluent Cloud for our Kafka and its Infinite Storage, which allows users to store as much data as they want in Kafka, for as long as they need, while only paying for the storage used.

While the number of partitions is a consideration, the amount of data you can put in Confluent Cloud is unlimited, and storage grows automatically as you need it without limits on retention time.

It allows us to be completely abstracted from how data is stored underneath and enables a cost-effective way to keep all of our data.

This enables us to scale our operations without limitations and to interpret events in any representation that we would like.

Powering the Ability to Replay Data

One of the remarkable advantages of using Kafka as a system of record lies in its ability to replay data, a native capability that traditional databases lack. For us, this feature aligns with our preference to store events versus states, which is critical for calculating trade states accurately.

We receive a whole bunch of deltas, which we call submissions or messages, which contribute to the trade state at a given point in time. Each incoming message or event modifies the trade and alters its current state. If any errors occur during our stream-processing logic, it can result in incorrect state outputs.

If that information is stored directly in a fixed representation or a traditional database, the events leading up to the state are lost. Even if the interpretation of those events were incorrect, there’s no way of revisiting the context that led to that interpretation.

However, by preserving the historical order of events in an immutable and append-only log, Kafka offers the ability to replay those events.

Given our business’s regulatory requirements, it is imperative to store everything in an immutable manner. We are required to capture and retain all data as it was originally received. While most databases, including SQL, allow modifications, Kafka by design prohibits any changes to its immutable log.

Using Kafka as a system of record and having infinite storage means we can go back in time, analyze how things unfolded, make changes to our interpretation, manage point-in-time historical corrections and create alternative representations without affecting the current operational workload.

This flexibility provides a significant advantage, especially when operating in a highly regulated market where it is crucial to rectify any mistakes promptly and efficiently.

Enabling Flexibility in Our Data Architecture

Using Kafka as a system of record introduces remarkable flexibility to our data architecture. We can establish specific views tailored to each use case and use dedicated databases or technologies that align precisely with those requirements, then read off the Kafka topics that contain the source of those events.

Take customer data management, for instance. We can use a graph database designed specifically for that use case without having our entire system built around a graph database because it’s just a view or a projection based on Kafka.

This approach allows us to use different databases based on use cases without designating them as our system of record. Instead, they serve as representations of the data, enabling us to be flexible. Otherwise you’re plugged into a database, data lake or data warehouse, which are rigid and don’t allow transformation of data into representations optimized for specific use cases.

From a startup perspective (KOR was founded in 2021), this flexibility also allows us to avoid being locked into a specific technology direction prematurely. Following the architectural best practice of deferring decisions until the last responsible moment, we can delay committing to a particular technology choice until it is necessary and aligns with our requirements. This approach means we can adapt and evolve our technological landscape as our business needs evolve and enable future scalability and flexibility.

Apart from flexibility, the use of Schema Registry ensures data consistency so we know the data’s origins and the schema associated with it. Confluent Cloud also allows you to have a clear evolution policy set up with Schema Registry. If we instead put all the data in a data lake, for instance, it gets harder to manage all the different versions, the different schemas and the different representations of that data.

Kafka as a System of Record: It’s More a Mindset Shift than a Technology Shift

To successfully adopt Kafka as a system of record, a company must foster a culture that encourages everyone to embrace an event-driven model.

This shift in thinking should also extend to the way applications are being developed by stream processing. Failure to do so will result in a compatibility mismatch. The goal is to help everyone on your team understand that they are dealing with immutable data, and if they’ve written something, they can’t just go in and change it.

It’s advisable to start with a single team that comprehends stream processing and the significance of events as a system of proof. By demonstrating advantages within that team, they can then act as ambassadors to other teams, encouraging the adoption of events as the ultimate truth and embracing stream processing with states as an eventual representation.

Watch this on-demand webinar to learn more about how KOR Financial leveraged Kafka and Confluent Cloud to cost-effectively store and secure all data to stay in compliance with financial regulations.

The post Ditching Databases for Apache Kafka as System of Record appeared first on The New Stack.

]]>
A Brief DevOps History: Databases to Infinity and Beyond, Part 2 https://thenewstack.io/a-brief-devops-history-databases-to-infinity-and-beyond-part-2/ Wed, 16 Aug 2023 14:15:14 +0000 https://thenewstack.io/?p=22715623

This is the second of a two-part series. Read Part 1.  We left off in 1979, with the release of

The post A Brief DevOps History: Databases to Infinity and Beyond, Part 2 appeared first on The New Stack.

]]>

This is the second of a two-part series. Read Part 1

We left off in 1979, with the release of INGRES and Oracle v2. At that point in history, databases were almost exclusively a tool built for and used by enterprises, developed according to their needs. But then the 1980s arrived (in our opinions, the most fashionable decade) and with them came the desktop computing era and some historically significant hacker movies to go with them.

Computers were no longer something that took up an entire room and required a specialized skill set to operate; they fit on your desk, they were affordable and a lot of the difficulties of interacting with one had been abstracted away.

Initially, a handful of different lightweight databases jockeyed for dominance in the desktop market. When IBM was developing its DOS-based PCs, it commissioned a DOS port of dBase. The IBM PCs were released in 1981 with dBase as one of the first pieces of software available, and it rocketed into popularity.

Interestingly, there is no dBase I — it was originally released as Vulcan and renamed when it was re-released. The name “dBase II” was chosen solely because the “two” implies a second, and thus less-buggy, release. The marketing stunt worked, and dBase II was destined for dominance.

dBase logo

dBase abstracted away a lot of the required but boring and technically complex aspects of interacting with a database, like opening and closing files and managing the allocation of storage space. This ease of use, relative to its predecessors, secured its place in history. Entire businesses sprung up around it, with multiple databases built on top of it and the associated programming language, but none was initially able to unseat it.

dBase remained one of the top-selling pieces of software throughout the ’80s and most of the ’90s, until a single bad release very nearly killed it.

In the 1990s, changes in the way we think about software development pushed databases in a slightly different direction. Object-oriented programming became the dominant design paradigm, and this necessitated a change in the way databases handle data.

Since we began thinking of both our code and our data as reusable objects with associated attributes, we needed to interact with the data in a slightly different way than a lot of databases at the time allowed out of the box. Additional abstraction layers become necessary so that we can think about what we’re doing rather than the specific implementation. This is how we got object-relational mapping tools (ORMs).

Visual FoxPro logo

To answer the needs of object-oriented programming, Microsoft acquired FoxPro and subsequently built Visual FoxPro based on it with support for some object-oriented design features. That acquisition gave them something even more important, though — FoxPro’s query optimization routines, which were built into Microsoft Access, almost immediately making it the most widely-used database in Windows environments.

In 1995, Access began shipping as part of the standard Microsoft Office suite rather than a standalone product, increasing its spread further and solidifying its dominance in the Windows market.

In the 2000s, the widespread popularity of the internet and an ever-present need to scale wider than ever before forced another innovation in databases and NoSQL entered the ring, but let’s go into the name first.

Carlo Strozzi originally used the name “NoSQL” in 1998 for a lightweight database he was developing, but it bears no resemblance to the NoSQL of today. Strozzi was still building a relational database; it just didn’t use SQL. Instead, it used shell scripts. According to Strozzi, today’s NoSQL should more accurately be called NoRel.

The term made a comeback in 2009 thanks to Johan Oskarsson at an event he held in response to the emergence and growth of some new technologies in databases: Google’s BigTable and Amazon’s DynamoDB, as well as their open source clones.

“Open source distributed, non-relational databases” was a bit too wordy and not pithy enough for a Twitter hashtag, though, so Eric Evans of Rackspace suggested an alternative: NoSQL. It took off, and the rest is history.

Back to the technology itself: While relational databases focus on ACID (atomicity, consistency, isolation, durability), non-relational databases focus on CAP (consistency, availability, partition tolerance) theorem. The idea is that no distributed system is immune to network failures by its very nature, so you may only have two of the three. When a failure occurs, a choice has to be made to ensure consistency by canceling the operation, which sacrifices availability, or ensure availability by continuing with the operation, sacrificing consistency.

Most distributed databases address this shortfall by offering “eventual consistency,” wherein changes aren’t necessarily propagated to all nodes at the same time, but within a few milliseconds of one another.

NoSQL

Usually, when people think about a NoSQL database, they’re thinking about something using a document model, similar to MongoDB. The playing field is much wider than that, though — we have several different flavors of key-value databases like Redis, wide column stores like DynamoDB, graph databases like Neo4j, hybrid databases that implement all these models like CosmosDB and more. These all have different strengths, weaknesses and use cases, but they all store denormalized data and generally do not support joint operations.

The pursuit of massively distributed databases that can scale horizontally into infinity has led to an explosion of specialized databases, with literally dozens of differing data models and entire products released for hyper-specific use cases. Technically, the world wide web itself is a large, distributed hypertext database.

Between the variety of relational and non-relational databases available today, the modern era is the database era. Nearly every action we take to interact with the world today is a database action made possible in large part by technology that originated before most of the people building the tools of the future were even born, with decades of iterative growth in between, and we’re moving faster than ever before. What will speed and scale mean to us in another 60 years?

The post A Brief DevOps History: Databases to Infinity and Beyond, Part 2 appeared first on The New Stack.

]]>
How Quantic Improved Developer Experience, Scalability https://thenewstack.io/how-quantic-improved-developer-experience-scalability/ Mon, 14 Aug 2023 13:24:13 +0000 https://thenewstack.io/?p=22715559

Quantic is a fast-growing startup that helps businesses streamline their operations by using a full-featured cloud-based point-of-sale (POS) platform, including

The post How Quantic Improved Developer Experience, Scalability appeared first on The New Stack.

]]>

Quantic is a fast-growing startup that helps businesses streamline their operations by using a full-featured cloud-based point-of-sale (POS) platform, including devices carried around by staff. Initially, our company primarily focused on restaurant operations. We weren’t planning for the technology to become a stand-alone product. However, by 2015, the demand from other restaurants was so high that we decided to roll it out, extending services to retailers, grocery stores, hotels, mini marts, gift shops, car washes and other businesses. As workloads grew and customer portfolios expanded, the business had to navigate a wide variety of different rules, regulations and workflows for the different industries served.

The Challenge

Over time, as our business expanded to hundreds of clients, we outgrew our original NoSQL database. Our application needed to scale beyond what our existing database was able to handle. We also needed to provide customers with real-time data synchronization capabilities, which enable the replication of data across clusters located in different data centers, something our database didn’t support. On top of that, unplanned downtime hurt customer experience, while managing various clusters placed a huge strain on our developers. DevOps teams were forced to deal with complex database management issues rather than focus on software development.

Following a competitive evaluation, Quantic turned to Couchbase Capella database as a service for a scalable and simple, yet powerful, way to keep pace with an ever-expanding number of customers, products and features.

The Solution

Building a database in-house to support the needs of the company was out of the question due to the cost, time and talent requirements. The database, data syncing from cloud to edge and storage costs alone wouldn’t make sense for a business that’s trying to scale. After evaluating various databases, we selected Couchbase Capella on Amazon Web Services (AWS) for its high performance, multidimensional scalability and a flexible NoSQL architecture that developers found familiar and easy to use.

The price performance of Capella coupled with mobile support and developer-friendly features allowed customer applications to be available 24/7, even when network connectivity was down. Capella’s offline synchronization capabilities, paired with the flexibility of JSON and SQL++, ensured applications were always on and always fast. This made Capella an easy choice for the business.

Additionally, Capella simplified decision-making, allowing Quantic’s clients to increase their business by determining when to offer special deals based on historical sales data, all within a single platform. Through Capella’s high-performance indexing, reports can be generated faster, allowing customers to get the data they need when they need it. With Capella, Quantic powers its applications, including tableside order placement and payments, customer management, couponing, QR codes, loyalty programs and fast checkout.

The Results

Once Capella was deployed, Quantic experienced an immediate impact on the business. Our customers need most applications to function in real time. From time tracking to sending orders to the kitchen via kitchen display systems, real-time communication is essential. This requires data to be synced quickly across the organization’s entire architecture. With Capella in our tech stack, the company receives instant updates and is able to provide a seamless end-user experience. Customers have uninterrupted access to data that has been aggregated over short or long periods of time.

The business also noticed that indexing speeds became faster, which made a substantial impact on reporting. And in some instances, query times were cut in half for end users.

What the Future Holds

Quantic continues to grow. The business recently developed a white-label POS platform, allowing other vendors to sell our services as their own. Through the white-label program, anyone, from an independent sales organization to a bank, can provide its customers with a solution that can grow its brand. Since the heavy lifting has been done using Capella, we’re able to expand our POS systems, while simultaneously helping partners expand their brands.

As the company scales and workloads balloon, Couchbase will help Quantic reduce database management efforts so development teams can focus on product enhancements to provide end users with a seamless experience.

Learn more about how Couchbase Capella on AWS enables Quantic to manage and scale the company’s growing workloads while improving the developer experience here. Try Capella on AWS for free here.

The post How Quantic Improved Developer Experience, Scalability appeared first on The New Stack.

]]>
How Vector Search Can Optimize Retail Trucking Routes https://thenewstack.io/how-vector-search-can-optimize-retail-trucking-routes/ Fri, 11 Aug 2023 17:31:50 +0000 https://thenewstack.io/?p=22715450

Vectors and vector search are key components of large language models (LLMs), but they are useful in a host of

The post How Vector Search Can Optimize Retail Trucking Routes appeared first on The New Stack.

]]>

Vectors and vector search are key components of large language models (LLMs), but they are useful in a host of other applications across many use cases that you might not have considered. How about the most efficient way to deliver retail goods?

In two prior articles in this series, I told a story of a hypothetical contractor who was hired to help implement AI/ML solutions at a big-box retailer, and then explored how this distributed systems and AI specialist used vector search to drive results with customer promotions at the company. Now, I’ll walk you through how this contractor uses vector search to optimize trucking routes.

The Problem

While we were looking at our options for scaling-down (and ultimately disabling) the recommendation batch job from the first story in this series, we were invited to a meeting with the Transportation Services team. They had heard how we assisted the Promotions team and were wondering if we could take a look at a problem of theirs.

BigBoxCo has its products trucked in from airports and shipping ports. Once at the distribution center (DC), they are tagged and separated into smaller shipments for the individual brick-and-mortar stores. While we have our own semi trailers for this part of the product journey, the fleet isn’t efficiently organized.

Currently, the drivers are given a list of stores on the truck’s digital device, and the supervisor suggests a route. However, the drivers often balk about the order of store stops, and they often disregard their supervisors’ route suggestions. This, of course, leads to variances in expected shipment and restock times, as well as in total time taken.

Knowing this, the DC staff is unable to fill each truck container completely, because they have to leave space in the truck for access to the product pallets for each store. Ideally, the product pallets would be ordered with the first store’s pallet in the most accessible position in the trailer.

Improving the Experience

The Transportation Services team would like us to examine the available data and see if there is a smarter way to approach this problem. For instance, what if there was a way that we could pre-determine the best possible route to take by determining the order in which the driver should visit the stores?

This is similar to the “traveling salesman problem” (TSP), a hypothetical problem in which a salesman is given a list of cities to visit, and needs to figure out the most efficient route between them. While coded implementations of the TSP can become quite complex, we might be able to use Apache Cassandra’s vector search capability to solve this.

The obvious approach is to plot out each of the geolocation coordinates of each destination city. However, the cities are only spread out over a local, metropolitan area, which means that latitude and longitude whole numbers would be mostly the same. That isn’t going to lead to a lot of easily detectable variance, so we should refocus that data by just considering the numbers to the right of the Geo URI scheme decimal point.

For example, the city of Rogersville (the location of one of our BigBoxCo stores) has a Geo URI of 45.200,-93.567. We’ll be able to detect variance from this and other vectors more easily if we look to the right of each decimal point of our coordinates, arriving at adjusted coordinates of 200,-567 (instead of 45.200,-93.567).

Taking this approach with the local metro cities with our stores gives us the following data:

Table 1 – Adjusted geo URI scheme coordinates for each of the cities with BigBoxCo stores, as well as the distribution center in Farley.

With a list of simplified coordinates like these, let’s look at getting something to work with them.

Implementation

Now that we have data, we can create a table in our Cassandra cluster with a two-dimensional vector. We will also need to create a SSTable attached secondary index (SASI) on the vector column:

CREATE TABLE bigbox.location_vectors (
    location_id text PRIMARY KEY,
    location_name text,
    location_vector vector<float, 2>);
CREATE CUSTOM INDEX ON bigbox.location_vectors (location_vector) USING 'StorageAttachedIndex';

This will enable us to use a vector search to determine the order in which to visit each city. It’s important to note, however, that vector searches are based on cosine-based calculations for distance, assuming that the points are on a flat plane. As we know, the Earth is not a flat plane. Calculating distances over a large geographic area should be done using another approach like the Haversine formula, which takes the characteristics of a sphere into account. But for our purposes in a small, local metro area, computing an approximate nearest neighbor (ANN) should work just fine.

Now let us load our city vectors into the table, and we should be able to query it:

INSERT INTO bigbox.location_vectors (location_id, location_name, location_vector) VALUES ('B1643','Farley',[86, -263]);

INSERT INTO bigbox.location_vectors (location_id, location_name, location_vector) VALUES (B9787,'Zarconia',[37, -359]);

INSERT INTO bigbox.location_vectors (location_id, location_name, location_vector) VALUES (B2346,'Parktown',[-52, -348]);

INSERT INTO bigbox.location_vectors (location_id, location_name, location_vector) VALUES ('B1643','Victoriaville',[94, -356]);

INSERT INTO bigbox.location_vectors (location_id, location_name, location_vector) VALUES ('B6789','Rockton',[11, -456]);

INSERT INTO bigbox.location_vectors (location_id, location_name, location_vector) VALUES ('B2345','Maplewood',[73, -456]);

INSERT INTO bigbox.location_vectors (location_id, location_name, location_vector) VALUES ('B5243','Rogersville',[200, -567]);

To begin a route, we will first consider the warehouse distribution center in Farley, which we have stored with a vector of 86, -263. We can begin by querying the location_vectors table for the ANNs of Farley’s vector:

SELECT location_id, location_name, location_vector, similarity_cosine(location_vector,[86, -263]) AS similarity
FROM location_vectors1
ORDER BY location_vector
ANN OF [86, -263] LIMIT 7;

The results of the query look like this:

location_id | location_name | location_vector | similarity 
-------------+---------------+-----------------+------------
       B1643 |        Farley |      [86, -263] |          1 
       B5243 |   Rogersville |     [200, -567] |   0.999867
       B1566 | Victoriaville |      [94, -356] |   0.999163 
       B2345 |     Maplewood |      [73, -456] |   0.993827 
       B9787 |      Zarconia |      [37, -359] |   0.988665
       B6789 |       Rockton |      [11, -456] |   0.978847
       B2346 |      Parktown |     [-52, -348] |   0.947053

(7 rows)


Note that we have also included the results of the similarity_cosine function, so that the similarity of the ANN results is visible to us. As we can see, after disregarding Farley at the top (100% match for our starting point), the city of Rogersville is coming back as the approximate nearest neighbor.

Next, let’s build a microservice endpoint that essentially traverses the cities based on a starting point and the top ANN returned. It will also need to disregard cities that it’s already been to. Therefore, we build a method that we can POST to, so that we can provide the ID of the starting city, as well as the list of cities for the proposed route in the body of the request:

curl -s -XPOST http://127.0.0.1:8080/transportsvc/citylist/B1643 \-d'["Rockton","Parktown","Rogersville","Victoriaville","Maplewood","Zarconia"]'
-H 'Content-Type: application/json'

Calling this service with the location_id “B1643” (Farley) returns the following output:

["Rogersville","Victoriaville","Maplewood","Zarconia","Rockton","Parktown"]


So this works great in the sense that it’s providing some systematic guidance for our trucking routes. However, our service endpoint and (by proxy) our ANN query don’t have an understanding of the highway system that connects each of these cities. For now, it’s simply assuming that our trucks can travel to each city directly “as the crow flies.”

Realistically, we know this isn’t the case. In fact, let’s look at a map of our metro area, with each of these cities and connecting highways marked (Figure 1).

Figure 1 – A map of our local metropolitan area showing each of the cities with BigBoxCo stores, as well as the connecting highway system. Each highway is shown with their names, colored differently to clearly distinguish themselves from each other.

One way to increase accuracy here would be to create vectors for the segments of the highways. In fact, we could create a highway table, and generate vectors for each by their starting and ending coordinates based on how they intersect with each other and our cities.

CREATE TABLE highway_vectors (
    highway_name TEXT PRIMARY KEY,
    highway_vector vector<float,4>);

CREATE CUSTOM INDEX ON highway_vectors(highway_vector) USING 'StorageAttachedIndex';

We can then insert vectors for each highway. We will also create entries for both directions of the highway segments, so that our ANN query can use either city as the starting or ending points. For example:

INSERT INTO highway_vectors(highway_name,highway_vector)
VALUES('610-E2',[94,-356,86,-263]);
INSERT INTO highway_vectors(highway_name,highway_vector)
VALUES('610-W2',[86,-263,94,-356]);

Going off of the result from our original query, we can run another query to pull back highway vectors with an ANN of the coordinates for the DC in Farley (86,-263) and our store in Rogersville (200,-567):

SELECT * FROM highway_vectors
ORDER BY highway_vector
ANN OF [86,-263,200,-567]
LIMIT 4;

 highway_name | highway_vector
--------------+-----------------------
       610-W2 |  [86, -263, 94, -356]
         54NW | [73, -456, 200, -567]
        610-W |  [94, -356, 73, -456]
        81-NW |  [37, -359, 94, -356]

(4 rows)


If we look at the map shown in Figure 1, we can see that Farley and Rogersville are indeed connected by highways 610 and 54. Now we’re on to something!

We could build another service endpoint to build a highway route from one city to another, based on the coordinates of the starting and ending cities. To really make this service complete, we would want it to eliminate any “orphan” highways returned (highways that aren’t on our expected route) and include any cities with stores that we may want to stop at on the way.

If we used the location_ids of Farley (B1643) and Rogersville (B5243), we should get output that looks like this:

curl -s -XGET http://127.0.0.1:8080/transportsvc/highways/from/B1643/to/B5243 \
-H 'Content-Type: application/json'
{"highways":[
    {"highway_name":"610-W2",
        "Highway_vector":{"values":[86.0,-263.0,94.0,-356.0]}},
    {"highway_name":"54NW",
        "highway_vector":{"values":[73.0,-456.0,200.0,-567.0]}},
    {"highway_name":"610-W",
        "highway_vector":{"values":[94.0,-356.0,73.0,-456.0]}}],
 "citiesOnRoute":["Maplewood","Victoriaville"]}

Conclusions and Next Steps

These new transportation services should be a significant help to our drivers and DC management. They should now be getting mathematically-significant results for route determination between stores. A nice side benefit to this is that DC staff can also fill the truck more efficiently. With access to the route ahead of time, they can load pallets into the truck in a first-in-last-out (LIFO) approach, using more of the available space.

While this is a good first step, there are some future improvements that we could make, once this initiative is deemed successful. A subscription to a traffic service will help in the route planning and augmentation. This would allow a route recalculation based on significant local traffic events on one or more highways.

We could also use the n-vector approach for coordinate positioning, instead of using the abbreviated latitude and longitudinal coordinates. The advantage here is that our coordinates would already be converted to vectors, which would likely lead to more accurate nearest-neighbor approximations.

Check out this GitHub repository for code for the above-described example transportation service endpoints, and learn more about how DataStax enables generative AI with vector search

The post How Vector Search Can Optimize Retail Trucking Routes appeared first on The New Stack.

]]>
A Brief DevOps History: Databases to Infinity and Beyond https://thenewstack.io/a-brief-devops-history-databases-to-infinity-and-beyond/ Wed, 09 Aug 2023 14:15:54 +0000 https://thenewstack.io/?p=22715204

This is the first in a two-part series. The further we get from the origin of a piece of technology,

The post A Brief DevOps History: Databases to Infinity and Beyond appeared first on The New Stack.

]]>

This is the first in a two-part series.

The further we get from the origin of a piece of technology, the easier it is for us to take for granted the incredible amount of work and huge leaps in innovation that were required to get us to where we are today. Databases are one such piece of technology. Many of us were born decades after the first database was implemented, and while we know the technology is old, we have no understanding of the path taken to get where we are today.

To understand how data storage and organization evolved, we first need to understand how computers were used at the dawn of computing. There wasn’t “data storage” as we know it today, just big boxes of punch cards. There was no real storage built into the machine itself, no operating systems, no nothing. This was mostly fine for the way people were doing things. Computers could only run one task at a time, so you’d show up for your appointment with the computer, run your punch cards, take your printed output and leave. You couldn’t interact with the computer otherwise. They were more like huge, advanced calculators.

In 1951, the Univac I computer was released, and with it the first magnetic tape storage drive. This allowed for much faster writes — hundreds of records per second — but tape storage is otherwise slow if you need to seek something since it can only be accessed sequentially.

That changed just a few short years later, in 1956, when IBM introduced disc storage with the 305 RAMAC. Unlike magnetic tape, data stored on discs could be accessed randomly, which sped up both reads and writes. We’d only been accessing data and executing programs sequentially until then, so conceptually, this was a pretty huge jump for people. Without a system to make organizing and accessing that data easier, it wasn’t actually a huge boon.

In 1961, when Charles Bachman wrote the first database management system for General Electric, it was called the Integrated Data Store, or IDS, and this opened the door to a lot of new technology. Architecturally it was a masterpiece, and there are IDS-type databases still in use today. For certain applications, its performance simply cannot be matched by a navigational database.

A few years later, with other general-purpose database systems entering the market but no real standards set for interacting with them, Bachman founded the Committee on Data Systems Languages (CODASYL) to begin work on a standard programming language. Thus, COBOL was born.

In 1966 another navigational database was released that would alter the course of history — IBM’s Information Management System, which was developed and released on the incredible IBM System/360 for the Apollo missions. Nothing we have today, from a computing perspective, would be possible without the System/360 and the things that were built for it. Countless innovations in computing, from virtualization to data storage, were pioneered at IBM for the System/360 mainframe. In this case, IMS sent us to the moon by handling the inventory for the bill of materials for the cool rocket ship, the Saturn V. IBM calls this a hierarchical database, but both IDS and IMS are examples of the earliest navigational databases.

In the 1970s, the collars got wider and the databases became relational. Edgar Codd was working on hard disks at IBM at the time, and he was pretty frustrated with the CODASYL approach since functionally, everything was a linked list, making a search feature impossible.

Table 1

In his paper “A Relational Model of Data for Large Shared Data Banks,” he described an alternative model that would organize data into a group of tables, with each table containing a different category of data. Within the table, data would be organized into a fixed number of columns, with one column containing a unique identifier for that particular item and the remaining columns containing the item’s attributes. From this model, he described queries that joined tables based on the relationships between those unique keys to return your result. Sounds familiar, yes?

Originally, Codd described this model using mathematical terms. Instead of tables, rows and columns, he used relations, tuples and domains. The name of the model itself, “relational database,” comes from the mathematical system of relational calculus upon which the operations that allow for joins in this model are built. He was, allegedly, not the biggest fan of people standardizing on tables, rows and columns rather than the mathematical terms he described.

In 1974, IBM was also developing a prototype based on Codd’s paper. System R is the first implementation of the SQL we know and love today, the first proof that the relational model could provide solid performance and the algorithmic influence for many systems that came later. IBM fussed around with this until 1979, before realizing it needed a production version, which eventually became Db2. IBM’s papers about System R were the basis for Larry Ellison’s Oracle Database, which ultimately beat IBM to the market with the public release of Oracle v2 in 1979.

Also, in 1979 came the public release of INGRES, built by Eugene Wong and Michael Stonebraker and based on Codd’s ideas. It originally used a query language called QUEL, but it eventually moved to SQL once it became clear that was where standards were headed. While INGRES itself did not stick around long term, the lessons learned from it did: Nearly 20 years later, Michael Stonebraker released a database we all know and love: PostgreSQL.

Up until this point, the evolution of databases had largely been driven by the changing needs of enterprises and enterprise hardware. Computers didn’t yet exist in a form factor and price point that made them accessible to regular people, whether at home or at the office. We needed another leap, another shift in the way people fundamentally use and think about computers, to see databases evolve once again. To find out more, stay tuned for Part 2 in this series.

The post A Brief DevOps History: Databases to Infinity and Beyond appeared first on The New Stack.

]]>
Best Practices: Collect and Query Data from Multiple Sources https://thenewstack.io/best-practices-collect-and-query-data-from-multiple-sources/ Thu, 03 Aug 2023 15:24:31 +0000 https://thenewstack.io/?p=22714742

In today’s data-driven world, the ability to collect and query data from multiple sources has become vital. With the rise

The post Best Practices: Collect and Query Data from Multiple Sources appeared first on The New Stack.

]]>

In today’s data-driven world, the ability to collect and query data from multiple sources has become vital. With the rise of the Internet of Things (IoT), cloud computing and distributed systems, organizations face the challenge of handling diverse data streams effectively. It’s common to have multiple databases/data storage options. For many large companies, the days of storing everything in a singular database are in the past.

It is crucial to implement best practices for efficient data collection and querying to maximize your data stores’ potential. This includes optimizing data ingestion pipelines, designing appropriate schema structures and using advanced querying techniques. On top of this, you need data stores that are flexible when querying data back out and are compatible with other data stores.

By adhering to these best practices, organizations can unlock the true value of their data and gain actionable insights to drive business growth and innovation. One such option is InfluxDB, a powerful time series database, that provides a robust solution for managing and analyzing time-stamped data, allowing organizations to make informed decisions based on real-time insights.

Understanding Different Data Sources

When it comes to data collection, it is crucial to explore different data sources and understand their unique characteristics. This involves identifying the types of data available, their formats and the potential challenges associated with each source. After identifying the data sources, selecting the appropriate data ingestion methods becomes essential. This involves using APIs, Telegraf plugins or implementing batch writes, depending on the specific requirements and constraints of the data sources.

Data space and speed are essential considerations, especially with IoT data. Ensuring data integrity and consistency throughout the collection process is of utmost importance. So, too, is having backup plans for data loss, stream corruption and storage at the edge. This involves implementing robust mechanisms to handle errors, to handle duplicate or missing data, and to validate the accuracy of the collected data. Proper data tagging and organization strategies also are essential to efficient data management and retrieval. By tagging data with relevant metadata and organizing it in a structured manner, it becomes easier to search, filter and analyze effectively.

It’s helpful to note here that most data storage solutions come with their own recommendations for how to begin collecting data into the system. For InfluxDB, we always suggest Telegraf, our open source data-ingestion agent. Or for language-specific needs, we suggest our client libraries written in Go, Java, Python, C# and JavaScript. The important takeaway here is to go with recommended and well-documented tools. While it might be tempting to use a tool you are already familiar with, if it’s not recommended, you might be missing out on the mechanisms for handling problems.

Effective Data Modeling

Effective data modeling is a crucial aspect of building robust and scalable data systems. It involves understanding the structure and relationships of data entities and designing schemas that facilitate efficient data storage, retrieval and analysis. A well-designed data model provides clarity, consistency and integrity to the data, ensuring its accuracy and reliability. The most important piece when dealing with multiple data sources is determining your “connector,” or your data piece that connects your data.

For example, let’s look at a generator that has two separate data sets: one in a SQL database storing the unit stats and one in the InfluxDB database that has real-time data about the battery capacity. You might need to identify a faulty generator and its owner based on these two data sets. It might seem like common sense that you would have some kind of shared ID between these two data sets, but when you are first modeling your data, the concern is less about being able to combine data sets and more about the main data use case and removing unnecessary data. Also the other question is: How unique is your connector and how easy will it be to store? For this example, the real-time battery storage might not have easy access to a serial number. That might need to be a hard-coded value added to all data collected from the generator.

Furthermore, as data evolves over time and variations occur, it becomes essential to employ strategies to handle these changes effectively. This may involve techniques such as versioning, migration scripts or implementing dynamic schema designs to accommodate new data attributes or modify existing ones.

For example, if our generator adds new data sets, it’s important that we add our original connector to that new data. But what if you are dealing with an existing data set? Then it gets trickier. You might have to go back and retroactively implement your connector. In this example, you might require people to manually enter their serial number in the app where they register their generator and view battery information. This allows you to tag them as the owner, and you can run analysis on their device from a distance to determine if it’s operating within normal range.

Obviously, this is a very simple example, but many companies and industries use this concept. The idea of data living in a vacuum is starting to disappear as many stakeholders expect to access multiple data sources and have an easy way to combine the data sets. So let’s dive into how to combine data sets once you have them. Let’s continue from our previous example with InfluxDB and a SQL database, a common use case for combining data.

When it comes to querying your data, and especially when it comes to combining data sets, there are a couple of recommended tools to accomplish this task. First is SQL, which is broadly used to query many data sources, including InfluxDB. And when it comes to data manipulation and analysis, a second tool, Pandas, is useful for flexible and efficient data processing. Pandas is a Python library that is agnostic to the data it accepts, as long as it’s within a Pandas DataFrame. Many data sources document how to convert their data streams into a Pandas DataFrame because it is such a popular tool.

The following code is an example of a SQL query in InfluxDB, which returns the average battery level over the past week for this specific device (via serial number):

This query would happen on the app side. When a user logs in and registers their generator’s serial number, that enables you to store the data with a serial number tag to use for filtering. For the readability of this query, it’s easier to imagine all generator data goes into one large database. In reality, it’s more likely that each serial number would be in a unique data storage, especially if you want to offer customers the chance to store their data longer for a fee, which is a common offer for some businesses and use cases, like residential solar panels.

An app developer would likely write several such queries to cover averages for the day and week, and to account for battery usage, battery levels and most recent values, etc. Ultimately, they hope to end up with between 10 and 20 values that they can show to the end user. You can find a list of these functions for InfluxDB here.

Once they have these values, they can combine all the data points with their SQL database that houses customer data, things like name, address, etc. They can use the InfluxDB Python client library to combine their two data sets in Pandas.

This is an example of what that join would look like in the end. When it comes to joining, Pandas has a few options. In this example, I’m using an inner join because I don’t want to lose any of the data from my two data sets. You would probably need to rename some columns, but overall, this query results in a combined data frame that you can then convert as needed for use.

You can imagine how data scientists might use these tools to run anomaly detection on the data sets to identify faulty devices and alert customers to the degradation and needed repairs. If there is a charge for storing data, users can also combine this data with a financial data set to confirm which customers pay for extended storage time and possibly receive extra information. Even in this simple example, there are many stakeholders, and at scale the number of people who need to access and use multiple data sets only expands.

Key Takeaways

With so much data in the world, the notion of storing everything in a single database or data store may seem tempting. (To be clear, you may want to store all of the same type of data, such as time series data, in a single database.) While it can be a viable solution at a small scale, the reality is that both small- and large-scale companies can benefit from the cost savings, efficiency improvements and enhanced user experiences that arise from using multiple data sources.

As industries evolve, engineers must adapt and become proficient in working with multiple data stores, and the ability to seamlessly collect and query data from diverse sources becomes increasingly important. Embracing this approach enables organizations to leverage the full potential of their data and empowers engineers to navigate the ever-expanding landscape of data management with ease.

The post Best Practices: Collect and Query Data from Multiple Sources appeared first on The New Stack.

]]>
ScyllaDB Is Moving to a New Replication Algorithm: Tablets https://thenewstack.io/scyllladb-is-moving-to-a-new-replication-algorithm-tablets/ Wed, 02 Aug 2023 13:31:31 +0000 https://thenewstack.io/?p=22714612

Like Apache Cassandra, ScyllaDB has historically decided on replica sets for each partition using Vnodes. The Vnode-based replication strategy tries

The post ScyllaDB Is Moving to a New Replication Algorithm: Tablets appeared first on The New Stack.

]]>

Like Apache Cassandra, ScyllaDB has historically decided on replica sets for each partition using Vnodes. The Vnode-based replication strategy tries to evenly distribute the global token space shared by all tables among nodes and shards. It’s very simplistic. Vnodes (token space split points) are chosen randomly, which may cause an imbalance in the actual load on each node.

Also, the allocation happens only when adding nodes, and it involves moving large amounts of data, which limits its flexibility. Another problem is that the distribution is shared by all tables in a keyspace, which is not efficient for relatively small tables, whose token space is fragmented into many small chunks.

In response to these challenges, ScyllaDB is moving to a new replication algorithm: tablets. Initial support for tablets is now in experimental mode.

Tablets allow each table to be laid out differently across the cluster. With tablets, we start from a different side. We divide the resources of the replica-shard into tablets, with a goal of having a fixed target tablet size, and then assign those tablets to serve fragments of tables (also called tablets).

This will allow us to balance the load in a more flexible manner by moving individual tablets around. Also, unlike with Vnode ranges, tablet replicas live on a particular shard on a given node, which will allow us to bind Raft groups to tablets.

This new replication algorithm allows each table to make different choices about how it is replicated and for those choices to change dynamically as the table grows and shrinks. It separates the token ownership from servers, ultimately allowing ScyllaDB to scale faster and in parallel.

Tablets require strong consistency from the control plane; this is provided by Raft. We talked about this detail in the ScyllaDB Summit talk below (starting at 17:26).

Raft vs. Lightweight Transactions for Strongly Consistent Tables

ScyllaDB is in the process of bringing the technology of Raft to user tables and allowing users to create strongly consistent tables that are based on Raft. We already provide strong consistency in the form of lightweight transactions, which are Paxos-based, but they have several drawbacks.

Generally, lightweight transactions are slow. They require three rounds to replicas for every request and they have poor scaling if there are conflicts between transactions. If there are concurrent conflicting requests, the protocol will retry due to conflict. As a result, they may not be able to make progress. This will not scale well.

Raft doesn’t suffer from this issue. First and foremost, it requires only one round to replicas per request when you’re on the leader — or even less than one per request because it can batch multiple commands in a single request. It also supports pipelining, meaning that it can keep sending commands without waiting for previous commands to be acknowledged. The pipelining goes down to a single CPU on which every following state machine runs. This leads to high throughput.

But Raft also has drawbacks in this context. Because there is a single leader, Raft tables may experience latency when the leader dies because the leader has to undergo a failover. Most of the delay is actually due to detection latency because Raft doesn’t switch the leader back and forth so easily. It waits for 1 second until it decides to elect a new leader. Lightweight transactions don’t have this, so they are theoretically more highly available.

Another problem with Raft is that you have to have an extra hop to the leader when the request starts executing not on the leader. This can be remedied by improving drivers to make them leader-aware and route requests to the leader directly.

Also, Raft tables require a lot of Raft groups to distribute load among shards evenly. That’s because every request has to go through a single CPU, the leader, and you have to have many such leaders to have even load. Lightweight transactions are much easier to distribute.

Balancing the Data across the Cluster

So let’s take a closer look at this problem. This is how the load is distributed currently using our standard partitioning (the Vnode partitioning), which also applies to tables that use lightweight transactions.

Replication metadata, which is per keyspace, determines the set of replicas for a given key. The request then is routed to every replica. On that replica, there is a sharding function that picks the CPU in which the request is served, which owns the data for a given key. The sharding function makes sure that the keys are evenly distributed among CPUs, and this provides good load distribution.

The story with Raft is a bit different because there is no sharding function applied on the replica. Every request that goes to a given Raft group will go to a fixed set of Raft state machines and Raft leader, and their location of CPUs is fixed.

They have a fixed shard, so the load distribution is not as good as with the sharding function with standard tables.

We could remedy the situation by creating more tokens inside the replication metadata so that we have more ranges and more narrow ranges. However, this creates a lot of Raft groups, which may lead to an explosion of metadata and management overhead because of Raft groups.

The solution to this problem depends on another technology: tablet partitioning.

Tablet Partitioning

In tablet partitioning, replication metadata is not per keyspace. Every table has a separate replication metadata. For every table, the range of keys (as with Vnode partitioning) is divided into ranges, and those ranges are called tablets. Every tablet is replicated according to the replication strategy, and the replicas live on a particular shard on the owning node. Unlike with Vnodes, requests will not be routed to nodes that then independently decide on the assignment of the key to the shard, but will rather be routed to specific shards.

This will give us more control over where data lives, which is managed centrally. This gives us finer control over the distribution of data.

The system will aim to keep the tablets at a manageable size. With too many small tablets, there’s a lot of metadata overhead associated with having tablets. But with too few large tablets, it’s more difficult to balance the load by moving tablets around. A table will start with just a few tablets. For small tables, it may end there. This is a good thing, because unlike with the Vnode partitioning, the data will not be fragmented into many tiny fragments, which adds management overhead and also negatively affects performance. Data will be localized in large chunks that are easy to process efficiently

As tables grow, as they accumulate data, eventually they will hit a threshold, and will have to be split. Or the tablet becomes popular with requests hitting it, and it’s beneficial to split it and redistribute the two parts so the load is more evenly distributed.

The tablet load balancer decides where to move the tablets, either within the same node to balance the shards or across the nodes to balance the global load in the cluster. This will help to relieve overloaded shards and balance utilization in the cluster, something which the current Vvnode partitioner cannot do.

This depends on fast, reliable, fault-tolerant topology changes because this process will be automatic. It works in small increments and can be happening more frequently than current node operations.

Tablets also help us implement Raft tables. Every Raft group will be associated with exactly one tablet, and the Raft servers will be associated with tablet replicas. Moving a tablet replica also moves the associated Raft server.

Additional Benefits: Resharding and Cleanup

Turns out the tablets will also help with other things. For example, resharding will be very cheap. SSTables are split at the tablet boundary, so resharding is only a logical operation that reassigns tablets to shards.

Cleanup is also cheap because cleaning up all data after a topology change is just about deleting the SSTable — there’s no need to rewrite them.

The post ScyllaDB Is Moving to a New Replication Algorithm: Tablets appeared first on The New Stack.

]]>
Create a Samba Share and Use from in a Docker Container https://thenewstack.io/create-a-samba-share-and-use-from-in-a-docker-container/ Sat, 29 Jul 2023 13:00:16 +0000 https://thenewstack.io/?p=22714223

At some point in either your cloud- or container-development life, you’re going to have to share a folder from the

The post Create a Samba Share and Use from in a Docker Container appeared first on The New Stack.

]]>

At some point in either your cloud- or container-development life, you’re going to have to share a folder from the Linux server. You may only have to do this in a dev environment, where you want to be able to share files with other developers on a third-party, cloud-hosted instance of Linux. Or maybe file sharing is part of an app or service you are building.

And because Samba (the Linux application for Windows file sharing) is capable of high availability and scaling, it makes perfect sense that it could be used (by leveraging a bit of creativity) within your business, your app stack, or your services.

You might even want to use a Samba share to house a volume for persistent storage (which I’m going to also show you how). This could be handy if you want to share the responsibilities for, say, updating files for an NGINX-run website that was deployed via Docker.

Even if you’re not using Samba shares for cloud or container development, you’re going to need to know how to install Samba and configure it such that it can be used for sharing files to your network from a Linux server and I’m going to show you how it’s done.

There are a few moving parts here, so pay close attention.

I’m going to assume you already have Docker installed on a Ubuntu server but that’s the only assumption I’ll make.

How to Install Samba on Ubuntu Server

The first thing we have to do is install Samba on Ubuntu Server. Log into your instance and install the software with the command:

sudo apt-get install samba -y


When that installation finishes, start and enable the Samba service with:

sudo sysemctl enable --now smbd


Samba is now installed and running.

You then have to add a password for any user who’ll access the share. Let’s say you have the user Jack. To set Jack’s Samba password, issue the following command:

sudo smbpasswd -a jack


You’ll be prompted to type and verify the password.

Next, enable the user with:

sudo smbpasswd -e jack

How to Configure Your First Samba Share

Okay, let’s assume you want to create your share in the folder /data. First, create that folder with the command:

sudo mkdir /data


In order to give it the proper permissions (so those users who need access), you might want to create a new group and then add users to the group. For example, create a group named editors with the command:

sudo groupadd editors


Now, change the ownership of the /data directory with the command:

sudo chow -R :editors /data


Next, add a specific user to that new group with:

sudo usermod -aG editors USER


Where USER is the specific user name.

Now, make sure the editors group has write permission for the /data directory with:

sudo chmod -R g+w /data


At this point, any member of the editors group should be able to access the Samba share. How they do that will depend on the operating system they use.

How to Create a Persistent Volume Mapped to the Share

For our next trick, we’re going to create a persistent Docker volume (named public) that is mapped to the /data directory. This is done with the following command:

docker volume create --opt type=none --opt o=bind --opt device=/data public


To verify the creation, you can inspect the volume with the command:

docker volume inspect public


The output will look something like this:

[
{
"CreatedAt": "2023-07-27T14:44:52Z",
"Driver": "local",
"Labels": {},
"Mountpoint": "/var/lib/docker/volumes/public/_data",
"Name": "public",
"Options": {
"device": "/data",
"o": "bind",
"type": "none"
},
"Scope": "local"
}
]


Let’s now add an index.html file that will be housed in the share and used by our Docker NGINX container. Create the file with:

nano /data/index.html


In that file, paste the following:

Save and close the file.

Deploy the NGINX Container

We can now deploy our NGINX container that will use the index.html file in our public volume that is part of our Samba share. To do that, issue the command:

docker run -d --name nginx-samba -p 8090:80 -v public:/usr/share/nginx/html nginx


Once the container is deployed, point a web browser to http://SERVER:8090 (where SERVER is the IP address of the hosting server), and you should see the index.html file that we created above (Figure 1).

Figure 1: Our custom index.html has been officially served in a Docker container.

Another really cool thing about this setup is that anyone with access to the Samba share can edit the index.html file (even with the container running) to change the page. You don’t even have to stop the container. You could even create a script to automate updates of the file if you like. For this reason, you need to be careful who has access to the share.

Congrats, you’ve just used Docker and Samba together. Although this might not be a wise choice for production environments, for dev or internal services/apps, it could certainly come in handy.

The post Create a Samba Share and Use from in a Docker Container appeared first on The New Stack.

]]>
Data Warehouses Are Terrible Application Backends https://thenewstack.io/data-warehouses-are-terrible-application-backends/ Thu, 13 Jul 2023 13:39:30 +0000 https://thenewstack.io/?p=22713061

The ever-increasing tide of data has become a paradox of plenty for today’s developers. According to a report from Seagate,

The post Data Warehouses Are Terrible Application Backends appeared first on The New Stack.

]]>

The ever-increasing tide of data has become a paradox of plenty for today’s developers. According to a report from Seagate, by 2025 worldwide data will grow to a staggering 163 zettabytes, over 10 times the volume in 2016. More data should mean deeper insights and better user experiences, but it also leads to problems.

For data-oriented developers, this explosion is a double-edged sword. It presents an incredible opportunity to build user-facing features powered by data and harnessing real-time analytics. On the other hand, processing all that data with minimal latency and at high concurrency can be really challenging with a typical modern data stack.

Data warehouses, in particular, are the resting place of choice for big data at modern companies, and their online analytical processing (OLAP) approach is perfect for complex, long-running analytical queries over big data for things like business intelligence reports and dashboards.

However, they make terrible application backends.

This post explains why a combination of job pool management, concurrency constraints and latency concerns preclude data warehouses from effectively functioning as a storage layer for user-facing applications, and why you should consider alternative technology for your data app stack.

Understanding the Data Warehouse

Ten years ago, data warehouses were the hot, new thing in the data world. Capable of storing a whole mess of structured data and processing complex analytical queries, data warehouses set a new bar for how businesses run their internal business-intelligence processes.

Specifically, data warehouses do three things that have made analytics accessible and powerful:

  1. They separate storage and compute, reducing costs to scale.
  2. They leverage distributed compute and cloud networking to maximize query throughput.
  3. They democratize analytics with the well-known SQL.

If you want a good primer on why data warehouses exist and what they’ve enabled for modern data teams, I encourage you to read this.

Today, data warehouses like Snowflake, BigQuery, Redshift and Azure Synapse still occupy the head of the table in many companies’ data stacks, and because of their favored position within the organization, developers may be tempted to use them as a storage layer for user-facing analytics. They have the power to run complex analytical queries that these use cases demand; the data is already there, and you’re already paying for them. What’s not to love?

As it turns out, quite a bit. Here are the reasons why application developers can’t rely on data warehouses as a storage layer for their user-facing analytics.

The Unpredictable World of Job Pools and Nondeterministic Latency

Data warehouses process analytical queries in a job pool. Snowflake, for example, uses a shared pool approach to process queries concurrently, aiming to optimize available computing resources.

Here’s the problem: Job pools create nondeterministic latency with a set floor. A simple SELECT 1 on Snowflake could potentially run in a few milliseconds, but more likely it will take a second or more simply because it must be processed in a queue with all the other queries.

Even the best query-optimization strategies can’t overcome this limitation.

Running a query on a data warehouse is like playing a game of “latency roulette.” You can spin the wheel the same way every time, but the final outcome (in this case, the latency of the query response) lands unpredictably.

Now, if you’re a backend developer building APIs over a storage layer, you’d never take a chance on nondeterministic latency like this. Users expect snappy APIs that respond within milliseconds. In fact, the database query should be one of the fastest things in the request path, even compared to network latency. If you’re building on top of a data warehouse, this won’t be the case, and your users will feel the pain.

The Illusion of Scalability

Latency is but one part of the equation for API builders. The second is concurrency. If you’re building an API that you intend to scale, solid fundamentals demand that you provide low-latency responses for a highly concurrent set of users.

When you dive deeper into the functionality of data warehouses, you’ll realize that to genuinely scale horizontally to accommodate increased query concurrency, you need to either spin up new virtual warehouses or increase their cluster limit, or both. For example, if you wanted to support just 100 concurrent queries per minute on Snowflake, you’d need 10 multicluster warehouses.

And spinning up new warehouses isn’t cheap. Just ask your buddies over in data engineering. For the Snowflake example, you’d be paying more than $30,000 a month.

Concurrency constraints in data warehouses like Snowflake present one of the most significant challenges when it comes to developing real-time applications. With a large volume of queries knocking at your warehouse’s door, and a limited number of resources to serve them, you’re bound to experience some serious latency issues unless you scale up and out. And scaling up and out is often prohibitively expensive.

Building Cache Layers: A Recent Trend and Its Drawbacks

OK, so nobody really builds an application directly on top of a data warehouse, right? Obviously, you’d use a caching layer like Redis or some other real-time database to make sure your API requests are fast and balanced even with many concurrent users.

This is a common approach when the data you need to support your application resides in a data warehouse. In theory, the approach seems workable. In reality, it carries some serious drawbacks, the most significant of which is data freshness.

Simply put, using a cache layer works great for shrinking query latency, but it still won’t work for applications built over streaming data that must always serve the most recent events.

Think about a fraud-detection use case where a financial institution must determine if a transaction is fraudulent within the time it takes to complete the transaction (a few seconds). This usually involves a complex analytical process or online machine learning feature store based on just-created data. If that data hits the warehouse before your backend APIs, no cache layer will save you. The cache is great for enabling low-latency API requests by storing analytics recently run in batch ETL (extract, transform, load) processes, but it can’t access just-created data, because the warehouse is still processing it.

The Alternative: Real-Time Data Platforms

As we’ve discussed, the fundamental problem of building data-intensive applications over data warehouses boils down to a failure to maintain:

  • low-latency queries
  • from highly-concurrent users
  • over fresh data

So, what’s the alternative?

For building user-facing applications, you should turn to real-time data platforms like Tinybird.

What Is a Real-Time Data Platform?

A real-time data platform helps data and engineering teams create high-concurrency, low-latency data products over streaming data at scale.

A real-time data platform uses a columnar database under the hood so it can handle the complex analytical workloads previously relegated to data warehouses but at a much faster pace. Furthermore, a real-time data platform typically offers a low-latency publication layer, exposing low-latency APIs to data-intensive applications that may rely on both batch and streaming data sources. Building APIs at scale for streaming data platforms is often not considered but can be a massive pain to maintain and scale as your data scales.

Reference Architectures for Real-Time Data Platforms

When building on top of real-time data platforms, consider two incremental architectures for your data stack.

In the first approach, the data warehouse can still be the primary underpinning storage layer, where the real-time data platform effectively serves as a publication layer. In this architecture, data is synced between the data warehouse and the real-time data platform either on a schedule or on ingestion, and the real-time data platform handles additional transformations as well as providing a low-latency, high concurrency API.

Real-time data platforms like Tinybird can function like a cache layer over a data warehouse using native connectors. In this way, they eliminate the need for custom object–relational mapping (ORM) code but still may suffer some data freshness constraints.

In practice, this is similar to using a real-time data platform as a caching layer, with the added benefit of avoiding the need to write custom API code to connect the cache with your application and having the ability to perform additional enrichment or transformations with the power of full online analytical processing (OLAP).

The second approach bypasses the data warehouse entirely or operates in parallel. Assuming event data is placed on some kind of message queue or streaming platform, the real-time data platform subscribes to streaming topics and ingests data as it’s created, performing the necessary transformations and offering an API layer for the application to use.

Real-time data platforms like Tinybird can function like a cache layer over a data warehouse using native connectors. In this way, they eliminate the need for custom object–relational mapping (ORM) code but still may suffer some data freshness constraints.

This can be the preferred approach since it eliminates the data freshness issues that still exist when a caching layer is used over a data warehouse and, with the right real-time data platform, streaming ingestion can be quite trivial.

The Benefits of a Real-Time Data Platform

  1. Native data-source connectors: Real-time data platforms can integrate with various data sources and other tech stack components. This makes it quite easy to unify and join multiple data sources for real-world use cases. For example, you can combine data from Snowflake or BigQuery with streaming data from Confluent or Apache Kafka. Tinybird, for example, even offers a simple HTTP-streaming endpoint that makes it trivial to stream events directly within upstream application code.
  2. Real-time OLAP power: Like data warehouses, a real-time data platform gives developers the ability to run complex OLAP workloads.
  3. Cost-effective: Establishing a publication layer on Snowflake using traditional methods would necessitate additional virtual warehouses, thereby leading to increased costs. In contrast, the pricing model for real-time data platforms are often predicated on the volume of data processed via the publication layer, resulting in a significant reduction in cost when used as an application backend.
  4. Scalability: Many real-time data platforms are serverless, so infrastructure scales with you, handling big data with high levels of performance and availability. Rather than host your database on bare metal servers or tweak cluster settings with managed databases, you can focus on building and shipping use cases while the real-time data platform handles scale under the hood.
  5. Zero glue code: Even with a cache layer over data warehouses, you’d still have to write glue code: ETLs to get data from the warehouse to your cache, and ORM code to publish APIs from your cache. A real-time data platform, in contrast, handles the entire data flow, from ingestion to publication, with zero glue code. Data gets synced using native connectors, transformations get defined with SQL, and queries are instantly published as scalable APIs with built-in documentation, authentication token management and dynamic query parameters.

Like a data warehouse, Tinybird offers OLAP storage with SQL-based transformations. Unlike data warehouses, it preserves data freshness and offers a low-latency, high-concurrency API layer to support application development.

Where data warehouses fail as application backends, real-time data platforms like Tinybird shine. Like data warehouses, these platforms support heavy data loads and complex analytics, but they do so in a way that preserves data freshness, minimizes query latency and scales to support high concurrency.

Wrapping Up

Data warehouses aren’t bad technology, but they are bad application backends. Despite their power and usefulness for business intelligence, they simply can’t cost-effectively handle the freshness, latency and concurrency requirements that data-oriented applications must support.

Real-time data platforms, on the other hand, function exceptionally well as backends for data-intensive applications across a wide variety of use cases: real-time personalization, in-product analytics, operational intelligence, anomaly detection, usage-based pricing, sports betting and gaming, inventory management and more.

Ready to experience the industry-leading real-time data platform? Try Tinybird today for free.

The post Data Warehouses Are Terrible Application Backends appeared first on The New Stack.

]]>
JSON and Relational Tables: How to Get the Best of Both https://thenewstack.io/json-and-relational-tables-how-to-get-the-best-of-both/ Wed, 05 Jul 2023 14:19:25 +0000 https://thenewstack.io/?p=22712320

JSON and relational tables are both popular and extremely useful, and they have their own distinct approaches to organizing data

The post JSON and Relational Tables: How to Get the Best of Both appeared first on The New Stack.

]]>

JSON and relational tables are both popular and extremely useful, and they have their own distinct approaches to organizing data in an application. In this follow-up to a popular article by my colleague, Chris Saxon, on managing JSON files using SQL, I’d like to discuss a major development that helps developers work with both relational tables and JavaScript Object Notation (JSON) documents, without the tradeoffs of either model.

The Relational Model

I’ll start with the relational model, which is a general-purpose model that uses data normalization to ensure data integrity and avoids data duplication. In addition, SQL makes data access, manipulation and modeling flexible and easy. The strength of the relational model, when coupled with SQL, lies in its general-purpose design: You can use normalization techniques to take data from an application and split it into stand-alone logical parts by storing these in separate tables.

Imagine a student class schedule. In its simplest form, such a data item would have four parts that in combination make up the schedule: a) the information about the student for whom the schedule is for, b) the courses the student wants to attend, c) the time when the courses are scheduled, d) the classrooms where the courses are held. Each piece of information is important for students to know where and when they are supposed to be, but the relational model would normalize these four pieces into separate tables.

Why is this normalization important? Because it increases the reusability of the data for other applications or purposes. In this example, teachers also have schedules, which look different than student schedules. Teachers care about which courses they are giving, and when and in what classrooms they are. Yet the teacher’s schedule is likely to look very different from a student’s because the teacher only teaches a few courses while students probably also attend many courses taught by other teachers.

Next, add the facilities team to the mix. They probably do not care about the teachers or the students, but they care very much about which classrooms need special setups. They also need to know when classes are not being used in order to schedule maintenance.

Thanks to the relational model and the normalization of the data, all these questions can be answered without having to restructure the data physically or logically.

Yet, while the relational model gives us this nice, general-purpose design, it is not always the easiest for developers to use. That’s because developers usually build apps in terms of object-oriented programming, where objects are structured hierarchy into classes and therefore aren’t naturally aligned with the notion of normalized data in rows and columns within separate tables.

JSON Document Model

The JSON document model overcomes many of the rows-and-columns limitations of the relational model by allowing apps to directly map objects into a corresponding hierarchical JSON format. This can drastically reduce the complexities of object-relational mapping.

Furthermore, JSON is a self-describing data format that makes inter-application communication easy, because a JSON document not only contains the actual data, but also information about the data — the metadata or schema. However, although the JSON document model provides great benefits to developers by more closely mapping to the object-oriented nature of the application’s internal operations, it isn’t ideal either.

Here’s one weakness: Due to the self-describing and hierarchical nature of JSON documents, information is stored redundantly, leading to inefficiencies and potential inconsistency down the road.

In the scheduling example, what happens if the facilities crew needs to take a classroom out of commission due to a water leakage? If the data for the student and teacher schedules were all modeled in JSON format, chances are that each student and teacher would have one JSON document with redundant data about the affected classroom.

Those documents may not have the same structures: One document may have the student information as the root element while the other has the teacher information as the root element. The student schedule document may have information about each class’s teachers, but the teacher schedule document may not have any information about individual students or about co-teachers.

In this example, the facilities-management application would have to somehow propagate the classroom change, searching for the classroom in differently structured JSON documents and replacing it with another. While doable, JSON adds complexity to the facilities app, which arguably should not even have to worry about student and teacher schedules and their associated data structures altogether, but which should merely map everything in classroom 225 to classroom 316B for the next two weeks.

In contrast, if the data was modeled relationally, there would only be one table that contains all schedules — and therefore only one table would require an update. All three applications, whether used by students, teachers or facilities staff, would automatically use the correct data consistently. (Of course, this example assumes all data to be in one database or document store for simplicity.)

The Worst of Both Worlds?

To get around the problem of inconsistency, some JSON document databases recommend normalizing documents using references: Instead of including the classroom information in the student or teacher schedule documents, the documents may simply include an ID for that classroom residing in another document. The problem with that is that normalizing documents this way makes a JSON document work like a relational database, which completely defeats the simplicity benefits of the document model, and now developers have a model that is the worst of both worlds.

By the way, it’s difficult to model many-to-many relationships using the JSON document model. Attempts to model the relationships lead to even greater data duplication and the potential for additional inconsistencies.

There are different ways developers get around these issues, and historically developers have often resorted to using object-relational mapping (ORM) frameworks to consume and manipulate the data in relational databases in hierarchical object-oriented form, while leaving the data itself in relational form.

Problem solved? No. ORMs are not perfect either. They add a layer of abstraction that developers have limited control over, often provide their own conflict or collision resolution that may or may not be adequate, and may not take full advantage of all the features a given database engine provides. ORMs can quickly become the lowest common denominator moving the application even further away from the data and efficient manipulation of it than originally desired.

To sum up:

The relational model is a general-purpose model that makes it easy to query and manipulate parts of data but sometimes poses challenges for developers when consumed by their apps.

The JSON document format makes it easy for developers to consume data in hierarchical form but adds data duplication and the associated inconsistency and reusability challenges on its own.

ORMs alleviate the task of decomposition and reconstruction of data from object-oriented hierarchical structure to relational tables but may bring their own set of challenges to the overall architecture.

The Best of Both Worlds

These issues with relational databases and JSON document models are well known, and many database providers have been working to offer a solution. A new approach is a feature in the Oracle Database called JSON Relational Duality. This feature offers a way forward that combines the benefits of the JSON document and relational worlds in a single database, hence the name JSON Relational Duality, while avoiding the tradeoffs I’ve discussed so far.

The new feature is available in Oracle Database 23c Free—Developer Release, which anyone can download and use; no lawyers will ever come knocking. If you are curious about that release, check out my conversation about it with James Governor, co-founder of RedMonk. (It’s an unprecedented move by Oracle to offer a new database version free to developers before the paid version arrives. But I digress.)

Let’s get into what JSON relational duality can do:

By allowing a capability called a JSON Relational Duality View, data is still stored in relational tables in a highly efficient normalized format but is accessed by apps in a way that looks like JSON documents. This allows developers to continue to think in terms of JSON documents for data access while, behind the scenes, the system uses a highly efficient and multipurpose relational model for data storage.

In this and other ways, Duality Views hide all the complexities of database-level concurrency control from the user, providing document-level serializability.

Duality Views can be declared over any number of tables using intuitive GraphQL or SQL/JSON syntax. For example, the earlier described example could be defined as the following Duality View which renders the relational data available in the students, schedules, courses and teachers’ tables as a JSON document corresponding to an app-tier StudentSchedule object.

Developers can easily define different Duality Views on the same or overlapping set of relational tables, making it easy to support many use cases on the same data (such as TeacherSchedule and StudentSchedule Duality Views that share common tables for courses and schedules).

Using Duality Views, developers have JSON document access to all data, including access to data stored in relational tables. At the same time, they can access the relational data directly using SQL, if they choose. This way, applications using Duality Views can now simply read a document from the database, make any changes they need and write the modified document back to the database without having to worry about the underlying relational structure.

This is where the magic trick comes in: The database will consume the JSON document and do the correct creates, reads, updates and deletes on the corresponding rows based on JSON Relational Duality View definitions, with full ACID concurrency and consistency controls, runtime optimizations and all in just one round trip to the database. And the best part of it is developers can manipulate documents realized by Duality Views in the ways they’re used to, using their usual drivers, frameworks, tools and development methods.

Lock-Free Concurrency Control

Duality Views also benefit from a novel lock-free or optimistic concurrency control architecture, so developers can manage their data consistently across stateless operations. You can find more details about that here, but the basic idea is that stateless document-level serializability using built-in optimistic concurrency control avoids pessimistic locking and the associated performance issues so that developers can continue to focus on building their app instead of debugging concurrency control mechanisms and race conditions.

Why use JSON Relational Duality:

  • Duality Views eliminate the need for object-relational mapping (ORM) frameworks and JSON-relational de-/serializers.
  • Document-centric applications can access Duality Views via document APIs, such as Oracle Database API for MongoDB and Oracle REST Data Services (ORDS), or they can use standard-based SQL/JSON functions.
  • Application operations against Duality Views are optimally executed inside the database since they enable fetching and storing of all rows needed for an app-tier object use case in a single database round trip and database-optimized operation.
  • Duality Views eliminate data duplication by allowing many Duality Views to be defined over the same or overlapping relational tables giving every application its business objects without the need of duplicating data.
  • Duality Views allow relational data modification concurrently and consistently via JSON document-based PUT and POST operations.

By giving developers the flexibility and data access benefits of the JSON document model as well as the storage efficiency and general-purpose power of the relational model, JSON Relational Duality is a big step forward in simplifying app development.

The cool thing is that anyone can try these capabilities for free by downloading Oracle Database 23c Free—Developer Release.

Furthermore, Oracle is also making it easy for you to experience Duality View in building apps with well-documented and easy-to-learn tutorials on GitHub. You can browse and download the tutorials, and use them in this release. In the tutorials, you can use SQL, REST and Oracle Database API for MongoDB to try features, capabilities and examples related to Duality Views. Last but not least, Oracle also provides LiveLabs to help you play with Duality Views in a ready-to-run environment.

The post JSON and Relational Tables: How to Get the Best of Both appeared first on The New Stack.

]]>
Nvidia Uses OpenStack Swift Storage as Part of Its AI/ML Process https://thenewstack.io/nvidia-uses-openstack-swift-storage-as-part-of-its-ai-ml-process/ Fri, 30 Jun 2023 13:50:47 +0000 https://thenewstack.io/?p=22712264

When you think about artificial intelligence (AI) and machine learning (ML), the OpenStack Infrastructure as a Service (IaaS) cloud, and

The post Nvidia Uses OpenStack Swift Storage as Part of Its AI/ML Process appeared first on The New Stack.

]]>

When you think about artificial intelligence (AI) and machine learning (ML), the OpenStack Infrastructure as a Service (IaaS) cloud, and its object storage component, Swift is not the first technology to come to mind. But that’s exactly what trillion-dollar-plus chip and AI powerhouse Nvidia uses to power its ML efforts.

John Dickinson, Nvidia’s Principal Systems Software Engineer, explained at the recent OpenInfra Summit that ML requires speedy, robust storage solutions. “With the rise of AI and ML technologies, our primary role as storage providers is to fuel the engine with as much data as possible, as quickly as possible,” said Dickinson. To keep up with the increasing demand, storage solutions must offer high capacity, availability, and aggregate throughput.

Enter Swift

True, Dickinson continued, “While Nvidia’s Grace and Hopper chips and spectrum switches are pushing the boundaries in the computing and networking domains, storage speed is also vital.” Open source Swift, a distributed object storage system designed to scale from a single machine to thousands of servers, is optimized for multitenancy and high concurrency. A simple, REST-based API is normally used to access Swift.

During his keynote at the conference, Dickinson illustrated the ML workflow, emphasizing the significance of understanding data access patterns while building the supporting storage systems. After all, ML demands massive datasets that are much too large to fit into GPU memory or server flash storage.

According to Dickinson, the answer lies in object storage. This offers high-throughput and large capacity, albeit with differing APIs. While object storage presents its own set of challenges, including caching complexities and varying APIs, he firmly stated that the goal is to “enable users to do what was previously impossible.”

Two Key Concepts — Inner and Outer

Nvidia, he disclosed, is implementing two key concepts to tackle these issues — an “inner ring” and an “outer ring”. The inner ring is characterized by high speed, low latency, and its connection to a specific GPU cluster, resembling file storage for the end users. The outer ring, on the other hand, offers large capacity, high throughput, and high availability. For the outer ring, Nvidia uses Swift, thanks to its suitability for large capacity and high throughput storage.

Implementing these storage concepts has enabled Nvidia to support massive datasets that were previously impossible to handle, improve performance, and increase workload portability. Swift also delivers improved I/O performance with a single read from the outer ring on the ML first epoch, This outer ring data is also accessible from every compute cluster. In addition, since Swift supports many standard APIs such as POSIX and NFS for file access and S3, Azure, and native Swift for object access, it’s very easy to work with the datasets regardless of how you need to access them.

The strategy continues beyond providing inner and outer rings. Acknowledging the increasing difficulty of data exploration as datasets grow, Nvidia has created a dataset service aiming to simplify this process. In a live demonstration, Dickinson showcased how these storage services facilitate large-scale machine learning, highlighting how a user can load a dataset into Swift, explore it in a Jupyter notebook, and run an ML task without worrying about the down and dirty details of accessing that storage.

This live demo impressed the OpenInfra audience of about 750 users. It’s rare that a technical audience is impressed by a demo. They’ve seen it all, and they know all the tricks. But this one caught their attention. OpenStack and Swift have a clear role to play in serious work with massive ML datasets.

The post Nvidia Uses OpenStack Swift Storage as Part of Its AI/ML Process appeared first on The New Stack.

]]>
MinIO’s Object Storage Supports External Tables for Snowflake https://thenewstack.io/minios-object-storage-supports-external-tables-for-snowflake/ Tue, 27 Jun 2023 14:56:32 +0000 https://thenewstack.io/?p=22711896

MinIO provides cloud-agnostic object storage that’s equally at home in on-premises, co-location, and edge environments for workloads involving advanced machine

The post MinIO’s Object Storage Supports External Tables for Snowflake appeared first on The New Stack.

]]>

MinIO provides cloud-agnostic object storage that’s equally at home in on-premises, co-location, and edge environments for workloads involving advanced machine learning, streaming datasets, unstructured data, semi-structured data, and structured data.

Its impact on these data types is of more than academic interest to Snowflake users. MinIO’s capability to supply object storage almost anywhere data exists nicely complements Snowflake’s notion of external tables, which minimize data movement, decreases costs, and allows organizations to apply more of their data for any given use case. The company had a major presence at the Snowflake Summit being held this week and spoke with The New Stack about its relationship with Snowflake.

According to Jonathan Symonds, MinIO CMO, Snowflake “wants access to more data and not less and so, therefore, they basically created this concept called external tables. That allows you to query in place wherever the data may exist.”

When storing data with MinIO, there are few limits to where that data might actually be.

External Tables

With this paradigm, Snowflake users can query data wherever they have external tables set up which, when working with MinIO’s object storage, might be in adjacent clouds, on-premises data centers, and in edge devices. From an end-user perspective, the data may as well be in Snowflake — sans all the data preparation and data pipeline work required for it to get there. “The only thing that needs to happen is the administrator has to set up MinIO as an external table and give permissions to the user to be able to use it,” explained MinIO executive Satish Ramakrishnan. “So once they see this as an external table, then they can just run their regular queries. To them, it just looks like rows and columns in a database.”

Snowflake is responsible for querying the external data as though it were located internally. Ramakrishnan noted that for external tables, the cloud warehouse “does the same thing it does for its own internal systems, like caching queries and creating materialized views. It does that all automatically.” The performance issues appear to be negligible and are attributed in part to the caching techniques. Ramakrishnan referenced a use case in which external tables were queried from Snowflake and “the first time when it does the fetch it took a few seconds and from then on everything else was in milliseconds…So, we know that there’s a lot of that caching, which they’re already undertaking.”

In-Place Querying

The in-place querying capabilities Snowflake’s external tables enable in MinIO’s object storage create numerous advantages for the enterprise. The most noteworthy may be that data in distributed environments no longer has to move. Data movement has traditionally been considered a bottleneck and is often costly, if not cumbersome.

“You’re able to actually run this without any of that data movement, either the cost [of it] or having to clean it up,” Ramakrishnan commented about this in-place querying approach. “You can do it on all of your data. And most importantly, it’s current. It doesn’t have to go from a pipeline from your data lake all the way into Snowflake.” Depending on the use case and the velocity of the data, when data pipelines are involved it’s not uncommon for new data to have already been generated by the time data is transported to Snowflake.

Other Gains

The prohibitive costs of such traditional approaches frequently make users have to choose which data they move, preventing them from querying or accessing all of it. Another advantage of the external table approach is that data is accessible from multiple instances of Snowflake, which is beneficial for organizations with decentralized teams in different geographic locations.

“You can have a Snowflake instance in AWS and a Snowflake instance on GCP and still access that same table,” Ramakrishnan remarked. “There’s no data movement required.” There are also fewer copies of data, which helps security, access control, and data governance efforts. Plus, users get a uniform version of their data to support the proverbial single version of the truth. “You don’t have to move data around and you can actually run all your regular Snowflake jobs; queries and applications will all work as is,” Ramakrishnan added.

Overarching Significance

The overarching significance of object storage may very well be its ability to provide highly detailed metadata descriptions of unstructured and semi-structured data, which is swiftly retrievable at scale. However, Snowflake’s in-place querying via external tables roundly expands those benefits by forsaking the data movement, costs, and latency of data pipelines. The cloud data warehouse’s broad user base may very well avail itself of this benefit as much as it does for other applications of object storage.

The post MinIO’s Object Storage Supports External Tables for Snowflake appeared first on The New Stack.

]]>
The Architect’s Guide to Storage for AI https://thenewstack.io/the-architects-guide-to-storage-for-ai/ Thu, 01 Jun 2023 14:44:06 +0000 https://thenewstack.io/?p=22709651

Choosing the best storage for all phases of a machine learning (ML) project is critical. Research engineers need to create

The post The Architect’s Guide to Storage for AI appeared first on The New Stack.

]]>

Choosing the best storage for all phases of a machine learning (ML) project is critical. Research engineers need to create multiple versions of datasets and experiment with different model architectures. When a model is promoted to production, it must operate efficiently when making predictions on new data.  A well-trained model running in production is what adds AI to an application, so this is the ultimate goal.

As an AI/ML architect, this has been my life for the past few years. I wanted to share what I’ve learned about the project requirements for training and serving models and the available options. Consider this a survey. I will cover everything from traditional file systems variants to modern cloud native object stores that are designed for the performance at scale requirements associated with large language models (LLMs) and other generative AI systems.

Once we understand requirements and storage options, I will review each option to see how it stacks up against our requirements.

Before we dive into requirements, reviewing what is happening in the software industry today will be beneficial. As you will see, the evolution of machine learning and artificial intelligence are driving requirements.

The Current State of ML and AI

Large language models (LLMs) that can chat with near-human precision are dominating the media. These models require massive amounts of data to train. There are many other exciting advances concerning generative AI, for example, text-to-image and sound generation. These also require a lot of data.

It is not just about LLMs. Other model types exist that solve basic lines of business problems. Regression, classification and multilabel are model types that are nongenerative but add real value to an enterprise. More and more organizations are looking to these types of models to solve a variety of problems.

Another phenomenon is that an increasing number of enterprises are becoming SaaS vendors that offer model-training services using a customer’s private data. Consider an LLM that an engineer trained on data from the internet and several thousand books to answer questions, much like ChatGPT. This LLM would be a generalist capable of answering basic questions on various topics.

However, it might not provide a helpful, detailed and complete answer if a user asks a question requiring advanced knowledge of a specific industry like health care, financial services or professional services. It is possible to fine-tune a trained model with additional data.

So an LLM trained as a generalist can be further trained with industry-specific data. The model would then provide better answers to questions about the specified industry. Fine-tuning is especially beneficial when done with LLMs, as their initial training can cost millions, and the fine-tuning cost is much cheaper.

Regardless of the model you are building, once it is in production, it has to be available, scalable and resilient, just like any other service you deploy as a part of your application. If you are a business offering a SaaS solution, you will have additional security requirements. You will need to prevent direct access to any models that represent your competitive advantage, and you will need to secure your customers’ data.

Let’s look at these requirements in more detail.

Machine Learning Storage Requirements

The storage requirements listed below are from the lens of a technical decision-maker assessing the viability of any piece of technology. Specifically, technologies used in a software solution must be scalable, available, secure, performant, resilient and simple. Let’s see what each requirement means to a machine learning project.

Scalable: Scalability in a storage solution refers to its ability to handle an increasing amount of storage without requiring significant changes. In other words, scalable storage can continue to function optimally as capacity and throughput requirements increase. Consider an organization starting its ML/AI journey with a single project. This project by itself may not have large storage requirements. However, soon other teams will create their initiatives. These new teams may have small storage requirements. However, collectively, these teams may put a considerable storage requirement on a central storage solution. A scalable storage solution should scale its resources (either out or up) to handle the additional capacity and throughput needed as new teams onboard their data.

Available: Availability is a property that refers to an operational system’s ability to carry out a task. Operations personnel often measure availability for an entire system over time. For example, the system was available for 99.999% of the month. Availability can also refer to the wait time an individual request experiences before a resource can start processing it. Excessive wait times render a system unavailable.

Regardless of the definition, availability is essential for model training and storage. Model training should not experience delays due to lack of availability in a storage solution. A model in production should be available for 99.999% of the month. Requests for data or the model itself, which may be large, should experience low wait times.

Secure: Before all read or write operations, a storage system should know who you are and what you can do. In other words, storage access needs to be authenticated and authorized. Data should also be secure at rest and provide options for encryption. The hypothetical SaaS vendor mentioned in the previous section must pay close attention to security as they provide multitenancy to their customers. The ability to lock data, version data and specify retention policy are also considerations that are part of the security requirement.

Performant: A performant storage solution is optimized for high throughput and low latency. Performance is crucial during model training because higher performance means that experiments are completed faster. The number of experiments an ML engineer can perform is directly proportional to the accuracy of the final model. If a neural network is used, it will take many experiments to determine the optimal architecture. Additionally, hyperparameter tuning requires even further experimentation. Organizations using GPUs must take care to prevent storage from becoming the bottleneck. If a storage solution cannot deliver data at a rate equal to or greater than a GPU’s processing rate, the system will waste precious GPU cycles.

Resilient: A resilient storage solution should not have a single point of failure. A resilient system tries to prevent failure, but when failures occur, it can gracefully recover. Such a solution should be able to participate in failover and stay exercises where the loss of an entire data center is emulated to test the resiliency of a whole application.

Models running in a production environment require resiliency. However, resiliency can also add value to model training. Suppose an ML team uses distributed training techniques that use a cluster. In that case, the storage that serves this cluster, as well as the cluster itself, should be fault tolerant, preventing the team from losing hours or days due to failures.

Simple: Engineers use the words “simple” and “beauty” synonymously. There is a reason for this. When a software design is simple, it is well thought out. Simple designs fit into many different scenarios and solve a lot of problems. A storage system for ML should be simple, especially in the proof of concept (PoC) phase of a new ML project when researchers need to focus on feature engineering, model architectures and hyperparameter tuning while trying to improve the performance of a model so it is accurate enough to add value to the business.

The Storage Landscape

There are several storage options for machine learning and serving. Today, these options fall into the following categories: local file storage, network-attached storage (NAS), storage-area networks (SAN), distributed file systems (DFS) and object storage. In this section, I’ll discuss each and compare them to our requirements. The goal is to find an option that measures up the best across all requirements.

Local file storage: The file system on a researcher’s workstation and the file system on a server dedicated to model serving are examples of local file systems used for ML storage. The underlying device for local storage is typically a solid-state drive (SSD), but it could also be a more advanced nonvolatile memory express drive (NVMe). In both scenarios, compute and storage are on the same system.

This is the simplest option. It is also a common choice during the PoC phase, where a small R&D team attempts to get enough performance out of a model to justify further expenses. While common, there are drawbacks to this approach.

Local file systems have limited storage capacity and are unsuitable for larger datasets. Since there is no replication or autoscaling, a local file system cannot operate in an available, reliable and scalable fashion. They are as secure as the system they are on. Once a model is in production, there are better options than a local file system for model serving.

Network-attached storage (NAS): NAS is a TCP/IP device connected to a network that has an IP address, much like a computer. The underlying technology for storage is a RAID array of drives, and files are delivered to clients via TCP. These devices are often delivered as an appliance. The compute needed to manage the data and the RAID array are packaged into a single device.

NAS devices can be secured, and the RAID configuration of the underlying storage provides some availability and reliability. NAS uses data transfer protocols like Server Message Block (SMB) and Network File System (NFS) to encapsulate TCP for data transfer.

NAS devices run into scaling problems when there are a large number of files. This is due to the hierarchy and pathing of their underlying storage structure, which maxes out at millions of files. This is a problem with all file-based solutions. Maximum storage for a NAS is on the order of tens of terabytes.

Storage-area network (SAN): A SAN combines servers and RAID storage on a high-speed interconnect. With a SAN, you can put storage traffic on a dedicated fiber channel using the Fiber Channel Protocol (FCP). A request for a file operation may arrive at a SAN via TCP, but all data transfer occurs via a network dedicated to delivering data efficiently. If a dedicated fiber network is unavailable, a SAN can use Internet Small Computer System Interface (iSCSI), which uses TCP for storage traffic.

A SAN is more complicated to set up than a NAS device since it is a network and not a device. You need a separate dedicated network to get the best performance out of a SAN. Consequently, a SAN is costly and requires considerable effort to administer.

While a SAN may look compelling when compared to a NAS (improved performance and similar levels of security, availability and reliability), it is still a file-based approach with all the problems previously described. The improved performance does not make up for the extra complexity and cost. Total storage maxes out around hundreds of petabytes.

Distributed file system: A distributed file system (DFS) is a file system that spans multiple computers or servers and enables data to be stored and accessed in a distributed manner. Instead of a single centralized system, a distributed file system distributes data across multiple servers or containers, allowing users to access and modify files as if they were on a single, centralized file system.

Some popular examples of distributed file systems include Hadoop Distributed File System (HDFS), Google File System (GFS), Amazon Elastic File System (EFS) and Azure Files.

Files can be secured, like the file-based solutions above, since the operating system is presented with an interface that looks like a traditional file system. Distributed file systems run in a cluster that provides reliability. Running in a cluster may result in better throughput when compared to a SAN; however, they still run into scaling problems when there are a large number of files (like all file-based solutions).

Object storage: Object storage has been around for quite some time but was revolutionized when Amazon made it the first AWS service in 2006 with Simple Storage Service (S3). Modern object storage was native to the cloud, and other clouds soon brought their offerings to market. Microsoft offers Azure Blob Storage, and Google has its Google Cloud Storage service. The S3 API is the de facto standard for developers to interact with storage and the cloud, and there are multiple companies that offer S3-compatible storage for the public cloud, private cloud, edge and co-located environments. Regardless of where an object store is located, it is accessed via a RESTful interface.

The most significant difference with object storage compared to the other storage options is that data is stored in a flat structure. Buckets are used to create logical groupings of objects. Using S3 as an example, a user would first create one or more buckets and then place their objects (files) in one of these buckets. A bucket cannot contain other buckets, and a file must exist in only one bucket. This may seem limiting, but objects have metadata, and using metadata, you can emulate the same level of organization that directories and subdirectories provide within a file system.

Object storage solutions also perform best when running as a distributed cluster. This provides them with reliability and availability.

Object stores differentiate themselves when it comes to scale. Due to the flat address space of the underlying storage (every object in only one bucket and no buckets within buckets), object stores can find an object among potentially billions of objects quickly. Additionally, object stores offer near-infinite scale to petabytes and beyond. This makes them perfect for storing datasets and managing large models.

Below is a storage scorecard showing solutions against requirements.

The Best Storage Option for AI

Ultimately, the choice of storage options will be informed by a mix of requirements, reality and necessity; however, for production environments, there is a strong case to be made for object storage.

The reasons are as follows:

  1. Performance at scale: Modern object stores are fast and remain fast even in the face of hundreds of petabytes and concurrent requests. You cannot achieve that with other options.
  2. Unstructured data: Many machine learning datasets are unstructured — audio, video and images. Even tabular ML datasets that could be stored in a database are more easily managed in an object store. For example, it is common for an engineer to treat the thousands or millions of rows that make up a training set as a single entity that can be stored and retrieved via a single simple request. The same is true for validation sets and test sets.
  3. RESTful APIs: RESTful APIs have become the de facto standard for communication between services. Consequently, proven messaging patterns exist for authentication, authorization, security in motion and notifications.
  4. Encryption: If your datasets contain personally identifiable information, your data must be encrypted while at rest.
  5. Cloud native (Kubernetes and containers): A solution that can run its services in containers that are managed by Kubernetes is portable across all the major public clouds. Many enterprises have internal Kubernetes clusters that could run a Kubernetes-native object storage deployment.
  6. Immutable: It’s important for experiments to be repeatable, and they’re not repeatable if the underlying data moves or is overwritten. In addition, protecting training sets and models from deletion, accidental or intentional, will be a core capability of an AI storage system when governments around the world start regulating AI.
  7. Erasure coding vs. RAID for data resiliency and availability: Erasure coding uses simple drives to provide the redundancy required for resilient storage. A RAID array (made up of a controller and multiple drives), on the other hand, is another device type that has to be deployed and managed. Erasure coding works on an object level, while RAID works on a block level. If a single object is corrupted, erasure coding can repair that object and return the system to a fully operational state quickly (as in minutes). RAID would need to rebuild the entire volume before any data can be read or written, and rebuilding can take hours or days, depending on the size of the drive.
  8. As many files as needed: Many large datasets used to train models are created from millions of small files. Imagine an organization with thousands of IoT devices, each taking a measurement every second. If each measurement is a file, then over time, the total number of files will be more than a file system can handle.
  9. Portable across environments: A software-defined object store can use a local file, NAS, SAN and containers running with NVMe drives in a Kubernetes cluster as its underlying storage. Consequently, it is portable across different environments and provides access to underlying storage via the S3 API everywhere.

MinIO for ML Training and Inference

MinIO has become a foundational component in the AI/ML stack for its performance, scale, performance at scale and simplicity. MinIO is ideally configured in a cluster of containers that use NVMe drives; however, you have options to use almost any storage configuration as requirements demand.

An advantage of implementing a software-defined, cloud native approach is that the code becomes portable. Training code and model-serving code do not need to change as an ML project matures from a proof of concept to a funded project, and finally to a model in production serving predictions in a cluster.

While portability and flexibility are important, they are meaningless if they are at the expense of performance. MinIO’s performance characteristics are well known, and all are published as benchmarks.

Next Steps

Researchers starting a new project should install MinIO on their workstations. The following guides will get you started.

If you are responsible for the clusters that make up your development, testing and production environments, then consider adding MinIO.

Learn by Doing

Developers and data scientists increasingly control their own storage environments. The days of IT closely guarding storage access are gone. Developers naturally gravitate toward technologies that are software defined, open source, cloud native and simple. That essentially defines object storage as the solution.

If you are setting up a new workstation with your favorite ML tools, consider adding an object store to your toolkit by installing a local solution. Additionally, if your organization has formal environments for experimentation, development, testing and production, then add an object store to your experimental environment. This is a great way to introduce new technology to all developers in your organization. You can also run experiments using a real application running in this environment. If your experiments are successful, promote your application to dev, test and prod.

The post The Architect’s Guide to Storage for AI appeared first on The New Stack.

]]>
8 Real-Time Data Best Practices https://thenewstack.io/8-real-time-data-best-practices/ Thu, 01 Jun 2023 12:00:14 +0000 https://thenewstack.io/?p=22709548

More than 300 million terabytes of data are created every day. The next step in unlocking the value from all

The post 8 Real-Time Data Best Practices appeared first on The New Stack.

]]>

More than 300 million terabytes of data are created every day. The next step in unlocking the value from all that data we’re storing is being able to act on it almost instantaneously.

Real-time data analytics is the pathway to the speed and agility that you need to respond to change. Real-time data also amplifies the challenges of batched data, just continuously, at the terabyte scale.

Then, when you’re making changes or upgrades to real-time environments, “you’re changing the tires on the car, while you’re going down the road,” Ramos Mays, CTO of Semantic-AI, told The New Stack.

How do you know if your organization is ready for that deluge of data and insights? Do you have the power, infrastructure, scalability and standardization to make it happen? Are all of the stakeholders at the planning board? Do you even need real time for all use cases and all datasets?

Before you go all-in on real time, there are a lot of data best practices to evaluate and put in place before committing to that significant cost.

1. Know When to Use Real-Time Data

Just because you can collect real-time data, doesn’t mean you always need it. Your first step should be thinking about your specific needs and what sort of data you’ll require to monitor your business activity and make decisions.

Some use cases, like supply chain logistics, rely on real-time data for real-time reactions, while others simply demand a much slower information velocity and only need analysis on historical data.

Most real-time data best practices come down to understanding your use cases up front because, Mays said, “maintaining a real-time infrastructure and organization has costs that come alongside it. You only need it if you have to react in real time.

“I can have a real-time ingestion of traffic patterns every 15 seconds, but, if the system that’s reading those traffic patterns for me only reads it once a day, as a snapshot of only the latest value, then I don’t need real-time 15-second polling.”

Nor, he added, should he need to support the infrastructure to maintain it.

Most companies, like users of Semantic-AI, an enterprise intelligence platform, need a mix of historical and real-time data; Mays’ company, for instance, is selective about when it does and doesn’t opt for using information collected in real time.

He advises bringing together your stakeholders at the start of your machine learning journey and ask: Do we actually need real-time data or is near-real-time streaming enough? What’s our plan to react to that data?

Often, you just need to react if there’s a change, so you would batch most of your data, and then go for real time only for critical changes.

“With supply chain, you only need real-time time if you have to respond in real time,” Mays said. “I don’t need real-time weather if I’m just going to do a historic risk score, [but] if I am going to alert there’s a hurricane through the flight path of your next shipment [and] it’s going to be delayed for 48 hours, you’re reacting in real time.”

2. Keep Data as Lightweight as Possible

Next you need to determine which categories of data actually add value by being in real time in order to keep your components lightweight.

“If I’m tracking planes, I don’t want my live data tracking system to have the flight history and when the tires were last changed,” Mays said. “I want as few bits of information as possible in real time. And then I get the rest of the embellishing information by other calls into the system.”

Real-time data must be designed differently than batch, he said. Start thinking about where it will be presented in the end, and then, he recommended, tailor your data and streams to be as close to your display format as possible. This helps determine how the team will respond to changes.

For example, if you’ve got customers and orders, one customer can have multiple orders. “I want to carry just the amount of information in my real-time stream that I need to display to the users,” he said, such as the customer I.D. and order details. Even then, you will likely only show the last few orders in live storage, and then allow customers to search and pull from the archives.

For risk scoring, a ground transportation algorithm needs real-time earthquake information, while the aviation algorithm needs real-time wind speed — it’s rare that they would both need both.

Whenever possible, Mays added, only record deltas — changes in your real-time data. If your algorithm is training on stock prices, but those only change every 18 seconds, you don’t need it set for every quarter second. There’s no need to store those 72 data points across networks when you could only send one message when the value changes. This in turn reduces your organizational resource requirements and focuses again on the actionable.

3. Unclog Your Pipes

Your data can be stored in the RAM of your computer, on disk or in the network pipe. Reading and writing everything to the disk is the slowest. So, Mays recommended, if you’re dealing in real-time systems, stay in memory as much as you can.

“You should design your systems, if at all possible, to only need the amount of data to do its thing so that it can fit in memory,” he said, so your real-time memory isn’t held up in writing and reading to the disk.

“Computer information systems are like plumbing,” Mays said. “Still very mechanical.”

Think of the amount of data as water. The size of pipes determines how much water you can send through. One stream of water may need to split into five places. Your pipes are your network cables or, when inside the machine, the I/O bus that moves the data from RAM memory to the hard disk. The networks are the water company mainlines, while the bus inside acts like the connection between the mainlines and the different rooms.

Most of the time, this plumbing just sits there, waiting to be used. You don’t really think about it until you are filling up your bathtub (RAM.) If you have a hot water heater (or a hard drive), it’ll heat up right away; if it’s coming from your water main (disk or networking cable), it takes time to heat up. Either way, when you have finished using the  water (data) in your bathtub (RAM) it drains and is gone when you’re done with it.

You must have telemetry and monitoring, Mays said, extending the metaphor, because “we also have to do plumbing while the water is flowing a lot of times. And if you have real-time systems and real-time consumers, you have to be able to divert those streams or store them and let it back up or to divert it around a different way,” while fixing it, in order to meet your service-level agreement.

4. Look for Outliers

As senior adviser for the Office of Management, Strategy and Solutions at the U.S. Department of State, Landon Van Dyke oversees the Internet of Things network for the whole department — including, but not limited to, all sensor data, smart metering, air monitoring, and vehicle telematics across offices, embassies and consulates, and residences. Across all resources and monitors, his team exclusively deals in high-frequency, real-time data, maintaining two copies of everything.

He takes a contrary perspective to Mays and shared it with The New Stack. With all data in real time, Van Dyke’s team is able to spot crucial outliers more often,  and faster.

“You can probably save a little bit of money if you look at your utility bill at the end of the month,” Van Dyke said, explaining why his team took on its all-real-time strategy to uncover better patterns at a higher frequency. “But it does not give you the fidelity of what was operating at three in the afternoon on a Wednesday, the third week of the month.”

The specificity of energy consumption patterns are necessary to really make a marked difference, he argued. Van Dyke’s team uses that fidelity to identify when things aren’t working or something can be changed or optimized, like when a diplomat is supposed to be away but the energy usage shows that someone has entered their residence without authorization.

“Real-time data provides you an opportunity for an additional security wrapper around facilities, properties and people, because you understand a little bit more about what is normal operations and what is not normal,” he said. “Not normal is usually what gets people’s attention.”

5. Find Your Baseline

“When people see real-time data, they get excited. They’re like, ‘Hey, I could do so much if I understood this was happening!’ So you end up with a lot of use cases upfront,” Van Dyke observed. “But, most of the time, people aren’t thinking on the backend. Well, what are you doing to ensure that use case is fulfilled?”

Without proper planning upfront, he said, teams are prone to just slap on sensors that produce data every few seconds, connecting them to the internet and to a server somewhere, which starts ingesting the data.

“It can get overwhelming to your system real fast,” Van Dyke said. “If you don’t have somebody manning this data 24/7, the value of having it there is diminished.”

It’s not a great use of anyone’s time to pay people to stare at a screen 24 hours a day, so you need to set up alerts. But, in order to do that, you need to identify what an outlier is.

That’s why, he said, you need to first understand your data and set up your baseline, which could take up to six months, or even longer, when data points begin to have more impact on each other, like within building automation systems. You have to manage people’s expectations of the value of machine learning early on.

Once you’ve identified your baseline, you can set the outliers and alerts, and go from there.

6. Move to a Real-Time Ready Database

Still, at the beginning of this machine learning journey, Van Dyke says, most machines aren’t set up to handle that massive quantity of data. Real-time data easily overwhelms memory.

“Once you get your backend analysis going, it ends up going through a series of models,” Van Dyke said. “Most of the time, you’ll bring the data in. It needs to get cleaned up. It needs to go through a transformation. It needs to be run through an algorithm for cluster analysis or regression models. And it’s gotta do it on the fly, in real time.”

As you move from batch processing to real-time data, he continued, you quickly realize your existing system is not able to accomplish the same activities at a two- to five-second cadence. This inevitably leads to more delays, as your team has to migrate to a faster backend system that’s set up to work in real time.

This is why the department moved over to Kinetica’s real-time analytics database, which, Van Dyke said, has the speed built in to handle running these backend analyses on a series of models, ingesting and cleaning up data, and providing analytics. “Whereas a lot of the other systems out there, they’re just not built for that,” he added. “And they can be easily overwhelmed with real-time data.”

7. Use Standardization to Work with Non-Tech Colleagues

What’s needed now won’t necessarily be in demand in the next five years, Van Dyke predicted.

“For right now, where the industry is, if you really want to do some hardcore analytics, you’re still going to want people that know the coding, and they’re still going to want to have a platform where they can do coding on,” he said.
“And Kinetica can do that.”

He sees a lot of graphical user interfaces cropping up and predicts ownership and understanding of analytics will soon shift to becoming a more cross-functional collaboration. For instance, the subject matter expert (SME) for building analytics may now be the facilities manager, not someone trained in how to code. For now, these knowledge gaps are closed by a lot of handholding between data scientists and SMEs.

Standardization is essential among all stakeholders. Since everything real time is done at a greater scale, you need to know what your format, indexing and keys are well in advance of going down that rabbit hole.

This standardization is no simple feat in an organization as distributed as the U.S. State Department. However, its solution can be mimicked in most organizations — finance teams are most likely to already have a cross-organizational footprint in place. State controls the master dataset, indexes and meta data for naming conventions and domestic agencies, standardizing it across the government, which it based on the Treasury’s codes.

Van Dyke’s team ensured via logical standardization that “no other federal agency should be able to have its own unique code on U.S. embassies and consulates.”

8. Back-up in Real Time

As previously mentioned, the State Department also splits its data into two streams — one for model building and one for archival back-up. This still isn’t common practice in most real-time data-driven organizations, Van Dyke said, but it follows the control versus variable rule of the scientific method.

“You can always go back to that raw data and run the same algorithms that your real-time one is doing — for provenance,” he said. “I can recreate any outcome that my modeling has done, because I have the archived data on the side.” The State Department also uses the archived data for forensics, like finding patterns of motion around the building and then flagging deviations.

Yes, this potentially doubles the cost, but data storage is relatively inexpensive these days, he said.

The department also standardizes ways to reduce metadata repetition. For example, if a team wants to capture the speed of a fan in a building, but the metadata for that would include the fan’s make and model and the firmware for the fan controller. However, Van Dyke’s team exponentially reduces repetitive data in a table column by leveraging JSON to create nested arrays, which allows the team to decrease the amount of data by associating one note of firmware with all the speed logs.

It’s not just for real time, but in general, Van Dyke said: “You have to know your naming conventions, know your data, know who your stakeholders are, across silos. Make sure you have all the right people in the room from the beginning.”

Data is a socio-technical game, he noted. “The people that have produced the data are always protective of it. Mostly because they don’t want the data to be misinterpreted. They don’t want it to be misused. And sometimes they don’t want people to realize how many holes are in the data or how incomplete the data [is]. Either way, people have become very protective of their data. And you need to have them at the table at the very beginning.”

In the end, real-data best practices rely on collaboration across stakeholders and a whole lot of planning upfront.

The post 8 Real-Time Data Best Practices appeared first on The New Stack.

]]>
Raft Native: The Foundation for Streaming Data’s Best Future https://thenewstack.io/raft-native-the-foundation-for-streaming-datas-best-future/ Tue, 30 May 2023 16:44:09 +0000 https://thenewstack.io/?p=22709492

Consensus is fundamental to consistent, distributed systems. To guarantee system availability in the event of inevitable crashes, systems need a

The post Raft Native: The Foundation for Streaming Data’s Best Future appeared first on The New Stack.

]]>

Consensus is fundamental to consistent, distributed systems. To guarantee system availability in the event of inevitable crashes, systems need a way to ensure each node in the cluster is in alignment, such that work can seamlessly transition between nodes in the case of failures.

Consensus protocols such as Paxos, Raft, View Stamped Replication (VSR), etc. help to drive resiliency for distributed systems by providing the logic for processes like leader election, atomic configuration changes, synchronization and more.

As with all design elements, the different approaches to distributed consensus offer different tradeoffs. Paxos is the oldest consensus protocol around and is used in many systems like Google Spanner, Apache Cassandra, Amazon DynamoDB and Neo4j.

Paxos achieves consensus in a three-phased, leaderless, majority-wins protocol. While Paxos is effective in driving correctness, it is notoriously difficult to understand, implement and reason about. This is partly because it obscures many of the challenges in reaching consensus (such as leader election, and reconfiguration), making it difficult to decompose into subproblems.

Raft (for reliable, replicated, redundant and fault-tolerant) can be thought of as an evolution of Paxos — focused on understandability. This is because Raft can achieve the same correctness as Paxos but is more understandable and simpler to implement in the real world, so often it can provide greater reliability guarantees.

For example, Raft uses a stable form of leadership, which simplifies replication log management. And its leader election process, driven through an elegant “heartbeat” system, is more compatible with the Kafka-producer model of pushing data to the partition leader, making it a natural fit for streaming data systems like Redpanda. More on this later.

Because Raft decomposes the different logical components of the consensus problem, for example by making leader election a distinct step before replication, it is a flexible protocol to adapt for complex, modern distributed systems that need to maintain correctness and performance while scaling to petabytes of throughput, all while being simpler to understand to new engineers hacking on the codebase.

For these reasons, Raft has been rapidly adopted for today’s distributed and cloud native systems like MongoDB, CockroachDB, TiDB and Redpanda to achieve greater performance and transactional efficiency.

How Redpanda Implements Raft Natively to Accelerate Streaming Data

When Redpanda founder Alex Gallego determined that the world needed a new streaming data platform — to support the kind of gigabytes-per-second workloads that bring Apache Kafka to a crawl without major hardware investments — he decided to rewrite Kafka from the ground up.

The requirements for what would become Redpanda were: 1) it needed to be simple and lightweight to reduce the complexity and inefficiency of running Kafka clusters reliably at scale; 2) it needed to maximize the performance of modern hardware to provide low latency for large workloads; and 3) it needed to guarantee data safety even for very large throughputs.

The initial design for Redpanda used chain replication: Data is produced to node A, then replicated from A to B, B to C and so on. This was helpful in supporting throughput, but fell short for latency and performance, due to the inefficiencies of chain reconfiguration in the event of node downtime (say B crashes: Do you fail the write? Does A try to write to C?). It was also unnecessarily complex, as it would require an additional process to supervise the nodes and push reconfigurations to a quorum system.

Ultimately, Alex decided on Raft as the foundation for Redpanda consensus and replication, due to its understandability and strong leadership. Raft satisfied all of Redpanda’s high-level design requirements:

  • Simplicity. Every Redpanda partition is a Raft group, so everything in the platform is reasoning around Raft, including both metadata management and partition replication. This contrasts with the complexity of Kafka, where data replication is handled by ISR (in-sync replicas) and metadata management is handled by ZooKeeper (or KRaft), and you have two systems that must reason with one another.
  • Performance. The Redpanda Raft implementation can tolerate disturbances to a minority of replicas, so long as the leader and a majority of replicas are stable. In cases when a minority of replicas have a delayed response, the leader does not have to wait for their responses to progress, mitigating impact on latency. Redpanda is therefore more fault-tolerant and can deliver predictable performance at scale.
  • Reliability. When Redpanda ingests events, they are written to a topic partition and appended to a log file on disk. Every topic partition then forms a Raft consensus group, consisting of a leader plus a number of followers, as specified by the topic’s replication factor. A Redpanda Raft group can tolerate ƒ failures given 2ƒ+1 nodes; for example, in a cluster with five nodes and a topic with a replication factor of five, two nodes can fail and the topic will remain operational. Redpanda leverages the Raft joint consensus protocol to provide consistency even during reconfiguration.

Redpanda also extends core Raft functionality in some critical ways to achieve the scalability, reliability and speed required of a modern, cloud native solution. Redpanda enhancements to Raft tend to focus on Day 2 operations, for instance how to ensure the system runs reliably at scale. These innovations include changes to the election process, heartbeat generation and, critically, support for Apache Kafka acks.

Redpanda’s optimistic implementation of Raft is what enables it to be significantly faster than Kafka while still guaranteeing data safety. In fact, Jepsen testing has verified that Redpanda is a safe system without known consistency problems and a solid Raft-based consensus layer.

But What about KRaft?

While Redpanda takes a Raft-native approach, the legacy streaming data platforms have been laggards in adopting modern approaches to consensus. Kafka itself is a replicated distributed log, but it has historically relied on yet another replicated distributed log — Apache ZooKeeper — for metadata management and controller election.

This has been problematic for a few reasons: 1) Managing multiple systems introduces administrative burden; 2) Scalability is limited due to inefficient metadata handling and double caching; 3) Clusters can become very bloated and resource intensive — in fact, it is not too uncommon to see clusters with equal numbers of ZooKeeper and Kafka nodes.

These limitations have not gone unacknowledged by Apache Kafka’s committers and maintainers, who are in the process of replacing ZooKeeper with a self-managed metadata quorum: Kafka Raft (KRaft).

This event-based flavor of Raft achieves metadata consensus via an event log, called a metadata topic, that improves recovery time and stability. KRaft is a positive development for the upstream Apache Kafka project because it helps alleviate pains around partition scalability and generally reduces the administrative challenges of Kafka metadata management.

Unfortunately, KRaft does not solve the problem of having two different systems for consensus in a Kafka cluster. In the new KRaft paradigm, KRaft partitions handle metadata and cluster management, but replication is handled by the brokers using ISR, so you still have these two distinct platforms and the inefficiencies that arise from that inherent complexity.

The engineers behind KRaft are upfront about these limitations, although some exaggerated vendor pronouncements have created ambiguity around the issue, suggesting that KRaft is far more transformative.

Combining Raft with Performance Engineering: A New Standard for Streaming Data

As data industry leaders like CockroachDB, MongoDB, Neo4j and TiDB have demonstrated, Raft-based systems deliver simpler, faster and more reliable distributed data environments. Raft is becoming the standard consensus protocol for today’s distributed data systems because it marries particularly well with performance engineering to further boost the throughput of data processing.

For example, Redpanda combines Raft with speedy architectural ingredients to perform at least 10 times faster than Kafka at tail latencies (p99.99) when processing a 1GBps workload, on one-third the hardware, without compromising data safety.

Traditionally, GBps+ workloads have been a burden for Apache Kafka, but Redpanda can support them with double-digit millisecond latencies, while retaining Jepsen-verified reliability. How is this achieved? Redpanda is written in C++, and uses a thread-per-core architecture to squeeze the most performance out of modern chips and network cards. These elements work together to elevate the value of Raft for a distributed streaming data platform.

Redpanda vs. Kafka with KRaft performance benchmark – May 11, 2023

An example of this in terms of Redpanda internals: Because Redpanda bypasses the page cache and the Java virtual machine (JVM) dependency of Kafka, it can embed hardware-level knowledge into its Raft implementation.

Typically, every time you write in Raft you have to flush to guarantee the durability of writes on disk. In Redpanda’s approach to Raft, smaller intermittent flushes are dropped in favor of a larger flush at the end of a call. While this introduces some additional latency per call, it reduces overall system latency and increases overall throughput, because it is reducing the total number of flush operations.

While there are many effective ways to ensure consistency and safety in distributed systems (Blockchains do it very well with Proof of Work and Statement of Work protocols), Raft is a proven approach and flexible enough that it can be enhanced to adapt to new challenges.

As we enter a new world of data-driven possibilities, driven in part by AI and machine learning use cases, the future is in the hands of developers who can harness real-time data streams. Raft-based systems, combined with performance-engineered elements like C++ and thread-per-core architecture, are driving the future of data streaming for mission-critical applications.

The post Raft Native: The Foundation for Streaming Data’s Best Future appeared first on The New Stack.

]]>
Why the Document Model Is More Cost-Efficient Than RDBMS https://thenewstack.io/why-the-document-model-is-more-cost-efficient-than-rdbms/ Thu, 25 May 2023 16:24:22 +0000 https://thenewstack.io/?p=22709066

A relational database management system (RDBMS) is great at answering random questions. In fact, that is why it was invented.

The post Why the Document Model Is More Cost-Efficient Than RDBMS appeared first on The New Stack.

]]>

A relational database management system (RDBMS) is great at answering random questions. In fact, that is why it was invented. A normalized data model represents the lowest common denominator for data. It is agnostic to all access patterns and optimized for none.

The mission of the IBM System R team, creators of arguably the first RDBMS, was to enable users to query their data without having to write complex code requiring detailed knowledge of how their data is physically stored. Edgar Codd, inventor of the RDBMS, made this claim in the opening line of his famous document, “A Relational Model of Data for Large Shared Data Banks”:

Future users of large data banks must be protected from having to know how the data is organized in the machine.

The need to support online analytical processing (OLAP) workloads drove this reasoning. Users sometimes need to ask new questions or run complex reports on their data. Before the RDBMS existed, this required software engineering skills and a significant time investment to write the code required to query data stored in a legacy hierarchical management system (HMS). RDBMS increased the velocity of information availability, promising accelerated growth and reduced time to market for new solutions.

The cost of this data flexibility, however, was significant. Critics of the RDBMS quickly pointed out that the time complexity, or the time required to query a normalized data model was very high compared to HMS. As such, it was probably unsuitable for the high-velocity online transaction processing (OLTP) workloads that consume 90% of IT infrastructure. Codd himself recognized the tradeoffs. The time complexity of normalization is also referred to in his paper on the subject:

“If the strong redundancies in the named set are directly reflected in strong redundancies in the stored set (or if other strong redundancies are introduced into the stored set), then, generally speaking, extra storage space and update time are consumed with a potential drop in query time for some queries and in load on the central processing units.”

This would probably have killed the RDBMS before the concept went beyond prototype if not for Moore’s law. As processor efficiency increased, the perceived cost of the RDBMS decreased. Running OLTP workloads on top of normalized data eventually became feasible from a total cost of ownership (TCO) perspective, and from 1980 to 1985, RDBMS platforms were crowned as the preferred solution for most new enterprise workloads.

As it turns out, Moore’s law is actually a financial equation rather than a physical law. As long as the market will bear the cost of doubling transistor density every two years, it remains valid.

Unfortunately for RDBMS technology, that ceased to be the case around 2013 when the cost of moving to 5 nanometers fab for server CPUs proved to be a near-insurmountable barrier to demand. The mobile market adopted 5nm technology to use as a loss leader, recouping the cost through years of subscription services associated with the mobile device.

However, there was no subscription revenue driver in the server processing space. As a result, manufacturers have been unable to ramp up 5nm CPU production and per-core server CPU performance has been flattening for almost a decade.

Last February, AMD announced that it is decreasing 5nm wafer inventory indefinitely going forward in response to weak demand for server CPUs due to high cost. The reality is that server CPU efficiency might not see another order-of-magnitude improvement without a generational technology shift, which could take years to bring to market.

All this is happening while storage cost continues to plummet. Normalized data models used by RDBMS solutions rely on cheap CPU cycles to enable efficient solutions. NoSQL solutions rely on efficient data models to minimize the amount of CPU required to execute common queries. Oftentimes this is accomplished by denormalizing the data, essentially trading CPU for storage. NoSQL solutions become more and more attractive as CPU efficiency flattens while storage costs continue to fall.

The gap between RDBMS and NoSQL has been widening for almost a decade. Fortune 10 companies like Amazon have run the numbers and gone all-in with a NoSQL-first development strategy for all mission-critical services.

A common objection from customers before they try a NoSQL database like MongoDB Atlas is that their developers already know how to use RDBMS, so it is easy for them to “stay the course.” Believe me when I say that nothing is easier than storing your data the way your application actually uses it.

A proper document data model mirrors the objects that the application uses. It stores data using the same data structures already defined in the application code using containers that mimic the way the data is actually processed. There is no abstraction between the physical storage or increased time complexity to the query. The result is less CPU time spent processing the queries that matter.

One might say this sounds a bit like hard-coding data structures into storage like the HMS systems of yesteryear. So what about those OLAP queries that RDBMS was designed to support?

MongoDB has always invested in APIs that allow users to run the ad hoc queries required by common enterprise workloads. The recent addition of an SQL-92 compatible API means that Atlas users can now run the business reports they need using the same tooling they have always used when connecting to MongoDB Atlas, just like any other RDBMS platform via ODBC (Open Database Connectivity).

Complex SQL queries are expensive. Running them at high velocity means hooking up a firehose to the capex budget. NoSQL databases avoid this problem by optimizing the data model for the high velocity queries. These are the ones that matter. The impact of this design choice is felt when running OLAP queries that will always be less efficient when executed on denormalized data.

The good news is nobody really cares if the daily report used to take 5 seconds to run, but now it takes 10. It only runs once a day. Similarly the data analyst or support engineer running an ad hoc query to answer a question will never notice if they get a result in 10 milliseconds vs. 100ms. The fact is OLAP query performance almost never matters, we just need to be able to get answers.

MongoDB leverages the document data model and the Atlas Developer Data Platform to provide high OLTP performance while also supporting the vast majority of OLAP workloads.

The post Why the Document Model Is More Cost-Efficient Than RDBMS appeared first on The New Stack.

]]>
Amazon Aurora vs. Redshift: What You Need to Know https://thenewstack.io/amazon-aurora-vs-redshift-what-you-need-to-know/ Thu, 25 May 2023 13:27:00 +0000 https://thenewstack.io/?p=22709014

Companies have an ever-expanding amount of data to sort through, analyze and manage. Fortunately, provides powerful tools and services to

The post Amazon Aurora vs. Redshift: What You Need to Know appeared first on The New Stack.

]]>

Companies have an ever-expanding amount of data to sort through, analyze and manage. Fortunately, Amazon Web Services (AWS) provides powerful tools and services to help you manage data at scale, including Amazon Aurora and Amazon Redshift. But which service is best for you? Choosing a winner in the Aurora vs. Redshift debate requires careful consideration of each service’s strengths and limitations — and your business needs.

Learn more about each service’s benefits and what makes them different, as well as how to choose the service that’s best for your use cases.

Overview of Amazon Aurora

Amazon Aurora is a proprietary relational database management system developed by AWS. It’s a fully managed MySQL and PostgreSQL-compatible database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases.

Aurora provides businesses with a secure and reliable database engine that meets the needs of modern applications. It’s highly available and offers up to five times the throughput of non-Aurora MySQL and PostgreSQL databases.

Aurora also offers a low-cost pricing structure, enterprise-grade security features and the ability to scale with ease. With this service, businesses can build and maintain sophisticated applications that require high performance, scalability and availability.

Overview of Amazon Redshift

Amazon Redshift is a petabyte-scale data warehouse service that can store and analyze vast amounts of data quickly and easily. Businesses use it to store and analyze large datasets in the cloud, enabling them to make better use of their data and gain deeper insights.

Redshift’s features make it well-suited for large-scale data-warehousing needs. It enables companies to analyze customer behavior, track sales performance and process log data, all of which are essential components of a data pipeline.

Redshift supports structured, semi-structured and unstructured data. It also provides advanced data compression and encoding capabilities to help businesses optimize storage and query performance. The service integrates with many data visualization and business intelligence tools.

Amazon Aurora vs. Redshift: Key Differences

The choice between Redshift and Aurora requires understanding how these powerful services differ, especially if you’re moving software-as-a-service platforms to AWS. Here are some distinctions between their key features and potential use cases.

OLTP vs. OLAP

One of the main differences between Aurora and Redshift is the type of workloads they’re designed for. Aurora is optimized for online transaction processing (OLTP) workloads, which involve processing small, fast transactions in real time. OLTP systems are typically used for operational tasks, such as inserting, updating and deleting data in a database. They’re designed to support high-volume, low-latency transactions. Data is stored in a normalized format to avoid redundancy.

Redshift is optimized for online analytical processing (OLAP) workloads, which involve processing complex, large-scale queries that require aggregation and analysis from multiple data sources. OLAP systems are typically used for data analysis and reporting tasks, such as generating sales reports or analyzing customer behavior. They’re designed to support complex queries that involve large amounts of data. Data is stored in a denormalized format to improve query performance.

Data Models and Storage

Amazon Aurora is a relational database engine that stores data in tables with rows and columns. This makes it ideal for storing transactional data, such as customer orders, inventory and financial records. For example, an e-commerce business that needs to house and analyze customer order data could be a great fit for Aurora.

Amazon Redshift is a columnar data store that’s optimized for analytical queries that involve large amounts of data. This includes business intelligence reporting, data warehousing and data exploration. For example, a media company needs to analyze advertisement impressions and engagement data to optimize an ad campaign.

Aside from the differences in data models, Aurora and Redshift also differ in their approach to data storage. Aurora uses a distributed storage model where data is stored across multiple nodes in a cluster. Meanwhile, Redshift uses a massively parallel processing (MPP) architecture, where data is divided into multiple slices and distributed across different nodes. This allows for faster data retrieval and processing, as each slice can be processed in parallel.

Performance

Aurora is optimized for transactional workloads, and it’s capable of delivering high performance for small, fast transactions that require low latency. This means it’s great for situations where fast response times are critical, such as online transactions or real-time data processing.

Redshift is optimized for analytical workloads and excels at processing complex queries that involve aggregating and analyzing large amounts of data from multiple sources. The columnar storage format used by Redshift enables efficient compression and fast query execution, while the MPP architecture enables high performance even with large datasets.

Scale

Both services are designed to scale horizontally, meaning you can add more nodes to increase processing power and storage capacity. Aurora also enables vertical scaling through upgrading instance types, while Redshift offers a concurrency scaling feature that essentially handles an unlimited number of concurrent users.

Amazon Aurora can scale up to 128 tebibytes of storage, depending on the engine, and up to 15 read replicas to handle high read traffic. Redshift can scale up to petabyte-scale data warehouses. Redshift also offers automatic scaling and workload management features, allowing you to easily add or remove nodes to handle changing workloads. It also provides local storage on each node, which reduces network traffic and improves performance.

Pricing

Aurora’s pricing is based on the amount of data stored, the compute and memory used, and the number of I/O requests made. The pricing model for Redshift is more complex, as it includes the type of instance that you choose, the number of compute nodes, the amount of storage and the duration of usage.

Difference between Aurora and Redshift:

  • OLTP vs. OLAP
  • Data models and storage
  • Performance
  • Scale
  • Pricing

Amazon Aurora vs. Redshift: Shared Benefits

Amazon Aurora and Redshift are each designed to help businesses manage their data in a secure and compliant manner. Here are some of the benefits they have in common.

High Performance

Amazon Aurora databases have a distributed architecture that delivers high performance and availability for transactional workloads. Unlike traditional databases, which can experience performance issues when scaling, Aurora’s architecture is specifically designed to scale out and maintain performance.

Amazon Redshift’s MPP architecture allows it to distribute processing tasks across multiple nodes to handle petabyte-scale data warehouses with ease. It can process large queries quickly, making it a great choice for analytical workloads. Additionally, Redshift provides advanced performance-tuning options, such as automatic query optimization and workload management.

Scalability

Amazon Aurora provides automatic scaling capabilities for compute and storage resources. Using this service means you can easily increase or decrease the number of database instances to meet changes in demand. Aurora also supports read replicas, allowing you to offload read queries from the primary instance to one or more replicas, which improves performance and enables horizontal scale reads.

Amazon Redshift provides elastic scaling, which allows you to easily add or remove compute nodes as your data warehouse grows, all while handling more concurrent queries and improving query performance. Redshift provides automatic distribution and load balancing of data across nodes, which improves performance and scalability.

Cost-Effective

With Amazon Aurora and Redshift, you only pay for the resources you use, allowing you to scale up or down without additional financial concerns.

Amazon Aurora offers a low-cost pricing structure compared to traditional commercial databases, with no upfront costs or long-term commitments. Redshift offers reserved instance pricing options, which saves money if you can commit to using the service for a longer period.

Security

As part of the AWS portfolio, Redshift and Aurora share valuable security features such as multifactor authentication, automated backups and encryption for data at rest and in transit.

You can control who can access your resources using AWS Identity and Access Management (IAM). Permissions can be customized for specific users, groups or roles while controlling access at the network level. You can set up security groups to allow traffic only from trusted sources and configure access control policies to ensure that users are granted the appropriate level of access.

Benefits of Amazon Aurora and Redshift:

  • High Performance
  • Scalability
  • Cost-Effective
  • Security

Amazon Redshift vs. Aurora: Cost Comparisons

Storage Costs

There are two key factors to consider when assessing storage costs: per-gigabyte pricing and reserved node fees. Both services offer per-gigabyte pricing, which means you pay for usage rather than for access. However, Redshift charges additional fees for reserved nodes, which are required for running a cluster with predictable performance and capacity.

Despite this fee structure, Redshift can be a more cost-effective option for larger datasets because of its advanced compression algorithms and columnar storage technologies. These technologies enable Redshift to store and query large amounts of data efficiently, resulting in lower storage costs and faster query performance.

Compute Costs

Compute costs include I/O operations, such as scanning or sorting large datasets, among other elements. Redshift’s advanced compression algorithms and columnar storage improve the efficiency of processing complex analytics, which reduces the number of I/O operations required and results in lower compute costs.

Disk I/O latency, or the time it takes for the storage system to retrieve data, can also affect compute costs. Aurora generally has lower latency compared to Redshift because of its architecture. Query performance is another consideration. Redshift can handle more complex queries than Aurora and may be required for businesses with complex analytical needs, although higher compute costs are possible.

Generally speaking, Aurora is better suited for reading and writing operations while Redshift offers superior capabilities in terms of complex analytics processing, making it worth investing in higher cluster sizes with Redshift over Aurora’s 64 nodes per cluster limit.

Data Transfer Fees

In addition to storage and compute expenses, factor in data transfer fees that might apply. Data transfers within AWS regions are free of charge, but transferring between regions has associated costs depending on how much data needs to be transmitted.

Amazon Aurora vs. Redshift: What’s Right for You?

Businesses looking at Aurora, Redshift and similar services need to understand how these offerings fit with their workload requirements. Aurora advantages include its sub-second latency read/write operations, making it suitable for tasks such as online transaction processing. Its multi-availability-zone deployment capabilities offer high availability and fault tolerance. But Aurora does have limitations. Certain functions such as sharding and table partitioning may not be supported by this platform because of its maximum cluster size limit.

Amazon Redshift shines when executing complex analytical queries on vast datasets, making it perfect for data-warehousing duties like business intelligence or analytics. Its columnar storage, data compression and high-performance capabilities allow it to handle intense workloads. Redshift’s scalability options are limited, which could be a disadvantage for large cloud solutions on AWS.

Besides the features and pricing, consider what kind of implementation needs your business requires. Amazon Aurora has quick setup times compared to on-premises solutions, while deploying a Redshift cluster can take days or even weeks.

After you’ve determined the right solution, you’ll need to migrate data to the new database. Amazon offers several migration tools to help you transfer your data, including Amazon Database Migration Service (DMS).

Throughout this process, partnering with a premier tier services partner such as Mission Cloud can ensure a smooth and successful migration. Want to learn more? See how Amazon DMS makes database migration easy.

The post Amazon Aurora vs. Redshift: What You Need to Know appeared first on The New Stack.

]]>
Boost DevOps Maturity with a Data Lakehouse https://thenewstack.io/boost-devops-maturity-with-a-data-lakehouse/ Wed, 17 May 2023 17:20:53 +0000 https://thenewstack.io/?p=22708319

In a world riven by macroeconomic uncertainty, businesses increasingly turn to data-driven decision-making to stay agile. That’s especially true of

The post Boost DevOps Maturity with a Data Lakehouse appeared first on The New Stack.

]]>

In a world riven by macroeconomic uncertainty, businesses increasingly turn to data-driven decision-making to stay agile.

That’s especially true of the DevOps teams tasked with driving digital-fueled sustainable growth. They’re unleashing the power of cloud-based analytics on large data sets to unlock the insights they and the business need to make smarter decisions. From a technical perspective, however, that’s challenging. Observability and security data volumes are growing all the time, making it harder to orchestrate, process, analyze and turn information into insight. Cost and capacity constraints are becoming a significant burden to overcome.

Data Scale and Silos Present Challenges

DevOps teams are often thwarted in their efforts to drive better data-driven decisions with observability and security data. That’s because of the heterogeneity of the data their environments generate and the limitations of the systems they rely on to analyze this information.

Most organizations are battling cloud complexity. Research has found that 99% of organizations have embraced a multicloud architecture. On top of these cloud platforms, they’re using an array of observability and security tools to deliver insight and control — seven on average. This results in siloed data that is stored in different formats, adding further complexity.

This challenge is exacerbated by the high cardinality of data generated by cloud native, Kubernetes-based apps. The sheer number of permutations can break traditional databases.

Many teams look to huge cloud-based data lakes, a repository that stores data in its natural or raw format, to help teams centralize disparate data. A data lake enables teams to keep as much raw, “dumb” data as they wish, at relatively low cost, until teams in the business find a use for it.

When it comes to extracting insight, however, data needs to be transferred to a warehouse technology so it can be aggregated and prepared before it is analyzed. Various teams usually then end up transferring the data again to another warehouse platform, so they can run queries related to their specific business requirements.

When Data Storage Strategies Become Problematic

Data warehouse-based approaches add cost and time to analytics projects.

As many as tens of thousands of tables may need to be manually defined to prepare data for querying. There’s also the multitude of indexes and schemas needed to retrieve and structure the data and define the queries that will be asked of it. That’s a lot of effort.

Any user who wants to ask a new question for the first time will need to start from scratch to redefine all those tables and build new indexes and schemas, which creates a lot of manual effort. This can add hours or days to the process of querying data, meaning insights are at risk of being stale or are of limited value by the time they’re surfaced.

The more cloud platforms, data warehouses and data lakes an organization maintains to support cloud operations and analytics, the more money they will need to spend. In fact, the storage space required for the indexes used to support data retrieval and analysis may end up costing more than the data storage itself.

Further costs will arise if teams need technologies to track where their data is and to monitor data handling for compliance purposes. Frequently moving data from place to place may also create inconsistencies and formatting issues, which could affect the value and accuracy of any resulting analysis.

Combining Data Lakes and Data Warehouses

A data lakehouse approach combines the capabilities of a warehouse and a lake to solve the challenges associated with each architecture, thanks to its enormous scalability and massively parallel processing capabilities. With a data lakehouse approach to data retention, organizations can cope with high-cardinality data in a time- and cost-effective manner, maintaining full granularity and extra-long data retention to support instant, precise and contextual predictive analytics.

But to realize this vision, a data lakehouse must be schemaless, indexless and lossless. Being schema-free means users don’t need to predetermine the questions they want to ask of data, so new queries can be raised instantly as the business need arises.

Indexless means teams have rapid access to data without the storage cost and resources needed to maintain massive indexes. And lossless means technical and business teams can query the data with its full context in place, such as interdependencies between cloud-based entities, to surface more precise answers to questions.

Unifying Observability Data

Let’s consider the key types of observability data that any lakehouse must be capable of ingesting to support the analytics needs of a modern digital business.

  • Logs are the highest volume and often most detailed data that organizations capture for analytics projects or querying. Logs provide vital insights to verify new code deployments for quality and security, identify the root causes of performance issues in infrastructure and applications, investigate malicious activity such as a cyberattack and support various ways of optimizing digital services.
  • Metrics are the quantitative measurements of application performance or user experience that are calculated or aggregated over time to feed into observability-driven analytics. The challenge is that aggregating metrics in traditional data warehouse environments can create a loss of fidelity and make it more difficult for analysts to understand the relevance of data. There’s also a potential scalability challenge with metrics in the context of microservices architectures. As digital services environments become increasingly distributed and are broken into smaller pieces, the sheer scale and volume of the relationships among data from different sources is too much for traditional metrics databases to capture. Only a data lakehouse can handle such high-cardinality data without losing fidelity.
  • Traces are the data source that reveals the end-to-end path a transaction takes across applications, services and infrastructure. With access to the traces across all services in their hybrid and multicloud technology stack, developers can better understand the dependencies they contain and more effectively debug applications in production. Cloud native architectures built on Kubernetes, however, greatly increase the length of traces and the number of spans they contain, as there are more hops and additional tiers such as service meshes to consider. A data lakehouse can be architected such that teams can better track these lengthy, distributed traces without losing data fidelity or context.

There are many other sources of data beyond metrics, logs, and traces that can provide additional insight and context to make analytics more precise. For example, organizations can derive dependencies and application topology from logs and traces.

If DevOps teams can build a real-time topology map of their digital services environment and feed this data into a lakehouse alongside metrics, logs and traces, it can provide critical context about the dynamic relationships between application components across all tiers. This provides centralized situational awareness that enables DevOps teams to raise queries about the way their multicloud environments work so they can understand how to optimize them more effectively.

User session data can also be used to gain a better understanding of how customers interact with application interfaces so teams can identify where optimization could help.

As digital services environments become more complex and data volumes explode, observability is certainly becoming more challenging. However, it’s also never been more critical. With a data lakehouse-based approach, DevOps teams can finally turn petabytes of high-fidelity data into actionable intelligence without breaking the bank or becoming burnt out in the effort.

The post Boost DevOps Maturity with a Data Lakehouse appeared first on The New Stack.

]]>
Vercel Offers Postgres, Redis Options for Frontend Developers https://thenewstack.io/vercel-offers-postgres-redis-options-for-frontend-developers/ Mon, 01 May 2023 16:00:40 +0000 https://thenewstack.io/?p=22706763

Increasingly, cloud provider Vercel is positioning itself as a one-stop for frontend developers. A slew of announcements this week makes

The post Vercel Offers Postgres, Redis Options for Frontend Developers appeared first on The New Stack.

]]>

Increasingly, cloud provider Vercel is positioning itself as a one-stop for frontend developers. A slew of announcements this week makes that direction clear by adding to the platform a suite of serverless storage options, as well as new security and editing features.

“Basically, for the longest time, frontend developers have struggled to come to define how you put together these best-in-class tools into a single platform,” Lee Robinson, Vercel’s vice president of developer experience, told The New Stack. “The idea here really is what would storage look like if it was reimagined from the perspective of a frontend developer.”

All of the announcements will be explored in a free upcoming online conference of sorts later this week.

Rethinking Storage for Frontend Developers

Vercel wanted to think about storage that works with new compute primitives, such as serverless and edge — functions that mean frontend developers don’t have to think through some of the more traditional ways of connecting to a database, Robinson said.

Developers are moving away from monolithic database architectures and embracing distributed databases “that can scale and perform in the cloud,” the company said in its announcement. Vercel also wants to differentiate by integrating storage with JavaScript frameworks, such as Next.js, Sveltekit or Nuxt, Robinson said.

The new options came out of conversations in which developers said they wanted first-party integration with storage and a unified way to handle billing and usage, a single account to manage both their compute as well as their storage, all integrated into their frontend framework and their frontend cloud, Robinson added.

“Historically, frontend developers — trying to retrofit databases that were designed for a different era — have struggled to integrate those in modern frontend frameworks,” Robinson said. “They have to think about manually setting up connection pooling as their application scales in size and usage. They have to think about dialing the knobs for how much CPU or storage space they’re allotting for their database. And for a lot of these developers, they just want a solution that more or less works out of the box and scales with them as their site grows.”

The three storage products Vercel announced this week are:
1. Vercel Postgres, through a partnership with Neon.

Postgres is an incredible technology. Developers love it,” Robinson said. “We wanted to build on a SQL platform that was reimagined for serverless and that could pair well with Vercel as platform, and that’s why we chose to have the first-party integration with Neon, a serverless database platform, a serverless Postgres platform.”

The integration will give developers access to a fully managed, highly scalable, truly serverless fault-tolerant database, which will offer high performance and low latency for web applications, the company added. Vercel Postgres is designed to work seamlessly with the Next.js App Router and Server Components, which allow web apps to fetch data from the database to render dynamic content on the server, Vercel added.

2. Vercel KV, a scalable, durable Redis-compatible database.

Redis is used for key-value store in frontend development. Like Postgres, Redis is one of the top-rated databases and caches for developers, he said. Developer loves its flexibility and API and the fact it’s open source, he said.

“These databases can be used for rate limiting, session management and application state,” Vercel stated in its press release. “With Vercel KV, frontend developers don’t need to manage scaling, instance sizes or Redis clusters — it’s truly serverless.”

Vercel’s lightweight SDK works from edge or serverless functions and scales with a brand’s traffic.

“The interesting thing here — and what I’m really excited about with this one — is that traditionally, a lot of Redis instances would be ephemeral. So you would use them as a cache, you would store some data in them, and that cache would expire,” Robinson said. “The cool thing about durable storage, or our durable Vercel KV for Redis, is that you can actually use it like a database. You can store data in there and it will persist. So developers get the power and the flexibility that they love from Redis.”

3. Vercel Blob, a secure object storage, which has been one of the top requests from the Vercel community. Vercel Blob offers file storage in the cloud using an API built on Web standard APIs, allowing users to upload files or attachments of any size. It will enable companies to host medium complex apps entirely on Vercel without the need of a separate backend or database provider.

“Vercel Blob is effectively a fast and simple way to upload files,” Robinson said. “We’re working in partnership with Cloudflare and using their R2 product that allows you to effectively very easily upload and store files in the cloud, and have a really simple API that you can use; again, that works well with your frontend frameworks to make it easy to store images or any other type of file.”

Each offers developers an easy way to solve different types of storage problems, he said.

“If you step back and you look at the breadth of the storage products that we’re having these first-party integrations for, we’re trying to give developers a convenient, easy way to solve all of these different types of storage solutions,” Robinson said.

New Security Offerings from Vercel

Along with Vercel’s new storage products, the frontend cloud provider has also launched Vercel Secure Compute, which gives businesses the ability to create private connections between serverless functions and protect their backend cloud. Previously, companies had to allow all IP addresses on their backend cloud for a deployment to be able to connect with it, Vercel explained. With Vercel Secure Compute, the deployments and build container will be placed in a private network with a dedicated IP address in the region of the user’s choice and logically separated from other containers, the press release stated.

“Historically on the Vercel platform, you’ve had your compute, which is serverless functions or edge functions, and when we talk to our largest customers, our enterprise customers, they love the flexibility that offers, but they wanted to take it a step further and add additional security controls on top,” Robinson said. “To do that, we’ve offered a product called Vercel Secure Compute, which allows you to really isolate that compute and put it inside of the same VPC [virtual private cloud] as the rest of your infrastructure.”

It’s targeting large teams who have specific security rules or compliance rules and want additional control over their infrastructure, he added. Along with that, they introduced Vercel Firewall, with plans to introduce a VPN at some point in the future.

“The same customers when they’re saying, ‘I want more control, more granularity over my compute,’ they also want more control over the Vercel Edge network, and how they can allow or block traffic. So with Vercel firewall we’re giving our enterprise customers more flexibility for allowing or blocking specific IP addresses,” Robinson said.

Visual Editing Pairs with Comments on Preview

The company also released Vercel Visual Editing, which dovetails on the company’s release in December of Comments on Preview Deployments. Visual Editing means developers can work with non-technical colleagues and across departments to live-edit site content. To do that, Vercel partnered with Sanity, a real-time collaboration platform for structured content, to introduce a new open standard for content source mapping for headless CMS [content management systems]. The new standard works with any framework and headless CMS, the company added.

Vercel used it for the blog posts it’s creating about the new announcements, collectively nicknamed Vercel Ship, allowing the team to edit the content.

“The way that visual editing pairs into this, it actually works in harmony with Comments,” he said. “So for example, all of the blog posts that we’re working on for this upcoming Vercel Ship week, we’re using a combination of comments, as well as visual editing to allow our teams to give feedback say, ‘Let’s change this word here to a different word. Let’s fix this typo.’ Then the author or the editors can go and click the edit button go in make those changes directly and address the comment.”

The post Vercel Offers Postgres, Redis Options for Frontend Developers appeared first on The New Stack.

]]>
Why We Need an Inter-Cloud Data Standard https://thenewstack.io/why-we-need-an-inter-cloud-data-standard/ Thu, 27 Apr 2023 15:19:18 +0000 https://thenewstack.io/?p=22706411

The cloud has completely changed the world of software. Everyone from startup to enterprise has access to vast amounts of

The post Why We Need an Inter-Cloud Data Standard appeared first on The New Stack.

]]>

The cloud has completely changed the world of software. Everyone from startup to enterprise has access to vast amounts of compute hardware whenever they want it. Need to test the latest version of that innovative feature you were working on? Go ahead. Spin up a virtual machine and set it free. Does your company need to deploy a disaster recovery cluster? Sure, if you can fit the bill.

No longer are we limited by having to procure expensive physical servers and maintain on premises data centers. Instead, we are free to experiment and innovate. Each cloud service provider has a litany of tools and systems to help accelerate your modernization efforts. The customer is spoiled with choices.

Except that they aren’t. Sure, a new customer can weigh the fees and compare the offerings. But that might be their last opportunity, because once you check in, you can’t check out.

The Hotel California Problem

Data is the lifeblood of every app and product, as all software, at its core, is simply manipulating data to create an experience for the user. And thankfully, cloud providers will happily take your precious data and keep it safe and secure for you. The cost to upload your data is minimal. However, when you want to take your ever-growing database and move it somewhere else, you will be in for a surprise: the worst toll road in history. Heavy bills and slow speeds.

Is there a technical reason? Let’s break it down. Just like you and me, the big cloud providers pay for their internet infrastructure and networking, and that must be factored into their business model. Since the costs can’t be absorbed, it is worth subsidizing the import fees and taxing the exports.

Additionally, the bandwidth of their internet networks is also limited. It makes sense to deprioritize the large data exports so that production application workloads are not affected.

Combine these factors and you can see why Amazon Web Services (AWS) offers a service where it sends a shipping container full of servers and data drives so you can migrate data to its services. It is often cheaper and faster to mail a hard drive than it is to download its contents from the web.

It helps that all these factors align with the interests of the company. When a large data export is detected, it probably is a strong indicator that the customer wants to lower their fees or take their business elsewhere. It is not to the cloud provider’s benefit to make it easy for customers to move their data out.

Except that it is in the cloud provider’s interest. It’s in everyone’s interest.

It Really Matters

The situation is not unlike the recent revolution in the smart home industry. Since its inception, it has been a niche enthusiast hobby. But in 2023, it is poised to explode.

Amazon, Google and Apple have ruthlessly pursued this market for years, releasing products designed to coordinate your home. They have tried to sell the vision of a world where your doors are unlocked by Alexa, where Siri watches your cameras for intruders and where Google sets your air conditioning to the perfect temperature. But you were only allowed one assistant. Alexa, Siri or Google.

By design, there was no cross compatibility; you had to go all in. This meant that companies who wanted to develop smart home products also had to choose an ecosystem, as developing a product that works with and is certified for all three platforms was prohibitively expensive. Product boxes had a ridiculous number of logos on them indicating which systems they work with and what protocol they operate on.

Source: Google Images

It was a minefield for consumers. The complexity of finding products that work with your system was unbearably high and required serious planning and research. It was likely you would walk out of the shop with a product that wouldn’t integrate with your smart home assistant.


This will change. In late 2022, products certified on the new industry standard, named Matter, started hitting shelves, and they work with all three ecosystems. No questions asked, and only one logo to look for. This reduces consumer complexity. It makes developing products easier, and it means that the smart home market can grow beyond a niche hobby for technology enthusiasts and into the mass market. By 2022, only 14% of households had experimented with smart technology. However, in the next four years, adoption of smart home technology is set to double, with another $100 billion of revenue being generated by the industry.

Bringing It Back to Cloud

We must look at it from the platform vendor’s perspective. Before Matter, users had to choose, and if they chose your ecosystem, it was a big win! Yet the story isn’t that simple, as the customers were left unfulfilled, limited to a small selection of products that they could use. Worse, the friction that this caused limited the size of the market and ensured that even if the vendor improved its offering, it was unlikely to cause a competitor’s customers to switch.

In this case, lock-in was so incredibly detrimental to the platform owners that all the players in the space acknowledged the existential threats to the budding market, driving traditionally bitter rivals to rethink, reorganize and build a new, open ecosystem.

The cloud service providers (CSPs) are in a similar position. The value proposition of the cloud was originally abundantly clear, and adoption exploded. Today, sentiment is shifting, and the cloud backlash has begun. After 10 years of growing cloud adoption, organizations are seeing their massive cloud bills continue to grow, with an expected $100 billion increase in spending in 2023 alone and cloud lock-in is limiting agility.

With so much friction in moving cloud data around, it might be easier for customers to never move data there and just manage the infrastructure themselves.

The value still exists for sporadic workloads, or development and innovation, as purchasing and procurement is a pain for these sorts of use cases. Yet, even these bleeding-edge use cases can be debilitated by lock-in. Consider that there may be technology offered by AWS and another on Google Cloud that together could solve a key technical challenge that a development team faces. This would be a win-win-win. Both CSPs would gain valuable revenue, and the customer would be able to build their technology. Unfortunately, today this is impossible as the data transfer costs make this unreasonably expensive.

There are second-order effects as well. Each CSP currently must invest in hundreds of varying services for their customers. As for each technology category, the cloud provider must offer a solution to its locked-in customers. This spreads development thin, perhaps limiting the excellence of each individual service since many of them need to be developed and supported. As thousands of employees are let go by Amazon (27,000), Google (12,000) and Microsoft (10,000), can these companies really keep up the pace? Wouldn’t quality and innovation go up if these companies could focus their efforts on their differentiators and best-in-class solutions? Customers could shop at multiple providers and always get the best tools for their money.

High availability is another key victim to the current system. Copies of the data must be stored and replicated in a set of discrete locations to avoid data loss. Yet, data transfer costs means that the cost of replicating data between availability zones internally within a single cloud region already drives up the bill. Forget replicating any serious amount of data between cloud providers as that becomes infeasible due to cost and latency. This places real limits on how well customers can protect their data from disasters or failures, artificially capping risk mitigations.

An Industry Standard

So many of today’s cloud woes come down to the data-transfer cost paradigm. The industry needs a rethink. Just like the smart home companies came together to build a single protocol called Matter, perhaps the CSPs could build a simple, transparent and unified system for data transfer fees.

The CSPs could invest in building an inter-cloud super highway: an industry-owned and -operated infrastructure designed solely for moving data between CSPs with the required throughput. Costs would go down as the public internet would no longer be a factor.

A schema could be developed to ensure interoperability between technologies and reduce friction for users looking to migrate their data and applications. An encryption standard could be enforced to ensure security and compliance and use of the aforementioned cross-cloud network would reduce risk of interception by malicious actors. For critical multicloud applications, customers could pay a premium to access faster inter-cloud rates.

Cloud providers would be able to further differentiate their best product offerings knowing that if they build it, the customers will come, no longer locked into their legacy cloud provider.

Customers could avoid lengthy due diligence when moving to the cloud, as they could simply search for the best service for their requirements, no longer buying the store when they just need one product. Customers would benefit from transparent and possibly reduced costs with the ability to move their business when and if they want to. Overall agility would increase, allowing strategic migration on and off the cloud as requirements change.

And of course, a new level of data resilience could be unlocked as data could be realistically replicated back and forth between different cloud providers, ensuring the integrity of the world’s most important data.

This is a future where everyone wins. The industry players could ensure the survival and growth of their offerings in the face of cloud skepticism. Customers get access to the multitudes of benefits listed above. Yes, it would require historic humility and cooperation from some of the largest companies in the world, but together they could usher in a new generation of technology innovation.

We need an inter-cloud data mobility standard.

In the Meantime

Today there is no standard, and all the opposite is true. The risks of cloud lock-in are high, and customers must mitigate them by leveraging the cloud carefully and intelligently. Data transfer fees cannot be avoided, but there are other ways to lower your exposure.

That’s why Couchbase enables its NoSQL cloud database to be used in a multitude of different ways. You can manage it yourself, or use the Autonomous Operator to deploy it on any Kubernetes infrastructure (on premises or in the cloud). We also built our database-as-service, Capella, to natively run on all three major cloud platforms.

Couchbase natively leverages multiple availability zones and its internal replication technology to ensure high availability alongside multiple availability zones. With Cross Datacenter Replication (XDCR), you can asynchronously replicate your data across regions and even cloud platforms themselves to ensure your data is safe even in the worst-case scenarios.

Try Couchbase Capella today with a free trial and no commitment.

The post Why We Need an Inter-Cloud Data Standard appeared first on The New Stack.

]]>
We Designed Our Chips with First Pass Success — and So Can You https://thenewstack.io/we-designed-our-chips-with-first-pass-success-and-so-can-you/ Wed, 26 Apr 2023 17:00:12 +0000 https://thenewstack.io/?p=22704549

Like most thrilling adventures, this one began with a question: When interviewing for my current job at ScaleFlux, a computational

The post We Designed Our Chips with First Pass Success — and So Can You appeared first on The New Stack.

]]>

Like most thrilling adventures, this one began with a question: When interviewing for my current job at ScaleFlux, a computational storage vendor, in early 2019, my future boss asked me, “Do you think we can design our own chips?” For a small startup like ScaleFlux, it might seem like an insurmountable project. But, appreciating a challenge, I responded enthusiastically, “Yes!”

What are the benefits? You gain flexibility and increase sustainability. You level the playing field so you can compete with major global technology companies. You also retain control of your architecture rather than giving it to someone else. After all, your architecture is your company’s secret sauce.

Another advantage is time to market. If you don’t have control over your chip design, a change in your assembled final product could result in a six-month delay in production — or longer — because of the redesign and validation steps involved.

Most of the strategies involved in chip design are the same regardless of what type of chip you build. However, at ScaleFlux, we were specifically interested in designing a system on a chip (SoC) for next-generation SSD drives because it can combine many aspects of a computer system into a single chip.

My goal for our chip design was “A-0” success, meaning immediate production readiness, with no modification or redesign necessary. Producing a market-ready A-0 chip is a monumental challenge barely attempted by even the largest, most sophisticated tech companies. Still, it is achievable with the right mentality and a “can do” attitude.

As we discovered later, stalled capacity at fabs (the fabrication plants manufacturing semiconductors) provided further incentive to be in control of our own destiny. The global economy struggled with the chip shortage, which resulted in shipping challenges caused by shortages of the materials used to package the chips.

Yet, because we had reserved our position in the fabs, we could weather the worst of the chip shortage as competitors floundered.

We designed a working chip on our first try, and I am proud of our success. Let me tell you that now is a fantastic time to get started with designing your chips, and I want to see others accomplish the same goal. While a smart — and likely profitable — venture, designing your own chips isn’t easy. It requires a substantial initial investment of time and money. I experienced that for myself when ScaleFlux took on the challenge. There were bumps along the way, but looking back, it’s all been worth it — and the ROI took just a couple of years.

So, allow me to share six strategies I found tremendously helpful in streamlining the chip design process:

1. Commit Enough Resources to the Project

The initial investment includes building your team, purchasing simulation and design technology tools and contracting with production partners to develop your chips. While the schedule is king, quality is queen, so build enough time into your schedule to keep the quality high.

ScaleFlux’s investment was 10s of millions of dollars, and the project took a year and a half. Whatever your level of investment, allocate a sufficient amount for success and never cut corners. After all, there’s no patch for most hardware failures.

2. Gather a Skilled Team

Every person on our 50-person team contributed to the project’s success. I didn’t need a lone hero who could “do it all.” I needed a cross-functional group of humble, hard-working people who could admit what they didn’t know, communicate effectively and focus on the company’s objectives over their egos. I’ll never forget the big smiles on team members’ faces at the party to celebrate the successful validation of our first chips.

3. Pick the Right Partners

Select technology partners with a track record of success in specialized domains, such as physical design, testability and packaging. Qualified semiconductor engineers are a hot commodity in the US, so be flexible about where you source your talent. Your next team member could be an up-and-coming intern still in college or a semiconductor engineer who lives on the other side of the globe. For the verification team, aim for a minimum 1:2 ratio of designers to verification engineers. And as your team achieves, reward the team rather than individual members.

4. Adopt a Leadership Mindset

Agility is a significant advantage at a startup — you shouldn’t waste time making decisions. Stepping outside your comfort zone as a company and designing new technology is risky. But with the right mindset and appropriate resources, you can build a foundation of discipline and thorough planning to take calculated risks that bring tremendous gains.

While we limited architectural changes after the design kickoff, chip design involves a million small tasks. If just one of them goes wrong — like a team member misreading a design spec and making an error — your chip won’t work. We automated process steps whenever possible and used numerous checklists — following the adage to trust but verify — and those checklists at every significant design phase kept us on task during countless reviews.

5. Seek Outside Expertise When Needed

Large companies sometimes want to do everything in-house and resist partnering with outside resources. When they do, they are often limited to working with a short list of approved partners. Startups and smaller companies are more comfortable reaching out for expertise because of headcount limitations and have more flexibility to work with promising startups.

Our partnerships on our chip design initiative let us tap into some of the industry’s top chip design and production experts and benefit from their knowledge and connections. One supplier had paid years in advance for production line capacity, which proved valuable. Another was a relatively unknown, small company started by a talented lead engineer I’d worked with for over 15 years at a previous company.

6. Resist Being Dazzled by “Cool Technology.”

Instead of prioritizing cost, I prioritized quality and production readiness in components. Some employees and board members urged me to use new technology in our chip design. I resisted. My first rule for my team was to limit ourselves to only selecting established, market-tested intellectual property (IP) when sourcing technology for our chips.

All the IP we used when designing our chips were production-worthy rather than new. I chose older, time-tested technology over shiny, new technology. I wanted IP that had pushed dozens of previous SoCs to production and contributed to millions of shipped chips.

You, Too, Can Design Your Own Chips

With supply chain issues continuing and China dominating the chip fabrication market, companies can save money and gain a competitive edge by handling the first half of chip manufacturing — chip design — themselves.

Now, ScaleFlux is working on technology focused on reducing costs and designing next-generation chips for our future products. I hope our story inspires you. We’re a technology startup without the resources of an established company. We saw a need — and recognized the significant benefits we could realize — and we made it happen.

So can you.

The post We Designed Our Chips with First Pass Success — and So Can You appeared first on The New Stack.

]]>
ACID Transactions Change the Game for Cassandra Developers https://thenewstack.io/acid-transactions-change-the-game-for-cassandra-developers/ Wed, 26 Apr 2023 15:33:43 +0000 https://thenewstack.io/?p=22706385

For years, Apache Cassandra has been solving big data challenges such as horizontal scaling and geolocation for some of the

The post ACID Transactions Change the Game for Cassandra Developers appeared first on The New Stack.

]]>

For years, Apache Cassandra has been solving big data challenges such as horizontal scaling and geolocation for some of the most demanding use cases. But one area, distributed transactions, has proven particularly challenging for a variety of reasons.

It’s an issue that the Cassandra community has been hard at work to solve, and the solution is finally here. With the release of Apache Cassandra version 5.0, which is expected later in 2023, Cassandra will offer ACID transactions.

ACID transactions will be a big help for developers, who have been calling for more SQL-like functionality in Cassandra. This means that developers can avoid a bunch of complex code that they used for applying changes to multiple rows in the past. And some applications that currently use multiple databases to handle ACID transactions can now rely solely on Cassandra to solve their transaction needs.

What Are ACID Transactions and Why Would You Want Them?

ACID transactions adhere to the following characteristics:

  • AtomicityOperations in the transaction are treated as a single unit of work and can be rolled back if necessary.
  • Consistency — Different from the “consistency” that we’re familiar with from the CAP Theorem, this is about upholding the state integrity of all data affected by the transaction.
  • Isolation — Assuring that the data affected by the transaction cannot be interfered with by competing operations or transactions.
  • Durability — The data will persist at the completion of the transaction, even in the event of a hardware failure.

While some NoSQL databases have managed to implement ACID transactions, they traditionally have only been a part of relational database management systems (RDBMS). One reason for that: RDBMSs historically have been contained within a single machine instance. The reality of managing database operations is that it’s much easier to provide ACID properties when everything is happening within the bounds of one system. This is why the inclusion of full ACID transactions into a distributed database such as Cassandra is such a big deal.

The advantage of ACID transactions is that multiple operations can be grouped together and essentially treated as a single operation. For instance, if you’re updating several points of data that depend on a specific event or action, you don’t want to risk some of those points being updated while others aren’t. ACID transactions enable you to do that.

Example Transaction

Let’s look at a game transaction as an example. Perhaps we’re playing one of our favorite board games about buying properties. One of the players, named “Avery,” lands on a property named “Augustine Drive” and wants to buy it from the bank for $350.

There are three separate operations needed to complete the transaction:

  • Deduct $350 from Avery
  • Add $350 to the bank
  • Hand ownership of Augustine Drive to Avery

ACID transactions will help to ensure that:

  • Avery’s $350 doesn’t disappear
  • The bank doesn’t just receive $350 out of thin air
  • Avery doesn’t get Augustine Drive for free

Essentially, an ACID transaction helps to ensure that all parts of this transaction are either applied in a consistent manner or rolled back.

Consensus with Accord

Cassandra will be able to support ACID transactions thanks to the Accord protocol. As a part of the Cassandra Enhancement Process, CEP-15 introduces general-purpose transactions based on the Accord white paper. The main points of the CEP-15 are:

  • Implementation of the Accord consensus protocol
  • Strict, serializable isolation
  • The best attempts will be made to complete the transaction in one round trip
  • Operation over multiple partition keys

With the Accord consensus protocol, each node in a Casandra cluster has a structure called a “reorder buffer.” This buffer is designed to hold transaction timestamps for the future.


Figure 1: A coordinator node presenting a future timestamp to its voting replicas.

Essentially, a coordinator node takes a transaction and proposes a future timestamp for it. It then presents this timestamp (Figure 1) to the “electorate” (voting replicas for the transaction). The replicas then check to see if they have any conflicting operations.

As long as a quorum of the voting replicas accepts the proposed timestamp (Figure 2), the coordinator applies the transaction at that time. This process is known as the “Fast Path,” because it can be done in a single round trip.

Figure 2: All the voting replicas “accept” the proposed timestamp, and the “Fast Path” application of the transaction can proceed.

However, if a quorum of voting replicas fails to “accept” the proposed timestamp, the conflicting operations are reported back to the coordinator along with a newly proposed timestamp for the original transaction.

Wrapping Up

The addition of ACID transactions to a distributed database like Cassandra is an exciting change, in part because it opens Cassandra up to several new use cases:

  • Automated payments
  • Game transactions
  • Banking transfers
  • Inventory management
  • Authorization policy enforcement

Previously, Cassandra would have been unsuited for the cases listed above. Many times, developers have had to say, “We want to use Cassandra for X, but we need ACID.” No more!

More importantly, this is the beginning of Cassandra evolving into a feature-rich database. And that is going to improve the developer experience by leaps and bounds and help to make Cassandra a first-choice datastore for all developers building mission-critical applications. If you want to stay up on the latest news about Cassandra developers, check out Planet Cassandra. While you’re there you can see an ever-growing list of real-world use cases if you need some ideas. And if you’re a Cassandra user, we’d love to publish your story here.

The post ACID Transactions Change the Game for Cassandra Developers appeared first on The New Stack.

]]>