Database and Data Management News & Trends | The New Stack

LLMs and Data Privacy: Navigating the New Frontiers of AI

Mark Hinkle — Wed, 27 Sep 2023 17:00:30 +0000

Large Language Models (LLMs) like ChatGPT are revolutionizing how we interact online, offering unmatched efficiency and personalization. But as these AI-driven tools become more prevalent, they bring significant concerns about data privacy to the forefront. With models like OpenAI’s ChatGPT becoming staples in our digital interactions, the need for robust confidentiality measures is more pressing than ever.

I have been thinking about security for generative AI lately. Not because I have tons of private data but because my clients do. I also need to be mindful of taking their data and manipulating it or analyzing it in SaaS-based LLMs, as doing so could breach privacy. Numerous cautionary tales exist already of professionals doing this either knowingly or unknowingly. Among my many goals in life, being a cautionary tale isn’t one of them.

Current AI Data Privacy Landscape

Despite the potential of LLMs, there’s growing apprehension about their approach to data privacy. For instance, OpenAI’s ChatGPT, while powerful, refines its capabilities using user data and sometimes shares this with third parties. Platforms like Anthropic’s Claude and Google’s Bard have retention policies that might not align with users’ data privacy expectations. These practices highlight an industry-wide need for a more user-centric approach to data handling.

The digital transformation wave has seen generative AI tools emerge as game-changers. Some industry pundits even compare their transformative impact to landmark innovations like the internet. The impact of the internet is likely to be just as great, if not greater. As the adoption of LLM applications and tools skyrockets, there’s a glaring gap: preserving the privacy of data processed by these models by securing the inputs of training data and any data the model outputs. This presents a unique challenge. While LLMs require vast data to function optimally, they must also navigate a complex web of data privacy regulations.

Legal Implications and LLMs

The proliferation of LLMs hasn’t escaped the eyes of regulatory bodies. Frameworks like the EU AI Act, General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA) have set stringent data sharing and retention standards. These regulations aim to protect user data, but they also pose challenges for LLM developers and providers, emphasizing the need for innovative solutions that prioritize user privacy.

Top LLM Data Privacy Threats

In August, the Open Web Application Security Project (OWASP) released the Top 10 for LLM Applications 2023, a comprehensive guide to the most critical security risks to LLM applications. One such concern is training data poisoning. This happens when changes to data or process adjustments introduce vulnerabilities, biases, or even backdoors. These modifications can endanger the security and ethical standards of the model. To tackle this, confirming the genuineness of the training data’s supply chain is vital.

Using sandboxing can help prevent unintended data access, and it’s crucial to vet specific training datasets rigorously. Another challenge is supply chain vulnerabilities. The core foundation of LLMs, encompassing training data, ML models and deployment platforms, can be at risk due to weaknesses in the supply chain. Addressing this requires a comprehensive evaluation of data sources and suppliers. Relying on trusted plugins and regularly engaging in adversarial testing ensures the system remains updated with the latest security measures.

Sensitive information disclosure is another challenge. LLMs might unintentionally disclose confidential data, leading to privacy concerns. To mitigate this risk, it’s essential to use data sanitization techniques. Implementing strict input validation processes and hacker-driven adversarial testing can help identify potential vulnerabilities.

Enhancing LLMs with plugins can be beneficial but also introduce security concerns due to insecure plugin design. These plugins can become potential gateways for security threats. To ensure these plugins remain secure, it’s essential to have strict input guidelines and robust authentication methods. Continuously testing these plugins for security vulnerabilities is also crucial.

Lastly, the excessive agency in LLMs can be problematic. Giving too much autonomy to these models can lead to unpredictable and potentially harmful outputs. It’s essential to set clear boundaries on the tools and permissions granted to these models to prevent such outcomes. Functions and plugins should be clearly defined, and human oversight should always be in place, especially for significant actions.

Three Approaches to LLM Security

There isn’t a one-size-fits-all approach to LLM security. It’s a balancing act between how you want to interact with both internal and external sources of information and the users of those models. For example, you may want a customer-facing and internal chatbot to collate private institutional knowledge.

Data Contagion Within Large Language Models (LLMs)

Data contagion of Large Language Models (LLMs) is the accidental dissemination of confidential information via a model’s inputs. Given the intricate nature of LLMs and their expansive training datasets, ensuring that these computational models do not inadvertently disclose proprietary or sensitive data is imperative.

In the contemporary digital landscape, characterized by frequent data breaches and heightened privacy concerns, mitigating data contagion is essential. An LLM that inadvertently discloses sensitive data poses substantial risks, both in terms of reputational implications for entities and potential legal ramifications.

One approach to address this challenge encompasses refining the training datasets to exclude sensitive information, ensuring periodic model updates to rectify potential vulnerabilities and adopting advanced methodologies capable of detecting and mitigating risks associated with data leakage.

Sandboxing Technique LLMs

Sandboxing is another strategy to keep data safe when working with AI models. Sandboxing entails the creation of a controlled computational environment wherein a system or application operates, ensuring that its actions and outputs remain isolated and don’t make their way outside of the systems.

For LLMs, the application of sandboxing is particularly salient. By establishing a sandboxed environment, entities can regulate access to the model’s outputs, ensuring interactions are limited to authorized users or systems. This strategy enhances security by preventing unauthorized access and potential model misuse.

With over 300,000 plus models available on HuggingFace and exceptionally powerful large-language models readily available, it’s within reason for those enterprises that have the means to deploy their own EnterpriseGPT that can remain private.

Effective sandboxing necessitates the implementation of stringent access controls, continuous monitoring of interactions with the LLM and establishing defined operational parameters to ensure the model’s actions remain within prescribed limits.

Data Obfuscation Before LLM Input

The technique of “obfuscation” has emerged as a prominent strategy in data security. Obfuscation pertains to modifying original data to render it unintelligible to unauthorized users while retaining its utility for computational processes. In the context of LLMs, this implies altering data to remain functional for the model but become inscrutable for potential malicious entities. Given the omnipresent nature of digital threats, obfuscating data before inputting it into an LLM is a protective measure. In the event of unauthorized access, the obfuscated data, devoid of its original context, offers minimal value to potential intruders.

Several methodologies are available for obfuscation, such as data masking, tokenization and encryption. It is vital to choose a technique that aligns with the operational requirements of the LLM and the inherent nature of the data being processed. Selecting the right approach ensures optimal protection while preserving the integrity of the information.

In conclusion, as LLMs continue to evolve and find applications across diverse sectors, ensuring their security and the integrity of the data they process remains paramount. Proactive measures, grounded in rigorous academic and technical research, are essential to navigate the challenges posed by this dynamic domain.

OpaquePrompts: Open Source Obfuscation for LLMs

In response to these challenges, OpaquePrompts has recently been released on Github by Opaque Systems. It preserves the privacy of user data by sanitizing it, ensuring that personal or sensitive details are removed before interfacing with the LLM. By harnessing advanced technologies such as confidential computing and trusted execution environments (TEEs), OpaquePrompts guarantees that only the application developer can access the full scope of the prompt’s data. OpaquePrompts’s suite of tools is available on GitHub for those interested in delving deeper.

OpaquePrompts is engineered for scenarios demanding insights from user-provided contexts. Its workflow is comprehensive:

User Input Processing: LLM applications create a prompt, amalgamating retrieved-context, memory and user queries, which is then relayed to OpaquePrompts.
Identification of Sensitive Data: Within a secure TEE, OpaquePrompts utilizes advanced NLP techniques to detect and flag sensitive tokens in a prompt.
Prompt Sanitization: All identified sensitive tokens are encrypted, ensuring the sanitized prompt can be safely relayed to the LLM.
Interaction with LLM: The sanitized prompt is processed by the LLM, which then returns a similarly sanitized response.
Restoring Original Data: OpaquePrompts restores the original data in the response, ensuring users receive accurate and relevant information.

The Future: Merging Confidentiality with LLMs

In the rapidly evolving landscape of Large Language Models (LLMs), the intersection of technological prowess and data privacy has emerged as a focal point of discussion. As LLMs, such as ChatGPT, become integral to our digital interactions, the imperative to safeguard user data has never been more pronounced. While these models offer unparalleled efficiency and personalization, they also present challenges in terms of data security and regulatory compliance.

Solutions like OpaquePrompts are one of many that will come that exemplify how data privacy at the prompt layer can be a game-changer. Instead of venturing into the daunting task of self-hosting a Foundational Model, LLM focusing on prompt-layer privacy provides data confidentiality from the get-go without requiring the expertise and costs associated with in-house model serving. This simplifies LLM integration and reinforces user trust, underscoring the commitment to data protection.

It is evident that as we embrace the boundless potential of LLMs, a concerted effort is required to ensure that data privacy is not compromised. The future of LLMs hinges on this delicate balance, where technological advancement and data protection coalesce to foster trust, transparency and transformative experiences for all users.

The post LLMs and Data Privacy: Navigating the New Frontiers of AI appeared first on The New Stack.

NoSQL Data Modeling Mistakes that Ruin Performance

Felipe Cardeneti Mendes — Wed, 27 Sep 2023 16:00:16 +0000

Getting your data modeling wrong is one of the easiest ways to ruin your performance. And it’s especially easy to screw this up when you’re working with NoSQL, which (ironically) tends to be used for the most performance-sensitive workloads. NoSQL data modeling might initially appear quite simple: just model your data to suit your application’s access patterns. But in practice, that’s much easier said than done.

Fixing data modeling is no fun, but it’s often a necessary evil. If your data modeling is fundamentally inefficient, your performance will suffer once you scale to some tipping point that varies based on your specific workload and deployment. Even if you adopt the fastest database on the most powerful infrastructure, you won’t be able to tap its full potential unless you get your data modeling right.

This article explores three of the most common ways to ruin your NoSQL database performance, along with tips on how to avoid or resolve them.

Not Addressing Large Partitions

Large partitions commonly emerge as teams scale their distributed databases. Large partitions are partitions that grow too big, up to the point when they start introducing performance problems across the cluster’s replicas.

One of the questions that we hear often — at least once a month — is “What constitutes a large partition?” Well, it depends. Some things to consider:

Latency expectations: The larger your partition grows, the longer it will take to be retrieved. Consider your page size and the number of client-server round trips needed to fully scan a partition.
Average payload size: Larger payloads generally lead to higher latency. They require more server-side processing time for serialization and deserialization and also incur a higher network data transmission overhead.
Workload needs: Some workloads organically require larger payloads than others. For instance, I’ve worked with a Web3 blockchain company that would store several transactions as BLOBs under a single key, and every key could easily get past 1 megabyte in size.
How you read from these partitions: For example, a time series use case will typically have a timestamp clustering component. In that case, reading from a specific time window will retrieve much less data than if you were to scan the entire partition.

The following table illustrates the impact of large partitions under different payload sizes, such as 1, 2 and 4 kilobytes.

As you can see, the higher your payload gets under the same row count, the larger your partition is going to be. However, if your use case frequently requires scanning partitions as a whole, then be aware that databases have limits to prevent unbounded memory consumption.

For example, ScyllaDB cuts off pages at every 1MB to prevent the system from potentially running out of memory. Other databases (even relational ones) have similar protection mechanisms to prevent an unbounded bad query from starving the database resources.

To retrieve a payload size of 4KB and 10K rows with ScyllaDB, you would need to retrieve at least 40 pages to scan the partition with a single query. This may not seem a big deal at first. However, as you scale over time, it could affect your overall client-side tail latency.

Another consideration: With databases like ScyllaDB and Cassandra, data written to the database is stored in the commit log and under an in-memory data structure called a “memtable.”

The commit log is a write-ahead log that is never really read from, except when there’s a server crash or a service interruption. Since the memtable lives in memory, it eventually gets full. To free up memory space, the database flushes memtables to disk. That process results in SSTables (sorted strings tables), which is how your data gets persisted.

What does all this have to do with large partitions? Well, SSTables have specific components that need to be held in memory when the database starts. This ensures that reads are always efficient and minimizes wasting storage disk I/O when looking for data. When you have extremely large partitions (for example, we recently had a user with a 2.5 terabyte partition in ScyllaDB), these SSTable components introduce heavy memory pressure, therefore shrinking the database’s room for caching and further constraining your latencies.

How do you address large partitions via data modeling? Basically, it’s time to rethink your primary key. The primary key determines how your data will be distributed across the cluster, which improves performance as well as resource utilization.

A good partition key should have high cardinality and roughly even distribution. For example, a high cardinality attribute like User Name, User ID or Sensor ID might be a good partition key. Something like State would be a bad choice because states like California and Texas are likely to have more data than less populated states such as Wyoming and Vermont.

Or consider this example. The following table could be used in a distributed air quality monitoring system with multiple sensors:

CREATE TABLE air_quality_data (
   sensor_id text,
   time timestamp,
   co_ppm int,
   PRIMARY KEY (sensor_id, time)
);

With time being our table’s clustering key, it’s easy to imagine that partitions for each sensor can grow very large, especially if data is gathered every couple of milliseconds. This innocent-looking table can eventually become unusable. In this example, it takes only ~50 days.

A standard solution is to amend the data model to reduce the number of clustering keys per partition key. In this case, let’s take a look at the updated air_quality_data table:

CREATE TABLE air_quality_data (
   sensor_id text,
   date text,
   time timestamp,
   co_ppm int,
   PRIMARY KEY ((sensor_id, date), time)
);

After the change, one partition holds the values gathered in a single day, which makes it less likely to overflow. This technique is called bucketing, as it allows us to control how much data is stored in partitions.

Bonus: See how Discord applies the same bucketing technique to avoid large partitions.

Introducing Hot Spots

Hot spots can be a side effect of large partitions. If you have a large partition (storing a large portion of your data set), it’s quite likely that your application access patterns will hit that partition more frequently than others. In that case, it also becomes a hot spot.

Hot spots occur whenever a problematic data access pattern causes an imbalance in the way data is accessed in your cluster. One culprit: when the application fails to impose any limits on the client side and allows tenants to potentially spam a given key.

For example, think about bots in a messaging app frequently spamming messages in a channel. Hot spots could also be introduced by erratic client-side configurations in the form of retry storms. That is, a client attempts to query specific data, times out before the database does and retries the query while the database is still processing the previous one.

Monitoring dashboards should make it simple for you to find hot spots in your cluster. For example, this dashboard shows that shard 20 is overwhelmed with reads.

For another example, the following graph shows three shards with higher utilization, which correlates to the replication factor of three, configured for the keyspace in question.

Here, shard 7 introduces a much higher load due to the spamming.

How do you address hot spots? First, use a vendor utility on one of the affected nodes to sample which keys are most frequently hit during your sampling period. You can also use tracing, such as probabilistic tracing, to analyze which queries are hitting which shards and then act from there.

If you find hot spots, consider:

Reviewing your application access patterns. You might find that you need a data modeling change such as the previously-mentioned bucketing technique. If you need sorting, you could use a monotonically increasing component, such as Snowflake. Or, maybe it’s best to apply a concurrency limiter and throttle down potential bad actors.
Specifying per-partition rate limits, after which the database will reject any queries that hit that same partition.
Ensuring that your client-side timeouts are higher than the server-side timeouts to prevent clients from retrying queries before the server has a chance to process them (“retry storms”).

Misusing Collections

Teams don’t always use collections, but when they do, they often use them incorrectly. Collections are meant for storing/denormalizing a relatively small amount of data. They’re essentially stored in a single cell, which can make serialization/deserialization extremely expensive.

When you use collections, you can define whether the field in question is frozen or non-frozen. A frozen collection can only be written as a whole; you cannot append or remove elements from it. A non-frozen collection can be appended to, and that’s exactly the type of collection that people most misuse. To make matters worse, you can even have nested collections, such as a map that contains another map, which includes a list, and so on.

Misused collections will introduce performance problems much sooner than large partitions, for example. If you care about performance, collections can’t be very large at all. For example, if we create a simple key:value table, where our key is a sensor_id and our value is a collection of samples recorded over time, our performance will be suboptimal as soon as we start ingesting data.

CREATE TABLE IF NOT EXISTS {table} (
           	sensor_id uuid PRIMARY KEY,
           	events map>>,
        	)

The following monitoring snapshots show what happens when you try to append several items to a collection at once.

You can see that while the throughput decreases, the p99 latency increases. Why does this occur?

Collection cells are stored in memory as sorted vectors.
Adding elements requires a merge of two collections (old and new).
Adding an element has a cost proportional to the size of the entire collection.
Trees (instead of vectors) would improve the performance, BUT…
Trees would make small collections less efficient!

Returning that same example, the solution would be to move the timestamp to a clustering key and transform the map into a frozen collection (since you no longer need to append data to it). These very simple changes will greatly improve the performance of the use case.

CREATE TABLE IF NOT EXISTS {table} (
           	sensor_id uuid,
		record_time timestamp,
           	events FROZEN>,
	 PRIMARY KEY(sensor_id, record_time)
        	)

Learn More: On-Demand NoSQL Data Modeling Masterclass

Want to learn more about NoSQL data modeling best practices for performance? Take a look at our NoSQL data modeling masterclass — three hours of expert instruction, now on demand and free. You will learn how to:

Analyze your application’s data usage patterns and determine which data modeling approach will be most performant for your specific usage patterns.
Select the appropriate data modeling options to address a broad range of technical challenges, including the benefits and trade-offs of each option.
Apply common NoSQL data modeling strategies in the context of a sample application.
Identify signs that indicate your data modeling is at risk of causing hot spots, timeouts and performance degradation — and how to recover.

The post NoSQL Data Modeling Mistakes that Ruin Performance appeared first on The New Stack.

CelerData Upends Real-Time Data Analytics with Dynamic Table Joins

Jelani Harper — Tue, 26 Sep 2023 14:25:05 +0000

The shift to real-time analytics, infrastructure, and architecture is impacting organizations across industries and use cases whether involving Internet of Things deployments like digital twins or wearables, horizontal concerns like supply chain management, or fraud detection and recommendation engines in AdTech, the need to analyze and act on data with low latency is only increasing.

The most accomplished OLAP databases for such tasks are written in C++ to accommodate these performance needs. Many integrate with streaming data platforms like Apache Flink or Spark Streaming to handle the preprocessing their architectures require for such timely analytics.

Regardless of the particular approach or database involved in such matters, there’s no getting around one simple fact that’s consistently proved determinative for real-time OLAP databases.

In almost all cases, the data that are analyzed is on more than one table.

“Aside from analyzing logs, or analyzing user behavior, and sometimes not even that, for every other scenario you actually need joins,” revealed CelerData product marketing manager Sida Shen. “There’s really not that many scenarios where you don’t need joins.”

CelerData’s real-time, open source OLAP database StarRocks is one of the few options in this space that dynamically performs join operations on tables with low latency data. Because of its architecture, this real-time database is considerably more flexible, swifter, and cost-effective than many of its competitors are — which produces tremendous advantages for users when it’s deployed at an enterprise scale.

On-the-Fly Joins

According to Shen, StarRocks’ ability to rapidly perform dynamic joins on real-time data is “unique” among OLAP databases in this field. From an architectural perspective, this advantage is largely due to the fact that StarRocks “has a natively built cost-based optimizer,” Shen remarked, which supports scalable join operations. Typically, other OLAP databases can only process single table queries on real-time data and require preprocessing to join tables so organizations can query across them.

Considering the speed and sizes of the data in real-time analytics use cases, preprocessing for joins is “one of the most expensive things you can do with OLAP databases, joining two large tables,” Shen commented. Since StarRocks can join tables on the fly for these low latency use cases, its users avoid those costs and the time spent denormalizing their tables to facilitate joins. “Data lake engines can do joins because they do ETL jobs, but real-time OLAP databases give up on that because it needs a lot of optimization on the query planning side,” Shen explained. “Our architecture supports joins internally.”

Denormalization Realities

Without the capability to dynamically join tables, other OLAP databases for real-time analytics account for this fact with denormalization processing that frequently entails platforms like Spark Streaming or Flink. “Denormalization is when you pre-join your tables together based on your query pattern,” Shen specified. After the tables are joined into a flat table in the preprocessing platform, the latter table is ingested and analyzed in the real-time OLAP database. It’s not uncommon for organizations to generate copious amounts of code for these operations, which may be tenuous.

“This is where it gets very complicated,” Shen admitted. “It’s very difficult to configure, it breaks a lot, and it requires a lot of resources. Just a lot of maintenance, and this is on the cost side, like hardware and man-hour costs.” Moreover, when schema changes arise, there’s a definite possibility of having to redo this preparation work. In that case, “you have to reconfigure the entire pipeline and sometimes you need to backfill all of the data for your flat table,” Shen observed. “Because one thing changes, the whole flat table can change.”

Architectural Advantages

Organizations can avoid such inflexibility, costs, and time preprocessing their tables by employing a real-time OLAP database that joins tables at enterprise scale for instant data analysis. StarRocks’ architecture enables it to support in-memory data shuffling, which helps with joins and complicated aggregation operations. Data shuffling becomes influential in distributed environments in which “one of the challenges is to send the data to the appropriate nodes, so the nodes can get the data and they all do their part,” Shen noted. “Data shuffling is, basically, you shuffle the two. Let’s say you join two tables and shuffle all the data on the join key to all of the nodes.”

This operation allows organizations to perform scalable joins. Without it, users would have to attempt what Shen termed a “broadcast join” that involves replicating a smaller table and sending it to all the nodes. According to Shen, for CelerData’s real-time OLAP competitors, “The most they can do without shuffling is to have a big table join a very tiny table on a cluster that’s not very big. But we can do a big table joining a big table or any other kind.”

Additionally, because StarRocks is based on C++, some of its performance gains — which become palpable when competing with other Java-based query engines like Presto or Trino for directly querying data lakes — are based on its utilization of Single Instruction, Multiple Data (SIMD) instructions. With SIMD, “you process multiple data points with one instruction, so you touch your memory a lot less by executing one query,” Shen said. This increased efficiency is characteristic of OLAP databases predicated on C++; Shen mentioned it’s not possible with JAVA-based options.

The End of Table Denormalization?

A real-time OLAP database that dynamically joins tables whenever organizations specify it has considerable consequences for real-time analytics. On the one hand, it could herald an end to denormalization and the time, effort, and costs denormalization exacts from organizations to pre-join tables according to specific query patterns. On the other, it could signal an era in which there’s much more flexibility for real-time databases to adjust to changes in schema, source data, and business requirements. Either way, this capability could further advance the usefulness of real-time data analytics.

The post CelerData Upends Real-Time Data Analytics with Dynamic Table Joins appeared first on The New Stack.

5 Hard Problems in Vector Search, and How Cassandra Solves Them

Jonathan Ellis — Fri, 22 Sep 2023 16:48:25 +0000

Vector search is a critical component of generative AI tooling because of how retrieval augmented generation (RAG) like FLARE helps LLMs incorporate up-to-date, customized information while avoiding hallucinations. At the same time, vector search is a feature, not a product — you need to query vectors as they relate to the rest of your data, not in isolation, and you shouldn’t need to build a pipeline to sync the rest of your data with a vector store to do that.

This year, we have seen an explosion in vector search products and projects, making selecting among them a serious effort. As you research the options, you’ll need to consider the following hard problems and the different approaches to solving them. Here, I’ll walk you through these challenges and describe how DataStax tackled them for our implementation of vector search for DataStax Astra DB and Apache Cassandra.

The Curse of Dimensionality

At the core of these hard problems lies what researchers call “the curse of dimensionality.” What this means in practice is that algorithms that work for exact vector search in 2D or 3D space, like k-d trees, fall apart when you get to 10s or 100s or 1000s of dimensions in your vectors. The result is that there is no shortcut for exact similarity search with high-dimensional vectors; to get logarithmic-time results, we need to use approximate nearest neighbor (ANN) algorithms, which bring with them challenges in the following areas.

Problem 1: Scale-out

Many vector search algorithms were designed for datasets that fit in memory on a single machine, and this is still the only scenario tested by ann-benchmarks. (Ann-benchmarks further restricts its testing to a single core!) This might be okay for academic work of a million documents or rows, but it’s been many years since that could be considered representative of real-world workloads.

As with any other domain, scale-out requires both replication and partitioning, as well as subsystems to handle replacing dead replicas, repairing them after a network partition, and so forth.

This was an easy one for us: scale-out replication is Cassandra’s bread and butter, and combining that with the new-in-Cassandra-5.0 SAI (Storage-Attached Indexing — see CEP-7 for how it works, and the SAI documentation for how to use it) gave our vector search implementation powerful scale-out capabilities effectively for free.

Problem 2: Efficient Garbage Collection

By “garbage collection,” I mean removing obsolete information from the index — cleaning out deleted rows and dealing with rows whose indexed vector value has changed. This might not seem worth mentioning — it’s been a more or less solved problem for forty years in the relational database world — but vector indexes are unique.

The key problem is that all the algorithms we know of that provide both low-latency searches and high-recall results are graph-based. There are lots of other vector indexing algorithms available — FAISS implements many of them — but all of them are either too slow to build, too slow to search, or offer recall that’s too low (and sometimes all three!) to be a general-purpose solution. That’s why every production vector database that I know of uses a graph-based index, the simplest of which is HNSW. Hierarchical Navigable Small World graphs were introduced by Yury Malkov et al in 2016; the paper is quite readable and I highly recommend it. More on HNSW below.

The challenge with graph indexes is that you can’t just rip the old (vector-associated) node out when a row or document changes; if you do that more than a handful of times, your graph will no longer be able to perform its purpose of directing BFS (breadth-first search) toward areas of greater similarity to a query vector.

So you’ll need to rebuild your indexes at some point to perform this garbage collection, but when — and how — do you organize it? If you always rebuild everything when changes are made, you will massively increase the physical writes performed; this is called write amplification. On the other hand, if you never rebuild then you’ll do extra work filtering out obsolete rows at query time (“read amplification”).

This is another problem space that Cassandra has been working on for years. Since SAI indexes are attached to the main storage lifecycle, they also participate in Cassandra compaction, which logarithmically increases the storage units to provide a healthy balance between reads and writes.

Sidebar: Cloud Application Workloads

DataStax Astra DB builds on Apache Cassandra to provide a platform for cloud application workloads. That means workloads that are:

Massively concurrent thousands to millions of requests per second, usually to retrieve just a few rows apiece. This is why you couldn’t run Netflix on Snowflake, even if you could afford it: Snowflake and similar analytical systems are designed to handle only a few concurrent requests that each run for many seconds to minutes or even longer.
Larger than memory If your dataset fits in memory on a single machine, it almost doesn’t matter what tools you use. SQLite, MongoDB, MySQL — they’ll all work fine. Things get more challenging when that’s not the case — and the bad news is that vector embeddings are usually several KB, or around an order of magnitude larger than typical database documents, so you’ll get to larger-than-memory relatively quickly.
Core to the application If you don’t care if you lose your data, either because it’s not that important or because you can rebuild it from the actual source of record, then again it kind of doesn’t matter what tools you use. Databases like Cassandra, and Astra DB, are built to keep your data available and durable no matter what.

Problem 3: Concurrency

I mentioned above that the well-known ann-benchmarks comparison limits all algorithms to a single core. While this levels the playing field, it also handicaps those who can take advantage of the major source of hardware improvement over the last two decades.

A related problem is that ann-benchmarks only performs one type of operation at a time: first, it builds the index, then it queries it. Dealing with updates interleaved with searches is optional — and likely even a handicap; if you know that you don’t need to deal with updates, you can make a lot of simplifying assumptions that look good on artificial benchmarks.

If you care about being able to do multiple concurrent operations with your index, or update it after it’s built, then you need to look a little deeper to understand how it works and what tradeoffs are involved.

First, all general-purpose vector databases that I know of use graph-based indexes. That’s because you can start querying a graph index as soon as the first vector is inserted. Most other options require you to either build the entire index before querying it, or at least do a preliminary scan of the data to learn some statistical properties.

However there are still important implementation details even within the graph index category. For example, we thought at first that we could save time by using Lucene’s HNSW index implementation, the way MongoDB and Elastic and Solr do. But we quickly learned that Lucene only offers single-threaded, non-concurrent index construction. That is, you can neither query it as it is being built (which should be one of the primary reasons to use this data structure!) nor allow multiple threads to build it concurrently.

The HNSW paper suggests that a fine-grained locking approach can work, but we went one better and built a non-blocking index. This is open sourced in JVector.

JVector scales concurrent updates linearly to at least 32 threads. This graph is log-scaled on both the x and y axes, showing that doubling the thread count halves the build time.

More importantly, JVector’s non-blocking concurrency benefits more realistic workloads mixing searches with updates. Here is a comparison of Astra DB’s performance (using JVector) compared to Pinecone across different datasets. While Astra DB is about 10% faster than Pinecone for a static dataset, it is 8x to 15x faster while also indexing new data. We picked the best available pod tier with Pinecone (Pod Type: p2 and Pod Size: x8, with two pods per replica) based on their recommendations for higher throughput and lower latency. Pinecone does not disclose what physical resources this corresponds to. On the Astra DB side, we picked the default PAYG deployment model and did not have to worry about choosing the resources as it is serverless. Tests were performed using NoSQLBench.

Astra DB does this while maintaining higher recall and precision as well (F1 is a combination of recall and precision).

Problem 4: Effective Use of Disk

We started with the HNSW graph indexing algorithm because it’s fast to build the index, fast to query, highly accurate, and straightforward to implement. However, it has a well-known downside: it requires a lot of memory.

An HNSW index is a series of layers, where each layer above the base layer has roughly 10% as many nodes as the previous. This enables the upper layers to act as a skip list, allowing the search to zero in on the right neighborhood of the bottom layer that contains all of the vectors.

However, this design means that (in common with all graph indexes) you can’t get away with saying “the disk cache will save us,” because, unlike normal database query workloads, every vector in the graph has an almost equal chance of being relevant to a search. (The exception is the upper layers, which we can and do cache.)

Here’s what a profile of serving a 100M vector dataset of Wikipedia article chunks (about 120GB on disk) looked like on my desktop with 64GB of RAM, back when we were using Lucene:

Cassandra is spending almost all of its time waiting to read vectors off of the disk.

To solve this problem, we implemented a more advanced algorithm called DiskANN and open-sourced it as a standalone embedded vector search engine, JVector. (Specifically, JVector implements the incremental version of DiskANN described in the FreshDiskANN paper.) Briefly, DiskANN uses a single-level graph with longer edges than HNSW and an optimized layout of vectors and neighbors to reduce disk IOPS and keeps a compressed representation of vectors in memory to speed up similarity computations. This results in tripling the throughput for the Wikipedia workload.

Here is what HNSW versus DiskANN looks like in a purely embedded scenario with no client/server components. This measures the speed of searching the Deep100M dataset under Lucene (HNSW) and JVector (DiskANN). The Lucene index is 55GB, including the index and the raw vectors. The JVector index is 64GB. The search was performed on my 24GB MacBook, which has about one-third as much memory as it would take to hold the index in RAM.

Problem 5: Composability

Composability in the context of database systems refers to the ability to seamlessly integrate various features and capabilities in a coherent manner. This is particularly significant when discussing the integration of a new category of capability, like vector search. Non-toy applications will always require both classic CRUD database features as well as the new vector search.

Consider the simple AI chatbot sample application for Astra DB. This is about as pure of a RAG application as you will find, using vector search to surface appropriate documentation to the LLM to respond to user questions. However, even a simple demo like this still needs to make “normal,” non-vector queries to Astra DB to retrieve the conversation history, which must also be included with every request to the LLM so that it can “remember” what has already taken place. So the key queries include:

Find the most relevant documents (or document fragments) for the user’s question
Retrieve the last twenty messages from the user’s conversation

In a more realistic use case, one of our solutions engineers was recently working with a company in Asia that wanted to add semantic search to their product catalog, but also wanted to enable term-based matching. For example, if the user searches for [“red” ball valve] then they want to restrict the search to items whose description matches the term “red”, no matter how semantically similar the vector embeddings are. The key new query then (on top of classic functionality like session management, order history, and shopping cart updates) is thus: Restrict the products to those that contain all quoted terms, then find the most similar to the user’s search.

This second example makes it clear that not only that applications need both classic query functionality and vector search, but they often need to be able to use elements of each in the same query.

The current state of the art in this young field is to try to do what I’m calling classic queries in a “normal” database, vector queries in a vector database, and then stitching the two together in an ad hoc fashion when both are required at the same time. This is error-prone, slow, and expensive; its only virtue is that you can make it work until you have a better solution.

In Astra DB, we’ve built (and open sourced) that better solution, on top of Cassandra SAI. Because SAI allows for creating custom index types that are all tied to the Cassandra SSTable and compaction life cycle, it is straightforward for Astra DB to allow developers to mix and match boolean predicates, term-based search, and vector search, with no overhead of managing and synchronizing separate systems. This gives developers building generative AI applications more sophisticated query capabilities that drive greater productivity and faster time-to-market.

Conclusion

Vector search engines are an important new database feature with multiple architectural challenges, including scale-out, garbage collection, concurrency, effective use of disk, and composability. I believe that in building vector search for Astra DB we have been able to leverage Cassandra’s capabilities to deliver a best-in-class experience for developers of generative AI applications. Learn more about Astra DB here, or if you want to get up close and personal with vector search algorithms, check out JVector.

The post 5 Hard Problems in Vector Search, and How Cassandra Solves Them appeared first on The New Stack.

PostgreSQL 16 Expands Analytics Capabilities

Jelani Harper — Fri, 22 Sep 2023 15:00:22 +0000

The recent release of PostgreSQL 16 is significant for a number of reasons. It enables more flexible access control mechanisms, which have immediate consequences for deployments involving Managed Service Providers (MSPs).

Version 16 also supports hot standby capabilities which, when serving as the source for logical replications, has the ability to “allow for new architectures,” affirmed Adam Wright, Senior Product Manager for EnterpriseDB (EDB). EDB was one of the foremost code contributors to PostgreSQL 16, an open source Relational Database Management System.

Most importantly, however, the latest version of PostgreSQL includes analytics functionality for facilitating complicated aggregation and windowing queries. This enhancement, when paired with the database’s extensions for managing geospatial and vector data, respectively, is perhaps the latest indicator of the increasing relevance of transactional databases for analytics.

“I think what you’re starting to see is the need for specialty data warehousing is starting to get lower,” Wright reflected. “You might have extreme ends of the data warehousing market, where having a specialty system is necessary. But that’s really starting to become the extreme end and, for a lot of use cases, you can just use Postgres and not need that specialty system.”

Analytics

Although PostgreSQL is widely used in transactional systems, its implementation of what Wright called the “any value function” in version 16 has definite analytics overtones. This function is part of the SQL:2023 standard. “This function is really mainly used for analytical databases,” Wright revealed. “Complex aggregations/windowing queries is kind of the subheading for what you can use that for.”

This particular function allows administrators and developers to do calculations across a set of rows in a table. “You might compare how a calculation for one row is done against another row and get some aggregation of those two,” Wright explained. “So, things like getting a running total and doing that easily through a few lines of SQL.” This feature is valuable for use cases such as comparing product types for stock ordering in retail, particularly across different locations represented in a large table.

Vector and Geospatial Data

According to Wright, it’s often more efficient to manage this task at the database level than at the application level. With the latter approach, administrators or developers would have to write more code than they otherwise would, as well as write multiple functions. Then, they’d have to bring the data back and compare it, instead of filtering everything out of the database server. However, “if I’m on the database server and I write this aggregation query, I’m comparing different rows in the table,” Wright mentioned. “Once that’s all done, I’m going to stream back only the records that are necessary to the application.”

A PostgreSQL extension, PostGIS, enables users to store and query geospatial data. Wright referenced another extension for managing vector workloads that includes storing vector data and supporting vector operators. “It’s just another use case that’s available in Postgres that you don’t need to go to a specialty database vendor and get another contract, and another support, and have to onboard whatever you may need to actually support those workloads,” Wright commented.”

Logical Replications

Although the extensions Wright mentioned are not part of the new capabilities unveiled in PostgreSQL 16, they attest to the database’s growing analytics usefulness, which coincides with the recently released aggregation and windowing function. The new edition also includes enhancements to its logical replication capabilities, the most substantial of which is “that you’re able to do logical representation from a standby server,” Wright remarked. With this paradigm, users can do logical representations from what Wright termed a hot standby, which he described as involving a physical replication of every change in PostgreSQL data to a target system — frequently for High Availability.

“With logical representation, instead of getting all the changes from the database, you might just want to replicate a couple of tables, a sales table, an orders table, and feed them to these different systems,” Wright observed. Logical representation enables users to specify the tables that are replicated which, when combined with the ability to do so from standby systems, broadly expands the possibilities for replications and architectural approaches. “You can have cascading multiple replications and things like that,” Wright said. “You can do everything from one system, but can also get subsets of the data from Postgres more easily, get into read-only systems… There’s just a lot more architectures that are going to be supported because of these native features.”

Superuser Improvements

PostgreSQL 16 also enhances the degree and scope of control associated with superusers — users with a considerable amount of privileges for data and system access. In previous versions, superusers had the latitude to do almost anything, including tampering on the underlying “operating system as the service that’s running as Postgres,” Wright admitted. “This is a big problem for managed data services.” Consequently, managed service providers (including some of the hyperscalers) would “fork” or replicate the database, users, and settings from the original cluster to another.

According to Wright, this process may lead to “bugs and security issues.” Consequently, the more refined superuser controls in the most recent version of PostgreSQL enable “you granular management of privileges and to delegate tasks that are needed to manage the database for DBAs, but not give them things that let them break out of the database,” Wright said. “Or, you can manage a role but not necessarily manage the data for that role.”

The Database Space

The new additions to PostgreSQL 16 are, for the most part, indicative of developments that are impacting the database space as a whole. Systems that were conventionally used for transactional purposes are taking on more analytics responsibilities. PostgreSQL’s native support of an aggregate function for writing aggregation and windowing queries — and extensions for workloads pertaining to geospatial and vector data — is perhaps a harbinger of a future in which the traditional divide between transactional and analytics databases is not so pronounced.

The post PostgreSQL 16 Expands Analytics Capabilities appeared first on The New Stack.

Battling the Steep Price of Storage for Real-Time Analytics

B. Cameron Gain — Fri, 22 Sep 2023 13:00:59 +0000

Nowadays, customers demand that database providers offer massive amounts of data storage for real-time analytics. For many use cases, the amount of data that these users are working with requires large amounts of storage.

Plus, this storage needs to be readily accessible and fast. Manufacturers, healthcare providers, climate change scientists, and various other use cases need to access data stored in memory caches in real time, while simultaneously leveraging historical data relevant to that data point.

Adding AI into this mix increases the amount of data companies have to deal with exponentially. The generation of predictive models results in applications calculating more data inferences, which, in turn, creates even more data.

As organizations seek to achieve greater observability into their systems and applications, they’re tasked with collecting more data from more devices — such as industrial Internet of Things (IoT) devices and aerospace telemetry. In many cases, these sources generate data at high resolutions, which increases storage costs even more.

“The fact of the matter is that companies have a lot more data coming in and the gap between what it was, even a few years ago, and what it looks like today is orders of magnitude wider,” Rick Spencer, vice president of products at InfluxData, told The New Stack.

While real-time data analytics alone requires cutting-edge database and streaming technologies, the cost of storage to meet these demands remains too high for many, if not most, organizations.

“Customers just have so much data these days,” Spencer said. “And they have two things they want to do with it: act on it and perform analytics on it.”

Acting on it in real time requires users to write automation that detects and responds to any change in activity. This can range from spacecraft wobbling or increasing error rates in shopping carts – whatever things users need to detect in order to respond to quickly.

“The other thing they want to do is perform historical analytics on that data. So, the dilemma that customers faced in the past is over what data to keep, because attempting to keep all the data becomes extremely expensive.”

With that in mind, let’s look at some of the technology challenges that real-time data analytics pose and offer more details about the associated storage cost conundrum. We’ll also explore InfluxDB 3.0, the latest version of InfluxData’s leading time series database, which promises to reduce data storage costs by up to 90%.

The latest iteration of the InfluxDB 3.0 product suite, InfluxDB Clustered, delivers these capabilities for self-managed environments.

Real-Time Evolution

The capacity to execute queries against vast amounts of data is typically a key requirement for large-scale real-time data analytics.

InfluxDB 3.0, InfluxData’s columnar time series database, is purpose-built to handle this. Users can conduct historical queries or analytical queries across multiple rows. These queries might consist of calculating the mean or moving average for all rows in large columnar datasets. The time needed to do so could be measured in milliseconds, even when retrieving data from objects.

However, Spencer noted, InfluxData’s customers demand a lot from its databases. “Our users tend to push the limits of our query capabilities,” he said. “If there was a query, say, across a month of data that used to time out but doesn’t now, they’ll run it. So the question isn’t necessarily about how slow the queries are but rather, how much data you can query based on your requirements.”

Previously, InfluxDB 1.x and 2.x releases provided exceptionally fast data transfers for tag value matching. However, in 1.x and 2.x, it was challenging to perform analytic queries or store a lot of data like logs and traces, just metrics.

By contrast, the new InfluxDB 3.0, which was released for general availability in January, provides those capabilities.

For queries against large data sets, it might take 40 seconds to access data such as logs and traces with InfluxDB 3.0, where those same queries would have timed out in earlier versions. Queries against smaller data sets complete in milliseconds, Spencer said.

“Now we can handle much more than metrics, resulting in cost savings as you can consolidate various databases,” he said.

The cost savings come into even more direct play with the recent InfluxDB Clustered release that added a final mile to Influx 3.0 capabilities.

The idea here is to keep data in object storage, instead of in an attached local disk, like traditional databases do. Object stores cost about 1/13th the price of an attached disk, Spencer said.

Efficient Data Compression, Enhanced Performance

Among the main features of InfluxDB are four components that offer:

Data ingestion.
Data querying.
Data compaction.
Garbage collection.

The main components of InfluxDB 3.0. (Source: InfluxData)

With InfluxDB Clustered, organizations can extend InfluxDB 3.0’s capabilities to on-premises and private cloud environments. These core capabilities consist of what InfluxData says is unlimited cardinality, high-speed ingest, real-time querying and very efficient data compression, to realize the 90% reduction in storage costs that low-cost object storage and separation of compute and storage offer.

InfluxDB 3.0 also heavily uses Parquet files. This is an open source, column-oriented data file format developed for efficient data storage and retrieval. It is designed to provide efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

A significant aspect of Parquet files lies in the fact that their specification is designed by a highly skilled community of developers, aiming to facilitate efficient compression of analytical data, Spencer said.

“Given your time series use case, we can make specific assumptions that allow for substantial compression,” he said. ”Parquet files become quite compact due to their columnar structure. It turns out that as data accumulates, a columnar database generally compresses much more efficiently.”

Storage Costs: a Drop from $8 Million to $150,000 per Year

One InfluxData customer was spending $8 million annually on storage costs. The customer was concerned that this cost would severely impact its business.

“However, adopting InfluxDB 3.0 reduced their storage costs to approximately $150,000 per year,” Spencer said. “Consider what this means for a business — transitioning from an $8 million budget to $150,000 is truly remarkable and highly beneficial for their business.

“With this approach, I can tell customers that even if their budget only allows for $10,000, and they’re currently spending $100,000 to retain their full data fidelity, they may be able to afford to keep all their data.”

Driving the Time Series Market Forward

InfluxDB 3.0 takes several giant leaps forward when it comes to performance, including data compression. Not only is the database itself able to compress data smaller than previous versions, but its persistence format compounds that benefit because Apache Parquet is designed for optimized compression of columnar data.

Taken together, these improvements can drastically reduce an organization’s financial commitment to data storage. It also means that InfluxDB enables users to store more of the data they want, to easily manage that data, and — most importantly — to generate value from that data in real time.

The post Battling the Steep Price of Storage for Real-Time Analytics appeared first on The New Stack.

MySQL HeatWave Gets Generative AI and JavaScript, Slew of New Features

Andrew Brust — Thu, 21 Sep 2023 17:18:48 +0000

As the Oracle CloudWorld conference takes place in Las Vegas this week, Oracle‘s MySQL team is announcing a number of enhancements to the HeatWave platform that shore up its core functionality; add capabilities in the realm of generative AI; enhance support for the data lakehouse approach to analytics data management, autonomous operation, and in-database machine learning; and address core programmability and performance on the OLTP side, too.

Developer Goodies

The MySQL team briefed the media by starting on the analytics side, and leaving the developer-oriented features for last. As far as readers of The New Stack are concerned, I say they buried the lede, so I’m going to kick off with what the MySQL team left until last: goodies for developers including JSON acceleration and JavaScript-based stored procedures and functions.

JSON support in that base MySQL platform allows JSON data to be materialized in binary and text columns in tables or in virtual columns. It also allows JSON payloads to be passed to stored procedures and functions as arguments. MySQL supports use of its MongoDB API-compatible XDevAPI on the client side and numerous programming languages can be used in the MySQL shell to manipulate the JSON data on the input or output side. But now JSON data can be brought into HeatWave, where it is stored in binary format, partitioned, compressed up to 3x and scaled across nodes. The MySQL team says simple filter queries can be accelerated up to 20x, aggregation queries up to 22x and large join queries up to 144x.

Moving on from the JavaScript Object Notation format to the JavaScript language itself, stored procedures in HeatWave can now be coded in that language, in addition to the long-supported use of SQL. SQL is a declarative, set-based language, which can make it hard to perform more imperative tasks. JavaScript stored procs and functions eliminate this constraint and are called and used in exactly the same way as SQL-based ones, be it in queries, views, data manipulation language (DML) commands or data definition language (DDL) commands.

Data type conversions between the two languages are implemented implicitly. The JavaScript code executes in a GraalVM virtual machine, which provides for secure/sandboxed use of compute and memory, and which blocks direct network and file system access.

Lakehouse Enhancements

Now let’s move on to HeatWave’s lakehouse capabilities, as there are a few dimensions to it. First off, HeatWave is adding support for the Apache Avro data file format to its existing compatibility with CSV and Apache Parquet formats. The functionality includes support for multiple compression algorithms, across which the team says performance is consistent. Avro support also includes — via HeatWave’s “Autopilot” assistance feature — schema inference, cluster capacity estimation for data load operations, and a time estimate for same.

What’s key in this announcement is that HeatWave now supports an optimized data format for row-oriented data. Compare this with the unoptimized text-based CSV and the column-oriented Parquet format and you can see that Oracle’s MySQL team is paying attention to OLTP workloads, in addition to the analytical workload support that was HeatWave’s original hook. Meanwhile, that analytical side would benefit from support for the Delta, Iceberg and/or Hudi open table formats that build on top of the Parquet standard.

Next on the lakehouse side is support for HeatWave on the Amazon Web Services cloud. This means data in any of the three supported formats that any customer may already have in Amazon’s S3 object storage is now available for processing with HeatWave. Even though HeatWave itself runs in Oracle’s own AWS account, connectivity to data in the customer’s account is still provided. Adding S3 data to HeatWave can be done simply by providing an ENGINE = LAKEHOUSE clause in a CREATE TABLE command, and that command can itself be auto-generated by Autopilot, leveraging the schema inference we’ve already discussed.

AutoML Enhanced, Now Encompasses Generative AI

Moving on to the world of AI, HeatWave’s AutoML (automated machine learning) can leverage this S3 data access, including the new Avro support, to build machine learning models that reside in HeatWave and are trained on HeatWave data. HeatWave AutoML also supports recommendation models, beyond other AutoML platforms’ typical support for classification, regression, clustering/anomaly detection and time-series forecasting models.

With respect to competition, Oracle claims HeatWave’s training times are 25x faster than those for Amazon Redshift, with the implication that HeatWave is a better analytics database for AWS than AWS’ own data warehouse offering. And beyond Redshift, Snowflake’s SnowPark ML, provides a bridge to scikit-learn and doesn’t provide any built-in AutoML, according to the MySQL team.

There’s generative AI support in MySQL AutoML too, and it takes a couple of forms, including support for Large Language Models (LLMs) and a built-in vector store. On the LLM side, HeatWave can use BERT and Tfidf to generate embeddings from the content of text columns in the database and submit them to the AutoML engine, alongside numerical representations of data in conventional scalar data columns. From all these inputs, tuned models are produced.

Documents in object storage factor in as well, as vector embeddings for them can be stored and indexed in the HeatWave vector store. Together, these features lead to more contextual answers to generative AI queries, as data in the vector store can be used to augment the prompts sent to the LLM.

Autonomous Autopilot

Moving on to HeatWave’s Autopilot, which uses AI to implement autonomous operation, or assistance with advanced features, the team has added support for Autopilot indexing, auto unload, auto compression, and adaptive query execution. The last of these, according to the MySQL team, dynamically adjusts data structures and system resources even after query execution has begun, to accommodate the actual distribution of the data observed as the query engine encounters it. The MySQL team reports first-run performance improvement of between 10% and 25% as a result of adaptive query execution.

Autopilot indexing is a machine learning-driven service that recommends secondary indexes for OLTP workloads, and includes suggesting new indexes as well as pointing out superfluous (e.g. unused or duplicate) indexes that should be dropped. Autopilot indexing takes both queries and DML operations — like UPDATE, INSERT and DELETE — into account. The service also predicts both storage requirements and performance, and it provides explanations for its recommendations.

Auto load and unload moves data from a conventional MySQL database into and out of the HeatWave cluster, based on frequency of access, helping developers avoid performing these operations manually. Auto-column compression will mix and match compression algorithms on a per-column basis, finding the right balance between memory usage and performance. The company claims memory savings of between 6% and 25% and performance increases between 6% and 10%. The fact that there can be improvement on both the memory and perf axes, rather than making developers choose between them, is an impressive testimonial to the value of algorithmic optimization.

And More

Other capabilities include a bulk data ingest/load feature, partitioning, analytics functions, SET operations, and availability on multiple clouds (Amazon Web Services, Microsoft’s Azure and Oracle Cloud Infrastructure). These and all the other capabilities discussed here should ensure continued momentum for MySQL HeatWave that Oracle says it has seen in the digital marketing, gaming, healthcare and fintech sectors. This is a real smorgasbord of capabilities, demonstrating that Oracle views MySQL as a strategic asset in its portfolio. Does Oracle Database itself rule the roost? Maybe. But MySQL, with its decades-long ecosystem, its huge community, and its modular, pluggable engine architecture, has found new life in the cloud, in analytics, in machine learning, and now in generative AI.

The post MySQL HeatWave Gets Generative AI and JavaScript, Slew of New Features appeared first on The New Stack.

Oracle Introduces New App Analytics Platform, Enhances Analytics Cloud

Andrew Brust — Thu, 21 Sep 2023 13:47:03 +0000

At its Oracle CloudWorld conference in Las Vegas this week, Oracle is introducing a range of new analytics capabilities. In addition to its core Oracle Database, MySQL and MySQL HeatWave businesses, Oracle focuses on analytics and applications. As such, the new analytics capabilities it is announcing accrue to both its Oracle Analytics Cloud (OAC) platform as well as the value-added functionality for Oracle applications that run atop that platform.

A Full Data Intelligence Platform

It’s with respect to the latter that Oracle is announcing the new Fusion Data Intelligence Platform. This new service is an evolution of the Fusion Analytics platform that preceded it, but in addition to Fusion Analytics’ semantic models that are defined and materialized in Oracle Analytics Cloud, the new service includes 360-degree data models, analytic artifacts, AI and BI models and pre-built intelligent apps.

Those pre-built apps bring in data models, ML models and analytics, designed to be accessible to people who don’t currently use self-service BI, and prefer to stay a level of abstraction above it. Oracle demoed a “Supply Chain Command Center” application as an example. It was a full-blown browser-based application with BI and AI capabilities already implemented and built in.

External Data too, All in the Lakehouse

Like Fusion Analytics, Fusion Data Intelligence Platform is not an island. For example, it will allow the addition of external data and will link to the likes of Salesforce, LinkedIn, and other external services with business-relevant data. On the Oracle applications side, Fusion Data Intelligence Platform will tie into Oracle Netsuite, Oracle Health and Oracle Industries applications. Fusion Data Intelligence Platform also integrates with, and includes an instance of, OAC, which Fusion Analytics did as well.

All data will land in a single Oracle Cloud Infrastructure (OCI) data lakehouse with a semantic model, ML models, etc. and OAC tie-ins. Though the lakehouse will feature a single model, it will be broken into multiple “subject areas” for specific target audiences.

OAC Gets AI

It’s not only at the Fusion Data Intelligence Platform level where Oracle has added AI capabilities. After all, Fusion Data Analytics Platform is a layer above OAC, where Oracle has added AI capabilities as well.

OAC now has an Analytics Assistant, offering a chatbot interface on your data, with links to public data via ChatGPT. In partnership with Synthesia, the Assistant features avatars that can act as “news readers” to deliver data stories verbally to business decision-makers.

AI-Powered Document Understanding can scan JPEG and PDF files — and extract values and context. One example mentioned by Oracle, for applying this in practice, was the reading of individual receipt images to ensure their totals match the data in expense reports.

Narratives, Teams Integration, and the Business User Strategy

Contextual Insights implements natural language generation to provide narratives of users’ data. It’s similar in concept to Power BI’s Smart Narratives and Data Stories/narratives in Tableau. OAC now also integrates with Microsoft Teams, letting users bring OAC dashboards, visualizations, and insights into Teams channel chats. The functionality provided is similar to the previously introduced integration of OAC with Slack.

The range of capabilities added to Oracle’s Analytics platform should greatly benefit Oracle Applications customers. While customers might think of Power BI or Tableau when the subject of analytics comes up, Oracle is making it unnecessary to bring in third-party platforms when it comes to AI- and BI-driven insights on its applications’ data. Its goal is to go beyond self-service analytics and instead just surface analytics capabilities in business users’ tools. Clearly, Oracle is delivering in that area.

The post Oracle Introduces New App Analytics Platform, Enhances Analytics Cloud appeared first on The New Stack.

Developers: Is Your API Designed for Attackers?

Loraine Lawson — Wed, 20 Sep 2023 18:24:40 +0000

When an organization has a security problem with an API, it’s usually one it built internally, according to Jeremy Snyder, founder and CEO of API security firm FireTail.io.

The security firm analyzed 40 public breaches to see what role APIs played in security problems, which Snyder featured in his 2023 Black Hat conference presentation. The issue might be built-in vulnerabilities, misconfigurations in the API, or even a logical flaw in the application itself — and that means it falls on developers to fix it, Snyder said.

“It’s a range of things, but it is generally with their own APIs,” Snyder told The New Stack. ”It is in their domain of influence, and honestly, their domain of control, because it is ultimately down to them to build a secure API.”

The number of breaches analyzed is small — it was limited to publicly disclosed breaches — but Snyder said the problem is potentially much more pervasive.

“First of all, 83% of all internet requests, if not more, are API requests,” he said. “It’s not the total volume of traffic. It’s the number of requests that are flowing across the Internet day to day, more than four-fifths of all requests are actually API requests, not user-initiated queries.”

In the last couple of months, he said, security researchers who work on this space have uncovered billions of records that could have been breached through poor API design. He pointed to the API design flaws in basically every full-service carrier’s frequent flyer program, which could have exposed entire datasets or allowed for the awarding of unlimited miles and hotel points.

“We’ve seen a few very, very high-profile examples,” he said. “Effectively the entire connected car ecosystem has had API design flaws that could have exposed not only the owners of all of these vehicles, [but that] allows you to update the owner records, allows you to unlock and start these vehicles and drive them away.”

Snyder explained some of the top API problems and outlined best practices developers can use to improve APIs.

Common API Flaws

Indirect Object Reference, or IDR, is a common problem, Snyder said. It allows someone with a legitimate user’s access to manipulate the API request to access another user’s data.

“That is a super common — that may be, on its own, the single number one problem that we see consistently across the data set,” he said.

Another common problem is excessive data exposure, in which the API returns too much data. For instance, a page might have a photo, your name, an address, whatever, and the API sends everything — including personal data. Then the developer relies on either the mobile app or the web browser to hide all the data that wasn’t requested.

“Of course, bad actors don’t play by those rules,” he said. “They’re not going to go through your app or go through your web interface to try to scrape data from your API.”

Developers aren’t doing this on purpose, but mistakes happen when other pressures mount, he added.

“I don’t think any developer sets out to intentionally return too much data or to intentionally build a bad API,” he said. “But I think there’s a trade-off between how quickly I can build something — speed and convenience versus the security and privacy considerations.”

Best Practices to Fix API Flaws

Write a specification. Very few developers start from absolute zero when they’re building an API, Snyder noted. Typically, they’ll use a common open source framework for building that API. Part of that initial work should include a specification file governing how the API should work, he said.

Use common tools. Don’t try to create your own kind of identity and authentication mechanisms, Snyder said. “There’s everything from WebAuthn to single sign-on mechanisms and the less you make yourself build and design around identity, the higher the chances that you could get it right easily by leveraging a proven solution,” he said.

Think about the data. “Think about designing your API in a way that doesn’t expose too much and also is like checking an authorization for each data request,” Snyder suggested. Sometimes, developers push that authorization check to the frontend on a mobile client or Internet of Things device. In one famous case, the authorization was happening inside the logic of a Peloton exercise bike. “Again, you know, hackers don’t play by those rules, so they went straight to the Peloton API using the scripting language,” he said. “They just started manipulating authorization requests and they were able to extract about 3 million records.”

The post Developers: Is Your API Designed for Attackers? appeared first on The New Stack.

A Long Time Ago, on a Server Far, Far Away…

Jason Myers — Wed, 20 Sep 2023 16:36:07 +0000

Calling a programming language Rust almost seems like a misnomer. Rust is the brittle byproduct of corrosion — not something that would typically inspire confidence. But fortunately, software developers have very different concerns from metallurgists. In the digital realm, Rust is a game changer.

The following is a brief case study that explores the logistics and motivations that would lead a successful company to spend time and resources completely rewriting the core of their flagship product in Rust. InfluxData makes InfluxDB, the leading time series database in the world. When it comes to time series data use cases, the 1.x and 2.x iterations of InfluxDB are great for metrics. They’re able to handle analytics use cases to a certain extent, but there was always a danger of high-cardinality data impacting database performance.

The vision for InfluxDB is not to simply master metrics, but to provide solutions for all time series use cases. To achieve this, developers needed to solve the cardinality problem. Doing so would throw open the floodgates for time series data and InfluxDB.

As developers sought solutions for the cardinality problem, it became clear that to achieve their desired end they needed to rewrite significant portions of InfluxDB’s core. They needed to build a columnar database and, as a company with roots in, and a commitment to open source, turned to Apache Arrow for the columnar framework of the new database version. Versions 1.x and 2.x were written in Go, but for this new stack InfluxData founder and CTO Paul Dix saw an opportunity to try something different. Enter Rust.

Why Rust?

Rust has many attractive features for developers. The real-time nature of time series data brings with it significant performance demands. Rust has the inherent performance capabilities to support the characteristics of time series data and use cases. For example, Rust relies on fearless concurrency. This is an approach to systems programming that enforces discipline around different programming paradigms that helps developers mitigate or eliminate subtle bugs in their code. Another benefit of this fearless concurrency approach is that it makes applications easy to refactor without introducing new bugs. The borrow checker is another critical aspect of Rust. It helps users manage memory and initializes all variables before using them. This prevents users from unintentionally using the same value more than once.

Some additional perks of using Rust include the fact that its libraries can export a foreign function interface (FFI) compatible with many different programming languages. This provides extensibility and interoperability that makes Rust a major potential value-add to a wide range of applications. Rust uses the Crates.io packaging system, which gives developers everything they need right out of the box. In Rust, errors are first-class citizens and developers don’t have to deal with a garbage collector.

Rust also gives developers more control over runtimes than many other languages. Its async/await tool is much more advanced than order languages like JavaScript. In JavaScript, for example, users can’t control the order of asynchronous functions when they execute in Node.js. Async runtimes are runtimes optimized to execute async functions in specific environments. In Rust, however, developers have granular control over the execution order of asynchronous functions using async runtimes.

This just scratches the surface for the advantages of Rust. However, memory management and runtime control are two contributing factors that led to InfluxData’s decision to build its new database engine in Rust.

Rust Challenges

While Rust presents a lot of advantages, it has its share of challenges as well. The most significant is that it has a high learning curve. It is a uniquely designed programming language, complete with its own design patterns. In some cases, these unique qualities are driven by the very capabilities that make Rust appealing, like the borrow checker. Developers with a background limited to dynamic programming languages, such as Python, Ruby or JavaScript tend to have a harder time learning Rust than developers with a background in static programming languages, like C++ or Swift.

Another sticking point that developers must adapt to is Rust’s lengthy compile time. This puts pressure on developers to write code that optimizes compile time. But the uphill climb might just be worth it because Dix believes that developers and companies will write more and more high-performance server software in Rust moving forward.

Supporting a Shift to Rust

Hiring seasoned Rust developers may not always be an option as demand for them continues to increase. So, it’s important for individuals and companies alike to tap into available resources that will help mitigate the language’s steep learning curve. Rust is an open source language with a growing community supporting it, so leaning on the community is a great starting point for motivated developers.

Rust Results

InfluxData set out to expand the analytical capability of its leading time series database and used Rust to accomplish that task. InfluxDB 3.0 is the result of years of research and development. It takes several key database concepts and applies them to the time series use case. Columnar databases aren’t new. Neither is the idea of separating storage and compute. But combining these concepts for the time series use case results in a database that can drive both monitoring and real-time analytics projects at scale.

InfluxDB 3.0 can handle data with unlimited cardinality, can scale compute and storage separately and supports native SQL queries. Its performance gains compared to previous versions of InfluxDB OSS are seismic. With a “hot” storage tier for leading-edge data, users can perform real-time analytics. And the combination of a columnar database and the use of Apache Parquet as its persistence format ushers in drastic data compression gains. Using low-cost cloud object storage for “cold” data can save users up to 90% on storage costs, all while enabling them to keep more, high-granularity data for longer periods.

Rust was a key difference maker in the creation of InfluxDB 3.0. While the decision to rewrite the database core was a major one, the end results speak for themselves. Thanks to Rust, InfluxDB is poised to remain atop the time series category for the foreseeable future.

Try InfluxDB for yourself and see what a difference Rust makes.

The post A Long Time Ago, on a Server Far, Far Away… appeared first on The New Stack.

How Apache Flink Delivers for Deliveroo

Alex Williams — Wed, 20 Sep 2023 13:54:21 +0000

Deliveroo is a leading food delivery company that builds tools and services using Apache Flink.

The Flink framework offers a distributed processing engine for “stateful computations over unbounded and bounded data streams.”

Deliveroo has a three-sided marketplace. They have delivery drivers, restaurants, and customers who place orders on the application to get food or groceries delivered to their door. Joining us to discuss using Apache Flink and the Amazon Managed Service for Apache Flink were two engineers from Deliveroo: Felix Angell and Duc Anh Khu.

Deliveroo sought to do more real-time streaming. They explored behaviors to understand the customer journey and how those customers use the Deliveroo application.

That meant modernizing to a more stable platform. The old platform could scale up but not down due to earlier decisions made about the technology. They looked at other services such as Apache Spark and Kafka Streams. But Flink had feature parity with the legacy platform, and large companies used Flink for production.

Deliveroo started experimenting with Flink as a proof-of-concept on Kubernetes. They used third-party operators but found many needed more support and were not maintained. They turned to the Amazon Managed Service for Flink (MSF), which allowed the Deliveroo team to focus on its core responsibilities, such as CI/CD and taking updates to production.

Angell said he would like more Deliveroo teams using Apache Flink. Duc said they move very fast to roll out the latest product. And that means their modeling could be more flexible to adapt to that demand.

Flexibility comes with a cost, he said. Sometimes, you need to remodel things, though other times, you need to normalize the data model. It would help make that process easier for teams to do.

“And for me, one of the features that we would like to see is a self-serve configuration from MSF, so that we can just tweak some of the low-level configuration as well as auto-scaling requirements based on application metrics,” Duc said.

Learn more about Apache Flink and AWS:

Kinesis, Kafka and Amazon Managed Service for Apache Flink

Apache Flink for Real Time Data Analysis

The post How Apache Flink Delivers for Deliveroo appeared first on The New Stack.

Implementing High-Performance Ad Tech Demand-Side Platforms (DSPs)

Daniel Landsman — Tue, 19 Sep 2023 15:30:55 +0000

Tightening the Latency Gap

As companies become more data and software-driven, there is a push by technology leadership to reduce the latency gap between the time data is produced and when it is actioned. We see this in many industries and use cases:

fraud prevention in retail and financial services, including payment processing;
personalization in retail, gaming and advertising;
route optimization for transportation and
mitigating emergent threats in cybersecurity.

All of these examples have real consequences if data is delayed, so it’s important to reduce the total time to action to stay competitive.

In this article, we’ll focus on a use case that brings billions of data points and thousands of companies together in milliseconds to make instant decisions: real-time online ad bidding.

The old advertising saying is that “half the money I spend on advertising is wasted; the trouble is I don’t know which half.” But data is a lot easier to come by in the digital age. For example, ad exchanges can anonymously share that visitors on a given web property have recently visited both StubHub and Lionel Messi’s Instagram page. A paid ad for an Inter Miami jersey on that page is money well spent and is much more likely to create a happy customer than an untargeted ad for a jersey from a different team or sport.

Interactions like this — the decisions to evaluate, buy and place ads — happen in milliseconds, millions of times a day. Critical to these transactions is the free flow of data for timely actions, often facilitated by real-time databases and streaming data platforms. This article unveils how Demand Side Platforms (DSPs), in particular, can implement a solution leveraging the capabilities of Redpanda (a streaming data platform) and Aerospike (a real-time database) to achieve success in the fast-paced environment of real-time bidding.

What Is Aerospike?

Aerospike is a real-time database that ingests, stores and retrieves data, handling millions of transactions per second (TPS) throughput with sub-millisecond latency. Event streams are one of the frequent sources of data ingested into an Aerospike database. Aerospike can ingest events directly, or through its connectors, to streaming data platforms like Redpanda, Apache Kafka, Apache Pulsar and others.

It’s common to ingest data into Aerospike from a streaming topic, run several evaluations of that data and publish data to another topic, all in milliseconds. In ad tech, one key use case for Aerospike is as a user profile store. This store holds extensive anonymized information about individual users, from preferences to past interactions.

What Is Redpanda?

You can think of Redpanda as a rebuilt Kafka. Kafka, one of the most popular streaming platforms, ensures seamless communication among various parts of the online ecosystem. Redpanda uses the Kafka API to ensure compatibility with the existing ecosystem but is rewritten from the ground up in C++ to maximize modern hardware utilization. One of Redpanda’s selling points is its ability to reliably handle large spikes in volume, supporting up to multiple gigabytes (GB) per second on average.

Demand-Side Platforms in Real-Time Bidding

To understand how DSPs fit into the ad tech landscape, we need a glimpse into real-time bidding (RTB). In RTB, advertisers compete to display their ads on websites in real time, which entails a complex engagement between supply-side platforms (SSPs), DSPs, websites, and ad exchanges. DSPs play a crucial role by representing advertisers and making split-second decisions to bid on ad slots. DSPs must differentiate themselves by swiftly processing data, targeting the right audience and optimizing bids to win the auction.

Figure 1: Real-time bidding overview

A Real-Time Data Architecture for DSPs

In the interconnected world of DSPs, multiple key data components come into play: user profiles, available ad slots, historical data, and ad creatives. DSPs need to rapidly create profiles of potential users based on available data, assess ad slot opportunities from SSP or ad exchanges, and craft bids that align with user preferences and active campaigns. This process requires efficient data movement and rapid decision-making.

This is where Redpanda and Aerospike come in. Redpanda provides a highly performant streaming engine, enabling lumpy spikes in data volumes to move through the DSP’s internal environment and across the rest of the Ad Tech ecosystem in milliseconds. Aerospike serves as the data historian, delivering speedy retrieval of data lookups such as users, devices and sessions. It swiftly retrieves user profiles and past interactions, allowing DSPs to craft bids tailored to individual preferences.

Figure 2: Real-time data architecture for a DSP

As illustrated in Figure 2, DSPs incorporate information from ad exchanges and the publisher side, such as available ad formats and anonymized user interests and geolocation data. Some DSPs budget just 10 milliseconds to load all this data to inform their bidding process. Aerospike can ingest data in under a millisecond and will typically connect directly to an ad exchange’s API.

There are many examples where it makes more sense to decouple the data producer from the data consumer. In these cases Aerospike can use its Kafka Connector to publish the ad exchange data to a Redpanda topic, enabling multiple services within the DSP to subscribe to the topic. A pricing optimization service, for example, could augment new real-time data with historical information about targeted users to help predict the ad spend for upcoming bids.

Redpanda provides a Kafka API that consistently delivers p50 latency as low as 5 milliseconds. Some of the DSP’s external data will go directly onto a Redpanda topic, such as data on impressions and bidding results. Aerospike can also be a subscriber to any of these topics, as it extends the historical database.

DSPs harness the Kafka API to swiftly process incoming data, assess ad opportunities and generate bids. The real-time bidding environment demands speed, accuracy and the ability to handle immense data flows. The combination of Aerospike and Redpanda provides both real-time data flow and the nearly instantaneous retrieval of past data to help DSPs make informed, tailored bids. This movement and integration of data ensures that bid requests, ad opportunities and decisions flow seamlessly, enabling timely ad placements.

Conclusion: Mastering the Real-Time Landscape

In the dynamic world of real-time bidding, Demand Side Platforms (DSPs) depend on powerful, modern platforms and efficient tooling to move and synthesize disparate data in mere milliseconds, ultimately making the best bid for their clients. In one such architecture, Redpanda’s performant Kafka API synchronizes seamlessly with Aerospike’s lightning-fast retrieval of user profiles, ensuring that DSPs are armed with all the information they need to craft real-time bids that are highly personalized for users. With these platforms in their arsenal, DSPs can navigate the intricate real-time bidding ecosystem with agility and precision, paving the way for successful ad placements and enriched user experiences.

The post Implementing High-Performance Ad Tech Demand-Side Platforms (DSPs) appeared first on The New Stack.

How to Get the Right Vector Embeddings

Yujian Tang — Mon, 18 Sep 2023 13:09:38 +0000

Vector embeddings are critical when working with semantic similarity. However, a vector is simply a series of numbers; a vector embedding is a series of numbers representing input data. Using vector embeddings, we can structure unstructured data or work with any type of data by converting it into a series of numbers. This approach allows us to perform mathematical operations on the input data, rather than relying on qualitative comparisons.

Vector embeddings are influential for many tasks, particularly for semantic search. However, it is crucial to obtain the appropriate vector embeddings before using them. For instance, if you use an image model to vectorize text, or vice versa, you will probably get poor results.

In this post, we will learn what vector embeddings mean, how to generate the right vector embeddings for your applications using different models and how to make the best use of vector embeddings with vector databases like Milvus and Zilliz Cloud.

How Are Vector Embeddings Created?

Now that we understand the importance of vector embeddings, let’s learn how they work. A vector embedding is the internal representation of input data in a deep learning model, also known as embedding models or a deep neural network. So, how do we extract this information?

We obtain vectors by removing the last layer and taking the output from the second-to-last layer. The last layer of a neural network usually outputs the model’s prediction, so we take the output of the second-to-last layer. The vector embedding is the data fed to a neural network’s predictive layer.

The dimensionality of a vector embedding is equivalent to the size of the second-to-last layer in the model and, thus, interchangeable with the vector’s size or length. Common vector dimensionalities include 384 (generated by Sentence Transformers Mini-LM), 768 (by Sentence Transformers MPNet), 1,536 (by OpenAI) and 2,048 (by ResNet-50).

What Does a Vector Embedding Mean?

Someone once asked me about the meaning of each dimension in a vector embedding. The short answer is nothing. A single dimension in a vector embedding does not mean anything, as it is too abstract to determine its meaning. However, when we take all dimensions together, they provide the semantic meaning of the input data.

The dimensions of the vector are high-level, abstract representations of different attributes. The represented attributes depend on the training data and the model itself. Text and image models generate different embeddings because they’re trained for fundamentally different data types. Even different text models generate different embeddings. Sometimes they differ in size; other times, they differ in the attributes they represent. For instance, a model trained on legal data will learn different things than one trained on health-care data. I explored this topic in my post comparing vector embeddings.

Generate the Right Vector Embeddings

How do you obtain the proper vector embeddings? It all starts with identifying the type of data you wish to embed. This section covers embedding five different types of data: images, text, audio, videos and multimodal data. All models we introduce here are open source and come from Hugging Face or PyTorch.

Image Embeddings

Image recognition took off in 2012 after AlexNet hit the scene. Since then, the field of computer vision has witnessed numerous advancements. The latest notable image recognition model is ResNet-50, a 50-layer deep residual network based on the former ResNet-34 architecture.

Residual neural networks (ResNet) solve the vanishing gradient problem in deep convolutional neural networks using shortcut connections. These connections allow the output from earlier layers to go to later layers directly without passing through all the intermediate layers, thus avoiding the vanishing gradient problem. This design makes ResNet less complex than VGGNet (Visual Geometry Group), a previously top-performing convolutional neural network.

I recommend two ResNet-50 implementations as examples: ResNet 50 on Hugging Face and ResNet 50 on PyTorch Hub. While the networks are the same, the process of obtaining embeddings differs.

The code sample below demonstrates how to use PyTorch to obtain vector embeddings. First, we load the model from PyTorch Hub. Next, we remove the last layer and call .eval() to instruct the model to behave like it’s running for inference. Then, the embed function generates the vector embedding.

HuggingFace uses a slightly different setup. The code below demonstrates how to obtain a vector embedding from Hugging Face. First, we need a feature extractor and model from the transformers library. We will use the feature extractor to get inputs for the model and use the model to obtain outputs and extract the last hidden state.

Text Embeddings

Engineers and researchers have been experimenting with natural language and AI since the invention of AI. Some of the earliest experiments include:

ELIZA, the first AI therapist chatbot.
John Searle’s Chinese Room, a thought experiment that examines whether the ability to translate between Chinese and English requires an understanding of the language.
Rule-based translations between English and Russian.

AI’s operation on natural language has evolved significantly from its rule-based embeddings. Starting with primary neural networks, we added recurrence relations through RNNs to keep track of steps in time. From there, we used transformers to solve the sequence transduction problem.

Transformers consist of an encoder, which encodes an input into a matrix representing the state, an attention matrix and a decoder. The decoder decodes the state and attention matrix to predict the correct next token to finish the output sequence. GPT-3, the most popular language model to date, comprises strict decoders. They encode the input and predict the right next token(s).

Here are two models from the sentence-transformers library by Hugging Face that you can use in addition to OpenAI’s embeddings:

MiniLM-L6-v2: a 384-dimensional model
MPNet-Base-V2: a 768-dimensional model

You can access embeddings from both models in the same way.

Multimodal Embeddings

Multimodal models are less well-developed than image or text models. They often relate images to text.

The most useful open source example is CLIP VIT, an image-to-text model. You can access CLIP VIT’s embeddings in the same way as you would an image model, as shown in the code below.

Audio Embeddings

AI for audio has received less attention than AI for text or images. The most common use case for audio is speech-to-text for industries such as call centers, medical technology and accessibility. One popular open source model for speech-to-text is Whisper from OpenAI. The code below shows how to obtain vector embeddings from the speech-to-text model.

Video Embeddings

Video embeddings are more complex than audio or image embeddings. A multimodal approach is necessary when working with videos, as they include synchronized audio and images. One popular video model is the multimodal perceiver from DeepMind. This notebook tutorial shows how to use the model to classify a video.

To get the embeddings of the input, use outputs[1][-1].squeeze() from the code shown in the notebook instead of deleting the outputs. I highlight this code snippet in the autoencode function.

Storing, Indexing and Searching Vector Embeddings with Vector Databases

Now that we understand what vector embeddings are and how to generate them using various powerful embedding models, the next question is how to store and take advantage of them. Vector databases are the answer.

Vector databases like Milvus and Zilliz Cloud are purposely built for storing, indexing and searching across massive datasets of unstructured data through vector embeddings. They are also one of the most critical infrastructures for various AI stacks.

Vector databases usually use the Approximate Nearest Neighbor (ANN) algorithm to calculate the spatial distance between the query vector and vectors stored in the database. The closer the two vectors are located, the more relevant they are. Then the algorithm finds the top k nearest neighbors and delivers them to the user.

Vector databases are popular in use cases such as LLM retrieval augmented generation (RAG), question and answer systems, recommender systems, semantic searches, and image, video and audio similarity searches.

To learn more about vector embeddings, unstructured data and vector databases, consider starting with the Vector Database 101 series.

Summary

Vectors are a powerful tool for working with unstructured data. Using vectors, we can mathematically compare different pieces of unstructured data based on semantic similarity. Choosing the right vector-embedding model is critical for building a vector search engine for any application.

In this post, we learned that vector embeddings are the internal representation of input data in a neural network. As a result, they depend highly on the network architecture and the data used to train the model. Different data types, such as images, text and audio, require specific models. Fortunately, many pretrained open source models are available for use. In this post, we covered models for the five most common types of data: images, text, multimodal, audio and video. In addition, if you want to make the best use of vector embeddings, vector databases are the most popular tool.

The post How to Get the Right Vector Embeddings appeared first on The New Stack.

Python Delights Excel Data Nerds Plus Data Lake Enthusiasts

Darryl K. Taft — Fri, 15 Sep 2023 15:35:40 +0000

Anaconda, which helped pioneer the use of Python for data science in 2009, has launched its Anaconda Toolbox, a new suite of tools built to enhance the experience and capabilities of Python in Excel.

The New Microsoft Excel add-in brings AI-powered Anaconda Assistant, curated data catalogs, and cloud features to Python in Excel users.

Anaconda Toolbox is a new suite of tools built to enhance the experience and capabilities of Python in Excel. The Toolbox will be accessible to current Python in Excel beta users through the Microsoft Marketplace.

AI Assistant

Launched last month. Python in Excel now boasts new features added by Anaconda Toolbox that will enable developers to use Python in Excel, even if they don’t know Python. Included in Toolbox is Anaconda Assistant, the recently released AI assistant designed specifically for Python users and data scientists, which can guide you in your first steps or supercharge your work, even if you have advanced experience.

Python in Excel beta users can sign up to experience Anaconda Toolbox today.

Anaconda Toolbox enables anyone, regardless of experience, to quickly generate code and visualizations while learning Python along the way, the company said. Because the code runs in Excel, you know how it will work when you share the file with others, even if they don’t have Toolbox.

“The AI revolution has triggered an explosion in creativity and productivity. The Anaconda Toolbox fits neatly in that same area as it provides the perfect on-ramp for advanced data science and AI with Python,” said Timothy Hewitt, Senior Product Manager for Python in Excel at Anaconda. “We understand that many Excel users have never used Python, that’s why we included our AI-powered Anaconda Assistant. This AI-assistant helps users accomplish what they need using natural language without needing to know all of the underlying Python code. Whether you need to visualize a data set, develop a script, or quickly generate insights, the Anaconda Assistant makes that possible — and it’s now just one click away.”

Ask the Assistant

Know what you want to do, but don’t know how to do it in Python? Just ask Anaconda Assistant, the company says. When it gives you the code, just push it to the Excel grid, where you can edit and run it just like other Python code. If you start with one of our provided prompts, it will analyze your tables and recommend different ways of working with your data.

Microsoft has released Python in Excel as a Public Preview to its Insiders Beta Channel so it is still early days for the technology but the company will continue to roll out updates on: improved editing experiences (such as autocomplete and syntax highlighting), default repairs, enhanced error behaviors, help and documentation, and more, said Stefan Kinnestrand, a general manager of product marketing/management at Microsoft in a blog post.

With Python in Excel, users can integrate Python and Excel analytics within the same Excel grid for uninterrupted workflow.

“Python in Excel combines Python’s powerful data analysis and visualization libraries with Excel’s features you know and love,” Kinnestrand said. “You can manipulate and explore data in Excel using Python plots and libraries, and then use Excel’s formulas, charts and PivotTables to further refine your insights.”

Partnership

To help with this integration, Microsoft has partnered with Anaconda, a leading enterprise-grade Python repository used by tens of millions of data practitioners worldwide. Microsoft said Python in Excel leverages Anaconda Distribution for Python running in Azure, which includes the most popular Python libraries such as pandas for data manipulation, statsmodels for advanced statistical modeling, and Matplotlib and seaborn for data visualization.

“Python has become the lingua Franca and Swiss Army Knife of working with data, and it’s the de facto language of data science and machine learning,” said Andrew Brust, CEO of Blue Badge Insights, a data consultancy. “It’s present in Microsoft Fabric, Azure Synapse Analytics, Azure Machine Learning, Azure Databricks, Visual Studio, VS Code, SQL Server and Power BI. And since Microsoft and Anaconda have collaborated around many of these integrations, doing so in the Excel case was almost a foregone conclusion.”

In 2022 Anaconda launched PyScript, a web-based tool for coding in the browser and deploying apps with the click of a button. The company also launched Anaconda Learning to help people build foundational skills in Python, data visualization, machine learning, and more.

Python education is part of Anaconda’s mission. Every day more and more people are starting to learn Python and for most Anaconda is their first stop in that journey.

“We want to see the Python community continue to grow, so we’ve developed an extensive library of free educational content and certificates to that have helped thousands of new users break into a whole new world of data science and AI,” Hewitt told The New Stack. “The Anaconda Toolbox for Python in Excel absolutely extends our mission of Python education. In the toolbox, users can find a curated selection of open-source data sets to test new data science skills and the built-in Anaconda Assistant can be used to guide users in self-learning, evaluate code, and explain the code it develops.”

Ibis and PyStarburst

Meanwhile, Starburst, the data lake analytics platform, recently announced extended support for Python and a new integration with the open source Python library, Ibis (built in collaboration with Voltron Data) to reinforce its commitment to openness.

For developers and data engineers used to working with PySpark and Snowpark, PyStarburst provides a familiar syntax that makes it easy to not only build new data pipelines but also migrate existing pipelines to Starburst without rewriting lots of code. Meanwhile, the new Ibis integration provides a uniform Python API and an open backend that connects to any cloud data source so that data and software engineers can build and scale data applications from development to production without rewriting code.

“Many data engineers prefer writing code over SQL for transformations, and many software engineers are used to building data applications in Python. With PyStarburst, we’re giving them the freedom to do so with the increased productivity and performance of Starburst’s enterprise-grade Trino,” said Martin Traverso, CTO of Starburst, in a statement.

For developers and data engineers looking to build scalable data applications, the new Ibis integration provides a uniform Python API that can execute queries on more than 18 different engines — including DuckDB, Pandas, PostgreSQL, and now Starburst Galaxy. This means you can scale from development on a laptop to production in Galaxy without rewriting a single line of code.

There’s a lot of tooling going into the ecosystem, the analytic data transformation data engineering base built around Python, there are libraries for doing machine learning data science, Traverso told The New Stack. So Python tests tend to be like glue for everything. And that’s the language that all the data scientists use on a day-to-day basis. They’re building AI models, they’re interacting with data, engine data, permission engines to massage their data to provide to their AI modeling systems. And Python happens to be their tool of choice, so yeah, we see a lot of a lot of people rely on that. If you look at Spark, Spark started as built in Scala, and originally the APIs were built around Scala which was a hard language to deal with. And for the regular programmers, Python is a little more flexible, a lot easier to pick up. So there’s a whole Python ecosystem that’s built around that. And eventually became the language of choice to interact with Spark. And therefore, anyone that’s dealing with, with data processing at large scale with Spark will be familiar with that. So we’re kind of capitalizing on that, on that investment, that expertise and trying to bring that to the Starburst, he noted. At Starburst everything is built with openness in mind, and we are interoperable with nearly any data environment, so we’re extending that commitment to our programming languages. The partnership with Voltron Data and Ibis was a natural fit,” said Harrison Johnson, Head of Technology Partnerships at Starburst.

Together, Ibis and Starburst Galaxy empower users to write portable Python code that executes on Starburst’s high-performance data lake analytics engine, operating on data from more than 50 supported sources. Users will now be able to build analytic expressions across multiple data sources with reusable scripts that execute at any scale.

“Python users struggle to bridge the gap between prototypes on their laptops and production apps running on platforms like Starburst Galaxy. Ibis makes it much easier to bridge this gap,” said Josh Patterson, CEO of Voltron Data. “With Ibis, you can write Python code once and run it anywhere, with any supported backend execution engine. You can move seamlessly from crunching gigabyte-scale test data on your laptop to crunching petabyte-scale data in production using Starburst Galaxy.”

The post Python Delights Excel Data Nerds Plus Data Lake Enthusiasts appeared first on The New Stack.

Web Dev Platform Netlify Releases Software Development Kit

Loraine Lawson — Thu, 14 Sep 2023 15:35:11 +0000

Web development platform Netlify released a software development kit (SDK) Wednesday that it said will make it easier for tech partners and customers to design custom integrations with Netlify.

“The SDK is exciting to me because it opens up for partners and the other tool makers to integrate into Netlify and enterprise companies to build integrations, specific to their services on Netlify, from the beginning,” CEO Matt Biilmann told The New Stack.

Netlify offers serverless backend services for web applications and dynamic websites. The SDK supports taking a composable architecture approach to web applications and websites at scale, Biilmann said.

“We coined the term Jamstack and pioneered this whole idea of building decoupled web UIs that talk to all these different APIs and services,” he said. “Now that’s maturing into this idea of composable architectures at scale, where you combine together many different tools instead of buying one big monolithic tool.”

Netlify Connect, which was released in June, plays a role in that, he added. Netlify Connect allows developers to integrate content from multiple sources into a single data unification layer for access through a GraphQL API, according to the documentation. That allows data updates to sync automatically. The SDK includes connectors to support connecting to and syncing data from a custom data source in Netlify Connect.

SDK Simplifies Flows, Authentication and Connectors

The SDK also will simplify flows, OAuth authentication and connectors, Biilmann told The New Stack.

“The connector part of the SDK allows partners or internal developers to build their own connectors and define ‘here’s a connector’ for Sanity, Sitecore or Adobe Experience Manager, or as a large company, ‘here is a connector to our internal product catalog.’ Once that connector is defined, any team building with it can simply install it, get data into Netlify Connect and start building on top of it,” he said.

Already, partner companies have deployed connectors using the SDK. For example, the MySQL platform PlanetScale created an integration that allows joint customers to deploy data-intensive applications without worrying about the underlying infrastructure or issues with data scalability.

It also incorporates a build event handler, which is a function that is called during the build process. For instance, performance monitoring firm Sentry has built a connector that sends all the source maps from the build through Sentry, by leveraging the SDK’s build event handlers.

“Now if there is an error in your frontend, it will be reported to Sentry and Sentry can use the source maps to tell you exactly where in the code it happened,” Biilmann said. “The build event handler will allow an integrator like Sentry to orchestrate all of that so when you install the Sentry integration, they can see from now on in your build.”

Previously, third-party integrations were handled by plug-ins written as NPM modules, he explained.

“There was no real control over the UI installation experience and those pieces and other parts of it,” Biilmann said. “If you wanted to do all our flows and so on, we had to do custom work together with a partner.”

Support for Enterprise Software Integration

The SDK also incorporates API handlers and an integration UI.

“The integration UI gives you a declarative way of building the UI for your integration within Netlify,” he said. “The API handlers allow you to use Netlify itself to build the backend for that UI, because, obviously, you probably need a backend that has access to the right secrets, that can talk to Sentry’s API, talk to Netlify’s API and make everything fit together. That’s part of the SDK.”

The SDK allows developers to define what should happen at build time, what should be injected into the runtime code, what path should be a connector, how the UI should look and what the API handlers should be to make that UI actually function and work, he added. For instance, with Sentry’s integration, developers can click OAuth to do an OAuth flow in the associated Netlify project.

It also allows enterprises to create their own integrations with their own partner software. Enterprises will “almost certainly” have off-the-shelf software they’re using and want to connect to, he said.

“They’ll almost certainly also have a bunch of internal APIs and services that they want to make reusable for their UI teams, and that’s why the SDK is also really the toolkit that they can use to build private integrations that are not publicly shared with any other Netlify team, but within their organization,” he said. “[That] can be how they make reusable building blocks that a web developer can simply come in, click through some options to install, and now they’re off to the races.”

The post Web Dev Platform Netlify Releases Software Development Kit appeared first on The New Stack.

Discover the Performance Gain with Retrieval Augmented Generation

Fangrui Liu — Tue, 12 Sep 2023 17:00:13 +0000

Large Language Models (LLMs) are smart enough to understand context. They can answer questions, leveraging their vast training data to provide coherent and contextually relevant responses, no matter whether the topic is astronomy, history or even physics. However, LLMs tend to hallucinate (deliver compelling yet false facts) when asked to answer questions outside the scope of their training data, or when they can’t remember the details in the training data.

A new technique, Retrieval Augmented Generation (RAG), fills the knowledge gaps, reducing hallucinations by augmenting prompts with external data. Combined with a vector database (like MyScale), it substantially increases the performance gain in extractive question answering.

To this end, this article focuses on determining the performance gain with RAG on the widely-used MMLU dataset. We find that both the performance of commercial and open source LLMs can be significantly improved when knowledge can be retrieved from Wikipedia using a vector database. More interestingly, this result is achieved even when Wikipedia is already in the training set of these models.

You can find the code for the benchmark framework and this example here.

Retrieval Augmented Generation

But first, let’s describe Retrieval Augmented Generation (RAG).

Research projects aim to enhance LLMs like gpt-3.5 by coupling them with external knowledge bases (like Wikipedia), databases, or the internet to create more knowledgeable and contextually aware systems. For example, let’s assume a user asks an LLM what Newton’s most important result is. To help the LLM retrieve the correct information, we can search for Newton’s wiki and provide the wiki page as context to the LLM.

This method is called Retrieval Augmented Generation (RAG). Lewis et al. in Retrieval Augmented Generation for Knowledge-Intensive NLP Tasks define Retrieval Augmented Generation as:

“A type of language generation model that combines pre-trained parametric and non-parametric memory for language generation.”

Moreover, the authors of this academic paper go on to state that they:

“Endow pre-trained, parametric-memory generation models with a non-parametric memory through a general-purpose fine-tuning approach.”

Note: Parametric-memory LLMs are massive self-reliant knowledge repositories like ChatGPT and Google’s PaLM. Non-parametric memory LLMs leverage external resources that add additional context to parametric-memory LLMs.

Combining external resources with LLMs seems feasible as LLMs are good learners, and referring to specific external knowledge domains can improve truthfulness. But how much of an improvement will this combination be?

Two major factors affect a RAG system:

How much can an LLM learn from the external context?
How accurate and related is the external context?

Both of these factors are hard to evaluate. The knowledge gained by the LLM from the context is implicit, so the most practical way to assess these factors is to examine the LLM’s answer. However, the accuracy of the retrieved context is also tricky to evaluate.

Measuring the relevance between paragraphs, especially in question answering or information retrieval, can be a complex task. The relevance assessment is crucial to determine whether a given section contains information directly related to a specific question. This is especially important in tasks that involve extracting information from large datasets or documents, like the WikiHop dataset.

Sometimes, datasets employ multiple annotators to assess the relevance between paragraphs and questions. Using multiple annotators to vote on relevance helps mitigate subjectivity and potential biases that can arise from individual annotators. This method also adds a layer of consistency and ensures that the relevance judgment is more reliable.

As a consequence of all these uncertainties, we developed an open-sourced end-to-end evaluation of the RAG system. This evaluation considers different model settings, retrieval pipelines, knowledge base choices, and search algorithms.

We aim to provide valuable baselines for RAG system designs and hope that more developers and researchers join us in building a comprehensive and systematic benchmark. More results will help us disentangle these two factors and create a dataset closer to real-world RAG systems.

Note: Share your evaluation results at GitHub. PRs are very welcome!

A Simple End-to-End Baseline for a RAG System

In this article, we focus on a simple baseline evaluated on an MMLU (Massive Multitask Language Understanding Dataset), a widely used benchmark for LLMs, containing multiple-choice single-answer questions on many subjects like history, astronomy and economy.

We set out to find out if an LLM can learn from extra contexts by letting it answer multiple-choice questions.

To achieve our aim, we chose Wikipedia as our source of truth because it covers many subjects and knowledge domains. And we used the version cleaned by Cohere.ai on Hugging Face, which includes 34,879,571 paragraphs belonging to 5,745,033 titles. An exhaustive search of these paragraphs will take quite a long time, so we need to use the appropriate ANNS (Approximate Nearest Neighbor Search) algorithms to retrieve relevant documents. Additionally, we use the MyScale database with the MSTG vector index to retrieve the relevant documents.

Semantic Search Model

Semantic search is a well-researched topic with many models with detailed benchmarks available. When incorporated with vector embeddings, semantic search gains the ability to recognize paraphrased expressions, synonyms, and contextual understanding.

Moreover, embeddings provide dense and continuous vector representations that enable the calculation of meaningful metrics of relevance. These dense metrics capture semantic relationships and context, making them valuable for assessing relevance in LLM information retrieval tasks.

Taking into account the factors mentioned above, we have decided to use the paraphrase-multilingual-mpnet-base-v2 model from Hugging Face to extract features for retrieval tasks. This model is part of the MPNet family, designed to generate high-quality embeddings suitable for various NLP tasks, including semantic similarity and retrieval.

Large Language Models (LLMs)

For our LLMs, we chose OpenAI’s gpt-3.5-turbo and llama2-13b-chat with quantization in six bits. These models are the most popular in commercial and open-source trends. The LLaMA2 model is quantized by llama.cpp. We chose this 6-bit quantization setup because it is affordable without sacrificing performance.

Note: You can also try other models to test their RAG performance.

Our RAG System

The following image describes how to formulate a simple RAG system:

Figure 1: Simple Benchmarking RAG

Note: Transform can be anything as long as it can be fed into the LLM, returning the correct answer. In our use case, Transform injects context into the question.

Our final LLM prompt is as follows:

```python

template = \

("The following are multiple choice questions (with answers) with context:"

"\n\n{context}Question: {question}\n{choices}Answer: ")

```

Now let’s move on to the result.

Several Benchmarking Insights

Our benchmark test results are collated in Table 1 below.

But first, our summarized findings are:

Extra context usually helps
More context sometimes helps
Smaller models are hungrier for knowledge

Table 1: Retrieval Accuracy with Different Contexts

Setup		Dataset					Average
LLM	Contexts	mmlu-astronomy	mmlu-prehistory	mmlu-global-facts	mmlu-college-medicine	mmlu-clinical-knowledge	Average
gpt-3.5-turbo	❌	71.71%	70.37%	38.00%	67.63%	74.72%	68.05%
	✅ (Top-1)	75.66% (+3.95%)	78.40% (+8.03%)	46.00% (+8.00%)	67.05% (-0.58%)	73.21% (-1.51%)	71.50% (+3.45%)
	✅ (Top-3)	76.97% (+5.26%)	81.79% (+11.42%)	48.00% (+10.00%)	65.90% (-1.73%)	73.96% (-0.76%)	72.98% (+4.93%)
	✅ (Top-5)	78.29% (+6.58%)	79.63% (+9.26%)	42.00% (+4.00%)	68.21% (+0.58%)	74.34% (-0.38%)	72.39% (+4.34%)
	✅ (Top-10)	78.29% (+6.58%)	79.32% (+8.95%)	44.00% (+6.00%)	71.10% (+3.47%)	75.47% (+0.75%)	73.27% (+5.22%)
llama2-13b-chat-q6_0	❌	53.29%	57.41%	33.00%	44.51%	50.19%	50.30%
	✅ (Top-1)	58.55% (+5.26%)	61.73% (+4.32%)	45.00% (+12.00%)	46.24% (+1.73%)	54.72% (+4.53%)	55.13% (+4.83%)
	✅ (Top-3)	63.16% (+9.87%)	63.27% (+5.86%)	49.00% (+16.00%)	46.82% (+2.31%)	55.85% (+5.66%)	57.10% (+6.80%)
	✅ (Top-5)	63.82% (+10.53%)	65.43% (+8.02%)	51.00% (+18.00%)	51.45% (+6.94%)	57.74% (+7.55%)	59.37% (+9.07%)
	✅ (Top-10)	65.13% (+11.84%)	66.67% (+9.26%)	46.00% (+13.00%)	49.71% (+5.20%)	57.36% (+7.17%)	59.07% (+8.77%)
* The benchmark uses MyScale MSTG as a vector index * This benchmark can be reproduced with our GitHub repository retrieval-qa-benchmark

In these benchmarking tests, we compared performance with and without context. The test without context represents how internal knowledge can solve questions. Secondly, the test with context shows how an LLM can learn from context.

Note: Both llama2-13b-chat and gpt-3.5-turbo are enhanced by around 3-5% overall, even with only one extra context.

The table reports that some numbers are negative, for example, when we insert context into clinical-knowledge to gpt-3.5-turbo.

This might be related to the knowledge base, saying that Wikipedia does not have much information on clinical knowledge or because OpenAI’s terms of use and guidelines are clear that using their AI models for medical advice is strongly discouraged and may even be prohibited. Despite this, the increase is quite evident for both models.

Notably, the gpt-3.5-turbo results claim that the RAG system might be powerful enough to compete with other language models. Some of the reported numbers, such as those on prehistory and astronomy are pushing towards the performance of gpt4 with extra tokens, suggesting RAG could be another solution to specialized Artificial General Intelligence (AGI) when compared to fine-tuning.

Note: RAG is more practical than fine-tuning models as it is a plug-in solution and works with both self-hosted and remote models.

2. More Context Sometimes Helps

Figure 2: Performance Gain vs. the Number of Contexts

The benchmark above suggests that you need as much context as possible. In most cases, LLMs will learn from all the supplied contexts. Theoretically, the model provides better answers as the number of retrieved documents is increased. However, our benchmarking shows that some numbers dropped the greater the contexts retrieved.

By way of validating our benchmarking results, a paper by Stanford University titled: “Lost in the Middle: How Language Models Use Long Contexts” suggests the LLM only looks at the context’s head and tail. Therefore, choose fewer but more accurate contexts from the retrieval system to augment your LLM.

3. Smaller Models Are Hungrier for Knowledge

The larger the LLM, the more knowledge it stores. Larger LLMs tend to have a greater capacity to store and understand information, which often translates to a broader knowledge base of generally understood facts. Our benchmarking tests tell the same story: the smaller LLMs lack knowledge and are hungrier for more knowledge.

Our results report that llama2-13b-chat shows a more significant increase in knowledge than gpt-3.5-turbo, suggesting context injects more knowledge into an LLM for information retrieval. Additionally, these results imply gpt-3.5-turbo was given information it already knows while llama2-13b-chat is still learning from the context.

Last but Not Least…

Almost every LLM uses the Wikipedia corpus as a training dataset, meaning both gpt-3.5-turbo and llama2-13b-chat should be familiar with the contexts added to the prompt. Therefore, the questions that beg are:

What is the reason for the increases in this benchmark test?
Is the LLM really learning using the supplied contexts?
Or do these additional contexts help recall memories learned from the training set data?

We currently don’t have any answers to these questions. As a result, research is still needed.

Contributing to Building a RAG Benchmark Together

Contribute to research to help others.

We can only cover a limited set of evaluations in this blog. But we know more is needed. The results of every benchmark test matter, regardless of whether they are replications of existing tests or some new findings based on novel RAGs.

With the aim of helping everyone create benchmark tests to test their RAG systems, we have open sourced our end-to-end benchmark framework. To fork our repository, check out our GitHub page.

This framework includes the following tools:

A universal profiler that wraps functions added to your searches or LLMs.
A YAML configuration that stores all the details of your experiments.
A chain for building RAG execution graphs.

It’s up to you to create your own benchmark. We believe RAG can be a possible solution to AGI. Therefore, we built this framework for the community to make everything trackable and reproducible.

PRs are welcome.

In Conclusion

We have evaluated a small subset of MMLU with a simple RAG system built with different LLMs and vector search algorithms and described our process and results in this article. We also donated the evaluation framework to the community and called for more RAG benchmarks. We will continue to run benchmarking tests and update the latest results to GitHub and the MyScale blog, so follow us on Twitter or join us on Discord to stay updated.

The post Discover the Performance Gain with Retrieval Augmented Generation appeared first on The New Stack.

The Role of SQL in LLM Apps and How Snowflake Uses LangChain

Richard MacManus — Tue, 12 Sep 2023 15:51:08 +0000

Meta’s recent release of Code Llama, a large language model (LLM) for code generation, prompted the data cloud company Snowflake to evaluate Code Llama’s performance on SQL code generation. It found that “Code Llama models outperform Llama2 models by 11-30 percent accuracy points on text-to-SQL tasks and come very close to GPT4 performance.” Snowflake also discovered that by fine-tuning Code Llama, it could make it up to 50 percent accuracy points better.

To find out more about Snowflake’s plans for SQL in the generative AI era, and why it’s suddenly all-in on Code Llama, I spoke to Adrien Treuille, director of product management and head of Streamlit (a Python app builder that was acquired by Snowflake in March 2022).

Riding First Class with SQL and Python

Treuille began by noting that Streamlit’s Community Cloud is currently host to over 10,000 LLM-powered apps, so it’s already become a leading platform for LLM app developers. “It’s a linchpin of Snowflake’s app strategy as well,” he added.

When it comes to connecting LLMs with Snowflake’s extensive data platform, SQL is the glue. “Snowflake was built on SQL,” said Treuille, “and so all functionality is available in SQL as a first-class citizen.” SQL, of course, enables you to add structure to massive swathes of data. Also, as Treuille put it, Snowflake’s “original market was database admins, people who basically speak SQL for a living.”

As for Streamlit, it was built on the back of Python. Now that Snowflake owns Streamlit, Python has also become a first-class language in the company.

“It means that, basically, all functionality [in Snowflake] has first-class Python bindings,” Treuille explained. “And of course, in Python, you can call SQL if you need an escape hatch down into the bowels of Snowflake. So yes, we are committed to both Python and SQL as being the languages of Snowflake.”

Building a Structured Data App with LLMs and SQL

Where a developer might decide to use Snowflake to build an LLM app when the data they’re accessing and querying is so complex that it needs further structure before it can be used in an application. Usually, this means both an LLM and at least one external data source are involved — that external data could be stored in Snowflake and/or elsewhere, such as in a vector database.

Treuille said that apps like a customer support chatbot or a “product suggestion bot” are good examples of the type of apps typically built on Snowflake using this “combination of LLMs and structured search.”

In a demo entitled “Building an LLM-Powered Chatbot,” at the Snowflake Summit 2023 in late June, Treuille showed how interacting with a Streamlit chatbot app in natural language can generate and run SQL queries on a data store in Snowflake.

“We now have a chatbot that is actually creating SQL on the fly based on our natural language input, running the SQL query and generating the response inline in our custom chatbot,” he said in the demo (see screenshot below).

SQL is generated and run by the LLM chatbot. Click for full image.

Why Code Llama Is So Important

It makes perfect sense that Snowflake would want to promote SQL code generation in LLMs, but why is it so excited about Meta’s new Code Llama LLM in particular?

“Six months ago, there was a fear that you were either one of the two or three superpowers who could build hyper-intelligent LLMs — like OpenAI — and there was everyone else,” Treuille replied. “And you either went to VCs and raised billions of dollars — like, you know, Anthropic — or you would inevitably be a customer and ultimately disintermediated by these massive super-intelligent LLMs from others.”

But now, he continued, “Facebook has completely abandoned that paradigm, by open sourcing some of the most powerful LLMs in the world.”

So Snowflake is, essentially, hitching its wagon to the open source LLMs being released by Meta (and perhaps others later). Snowflake can fine-tune an LLM like Code Llama to suit its own purposes — in this case, so that it does text-to-SQL better. It means the company doesn’t have to rely on a proprietary LLM provider, like OpenAI, because it can build its own LLMs from Meta’s open sourced models.

“Snowflake’s LLMs are near GPT level on standard tasks,” said Treuille, adding that “anyone can benchmark this.” In other words, he’s saying that its fine-tuned Open Llama LLM is “near” the quality of OpenAI’s GPT on tasks like text-to-SQL. “And that is totally game-changing,” insists Treuille.

Other Parts of the LLM App Ecosystem

In addition to creating its own fine-tuned LLMs, Snowflake plays nicely with other parts of the LLM app ecosystem, said Treuille. He added that not only is Snowflake “compatible with vector databases,” but it is “in private preview for our own vector database product.” This isn’t surprising, given how many different product types are already represented in Snowflake’s platform.

Perhaps more interesting is how Snowflake works alongside LangChain, the LLM orchestrator that has been a core part of many early LLM applications. During the presentation that Treuille and a couple of colleagues did at Snowflake Summit 2023, the group demonstrated how LangChain can be used to “help us organize the LLM’s thoughts so that it actually can decide the strategy it wants to take to solve a problem.”

In the example that was demoed, LangChain (which we were told was using GPT-4) acted as a kind of facilitator between the user and the SQL queries that the main LLM was generating.

Snowflake and LangChain co-ordination. Click for full image.

Everyone Will Have Their Own LLM

I asked Treuille how he thinks the LLM app ecosystem will evolve over the next few years, and what Snowflake’s role will be in this.

“If I could describe a North Star,” he replied, “it would be: talk to your data.”

Eventually, he thinks the industry will get to a place where every enterprise company essentially has its own LLM that “embodies all their knowledge.” He acknowledged that “it’ll be a little bit more structured than that — you may have an LLM that embodies all the knowledge, [but] you will still have structured databases against which you can run queries, and there’s going to be some non-trivial logic in-between.”

But from a product point of view, enterprise customers will end up with what they view as their own custom LLM solution. Which, of course, Snowflake hopes to provide.

The post The Role of SQL in LLM Apps and How Snowflake Uses LangChain appeared first on The New Stack.

Getting Started with Infrastructure Monitoring

Charles Mahler — Mon, 11 Sep 2023 14:01:40 +0000

While building new features and launching new products is fun, none of it matters if your software isn’t reliable. One key part of making sure your apps run smoothly is having robust infrastructure monitoring in place. In this article you will learn about the following:

The different components of infrastructure monitoring.
Popular tools used for infrastructure monitoring.
How to set up monitoring for an application.

If you prefer video, you can also check out this presentation, which covers some of the themes discussed in this article.

Components of Infrastructure Monitoring

Infrastructure monitoring consists of a number of different architecture components that are needed to serve a modern application. To ensure software is reliable, all of these components need to be properly monitored.

Network monitoring — Network monitoring focuses on hardware-like routers and switches and involves tracking things like bandwidth usage, uptime and device status. It is used to identify bottlenecks, downtime and potentially inefficient network routing.
Server monitoring — Server monitoring is focused on monitoring the performance and health of physical and virtual server instances. Metrics like CPU, RAM and disk utilization are common. Server monitoring is important for capacity planning.
Application performance monitoring (APM) — APM is focused on software and is used to track how an application is performing at every layer from the UI to how data is stored. Common metrics are things like error rates and response times.
Cloud infrastructure monitoring — Cloud monitoring, as the name implies, is about monitoring cloud infrastructure like databases, different types of storage and VMs. The goal is to track availability and performance, as well as resource utilization to prevent over or under provisioning of cloud hardware.

Each of these types of monitoring act as a different lens for teams to view and manage their infrastructure. By taking advantage of all of this data, companies can ensure their infrastructure is performing optimally while reducing costs.

Tools for Infrastructure Monitoring

Choosing the right tools for the job is critical when it comes to creating an infrastructure monitoring system. There are a number of open source and commercial options available. You also have the option of choosing a full-service solution or creating your own custom solution by combining specialized tools. Regardless, there are three main questions to consider: How are you going to collect your data, how to store the data and what will you do with the data? Let’s look at some of the tools available for accomplishing each one.

Data Collection Tools

One of the biggest challenges with infrastructure monitoring is collecting data that may be coming from many different sources, often with no standardized protocol or API. The key goal here should be to choose a tool that saves you from having to reinvent the wheel, doesn’t lock you in and is extensible so you can scale or modify data collection as your app changes.

Telegraf

Telegraf is an open source server agent that is ideal for infrastructure monitoring data collection. Telegraf solves most of the problems mentioned above. It has over 300 different plugins for inputs and outputs, meaning you can easily collect data from new sources and output that data to whichever storage solution works best for your use case.

The result is that Telegraf saves you a ton of engineering resources by not having to write custom code for collecting data and prevents vendor lock-in because you can change storage outputs easily. Telegraf also has plugins for data processing and transformation, so in some use cases it can simplify your architecture by replacing stream-processing tools.

OpenTelemetry

OpenTelemetry is an open source set of SDKs and tools that make it easy to collect metrics, logs and traces from applications. The primary advantage of OpenTelemetry is that it is vendor agnostic, so you don’t have to worry about getting locked into an expensive APM tool with high switching costs. OpenTelemetry also saves your developers time by providing tools to make instrumenting applications for data collection easy.

Data Storage Tools

After you start collecting data from your infrastructure, you’ll need a place to store that data. While a general-purpose database can be used for this data, in many cases you will want to look for a more specialized database designed for working with the types of time series data collected for infrastructure monitoring. Here are a few available options:

InfluxDB

InfluxDB is an open source time series database designed for storing and analyzing high volumes of time series data. It offers efficient storage and retrieval capabilities, scalability and support for real-time analytics. With InfluxDB, you can easily capture and store metrics from various sources, making it a good fit for monitoring and analyzing the performance and health of your infrastructure.

Prometheus

Prometheus is an open source monitoring and alerting toolkit built for collecting and storing metrics data. It is specifically designed to monitor dynamic and cloud native environments. Prometheus provides a flexible data model and powerful query language, making it well-suited for storing infrastructure monitoring data. With its built-in alerting and visualization capabilities, Prometheus enables you to gain insight into the performance and availability of your infrastructure.

Graphite

Graphite is a time series database and visualization tool that focuses on storing and rendering graphs of monitored data. It is widely used for monitoring and graphing various metrics, making it a suitable option for storing infrastructure monitoring data. Graphite excels at visualizing time series data, allowing you to create interactive and customizable dashboards to monitor the performance and trends of your infrastructure. Its scalable architecture and extensive ecosystem of plugins make it a popular choice for monitoring and analyzing infrastructure metrics.

Data Analysis Tools

Once you’ve got your data stored, it’s time for the fun part, actually doing something with it to create value. Here are a few tools that you can use for analyzing your data.

Grafana

Grafana is a powerful open source data visualization and analytics tool that allows users to create, explore and share interactive dashboards. It is commonly used for analyzing infrastructure monitoring data by connecting to various data sources such as databases, APIs and monitoring systems. With Grafana, users can create visualizations, set up alerts and gain insights into their infrastructure metrics, logs and traces.

Apache Superset

Apache Superset is a modern enterprise-ready business intelligence web application that enables users to explore, visualize and analyze data. It provides a user-friendly interface for creating interactive dashboards, charts and reports. When it comes to analyzing infrastructure monitoring data, Apache Superset can be used to connect to monitoring systems, databases or other data sources to explore and visualize key metrics, generate reports and gain insights into the performance and health of the infrastructure.

Jaeger

Jaeger is an open source, end-to-end distributed tracing system that helps users monitor and troubleshoot complex microservices architectures. It can be used for analyzing infrastructure monitoring data by providing detailed insights into the interactions and dependencies between different components of the infrastructure. Jaeger captures and visualizes traces, which represent the path of requests as they travel through the system, allowing users to identify bottlenecks, latency issues and performance optimizations in the infrastructure.

Infrastructure Monitoring Tutorial

Now let’s look at an example of how to implement a monitoring system for an application. This tutorial will focus on a combination of open source tools known as the TIG stack: Telegraf, InfluxDB and Grafana. The TIG stack allows developers to easily build an infrastructure monitoring solution that is scalable and extensible in the long term.

Architecture Overview

The example application for this tutorial is a chat app powered by an AI model that returns responses based on user input. The app has a hybrid architecture with the backend hosted on AWS, and the AI model is run on dedicated GPUs outside the cloud. The primary challenge is ensuring reliability of the service while also scaling infrastructure due to rapid user growth. Doing this requires collecting large amounts of data to track resource utilization in real time for monitoring and also for future capacity planning based on user growth.

Infrastructure Monitoring Setup

Now let’s look at how to set up and configure monitoring for this application. The first step will be configuring Telegraf to collect the data we want from each part of our infrastructure. We’ll take advantage of the following Telegraf plugins:

SNMP input — The SNMP plugin is used to collect the metrics needed for network monitoring.
CPU, Disk, Nvidia SMI, DiskIO, mem, swap, system input — These plugins are used to collect server monitoring metrics.
OpenTelemetry input — OpenTelemetry is used to collect application performance metrics like logs, metrics and traces.
AWS Cloudwatch input — The AWS CloudWatch plugin makes it easy to collect all the cloud infrastructure metrics we need from AWS.
InfluxDB V2 output — The InfluxDB output plugin will send all of these collected metrics to the specified InfluxDB instance.

And here’s an example of a Telegraf configuration TOML file for this setup:

```TOML
[global_tags]
  # dc = "us-east-1" # will tag all metrics with dc=us-east-1
  # rack = "1a"
  # user = "$USER"

[agent]
  interval = "10s"
  round_interval = true

  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"

  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""

  # debug = false
  # quiet = false
  # logtarget = "file"
  # logfile = ""
  # logfile_rotation_interval = "0d"
  # logfile_rotation_max_size = "0MB"
  # logfile_rotation_max_archives = 5

  hostname = ""
  omit_hostname = false

[[inputs.snmp]]
  agents = ["udp://127.0.0.1:161"].
  timeout = "15s"
   version = 2
   community = "SNMP"
  retries = 1


  [[inputs.snmp.field]]
    oid = "SNMPv2-MIB::sysUpTime.0"
    name = "uptime"
    conversion = "float(2)"

  [[inputs.snmp.field]]
    oid = "SNMPv2-MIB::sysName.0"
    name = "source"
    is_tag = true

[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false

[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]

[[inputs.diskio]]
[[inputs.mem]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]
[[nvidia-smi]]

[[inputs.opentelemetry]]
service_address = "0.0.0.0:4317"

  timeout = "5s"

 metrics_schema = "prometheus-v2"

  tls_cert = "/etc/telegraf/cert.pem"
   tls_key = "/etc/telegraf/key.pem"

[[inputs.cloudwatch_metric_streams]]

  service_address = ":443"

[[inputs.cloudwatch]]
  region = "us-east-1"

[[outputs.influxdb_v2]]
  urls = ["http://127.0.0.1:8086"]

  ## Token for authentication.
  token = ""

  ## Organization is the name of the organization you wish to write to.
  organization = ""

  ## Destination bucket to write into.
  bucket = ""

  ## The value of this tag will be used to determine the bucket.  If this
  ## tag is not set the 'bucket' option is used as the default.
  # bucket_tag = ""

  ## If true, the bucket tag will not be added to the metric.
  # exclude_bucket_tag = false

  ## Timeout for HTTP messages.
  # timeout = "5s"

  ## Additional HTTP headers
  # http_headers = {"X-Special-Header" = "Special-Value"}

  ## HTTP Proxy override, if unset values the standard proxy environment
  ## variables are consulted to determine which proxy, if any, should be used.
  # http_proxy = "http://corporate.proxy:3128"

```

This Telegraf configuration takes care of both the data collection and data storage steps by collecting all the designated data and sending it to InfluxDB for storage. Let’s go over some ways you can use that data.

Data Visualization

One of the first steps for many companies is to create dashboards and data visualizations for their infrastructure monitoring system. These dashboards can be used for everything from high-level reports to detailed analysis by engineers monitoring things in real time. Here’s an example of a Grafana dashboard built using the data collected for this tutorial:

Alerting

While dashboards are nice, it’s impossible to manually track everything happening with your infrastructure at scale. To help with this problem, setting up automated alerting is a common feature of infrastructure monitoring systems. Here’s an example of how Grafana can be used to set value thresholds for metrics and create automated alerts if those thresholds are violated.

Grafana integrates with third-party tools like PagerDuty and Slack so engineers can be notified if something goes wrong. In some cases, alerting like this could be used to completely automate certain actions, like automatically scaling cloud capacity if hardware utilization hits a certain level.

Predictive Analytics and Forecasting

Predictive analytics and forecasting are probably the ideal end goal for many engineering teams. While alerting is a reactive approach that only works after something has gone wrong, predictive analytics and forecasting allow you to take action before the issue occurs. Creating accurate forecasts is obviously easier said than done, but it has huge benefits when done right.

Next Steps

Hopefully this article helped you to better understand infrastructure monitoring and some of the tools that are available for building your own system. If you want to play around with some real data you can check out the following resources:

The post Getting Started with Infrastructure Monitoring appeared first on The New Stack.

Unlock Data’s Full Potential with a Mature Analytics Strategy

Venkat Ramakrishnan — Fri, 08 Sep 2023 17:16:18 +0000

Over the past decade, businesses have harnessed the power of “big data” to unlock new possibilities and enhance their analytical capabilities. Today, those businesses must accelerate those capabilities by moving beyond experimentation with analytics toward mature investments and capabilities, or risk losing a competitive edge.

A mature data analytics strategy is critical to deriving the most value from data, but many organizations struggle to get it right. Despite the exponential growth in data collection, about 73% of enterprise data remains unused for analytics, according to Forrester. This means that just one-fourth of the data generated is effectively leveraged to gain valuable insights. Embracing modern technology, such as containerized storage capabilities, can help leaders obtain a strong handle on their data and derive actionable insights from it to truly drive business growth.

Legacy Analytics Architectures Are Obstructing Innovation

Today’s software applications need to handle millions of users across the globe on demand while running on multiple platforms and environments. They also need to provide high availability to enable businesses to innovate and respond to changing market conditions. Legacy platforms were designed prior to ubiquitous fast storage and network fabric, presenting more challenges than solutions for organizations looking to get ahead of the competition.

When I spoke to IT leaders who use legacy deployment models, the number-one complaint I heard is that it requires too much effort to support data at the indexer layer, which leads to reduced operational efficiencies. Hours, days and even weeks can be spent on software updates, patches and scaling hardware to support growth. This, in turn, affects optimization as at-scale teams are challenged to meet the needs of their growing organization.

Additionally, legacy architectures require multiple copies of data, which significantly increases compute and storage requirements. When you add storage in a distributed architecture, you add compute regardless of organizational needs, affecting overall utilization and the ability to control costs.

Lastly, with varying performance capabilities across different storage tiers, there is a risk of slower query response times or inconsistent search results. This can hinder the speed and accuracy of data analysis. A mature analytics strategy faces these challenges head-on to provide operational efficiency, accelerated innovation and reduced cost of doing business.

The Case for Containerizing Modern Analytics Loads

Managing modern data involves more than relying on cloud architecture capabilities alone. Containerization can seamlessly integrate into cloud infrastructure to support modern analytics workloads. Imagine the convenience of running an application in a virtual environment without the hefty resource requirements of a hypervisor. By encapsulating software into virtual self-contained units, that’s exactly what a container can do.

Containerized applications provide greater performance and can run reliably from one computing environment to another. More application instances allow for greater performance overall, and the portability of the storage method enables centralized image management, rapid deployment and elasticity for organizations to scale storage capacity based on demand.

Interestingly, containerized applications can help with CPU utilization as well. In testing, we found that containerized applications enabled up to 60% utilization, compared to only 17% from a bare metal application model. Pair containerization with a high-performance storage solution, and organizations can achieve more flexibility and quicker response as data volumes increase.

Kubernetes’ Role in Unlocking Agile Data Management

Container orchestration platforms like Kubernetes provide robust tools for managing and orchestrating containerized applications at scale. With Kubernetes, platform and DevOps teams can easily deploy and run thousands of applications in a containerized or VM format, on any infrastructure, and can operate with much lower operational costs.

But to fully derive the benefits of a powerful application platform like Kubernetes, users need an equally powerful data platform to complete the solution. The Portworx Data Platform offers advancements such as automated and declarative storage provisioning, volume management, high availability and data replication, data protection and backup, business continuity and disaster recovery, security and robust cost optimization and management. These capabilities enable organizations to efficiently manage and control their data storage across distributed cloud environments, ensuring data availability and agility.

When using Kubernetes for containerized storage, there are considerations to keep in mind to ensure an organization’s mature analytics strategy is optimized and agile. First, using Kubernetes operators can further enhance storage capabilities by automating and simplifying complex tasks.

It’s also crucial to set up high availability at both the data service layer and the storage layer because relying on a single instance in a Kubernetes environment can be risky. Lastly, understanding whether an organization’s data service can be scaled up or scaled out will allow IT teams to choose the best solution to add more capacity or compute power as needed.

Organizations with mature analytics investments are achieving bigger impacts on business outcomes across the board, from customer experience and strategy to product innovation. Through modern data management like container applications and Kubernetes, organizations can make greater use of their data for innovation and growth and, more to the point, increase sales.

The post Unlock Data’s Full Potential with a Mature Analytics Strategy appeared first on The New Stack.

Stream Processing 101: What’s Right for You?

David Anderson — Fri, 08 Sep 2023 13:20:07 +0000

Over the last decade, the growing adoption of Apache Kafka has allowed data streaming — the continuous transmission of streams of data — to go mainstream.

To run operational and analytics use cases in real time, you don’t want to work with pockets of data that will sit and go stale. You want continuous streams of data that you can deal with and apply as they’re generated and ingested. That’s why so many companies have turned to data streaming, but the reality is that data streaming alone is not enough to maximize the value of real-time data. For that, you need stream processing.

What Is Stream Processing and How Does It Work?

Stream processing means performing operations on data as soon as it’s received. Processing data in flight allows you to extract its value as soon as it arrives rather than waiting for data collection and then batch processing.

By default, most systems are designed with high latency. Batch jobs are strung together to periodically move data from one place to another, like a Rube Goldberg machine. But that doesn’t have to be the case. Organizations gain an advantage when they architect for faster processing, especially in use cases designed to improve an organization’s responsiveness.

The TV streaming apps many of us use are a great example of how stream processing can improve both frontend experiences and backend processes. Every button pressed on a remote control provides information about viewing behavior that can inform the categorization of content to improve the user experience.

At the same time, the app can be designed to ensure viewing quality by monitoring streams of data on rebuffering events and regional outages. Compare that to a system or app that can only provide data on interruptions in predetermined intervals, minutes, hours or even days apart. That’s the difference between using batch-based versus streaming data pipelines to capture the data that runs a business. And once an organization makes the jump to data streaming, incorporating stream processing into the new pipelines they build is the only thing that makes sense.

Organizations that adopt data streaming without taking advantage of stream processing are left dealing with more latency and higher costs than they have to. Why bother to capture data in real time if you’re not going to process and transform it in real time too?

Although not every application you build requires processing data in flight, many of the most valuable use cases such as fraud detection, cyber security and location tracking need real-time processing to work effectively.

When streaming data isn’t processed in real time, it has to be stored in a traditional file system or a cloud data warehouse until an application or service requests that data. That means executing queries from scratch every time you want the data to be joined, aggregated or enriched so it’s ready for downstream systems and applications.

In contrast, stream processing allows you to “look” at the data once rather than having to apply the same operations to it over and over. That reduces storage and compute costs, especially as your data-streaming use cases scale over time.

Stream Processing in the Real World

Once you have stream processing pipelines built, you can connect them to all the places your data lives — from on-premise relational databases to the increasingly popular cloud data warehouses and data lakes. Or you can use these pipelines to connect directly to a live application.

A great example of the benefits of stream processing is real-time e-commerce. Stream processing allows an e-commerce platform to update downstream systems as soon as there’s new information available. When it comes to data points like product pricing and inventory, there can be multiple operational and customer-facing use cases that need that information.

If these platforms have to process data in batches, this leads to greater lag time between the information customers want — new sales and promotions, shipping updates or refunds — and the notifications they actually receive. That’s a poor customer experience that businesses need to avoid if they want to be competitive, and something that’s applicable across every industry.

But before companies and their developers can get started, they need to choose the right data-stream-processing technology. And that choice isn’t necessarily a straightforward one.

Common Stream Processing Technologies

Over the last seven or eight years, a few open source technologies have dominated the world of stream processing. This small handful of technologies are trying to solve the problem of putting data to work faster without compromising data quality or consistency, even if the technical, architectural and operational details underneath differ.

Let’s look at three commonly used stream processors.

Apache Flink is a data-processing framework designed to process large-scale data streams. Flink supports both event-driven processing and batch processing, as well as interactive analytics.
Kafka Streams, part of the Apache Kafka ecosystem, is a microservices-based, client-side library that allows developers to build real-time stream-processing applications and scalable, high-throughput pipelines.
Apache Spark is a distributed engine built for big data analytics using micro-batches, and is similar to the real-time processing achieved with Flink and Kafka Streams.

Each of these technologies has its strengths, and there are even use cases where it makes sense to combine these technologies. Whether considering these three technologies or the many others available in the broader ecosystem, organizations need to consider how this decision will further their long-term data strategy and allow them to pursue use cases that will keep them competitive as data streaming becomes more widespread.

How Organizations Can Choose Their Stream-Processing Technologies

Organizations adopting stream processing today often base this decision on the existing skill set of their developer and operations teams. That’s why you often see businesses with a significant community of practice around Kafka, turning to Kafka Streams, for example.

The developer experience is an important predictor of productivity if you plan to build streaming applications in the near future. For example, using a SQL engine (Flink SQL, ksqlDB or Spark SQL) to process data streams may be the right choice for making real-time data accessible to business analysts in your organization. In contrast, for developers used to working with Java, the ease of use and familiarity of Kafka Streams might be a better fit for their skill set.

While this reasoning makes sense for not blocking the way of innovation in the short term, it’s not always the most strategic decision and can limit how far you can take your stream-processing use cases.

How to Get Started with Stream Processing Today

Getting started with stream processing looks different from a practitioner perspective versus an organizational one. While organizations need to think about business requirements, practitioners can focus on the technology that helps them launch and learn fast.

Start by looking at side-by-side comparisons of the streaming technologies you want to use. While a company might evaluate several technologies at once, I’d recommend against that approach for developers — you don’t want to do a proof of concept (POC) on five different technologies. Instead, narrow down your list to two options that fit your requirements, and then build a POC for each.

The easiest way to do this is to find a tutorial that closely matches your use case and dive in. A great way to start is by building streaming pipelines that ingest and process data from Internet of Things (IoT) devices or public data sets like Wikipedia updates. Here are some places to start learning:

Stream Processing Simplified is about Flink for Kafka Users.
Learn Flink: Hands-On Training is about using Flink’s APIs to manage time and state.
Get started with Flink in Java with this hands-on exercise.
Apache Flink 101 discusses Flink’s core concepts and architecture.
Build a real-time fraud detection pipeline with Kafka Streams.
Build a real-time stream-processing pipeline with Spark and Kafka.

Developing streaming applications and services can be challenging because they require a different approach than traditional synchronous programming. Practitioners not only need to become familiar with the technology but also how to solve problems by reacting to events and streams of data, rather than by applying conditions and operations to data at rest.

While the technology you choose today may not be the one you use tomorrow, the problem-solving and stream-processing skills you’re gaining won’t go to waste.

The post Stream Processing 101: What’s Right for You? appeared first on The New Stack.

How AI Helped Us Add Vector Search to Cassandra in 6 Weeks

Jonathan Ellis — Wed, 06 Sep 2023 13:11:53 +0000

With the huge demand for vector search functionality that’s required to enable generative AI applications, DataStax set an extremely ambitious goal to add this capability to Apache Cassandra and Astra DB, our managed service built on Cassandra.

Back in April, when I asked our chief vice president of product officer who was going to build it, he said, “Why don’t you do it?”

With two other engineers, I set out to deliver a new vector search implementation on June 7 — in just six weeks.

Could new AI coding tools help us meet that goal? Some engineers have confidently claimed that AI makes so many mistakes that it’s a net negative to productivity:

I uninstalled copilot. It helped me maybe 1% of auto completes, and fixing it’s mistakes took more time than writing the code right myself would have taken 🙃

I’m not worried about ai.

— NullVoxPopuli (@nullvoxpopuli) December 4, 2022

And more recently:

UNIMPRESSED* 😒

Coding with ChatGPT has serious pitfalls; you spend more time debugging than just writing the damn thing from scratch

Troubleshooting is way quicker than Stackoverflow

— Adam (@adam___dee) July 26, 2023

After trying them out on this critical project, I’m convinced that these tools are in fact a massive boost to productivity. In fact, I’m never going back to writing everything by hand. Here’s what I learned about coding with ChatGPT, GitHub Copilot and other AI tools.

Copilot

Copilot is simple: It’s enhanced autocomplete. Most of the time it will complete a line for you or pattern-match a completion of several lines from context. Here, I’ve written a comment, and then started a new line writing neighbors. Copilot offered to complete the rest, correctly (with the text following ‘neighbors’ on the second line):

Here’s a slightly more involved example from test code, where I started off writing the loop as a mapToLong but then changed my data structures so that it ended up being cleaner to invoke a method with forEach instead. Copilot had my back:

And occasionally (this is more the exception than the rule), it surprises me by offering to complete an entire method:

Copilot is useful, but limited, for two reasons. First, it’s tuned to (correctly) err on the side of caution. It can still hallucinate, but it’s rare; when it doesn’t think it knows what to do, it doesn’t offer completions. Second, it is limited by the requirement to be fast enough to seamlessly integrate with a brief pause in human typing, which rules out using a heavyweight model like GPT-4, for now.

ChatGPT

You can try to get Copilot to generate code from comments, but for that use case you will almost always get better results from GPT-4, via paid ChatGPT or API access.

If you haven’t tried GPT-4 yet, you absolutely should. It’s true that it sometimes hallucinates, but it does so much less than GPT-3.5 or Claude. It’s also true that sometimes it can’t figure out simple problems (here I am struggling to get it to understand a simple binary search). But other times it’s almost shockingly good, like this time when it figured out my race condition on its first try. And even when it’s not great, having a rubber duck debugging partner that can respond with a passable simulacrum of intelligence is invaluable to stay in the zone and stay motivated.

And you can use it for everything. Or at least anything you can describe with text, which is very close to everything, especially in a programming context.

Here are some places I used GPT-4:

Random questions about APIs that I would have had to source dive for. This is the most likely category to result in hallucinations, and I have largely switched to Phind for this use case (see below).
Micro-optimizations. It’s like Copilot but matching against all of Stack Overflow, because that’s (part of) what it was trained on.
Involved Stream pipelines, because I am not yet very good at turning the logic in my head into a functional chain of Stream method calls. Sometimes, as in this example, the end result is worse than where we started, but that happens a lot in programming. It’s much easier and faster to do that exploration with GPT than one keystroke at a time. And making that time-to-results loop faster makes it more likely that I’ll try out a new idea, since the cost of experimenting is lower.
Of course GPT also knows about git, but maybe you didn’t realize how good it is at building custom tools using git. Like the other bullets in this list, this is stuff I could have done before by hand, but having GPT there to speed things up means that now I’ll create tools like this (before, I usually would have reached for whatever the second-best solution was, instead of spending an hour on a one-off script like this).

Here’s my favorite collaboration with GPT-4. I needed to write a custom class to avoid the garbage collection overhead of the box/unbox churn from a naive approach using ConcurrentHashMap, and this was for Lucene, which has a strict no-external-dependencies policy, so I couldn’t just sub in a concurrent primitives map like Trivago’s fastutil-concurrent-wrapper.

I went back and forth several times with GPT, improving its solution. This conversation illustrates what I think are several best practices with GPT (as of mid-2023):

When writing code, GPT does best with nicely encapsulated problems. By contrast, I have been mostly unsuccessful trying to get it to perform refactorings that touch multiple parts of a class, even a small one.
Phrase suggestions as questions. “Would it be more efficient to … ?” GPT (and, even more so, Claude) is reluctant to directly contradict its user. Leave it room to disagree or you may unintentionally force it to start hallucinating.
Don’t try to do everything in the large language model (LLM). The final output from this conversation still needs some tweaks, but it’s close enough to what I wanted that it was easier and faster to just finish it manually instead of trying to get GPT to get it exactly right.
Generally, I am not a believer in magical prompts — better to use a straightforward prompt, and if GPT goes off in the wrong direction, correct it — but there are places where the right prompt can indeed help a great deal. Concurrent programming in Java is one of those places. GPT’s preferred solution is to just slap synchronized on everything and call it a day. I found that telling it to think in the style of concurrency wizard Cliff Click helps a great deal. More recently, I’ve also switched to using a lightly edited version of Jeremy Howard’s system prompt.

Looking at this list, it’s striking how well it fits with the rule of thumb that AI is like having infinite interns at your disposal. Interns do best with self-contained problems, are often reluctant to contradict their team lead and frequently it’s easiest to just finish the job yourself rather than explain what you want in enough detail that the intern can do it. (While I recommend resisting the temptation to do that with real interns, with GPT it doesn’t matter.)

Advanced Data Analysis

Advanced Data Analysis, formerly known as Code Interpreter — also part of ChatGPT — is next level, and I wish it were available for Java yesterday. It wraps GPT-4 Python code generation into a Juypter or Jupyter-like sandbox, and puts it in a loop to correct its own mistakes. Here’s an example from when I was troubleshooting why my indexing code was building a partitioned graph.

The main problem to watch for is that ADA likes to “solve” problems with unexpected input by throwing the offending lines away, which usually isn’t what you want. And it’s usually happy with its efforts once the code runs to completion without errors – you will need to be specific about sanity checks that you want it to include. Once you tell it what to look for, it will add that to its “iterate until it succeeds” loop, and you won’t have to keep repeating yourself.

Also worth mentioning: The rumor mill suggests that ADA is now running a more advanced model than regular GPT-4, with (at minimum) a longer context window. I use ADA for everything by default now, and it does seem like an improvement; the only downside is that sometimes it will start writing Python for me when I want Java.

Claude

Claude is a competitor of OpenAI’s GPT from Anthropic. Claude is roughly at GPT 3.5 level for writing code — it’s noticeably worse than GPT-4.

But Claude has a 100,000 token context window, which is over 10 times what you get with GPT-4. (OpenAI just announced an Enterprise ChatGPT that increases GPT-4’s context window to 32,000 tokens, which is still only a third of Claude.)

I used Claude for three things:

Pasting in entire classes of Cassandra code to help figure out what they do.
Uploading research papers and asking questions about them.
Doing both at once: Here’s a research paper; here’s my implementation in Java. How are they different? Do those differences make sense given constraints X and Y?

Bing and Phind

Bing Chat got a bunch of attention when it launched earlier this year, and it’s still a good source of free GPT-4 (select the “Creative” setting), but that’s about it. I have stopped using it almost entirely. Whatever Microsoft did to Bing’s flavor of GPT-4 made it much worse at writing code than the version in ChatGPT.

Instead, when I want AI-flavored search, I use Phind. It’s what Bing should have been, but for whatever reason a tiny startup out-executed Microsoft on one of its flagship efforts. Phind has completely replaced Google for my “how do I do X”-type questions in Java, Python, git and more. Here’s a good example of solving a problem with an unfamiliar library. On this kind of query, Phind almost always nails it — and with relevant sources, too. In contrast, Bing will almost always cite at least one source as saying something different than it actually does.

Bard

I haven’t found anything that Bard is good at yet. It doesn’t have GPT-4’s skill at writing code or Claude’s large context window. Meanwhile, it hallucinates more than either.

Making Coding Productive — and Fun

Cassandra is a large and mature codebase, which can be intimidating to a new person looking to add a feature — even to me, after 10 years spent mostly on the management side. If AI is going to help any of us move faster, this is the way. ChatGPT and related AI tooling are good at writing code to solve well-defined problems, both as part of a larger project designed by a human engineer or for one-off tooling. They are also useful for debugging, sketching out prototypes and exploring unfamiliar code.

In short, ChatGPT and Copilot were key to meeting our deadline. Having these tools makes me 50% to 100% more productive, depending on the task. They have limitations, but they excel at tirelessly iterating on smaller tasks and help their human supervisor stay in the zone by acting as a tireless, uncomplaining partner to bounce ideas off of. Even if you have years of programming experience, you need to do this.

Because finally, even without the productivity aspects, coding with an AI that helps with the repetitive parts is just more fun. It’s given me a second wind and a new level of excitement for building cool things. I look forward to using more advanced versions of these tools as they evolve and mature.

Try building on Astra DB with vector search.

The post How AI Helped Us Add Vector Search to Cassandra in 6 Weeks appeared first on The New Stack.

Apache Flink for Real Time Data Analysis

Joab Jackson — Tue, 05 Sep 2023 18:09:47 +0000

In this latest podcast from The New Stack, we explore Apache Flink, a platform for running both batch and real-time streaming data analysis jobs. This recording is the first in a three-part series on a new managed service Amazon Web Services is unveiling built on Flink. Subsequent episodes will focus on the managed service itself and on the customer experience.

Joining the podcast is Danny Cranmer, who is a principal engineer at Amazon Web Services, as well as an Apache Flink PMC and Committer, and Hong Teoh, a software development engineer at AWS.

Flink is a high-level framework that can be used to define data analytics jobs. It supports bounded (“batch”) and unbounded (“streaming”) data sets. It provides a set of APIs against which you can build analysis jobs using Java, Python, SQL and other languages. In addition to the framework, there is an engine to run the jobs. Jobs run in a distributed manner with fault tolerance and horizontal scaling capabilities.

Extract-Transform-Load (ETL) is one use case, in which raw data is gathered and formatted for a particular workload. Flink is good for doing this job quickly when you need the results immediately. Flink is built for such an unbounded data stream and can offer really low latency for the transforms.

“So the moment your data is generated, you will process it and output it,” Teoh said. ” Let’s say you wanted to take in your data in real-time, you want to kind of analyze what’s happened, generate some insights, maybe it’s business data, or sports data, right? And you want to display that on a dashboard. Flink is very good at that because you can analyze things immediately on the fly.”

Event-Driven Applications also rely on Flink as well. In this scenario, an event, such as a user looking up the weather, triggers an immediate reaction, such as the weather data being served up.

Streaming data, unlike batch data, is constantly being updated with new values. As a result, it needs to be processed incrementally, and the results are delivered incrementally. Fink can do both batch processing and streaming with the same SQL commands.

Like any good real-time data processing system, Flink guarantees exactly once processing, or the ability to avoid duplicates, Cranmer explained. As with any distributed system, a transaction may be processed separately by two different nodes. In fields such as banking, this fault could lead to duplicate transactions — clearly not a good idea. It also periodically checkpoints a job so if any node fails, it will automatically return to the last known good state

In addition to the Flink architecture, we also discuss the AWS role in maintaining the open source project and the future of Flink.

The post Apache Flink for Real Time Data Analysis appeared first on The New Stack.

Change Data Capture for Real-Time Access to Backend Databases

Jim Moffitt — Tue, 05 Sep 2023 15:05:52 +0000

In a recent post on The New Stack, I discussed the emergence and significance of real-time databases. These databases are designed to support real-time analytics as a part of event-driven architectures. They prioritize high write throughput, low query latency, even with complex analytical queries including filter aggregates and joins, and high levels of concurrent requests.

This highly-specialized class of database, which includes open source variants such as ClickHouse, Apache Pinot and Apache Druid, is often the first choice when you’re building a real-time data pipeline from scratch. But more often than not, real-time analytics is pursued as an add-on to an existing application or service, where a more traditional, relational database like PostgreSQL, SQL Server or MySQL has already been collecting data for years.

In the post I linked above, I also briefly touched on how these online transactional processing (OLTP) databases aren’t optimized for analytics at scale. When it comes to analytics, they simply cannot deliver the same query performance at the necessary levels of concurrency. If you want to understand why in more detail, read this.

But the Internet Is Built on These Databases!

Row-based databases may not work for real-time analytics, but we can’t get around the fact that they are tightly integrated with backend data systems around the world and across the internet. They’re everywhere, and they host critical data sets that are integral to and provide context for many of the real-time systems and use cases we want to build. They store facts and dimensions about customers, products, locations and more that we want to use to enrich streaming data and build more powerful user experiences.

So, what are we to do? How do you bring this row-oriented, relational data into the high-speed world of real-time analytics? And how do you do it without overwhelming your relational database server?

Here’s How Not to Do It

Right now, the prevailing pattern to get data out of a relational database and into an analytical system is using a batch extract, transform, load (ETL) process scheduled with an orchestrator to pull data from the database, transform it as needed and dump it into a data warehouse so the analysts can query it for the dashboards and reports. Or, if you’re feeling fancy, you go for an extract, load, transform (ELT) approach and let the analytics engineers build 500 dbt models on the Postgres table you’ve replicated in Snowflake.

This may as well be an anti-pattern in real-time analytics. It doesn’t work. Data warehouses make terrible application backends, especially when you’re dealing with real-time data.

Batch ETL processes read from the source system on a schedule, which not only introduces latency but also puts strain on your relational database server.

ETL/ELT is simply not designed for serving high volumes of concurrent data requests in real-time. By nature, it introduces untenable latency between data updates and their availability to downstream consumers. With these batch approaches, latencies of more than an hour are common, with five-minute latencies about as fast as can be expected.

And finally, ETLs put your application or service at risk. If you’re querying a source system (often inefficiently) on a schedule, that puts a strain on your database server, which puts a strain on your application and degrades your user experience. Sure, you can create a read replica, but now you’re doubling your storage costs, and you’re still stuck with the same latency and concurrency constraints.

Change Data Capture (CDC) to the Real-Time Rescue

Hope is not lost, however, thanks to real-time change data capture (CDC). CDC is a method of tracking changes made to a database such as inserts, updates and deletes, and sending those changes to a downstream system in real time.

Change data capture works by monitoring a transaction log of the database. CDC tools read the transaction log and extract the changes that have been made. These changes are then sent to the downstream system.

Change data capture tools read from the database log file and propagate change events to a message queue for downstream consumers.

The transaction log, such as PostgreSQL’s Write Ahead Log (WAL) or MySQL’s “bin log,” chronologically records database changes and related data. This log-based CDC minimizes the additional load on the source system, making it superior to other methods executing queries directly on source tables.

CDC tools monitor these logs for new entries and append them to a topic on an event-streaming platform like Apache Kafka or some other message queue, where they can be consumed and processed by downstream systems such as data warehouses, data lakes or real-time data platforms.

Real-Time Analytics with Change Data Capture Data

If your service or product uses a microservices architecture, it’s highly likely that you have several (perhaps dozens!) of relational databases that are continually being updated with new information about your customers, your products and even how your internal systems are running. Wouldn’t it be nice to be able to run analytics on that data in real time so you can implement features like real-time recommendation engines or real-time visualizations in your products or internal tools like anomaly detection, systems automation or operational intelligence dashboards?

For example, let’s say you run an e-commerce business. Your website runs over a relational database that keeps track of customers, products and transactions. Every customer action, such as viewing products, adding to a cart and making a purchase, triggers a change in a database.

Using change data capture, you can keep these data sources in sync with real-time analytics systems to provide the up-to-the-second details needed for managing inventory, logistics and positive customer experiences.

Now, when you want to place a personalized offer in front of a shopper during checkout to improve conversion rates and increase average order value, you can rely on your real-time data pipelines, fed by the most up-to-date change data to do so.

How Do You Build a Real-Time CDC Pipeline?

OK, that all sounds great. But how do you build a CDC event pipeline? How do you stream changes from your relational database into a system that can run real-time analytics and then expose them back as APIs that you can incorporate into the products you’re building?

Let’s start with the components you’ll need:

Source data system: This is the database that contains the data being tracked by CDC. It could be Postgres, MongoDB, MySQL or any other such database. Note that the database server’s configuration may need to be updated to support CDC.
CDC connector: This is an agent that monitors the data source and captures changes to the data. It connects to a database server, monitors transaction logs and publishes events to a message queue. These components are built to navigate database schema and support tracking specific tables. The most common tool here is Debezium, an open source change data capture framework on which many data stack companies have built change data tooling.
Event streaming platform: This is the transport mechanism for your change data. Change data streams get packaged as messages, which are placed onto topics, where they can be read and consumed by many downstream consumers. Apache Kafka is the go-to open source tool here, with Confluent and Redpanda , among others, providing some flexibility and performance extensions on Kafka APIs.
Real-time database or platform: For batch analytics workflows like business intelligence and machine learning, this is usually a data warehouse or data lake. But we’re here for real-time analytics, so in this case, we’d go with a real-time database like those I mentioned above or a real-time data platform like Tinybird. This system subscribes to change data topics on the event streaming platform and writes them to a database optimized for low-latency, high-concurrency analytics queries.
Real-time API layer: If your goal, like many others, is to build user-facing features on top of change data streams, then you’ll need an API layer to expose your queries and scale them to support your new service or feature. This is where real-time data platforms like Tinybird provide advantages over managed databases, as they offer API creation out of the box. Otherwise, you can turn to tried-and-tested ORMs (object-relational mappings) and build the API layer yourself.

An example real-time CDC pipeline for PostgreSQL. Note that unless your destination includes an API layer, you’ll have to build one to support user-facing features.

Put all these components together, and you’ve got a real-time analytics pipeline built on fresh data from your source data systems. From there, what you build is limited only by your imagination (and your SQL skills).

Change Data Capture: Making Your Relational Databases Real Time

Change data capture (CDC) bridges the gap between traditional backend databases and modern real-time streaming data architectures. By capturing and instantly propagating data changes, CDC gives you the power to create new event streams and enrich others with up-to-the-second information from existing applications and services.

So what are you waiting for? It’s time to tap into that 20-year-old Postgres instance and mine it for all its worth. Get out there, research the right CDC solution for your database, and start building. If you’re working with Postgres, MongoDB or MySQL, here are some links to get you started:

The post Change Data Capture for Real-Time Access to Backend Databases appeared first on The New Stack.

What’s ‘Pipeline-Free’ Real-Time Data Analytics?

B. Cameron Gain — Tue, 05 Sep 2023 14:41:58 +0000

Organizations across various industries are dealing with massive volumes of data that require extensive analysis and querying to help them better serve their customers. The sheer scale of this data can involve tens of thousands of metrics and dimensions, and data stores numbering in several petabytes.

To achieve real-time analytics, it usually takes a monumental effort to implement the query layer. Many organizations turn to open source alternatives like Apache Druid or Presto, along with data denormalization in separate pipelines, to ingest diverse data sources for multi-table queries.

However, this process demands significant resources and expertise, involving teams of engineers for implementation and maintenance, leading to time-consuming and resource-intensive efforts. Even minor schema changes can require days of effort, creating challenges for large organizations.

“Many people tend to give up on real-time analytics because of the organizational complexities they face when dealing with the software,” Sida Shen, product manager at CelerData, told The New Stack. “It’s the primary challenge they encounter, and it often leads them to dismiss the idea altogether.”

In this article, we explore an alternative approach to data analytics that eliminates the need for traditional data pipelines.

The Limits of Traditional Data Pipelines

Traditional pipelines lack flexibility, making it cumbersome to modify data models or pipelines. Each component adds complexity and increases the possibility of failure. Those components will likely lead to degradation in performance over time, not to mention the high operational costs.

Proper real-time analytics relies on various data transformations and data-cleaning processes. Additionally, pre-aggregation — which involves performing certain calculations in advance, such as denormalization — is used. (Denormalization means adding precomputed, redundant data to a relational database to improve its read performance.)

A “pipeline-free” solution addresses delays in data refreshing, minimizes latency, and reduces the complexity associated with denormalization and pre-aggregation steps, which often introduce time limits and delays in real-time analytics.

“The main advantage of going pipeline-free for real-time analytics is that it becomes much more accessible to a broader range of users, including those who may not be experienced engineers,” Shen said.

“With fewer complexities, organizations can easily manage their data and keep their five tables intact within the database, without resorting to the cumbersome process of pre-joining them into one table. This added flexibility is a significant boon, making the entire data more efficient.”

Pre-aggregation and denormalization are “Band-Aids” needed to enable real-time queries, along with dashboards and data applications across distributed data sources within and outside of the enterprise, according to Torsten Volk, an analyst for Enterprise Management Associates.

“Both practices sacrifice efficiency, flexibility and cost for query performance and simplicity,” Volk told The New Stack. “The more data sources we connect and the more historical data we include, the more we blow up the size of the underlying data stores, and the more SQL we have to write to join database tables.

“This makes it harder to manage and update data pipelines,” he added. “All of these factors dramatically lower the enthusiasm for building real-time data apps and queries, preventing enterprises from enhancing and automating their decision-making capabilities. “

A pipeline-free, real-time analytics alternative can significantly reduce the headaches organizations face during real-time analytics projects.

By using multi-table joins, you can eliminate the denormalization process and streamline real-time analytics processes, offering a substantial advantage in managing and implementing data pre-aggregation internally.

Joins are used to merge data from two or more tables into a unified column relational database. CelerData describes the joins its offers with open source StarRocks as essential for real-time analytics. This eliminates a vast number of steps, resources and operational complexities, making real-time analytics more manageable and efficient.

Airbnb recently migrated to StarRocks. With it, Airbnb engineers can maintain the tables in a snowflake schema and perform joins on the fly at query time, according to CelerData.

“This current definition of ‘pipeline-free’ refers to freeing data pipelines from the overhead generated by data scientists and analysts working around the limitations of joining data across standard row-based database systems,” Volk said.

“Everyone who has written the SQL code required to pull this off for a few dozen tables knows how hard it is to predict query performance and query cost and also how difficult it is to still understand your own query next week. Eliminating this overhead is a really big deal.”

An Open Source Alternative

As mentioned previously, many organizations struggle with real-time analytics due to the complexities of setting up data pipelines and managing denormalization processes. This often deters them from fully embracing real-time data analysis, leaving them feeling overwhelmed and opting for traditional batch processing solutions.

However, there is a promising alternative that can simplify real-time analytics and make it accessible to a broader audience. By leveraging tools like StarRocks, an open source project created in 2020, organizations can achieve real-time analytics without the need for extensive data pipelines or additional stream-processing tools.

“StarRocks provides built-in functionality to support these operations, eliminating the need for additional tools like Spark Streaming,” Shen said.

Thus far, interest in StarRocks, an online analytical processing (OLAP) database donated to The Linux Foundation in February, has racked up more than 5,000 GitHub stars and 1,200 forks.

StarRocks is a sub-second massively parallel processing (MPP) OLAP database for full analytics scenarios, including multidimensional analytics, real-time analytics and ad hoc queries, according to the project’s documentation.

While CelerData created and largely maintains the project, it is drawing interest from the developer community, with over 1,500 active pull requests, 70 active developers and 624 commits to the main GitHub branch this month.

Indeed, StarRocks has drawn “an active developer community,” Volk said, adding, “These metrics confirm that the StarRocks project is real and that there is significant demand for a database platform that enables real-time analytics right out of the box.”

Pre-aggregation plays a crucial role when using StarRocks. By performing calculations ahead of time, organizations can streamline analytics processes and significantly reduce resource and time consumption. Furthermore, with pipeline-free real-time analytics, it’s more efficient to manage data refreshing; the approach also minimizes latency and delays in data availability.

Gaining Flexibility

One of the most significant advantages of adopting this “pipeline-free” strategy is flexibility. Unlike traditional solutions that force organizations to pre-join multiple tables into a wide table, pipeline-free analytics allows them to keep individual tables in the database. This freedom to maintain separate tables and make schema changes without backfilling historical data can prove invaluable for scaling and managing data efficiently.

Incorporating StarRocks into real-time analytics empowers organizations to handle massive amounts of data with ease. Whether they are large corporations or small Software as a Service providers, StarRocks adapts to various use cases and data sizes. The end-to-end latency of less than 10 seconds ensures timely and accurate results, making it an ideal choice for organizations seeking efficient real-time data analysis solutions.

Ultimately, by embracing pipeline-free real-time analytics with StarRocks, organizations can streamline their processes, minimize complexity and unlock the full potential of their data analytics endeavors.

“Addressing query performance, query cost, data freshness and complexity is a critical step forward as it makes real-time analytics more accessible for enterprise use cases,” Volk said. “It is all about squeezing the most value out of the data you have ‘lying around anyway,’ and this is what the StarRocks database aims to help you achieve.”

The post What’s ‘Pipeline-Free’ Real-Time Data Analytics? appeared first on The New Stack.

D-Wave Suggests Quantum Annealing Could Help AI

Jelani Harper — Mon, 04 Sep 2023 10:00:24 +0000

The effect of quantum computing on Artificial Intelligence could be as understated as it is profound.

Some say quantum computing is necessary to achieve General Artificial Intelligence. Certain expressions of this paradigm, such as quantum annealing, are inherently probabilistic and optimal for machine learning. The most pervasive quantum annealing use cases center on optimization and constraints, problems that have traditionally involved non-statistical AI approaches like rules, symbols, and reasoning.

When one considers the fact that there are now cloud options for accessing this form of quantum computing (replete with resources for making it enterprise-applicable for any number of deployments) sans expensive hardware, one fact becomes unmistakably clear.

“With quantum computing, a lot of times we’re talking about what will it be able to do in the future,” observed Mark Johnson, D-Wave SVP of Quantum Technologies and Systems Products. “But no, you can do things with it today.”

Granted, not all those things involve data science intricacies. Supply chain management and logistics are just as easily handled by quantum annealing technologies. But, when these applications are considered in tandem with some of the more progressive approaches to AI-enabled by quantum annealing, their esteem to organizations across verticals becomes apparent.

Understanding Quantum Annealing

Quantum annealing involves the variety of quantum computing in which, when the quantum computer reaches its lowest energy state, it solves a specific problem — even NP-hard problems. Thus, whether users are trying to select features for a machine learning model or the optimum route to send a fleet of grocery store delivery drivers, quantum annealing approaches provide these solutions when the lowest energy state is achieved. “Annealing quantum computing is a heuristic probabilistic solver,” Johnson remarked. “So, you might end up with the very best answer possible or, if you don’t, you will end up with a very good answer.”

Quantum annealing’s merit lies in its ability to supply these answers at an enormous scale — such as that required for a defense agency’s need to analyze all possible threats and responses for a specific location at a given time. It excels in cases in which “you need to consider many, many possibilities and it’s hard to wade through them,” Johnson mentioned. Classical computational models consider each possibility one at a time for such a combinatorial optimization problem.

Quantum annealing considers those possibilities simultaneously.

Statistical AI

The data science implications for this computational approach are almost limitless. One developer resource D-Wave has made available via the cloud is a plug-in for the SDK for Ocean — a suite of open source Python tools — that integrates with scikit-learn to improve feature selection. It supports “recognizing in a large pattern of data, can I pick out features that correlate with certain things and being able to navigate that,” Johnson remarked. “I understand it ends up mapping into an optimization problem.” The statistical aspects of quantum annealing are suitable for other facets of advanced machine learning, too.

According to Johnson, because of its “probabilistic nature, one of the interesting things that quantum annealing does is not just picking the best answer or a good answer, but coming up with a distribution, a diversity of answers, and understanding the collection of answers and a little about how they relate to each other.” This quality of quantum annealing is useful for numerous dimensions of machine learning including backpropagation, which is used to adjust a neural network’s parameters while going from the output to the input. It can also reinforce what Johnson termed “Boltzmann sampling,” which involves randomly sampling combinatorial structures.

Cloud, Hybrid Framework

There are considerable advantages to making quantum annealing available through the cloud. The cloud architecture for accessing this form of computing is just as viable for accessing what Johnson called the “gate model” type of quantum computing, which is primed for factoring numbers and used in “RSA encryption schema,” Johnson confirmed. Organizations can avail themselves of quantum annealing in D-Wave’s cloud platform. Moreover, they can also utilize hybrid quantum and classical computing infrastructure as well, which is becoming ever more relevant in modern quantum computing conversations. “You would just basically be using both of them together for the part of the problem that’s most efficient,” Johnson explained.

In addition to the ready availability of each of these computational models, D-Wave’s cloud platform furnishes documentation for a range of example use cases for common business problems across industries. There’s also an “integrated developer environment you can pull up that already has in it Ocean, our open source suite of tools, which help the developer interface with the quantum computer,” Johnson added. Examples include the ability to write code in Python. When organizations find documentation in the cloud about a previous use case that’s similar to theirs, “You can pull up sample code that will… use the quantum computer to solve that problem in your integrated developer environment,” Johnson noted.

Quantum Supremacy

That sample code provides an excellent starting point for developers to build applications for applying quantum computing and hybrid quantum and classical computing methods to an array of business problems pertaining to financial services, manufacturing, life sciences, manufacturing, and more. It’s just one of the many benefits of quantum computing through the cloud. The appeal of quantum annealing, of course, lies in its ability to expedite the time required to solve combinatorial optimization problems.

As the ready examples of quantum solutions — the vast majority of which entail quantum annealing — across the aforesaid verticals indicate, such issues “are, the harder we look, ubiquitous throughout business,” Johnson indicated. The data science utility of quantum annealing for feature selection, Boltzmann sampling, and backpropagation is equally horizontal and may prove influential to the adoption rates of this computational approach.

The post D-Wave Suggests Quantum Annealing Could Help AI appeared first on The New Stack.

Microservices and Mortgage Meltdown: Let’s Get Relational

Ben Hunter — Wed, 30 Aug 2023 14:51:08 +0000

Financial services are embracing digital transformation, but if the UK’s 2022 mortgage meltdown proved anything it’s that there’s a long way left to go.

The mortgage crisis saw many well-known lenders unable to update the business IT systems behind their mortgage products in time to keep up with surging interest rates. Rather than leave themselves exposed, they withdrew mortgage products, losing money and customers as a result.

Technology wasn’t supposed to be like this. So what went wrong?

Playing Catch-up

The financial sector is changing, but the pace is slowed by decades of IT legacy. Behind each mortgage product sits business IT systems responsible for different phases of the mortgage process, from web offers through approval to account and customer management.

Tales of IT legacy are almost as old as London’s venerable financial heart, but the IT problem is no longer siloed. Each employs a monolithic application founded on hundreds of thousands of lines of code running on lots of different platforms — mainframes, client-server and hybrid cloud bought or built in-house. A change to one system has consequences for the others in the chain so must be taken offline for work.

The situation was complicated by banks’ IT change processes. As banks have built or bought IT systems, the job of managing them has spread out across teams that extend to third parties with requests for change communicated through arcane, ticket-based systems. It took one building society using such a system seven days to change the interest rates displayed on its customer portal.

Finally, some organizations simply lacked joined-up digital processes to move at speed. I know of organizations where staff printed online applications for manual review — taking three days to deliver a decision.

It’s tempting to write off the meltdown as a black swan, the culmination of unique events, but this would be wrong. It was merely a microcosm — change is now business as usual. The period between March 2009 and December 2021 saw six interest rate changes; rates since December 2021 have changed more than 16 times with UK mortgage interest rates now hitting 15-year highs.

As debt increases, financial services firms have now become preoccupied with how best to gain customers and stop them from being poached by more adroit rivals over the coming 18 months. Those with better packages and a unified customer journey to onboard those newcomers will dominate.

The Stack: Done Right

The prescribed answer to achieving this would seem obvious: to unravel the business IT software monoliths behind the mortgage businesses and reimplement the functionality as microservices. The logic is compelling: To rewrite integrated IT stacks as independent services that can be changed quickly and with minimal impact on the full application or fellow services.

Continuous-delivery guru Dave Farely explains what this looks like: processes deployed independently of each other, that are loosely coupled, that are organized around business capabilities, that operate within a bounded context and that are owned and maintained by small teams. In the case of the mortgage meltdown scenario, an interest rate service could be quickly updated by the team dedicated to its maintenance without taking down the entire lending application.

But microservices can come with baggage. One issue people have been struggling with is the creation of distributed monoliths. One cause of this is continued reliance on a single database, which means services remain anchored to the data sources and lose their flexibility.

In the mortgage scenario, the application has one source of truth, but the whole application must be taken down to be updated.

The answer is a distributed data model. The catch is, the model must employ the performance and reliability of a relational model that hitherto has struggled to perform at this kind of huge scale.

The Relational Route

Relational has been a reliable engine of business – an architecture of rows, columns and tables founded on principles of transaction completeness and isolation. This has seen relational databases used in transactional and analytical workloads. The relational model, however, has historically proved complicated to scale up for cloud native, which opened the door to the uptake of NoSQL, a document store that achieves scale and speed using different architecture. What we’ve seen, however, is firms trying to run relational workloads on NoSQL and having to build their own rows, columns and tables as a result at great cost and technical overhead.

It’s vital to pick the right database for this job: That means a system capable of delivering the reliable capabilities of relational but that’s distributed by design. This avoids using arcane practices and workarounds to make traditional relational scale and sidesteps the compromises of NoSQL.

The next step is to define the transactional model. At a high level that means understanding the use cases and the workloads. If we’re talking about a mortgage…

It means understanding the customer and business dynamics of the home-buying journey.
It means describing the data that will be used and the data store for systems like the financial ledger. It means describing what that ledger looks like.
It means specifying the systems and applications in the chain, the data flow and where the data would be updated.

That’s vital as the data won’t simply flow through offer, approval and settlement systems. It’ll flow through customer, employee and broker portals too. A change to a mortgage rate microservice will have to ripple through each system.

With the database and transactional model in place, it’s finally possible to create functionally independent microservices. Modern cloud infrastructure is invariably built on container technologies that are managed through a variety of orchestration and life cycle management tools. While this is great for open systems and developer choice, it can also result in a complex infrastructure that makes cloud native difficult and slow to manage. This can present a particular problem when deploying data-driven applications in a microservices architecture because data-driven updates must propagate consistently across each service and system on that infrastructure.

It therefore makes sense to employ an orchestration system that works in lockstep with your database’s automated deployment and administration capabilities. Achieving this will mean that data- and application-level changes can be packaged up and rolled out as part of any container or cluster update and deployed automatically using consistent processes and without manual intervention. The result is microservices that can be updated independently, autonomously and isolated from other services, allowing banks to respond at speed to changing business events.

What’s Next?

We are in the midst of the AI hype cycle and mortgage lenders are starting to think about how to integrate AI to improve the mortgage lending process. It can automate routine tasks, provide valuable insights, reduce risk and fraud and improve the customer experience — yet another reason to ensure your tech stack is capable of handling the influx of data that will be stored and analyzed to transform an antiquated industry.

Additionally, operational resilience — an organization’s ability to adapt and respond to disruptions or unexpected events while maintaining continuous operations without interruption — is going to be of huge importance in Finserv going forward, especially now that the stakes are high enough that governments have begun stepping in.

The UK is leading the way in holding financial firms responsible and accountable for their operational resiliency. Regulators have instructed financial firms to meet operational resilience requirements, overlaying governmental oversight on top of internal decision-making. Other countries are also pursuing similar regulatory initiatives in their financial sectors. One of the most significant is the European Union’s proposed Digital Operational Resilience Act (DORA), which seeks to ensure that all financial market participants have effective strategies and capabilities in place to manage operational resilience.

We are reaching the end of the “single cloud provider as automatic best practice” era. Centering your application architecture on cloud platform-agnostic services and tools has become an essential survival strategy that requires a distributed solution.

Conclusion

We’ve heard much about digitization in financial services, but 2022 proved just how much work is left outstanding. With businesses and consumers facing a challenging 18 months, it’s time for core banking IT systems to undergo a microservices overhaul. That means embracing a distributed relational data model to make the financial products — and the business — genuinely agile.

The post Microservices and Mortgage Meltdown: Let’s Get Relational appeared first on The New Stack.