Observability Overview, News and Trends | The New Stack

Build a Home Internet Speed Test with Grafana and InfluxDB

Jessica Wachtel — Tue, 19 Sep 2023 16:24:19 +0000

Time series data is ubiquitous in everyday life. Collecting, storing and analyzing time series data provides clarity and insight on just about anything. Time series data exists in outer space and inside your home. Time series data also sheds light on the dark question: “Am I getting the internet speeds I’m paying for?” Rather than write another article about how InfluxDB can turn time series data into real-time insights, this tutorial shows you how to track your internet speeds and to finally answer your questions about whether your internet really is too slow or if you might need to do a bit of deeper breathing.

This tutorial uses the TIG stack which includes Telegraf, InfluxDB Cloud Serverless and Grafana. For detailed information about this “TIG” stack, check out this article. The time series data in this tutorial comes directly from your home internet. Telegraf’s Home Internet Speed Plugin will collect the data. The Home Internet Speed Plugin uses Speedtest by Ookla to test a user’s internet speed and quality. The test locates the closest servers, finds the best latency, then downloads and uploads a file from the selected server to capture the user’s internet connection speed.

Required Materials

Before getting started, make sure you have the following:

Part One: InfluxDB Cloud Serverless

Login to InfluxDB Cloud Serverless and go to the Resource Center, navigate to “Add Data” and select “Configure Agent”. This takes you to the Telegraf section. Then select “Create Configuration”:

Use the “Bucket” dropdown to select an existing bucket or create a new one. Then scroll down through the Telegraf plugins until you reach the “Internet Speed” plugin. Click the “Internet Speed” plugin followed by the “Continue Configuring” blue button that appears in the bottom right-hand corner.

Name the configuration. You can add a brief description if you want. Complete this section by selecting “Save and Test”:

Next, open your terminal. If you don’t already have an API token for your bucket, create one at the API Token tab. Once you have a token with the settings you want, copy the API token and paste it into the terminal and save it to a text pad page. This token will disappear as soon as you leave this page so it’s important to save it. Copy the “Start Telegraf” command and paste that into the text pad.

Click “Finish” to close the window.

As soon as the “Create a Telegraf Configuration” window closes, you’ll see the name of the Telegraf configuration you just created appear on the main section of your screen. Click on the name of that configuration. Just below [agent] you’ll see interval = “20s”, so update that to interval = “60s”. This gives Telegraf a longer window to gather data from the speed test servers.

Locate the line that contains token = $INFLUX_TOKEN and replace the $INFLUX_TOKEN value with the API token you copied earlier. Complete this step by selecting “Save Changes”:

Paste the entire “Start Telegraf” command from your text pad into your terminal. The command includes telegraf — config + URL This is the only command needed to start Telegraf. The image below depicts what a successfully started Telegraf instance looks like:

The last thing to do in Part One is watch the data appear. Head to the Data Explorer (the graph icon on the top left side of the page). You can immediately query the data using SQL. You can organize the query by fields or tags. The Data Explorer includes a dropdown menu UI, meaning you don’t need to have deep SQL knowledge to execute queries.

Part Two: Configuring and Visualizing Data with Grafana

Part Two follows along with this tutorial, which also includes a YouTube video component.

Launch Grafana Cloud. Click “Connect Data” in the top right-hand corner. Select FlightSQL and select “Add New Data Source” in the top right-hand corner. Select “install via Grafana.com” to install FlightSQL:

Click on settings in the FlightSQL Plugin. The host name comes from InfluxDB Cloud Serverless. The host name is the URL with the protocol omitted.

For the port, define the secure port 443. The entire Host:Port will end up looking similar to this:

us-east-1-1.aws.cloud2.influxdata.com:443

The next step is to define the authentication type. Select “token” from the dropdown menu because InfluxDB Cloud Serverless uses tokens. Navigate back to InfluxDB Cloud Serverless because InfluxDB generates the token.

In InfluxDB Cloud Serverless’s main menu, navigate to Access Token Manager > Go to Tokens. Select Generate API Token > All Access API Token. Write a description if desired and save. Copy the token because it will disappear once you close the window. Navigate back to Grafana.

Once in Grafana, paste the all-access token into the text box. Then enable TLS/ SSL. This completes the basic connection setup.

The final setup step is enabling metadata. The metadata defines exactly where to query within InfluxDB Cloud Serverless. We query by bucket which is essentially its own database within InfluxDB. We’ll use the key “bucket-name” and the value will be the same as the bucket name that the home internet data is being sent.

When you’re done, click Save & Test. If everything was set up correctly, the OK message will appear.

The next step in our Grafana configuration is building the dashboard. Navigate to the dashboard panel and create a new dashboard. Select “Add New Panel”. Once the new window opens you’ll be able to see the FlightSQL data source just above the SQL editor around the middle of the page. Grafana helps generate SQL and allows for custom SQL generation. This example uses custom SQL generation. To open the custom SQL editor, click “Edit SQL”.

The custom SQL query we’re going to use is this:

SELECT time, download

FROM “internet_speed”

WHERE $__timeRange(time)

This query selects the time and the download fields from the bucket internet_speed. The $__timeRange(time) means we can update the time range from the dropdown menu. Clicking on the circular arrows to the top right of the panel creates or refreshes the image.

Part Three: Data Analysis

Looking at the graph above, it’s clear that my internet speeds are mostly within the 650-800mbs range. My first step in analysis is determining whether or not external factors are affecting the download speeds. The next thing I’ll do is add the server_id to my query.

To query by tag, just add the name of the tag in quotes. In this example, the new query line looks like SELECT time, download, “server_id”.

Grafana uses a tool called Transformations to organize the results by tag. Click “Transform” and the transformation we’re going to use is “Partition by Values”. Click select field > server_id. The resulting panel now includes each download speed by server id:

This is only the beginning. What other factors are at play? How does the latency affect downloads? What tags and fields do you think matter? Is your internet speed up to par?

Join the Time Series Data Revolution

This is just a scratch on the surface of the vast world of time series data. Time series data is everywhere but with the TIG stack, setup never has to be overly complicated and real-time insights are always available.

To keep working within our community, check out our community page which includes a link to our Slack and more cool projects to help get you started.

The post Build a Home Internet Speed Test with Grafana and InfluxDB appeared first on The New Stack.

Top Ways to Reduce Your Observability Costs: Part 2

Amanda Mitchell — Tue, 12 Sep 2023 16:38:06 +0000

This is the second in a two-part series. Read Part 1 here.

Organizations are constantly trying to figure out how to balance data value versus data cost, especially as data is rapidly outgrowing infrastructure and becoming costly to store and manage. To help, we’ve created a series that outlines how data can get expensive and ways to reduce your data bill.

Last time, we included a primer on “what is cardinality” before offering two tips on how to reduce observability costs: using downsampling and lowering retention. Before we jump into the last two tips on cost reduction — limiting dimensionality and using aggregation — we’ll do another quick primer, this time on classifying cardinality.

Classifying Cardinality: A Primer

When it comes to cardinality in metrics, you can classify dimensions into three high-level buckets to consider the balance between value and cardinality.

High value — These are the dimensions you need to measure to understand your systems, and they are always or often preserved when consuming metrics in alerts or dashboards. An example is including service/endpoint as a dimension for a metric tracking request latency. There’s no question that this is essential for visibility to make decisions about your system. But in a microservices environment, even a simple example like this can end up adding quite a lot of cardinality. When you have dozens of services, each with a handful of endpoints, you quickly end up with many thousands of series even before you add other sensible dimensions such as region or status code.

Low value — These dimensions are of more questionable value. They may not even be intentionally included, but rather come because of the way metrics are collected from your systems. An example dimension here is the instance label in Prometheus — it is automatically added to every metric you collect. Although in some cases you may be interested in per-instance metrics, looking at a metric such as request latency for a stateless service running in Kubernetes, you might not look at per-instance latency at all. Having it as a dimension does not necessarily add much value.

No value (useless or even harmful) — These are essentially anti-patterns to be avoided at all costs. Including them can result in serious consequences to your metric system’s health by exploding the amount of data you collect and causing significant problems when you query metrics.

Now for the good stuff: Our final two tips on how to reduce observability costs.

Keeping Costs Low: Dimensionality and Aggregation

Each team has to continuously make accurate trade-offs between the cost of observing their service or application, and the value of the insights the platform drives. This sweet spot will be different for every service, as some have higher business value than others, so those services can capture more dimensions, with higher cardinality, better resolution and longer retention than others.

This constant balancing of cost and derived value also means there is no easy fix. There are, however, some things you can do to keep costs in check.

1. Limit Dimensionality

The simplest way of managing the explosion of observability data is by reducing which dimensions you collect for metrics. By setting standards on what types of labels are collected as part of a metric, some of the cardinality can be farmed out to a log or a trace, which are much less affected by the high cardinality problem. And the observability team is uniquely positioned to help teams set appropriate defaults for their services.

These standards may include how and which metrics will use which labels, moving higher cardinality dimensions like unique request IDs to the tracing system to unburden the metrics system.

This is a strategy that limits what is ingested, which reduces the amount of data sent to the metrics platform. This can be a good strategy when teams and applications are emitting metrics data that is not relevant, reducing cardinality before it becomes a problem.

2. Use Aggregation

Instead of throwing away intermediate data points, aggregate individual data points into new summarized data points. This reduces the amount of data that needs to be processed and stored, lowering storage cost and improving query performance for larger, older data sets.

Aggregation can be a good strategy because it lets teams continue to emit highly dimensional, high cardinality data from their services, and then adjust it based on the value it provides as it ages.

While tweaking resolution and retention are relatively simple ways to reduce the amount of data stored by deleting data, they don’t do much to reduce the computational load on the observability system. Because teams often don’t need to view metrics across all dimensions, a simplified, aggregate view (for instance, without a per-pod or per-label level) is good enough to understand how your system is performing at a high level. So instead of querying tens of thousands of time series across all pods and labels, we can make do with querying the aggregate view with only a few hundred time series.

Aggregation is a way of rolling data into a more summarized, but less dimensional state, creating a specific view of metrics and dimensions that are important. The underlying raw metrics data can be kept for other use cases, or it can be discarded to save on storage space and reduce the cardinality of data if there is no use for the raw unaggregated data.

There Are Two Schools of Aggregation: Streaming vs. Batch

With stream aggregation, metrics data is streaming continuously, and the aggregation is done in memory on the streaming ingest path before writing results to the time series database. Because data is aggregated in real time, streaming aggregation is typically meant for information that’s needed immediately. This is especially useful for dashboards, which need to query the same expression repeatedly every time they refresh. Steaming aggregation makes it easy to drop the raw unaggregated data to avoid unnecessary load on the database.

Batch aggregation first stores raw metrics in the time series database and periodically fetches them and writes back the aggregated metrics. Because data is aggregated in batches over time, batch aggregation is typically done for larger swaths of data that isn’t time sensitive. Batch aggregation cannot skip ingesting the raw nonaggregated data, and it even incurs additional load as written raw data has to be read and rewritten to the database, adding additional query overhead.

The additional overhead of batch aggregation makes streaming better suited to scaling the platform, but there are limits to the complexity real-time processing can handle due to its real-time nature; batch processing can deal with more complex expressions and queries.

Rethink Observability, Control Your Costs

Before you adopt a cloud native observability platform, be sure it will help you keep costs low by enabling you to understand the value of your observability data as well as shaping and transforming data based on need, context and utility. Get more from your investment too, with capabilities that permit you to delegate responsibility for controlling cardinality and growth and continuously optimize platform performance.

The cloud native Chronosphere observability platform does all this and more. It helps you keep costs low by identifying and reducing waste. It also improves engineers’ experience by reducing noise. Best of all, teams remediate issues faster with Chronosphere’s automated tools and optimized performance.

The post Top Ways to Reduce Your Observability Costs: Part 2 appeared first on The New Stack.

Next-Gen Observability: Monitoring and Analytics in Platform Engineering

Robert Kimani — Tue, 12 Sep 2023 11:00:21 +0000

As applications become more complex, dynamic, and interconnected, the need for robust and resilient platforms to support them has become a foundational requirement. Platform engineering is the art of crafting these robust foundations, encompassing everything from orchestrating microservices to managing infrastructure at scale.

In this context, the concept of Next-Generation Observability emerges as a crucial enabler for platform engineering excellence. Observability transcends the traditional boundaries of monitoring and analytics, providing a comprehensive and insightful view into the inner workings of complex software ecosystems. It goes beyond mere visibility, empowering platform engineers with the knowledge and tools to navigate the intricacies of distributed systems, respond swiftly to incidents, and proactively optimize performance.

Challenges Specific to Platform Engineering

Platform engineering presents unique challenges that demand innovative solutions. As platforms evolve, they inherently become more intricate, incorporating a multitude of interconnected services, microservices, containers, and more. This complexity introduces a host of potential pitfalls:

Distributed Nature: Services are distributed across various nodes and locations, making it challenging to comprehend their interactions and dependencies.
Scaling Demands: As platform usage scales, ensuring seamless scalability across all components becomes a priority, requiring dynamic resource allocation and load balancing.
Resilience Mandate: Platform outages or degraded performance can have cascading effects on the applications that rely on them, making platform resilience paramount.

The Role of Next-Gen Observability

Next-Gen observability steps in as a transformative force to address these challenges head-on. It equips platform engineers with tools to see beyond the surface, enabling them to peer into the intricacies of service interactions, trace data flows, and understand the performance characteristics of the entire platform. By aggregating data from metrics, logs, and distributed traces, observability provides a holistic perspective that transcends the limitations of siloed monitoring tools.

This article explores the marriage of Next-Gen Observability and platform engineering. It delves into the intricacies of how observability reshapes platform management by providing real-time insights, proactive detection of anomalies, and informed decision-making for optimizing resource utilization. By combining the power of observability with the art of platform engineering, organizations can architect resilient and high-performing platforms that form the bedrock of modern applications.

Understanding Platform Engineering

Platform engineering plays a pivotal role in shaping the foundation upon which applications are built and delivered. At its core, platform engineering encompasses the design, development, and management of the infrastructure, services, and tools that support the entire application ecosystem.

Platform engineering is the discipline that crafts the technical underpinnings required for applications to thrive. It involves creating a cohesive ecosystem of services, libraries, and frameworks that abstract away complexities, allowing application developers to focus on building differentiated features rather than grappling with infrastructure intricacies.

A defining characteristic of platforms is their intricate web of interconnected services and components. These components range from microservices to databases, load balancers, caching systems, and more. These elements collaborate seamlessly to provide the functionalities required by the applications that rely on the platform.

The management of platform environments is marked by inherent complexities. Orchestrating diverse services, ensuring seamless communication, managing the scale-out and scale-in of resources, and maintaining consistent performance levels present a multifaceted challenge. Platform engineers must tackle these complexities while also considering factors like security, scalability, and maintainability.

Platform outages wield repercussions that stretch beyond the boundaries of the platform itself, casting a pervasive shadow over the entire application ecosystem. These disruptions reverberate, resulting in downtimes, data loss, and a clientele that’s both agitated and dismayed. The ramifications encompass more than just the immediate fiscal losses; they extend to a long-lasting tarnish on a company’s reputation, eroding trust and confidence.

In the contemporary landscape, user expectations hinge on the delivery of unwaveringly consistent and dependable experiences. The slightest lapse in platform performance has the potential to mar user satisfaction. This can, in turn, lead to a disheartening ripple effect, manifesting as user attrition and missed avenues for business growth. The prerequisite for safeguarding high-quality user experiences necessitates the robustness of the platform itself.

Enter the pivotal concept of observability — a cornerstone in the architecture of modern platform engineering. Observability serves as a beacon of hope, endowing platform engineers with an arsenal of tools that transcend mere visibility. These tools enable engineers to transcend the surface and plunge into the intricate machinations of the platform’s core.

This dynamic insight allows them to navigate the labyrinth of intricate interactions, promptly diagnosing issues and offering remedies in real-time. With its profound capacity to unfurl the platform’s inner workings, observability empowers engineers to swiftly identify and address problems, thereby mitigating the impact of disruptions and fortifying the platform’s resilience against adversity.

Core Concepts of Next-Gen Observability for Platform Engineering

Amidst the intricacies of platform engineering, where a multitude of services work in concert to deliver a spectrum of functionalities, comprehending the intricate interplay within a distributed platform presents an imposing challenge.

At the heart of this challenge lies a complexity born of a web of interconnected services, each with specific tasks and responsibilities. These services often span a gamut of nodes, containers, and even geographical locations. Consequently, tracing the journey of a solitary request as it navigates this intricate network becomes an endeavor fraught with intricacies and nuances.

In this labyrinthine landscape, the beacon of distributed tracing emerges as a powerful solution. This technique, akin to unraveling a tightly woven thread, illuminates the flow of requests across the expanse of services. In capturing these intricate journeys, distributed tracing unravels insights into service dependencies, bottlenecks causing latency, and the intricate tapestry of communication patterns. As if endowed with the ability to see the threads that weave the fabric of the platform, platform engineers gain a holistic view of the journey each request undertakes. This newfound clarity empowers them to pinpoint issues with precision and optimize with agility.

However, the advantages of distributed tracing transcend the microcosm of individual services. The insights garnered extend their reach to encompass the platform as a whole. Platform engineers leverage these insights to unearth systemic concerns that span multiple services. Bottlenecks, latency fluctuations, and failures that cast a shadow over the entire platform are promptly brought to light. The outcomes are far-reaching: heightened performance, curtailed downtimes, and ultimately, a marked enhancement in user experiences. In the intricate dance of platform engineering, distributed tracing emerges as a beacon that dispels complexity, illuminating pathways to optimal performance and heightened resilience.

At the nucleus of observability, metrics and monitoring take center stage, offering a panoramic view of the platform’s vitality and efficiency.

Metrics, those quantifiable signposts, unfold a tapestry of data that encapsulates the platform’s multifaceted functionality. From the utilization of the CPU and memory to the swift cadence of response times and the mosaic of error rates, metrics lay bare the inner workings, revealing a clear depiction of the platform’s operational health.

A parallel function of this duo is the art of monitoring — an ongoing vigil that unveils deviations from the expected norm. The metrics, acting as data sentinels, diligently flag sudden surges in resource consumption, the emergence of perplexing error rates, or deviations from the established patterns of performance. Yet, the role of monitoring transcends mere alerting; it is a beacon of foresight. By continuously surveying these metrics, monitoring predicts the need for scalability. As the platform’s utilization ebbs and flows, as users and requests surge and recede, the platform’s orchestration must adapt in stride. Proactive monitoring stands guard, ensuring that resources are dynamically assigned, and ready to accommodate surging demands.

And within this dance of metrics and monitoring, the dynamic nature of platform scalability comes to the fore. In the tapestry of modern platforms, scalability is woven as an intrinsic thread. As users and their requests ebb and flow, as services and their load variate, the platform must be malleable, and capable of graceful expansion and contraction. Observability, cast in the role of a linchpin, empowers platform engineers with the real-time pulse of these transitions. Armed with the insights furnished by observability, the engineers oversee the ebb and flow of the platform’s performance, ensuring a proactive, rather than reactive, approach to scaling. Thus, as the symphony of the platform unfolds, observability lends its harmonious notes, orchestrating the platform’s graceful ballet amidst varying loads.

In the intricate tapestry of platform engineering, logs emerge as the textual chronicles that unveil the story of platform events.

Logs assume the role of a scribe, documenting the narrative of occurrences, errors, and undertakings within the platform’s realm. In their meticulously structured entries, they create a chronological trail of the endeavors undertaken by various components. The insights gleaned from logs provide a contextual backdrop for observability, enabling platform engineers to dissect the sequences that lead to anomalies or incidents.

However, in the context of multi-service environments within complex platforms, the aggregation and analysis of logs take on a daunting hue. With a myriad of services coexisting, the task of corralling logs spreads across diverse nodes and instances. Uniting these scattered logs to craft a coherent narrative poses a formidable challenge, amplified by the sheer volume of logs generated in such an environment.

Addressing this intricate challenge are solutions that carve paths for efficient log analysis. The likes of log aggregation tools, with exemplars like the ELK Stack comprising Elasticsearch, Logstash, and Kibana, stand as guiding beacons. These tools facilitate the central collection, indexing, and visualization of logs. The platform engineer’s endeavors to search, filter, and analyze logs are fortified by these tools, offering a streamlined process. Swiftly tracing the origins of incidents becomes a reality, empowering engineers in the realm of effective troubleshooting and expedited resolution. As logs evolve from mere entries to a mosaic of insight, these tools, augmented by observability, light the way to enhanced platform understanding and resilience.

Implementing Next-Gen Observability in Platform Engineering

Instrumenting code across the breadth of services within a platform is the gateway to achieving granular observability.

Here are some factors to consider:

Granular Observability Data: Instrumentation involves embedding code with monitoring capabilities to gather insights into service behavior. This allows engineers to track performance metrics, capture traces, and log events at the code level. Granular observability data provides a fine-grained view of each service’s interactions, facilitating comprehensive understanding.
Best Practices for Instrumentation: Effective instrumentation requires a thoughtful approach. Platform engineers need to carefully select the metrics, traces, and logs to capture without introducing excessive overhead. Best practices include aligning instrumentation with key business and operational metrics, considering sampling strategies to manage data volume, and ensuring compatibility with observability tooling.
Code-Level Observability for Bottleneck Identification: Code-level observability plays a pivotal role in identifying bottlenecks that affect platform performance. Engineers can trace request flows, pinpoint latency spikes, and analyze service interactions. By understanding how services collaborate and identifying resource-intensive components, engineers can optimize the platform for enhanced efficiency.

Proactive Monitoring and Incident Response

Proactive monitoring enables platform engineers to preemptively identify potential issues before they escalate into major incidents.

The proactive monitoring approach involves setting up alerts and triggers that detect anomalies based on predefined thresholds. By continuously monitoring metrics, engineers can identify deviations from expected behavior early on. This empowers them to take corrective actions before users are affected.

Observability data seamlessly integrates into incident response workflows. When an incident occurs, engineers can access real-time observability insights to quickly diagnose the root cause. This reduces mean time to resolution (MTTR) by providing immediate context and actionable data for effective incident mitigation.

Observability provides real-time insights into the behavior of the entire platform during incidents. Engineers can analyze traces, metrics, and logs to trace the propagation of issues across services. This facilitates accurate root cause analysis and swift remediation.

Scaling Observability with Platform Growth

Scaling observability alongside the platform’s growth introduces challenges related to data volume, resource allocation, and tooling capabilities. The sheer amount of observability data generated by numerous services can overwhelm traditional approaches.

To manage the influx of data, observability pipelines come into play. These pipelines facilitate the collection, aggregation, and processing of observability data. By strategically designing pipelines, engineers can manage data flow, filter out noise, and ensure that relevant insights are available for analysis.

Observability is not static; it evolves alongside the platform’s expansion. Engineers need to continually assess and adjust their observability strategies as the platform’s architecture, services, and user base evolve. This ensures that observability remains effective in uncovering insights that aid in decision-making and optimization.

Achieving Platform Engineering Excellence Through Observability

At its core, observability unfurls real-time insights into the dynamic symphony of platform resource utilization. Metrics, such as the rhythm of CPU usage, the cadence of memory consumption, and the tempo of network latency, play harmonious notes that guide engineers. These metrics, akin to notes on a musical score, disclose the underutilized instruments and the overplayed chords. Such insights propel engineers to allocate resources judiciously, deftly treading the fine line between scaling and conserving, balancing and distributing.

Yet, observability is not just a map; it’s an artist’s palette. With its brushes dipped in data, it empowers engineers to craft performances of peak precision. Within the intricate canvas of observability data lies the artist’s ability to diagnose performance constraints and areas of inefficiency. Traces and metrics unveil secrets, pointing out latency crescendos, excessive resource indulgence, and the interplay of service dependencies that orchestrate slowdowns. Armed with these revelations, engineers don the mantle of virtuosity, fine-tuning the components of the platform. The aim is nothing short of optimal performance, a symphony of efficiency that resonates throughout the platform.

Real-world vignettes, cast as case studies, offer a vivid tableau of the observability’s transformative impact. These tales unfold how insights, gleaned through observability, yield tangible performance enhancements. The chronicles narrate stories of reduced response times, streamlined operations, and harmonized experiences. These are not merely anecdotes but showcases of observability data weaving into the very fabric of engineering decisions, orchestrating leaps of performance that resonate with discernible gains. In the intricate choreography of platform engineering, observability dons multiple roles — an instructor, a composer, and an architect of performance enhancement.

Ensuring Business Continuity and User Satisfaction

In the intricate interplay of business operations and user satisfaction, observability emerges as a safety net, a sentinel that safeguards business continuity and elevates user contentment.

In the realm of business operations, observability stands as a sentinel against the tempestuous tide of platform outages. The turbulence of such outages can unsettle business operations and erode the very bedrock of user trust. Observability steps in, orchestrating a swift ballet of incident identification and resolution. In this dynamic dance, engineers leverage real-time insights as beacons, pinpointing the elusive root causes that underlie issues. The power of observability ensures that recovery is swift, and the impact is pared down, a testament to its role in minimizing downtime’s blow.

Yet, observability’s canvas extends beyond the realm of business operations. It stretches its reach to the very threshold of user experience. Here, it unveils a compelling correlation—platform health waltzes in tandem with user satisfaction. A sluggish response, a dissonant error, or the stark absence of service can fracture user experiences, spurring disenchantment and even churn. The portal to user interactions, as illuminated by observability data, becomes the looking glass through which engineers peer. This vantage point affords a glimpse into the sentiment of users and their interactions. The insights unveiled through observability carve a pathway for engineers to align platform behavior with user sentiment, choreographing proactive measures that engender positive experiences.

As the proverbial cherry on top, case studies illuminate observability’s transformative prowess. These real-world tales narrate how the tapestry of observability-driven optimizations interlaces with the fabric of user satisfaction.

From smoothing the checkout processes in the e-commerce realm to fine-tuning video streaming experiences, these examples resonate as testimonies to observability’s role in crafting user-centric platforms. In this symphony of platform engineering, observability stands as a conductor, orchestrating harmony between business continuity and user contentment.

Conclusion

Observability isn’t a mere tool; it’s a mindset that reshapes how we understand, manage, and optimize platforms. The world of software engineering is evolving, and those who embrace the power of Next-Gen Observability will be better equipped to build robust, scalable, and user-centric platforms that define the future.

As you continue your journey in platform engineering, remember that the path to excellence is paved with insights, data, and observability. Embrace this paradigm shift and propel your platform engineering endeavors to new heights by integrating observability into the very DNA of your strategies. Your platforms will not only weather the storms of complexity but will also emerge stronger, more resilient, and ready to redefine the boundaries of what’s possible.

The post Next-Gen Observability: Monitoring and Analytics in Platform Engineering appeared first on The New Stack.

Top Ways to Reduce Your Observability Costs: Part 1

Amanda Mitchell — Tue, 05 Sep 2023 13:44:31 +0000

This is the first of a two-part series.

This article covers how cloud native architecture increases data growth, what cardinality is and how you can curb data costs.

The Cloud Native Observability Challenge

Companies of all sizes are rapidly moving to cloud native technologies and practices. This modern strategy offers speed, efficiency availability and the ability to innovate faster, which means organizations can seize business opportunities that simply aren’t possible with a traditional monolithic architecture.

Yet moving to an architecture based on containers and microservices creates a new set of challenges that, if not managed well, will undermine the promised benefits.

Exploding observability data growth. Cloud native environments emit a massive amount of monitoring data — somewhere between 10 and 100 times more than traditional VM-based environments. This is because every container/microservice is emitting as much data as a single VM. Additionally, service owners start adding metrics to measure and track more granularly to run the business. Scaling containers into the thousands and collecting more and more complex data (higher data cardinality) results in data volume becoming unmanageable.

Rapid cost increases. The explosive growth in data volume and the need for engineers to collect an ever-increasing breadth of data has broken the economics and value of existing infrastructure and application monitoring and tools. Costs can unexpectedly spike from a single developer rolling out new code. Observability data costs can exceed the cost of the underlying infrastructure.

As the amount of metrics data being produced grows, so does the pressure on the observability platform, increasing cost and complexity to a point where the value of the platform diminishes. So how do observability teams take control over the growth of the platform’s cost and complexity without dialing down the usefulness of the platform? This article describes the trade-offs between cost and value that can come with investing in observability.

Cardinality: A Primer

To understand the balance between cost and insight, it’s important to understand cardinality. This is the number of possible ways you can group your data, depending on its properties, also called dimensions.

Metric cardinality is defined as the number of unique time series that are produced by a combination of metric names and associated dimensions. The total number of combinations that exist are cardinalities. The more combinations possible, the higher a metric’s cardinality. Here’s a delicious practical example: purchasing fine cheese.

Understanding Data Sets

If your only preference is that the cheese you buy is made of sheep’s milk, your data would have just one dimension. Analyze 100 different kinds of cheese based on that dimension, and you’d have 100 data points, each labeling the cheese as either sheep’s milk–based or not (made from another source).

But then you decide you only want sheep’s milk cheese made in France. That would add another dimension to track for each cheese made of sheep’s milk — the country of origin. Think of all the cheese-producing countries in the world — about 200 — and you can understand how the cardinality, or the ways to group the data, can quickly increase.

If you then decide to analyze the data based on the type of cheese, it adds many hundreds of other dimensions for grouping (think of all the different kinds of cheese in the world).

Finally, you decide you want to only consider Camembert, and group Camembert cheese only by whether it was made with raw milk, warm milk or completely pasteurized milk. That’s three more dimensions. You’d be right in thinking that, with all these dimensions, the cardinality would be high, even in traditional on-premises, VM-based environments.

A key point, it’s difficult to calculate the overall cardinality of a data set. You can’t just multiply together the cardinality of individual dimensions to know what the overall cardinality is — you will frequently have dimensions that only apply to a subset of your data.

Controlling Cardinality

With the transition from monolithic to cloud native environments, there’s been an explosion of metrics data in terms of cardinality. This is because microservices and containerized applications generate metrics data an order of magnitude more than monolithic applications on VM-based cloud environments. To achieve good observability in a cloud native system then, you need to deal with large-scale data and take steps to understand and control cardinality.

From 150,000 to 150 million metrics with cloud native architecture

Legacy (virtual machine) environment

Cloud native environment

In addition to cardinality, it’s important to understand two other terms when managing data quantity in an observability platform: resolution and retention.

Resolution is the interval of the measurement — how often a measurement is taken. This is important because a longer interval often smooths out peaks and troughs in measurements, making them not even show up in the data. Time precision is an important aspect of catching transient and spiky behaviors.
Retention is how long high-precision measurements are kept before being aggregated and downsampled into longer-term trend data. Summarizing and collating reduces resolution, trading off storage and performance with less accurate data.

Ways to Keep Costs Low: Data Sampling and Retention

Each team has to continuously make accurate trade-offs between the cost of observing their service or application and the value of the insights the platform drives. This sweet spot will be different for every service, as some have higher business value than others, so those services can capture more dimensions, with higher cardinality, better resolution and longer retention than others.

This constant balancing of cost and derived value also means there is no easy fix. There are, however, some things you can do to keep costs in check.

1. Use Downsampling

Downsampling is a tactic to reduce the overall volume of data by lowering the sampling rate of data. This is a great strategy to apply, as the value of the resolution of metrics data diminishes as it ages. Very high resolution is only really needed for the most recent data, and it’s perfectly OK for older data to have a much lower resolution so it’s cheaper to store and faster to query.

Downsampling can be done by reducing the rate at which metrics are emitted to the platform, or it can be done as it ages. This means that fresh data has the highest frequency, but more and more intermediate data points are removed from the data set as it ages. It is, of course, important to be able to apply resolution reduction policies at a granular level using filters, since different services and application components across different environments need different levels of granularity.

By downsampling resolution as the metrics data ages, the amount of data that needs to be saved is reduced by orders of magnitude. Say we downsample data from one second to one minute, that is 60 times less data we need to store. Additionally, it vastly improves query performance.

A solid downsampling strategy prioritizes which metrics data (per service, application or team) to downsample and helps determine a staggering age strategy. Often organizations adapt a week-month-year strategy to their exact needs, keeping high-resolution data for a week (or two) and stepping down resolution after a month (or two) — and, after a year, keeping a few years of data. With this strategy, teams retain the ability to do historical trend analysis with week-over-week, month-over-month and year-over-year.

2. Lower Retention

By lowering retention, we’re tweaking the total amount of metrics data kept in the system by discarding older data (optionally after downsampling first).

By classifying and prioritizing data, we can get a handle on what data is ephemeral and only needed for a relatively short amount of time, such as dev or staging environments or low-business-value services, and which data is important to keep for a longer period of time to refer back to as teams are triaging issues. Again, being able to apply these retention policies granularly is key for any production-ready system, as a one-size-fits-all approach just doesn’t work for every metric.

For production environments, keeping a long-term record, even at a lower resolution, is key to being able to look at longer trends and being able to compare year-over-year.

However, we don’t need all dimensions or even metrics for this long-term analysis. Helping teams choose what data to keep, at a low resolution, and what metrics to discard after a certain time will help limit the amount of metrics data that we store, but never look at again.

Similarly, we don’t need to keep data for some kinds of environments, such as dev, test or staging environments. The same is true for services with low business value or non-customer-facing (internal) services. By choosing to limit retention for these, teams can balance their ability to query health and operational state without overburdening the metrics platform.

In the next and final installment of this series, we’ll include a primer on classifying different types of cardinality before diving into our last two tips on reducing observability costs: lowering retention and using aggregation. Stay tuned.

The post Top Ways to Reduce Your Observability Costs: Part 1 appeared first on The New Stack.

4 Key Observability Best Practices

Sophie Kohler — Fri, 01 Sep 2023 16:11:29 +0000

With bigger systems, higher loads and more interconnectivity between microservices in cloud native environments, everything has become more complex. Cloud native environments emit somewhere between 10 and 100 times more observability data than traditional, VM-based environments.

As a result, engineers aren’t able to make the most out of their workdays, spending more time on investigations and cobbling together a story of what happened from siloed telemetry, leaving less time to innovate.

Without the right observability set up, precious engineering time is wasted trying to sift through data to spot where a problem lies, rather than shipping new features — potentially introducing buggy features and affecting the customer experience.

So, how can modern organizations find relevant insights in a sea of telemetry and make their telemetry data work for them, not the other way around? Let’s explore why observability is key to understanding your cloud native systems, and four observability best practices for your team.

What Are the Benefits of Observability?

Before we dive into ways your organization can improve observability, lower costs and ensure smoother customer experience, let’s talk about what the benefits of investing in observability actually are.

Better Customer Experience

With better understanding and visibility into relevant data, your organization’s support teams can gain customer-specific insights to understand the impact of issues on particular customer segments. Maybe a recent upgrade works for all of your customers except for those under the largest load, or during a certain time window. Using this information, on-call engineers can resolve incidents quickly and provide more detailed incident reports.

Better Engineering Experience and Retention

By investing in observability, site reliability engineers (SREs) benefit from knowing the health of teams or components of the systems to better prioritize their reliability efforts and initiatives.

As for developers, benefits of observability include more effective collaboration across team boundaries, faster onboarding to new services/inherited services and better napkin math for upcoming changes.

Four Observability Best Practices

Now that we have a better understanding of why teams need observability to run their cloud native system effectively, let’s dive into four observability best practices teams can use to set themselves up for success.

1. Integrating with Developer Experience

Observability is everyone’s job, and the best people to instrument it are the ones who are writing the code. Maintaining instrumentation and monitors should not be a job just for the SREs or leads on your team.

A thorough understanding of the telemetry life cycle — the life of a span, metric or log — is key, from setting up configuration to emitting signals and any modifications or processing done before getting stored. If there is a high-level architecture diagram, engineers can better understand if or where their instrumentation gets modified (like aggregating or dropping, for example.) Often, this processing falls in the SRE domain and is invisible to developers, who won’t understand why their new telemetry is partially or entirely missing.

You can check out simple instrumentation examples in this OpenTelemetry Python Cookbook.

If there are enough resources and a clear need for a central internal tool, platform engineering teams should consider writing thin wrappers around instrumentation libraries to ensure standard metadata is available out of the box.

Viewing Changes to Instrumentation

Another way to enable developers is by providing a quick feedback loop when instrumenting locally, so that they can view changes to the instrumentation before merging a pull request. This recommendation is helpful for training purposes and for those teammates who are new to instrumenting or unsure about how to.

Updating the On-Call Process

Updating the on-call onboarding process to pair a new engineer with a tenured one for production investigations can help distribute tribal knowledge and orient the newbie to your observability stack. It’s not just the new engineers who benefit. Seeing the system through new eyes can challenge seasoned engineers’ mental models and assumptions. Exploring production observability data together is a richly rewarding practice you might want to keep after the onboarding period.

You can check out more in this talk from SRECon, “Cognitive Apprenticeship in Practice with Alert Triage Hour of Power.”

2. Monitor Observability Platform Usage in More than One Way

For cost reasons, becoming comfortable with tracking the current telemetry footprint and reviewing options for tuning — like dropping data, aggregating or filtering — can help your organization better monitor costs and platform adoption proactively. The ability to track telemetry volume by type (metrics, logs, traces or events) and by team can help define and delegate cost-efficiency initiatives.

Once you’ve gotten a handle on how much telemetry you’re emitting and what it’s costing you, consider tracking the daily and monthly active users. This can help you pinpoint which engineers need training on the platform.

These observability best practices for training and cost will lead to better understanding the value that each vendor is providing you, as well as what’s underutilized.

3. Center Business Context in Observability Data

Deciphering the business context in a pile of observability data can help shortcut high stakes in different ways:

By making it easier to translate incidents affecting workflows and functionality from a user perspective.
By creating a more efficient onboarding process for engineers.

One way to center business context in observability data is by renaming default dashboards, charts and monitors.

4. Un-Silo Your Telemetry

Teams need better investigations. One way to ensure a smoother remediation process is through an organized process like following breadcrumbs rather than having 10 different bookmark links and a mental map of what data lives where.

One way to do this is by understanding what telemetry your system emits from metrics, logs and traces and pinpointing the potential duplication or better sources of data. To achieve this, teams can create a trace-derived metric that represents an end-to-end customer workflow, such as:

“Transfer money from this account to that account.”
“Apply for this loan.”

Regardless of whether you’re sending to multiple vendors or a mix of DIY in-house stack and vendors, ensuring that you are able to link data between systems — such as adding the traceID to log lines, or a dashboard note with links to preformatted queries for relevance — will add that extra support for your team to perform better investigations and remediate issues faster.

Explore Chronosphere’s Future-Proof Solution

Engineering time comes at a premium. The more you can invest in getting high-fidelity insights and supporting engineers in understanding what telemetry is available, instrumenting will become fearless, troubleshooting faster and your team will make future-proof, data-informed decisions when weighing options.

As companies transition to cloud native, uncontrollable costs and rampant data growth can stop your team from performing successfully and innovating. That’s why cloud native requires more reliability and compatibility with future-proof observability. Take back control of your observability today, and learn how Chronosphere’s solutions manage scale and meet modern business needs.

The post 4 Key Observability Best Practices appeared first on The New Stack.

More Lessons from Hackers: How IT Can Do Better

Loraine Lawson — Fri, 01 Sep 2023 10:00:34 +0000

Kelly Shortridge is an advocate for better resiliency in IT systems. The author of Security Chaos Engineering: Sustaining Resilience in Software and Systems and a senior principal engineer at Fastly in the office of the CTO spoke at this year’s Black Hat conference. She explained why attackers are more resilient and what IT organizations can do to become more resilient and responsive.

Recently, The New Stack looked at Shortridge’s recommendations to leverage Infrastructure as Code and the Continuous Integration/Continuous Development pipeline to improve and become more resilient. In this follow-up post, we’ll look at the final lessons IT can take from attackers to improve their security posture:

Design-based defense
Systems thinking
Measuring tangible and actionable success

Design-Based Defense: Modularity and Isolation

“The solutions that actually help with this aren’t the ones we usually consider in cybersecurity or at least traditional cybersecurity. We want to design solutions that encourage the nimbleness that we envy in attackers, we want to design solutions that help us become the best ever-evolving defenders,” she said. “The less dependent it is on human behavior, the better it is.”

From Kelly Shortridge’s Black Hat 2023 presentation

She created the ice cream cone hierarchy of security solutions to demonstrate how organizations should prioritize security and resilience mitigations. As an example of a design-based solution, she pointed to Kelly Long’s push to use HTTPS as the default for Tumblr’s user blogs.

“That’s a fantastic example of a design-based solution,” Shortridge said. “She knew that security should be invisible to the end users, so we shouldn’t put the burden of security on end users who aren’t technical. I think she’s really ahead of her time.”

Instead of offloading that work onto end users and peers, IT should try to automate security and use design-based defense when possible. That means deliberately designing in modularity. Modularity allows structurally or functionally distinct parts to retain autonomy during periods of stress and allows for easier recovery from loss, Shortridge explained. A queue, for instance, adds a buffer, and message brokers can replay and make return code non-blocking.

“Message brokers and queues provide a standardization for passing data around the system. It also provides a centralized view into it,” she said. “What you get here is visibility, you can see where data is flowing in your system.”

Modularity also supports an airlock approach with systems so that if an attack gets through, it won’t necessarily bring your system down. She demonstrated an air gap between two services talking to each other who a queue in between. The queue allows you to take the service offline and fix it, while service A continues to send requests, which the queue handles, allowing the service to stay available and functioning until the fix is put into place.

“Modularity, when done right, minimizes incident impact because it keeps things separate,” she said. “Modularity allows us to break things down into smaller components and that is much harder for attackers not only to persist if it’s ephemeral, it makes it harder for attackers to move laterally and gain widespread access in our system.”

Mozilla and UC San Diego have used this approach and have reported they no longer have to worry about zero day attacks because these sandboxes of components give them time to roll out a reliable fix without taking the system down, she added.

Systems Thinking

Repeatedly, the speaker at Black Hat said attackers are “system thinkers.” Shortridge reiterated this in her talk.

“Attackers thinking in systems, while defenders thinking in components, [which is] especially apparent when I talk to security teams, and thinking about how traffic and data flows between surfaces that’s often overlooked,” Shortridge said. “We’re so focused as an industry on ingress and egress that we miss how services talk to each other. And by the way, attackers love that we missed this.”

Attackers tend to focus on one thing: Your assumptions. You assume parsing the string will always be fast or the messenger set that shows up on this course will always post authentication or an alert will always fire when the malicious executable appears. But will it really? Attackers will test your assumptions and then keep looking to see if you’re just a little wrong or really wrong, she said.

“We want to be fast, ever-evolving defenders, we want to refine our mental models continuously rather than waiting for attackers to exploit the difference between our mental models and reality,” she said. “Decision trees and resilient stress testing can help us do just that.”

Decision trees can help find the gaps in your security mitigations, she said, and force IT to examine the “this will always be true” assumptions before attackers do. Reliance stress tests — called chaos engineering in security circles — build upon decision trees, helping to identify where systems can fail.

“Chaos engineering seeks to understand how disruptions impact the entire system’s ability to recover and adapt,” she said. “It appreciates the inherent interactivity in the system across time and space. So it means we’re stress testing at the system level, not the component level as you usually do. It forces you to adopt a systems mindset.”

Measuring Tangible and Actionable Success

Attackers have another advantage — they can measure success and receive immediate feedback on their metrics. Attacker metrics are straightforward: Do they have access? How much access do they have? Can they accomplish their goal? Security vendors, by contrast, often struggle to create lucid, actionable metrics — especially metrics that offer immediate feedback, she said.

”We want to be fast ever-evolving defenders, we need system signals that can inspire quick action, we need system signals that can inform change,” she said. “It turns out reliability signals are friends here, they’re really useful for security.”

IT security should learn and use the organization’s observability stack, she advised. They can even help detect the presence of attackers, she added.

“Again, attackers monitor the system they’re compromising to make sure they’re not tipping off defenders, or tripping over any sort of alert thresholds. So in the resilience revolution, we want to collect system signals, too, so we can be fast and ever-evolving right back,” she said.

The post More Lessons from Hackers: How IT Can Do Better appeared first on The New Stack.

SRE vs Platform Engineer: Can’t We All Just Get Along?

Jennifer Riggins — Wed, 30 Aug 2023 14:48:48 +0000

So far, 2023 has been all about doing more with less. Thankfully, tech layoffs — a reaction to sustained, uncontrolled growth and economic downturn — seem to have slowed. Still, many teams are left with fewer engineers working on increasingly complex and distributed systems. Something’s got to give.

It’s no wonder that this year has seen the rise of platform engineering. After all, this sociotechnical practice looks to use toolchains and processes to streamline the developer experience (DevEx), reducing friction on the path to release, so those that are short-staffed can focus on their end game — delivering value to customers faster.

What might be surprising, however, is the rolling back of the site reliability engineering or SRE movement. Both platform teams and SREs tend to work cross-organizationally on the operations side of things. But, while platform engineers focus on that DevEx, SREs focus on reliability and scalability of systems — usually involving monitoring and observability, incident response, and maybe even security. Platform teams are all about increasing developer productivity and speed, while SRE teams are all about increasing uptime in production.

Lately, a lot of organizations are also in the habit of simply waving a fairy wand and — bibbidi-bobbidi-boo!— changing job titles, like from site reliability engineer, sysadmin or DevOps engineer to platform engineer. Is this just because the latter makes for cheaper employees? Or can a change in role really make a difference? How many organizations are changing to adopt a platform as a product mindset versus just finding a new way to add to the ops backlog?

What do these trends actually mean in reality? Is it really SRE versus platform engineering? Are companies actually skipping site reliability engineering and jumping right into a platform-first approach? Or, as Justin Warren, founder and principal analyst at PivotNine, wrote in Forbes, is platform engineering already at risk of “collapsing under the weight of its own popularity, hugged to death by over-eager marketing folk?”

In 2023, we have more important things to worry about than two teams with similar objectives feeling pitted against each other. Let’s talk about where this conflict ends and where collaboration and corporate cohabitation begins.

SREs Should Be More Platform-focused

There’s opportunity in bringing platform teams and SREs together, but a history of friction and frustration can slow that collaboration. Often, SREs can be seen as gatekeepers, while platform engineers are just setting up the guardrails. That could be the shine effect for more nascent platform teams or it can be the truth at some orgs.

“Outside of Google, SREs in most organizations lack the capacity to constantly think about ways to enable better developer self-service or improve architecture and infrastructure tooling while also establishing an observability and tracing setup. Most SRE teams are just trying to survive,” wrote Luca Galante, from Humanitec’s product and growth team. He argues that too many companies are trying to follow suit of these “elite engineering organizations,” and the result is still developers tossing code over the wall, leaving too much burden on SREs to try to catch up.

Instead, Galante argues, a platform as a product approach allows organizations to focus on the developer experience, which, in turn, should lighten the load of operations. After all, when deployed well, platform engineering can actually help support the site reliability engineering team by reducing incidents and tickets via guardrails and systemization.

In fact, Dynatrace’s 2022 State of SRE Report emphasizes that the way forward for SRE teams is a “platform-based solution with state-of-the-art automation and everything-as-code capabilities that support the full lifecycle from configuration and testing to observability and remediation.” The report continues that SREs are still essential in creating a “single version of the truth” in an organization.

A platform is certainly part of the solution, it’s just, as we know from this year’s Puppet State of Platform Engineering Report, most companies have three to six different internal developer platforms running at once. That could leave platform and SRE teams working in isolation.

Xenonstack technical strategy consultancy actually places platform engineering and SRE at different layers of the technical stack, not in opposition to each other. It looks at SRE as a lower level or foundational process, while platform engineering is a higher level process that abstracts out ops work, including that which the SRE team puts in place.

Both SRE and platform teams are deemed necessary functions in the cloud native world. The next step is to figure out how they can not just collaborate but integrate their work together. After all, a focus on standardization, as is inherent to platform engineering, only supports security and uptime goals.

Another opportunity is in how SREs use service level objectives (SLOs) and error budgets to set expectations for reliability. Platform engineers should consider applying the same practices but for their internal customers.

The same Dynatrace State of SRE Report also found that, in 2022, more than a third of respondents already had the platform team managing the external SLOs.

In the end, it is OK if these two job buckets become grayer — even to the developer audience — so long as your engineers can work through one single viewpoint and, when things deviate from that singularity, they know who to ask.

How SREs Built Electrolux’s Platform

Whether a platform enables your site reliability team or your SREs can help drive your platform-as-a-product approach, collaboration yields better results than conflict. How it’s implemented is as varied as an organization’s technical stack and company culture.

Back in 2017, the second largest home appliance maker in the world, Electrolux, shifted toward its future in the Internet of Things. It opened a digital products division to connect eventually hundreds of home goods. This product team kicked off with ten developers and two SREs. Now, in 2023, the company has grown to about 200 developers helping to build over 350 connected products — supported by only seven SREs.

Electrolux teammates, Kristina Kondrashevich, SRE product owner, and Gang Luo, SRE manager, spoke at this year’s PlatformCon about how building their own platform allowed them to scale their development and product coverage without proportionally scaling their SRE team.

Initially, the SREs and developers sat on the same product team. Eventually, they split up but still worked on the same products. As the company scaled with more product teams, the support tickets started to pile up. This is when the virtual event’s screen filled with screenshots of Slack notifications around developer pain points, including service requests, meetings and logs for any new cluster, pipeline or database migration.

Electrolux engineering realized that it needed to scale the automation and knowledge sharing, too.

“[Developers] would like to write code and push it into production immediately, but we want them to be focused on how it’s delivered, how they provision the infrastructure for their services. How do they achieve their SLO? How much does it cost for them?” Kondrashevich said, realizing that the developers don’t usually care about this information.“They want it to be done. And we want our consumers to be happy.”

She said they realized that “We needed to create for them a golden path where they can click one button and get a new AWS environment.”

As the company continued to scale to include several product teams serving hundreds of connected appliances, the SRE team pivoted to becoming its own product team, as Electrolux set out to build an internal developer platform in order to offer a self-service model to all product teams.

Electrolux’s platform was built to hold all the existing automation, as well as well-defined policies, patterns and best practices.

“If developers need any infrastructure today — for example, if they need a Kubernetes cluster or database — they can simply go to the platform and click a few buttons and make some selections, and they will get their infrastructure up and running in a few minutes,” Luo said. He emphasized that “They don’t need to fire any tickets to the SRE team and we ensure that all the infrastructure that gets created has the same kind of policies, [and] they follow the same patterns as well.”

“For developers, they don’t need to navigate different tools, they can use the single platform to access most of the resources,” he continued, across infrastructure, services and APIs. “Each feature contains multiple pre-defined templates, which has our policies embedded, so, if someone creates a new infrastructure or creates a new service, we can ensure that it already has what we need for security, for observability. This provided the golden path for our developers,” who no longer need to worry about things like setting up CI/CD or monitoring.

Electrolux’s SRE team actually evolved into a platform-as-a-product team, as a way to cover the whole developer journey. As part of this, Kondrashevich explained, they created a platform plug-in to track cloud costs as well as service requests per month.

“The first intention was to show that it costs money to do manual work. Rather the SRE team can spend time and provide the automation — then it will be for free,” she said. Also, by observing costs via the platform, they’ve enabled cross-organization visibility and FinOps. “Before our SRE team was responsible for cost and infrastructure. Today, we see how our product teams are owners of not only their products but…their expenses for where they run their services, pipelines, etcetera.”

They also measure platform success with continuous surveying and office hours.

In the end, whether it’s the SRE or the product team running the show, “Consumer experience is everything,” Kondrashevich said. “When you have visibility of what other teams are doing now, you can understand more, and you can speak more, and you can share this experience with others.”

To achieve any and all of this, she argues, you really need to understand what site reliability engineering means for your individual company.

The colleagues ended their PlatformCon presentation with an important disclaimer: “You shouldn’t simply follow the same steps as we have done because you might not have the same result.”

The post SRE vs Platform Engineer: Can’t We All Just Get Along? appeared first on The New Stack.

Top 4 Factors for Cloud Native Observability Tool Selection

Amanda Mitchell — Tue, 29 Aug 2023 10:00:43 +0000

This is the fourth of a four-part series. Read Part 1 , Part 2 and Part 3.

Cloud native adoption isn’t something that can be done with a lift-and-shift migration. There’s much to learn and consider before taking the leap to ensure the cloud native environment can help with business and technical needs. For those who are early in their modernization journeys, this can mean learning the various cloud native terms, benefits, pitfalls and how cloud native observability is essential to success.

To help, we’ve created a four-part primer around “getting started with cloud native.” These articles are designed to educate and help outline the what and why of cloud native architecture.

Our most recent article covered why traditional application performance monitoring (APM) tools can’t keep up with modern observability needs. This one covers the features and business requirements to consider when selecting cloud native observability tools.

The Need for Cloud Native Observability

Today’s developers are driven by two general issues pervasive throughout organizations of any size in any industry. First, they must be able to rapidly create and frequently update applications to meet ever-changing business opportunities. And they also must cater to stakeholders and a user base that expects (and demands) apps to be highly available and responsive, and incorporate the newest technologies as they emerge.

Monolithic approaches cannot meet these objectives, but cloud native architecture can. However, enterprises going from monolithic infrastructures to cloud native environments will fail without a modern approach to observability. But while the challenges cloud native adoption brings are real, they are not insurmountable.

Arming teams with modern observability that is purpose-built for cloud native environments will allow them to quickly detect and remediate issues across the environment. Your applications will work as expected. Customers will be happy. Revenue will be protected.

What to Look for in a Cloud Native Observability Solution

A suitable cloud native observability solution will:

Control Data … and Costs

Your cloud native observability solution should help you understand how much data you have and where it’s coming from, as well as make it simpler to quickly find the data you need to solve issues and achieve business results.

Traditional APM and infrastructure monitoring tools lack the ability to efficiently manage the exponential growth of observability data and the technical and organizational complexities of cloud native environments. APM and infrastructure monitoring tools require you to collect, store and pay for all your data regardless of the value you get from it.

With a central control plane, you optimize costs as data grows, without any surprise budget overruns. You get persistent system reliability and improve your user experience. You enjoy platform-generated recommendations that optimize performance. You also don’t waste so much valuable engineering resources on troubleshooting but rather reduce noise and redirect your engineers to solve problems faster. A central control plane allows you to refine, transform and manage your observability data based on need, context and utility. That way, you can analyze and understand the value of your observability data faster, including what’s useful and what’s waste.

Avoid Vendor Lock-In with Open Source Compatibility

Proprietary formats not only make it difficult for engineers to learn how to use systems, but they add customization complexity. Modern cloud native observability solutions natively integrate with open source standards such as Prometheus and OpenTelemetry, which eliminates switching costs. In times of economic uncertainty and when tech talent is scarce, you’ll want to invest in a solution that is open to possibilities.

Ensure Availability and Reliability

When your observability platform is down — even short, intermittent outages or performance degradations — your team is flying blind with no visibility into your services. The good news is that you don’t have to live with unpredictable and unreliable observability.

Your modernization plan should include working with an observability vendor that offers a 99.9% uptime service-level agreement (SLA), which can be confirmed by finding out what the actual delivered uptime has been for the past 12 months. Also, dig in a little to understand how vendors define and monitor SLAs and at what point it notifies customers of problems. A best-in-class solution will proactively monitor its own systems for downtime and count any period greater than a few minutes of system inaccessibility as downtime, prompting immediate customer notification.

Predict and Resolve Customer-Facing Problems Faster

A cloud native observability solution can improve engineering metrics such as mean time to remediate (MTTR) and mean time to detect (MTTD) as well as time to deploy. But that’s not all. It can also provide real-time insights that help improve business key performance indicators (KPIs) such as payment failures, orders submitted/processed or application latency that can hurt the customer experience.

Promote Strong Developer Productivity from the Jump

Today’s engineering on-call shifts are stressful because people can’t find the right data, run queries quickly or remediate issues fast — something enterprises should try to avoid when transitioning to a modern environment.

Most APM tools were introduced more than a decade ago when most engineering teams were organized in a top-down hierarchical fashion. In a DevOps world, developers own responsibility for the operations of their applications. The best way to support a modern environment that’s been organized with small, interdependent engineering teams is with an observability solution that supports workflows aligned with how your distributed, interdependent engineering teams are operating.

Your Observability Vendor Should Be a Partner in Your Cloud Native Success

Technical expertise isn’t a nice to have; it’s a must have for successful businesses. Vendor support experts help teams meet service-level agreements. Therefore, your observability vendor should offer customer support experts that are always available to help you navigate your cloud native journey — at no additional charge.

Read our full series on getting started with cloud native:

5 Things to Know Before Adopting Cloud Native
Pros and Cons of Cloud Native to Consider Before Adoption
3 Ways Traditional APM Systems Hinder Modern Observability
Top 4 Factors for Cloud Native Observability Tool Selection

The post Top 4 Factors for Cloud Native Observability Tool Selection appeared first on The New Stack.

3 Ways Traditional APM Systems Hinder Modern Observability

Amanda Mitchell — Tue, 22 Aug 2023 16:39:47 +0000

This is the third of a four-part series. Read Part 1 and Part 2.

Cloud native adoption isn’t something that can be done with a lift-and-shift migration. There’s much to learn and consider before taking the leap to ensure the cloud native environment can help with business and technical needs. For those who are early in their modernization journeys, this can mean learning the various cloud native terms, benefits, pitfalls and about how cloud native observability is essential to success.

To help, we’ve created a four-part primer around “getting started with cloud native.” These articles are designed to educate and help outline the what and why of cloud native architecture.

The previous article discussed the benefits and drawbacks of cloud native architecture. This article explains why traditional application performance monitoring tools aren’t suited for modern observability needs.

Cloud Native Requires New Tools

As cloud native approaches are more widely adopted, new challenges emerge. Organizations find it harder to understand the interdependencies between the various elements that make up an application or service. And their staff can spend enormous amounts of time trying to get to the root cause of an issue and fix problems.

What makes cloud native environments so different and more challenging to manage? Enterprises monitoring early cloud native workloads only need access to simple performance and availability data. In this scenario, the siloed nature of these platforms isn’t an obstacle to keeping applications or infrastructure running and healthy. So, traditional application performance monitoring (APM) and infrastructure monitoring tools do the job.

But as organizations begin their cloud native initiatives and use DevOps principles to speed application development, they need more. APM and infrastructure monitoring tools simply cannot provide the scalability, reliability and shared data insights needed to rapidly deliver cloud native applications at scale.

Legacy Tool Shortcomings

Here are some key ways legacy monitoring tools fail to meet cloud native challenges. These shortcomings will cause acute pain as your cloud native environment grows and should be factors that are considered when devising your modernization plan:

Inability to navigate microservices. Legacy tools are unable to navigate and highlight all the interdependencies of a microservices environment, making it nearly impossible to detect and remediate issues in a timely manner.
Lack of control. APM and infrastructure monitoring solutions lack data controls and visibility into observability data usage across teams and individuals. Simple code changes or new deployments can result in surprise overages.
Vendor lock-in. Proprietary solutions make it nearly impossible to switch tools, leaving you powerless when prices go up.

And though these may seem like engineering-centric challenges, they end up having a big impact on overall business health:

Costs increase. Because the pricing models for these tools are aligned to data ingestion, users or hosts, and there are no mechanisms to control data growth, it’s easy for costs to spiral out of control.
Teams end up flying blind. Rapidly rising costs force teams to restrict custom metrics and cardinality tags, limiting metrics stack behavior visualization and causing teams to lack important data.
Developer productivity plummets. Engineers are spending long nights and weekends troubleshooting. Burnout sets in. The skills gap worsens.
There is downtime and data loss. Service-level agreements (SLAs) and service-level objectives (SLOs) aren’t being met. Small changes lead to data loss.

What’s Needed?

These shortcomings have consequences due to the way modern businesses operate. Customer experience and application responsiveness are critical differentiators. Anything that affects either of these things can drive away customers, infuriate internal workers or alienate partners. Today, rather than waiting for problems — including performance degradation, disruption and downtime — to happen, businesses need to be ahead of the issues. They need to anticipate problems in the making and take corrective actions before they affect the application or the user.

It is obvious that cloud native architectures offer many benefits, but organizations also potentially have many challenges to overcome. Traditional application, infrastructure and security monitoring tools offer some help, but what they truly need is an observability solution designed for cloud native environments.

In the next and final installment, we’ll cover four main considerations you should have during the cloud native observability software selection process.

Read our full series on getting started with cloud native:

5 Things to Consider Before Adopting Cloud Native
Pros and Cons of Cloud Native to Consider Before Adoption
3 Reasons Traditional APM Systems Hinder Modern Observability
Top 4 Considerations for Cloud Native Observability Tool Selection

The post 3 Ways Traditional APM Systems Hinder Modern Observability appeared first on The New Stack.

Your App Will Fail if Your Documentation Is Bad

B. Cameron Gain — Mon, 14 Aug 2023 10:00:46 +0000

Struggles with providing proper documentation continue to be a weak point in many, if not most, software projects, which too often undermines the user’s experience. Sometimes, this even results in failed projects that could have had something valuable to offer in the worst cases.

There are numerous reasons contributing to this situation. This includes the frequent lack of sufficient time and resources. Unfortunately, developers often wrongly assume that writing proper documentation isn’t their responsibility as they are primarily tasked with software development.

Struggles with providing proper documentation continue to be a weak point in many, if not most, software projects. While at the very least this failing undermines the user’s experience, poor-quality documentation can cause projects to fail that otherwise had a lot of value to offer.

There are numerous reasons for the dilemma. They include the frequent lack of sufficient time and resources. Sometimes developers think they don’t have the talent to properly document their pull requests and other work, when in fact the vast majority do. In many cases, developers assume their code is “self-documenting,” and must go back and figure out what they did when a support ticket comes their way about a feature a readme file could have easily explained.

Worse still, developers might wrongly assume that writing proper documentation isn’t their responsibility as they are primarily tasked with software development.

A “content specialist,” who might lack knowledge about software, could be assigned to the task, making claims about SEO expertise and other dubious skills but with little know-how to get the job done. On the other end of the spectrum, talented documentation specialists and tech writers and editors might be spread too thinly, earnestly striving to produce documentation but facing a scarcity of resources. They are often given the responsibility without the corresponding priority and contributions from those with the knowledge base in the organizations and the user and open source community to help. Consequently, software documentation often gets deprioritized and frequently lacks the necessary resources for accurate execution.

As Kelsey Hightower emphasizes in the introduction of “Docs for Developers”: “If developers are the superheroes of the software industry, then the lack of documentation is our kryptonite,” Hightower writes.

At the same time, for the developer, the most successful projects have documentation to guide developers through each workflow step, Hightower writes. “It’s because documentation is a feature,” Hightower writes. “In fact, it’s the first feature of your project most users interact with, because it’s the first thing we look for when trying to solve a problem.”

In this article, we look at why proper documentation is critical and why it so often fails.

Documentation is as “important as the code itself,” according to Alanna Burke, community manager and developer relations advocate, for Lagoon, an application delivery platform. For example, Burke described how a study with Cornell University revealed that 61% of professionals using distributed tools struggle to figure out their colleagues’ work. Additionally, 44% face challenges in identifying duplicated efforts due to siloed digital media tools, while 62% miss opportunities to collaborate effectively with coworkers.

Burke highlights the significance of these statistics, as they shed light on the common excuses that lead to bad documentation practices. In many cases, some developers believe that code is self-documenting or find it too arduous to keep documentation up to date, impacting productivity. A Google case study revealed that 48% of their engineers considered poor documentation their top productivity issue, while 50% of Site Reliability Engineering (SRE) problems were attributed to documentation-related issues.

Documentation is essential. Not done reading right, and the loss in productivity becomes very expensive. @aburke626 @Amazee #KubeCon2023 pic.twitter.com/b9RVM9Qutr

— BC Gain (@bcamerongain) April 19, 2023

The financial consequences of inadequate documentation are significant. IT professionals spend about 20% of their time searching for information, resulting in wasted resources, Burke noted. For an average salary of $60,000 per year, this translates to approximately $13,760 per employee annually. The opportunity cost, which reflects the potential earnings if time were not wasted, amounts to about $34,400 per employee, she said.

“By recognizing the importance of documentation on par with code, organizations can make it an official part of every employee’s job description. Establishing this expectation fosters an intrinsic value for documentation, motivating individuals to update and disseminate information effectively,” Burke said.

Burke emphasizes “the necessity of documenting everything as humans have limitations in data storage. Effective documentation should be easy to understand, devoid of excessive jargon, and complemented by relevant visuals,” Burke said. “It should provide clear instructions without delving into unnecessary details.”

Additionally, documentation must be vetted for accuracy, with authors and dates specified to track any inaccuracies or updates, Burke said.

“Burke said: ultimately, documentation empowers employees to grasp their work context, promotes collaboration, and encourages product adoption. Without clear documentation, organizations risk confusion, duplication of efforts, and inefficient data storage. Making documentation a priority ensures a seamless workflow and facilitates successful product usage and comprehension,”.

Burke said, on the other hand, difficult-to-read documentation discourages users from spending significant time on it. To encourage user engagement, well-designed and interesting documentation is essential. Accuracy is of utmost importance, and inaccurate documentation can lead to significant issues. Google’s documentation practices include specifying the author and date, allowing easy identification of any inaccuracies and their origins. Furthermore, knowing the age of the documentation helps users understand its relevance and potential updates, Burke said.

“This comprehensive approach to documentation ensures that users have all the necessary information to understand the code’s purpose, its origin, and any recent modifications made to it. By providing such clarity, we empower developers and users to work with the codebase more efficiently and effectively,” Burke said. “So, whenever you encounter a code block, rest assured that the documentation will guide you to its proper implementation, allowing for a smooth and well-informed coding experience.”

Besides tailoring documentation to meet the needs of the users, getting there requires a team effort. The onerous might initially be on the developer to document code as they create code, but once that pull request is made, there should be an army of contributors ready to add their value input. Open source is also highly conducive for allowing the users to contribute directly.

Among other things, good documentations requires input and collaboration from all stakeholders, especially the user, Fiona Peers Artiaga, director of documentation and technical writing at Grafana Labs, said during GrafanaCon.

At Grafana Labs, commitment to documentation involves contributions from various sources, Fiona Peers Artiaga, director of documentation and technical writing at Grafana Labs, said. The team of contributors include the field engineering team, technical support team, and developers who collaboratively develop content, and especially, the users themselves, by way of direct feedback on the documentation pages on the website or via pull requests on GitHub, Artiaga said. It truly is a collaborative effort that is behind the creation of the Writer’s Toolkit, Artiaga said.

“We value the input we receive from our contributors, so we develop our open source documentation in the open using a docs-as-code model,” Artiaga said. “This approach ensures that contributors can comment on the documentation, open issues, and improve the documentation themselves.”

With good documentation, that 4:00 AM alert or urgent job ticket can be resolved by the user. The code is commented and READMEs are accurate and up to date in this case, Hightower writes. “You have a getting started guide and a set of tutorials that target your users’ top use cases. When a user asks you for help, you point them to documentation that’s genuinely helpful,” Hightower writes. “That four AM pager alert? It took five minutes to resolve because you found what you needed with your first search.”

Consequently, software documentation often gets deprioritized and frequently lacks the necessary resources for accurate execution. Of course, it can also get overlooked in certain situations. For instance, a Content Specialist, who might lack knowledge about software, could be assigned to the task, making claims about SEO expertise and other dubious skills. On the other end of the spectrum, documentation specialists might be spread too thinly, earnestly striving to produce documentation but facing a scarcity of resources and often being given the responsibility without the corresponding priority and contributions from those with the knowledge base in the organizations and the user and open source community to help.

In this article, we look at why documentation often fails and what can be done not only to improve the user experience when seeking help and support with documentation but in many cases, how otherwise great projects can be saved.

Important as the Code Itself

According to Alanna Burke, Lagoon community manager and developer relations advocate, documentation is as “important as the code itself.” Burke makes this statement based on data and research. For example, a study with Cornell University found that 61% of professionals using distributed tools struggle to figure out their colleagues’ work. Additionally, 44% face challenges in identifying duplicated efforts due to siloed digital media tools, while 62% miss opportunities to collaborate effectively with coworkers.

Burke highlights the significance of these statistics, as they shed light on the common excuses that lead to bad documentation practices. Some developers believe that code is self-documenting or find it too arduous to keep documentation up to date, impacting productivity. Google’s case study revealed that 48% of their engineers considered poor documentation their top productivity issue, while 50% of Site Reliability Engineering (SRE) problems were attributed to documentation-related issues.

The consequences of inadequate documentation are significant. IT professionals spend about 20% of their time searching for information, resulting in wasted resources. For an average salary of $60,000 per year, this translates to approximately $13,760 per employee annually. The opportunity cost, which reflects the potential earnings if time were not wasted, amounts to about $34,400 per employee.

Additionally, documentation must be vetted for accuracy, with authors and dates specified to track any inaccuracies or updates, Burke said.

Burke said: ultimately, documentation empowers employees to grasp their work context, promotes collaboration, and encourages product adoption. Without clear documentation, organizations risk confusion, duplication of efforts, and inefficient data storage. Making documentation a priority ensures a seamless workflow and facilitates successful product usage and comprehension.

Burke emphasized that good documentation should be easily understandable and avoid excessive use of jargon or complicated language. It should be straightforward, getting to the point without telling unnecessary stories, and instead, focusing on providing clear instructions for users.

According to Burke, incorporating good visuals in documentation is crucial as people are naturally drawn to pleasing and visually appealing content. Well-designed and laid-out documentation captures users’ attention and encourages them to engage with the material for longer periods, Burke said.

The Right Audience

Addressing the right audience is another key aspect of effective documentation. Tailoring the content to the end users or administrators ensures that the information provided is relevant and meets their specific needs, Burke said.

Burke highlights that well-crafted documentation is essential for seamless understanding, improved user engagement, and reliable information dissemination. By following these best practices, organizations can ensure that their documentation remains valuable and supportive of their users’ needs, Burke said.

“One of my favorite aspects is ensuring that code blocks come with clear context, telling people where the code should be used. For instance, in this scenario, we have an application deployment Yaml file. In this example, users have a designated section to provide feedback, indicating whether the deputation was successful. Moreover, we go the extra mile by including essential details such as the author’s name, the date of the update, and a direct link to the corresponding Git commit.

Grafana Initiative

Grafana Labs launched a Writers’ Toolkit announced last year during GrafanaCon. Fiona Peers Artiaga, director of documentation and technical writing at Grafana Labs said the toolkit is designed to empower all contributors to enhance documentation regardless of their role, whether from Grafana Labs or among the many global open-source contributors.

It ensures the production of high-quality documentation by providing downloadable content, templates, and models. The toolkit offers writing guidance and templates, streamlining the process and reducing friction, thus helping contributors avoid common pitfalls that hinder others from finding technical solutions, Artiaga said.

Artiaga said collaboration between those with in-depth knowledge of a feature and those skilled in presenting information is a core aspect. The toolkit promotes scalability by aiding contributors in creating technical documentation and objects, while also minimizing technical debt. It aligns closely with technical writing best practices and provides guidance and templates. When making a contribution, the technical writing team collaborates to ensure users receive the best possible information, Artiaga said.

“At Grafana Labs, we want everyone to bring their gifts to the documentation. Engineers are best positioned to explain how the code works; technical writers are best positioned to create consistent, comprehensive, and consumable information,” Artiaga said.

Key enhancements made in 2023 include improved navigation, recognizing that users search for information for various reasons such as problem-solving, answering questions, or gaining a deeper understanding of the tools used, Artiaga said. Diverse methods are employed, including search engines and navigating through different topics on the documentation site or starting from the docs homepage.

Incorporation of more feedback into the design process is an active endeavor, Artiaga said. Increased interest and feedback from contributors help identify specific pages and topics they would like to see highlighted more prominently. This direct user connection is a source of enthusiasm for the writing team, Artiaga said.

Moreover, the user context upon entering documentation has been enhanced. Recognizing that users often arrive from search engines, visual cues are provided to help them understand their location within the content and guide them to the next steps. Unlike the previous design, users are guided to primary topics and encouraged to stay within critical content, Artiaga said. A sidebar is thoughtfully designed to outline key subjects of interest and nest related topics underneath, creating a user-friendly ecosystem of documentation, Artiaga said.

This is exemplified in the documentation for Grafana Tempo. By comparing the old and new designs side by side, the enhanced accessibility and focus on critical elements that make information easily findable and usable for users seeking answers are evident.

At Grafana Labs, commitment to documentation involves contributions from various sources, including the field engineering team, technical support team, and developers who collaboratively develop content, and especially, the users themselves, by way of direct feedback on the documentation pages on the website or via pull requests on GitHub, Artiaga said. It truly is a collaborative effort that is behind the creation of the Writer’s Toolkit, Artiaga said.

The post Your App Will Fail if Your Documentation Is Bad appeared first on The New Stack.

Incident Management: How Organizational Context Can Help

Kevin Casey — Fri, 11 Aug 2023 10:00:09 +0000

It’s no secret that today’s IT environments are more dynamic and complex than ever — which means effective incident management matters more than ever.

Too often, however, help desk teams, IT operations engineers, and other technologists are forced to search for the proverbial needle in a haystack.

That haystack has become more like a hayfield. Homogeneous, monolithic systems have given way to far more dynamic and distributed applications and infrastructure. Think: containerization and orchestration, multicloud and hybrid cloud, microservices architecture, CI/CD, and a software supply chain that spans a virtually limitless number of sources.

That technical complexity is mirrored in the makeups and structures of modern businesses themselves, according to Chris Evans, co-founder and CPO of incident.io, an incident management platform.

“Organizations are incredibly complicated these days, whether it is from technology — where you’ve got myriads of different software systems and infrastructure and teams that own them — but also the interconnections between teams themselves and extensions out to customers,” Evans told The New Stack.

“Organizations can be thought of as this big web or graph of things that can be connected and mapped to other things in the organization.”

The sheer quantity and scale of those things — and how they explicitly and implicitly connect — is vast today. While incident management is a longstanding pillar in enterprise IT, both as a practice and in terms of the various tools and platforms that support that effort, it hasn’t necessarily kept pace with the scope and scale of that organizational web (or graph) that Evans described.

That’s why incident.io recently launched Catalog, a new feature on its platform intended to arm teams with the dynamic contextual awareness needed to effectively and efficiently respond to incidents when they occur — and without burning the house down in the process.

What Is Catalog?

Catalog is essentially a modern take on an older approach to this, the configuration management database (CMDB). CMDBs have typically been used to organize and track various IT assets, from employee laptops to databases.

“It was a static list back when organizations were a bit more static, certainly from a technology standpoint,” Evans said.

Service catalogs are a bit more dynamic, but Evans notes they have typically been restricted to a fairly fixed topology along the lines of: We have teams, those teams own services, and those services depend on the underlying infrastructure.

Catalog essentially pairs the two approaches and aims to push them further to better mirror and adapt to modern organizations and technology systems.

“Catalog is a very flexible data structure that lets you model all of the things that exist in your organization and all of the connections between them,” Evans said.

That flexibility means you can model virtually anything — not just, say, the straight line between an application, the internal team that owns it, and the infra it runs on, but connections to different business functions, or to customers and the account managers that take care of them, or to virtually any other facet of a specific organization.

The table stakes here, according to Evans, are the ability to map a particular incident to have it might impact a particular customer or particular business process. But then you can layer on additional data that creates a richer context that essentially magnetizes those needles out of the haystack instead of sending valuable team members on a neverending chase.

Moreover, the feature can use inference to kickstart automated workflows from the moment an incident is created.

For example, a customer-support representative (CSR) might get bombarded with inbound calls about problems with a customer-facing mobile application. Odds are that CSR doesn’t know who’s responsible for the app or how to troubleshoot.

But by creating an incident, Catalog can then automatically notify the people or teams that need to act because of their connection to a particular system or business functionality (the mobile app).

That workflow can get very granular: What is that team’s Slack channel? Who is the team lead? Where is the PagerDuty Escalation Policy I should use? And so forth.

Reducing Cognitive Load, Enabling Faster Response

Essentially, Evans said, Catalog helps reduce team members’ cognitive load, while encoding organization rules that can considerably cut down on lead time and manual effort in terms of response and mitigation.

This wasn’t especially necessary 10 or 20 years ago in the conventional IT environment: A database, a monolithic application or two, a few (or even a bunch of) servers that you could walk down the hall to see running in your data center. But that’s not the reality for most modern enterprises of any kind of substantial size — say, 100 employees and up.

“Modern organizations just aren’t building things the same way,” Evans said.

He shared an illustration from his previous company, an online bank in the U.K. When Evans started there in 2017, the bank ran on roughly 250 microservices running on hundreds of Amazon Web Services servers, all managed by five or so engineering teams.

That has enough complexity on its own. But when Evans left the firm in 2021, the bank ran about 2,500 microservices — with multiple copies of each running for redundancy, which meant north of 10,000 different workflows managing them. The microservices were deployed on nearly 1,000 cloud-based servers, and the company had grown from roughly 70 employees to about 2,500.

If an incident occurred that was strictly an engineering issue, the firm’s service catalog typically sufficed. But Evans said it lacked the context needed to quickly span out to the complete organization whenever that might have been needed — say, identifying the particular executive who might need to be looped in, or the ability to rapidly determine which customers might be most directly impacted and act accordingly.

In most organizations, that kind of context is pushed onto the plates of the front-line employees actually responding to incidents, which almost invariably adds time, headaches and costs.

Said Evans, “It’s exactly that [challenge] that we’re trying to solve with Catalog:” Give everyone the shared context of the organization and navigate that live during an incident, when things are already super high-pressure and you don’t have the time to go talk to a million different people. Give that context to everyone in one place.”

The post Incident Management: How Organizational Context Can Help appeared first on The New Stack.

Why Developers Need Their Own Observability

Jason Bloomberg — Thu, 27 Jul 2023 17:15:40 +0000

“Why did it do that?”

That’s the question software developers repeat all too frequently. Something went wrong with the app they were working on. Did they introduce a bug? Was it something a colleague changed? Or perhaps it was some kind of infrastructure issue somewhere between the frontend and the backend?

In the bad, old waterfall days, developers worked in boxes. Not only were Dev, test and Ops entirely separate endeavors, but even within Dev, frontend and backend teams worked largely independently of each other.

No more. In today’s distributed, cloud native world, every software component is intertwined in a complex web of dependencies with many others, as are the teams that work on them.

In this rapidly changing, interconnected environment, developers need answers, not only to why their app might be performing poorly, but how to fix it. And to get those answers, they need observability.

Not the Observability You Might Think

Observability is all the rage in Ops circles. Equip all the software to generate streams of telemetry data, and then use one of dozens of application performance management (APM) and infrastructure management or IT operations management (ITOM) tools to make sense of all that data.

The goal of operators’ and site reliability engineers’ observability efforts are straightforward: Aggregate logs and other telemetry, detect threats, monitor application and infrastructure performance, detect anomalies in behavior, prioritize those anomalies, identify their root causes and route discovered problems to their underlying owner.

Basically, operators want to keep everything up and running — an important goal but not one that developers may share.

Developers require observability as well, but for different reasons. Today’s developers are responsible for the success of the code they deploy. As a result, they need ongoing visibility into how the code they’re working on will behave in production.

Unlike operations-focused observability tooling, developer-focused observability focuses on issues that matter to developers, like document object model (DOM) events, API behavior, detecting bad code patterns and smells, identifying problematic lines of code and test coverage.

Observability, therefore, means something different to developers than operators, because developers want to look at application telemetry data in different ways to help them solve code-related problems.

Developers Need Observability Built for Their Needs

Because today’s developers work on complex distributed applications, they need observability that provides insight into how such applications behave. In particular, developers require:

End-to-end traceability: Developers must be able to trace a user interaction or other event from the frontend to the backend.
Visibility into API behavior: APIs are the glue that holds modern software together. Building APIs is thus a central role of the developer. How well APIs behave depends on both the code exposing the API as well as the software that consumes them. Developers need to have visibility into both sides of this equation.
Event trails, aka “breadcrumbs”: Tracking down problems can be a whodunit mystery. Stepping through the events that led to a problem can identify its cause — and its fix.
Version changes and responsible parties: In many cases, a problem crops up that is the result of some other developer’s work. Keeping track of who’s doing what and how their changes affect the current state of the software landscape are essential. Developers want to quickly assess the issue and either resolve it or route it to the responsible party.
Release tracking: Developers need to keep track of both new issues and regressions for each release. They must also track issue resolutions and monitor application health.
Test coverage: Developers also require test cases to ensure their code is reliable. In addition, if some code snippet never runs, then there’s no way to test it. The presence of such code also reveals some kind of logic error that prevents the code from running.
Visibility into both pre-release and post-release behavior: Developers typically code and test in a pre-release environment. They require pre-release observability to ensure their app works in this environment. Only then do they push code to production. Afterward, they still need to monitor and observe the app so they know it works post-release, because other developers continue to push code that could affect their own.

Clearly, the observability needs of developers are quite different from the needs of operators. Without the benefits of developer-focused observability, developers will be less productive and produce poorer code overall.

Worst of all, without sufficient observability, developers will stumble upon issues as though they were walking around in the dark — issues that will trip them up and detract from the creative flow that developers need to do their jobs well

The Intellyx Take

We’ve all run into slow-loading pages and other performance issues, as well as the dreaded HTTP 500 Internal Server Error — an otherwise empty web page that indicates that something went wrong somewhere.

Nobody wants to see such errors — developers least of all. Their lack of information is matched only by the urgency of the fix.

Without the visibility a developer observability tool can provide, developers would remain in the dark. Such tools should be on every development team’s shopping list.

Unfortunately, shopping for a developer-focused observability tool can be tricky. APM and ITOM are well-defined categories, with established vendors who offer mature products. Not so with developer observability.

One of the few vendors pioneering this space is Sentry, but one vendor doesn’t make a market. Given the significance of developer observability, however, it won’t be long until other vendors crowd into this important segment.

The post Why Developers Need Their Own Observability appeared first on The New Stack.

VictoriaMetrics Offers Prometheus Replacement for Time Series Monitoring

B. Cameron Gain — Mon, 17 Jul 2023 10:00:03 +0000

Prometheus has emerged as the leading open source observability tool for cloud native infrastructure monitoring. But during its development during the past years, some say there has been a difference in needs among its user base with how Prometheus is evolving.

One user and maintainer was Roman Khavronenko, co-founder at VictoriaMetrics, which continues to expand on its flagship open source MetricsQL for time series data monitoring solutions.

“Our query language is designed to address the issues we encountered with Prometheus. When we initially experimented with Prometheus, we were satisfied with its capabilities, but as we delved deeper, we saw problems on an architectural level. Language problems, for example, were pointed out by Prometheus community members, but were rejected by maintainers,” Khavronenko said. “VictoriaMetrics has listened to the community, and no longer dependent on Prometheus libraries, we were able to shape our query engine to fulfill the new requirements.”

Conversely, projects such as Thanos, Cortex and Mimir re-use libraries developed by Prometheus maintainers, Khavronenko noted. Doing this helps to maintain “the highest level of compatibility, as all listed solutions are basically using the same code,” said. “But once any of them wants to change something, it is a very long process to convince other parties that change is needed and satisfy all their requirements,” Khavronenko said. “VictoriaMetrics uses none of Prometheus libraries. And while it decreases compatibility level, it gives us a lot of flexibility to add features we want whenever we want.”

VictoriaMetrics’ system for alerts is similar to those of Prometheus’, except they exist as a separate service. Image: VictoriaMetrics.

For the application developer, it’s crucial to maintain a fast pace and have control over the development process, Khavronenko said. Depending on external libraries can create vulnerabilities and other issues. “The main reason for developing our own query engine is that we made it more efficient and flexible. For example, MetricsQL was multithreaded from the very beginning, while PromQL remains single-threaded.,” Khavronenko said. “This language retains the capabilities of the Prometheus query language while addressing the issues we encountered.

Indeed, there is room for applications that can better meet the different needs of Prometheus users. “There currently are too many opportunities to make significant mistakes during Prometheus deployment and configuration. As the number of Kubernetes clusters increase throughout the enterprise, there are simply too many potential failure points to consistently ensure reliable and accurate monitoring across all clusters,” Torsten Volk, an analyst for Enterprise Management Associates (EMA), said. “Ideally, each new Kubernetes cluster should automatically include monitoring and alerting for all of its relevant metrics. The fact that these metrics and alerts can be different depending on the applications running on a specific cluster makes this even more challenging.”

Khavronenko said MetricsQL was specifically designed to:

help users solve most-common metrics queries.
be compatible with industry-standard Prometheus PromQL.
offer HDR-like histograms for accurate analysis of extreme data ranges.

MetricsQL is designed for querying time series data. It provides a rich list of functions for various aggregations, transformations, rollups and other functionality specific for time series, and “remains simple and efficient to use on any scale,” Khavronenko said.

Applications include streaming services for video games, online music services, scientific research and other similar applications involving the distribution of streaming data. The applications can easily require the monitoring of billions of metrics that may be spread across multiple cloud deployments and physically located anywhere in the world, Khavronenko noted. This is where Prometheus typically falls short, he said.

Prometheus does not handle the function well of revealing the number of requests an application process per second, which MetricsQL is designed to offer, Khavronenko noted. “Prometheus often provides extrapolated results rather than exact ones, leading to misleading information and potential issues,” Khavronenko said. “This issue was discussed extensively in the Prometheus GitHub repository as far as 2019.”

Volk agreed: “There is simply too much knowledge needed to reliably monitor Kubernetes clusters,” Volk said. “If DevOps teams have to worry about how to optimally configure their queries so that they actually measure the correct data without causing resource issues on the cluster, effective monitoring is too difficult for a mainstream technology like Kubernetes.”

VictoriaMetrics’ revenue primarily comes from the enterprise version and the services provided to large companies. “We offer architectural support and additional features tailored to the needs of large organizations,” Khavronenko said. “Since we didn’t seek external investments and started earning money within six months of the release, we have been profitable from the beginning.”

VictoriaMetrics also recently introduced VictoriaLogs for monitoring applications for what the company says is a “more strategic ‘state of all systems’ enterprise-wide observability.” VictoriaLogs VictoriaLogs works with both structured and unstructured logs for maximum backward compatibility with large-scale infrastructure needed by users whether they are academic or commercial, working in e-commerce or video gaming teams, the company says.

While logs, metrics and traces make up the three pillars of observability, “many companies do not rely on traces at all and I’ve seen many fewer organizations not using metrics,” Khavronenko said. “But I haven’t seen a single IT company not doing logs,” Khavronenko said. “So, while VictoriaMetrics offers a scalable performance solution for metrics, VictoriaLogs now does the same for logs.”

The post VictoriaMetrics Offers Prometheus Replacement for Time Series Monitoring appeared first on The New Stack.

Why Did Grafana Labs Need to Add Adaptive Metrics?

B. Cameron Gain — Wed, 05 Jul 2023 12:00:38 +0000

It is hard not to hear high cloud costs as a pain point when talking about the challenges of cloud native architectures and Kubernetes. A major concern that organizations face, even after successfully transitioning to cloud native, is the unexpected rise in operations costs. Ironically, one of the ways to mitigate these costs is through observability, which can also be expensive when relied on to improve application productivity and operations efficiency and security.

In the observability space, the surge in metric data to monitor represents a major culprit when it comes to cloud native costs. This is because a surge in metrics that are redundant — that often come in spikes following an incident or misconfiguration — represents wasted storage, computing power, memory consumption, analytics and other expensive resources on the cloud. The issue is described as high levels of cardinality, as “cardinality” in the general sense is defined as the number of elements in a given set according to Merriam-Webster. In the context of observability, cardinality refers to the count of values associated with a specific label.

As a popular open source monitoring tool for cloud native environments, Prometheus metrics data are often scrutinized as a way to better manage cardinality due to the abundance of metrics that are crucial for observability. This pain point was felt at Grafana Labs, which is almost universally known for its famous Grafana panels. In response, Grafana recently introduced adapted metrics that aim to reduce cardinality and, consequently, cloud costs and made it available to Grafana Cloud users.

Reducing Cardinality

This reduction in cardinality is achieved through automated processes, intended to decrease the number of metric series. It does this by automating the process of identifying and eliminating unused time series data through aggregation. By reducing the number of series or cardinality, adaptive metrics is thus designed to help organizations optimize cloud expenses. Additionally, these metrics assist in the interpretation and extraction of actionable insights from the collected data through automation, for meaningful observations and decision-making that lead to actionable insights.

Reducing cardinality is a standard problem for data scientists to solve, involving the evaluation of the contribution of individual values to the prediction accuracy for the target variable, Torsten Volk, an analyst at Enterprise Management Associates (EMA), told The New Stack. For example, in observability, the target variables often are app performance, user experience, cost and resiliency. To reduce cardinality, the software can simply apply standard techniques such as principal component analysis, target mean encoding and binning. These calculations combine or eliminate values based on their contribution toward accurately predicting the target variables, Volk said.

“For example, instead of tracking exact numbers in milliseconds for response time, you may not lose any prediction accuracy by translating these numbers into percentiles. Or instead of tracking each individual value of a data stream, e.g. for memory usage, the algorithm might look at historical data and determine that you will get the same predictive accuracy by analyzing averages at the minute or even 10-minute level,” Volk said. “This is not a trivial challenge, as in certain cases prediction accuracy may significantly benefit from sub-second level measurement values, while for other cases aggregating these same measurements over 60 minutes may give you the same level of accuracy.”

Adaptive Metrics

As mentioned above, Grafana first developed adaptive metrics to address its own cardinality challenges. “Prometheus has become hugely popular for good reason, but when there’s rapid adoption within an organization, unpredictable growth and cardinality can be a real challenge. We’ve felt this pain ourselves at Grafana Labs. We were spending quite a lot of money running our own Prometheus monitoring for Grafana Cloud, as one of our clusters had grown to over 100 million active series,” Tom Wilkie, CTO for Grafana Labs, told The New Stack. “Adaptive Metrics was the solution we built for this problem. And we knew that in this current macroeconomic climate when budgets are tightening and people are gasping at $65 million observability bills, a feature that helps you cut some unnecessary costs in a flexible, intelligent way would be incredibly valuable to our users, just as it has been for us.”

As Wilkie explained, open source changes “the relationship between vendor and customer, because they can always go run it themselves.

We look at our relationships with our customers as long-term partnerships, so we want to do what’s right by them (proactively lowering their bills) even if this means less growth for us in the short term,” Wilkie said. “With features like Adaptive Metrics, we are making the case that it’s always more cost-effective to use Grafana Cloud, even compared to running the OSS yourself.”

At Work

In a blog post co-authored by Grafana Labs’ Archana Kesavan, director, product marketing, and Jen Villa, senior group product manager, Databases, described how Grafana’s Adaptive Metrics capability analyzes “every metric coming into Grafana Cloud” and compares it to how users access and interact with the metric. In particular, they wrote that it looks at whether each metric is:

used in an alerting or a recording rule.
used to power a dashboard.
queried ad hoc via Grafana Explore or Grafana’s API.

To answer the first two questions, it analyzes the alerting rules, recording rules and dashboards in a user’s hosted Grafana. To answer the third, it looks at the last 30 days of a user’s query logs. With these three signals, Adaptive Metrics determine if a metric is unused, partially used, or an integral part of your observability ecosystem:

Unused metrics. There has been no reference made to the metric based on any of those three signals.
Partially used metrics. The metric is being accessed, but it has been segmented with labels to create many time series, and people are only using a small subset of them.
Used metrics. All the labels on that metric are being used to slice and dice the data.

“Our initial tests in more than 150 customer environments show that on average, Adaptive Metrics users can reduce time series volume by 20%-50% by aggregating unused and partially used metrics into lower cardinality versions of themselves,” Kesavan and Villa wrote.

The post Why Did Grafana Labs Need to Add Adaptive Metrics? appeared first on The New Stack.

Observing and Experimenting: Enhanced Kubernetes Optimization

Nick Walker — Fri, 30 Jun 2023 17:00:44 +0000

As organizations increasingly adopt Kubernetes for their infrastructure, understanding and optimizing its performance becomes vital.

However, best practices and configuration guides for Kubernetes optimization only go so far. Researchers wrote more than 900 scholarly articles on factors influencing the optimization of Kubernetes in 2022 alone. Many variables can affect the speed and efficiency of applications running in a Kubernetes environment, but to gather the insights that lead to effective optimizations, you need to consistently observe and experiment.

In this article, we’ll look at how to develop a system of observation, experiments and feedback loops to continually improve the performance of applications running in Kubernetes environments.

Combine Observation and Experimentation

IT infrastructure is analogous to a manufacturing environment, comprising many components that move, process and store data, so it can be useful to the organization. For the system to produce optimal throughput, components must be secured, configured, instrumented and linked to one another and to the data being processed. However, merely setting up the components and letting them run isn’t sufficient. Instead, we must observe the process by collecting metrics on each component in the system and their interactions.

Learning from Observation

In the application development environment, things change daily. Network traffic, data quality and new software or security patches can all affect performance, affecting the health of your application deployment. You need proper metrics to notice these changes and optimize application performance.

There are many categories of metrics. Common examples include Kubernetes cluster and node usage metrics, container and application metrics and deployment and pod metrics, such as the number running or the number of Pod resource requests. You might also consider environmental metrics like traffic, network state and new deployments, since adding new services or features to an application can change infrastructure efficiency and effectiveness.

For example, website traffic is often seasonally variable: Retailers know that traffic increases over the holiday season and decreases after. To avoid delays and lost customers in peak season, or overpaying for capacity in the off-season, they need accurate metrics to configure their application scaling.

A key consideration is that your application and the Kubernetes infrastructure it runs on are not static. Variables that influence its performance change over time. To find opportunities for improvement, you need baseline metrics detailing how the infrastructure runs under average conditions. By establishing and understanding these metrics, you can analyze the effects of traffic spikes, outages and other events.

Once you have established which metrics you wish to observe, you need tools and a repository to capture and store this information. A common choice is the open source tool Prometheus.

Learning from Experimentation

Once you’ve established metrics and tooling, you can start learning from the data. You can simply observe your metrics over the course of a week or month, which allows you to see how your application performs under various circumstances. You might be able to pick up on simple patterns and make changes based on these patterns, and then watch for another week or month to see how it affects your metrics.

This is a very slow and imprecise form of experimentation that uses human observation over long periods of time — when your application, the traffic to it, or the environment it runs in might also be changing and introducing noise into your experiment.

Let’s go back to the example of the holiday season and retail. Suppose high volumes of application use at peak season don’t affect throughput. In this case, this might mean you have the configuration and scaling just right or that you have over-provisioned. To find out which it is, you need controlled experiments. You can discover the most cost-effective configuration by systematically varying the traffic, application and service configurations and closely observing the results. By changing the different parameters in many combinations and tracking the results you can eliminate the noise introduced in simply observing the application over time.

However, it quickly becomes difficult, if not impossible, to perform these experiments manually because of the number of variables, and even worse, the possible combinations of variables. The solution is to apply automation and machine learning (ML).

ML is a subset of artificial intelligence that enables a computer to ingest data, learn and develop an underlying model to predict and respond to data inputs. Data is divided randomly into two sets. The first set is used to train a model; then this model is applied to the second data set to predict outcomes. Next, the predicted outcomes are compared to the actual outcomes to assess the accuracy of the model. The results are adjusted, the ML program develops a new model and the whole process restarts.

Observed behaviors or conditions can and should lead to experiments to gain a deeper understanding. Continuing with the traffic example, let’s say you want to include additional variables in a new model or simulate conditions not present in the original data set. By increasing traffic levels above what was observed in the data set, you can better learn how robust the model is under additional traffic.

Benefits of Experimentation

This kind of experimentation is all about learning how a dynamic system behaves so you can optimize its efficiency. To understand the capabilities and limitations of specific environments, or combinations of configurations, you need to experiment.

Experiments should aim to discover what you should do differently. Are you over- or under-provisioning resources? Are you spending more than you should? One experiment may not answer all the questions, but it can point the way to the next test and an eventual solution.

If your initial application configuration is based on best practice guidelines, you may want to validate those guidelines for your specific situation. You can vary application configuration, traffic patterns, horizontal pod autoscaling (HPA) configuration, etc. You can conduct a variety of A/B tests of different configurations.

Experimental findings don’t always yield improvement. However, finding a specific setting robust to changes in traffic is also valuable, as it means we can limit our experimentation in this area.

Conclusion

Kubernetes is highly flexible when it comes to automating container operations across environments, with many configuration options. When it comes to optimizing configuration, the best practice guidelines only go so far. To push the boundaries, we need to establish metrics, experiment with the environment and observe closely.

The number of variables and interactions often makes it impractical to configure Kubernetes environments manually. Automation and ML make the experimentation process viable. Observability processes and analytics allow us to conduct planned experiments and understand how environmental variables affect performance.

StormForge gives you all the observability, automation and ML you need to get the most from Kubernetes experimentation. Visit StormForge.io to learn more and start a free trial.

The post Observing and Experimenting: Enhanced Kubernetes Optimization appeared first on The New Stack.

How We Slashed Detection and Resolution Time in Half

Eli Goldberg — Wed, 28 Jun 2023 17:00:08 +0000

My role as the Director of Platform Engineering at Salt Security lets me pursue my passion for cloud native tech and for solving difficult system-design challenges. One of the recent challenges we solved had to do with visibility into our services.

Or lack thereof.

Initially, we decided to adopt OpenTelemetry, but that didn’t give us everything we needed as we still had blind spots in our system.

Eventually, we found a solution that helped us zero in on service errors and slash the time it takes us to detect and resolve issues in half.

But let’s back up a bit.

70 Services and 50 Billion Monthly Spans Strong

At Salt Security, we have about 70 services, based on Scala, Go and NodeJS, which generate 50 billion monthly spans.

Since 70 is no small number and neither is 50 billion, we needed assistance gaining visibility into the requests between the services.

The Need to See

Why did we need to see into our services?

1. At the macro level, we needed to monitor and identify problems after making changes in our system. For example, we needed to detect filters, anomalies and any other signals of problematic flows.

2. At the micro level, we needed to be able to zero in on the causes of any problem we identified. For example, errors, slow operations or incomplete flows, whether they support gRPC or Kafka operations, as well as their communication with databases.

To be clear, when we say “visibility” we mean a deep level of granularity at the payload level. Because just one single, slow query in the database might slow down the entire flow, impacting our operations and customers.

Gaining this visibility proved to be a tough nut to crack. Not only because of the sheer number of services and spans, but also due to the complexity of some flows.

For example, one single flow might involve as many as five services, three databases and thousands of internal requests.

Attempt #1: OpenTelemetry and Jaeger

Naturally, our first go-to was OpenTelemetry with our own Jaeger instance.

This amazing open source collection helps make capturing distributed traces and metrics from applications and infrastructure easy. The SDKs, the Collector and The OpenTelemetry protocol (OTLP) enable gathering traces and metrics from all sources and propagating them with the W3C TraceContext and Zipkin’s B3 formats.

Here’s a high-level diagram of what the resulting OTel setup looked like:

As you can see, we used the OTel collector to gather, process and move data from our services. Then, the data was propagated to another open source tool: Jaeger. Jaeger was used for viewing the data.

Jaeger is fantastic, but it fell short of meeting our needs. We weren’t able to cover the critical parts of our system, leading to blind spots when we encountered errors.

Hello, Helios

That’s when we discovered Helios. Helios visualizes distributed tracing for fast troubleshooting. We chose Helios over other solutions because it answers both our macro and micro-level needs, and is especially incredible at the micro level.

Helios treats backend services, like databases and queues, and protocols, such as gRPC, HTTP, Mongo queries and others, as first-class citizens. The data is formatted according to what it represents.

For example, a Mongo query will be shown firsthand when looking at a Mongo DB call, with JSON formatting. An HTTP call will be separated into header and body. A Kafka topic publishing or consuming a message will show the header and payload separately. This visualization makes it extremely easy to understand why the call or query is slow.

Helios also provides super-advanced support for cloud and third-party API calls. When it comes to Kafka, Helios shows the list of topics it picked up. For AWS, Helios shows the list of services in use, and they are highlighted when services use them.

In addition, Helios folks came up with an entire testing strategy based on traces! We can generate tests in a single click when looking at a specific span. There are also many other fantastic features, like advanced search, previews of flows in search results, error highlighting of traces that weren’t closed and so on.

Our Helios setup is made up of:

An OTel collector running on our Kubernetes cluster.
The Helios SDK, which is used by each service in any language, and wraps the OTel SDK.
Two pipelines:
- Between the OTel collector and Helios.
- Between the OTel collector and Jaeger, with a one-day retention. (We’re using a sampling of 3% when we send spans to Helios and a much higher sampling rate that is sent to Jaeger, but with much lower retention — for development purposes).
Probability sampling for spans sent to Helios is at approximately 3%.

The Proof Is in the Pudding

The transition to Helios as an additional layer on top of OpenTelemetry proved successful. We use Helios daily when making changes in our system or when we’re trying to identify the source of an issue.

In one case, we used Helios to identify an erroring Span that occurred when a NodeJS service using the AWS SDK was timing out on requests to S3. Thanks to Helios, we were able to identify the issue and quickly fix it.

In another case, one of our complicated flows was failing. The flow involves three services, three databases, Kafka and gRPC calls. However, the errors were not being propagated properly and logs were missing. With Helios, we could examine the trace and understand the problem end-to-end immediately.

One more thing we like about Helios is its UI. which presents the services involved in each flow.

Here’s what that complicated flow looks like in Helios:

Simple and easy to understand, right?

Closing Remarks

We’re all familiar with the challenges of microservices and how blind we are when an error occurs. But while we’re flooded with tools for understanding that there’s a problem, we were missing a tool that could help us understand the exact location of the problem.

With Helios, we can see the actual queries and payloads without having to dig through span metadata. Their visualization significantly simplifies root cause analysis.

I highly recommend Helios for troubleshooting errors.

The post How We Slashed Detection and Resolution Time in Half appeared first on The New Stack.

Demystifying Service-Level Objectives for You and Me

Adriana Villela — Wed, 28 Jun 2023 15:12:10 +0000

If you follow Adriana’s writings, you know that she’s talked about service-level objectives (SLOs) on more than one occasion. Like here. And here. And here.

You might know what they are at a high level. You might know that they’re important, especially if you’re a site reliability engineer (SRE). But what is the big deal about them? Why are they so important for SRE practitioners? What are some good SLO practices?

Well, my friend, you’ve come to the right place! Today, all of your burning questions about SLOs will be answered.

Why Do We Need SLOs?

When Adriana graduated from university in 2001, having internet access on your mobile phone was barely a thing (shoutout to flip phones — hey-o). The Google search engine had been launched only three years earlier, and we were just starting to go from 3.5-inch floppy disks to USB flash drives for portable personal storage. Monolithic apps were a thing. Java was the hot language. Cloud? What cloud?

As the technology industry continues to evolve, the way we build applications has become more complex. Our applications have many moving parts that are all choreographed to work beautifully in unison, and for this to happen, they must be reliable.

SLOs are all about reliability. A service is said to be reliable if it is doing what its users need it to do. For example, suppose that we have a shopping cart service. When a user adds an item to the shopping cart, they expect that the item is added on top of any other items already in the shopping cart. If, instead, the shopping cart service adds the new item and removes prior cart items, then the service is not considered to be reliable. Sure, the service is running. It may even be performing well, but it’s not doing what users expect it to do.

SLOs help to keep us honest and ensure that everyone is on the same page. By keeping reliability in mind and, by extension, prioritizing user impact/experience, SLOs help to ensure that our systems are working so that they meet user expectations. And if SLOs don’t meet user expectations, then they set off alerts to notify engineers that it’s time to dig deep into what’s going on. More on that later.

What the Heck Is an SLO?

First, let’s get back to basics and define SLOs. But before we can talk about SLOs, we need to define a few terms: SLAs, error budgets and SLIs. I promise you’ll see why very shortly!

An SLA, or service-level agreement, is a contract between a service provider and a service user (customer). It specifies which products or services are to be delivered. It also sets customer support expectations and monetary compensation in case the SLA is breached.

An error budget is your wiggle room for failure. It answers the question, “What is your failure tolerance threshold for this service or group of services on the critical path?” It is calculated as 100% minus your SLO, which is itself expressed as a percentage over a period of time, as we’ll see later. When you exhaust your error budget, it means that you should focus your attention on improving system reliability, rather than deploying new features. They also create room for innovation and experimentation to happen. We will talk about that later when we discuss chaos engineering.

A service-level indicator (SLI) is nothing more than a metric. That is, it’s a thing that you measure. More specifically, it’s a two-dimensional metric that answers the question, “What should I be measuring and observing?”

Examples of SLIs include:

Number of good events / total number of events (success rate).
Number of requests completed successfully in 100 milliseconds/total of requests.

And this is important because we need SLIs for SLOs.

SLIs are the building blocks of SLOs. SLOs help us answer the question, “What is the reliability goal of this service?” SLOs are made up of a target and warning. The target is usually a percentage, based on your SLI. Your warning represents the time period that the target applies to.

With that information in hand, let’s go back to our SLIs from before and see if we can use them to create SLOs.

Our first SLI was a number of good events/total number of events (success rate).

This means that our SLO could be something like: 95% success rate over a rolling 28-day period.

You might be wondering where a 28-day period comes from. More on that later.

Our second SLI was the number of requests completed successfully in 100ms/total of requests.

This means that our SLO could be something like: 98% of our requests completed successfully in 100ms out of our total requests over a rolling 28-day period.

Cool. That wasn’t too bad, right? But why do we need SLOs again?

SLO Practices

Great. We get why SLOs are important, but what are some guidelines on defining them?

1. Don’t set your SLOs in stone.

When defining SLOs, the most important thing to keep in mind is that if you don’t get SLOs right the first time, you keep iterating on them until you get them right. And then you have to iterate on them some more, because every time you make changes to your systems, you need to re-evaluate your SLOs to ensure that they still make sense.

Remember: They are living, breathing things that require iteration.

SLOs should always be revisited after an outage. This allows you to see if the SLO caught the incident (for instance, was the SLO breached?) or if an SLO is missing for your application’s critical path. If the SLO was not breached, then it should be adjusted so that it does catch the incident next time. As you adjust your SLO, you’ll want to think about how the SLO change affects your team. For example, would your SRE team have gotten paged more under the previous SLO, compared to the revised version, and would those extra pages have been false alarms?

2. Speak the same (time) language.

Remember our 28-day SLO period from earlier? What’s the deal with that? Why not use 14 days or 30 days? By using 28 days, it standardizes the SLO over a four-week period that you can compare month to month. This allows you to see if you’re drifting into failure as you release your features into production. In addition, this sets a good example as a proper SLO practice and standard to follow within your SRE organization.

3. Make SLOs actionable.

SLOs are no good to us if we don’t do something with them. SLOs need to be actionable. Suppose that you have the following SLO: 95% of service requests will respond in less than 4 seconds.

All of a sudden, you notice that 90% of service requests are responding in less than 4 seconds. It means that the SLO has been breached and that you should do something about it. Normally, the SLO is triggering an alert to an on-call team so that it can look into why the SLO was breached.

4. Ditch the wall o’ dashboards in favor of SLOs.

Traditional monitoring is driven by two things: querying and dashboarding. When something breaks, engineers run queries against these dashboards to figure out why The Thing is breaking. Yuck.

What if instead of using metrics from dashboards, you used SLO-based alerting. This means that if your SLO is breached, your system fires off an alert to your on-call engineers. By using this approach, it’s like your SLOs are telling you, “Hey, you! There’s something wrong, customers are impacted, and here’s where you should start looking.” Oh, and if you tie your SLOs to your observability data — telemetry signals such as traces, metrics and logs — and you should, the data serves as breadcrumbs for your engineers to follow to answer the question, “Why is this happening?”

Now you’ve gone from digging for a needle in a haystack to doing a more direct search. You don’t need a whole lot of domain knowledge to figure out what’s up. Which means that you don’t always have to rely on your super senior folks who hold all the domain knowledge to troubleshoot. Yes, you can lean more on the junior folk.

For more on this, check out Austin Parker’s talk at last year’s SLOConf.

5. Make your SLOs customer-facing.

SLOs should be customer-facing (close to customer impact). This means that rather than tie your SLOs directly to an API, you should tie them to the API’s client instead. Why? Because if the API is done, you have nothing to measure and therefore you have no SLOs.

Another way to look at this is to look back at our definition of reliability: “A service is said to be reliable if it is doing what its users need it to do.”

Or, as Honeycomb CTO Charity Majors has said, “Nines don’t matter if users aren’t happy.”

Since users are at the heart of reliability, defining your SLOs as close to your users as possible is the logical choice.

6. SLOs should be independent of root cause.

When writing SLOs, you should never care how or why your SLO was breached. The SLO is just a warning. A canary in the coal mine, if you will. An SLO tells you that your reliability threshold isn’t met, but not why or how. The how is explained by your observability data emitted by your application and infrastructure.

7. Treat SLOs as a team sport.

It might surprise you to hear that building SLOs and learning about them isn’t the hard part. Figuring out what you want to measure is the hard part, and that’s where collaboration comes in. Having SRE teams work with developers and other stakeholders across the company (and yes, that means the business stakeholders, too) to understand what’s important to the folks using their systems helps drive the creation of meaningful SLOs. But how do you do that?

This is where observability data (telemetry) comes into play. SREs can use observability data to understand how users interact with their systems, and in doing so they can define those meaningful SLOs.

Also, I hope that you’ve noticed a recurring theme here, of SLOs and observability going hand in hand. Just sayin’ … 😎

8. Codify your SLOs.

In keeping with the SRE principle of “codifying all the things,” there is a movement to codify SLOs, thanks to projects like OpenSLO. Some observability vendors even allow for the codification of SLOs through Terraform providers. My hope is that we see greater adoption of OpenSLO as a standard for defining and codifying SLOs, and for SRE teams to integrate that into their workflows.

9. Be proactive about SLO creation.

We can and should be proactive about failure and creating SLOs from chaos engineering and game days.

Chaos engineering is the discipline of experimenting on a system to build confidence in the system’s capability to withstand turbulent conditions in production. We want to inject failure into our systems to see how it would react if this failure were to happen on its own. This allows us to learn from failure, document it and prepare for failures like it. We can start practicing these types of experiments with game days.

A game day is a time where your team or organization comes together to do chaos engineering. This can look different for each organization, depending on the organization’s maturity and architecture. These can be in the form of tabletop exercises, open source/internal tooling such as Chaos Monkey or LitmusChaos, or vendors like Harness Chaos Engineering or Gremlin. No matter how you go about starting this practice, you can get comfortable with failure and continue to build a culture of embracing failure at your organization. This also allows you to continue checking on those SLOs we just set up.

Remember that error budget we talked about earlier? It can be seen as your room for experimentation and innovation, which is perfect for running some chaos engineering experiments. Having this budget set aside helps reduce the risk of failure for your organization by ensuring that these experiments are accounted for. Remember that budgets are meant to be spent! Check out Jason Yee’s SLOConf talk for more on this topic.

How to SLO?

So, how do you create your SLOs? Well, you’ve got options:

Observability tools with SLO capabilities, such as Lightstep, Honeycomb, New Relic, Datadog, Dynatrace, to name a few.
Third-party SLO tools, such as Nobl9, which integrates with existing observability tooling.

Final Thoughts

I hope that this has helped demystify SLOs.

Keep in mind that this is barely scratching the surface. SLOs are an extensive topic, and there’s tons to learn. I suggest you check out the following to further your SLO journey:

Before we wrap up, here are some final nuggets on SLOs from the SLO guru himself, Alex Hidalgo, on the On-Call Me Maybe Podcast:

1. SLOs are not just a checkbox item in your work/SRE process. Make sure to think and strategize around them.
2. SLOs are not worth anything if they’re not telling you something.
3. You don’t just set up SLOs and walk away. You iterate on them constantly because things change, and your SLOs need to reflect those changes.
4. Defining SLOs is an iterative process. You might not get them right the first time, so you need to keep tweaking them.

The post Demystifying Service-Level Objectives for You and Me appeared first on The New Stack.

Acryl Data Unveils Data Observability Capabilities, Adds Funding

Jelani Harper — Fri, 23 Jun 2023 19:00:00 +0000

Yesterday, Acryl Data announced the launch of Acryl Observe, a data observability module for its flagship Acryl Cloud offering. The company also received $21 million in Series A funding. Acryl Cloud is a data management platform positioned atop the DataHub Project, an open source metadata platform, data catalog, and control plane for data.

Acryl Observe is currently in private beta for use by what Acryl Data CEO Swaroop Jagadish termed “early design partners.” The funding was provided by 8VC partner Bhaskar Ghosh, Sherpalo Ventures founder Ram Shriram, and Vercel CEO Guillermo Rauch.

The addition of the data observability layer to Acryl Data’s stack enables organizations to uniformly access data governance, observability, and data management capabilities in a single solution. Traditionally, governance and data observability “have been needlessly seen as separate problems,” Shriram pointed out. “Business users, at the end of the day, look for a unified reliability indicator and, by bringing governance and data observability together, the technical and the business users come together much more.”

Moreover, by providing these capabilities for contemporary decentralized architectures such as data mesh, organizations can monitor, validate, and improve their data products at the pace of contemporary business.

The Data Observability Engine

Acryl Data enables users to monitor their data health and detect incidents with low latency. According to Acryl Data CTO Shirshanka Das, “This is different from classical approaches because this is more real-time, event-oriented streaming metadata. We’re getting every single Spark job that is running. You’re getting continuous data profiling insights.” This capacity becomes even more useful with the automation characterizing the data observability layer.

The solution’s robust data discovery mechanisms employ machine learning to scrutinize historical data patterns and establish a baseline for what healthy data looks like. The results form the basis for automatically generated suggestions for data contracts for specific datasets — which users can modify or supplement with business logic. “With shift-left approaches, that is the central and hardest problem to solve because otherwise, data producers are going to always be lagging,” Das reflected. “So, we help them by suggesting what responsibilities they should be signing up.”

The Data Control Plane

Coupling these data observability capabilities with the metadata management features of DataHub proves mutually beneficial. The tandem provides a simplified architecture for addressing the increasingly distributed data paradigms characterizing modern organizations investing in data fabric and data mesh approaches. It also delivers a rich metadata foundation replete with business definitions, semantic clarity, and rules with which to contextualize and monitor the data via the data observability capabilities. “The metadata control plane gives us that fire hose of events that we’re able to continuously monitor and detect, whether it’s a data quality incident or a divergence in a certain distribution,” Das remarked.

DataHub’s data catalog and metadata platform is the substrate for merging business understanding of data with technical characteristics to inform the data contracts upheld by the data observability layer. Acryl Data’s stack relies on several components that underlie the data control plane, including a key-value store, a Kafka integration for data streams, and an Elastic integration to make contents almost instantly searchable. The metadata itself is connected via a knowledge graph for a heightened understanding of connections, implications, and use of data elements.

Data Products

For organizations with data assets decentralized throughout multiple clouds, repositories, and tools, or simply those that have assigned respective business domains ownership of data, this approach is timely. Either way, users can analyze their metadata to optimize the process of creating reusable data products, then conveniently monitor them for data reliability and optimal data health with the data observability module.

“You can look at the existing technical lineage graph and then say, ‘Oh, these things belong as a single data product,’” Das commented. “Or, you can come in with a clean opinion of what a data product is and you can define it using your favorite declarative language. We have GitHub Actions and things like that to translate and provision that into Acryl Cloud.”

Going Forward

Acryl Data plans to channel its recent funding into better understanding and servicing its community of users. “We’re investing in our core community,” Jagadish acknowledged. “The 7,500 plus people that we have in our community give us advantages. We learn at scale and improve our product rapidly.”

The company also seeks to democratize, if not evangelize, the utilitarian data control plane it’s championing. “In terms of the control plane vision, we are executing on the vision there,” Jagadish said. “There are some concrete and practical use cases we are tackling with data contracts and data mesh.”

The post Acryl Data Unveils Data Observability Capabilities, Adds Funding appeared first on The New Stack.

5 Ways OpenTelemetry Can Reduce Costs

Morgan McLean — Fri, 23 Jun 2023 17:00:22 +0000

During times of economic uncertainty, companies can often find themselves having to do more with less. Whether they’re facing reduced spending, heightened customer expectations, or fiercer competition, organizations need to embrace nimble and cost-effective approaches to remain resilient. One such way to harness operational resiliency and reduce costs is through OpenTelemetry.

So, what is OpenTelemetry, and how can it save your company money?

OpenTelemetry provides a standardized, vendor-neutral way for organizations to collect telemetry data from cloud native applications and infrastructure so that it can be sent to any destination for processing and analytics. It includes the agents, SDKs, protocol, and semantic conventions required to capture distributed traces, metrics, logs, and other critical signals.

By offering a unified approach to telemetry data collection, Otel makes it easier for organizations to integrate and share data across different systems. This, in turn, helps organizations improve their application performance, reduce operational costs, speed up development, improve reliability and simplify the management of telemetry data.

And many are seeing the value in OpenTelemetry. It is the second most active project within the Cloud Native Computing Foundation, after Kubernetes. The project’s origin story began in 2019 when two open source projects (OpenCensus and OpenTracing) merged to create OpenTelemetry to instill one standard.

The goal was to create a unified, vendor-neutral approach to observability, so developers can instrument and collect telemetry data across different programming languages and platforms. Since its creation, OpenTelemetry has rapidly evolved, with regular releases to improve its capabilities and expand its support across different platforms and environments. Today, it’s a popular tool in the observability space, with a robust community of 900 monthly-active contributors.

But while OpenTelemetry’s primary value is to help organizations reduce the complexity of managing telemetry data in their cloud native applications, it can also decrease the cost of monitoring, optimize cloud computing budgets, and improve resource utilization, reliability, and application performance.

Let’s dive into how OpenTelemetry can do all of this below, by outlining five key ways it can help save your business money:

Organizations can build things faster — much faster. With OpenTelemetry, engineers don’t have to re-instrument code or install different proprietary agents every time an analytics platform is changed. This enables them to be more efficient and strategic with their time, as they’re no longer having to manually instrument their tech stack. Second, observability tools can significantly enhance an organization’s development velocity. They enable developers to understand the interactions of existing services and how to extend them, by relying on the tools to show them the current functionality. In addition, virtually no time is wasted on training staff on proprietary agents and all development on OpenTelemetry is supported through the OpenTelemetry community. This ability to solicit help from the community eliminates time wasted on re-inventing the wheel.
It helps prevent loss of revenue. OpenTelemetry bridges the visibility gaps by providing a common format of instrumentation across all services (such as infrastructure, applications and observability solutions,) which can help with root cause analysis and troubleshooting. This unified view allows organizations to more quickly identify and resolve issues that impact revenue. When engineering teams can see with more precision and clarity, they can reduce the number of customer-impacting incidents and minimize downtime. And this is significant, as downtime can potentially cost organizations an average of $87 million per year due to lost revenue and productivity. So a faster approach to resolution can become a business-critical capability.
Orgs can optimize cloud computing and observability costs. When companies use monitoring tools that are not OpenTelemetry native, they often pay for yet another tool to reduce the volume/dimensionality of data they’re analyzing. In contrast, OpenTelemetry’s modular and extensible architecture enables businesses to customize and fine-tune their telemetry data collection and processing to meet their specific needs and preferences, without the need for expensive and proprietary tools. Additionally, OpenTelemetry can provide insights into the usage patterns of resources, such as CPU or memory, allowing organizations to optimize their resource allocation. Support for continuous profiling is being added to OpenTelemetry, which will provide further cost reductions with the newest generation of profiling tools, pinpointing the CPU and memory consumed by individual functions in an application’s code.
It empowers companies to avoid vendor lock-in and select the tools they need. Without OpenTelemetry, organizations are stuck using whatever observability tool they picked up first (even if that tool lacks critical functionality) because switching the proprietary instrumentation and agents required by monitoring solutions is unimaginably expensive. Being dependent on one vendor makes an organization incredibly vulnerable to both a vendor’s limitations as well as their decisions and whims, such as price hikes or discontinuation of products. With an OpenTelemetry native observability provider, companies adopt solutions that are right for their business and budget. Open standards mean the solution can be customized and innovated upon.
Own your observability software supply chain with OpenTelemetry. Users of proprietary agents and other components can’t audit their contents, and their vendors are effectively within their security boundary. Additionally, requests for enhancements are at the mercy of one’s relationship with their vendor. In contrast, OpenTelemetry is open source and helps organizations audit the code, run their own builds, and propose or add any functionality that they may like.

OpenTelemetry lets you truly own your own data and in turn allows you to make your organization more agile, more resilient to failure and less beholden to vendors and potential security threats. It offers a standardized, cost-effective, and flexible approach to data collection that can help businesses reduce operational costs and improve application performance. Embracing OpenTelemetry today helps to future-proof your business for tomorrow, and beyond.

The post 5 Ways OpenTelemetry Can Reduce Costs appeared first on The New Stack.

The Rise of Developer Native Dynamic Observability

Eran Kinsbruner — Fri, 23 Jun 2023 16:14:15 +0000

Observability has emerged as a crucial concept in modern software development, enabling teams to gain deep insights into the performance, health and behavior of complex systems. Traditional monitoring approaches, based on predefined metrics and static thresholds, have limitations in addressing the dynamic and distributed nature of modern applications. As a result, a new paradigm called dynamic observability has gained traction, offering real-time visibility into dynamic systems by combining metrics, logs and traces.

With the ability to adapt to changing environments and capture contextual information, dynamic observability enables development teams to proactively identify and resolve issues, reducing mean time to resolution (MTTR) and enhancing overall system reliability, as well as developer productivity.

Nowadays dynamic observability drives the evolution of development practices with the integration of runtime debugging into the development life cycle, encouraging teams to embrace a proactive approach to system monitoring and troubleshooting. With the insights provided by dynamic observability, development teams can optimize system performance, detect anomalies and make data-driven decisions throughout the development process.

The rise of dynamic observability represents a transformative shift in the software development landscape. By embracing dynamic observability, organizations can foster a culture of continuous improvement, enhance system reliability and deliver high-quality software products in an increasingly complex and dynamic digital ecosystem.

Observability: The Good, the Bad and the Overwhelming

Observability is the ability to understand the behavior of a system by looking at its outputs. This includes logs, traces and metrics. The main goal is to be able to diagnose and debug issues by having a clear overview of the system’s health and state. In simpler terms, it is the ability to look inside a system, understand how it works and actively debug it.

Observability = Metrics + Logs + Traces

While in the past, the mix of application performance-monitoring (APM) tools and logging tools with built-in integrated development environment (IDE) debugging tools could suffice to identify issues in applications, today’s reality is more complicated. The transition to microservices and cloud native, serverless apps, progressive delivery workflow and other advanced architectures put a new level of challenge on developers and Ops.

From a Lightrun experience perspective, the above are a representative subset of commonly seen developer challenges around troubleshooting in production environments.

For a variety of reasons, it is challenging to access a remote instance of an application or service deployed in a Kubernetes cluster or to debug a serverless Lambda function from a developer machine. To start with, the access itself from the developer machine toward a production deployment is either forbidden or so complex that it cripples developers and slows down the debugging process. Assuming that such an obstacle has been figured out, being able to obtain runtime telemetry from a production environment without stopping the app, changing its state and re-deploying it to gain these additional logs is an extremely lengthy and error-prone process.

These are only a few examples that led to the rise of dynamic observability and its shift toward being part of the software development life-cycle tool stack. Giving developers more tools and power to troubleshoot their own applications regardless of where they run and how complex they are is the new normal from a requirements perspective.

The Case for Dynamic Observability: Key Business Values

As we defined earlier, developers fall short when they need to troubleshoot remote and complex workloads from their IDEs using traditional debugging tools and APM solutions. Dynamic observability comes to address and solve these challenges.

Basically, as opposed to static logging, with dynamic observability developers enjoy end-to-end observability across app deployments and environments directly from their IDEs. This translates into reduced MTTR, enhanced developer productivity and overall cost optimization since developers debug and consume logs and telemetry data where and when they need it rather than monitoring everything.

Dynamic observability has emerged as a pivotal approach in modern software development, enabling teams to gain deep insights into system behavior and make informed decisions. It goes beyond traditional testing and monitoring methodologies, offering a comprehensive understanding of system patterns, strengths and weaknesses.

The rationale and benefits of adopting dynamic observability can be summarized into the following:

Enhancing System Understanding

Dynamic observability empowers developers to observe system states and behaviors, providing them with detailed insights. By monitoring real-time metrics, logs and traces, developers can gain a holistic view of their systems, identify potential issues and make data-driven decisions.

Complementing Test-Driven Development

While test-driven development (TDD) focuses on writing tests to ensure code correctness, dynamic observability goes further to observe and analyze the behavior of the system as a whole. It recognizes that testing alone may not capture all possible scenarios and encourages developers to embrace observability for comprehensive system understanding.

Strengthening Behavior-Driven Development

While behavior-driven development (BDD) focuses on collaboration between stakeholders to define behavior expectations, dynamic observability takes it a step further by actively observing and detecting system behavior in runtime across environments (production, staging, CI/CD, QA), uncovering patterns and potential risks that might otherwise remain hidden.

Complementing the Role of Monitoring

Dynamic observability and developer native observability do not replace monitoring; rather it complements it. Monitoring ensures system stability and performance, while dynamic observability provides a deeper level of understanding. By combining monitoring and dynamic observability, development teams can create a reliable and streamlined system, reducing vulnerabilities and risks.

Bottom Line: Dynamic Observability Is the New Normal

Dynamic observability represents a paradigm shift in software development, enabling developers to gain a detailed understanding of system behavior and make informed decisions. Using tools and practices that go beyond traditional testing, it empowers teams to create robust and reliable systems. It works in harmony with existing methodologies such as DevOps, TDD and BDD, augmenting their principles with a focus on system behavior. Through the adoption of dynamic observability, organizations can enhance their development processes and build more resilient and efficient software systems.

What to look for within a dynamic observability solution?

Live debugging: Developers must be able to add logging statements and metrics to their code in real time, without having to redeploy or restart their applications.
On-demand snapshots: Dynamic snapshots provide developers with virtual breakpoints without stopping the execution of your application. Having the ability to add conditions, evaluate expressions and inspect any code are key features in this category.
Custom metrics: Right from the IDE, developers should be able to create custom metrics to track specific aspects of the system’s behavior, such as response time, method duration or error rates — on the variable level. All tasks should be completed without the need to redeploy any code.
Integration with existing developers and enterprise tools including cloud native tools such as Prometheus, Grafana, VSCode, public cloud providers and more.
Cost optimization: Being able to analyze observability-related inefficiency and optimize to reduce costs, such as by replacing static logs with dynamic logs, is a significant capability of such tools.

Dynamic observability empowers developers to gain a deep understanding of their live applications, significantly enhances developers’ productivity, reduces operational and infrastructure costs, and simplifies the process of debugging and fixing problems.

As a result, developers can spend more time doing creative coding by handing off complex troubleshooting to dynamic observability solutions.

The post The Rise of Developer Native Dynamic Observability appeared first on The New Stack.

OpenTelemetry and Elastic Common Schema Comes Not Too Soon

B. Cameron Gain — Fri, 23 Jun 2023 15:22:34 +0000

Standardization is always a good thing, but when it comes to very popular and utilized tools and technologies, a common standard at the very least makes life a lot easier for developers. At the other extreme, a technology’s survival depends on it. This is the case with WebAssembly, regarding the dire need for a common standard for components so Wasm can be used to very efficiently deploy code anywhere across any type of device running with a CPU. Today we are talking about the marriage between the observability tools Elastic Common Schema (ECS) and OpenTelemetry Semantic Conventions . Specifically, the creators of open source Elastic are contributing ECS to OTel and are committed to the joint development of the two projects.

Second Highest

As the second-highest CNCF “velocity project” thanks to the strong growth of its user base, in the CNCF ecosystem. OpenTelemetry has become a widely adopted way to add instrumentation to an application to gather metrics, logs and traces from your favorite observability source. Telemetry data (metrics, logs and traces) from different sources can then be combined for monitoring with your favorite panel, such as with Grafana. Wildly popular ECS is used to define a common set of fields to be used when storing event data in Elasticsearch, such as logs and metrics and to specific field names and Elasticsearch datatypes for each field, and provides descriptions and example usage, according to its documentation. ECS will become that much better under the OTel umbrella. In fact, machine learning is being integrated with Elastic, which is already offering some very interesting results. With the collaboration between the open telemetry and the Elastic Search, this is the case when standardization could not have come sooner. “This collaboration between ECS and OpenTelemetry is a marriage made in heaven,” Torsten Volk, an analyst at Enterprise Management Associates (EMA), said. “ECS addresses the most critical bottleneck of true visibility and observability: the creation and maintenance of a common data model for all telemetry data.”

Developers Choice

Now OpenTelemetry can reliably collect data fields from Python, C#, JavaScript or the language of the developer’s choice from various APM tools and benchmarking platforms to fill in the context gap that often slows down or entirely prevents app-centric monitoring, Volk said.

“For instance, an e-commerce platform experiences a sudden surge in server load during a flash sale. With different services coded in different languages and monitored by different APM tools, the root-cause analysis would be tricky,” Volk said. “But with a common data model in place, all of the different languages and APMs dump their telemetry data into consistent JSON files that today’s magical AI-driven observability platforms can easily analyze. Thinking this further, the ECS standard can help OpenTelemetry analyze asset information from today’s heterogeneous universe of smart devices and report back to developers which devices work best and, more importantly, which ones work the poorest with their latest code creations.“

The positive impact that the contribution of ECS to OTel will have for OpenTelemetry users is applicable both generally and particularly for users of OpenTelemetry’s in-development logging capabilities, Morgan McLean, director of product management, Splunk, said. This is because beyond OpenTelemetry’s Collector agent and language instrumentation, one of the project’s biggest draws is its unified semantic conventions, which ensure that consistent metadata and resource information is attached to every signal, McLean said.

For example, spans of HTTP requests captured from services written in different languages will share the same keys and value encodings for their duration, URL, service name, host, etc., which “allows them to be analyzed very effectively,” McLean said.

“While this is already the case with spans and metrics in OpenTelemetry, we’re in the midst of adding support for logs, which introduce significantly more scenarios that require dedicated semantic conventions,” McLean said. “By merging ECS and its thousands of existing conventions into OpenTelemetry, everyone who uses OpenTelemetry will receive well-structured and consistent metadata on their logs, traces, metrics and more from the source in a huge variety of scenarios. These signals can then be processed, filtered through, compared, and analyzed efficiently without massive amounts of special casing logic or ignoring signals that lack expected metadata.”

ECS and OTel

The integration of ECS with OTel underscores OTel’s reach and its creators’ goals to allow users to merge telemetry data into a single panel for a more comprehensive analysis for observability. Describing this aspect of OTel in general without commenting specifically about the ECS contribution to Otel, Cedric Ziel, Grafana Labs senior product manager, noted how OpenTelemetry is a community-driven and CNCF governed initiative “that aims at commoditizing data collection concerns for observability.”

“The ideals behind OpenTelemetry are about vendor-neutral instrumentation of application code and the project was created to remove the need for people to rip and replace their instrumentation whenever they want to lean in on a different observability provider or even support multiple vendors at the same time. This is solving the problem of observability in our time: it is inevitable that you need multiple vendors to support your needs in different dimensions — settling on the same protocol and the same conventions for this is the holy grail in observability,” Ziel said. “There is not just one single thing that’s moving in OTel that makes it more attractive: It’s the overall thing. Seeing all signal types being more sustainably available across instrumentation libraries is a continuous effort and exciting to see.”

Indeed, the integration of ECS with OTel helps the OTel project move toward the ultimate goal of total compatibility and standardization with any observability tool or process.

“Since day one, OpenTelemetry has been focused on providing consistent, clear, and accurate telemetry data about cloud native systems to empower developers and operators with observability. The alignment of the ECS and the OpenTelemetry Semantic Conventions is another step along that journey, ensuring that high-quality and consistent metadata is available to end users,” Austin Parker, head of developer relations, Lightstep, said. “In addition, this step ensures that OpenTelemetry data will be the gold standard for the next generation of observability tooling powered by AI, empowering end-users to get better answers about their system and its state.”

The post OpenTelemetry and Elastic Common Schema Comes Not Too Soon appeared first on The New Stack.

Run OpenTelemetry on Docker

B. Cameron Gain — Tue, 20 Jun 2023 15:30:34 +0000

The OpenTelemetry project offers vendor-neutral integration points that help organizations obtain the raw materials — the “telemetry” — that fuel modern observability tools, and with minimal effort at integration time.

But what does OpenTelemetry mean for those who use their favorite observability tools but don’t exactly understand how it can help them? How might OpenTelemetry be relevant to the folks who are new to Kubernetes (the majority of KubeCon attendees during the past years) and those who are just getting started with observability?

The OpenTelemetry project has created demo services to help cloud native community members better understand cloud native development practices and test out OpenTelemetry, as well as Kubernetes, observability software, container environments like Docker, etc.

At this juncture in DevOps history, there has been considerable hype around observability for developers and operations teams, and more recently, much attention has been given to help combine the different observability solutions out there in use through a single interface, and to that end, OpenTelemetry has emerged as a key standard.

Learning Curve

Observability and OpenTelemetry, while conceptually straightforward, do require a learning curve to use. To that end, the OpenTelemetry project has released a demo to help. It is intended to both better understand cloud native development practices and to test out OpenTelemetry, as well as Kubernetes, observability software, etc., the project’s creators say.

OpenTelemetry Demo v1.0 general release is available on GitHub and on the OpenTelemetry site. The demo helps with learning how to add instrumentation to an application to gather metrics, logs and traces for observability. There is heavy instruction for open source projects like Prometheus for Kubernetes and Jaeger for distributed tracing. How to acquaint yourself with tools such as Grafana to create dashboards are shown. The demo also extends to scenarios in which failures are created and OpenTelemetry data is used for troubleshooting and remediation. The demo was designed for the beginner or the intermediate level user, and can be set up to run on Docker or Kubernetes in about five minutes.

The stated goals for the OpenTelemetry demo the project team communicated are:

Provide a realistic example of a distributed system that can be used to demonstrate OpenTelemetry instrumentation and observability.
Build a base for vendors, tooling authors, and others to extend and demonstrate their OpenTelemetry integrations.
Create a living example for OpenTelemetry contributors to use for testing new versions of the API, SDK, and other components or enhancements.

OpenTelemetry and Docker

In this tutorial, we look at how to run the OpenTelemetry demo in a Docker environment. Let’s get started.

The prerequisites are:

Docker
Docker Compose v2.0.0
4 GB of RAM

To note, if you are running Docker in Windows, you need to make sure that you have Admin privileges activated to deploy the OpenTelemetry demo in Microsoft PowerShell (yet another Windows aggravation).

We first clone the repo:

Navigate to the cloned folder:

Run Docker Compose (–no-build) and start the demo:

Head over to your Docker Desktop if you are on Windows and you should see the OpenTelemetry container ready to go in the dashboard:

Access the OpenTelemetry-Demo-Main and watch the Demo metrics data live:

And that is it. Now the fun can start!

Getting the Demo to run on Docker is, of course, just the beginning. There are loads of possibilities available to do more with the Demo that will likely be the subject of future tutorials.

This includes setting up the Astronomy Shop eCommerce demo application, which the maintainers of the project describe as an example of an application that a cloud native developer might be responsible for building, maintaining, etc.:

Several pre-built dashboards for the eCommerce application are available, such as this one for Grafana. It is used to track latency metrics from spans for each endpoint:

Feature Flags

Features flags, such as the recommendationCache feature flag, will initiate failures in the code that can be monitored with the panel using Grafana or Jaeger (Jaeger is used here):

Here is a list of access options once the images are built and the containers are running:

Once the images are built and containers are started you can access:

Webstore: http://localhost:8080/
Grafana: http://localhost:8080/grafana/
Feature Flags UI: http://localhost:8080/feature/
Load Generator UI: http://localhost:8080/loadgen/
Jaeger UI: http://localhost:8080/jaeger/ui/

Long Way

This OpenTelemetry demo project has come a long way. While bugs can exist, of course, that is why GitHub is there in part so you can help this project become even better than it already is. The demo GitHub page also offers a number of resources to get started.

In a future tutorial, stay tuned for the steps to get the Astronomy Shop eCommerce demo application up and running and view all the fabulous metrics provided with OpenTelemetry with a Grafana panel.

The post Run OpenTelemetry on Docker appeared first on The New Stack.

Signs You Have Outgrown Your Mobile Monitoring Solution

Emily Vince — Fri, 16 Jun 2023 13:36:55 +0000

Imagine you start a new hobby — let’s say bike riding. You don’t want to invest a lot in a bike because you’re not sure that you’ll like it. Luckily, you snag a free bike from a friend. It’s clunky, but the price is right. You start out with short rides around your neighborhood and eventually find yourself riding every day, going on longer and longer rides. Your free, heavy bike is holding you back. Now that you’re cycling more seriously, it’s time for a better bike.

Mobile monitoring is like that. When you first start building a mobile app, a free tool may be fine for identifying and solving most crashes. But as your user base grows and you want to improve mobile stability and performance to gain higher ratings and boost your visibility on app stores, you need a better tool. Here are some signs that you’ve outgrown your mobile monitoring tool.

1. You Get So Many Alerts that You’ve Started to Tune Them out

To solve a mobile crash, you first need to know that something is wrong. Well-defined alerts notify you when a crash or performance issue is affecting enough users to warrant immediate attention. If your mobile monitoring doesn’t provide meaningful and issue-specific alerts, you and your team won’t know which crashes, errors and issues really demand your attention and which are just false alarms. Pretty soon, you’ll start ignoring them. And if you’re ignoring all your alerts, it’s like you have no alerts at all.

2. You Don’t Have Enough Context to Solve Crashes Fast

Knowing that crashes and other errors are happening is important, but without the right context, it’s hard to identify the root cause and solve it fast. You need to collect enough data from user sessions to resolve issues quickly, including user metadata (device type, location, app version), user actions (the steps a user took that led up to the crash or error), screenshots and view hierarchy. Grouping errors is helpful to isolate the issue impact to specific devices or versions. Stack traces with un-minified source code give you insight into the sequence of events that lead to the bug as well as the line of code where you can find the bug. If you have two-way integrations with source code management platforms like GitHub, you can even determine the original owner of the commit who might know exactly what happened.

3. You Are Wasting Time Dealing with Broken Workflows

If you are building a mobile app solo or with a very small team, you may not notice minor snags in your workflow. But, as your app grows and many developers are collaborating across teams, inefficiencies in your workflow can really slow you down. We usually hear about two types of workflow challenges from users:

Issue triage and assignment: keeping track of issues to solve and owners to solve them
Tool integration: ensuring critical data flows between key apps and services, like Jira and GitHub

Because they are intended for smaller teams, entry-level mobile monitoring solutions lack the workflow tooling that most organizations need to find and fix bugs, errors and performance issues smoothly. Investing in a tool that streamlines your workflows is well worth it — the less time your team spends troubleshooting, the more time they can spend building.

4. You No Longer Trust the Data You’re Seeing

Unfortunately, we have heard from many users that entry-level mobile monitoring tools do not report all crashes or errors. When they compare crash reports from their monitoring console to the Google Play Developer or App Store Connect Console, there are often discrepancies in the results. You can only solve the crashes and errors that you know about. If your customers are experiencing issues that go unreported, it will negatively affect their experience.

5. Your Users Are Complaining

This is probably the most important sign that you have outgrown your mobile monitoring solution. If you aren’t able to accurately identify and quickly solve issues that are affecting the stability and performance of your mobile application, your customers will get frustrated, leave bad reviews and potentially abandon your app altogether. In most cases, the primary benefit of an entry-level mobile monitoring solution is that it is no cost. But, when you consider that 48% of users will delete an app after experiencing a single performance issue, and for every one-second delay in screen load times, conversions drop by 7%, a “free” mobile monitoring solution that indirectly hurts customer experience will cost you.

If any of these signs ring true, it’s probably time to move on from your current monitoring solution. The success of your mobile application depends on it.

The post Signs You Have Outgrown Your Mobile Monitoring Solution appeared first on The New Stack.

No More FOMO: Efficiency in SLO-Driven Monitoring

Imaya Kumar Jagannathan — Wed, 14 Jun 2023 17:00:07 +0000

Observability is a concept that has been defined in various ways by different experts and practitioners. However, the core idea that underlies all these definitions is efficiency.

Efficiency means using the available resources in the best possible way to achieve the desired outcomes. In the current scenario, where every business is facing fierce competition and changing customer demands, efficiency is crucial for survival and growth. Resources include not only money, but also time, productivity, quality and strategy.

IT spending is often a reflection of the market conditions. When the market is booming, companies tend to spend more on IT projects and tools, without being too concerned about the value they are getting from them. This can create some problems, such as having too many tools that are not integrated or aligned with the business goals, wasting resources on unnecessary or redundant tasks, and losing visibility and control over the IT environment.

IT spend always correlates to market temperature.

Even the companies that spend heavily on cloud services are reconsidering their big decisions that involve significant, long-term investments. Companies are reassessing their existing substantial spend to ensure their investments can be aligned with revenues or future revenue potential.

Observability tools are also subject to the same review. It is essential that the total operating cost of observability tools can also be directly linked to revenue, customer satisfaction, growth in business innovation and operational efficiency.

Why Do We Need Monitoring?

If we had a system that would absolutely never fail, we wouldn’t need to monitor that system.
If we had a system for which we never have to worry about being performant, reliable or functional, we wouldn’t need to monitor that system.
If we had a system that self-corrects itself and auto-recovers from failures, we wouldn’t need to monitor that system.

None of the aforementioned points are true today, and it is obvious that we need to set up monitoring for our infrastructure and applications no matter what scale you operate.

What Is FOMO-Driven Monitoring?

When you are responsible for operating a critical production system, it is natural to want to collect as much monitoring data as possible. After all, the more data you have, the better equipped you will be to identify and troubleshoot problems. However, there are a number of challenges associated with collecting too much monitoring data.

Data Overload

One of the biggest challenges of collecting too much monitoring data is data overload. When you have too much data, it can be difficult to know what to look at and how to prioritize your time. This can lead to missed problems and delayed troubleshooting.

Storage Costs

Another challenge of collecting too much monitoring data is storage costs. Monitoring data can be very large, and storing it can be expensive. If you are not careful, you can quickly rack up a large bill for storage.

Reduced Visibility

When there is too much data, it can be difficult to see the big picture. This can make it difficult to identify trends and patterns that could indicate potential problems.

Increased Noise

More data also means more noise. This can make it difficult to identify important events and trends.

Security Concerns

Collecting too much monitoring data can also raise security concerns. If your monitoring data is not properly secured, it could be vulnerable to attack. This could lead to theft of sensitive data or disruption of your production systems.

FOMO-driven monitoring

Ultimately, an approach driven by the fear of missing out does not result in an optimal observability situation/setup and, in fact, can contribute to plenty of chaos, increased expenses, ambiguity between teams and overall increase in poor efficiency.

You can address this situation by being intentional in making decisions on all aspects of the observability pipeline including signal collection, dashboarding and alerting. Using service-level objectives SLOs is one of the strategies that offers plenty of benefits.

What Are SLOs?

An SLO is a target or goal for a specific service or system. A good SLO will define the level of performance your application needs, but not any higher than necessary.

SLOs help us set a target performance level for a system and measure the performance over a period of time.

Example SLO: An API’s p95 will not exceed 300ms response time

How Do You Set SLOs?

SLOs are generally set by customers. Yes, they are the ultimate authority. However, customers do not actually set SLOs as you can imagine. It is up to the business teams to tell the IT operations and development teams the expected performance and availability of a system.

For example, the business teams operating a marketing lead sign-up page can tell the IT teams that they want the page to load within 200ms at least 90% of the time. They would derive this conclusion by looking at the customer behavior already captured.

Now the IT teams can set the SLO for tracking by identifying SLIs(service-level indicators) in order to measure the SLOs over a period of time. SLIs are the specific metrics and query details of the metrics used to keep track of the SLO progression.

Here is what your observability life cycle looks like implementing an SLO-driven strategy.

SLO-driven strategy

There is an intentional loopback mechanism that is set in taking the SLO-driven strategy. Observability is never a settled problem. Organizations that do not continue reinventing their observability strategy fall behind very quickly, resulting in ambiguous tools, outdated processes and practices, which in the end increases overall operational cost while decreasing efficiency.

With this approach, you get the ability to scientifically measure your infrastructure and application performance over a period of time. Data collected as a result can be used to influence important decisions made on infrastructure spend which in turn helps improve further efficiency.

What Does This Tell Us?

Taking an SLO-first approach allows us to be intentional about the metrics to collect to meet commitments to business.

These are some of the benefits that organizations can achieve by following SLO-based observability strategy:

Results in improved signal vs. noise ratio
Reduces tool proliferation
Enriched monitoring data resulting in reduced MTTR/MTTI
Feedback loop provides continuous improvement opportunities
Connect monitoring costs in relation to business outcomes, hence able to justify spend to management

Use SLOs to drive your monitoring decisions:

Measure, revisit and review SLOs periodically based on outcomes
Improve observability posture through
- Lower cost
- Reduced issue resolution time
- Increased team efficiency and innovation

Conclusion

We live in an era where efficiency is critical for organizational success. Observability costs can become uncontrollable if you do not have a proper strategy in place. SLO-driven observability strategy can help you set guardrails, track performance goals, business metrics and measure impact in a consistent manner while increasing operational efficiency and innovation.

The post No More FOMO: Efficiency in SLO-Driven Monitoring appeared first on The New Stack.

How Adobe Uses OpenTelemetry Collector

Susan Hall — Mon, 05 Jun 2023 17:02:50 +0000

Adobe’s Chris Featherstone and Shubhanshu Surana praised the OpenTelemetry Collector as the Swiss army knife of observability in their talk at Open Source Summit North America.

They went on to explain how they use it to track massive amounts of observability data their company collects, including metrics, 330 million unique series a day; span data of 3.6 terabytes a day; and log data of over 1 petabyte a day.

Adobe’s Chris Featherstone and Shubhanshu Surana

Featherstone, senior manager for software development, explained that not all of this data flows through his team or the OTel collector, but “it’s a pretty good chunk.”

Distributed tracing led his team to OpenTelemetry. Adobe is largely made up of acquisitions, he explained, and with every new company brought in, people have their own opinions of the best cloud, this tool, that text editor, etc.

“With distributed tracing specifically, that becomes a huge challenge,” he said. “Imagine trying to stitch a trace across clouds vendors, open source. So eventually, that’s what led us to the collector. But we were trying to build a distributed tracing platform based on Jaeger agents.” That was in 2019.

It started rolling out the OTel Collector in April 2020 to replace the Jaeger agents. Originally, the collector was just to ingest traces, but in September 2021, brought in metrics and they’re looking to bring in logs as well.

The team instruments applications using Open Telemetry libraries, primarily auto instrumentation, and primarily Java. It does some application enrichment, brings in Adobe-specific data and enriches its pipelines as data flows to the collector. It has some custom extensions and processors, the team does configuration by GitOps where possible.

“The collector is very dynamic extending to multiple destinations with one set of data and this was huge for us. …Sometimes we send collector data to other collectors to further process. So it’s the Swiss Army knife of observability,” Featherstone said.

His team at Adobe is called developer productivity with the charter to help developers write better code faster.

For the Java services, in particular, it has a base container and “if you’re using a Java image, you should go use this … It has a number of quality-of-life features already rolled into it, including the OpenTelemetry Java instrumentation in the jar. [The configuration is ] pulled from our docs, and this is exactly how we configure it for Java.

“So we set the Jaeger endpoint to the local DaemonSet collector. We set the metrics exporter to Prometheus, we set the propagators, we set some extra resource attributes, we set the tracer, the exporter to Jaeger. And we set the trace sampler to parent-based always off,” he said, pointing out that this is all rolled into the Java image.

So with these configurations, any Java service that spins up in Kubernetes at Adobe is already participating in tracing. Everything set up this way passes through the collector.

“So everyone’s participating in tracing just by spinning this up,” he said. “The metrics, we’ve tried to reduce the friction, people would still need to somehow go get those metrics out of that exporter. We’ve made that pretty easy, but it’s not automatic.” He said about 75% of what they run is Java, but they’re trying the same concept with Node.js and Python and other images.

Managing the Data

They do a lot of enrichment, and as well as ensuring no secrets are being sent as part of our tracing or metrics data, said Surana, Adobe’s cloud operations site reliability engineer for observability.

It uses multiple processes, including reduction processor as well as a custom processor in OpenTelemetry Collector that allows them to eliminate certain fields they don’t want sent to the backend, which could be personally identifiable information or other sensitive data. They’re also used to enrich the data because adding more fields such as service identifiers, Kubernetes clusters, and region help improve search.

“Adobe is built out of active acquisitions, and we run multiple different products in different ecosystems. There is a high possibility of service names being colliding under different products or under as similar microservice names, so we wanted to ensure that doesn’t happen,” he said.

It also uses Adobe-specific service registry, where every service has a unique ID attached to the service name. It allows any of the engineers at Adobe to uniquely identify a service in the single tracing backend.

“It [also] allows the engineers to quickly search on things, even though they don’t know the service, or they don’t know who owns that service, they can go look into our service registry, find out the engineering contact for that particular product or team and get on a call to resolve their issue,” Surana said.

They also send data to multiple export destinations.

“This is probably the most common use case,” he said. “Before the introduction of the OpenTelemetry Collector, engineering teams at Adobe have been using different processes, different libraries in a different format. And they were sending it to vendor products, open source projects, and it was very hard for us to get the engineering teams to change their backend, or to just do any small change in the backend code or their application code because engineers have their own product features and product requests, which they are working on.

“With the introduction of OpenTelemetry Collector, as well as the OTLP [OpenTelemetry protocol] format, This made it super easy for us; we are able to send their data to multiple vendors, multiple toolings with just few changes on our side.”

Last year, they were able to send the tracing data to three different backends at the same time to test out one engineering-specific use case.

They’re now sending data to another set of OTel collectors at the edge where they can do transformations including inverse sampling, rule-based sampling and throughput-based sampling.

He said they’re always looking into other ways to get richer insights while sending less data to the backend.

“This entire configuration is managed by git. We make use of the OpenTelemetry Operator Helm charts primarily for our infrastructure use case. … It takes away the responsibility from the engineers to be subject matter experts … and makes the configuration super easy,” he said.

Auto instrumentation with OpenTelemetry Operator allows engineers to just pass in a couple of annotations to instrument their service automatically for all the different signals without writing a single line of code.

“This is huge for us,” he said. This takes developer productivity to the next level.”

They also built out a custom extension on top of the OpenTelemetry Collector using the custom authenticator interface. They had two key requirements for this authentication system: to be able to use a single system to securely send data to the different backends and to be able to secure it for both open source and vendor tools.

OpenTelemetry Collector comes with a rich set of processes for building data processes, including an attribute processor which allows you to add attributes on top of log data and matrix data. It allows you to transform, enrich or modify the data in transit without the application engineers doing anything. Adobe also uses it to improve search capabilities in its backends.

The memory limiter processor helps ensure OTel never runs out of memory and checks the amount of storage needed for keeping things in state. They also use the span to matrix processor and service graph processor to generate data out of traces and build metrics dashboards on the fly.

So What’s Next?

Two things, according to Featherstone: improving data quality, namely getting rid of data no one is going to look at, and rate limiting spans at the edge.

The collector provides the ability at the edge to create rules and drop some data.

“For metrics, imagine that we had the ability to aggregate right in the collector itself. You know, maybe we don’t need quite 15-second granularity, let’s dumb that down to five minutes, and then send that off,” Featherstone said.

“Another one might be sending some metrics to be stored for long term and sending some on to be further processed in some operational data lake or something like that. We have the ability to just pivot right in the collector and do all kinds of things.”

The second thing is rate-limiting spans at the edge.

“We have one of our edges is taking like 60 billion hits per day, and we’re trying to do tracing on that. That becomes a lot of data when you’re talking about piping that all the way down to somewhere to be stored. So we’re trying to figure out where’re the right places to implement rate limiting in which collectors and at what levels … just to prevent unknown bursts of traffic, that kind of thing,” he said.

They’re also trying to pivot more to trace-first troubleshooting.

“We have so many east/west services that trying to do it through logs and trying to pull up the right log index for whatever team and do I even have access to it or whatever. It’s so slow and so hard to do, that we’re trying to really shift the way that people are troubleshooting within Adobe, to something like this, where we’ve made a lot of effort to make these traces, pretty complete,” he said.

They also looking into how people go about troubleshooting and whether the tools they have provide the best way to do that.

They’re looking forward to integrating the OpenTelemetry logging libraries with core application libraries and running OTel collectors as sidecars to send metrics, traces and logs. They exploring the new connector component and building a trace sampling extension at the edge to improve data quality.

Wrapping up, he lauded the collector’s plug-in-based architecture and the ability to send data to different destinations with a single binary. There are a rich set of extensions and processors which give a lot of flexibility with your data, he said.

“OpenTelemetry in general feels a lot to me, like the early days of Kubernetes where everybody was just kind of buzzing about it, and it started like we’re on the hockey stick path right now,” he said. “The community is awesome. The project is awesome. If you haven’t messed with the collector yet, you should definitely go check it out.”

The post How Adobe Uses OpenTelemetry Collector appeared first on The New Stack.

Red Hat Ansible Gets Event-Triggered Automation, AI Assist on Playbooks

Joab Jackson — Wed, 24 May 2023 13:22:59 +0000

In their unceasing conquest to rule over humankind, machines have scored yet another key victory.

Red Hat‘s IT automation tool Ansible will soon have the ability to execute certain actions, automatically, without human input. Heretofore Ansible’s automation scripts required a human to mash down on the go button to set in motion the actions captured in that script, called an Ansible Playbook.

As an option in the upcoming Ansible 2.4, administrators can have Ansible automatically execute a special kind of Ansible Playbook called a Runbook, whenever it is triggered by an external alert, such as a notice from an observability tool that a service is down.

“What we’re trying to do with Event-Driven Ansible is automate that decision making that a system administrator often goes through in the operations flow,” explained Red Hat Principal Product Manager Joe Pisciotta, during a presentation of the new technology at the Red Hat Summit, being held this week in Boston. “‘Ops-as-code’ is what we’re going for here. It’s codifying that operational logic.”

With Event Driven Ansible, messages that can come from an external systems monitoring app such as Datadog, Prometheus or Dynatrace, can trigger an action. Red Hat is working with select partners for custom plug-ins, though Ansible can be configured so that any message that arrives by way of a Webhook can trigger an action.

Ansible event triggers can come from these sources (Red Hat).

The resulting actions can be anything under the IT management purview of Ansible, such as restarting a server or provisioning more memory.

System actions that result in a lot of tickets being generated by the service desk, even though the remediation is well-known, would be a good use of Event Driven Ansible. Capacity metrics could be another trigger, allowing Ansible to archive data when a storage threshold of a certain capacity is reached.

Administrators have to specify what can trigger an Ansible automation. The automation controller that handles the event triggers is not a full-fledged messaging bus, though depending on the rules provided, it can execute actions such as collecting additional information, and then filling in a service ticket with that data.

“You have a high degree of confidence that the action that you want to take is consistently correlated to the event that you’re observing,” said Chris Wright, Red Hat chief technology officer and senior vice president of global engineering, in an interview with The New Stack.

Event Driven Ansible will be an optional feature of Ansible 2.4, due to be released in June.

Computer Help

Another new feature for Ansible comes from a partnership with IBM Research (Red Hat is a subsidiary of IBM): An AI-assisted natural language interface for Ansible, in a feature called Ansible Lightspeed.

Through a plug-in from Watson Code Assistant, users are given a freeform dialogue input box, where they can ask Ansible to execute a task. It can be used to carry out an action such as starting an EC2 instance, where the user may not know the exact commands to trigger that action.

Think of Ansible Lightspeed as a version of ChatGPT, but one that answers only with Ansible Playbooks.

Lightspeed isn’t quite at the point where a complete novice could use it, but it could very useful to a junior developer or admin to get up to speed more quickly.

“You still need to have someone who understands Ansible,” Wright said. Administrators should still validate that the AI-generated Playbooks do what the administrators expected them to do. Still, the automation can reduce the time it takes to write a Playbook from 30 minutes to about five, Wright suggested.

IBM build the Code Assist service on foundational models derived from thousands of Playbooks. In the future, Ansible may see new capabilities of this service, such as the ability to parse what an existing Playbook does, or if the Playbook has any inherent security vulnerabilities, Wright predicted.

Other news today from the Red Hat Summit:

#RHSummit: @asheshbadani introduces @RedHat's Trusted Software Chain, including Advanced Cluster Security for #Kubernetes runtimes & dependency analytics for info on vulnerabilities (along with remediation suggestions) and #SBOM doc generation… pic.twitter.com/YEPPUeotrc

— Joab Jackson (@Joab_Jackson) May 24, 2023

While the perception in the #AI community is that you need to have @nvidia GPUs to do machine learning, but in reality 70% of all #ML workloads are now run on @Intel Xeon processors—Intel’s Sridhar Kayathi, #RHSummit pic.twitter.com/o9sBk6TiBM

— Joab Jackson (@Joab_Jackson) May 24, 2023

#RHSummit: @RedHat has created an #OpenShift-based distribution for #DataScience use, both as a managed and self-maintained service. Mayur Shetty offers a demo of Red Hat OpenShift Data Science (RHODS) here: https://t.co/xngG9ujWBi … #ai #ml pic.twitter.com/HHiVZmWVEh

— Joab Jackson (@Joab_Jackson) May 24, 2023

Disclosure: Red Hat paid for this reporter’s travel and lodging to attend the Red Hat Summit.

The post Red Hat Ansible Gets Event-Triggered Automation, AI Assist on Playbooks appeared first on The New Stack.