Everything You Should Know About Microservices | The New Stack https://thenewstack.io/microservices/ Fri, 22 Sep 2023 14:42:57 +0000 en-US hourly 1 https://wordpress.org/?v=6.2.2 The Basics of Event-Driven Architectures https://thenewstack.io/the-basics-of-event-driven-architectures/ Fri, 22 Sep 2023 14:45:38 +0000 https://thenewstack.io/?p=22718689

Real-time data and event-driven systems have enabled a wide range of use cases responsible for making what the internet is

The post The Basics of Event-Driven Architectures appeared first on The New Stack.

]]>

Real-time data and event-driven systems have enabled a wide range of use cases responsible for making what the internet is today. From real-time personalization, fraud detection, content recommendation engines and inventory management systems, to advancements in customer service, environmental monitoring and communication, event-driven systems power much of the web innovation we’ve seen over the past decade.

As you build and evolve your own products, features and fundamental data handling, it is very likely that its design will benefit from the application of event-driven architectural concepts.

We live in an exciting time to be building event-driven systems. So let’s explore what event-driven architectures are and how they make it easier to work with high volumes of real-time data.

What Is Event-Driven Architecture?

Event-driven architectures (EDAs) are a way of designing and building systems that are based on the exchange of events. Events are notifications of some change in state or data, and they are typically published by one component and consumed by another in real time.

EDAs are becoming increasingly popular because they offer a number of advantages over traditional, request-response architectures. For example, EDAs can:

  • Decouple systems that can be independently scaled and updated.
  • Handle high volumes of data with low latency.
  • Support real-time processing and analytics.
  • Be more scalable and resilient to failures.

First, what does real-time data mean and what is an event?

By real-time data, we mean data that is published as it is generated, with no known end of that data being published. The data just keeps coming and is available as soon as it is created. These data are commonly described as traveling in streams or queues that are built to handle large volumes and velocities of data.

Events are data that describe some action or fact. For example, events are generated when a user clicks on a product, when an IoT device is sampled or when a vehicle moves to a new location.

Events are represented as objects with one or more attributes. For example, a webstore event object would include event timestamps (likely with multiple timestamps that document its journey along a data pipeline) and event-type attributes (such as view, cart and purchase events). These event objects are commonly rendered as JSON objects, a popular format for encoding and packaging data.

Next, let’s break down the term “event-driven architecture.” Here, we are building real-time data pipelines where events drive the system, where an action triggers a set of components to deliver the event data to something listening for it. You subscribe to an action of interest, and it’s made available to you as soon as it happens.

By “architecture” we are referring to how software and hardware components are arranged and connected to build services needed to provide real-time communication and message exchange. These components include things like databases, communication and message channels, brokers that manage those channels, and methods for publishing and sharing with other systems. To support real-time data use cases, these components need to be designed from the ground up to handle and process data with low latency and high concurrency.

At the core of event-driven systems is the component that acts as an event or message broker. This component receives source data, publishes it on source-specific topics and enables listeners to subscribe to topics of interest. Often this component is referred to as an “event bus,” where bus is an analogy for something that moves things from place to place while picking up new passengers.

How Are Event-Driven Architectures Built?

Event-driven architectures are made up of components that have specific roles in transmitting and processing data. In general, there are the components that make data available, components that publish data onto streams or queues, components that implement a message broker or “event bus,” and one or more “components” that listen for and consume the data.

This data may be consumed to perform analysis and generate real-time visualizations. The data may be stored in a real-time database and used to generate responses to API endpoint requests. Often the data is ingested into a real-time data platform used to serve all these needs. Real-time data platforms are built with real-time databases, provide ways to filter, sort, and aggregate data, and offer ways to publish and share the data.

Let’s explore these components in more detail.

Data sources

Data sources encapsulate the reasons you want to build an event-driven system. They contain the information and data you want and need to share with other stakeholders, such as other internal systems and customers. These sources may take the form of webstore customer actions, IoT networks providing weather and logistics information, financial data, logging systems or any other data that is being generated in real time.

Often this data lives in legacy storage systems that were not built to support the low latency and high concurrency that event-driven systems demand. Commonly, this data is stored in traditional databases such as MySQL, or even as files behind a network server. The good news starts with the fact that many tools and techniques exist to integrate these sources into real-time systems. If you have legacy databases and file systems with data you’d like to share with other systems in real time, you can use change data capture (CDC). Also, see this blog post for an overview of architectural best practices for integrating databases and files.

Event generators

These components read your source data and publish it on a stream or queue. Generators write data to a message broker or event bus, making the data available to event listeners.

This component can range from a CDC component that writes database data, to custom code that writes data objects, to the event bus. For example, Debezium and its cloud-hosted variants are popular tools for implementing CDC-based sources. Also, here is an example Python script that writes data directly to an event bus.

Event buses

An event bus is where events are loaded and where events are offloaded. This component is responsible for real-time data ingestion from the generators and for making that data available to components listening for the data.

There are many forms of event buses, often referred to as streams or queues. Options for streaming components include Apache Kafka, Amazon Kinesis, Google Pub/Sub, Confluent Cloud and Redpanda. Likewise, there are many excellent choices for implementing queues, such as Amazon SQS, RabbitMQ and Redis.

These components are based on classic publish/subscribe concepts and offer a range of features and capabilities to meet the diverse needs of organizations.

Event buses can be configured to retain a custom window of data, enabling data consumers to get data on their own schedule. The advent of event bus architectures has been a boon for anyone who requires full-fidelity and reliable systems. If a consumer fails, it can reconnect and start where it left off.

Event Listeners/Subscribers

These components consume the data, and there is commonly more than one listener. While most consumers are designed to read data only, it is possible to have consumers that remove data from the event bus or queue.

Listeners are designed to access data as soon as it is available by continuously polling the event bus for topics and messages of interest. In some cases, it may be possible to implement triggered polling, where some other signal is given to start the continuous polling process.

Listeners also have the responsibility to do something with the incoming data — to write the data somewhere. This could be to databases, data lakes or even additional downstream streams and queues.

Real-Time Databases

One common destination that listeners write to is a database. To handle the data volumes and velocities typically associated with EDAs, it’s important to use a real-time database. There are several options here, including the open source Apache Druid, Apache Pinot and ClickHouse. These open source database packages are all also available via a cloud-hosted service.

These databases can be primary or secondary sources of data. Primary sources are the original and authoritative source of data, while secondary sources are copies of data from multiple sources. This compilation of data from multiple sources is a common motivation for building EDAs.

Like traditional databases, these databases support SQL for filtering, sorting, aggregating and joining data. So it’s good news that these cutting-edge data storage tools also support a querying language widely used across a large range of technical roles. Chances are that you and your colleagues are well-equipped to start analyzing and integrating these new data streams.

In addition, these databases may support real-time, incremental materialized views, which auto-populate query results into new table views as event-driven data is ingested in real time.

Publication Layer

In most cases, event-driven systems are built to make real-time data available to a variety of consumers and stakeholders. These stakeholders may include data scientists and analysts performing ad hoc analysis, dashboards and report generators, web and mobile application features driven by real-time data or automated control systems that take actions without human intervention.

While this data availability may be implemented with a wide range of methods, ranging from webhook events to generating flat files, the most common method is building API endpoints for consumers to request data from. These API endpoints have the advantage of being extremely flexible, since they are able to serve customized data content to their consumers.

Real-Time Data Platforms

Real-time data platforms combine many of the components that EDAs are built with. These platforms include native data connectors for both streaming and batch data sources. In the case of streaming sources, these platforms provide ways to seamlessly consume from a variety of event buses such as Apache Kafka and others implementing the publisher/subscriber model. In addition, the platforms typically provide an endpoint for streaming data into it.

The platforms also manage data storage of the incoming data by integrating real-time databases. The systems are typically built on top of open source real-time databases, which enable them to manage and process high volumes and velocities of data.

Along with these integrated databases comes the ability to perform data analysis with SQL. The platforms commonly provide user interfaces for writing and designing queries for filtering and aggregating data and joining from multiple data sources.

Finally, the platforms integrate methods for publishing and sharing data. In some cases, they are used to publish data to streams or export data in a batch process. Most advanced platforms make it possible to serve data via low-latency and high-concurrency data APIs.

The advantage of building a real-time data platform into your event-driven architecture is that by combining fundamental EDA components, they remove many of the complexities of using separate components and ‘gluing’ them together. In particular, self-hosting real-time databases and building APIs from scratch demand experience and expertise that is abstracted away by real-time data platforms.

Event-Driven System Design Patterns

To demonstrate how these components fit together, here are two reference architectures (For additional architecture examples, see this blog post).

First, we have a fundamental pattern that focuses on data storage. Here incoming events are immediately stored in three different types of storage:

  • Transactional database — may power functionality for user-facing applications and typically persist fundamental state information key.
  • Data warehouse — built for large datasets, long-term storage and building historical archives.
  • Real-time database — may power real-time analytics and a publication layer. Built to support low-latency data retrieval and high concurrency and is well-suited to serve data consumers at scale.

Here we have three listeners and destinations for new event data. With this design, the data warehouse makes requests for new data directly from the real-time database.

We can extend that design by adding a real-time data analytics platform, along with a publication layer. With this type of design, the data analytics platform encapsulates several EDA components.

For example, here the real-time data platform includes a “connector” that listens for events and consumes them, provides a real-time database and analytical tools, and is able to host APIs for sharing data and analysis.

Designing Your Event-Driven Future

Whether you are designing for something new or revisiting designs implemented a long time ago, it’s worth exploring how event-driven architecture patterns can help. You can probably identify many cases where introducing real-time data would improve your product, system or user experience.

If you are in the business of building customer-facing apps, you probably already have a list of data-driven features that would delight your customers. At a minimum, there are probably a few existing pain points that are due for a tune up, along with lots of opportunities for small, quick performance improvements.

As you get started, there are three distinct areas to consider. First, identify where and how the data you want to build with is generated, stored and made available. Perhaps you have a data source that is already written to a stream or queue and all you have to do is add a new listener. Perhaps you have a backend database that you can integrate using change data capture techniques.

Second, decide what type of “event bus” to implement and start building the bridge from where events are generated to where you can listen for them. As mentioned above, there are many open source solutions available that can be self-hosted or cloud-hosted.

Third, with your data sources and event stream sorted out, it’s time to build data consumers.

Real-time data platforms are a common type of event data consumer. These platforms integrate many system components into a single package. For example, Tinybird is a real-time data platform that manages real-time event ingestion and storage, provides real-time data-processing and analysis tools, and hosts scalable, secure API endpoints.

For more information about Tinybird, check out the documentation. For more resources on building event-driven or real-time architectures, you can read this, this or this.

The post The Basics of Event-Driven Architectures appeared first on The New Stack.

]]>
A Microservices Outcome: Testing Boomed https://thenewstack.io/a-microservices-outcome-testing-boomed/ Fri, 15 Sep 2023 18:15:18 +0000 https://thenewstack.io/?p=22718199

A microservices outcome of the past five to ten years: testing boomed. It boomed as more people just needed to

The post A Microservices Outcome: Testing Boomed appeared first on The New Stack.

]]>

A microservices outcome of the past five to ten years: testing boomed.

It boomed as more people just needed to test microservices. Microservices and the rise of Kubernetes reflected the shift from large application architectures to approaches that broke services into little pieces, said Bruno Lopes of Kubeshop.

Kubeshop is a Kubernetes company incubator, Lopes said. They have six different projects that they created in the Kubernetes. Lopes is the product leader of the company’s Kubernetes native testing framework, TestKube.

The ability to test more easily means it is more accessible to everybody. People feel more comfortable with testing due to the better developer experience. For example, automation improves product quality, especially as people have more time to differentiate than perform manual tasks.

Teams use Kubernetes; they develop applications there but then don’t test the applications where they live, Lopes said. They have the old ways of testing, but they also want to push out new features. Developers move fast — often faster than the organization can change its methodologies. Modern testing methods get adopted, but it takes time for the organization to adapt.

Lopes said no one should ship anything that did not get tested before it goes into production. Secondly, a company should establish an environment resembling production where you can run all your tests and deploy applications. The environments are never 100% the same, but they can be similar to deployment.

“And make it very fast,” Lopes said.” You shouldn’t make your development team wait for manual QA to make sure everything is all right before deploying. It should deploy as fast as soon as you can. You should deploy without waiting for manual tests.”

Take the SRE team, for example. They need to respond fast to issues. They want fast debugging. The more they spend time looking at the problems, the more downtime for their customers.

Sometimes, especially in critical systems, the applications cannot be exposed to the Internet, Lopes said. That means it becomes essential to run the tests in Kubernetes itself. A matter that will take time for companies to understand, of course, accelerating as the developer experience improves.

The post A Microservices Outcome: Testing Boomed appeared first on The New Stack.

]]>
Event-Driven Microservices Offer Flexibility and Real-Time Responsiveness https://thenewstack.io/event-driven-microservices-offer-flexibility-and-real-time-responsiveness/ Wed, 13 Sep 2023 17:00:16 +0000 https://thenewstack.io/?p=22717093

In today’s dynamic business environment, developers are increasingly pressured to deliver fast, reliable, scalable solutions to keep up with evolving

The post Event-Driven Microservices Offer Flexibility and Real-Time Responsiveness appeared first on The New Stack.

]]>

In today’s dynamic business environment, developers are increasingly pressured to deliver fast, reliable, scalable solutions to keep up with evolving business demands, and traditional applications are proving to be a hurdle in achieving these objectives. Microservices offer a well-understood and promising alternative, but there is a powerful augmentation to this approach that drives even more agility and time-to-value for developers: more specifically, what I refer to as an event-driven programming model, leveraging event-driven microservices.

Event-driven microservices are a powerful architectural pattern that combines the modularity and flexibility of microservices with the real-time responsiveness and efficiency of event-driven architectures. At their core, event-driven microservices rely on three fundamental principles: loose coupling, message-driven communication and asynchronous processing. These principles combine to create scalable, resilient and highly-performant distributed systems:

Loose Coupling

Loose coupling is a critical aspect of event-driven microservices, as it promotes modularity and separation of concerns. Loose coupling allows each microservice to evolve independently, minimizing the dependencies between individual services without impacting the overall system. Loose coupling enables faster development and deployment cycles and ensures that issues in one service do not cascade and affect other parts of the system.

Message-Driven Communication

In an event-driven microservice architecture, services communicate through messages, representing events or data changes occurring within the system. Events passed between services via event handlers serve as intermediaries that decouple event producers from event consumers. By adopting message-driven communication, event-driven microservices can effectively handle varying loads, ensuring the system remains responsive and resilient even during heavy traffic or peak usage.

Asynchronous Processing

Asynchronous processing is another fundamental principle of event-driven microservices. Instead of waiting for an immediate response or completion of a task, services in this architecture can continue processing other tasks while awaiting the completion of previous requests. This approach significantly reduces system latency and allows for greater parallelism, as multiple services can process events concurrently without being blocked by synchronous calls.

These fundamentals establish the foundation for event-driven microservices and, thus, event-driven programming, allowing developers to create highly scalable, resilient and responsive distributed systems. By embracing loose coupling, message-driven communication and asynchronous processing, event-driven microservices can efficiently handle complex, dynamic workloads and adapt to the ever-changing requirements of modern applications.

Embracing Loose Coupling: The Key to Scalable and Resilient Event-Driven Microservices

Loose coupling is an essential feature of event-driven microservices that facilitates the separation of concerns and modularity in a distributed system. This design principle helps to minimize the dependencies between individual services, allowing them to evolve and scale independently without impacting the overall system.

In a loosely coupled architecture, services are designed to react only to incoming commands, process them and emit events. This approach has several benefits:

  • Service Autonomy: By limiting a service’s responsibility to processing commands and emitting events, each service operates independently of others. This autonomy allows for flexibility in development, as teams can modify or extend a single service without affecting other components in the system.
  • Decoupled Communication: Instead of directly invoking other services or sharing data through APIs, services in a loosely coupled architecture communicate through events. This indirect communication decouples services from one another, reducing the risk of creating brittle dependencies or tight coupling that can hinder scalability and maintainability.
  • Enhanced Scalability: Each service is responsible for processing its commands and emitting events, which can be scaled independently to handle increased demand or improve performance. This feature enables the system to adapt to changing workloads or growth in user traffic without affecting other services or the entire system.
  • Improved Fault Tolerance: Loose coupling helps to contain failures within individual services. If a service encounters an issue, it can be isolated and fixed without causing a cascading failure across the entire system. This containment improves overall system reliability and resilience.
  • Easier Maintenance and Updates: With each service operating independently, developers can deploy updates, bug fixes or add new features to a single service without impacting others. This modularity simplifies maintenance and enables faster iteration cycles.

Developers can create more robust, maintainable and scalable event-driven microservices by embracing loose coupling and designing services that react only to incoming commands, processes, and emit events. This isolation allows for greater flexibility and adaptability in changing requirements and growing workloads, ensuring the system remains responsive and resilient.

Harnessing Message-Driven Communication in Event-Driven Systems: Events, Commands and Downstream Services

Message-driven communication is fundamental to event-driven systems, enabling services to communicate asynchronously and maintain loose coupling. This process involves the interaction between upstream services, events, commands, and downstream services in a coordinated manner. Let’s break down each step of this communication process:
  • Publishing Events: Upstream services, or event producers, generate events in response to specific actions or changes within the system. These events represent state changes or important occurrences that must be communicated to other services. Event producers publish these events to an event broker or journal, disseminating them to interested parties.
  • Transforming Events into Commands: Once the events are received by a message handler or an intermediary service, they are typically transformed into commands. Commands represent actions that need to be executed by downstream services. This transformation process often involves extracting relevant data from the event payload, validating the data and mapping it to the appropriate command structure.
  • Publishing Commands to Downstream Services: After transforming events into commands, the message handler or intermediary service publishes the commands to the downstream services or command consumers. These services are responsible for executing the actions specified in the commands, processing the data and, if necessary, generating new events to notify other services of the results.
Message-driven communication in event-driven systems offers several benefits:
  • Asynchronous Interaction: By communicating through events and commands, services can interact asynchronously without waiting for immediate responses. This approach reduces system latency, allows for better parallelism and enhances responsiveness.
  • Decoupling Services: Using events and commands as the primary means of communication between services promotes loose coupling, as services do not need to be aware of each other’s internal implementations or APIs. This decoupling simplifies development and allows services to evolve independently.
  • Scalability and Resilience: Message-driven communication enables better load balancing and resource utilization, as services can independently scale and adapt to changing workloads. Additionally, this communication pattern improves fault tolerance, as the failure of one service does not immediately impact the entire system.
In summary, message-driven communication in event-driven systems is essential for promoting loose coupling, asynchronous processing and scalability. By publishing events from upstream services, transforming them into commands, and publishing those commands to downstream services, event-driven systems can efficiently handle complex workloads and adapt to the ever-changing requirements of modern applications.

Transitioning from Synchronous to Asynchronous Event-Driven Architectures: Learning from Experience

Developers and teams are often accustomed to synchronous communication patterns, as they are familiar and intuitive from their experience with object-oriented or functional programming. In these paradigms, objects invoke methods on other objects or functions that synchronously call other functions. This familiarity often leads to adopting synchronous communication patterns between microservices in distributed systems.
However, synchronous processing flows may not be well-suited to distributed processing environments for several reasons:
  • Coupling: Synchronous communication leads to tight coupling between services, as they need to be aware of each other’s APIs and implementation details. This coupling makes it difficult to evolve, scale or maintain services independently.
  • Latency: When services communicate synchronously, they must wait for responses before proceeding, which increases system latency and reduces responsiveness, especially when dealing with complex workflows or high workloads.
  • Reduced fault tolerance: Synchronous communication can lead to cascading failures, where issues in one service can quickly propagate to other services, leading to system-wide instability.
  • Limited scalability: Synchronous communication patterns limit the system’s ability to scale horizontally. Services must always be available and responsive to handle incoming requests, which can be challenging in high-traffic scenarios or under heavy workloads.
As developers encounter production stability issues and recognize the limitations of brittle synchronous processing patterns, they begin to appreciate the merits of asynchronous event-driven architectures. These architectures offer several advantages:
  • Loose coupling: Asynchronous event-driven architectures use message-driven communication, which decouples services and allows them to evolve independently, promoting greater modularity and maintainability.
  • Improved responsiveness: Asynchronous processing enables services to continue working on other tasks without waiting for responses, reducing system latency and enhancing responsiveness.
  • Enhanced fault tolerance: Asynchronous communication helps to contain failures within individual services, preventing cascading failures and improving overall system resilience.
  • Scalability: Asynchronous event-driven systems can more effectively scale horizontally, as services can process events concurrently and independently without being blocked by synchronous calls.

By embracing asynchronous event-driven architectures, developers can address the limitations of synchronous communication patterns and build more scalable, resilient and efficient distributed systems. Learning from experience, they can create more robust and maintainable microservice applications that can better adapt to the ever-changing requirements of modern software development.

Summary

Adopting event-driven microservices is a strategic move transforming how businesses and developers approach software design and management. As noted here, the benefits for developers are immense in terms of time, resources, and quality code. Beyond simply business interests, there are significant benefits to individual industries. Consider, in the healthcare sector, how event-driven architectures enable hospital networks to monitor patient health data in real time and trigger alerts to healthcare professionals when anomalies are detected. This could save lives by ensuring immediate action in critical situations.

These examples demonstrate how the principles of event-driven microservices can revolutionize a wide range of industries by delivering robust, adaptable and responsive applications.

The post Event-Driven Microservices Offer Flexibility and Real-Time Responsiveness appeared first on The New Stack.

]]>
Britive: Just-in-Time Access across Multiple Clouds https://thenewstack.io/britive-just-in-time-access-across-multiple-clouds/ Thu, 07 Sep 2023 10:00:41 +0000 https://thenewstack.io/?p=22717397

Traditionally when a user was granted access to an app or service, they kept that access until they left the

The post Britive: Just-in-Time Access across Multiple Clouds appeared first on The New Stack.

]]>

Traditionally when a user was granted access to an app or service, they kept that access until they left the company. Unfortunately, too often it wasn’t revoked even then. This perpetual 24/7 access left companies open to a multitude of security exploits.

More recently the idea of just-in-time (JIT) access has come into vogue, addressing companies’ growing attack surface that comes with the proliferation of privileges granted for every device, tool and process. Rather than ongoing access, the idea is to grant it only for a specific time period.

But managing access manually for the myriad technologies workers use on a daily basis, especially for companies with thousands of employees would be onerous. And with many companies adopting a hybrid cloud strategy, each of which with its own identity and access management (IAM) protocols, the burden grows. With zero standing privileges considered a pillar of a zero trust architecture, JIT access paves the way to achieve it.

Glendale, California-based Britive is taking on the challenge of automating JIT access across multiple clouds not only for humans but also for machine processes.

“We recognize that in the cloud, access is typically not required to be permanent or perpetual,” pointed out Britive CEO and co-founder Art Poghosyan. “Most of access is so frequently changing and dynamic, it really doesn’t have to be perpetual standing access … if you’re able to provision with an identity at a time when [users] need it. With proper security, guardrails in place and authorization in place, you really don’t need to keep that access there forever. … And that’s what we do, we call it just-in-time ephemeral privilege management or access management,”

‘Best Left to Automation’

Exploited user privileges have led to some massive breaches in recent years, like Solarwinds, MGM Resorts, Uber and Capital One. Even IAM vendor Okta fell victim.

In the Cloud Security Alliance report “Top Threats to Cloud Computing,” more than 700 industry experts named identity issues as the top threat overall.

And in “2022 Trends in Securing Digital Identities,” of more than 500 people surveyed, 98% said the number of identities is increasing, primarily driven by cloud adoption, third-party relationships and machine identities.

Pointing in particular to cloud identity misconfigurations, a problem occurring all too often, Matthew Chiodi, then Palo Alto Networks’ public cloud chief security officer cited a lack of IAM governance and standards multiplied by “the sheer volume of user and machine roles combined with permissions and services that are created in each cloud account.”

Chiodi added, “Humans are good at many things, but understanding effective permissions and identifying risky policies across hundreds of roles and different cloud service providers are tasks best left to algorithms and automation.”

JIT systems take into account whether a user is authorized to have access, the user’s location and the context of their current task. Access is granted only if the given situation justifies it, and then revokes it when the task is done.

Addressing Need for Speed

Founded in 2018, Britive automates JIT access privileges, including tokens and keys, for people and software accessing cloud services and apps across multiple clouds.

Aside from the different identity management processes involved with cloud platforms like Azure, Oracle, Amazon Web Services (AWS) and Google, developers in particular require access to a range of tools, Poghosyan pointed out.

“Considering the fact that a lot of what they do requires immediate access … speed is the topmost priority for users, right?” he said.

“And so they use a lot of automation, tools and things like HashiCorp Terraform or GitHub or GitLab and so on. All these things also require access and keys and tokens. And that reality doesn’t work well with the traditional IAM tools where it’s very much driven from a sort of corporate centralized, heavy workflow and approval process.

“So we built technology that really, first and foremost, addresses this high velocity and highly automated process that cloud environments users need, especially development teams,” he said, adding that other teams, like data analysts who need access to things like Snowflake or Google Big Query and whose needs change quickly, would find value in it as well.

“That, again, requires a tool or a system that can dynamically adapt to the needs of the users and to the tools that they use in their day-to-day job,” he said.

Beyond Role-Based Access

Acting as an abstraction layer between the user and the cloud platform or application, Britive uses an API-first approach to grant access with the level of privileges authorized for the user. A temporary service account sits inside containers for developer access rather than using hard-coded credentials.

While users normally work with the least privileges required for their day-to-day jobs, just-in-time access grants elevated privileges for a specific period and revoke those permissions when the time is up. Going beyond role-based access (RBAC), the system is flexible enough to allow companies to alternatively base access on attributes of the resource in question (attribute-based access) or policy (policy-based access), Poghosyan said.

The patented platform integrates with most cloud providers and with CI/CD automation tools like Jenkins and Terraform.

Its cross-cloud visibility provides a single view into issues such as misconfigurations, high-risk permissions and unusual activity across your cloud infrastructure, platform and data tools. Data analytics offers risk scores and right-sizing access recommendations based on historical use patterns. The access map provides a visual representation of the relationships between policies, roles, groups and resources, letting you know who has access to what and how it is used.

The company added cloud infrastructure entitlement management (CIEM) in 2021 to understand privileges across multicloud environments and to identify and mitigate risks when the level of access is higher than it should be.

The company launched Cloud Secrets Manager in March 2022, a cloud vault for static secrets and keys when ephemeral access is not feasible. It applies the JIT concept of ephemeral creation of human and machine IDs like a username or password, database credential, API token, TLS certificate, SSH key, etc. It addresses the problems of hard-coded secrets management in a single platform, replacing embedded API keys in code by retrieving keys on demand and providing visibility into who has access to which secrets and how and when they are used.

In August it released Access Builder, which provides self-service access requests to critical cloud infrastructure, applications and data. Users set up a profile that can be used as the basis of access and can track the approval process. Meanwhile, administrators can track requested permissions, gaining insights into which identities are requesting access to specific applications and infrastructure.

Range of Integrations

Poghosyan previously co-founded Advancive, an IAM consulting company acquired by Optiv in 2016. Poghosyan and Alex Gudanis founded Britive in 2018. It has raised $35.9 million, most recently $20.5 million in a Series B funding round announced in March. Its customers include Gap, Toyota, Forbes and others.

Identity and security analysts KuppingerCole named Britive among the innovation leaders in its 2022 Leadership Compass report along with the likes of CyberArk, EmpowerID, Palo Alto Networks, Senhasegura, SSH and StrongDM that it cited for embracing “the new worlds of CIEM and DREAM (dynamic resource entitlement and access management) capability.”

“Britive has one of the widest compatibilities for JIT machine and non-machine access cloud services [including infrastructure, platform, data and other ‘as a service’ solutions] including less obvious provisioning for cloud services such as Snowflake, Workday, Okta Identity Cloud, Salesforce, ServiceNow, Google Workspace and others – some following specific requests from customers. This extends its reach into the cloud beyond many rivals, out of the box,” the report states.

It adds that it is “quite eye-opening in the way it supports multicloud access, especially in high-risk develop environments.”

Poghosyan pointed to two areas of focus for the company going forward: one is building support for non-public cloud environments because that’s still an enterprise reality, and the other is going broader into the non-infrastructure technologies. It’s building a framework to enable any cloud application or cloud technology vendor to integrate with Britive’s model, he said.

The post Britive: Just-in-Time Access across Multiple Clouds appeared first on The New Stack.

]]>
7 Steps to Highly Effective Kubernetes Policies https://thenewstack.io/7-steps-to-highly-effective-kubernetes-policies/ Wed, 06 Sep 2023 14:37:06 +0000 https://thenewstack.io/?p=22717476

You just started a new job where, for the first time, you have some responsibility for operating and managing a

The post 7 Steps to Highly Effective Kubernetes Policies appeared first on The New Stack.

]]>

You just started a new job where, for the first time, you have some responsibility for operating and managing a Kubernetes infrastructure. You’re excited about toeing your way even deeper into cloud native, but also terribly worried.

Yes, you’re concerned about the best way to write secure applications that follow best practices for naming and resource usage control, but what about everything else that’s already deployed to production? You spin up a new tool to peek into what’s happening and find 100 CVEs and YAML misconfigurations issues of high or critical importance. You close the tab and tell yourself you’ll deal with all of that … later.

Will you?

Maybe the most ambitious and fearless of you will, but the problem is that while the cloud native community likes to talk about security, standardization and “shift left” a lot, none of these conversations deaden the feeling of being overwhelmed by security, resource, syntax and tooling issues. No development paradigm or tool seems to have discovered the right way to present developers and operators with the “sweet spot” of making misconfigurations visible without also overwhelming them.

Like all the to-do lists we might face, whether it’s work or household chores, our minds can only effectively deal with so many issues at a time. Too many issues and we get lost in context switching and prioritizing half-baked Band-Aids over lasting improvements. We need better ways to limit scope (aka triage), set milestones and finally make security work manageable.

It’s time to ignore the number of issues and focus on interactively shaping, then enforcing, the way your organization uses established policies to make an impact — no overwhelming feeling required.

The Cloudy History of Cloud Native Policy

From Kubernetes’ first days, YAML configurations have been the building blocks of a functioning cluster and happily running applications. As the essential bridge between a developer’s application code and an Ops engineer’s work to keep the cluster humming, they’re not only challenging to get right, but also the cause of most deployment/service-level issues in Kubernetes. To add in a little extra spiciness, no one — not developers and not Ops engineers — wants to be solely responsible for them.

Policy entered the cloud native space as a way to automate the way YAML configurations are written and approved for production. If no one person or team wants the responsibility of manually checking every configuration according to an internal style guide, then policies can slowly shape how teams tackle common misconfigurations around security, resource usage and cloud native best practices. Not to mention any rules or idioms unique to their application.

The challenge with policies in Kubernetes is that it’s agnostic to how, when and why you enforce them. You can write rules in multiple ways, enforce them at different points in the software development life cycle (SDLC) and use them for wildly different reasons.

There is no better example of this confusion than pod security policy (PSP), which entered the Kubernetes ecosystem in 2016 with v1.3. PSP was designed to control how a pod can operate and reject any noncompliant configurations. For example, it allowed a K8s administrator to prevent developers from running privileged pods everywhere, essentially decoupling low-level Linux security decisions away from the development life cycle.

PSP never left that beta phase for a few good reasons. These policies were only applied when a person or process requested the creation of a pod, which meant there was no way to retrofit PSPs or enable them by default. The Kubernetes team admits PSP made it too easy to accidentally grant too-broad permissions, among other difficulties.

The PSP era of Kubernetes security was so fraught that it inspired a new rule for release cycle management: No Kubernetes project can stay in beta for more than two release cycles, either becoming stable or marked for deprecation and removal.

On the other hand, PSP moved the security-in-Kubernetes space in one positive direction: By separating the creation and instantiation of Kubernetes security policy, PSP opened up a new ecosystem for external admission controllers and policy enforcement tools, like Kyverno, Gatekeeper and, of course, Monokle.

Tools that we’ve used to shed our clusters of the PSP shackles and replaced that with… the Pod Security Standard (PSS). We’ll come back to that big difference in a minute.

A Phase-Based Approach to Kubernetes Policy

With this established decoupling between policy creation and instantiation, you can now apply a consistent policy language across your clusters, environments and teams, regardless of which tools you choose. You can also switch the tools you use for creation and instantiation at will and get reliable results in your clusters.

Creation typically happens in an integrated development environment (IDE), which means you can stick with your current favorite to express rules using rule-specific languages like Open Policy Agent (OPA), a declarative syntax like Kyverno, or a programming language like Go or TypeScript.

Instantiation and enforcement can happen in different parts of the software development life cycle. As we saw in our previous 101-level post on Kubernetes YAML policies, you can apply validation at one or more points in the configuration life cycle:

  1. Pre-commit directly in a developer’s command line interface (CLI) or IDE,
  2. Pre-deployment via your CI/CD pipeline,
  3. Post-deployment via an admission controller like Kyverno or Gatekeeper, or
  4. In-cluster for checking whether the deployed state still meets your policy standards.

The later policy instantiation, validation and enforcement happen in your SDLC, the more likely a dangerous misconfiguration slips its way into the production environment, and the more work will be needed to identify and fix the original source of any misconfigurations found. You can instantiate and enforce policies at several stages, but earlier is always better — something Monokle excels at, with robust pre-commit and pre-deployment validation support.

With the scenario in place — those dreaded 90 issues — and an understanding of the Kubernetes policy landscape, you can start to whittle away at the misconfigurations before you.

Step 1: Implement the Pod Security Standard

Let’s start with the PSS mentioned earlier. Kubernetes now describes three encompassing policies that you can quickly implement and enforce across your cluster. The “Privileged policy is entirely unrestricted and should be reserved only for system and infrastructure workloads managed by administrators.

You should start with instantiating the “Baseline policy, which allows for the minimally specified pod, which is where most developers new to Kubernetes begin:

apiVersion: v1
kind: Pod
metadata:
  name: default
spec:
  containers:
    - name: my-container
      Image: my-image


The advantage of starting with the Baseline is that you prevent known privilege escalations without needing to modify all your existing Dockerfiles and Kubernetes configurations. There will be some exceptions, which I’ll talk about in a moment.

Creating and instantiating this policy level is relatively straightforward — for example, on the namespace level:

apiVersion: v1
kind: Namespace
metadata:
  name: my-baseline-namespace
  labels:
    pod-security.kubernetes.io/enforce: baseline
    pod-security.kubernetes.io/enforce-version: latest
    pod-security.kubernetes.io/warn: baseline
    pod-security.kubernetes.io/warn-version: latest


You will inevitably have some special services that require more access than Baseline allows, like a Promtail agent for collecting logs and observability. In these cases, where you need certain beneficial features, those namespaces will need to operate under the Privileged policy. You’ll need to keep up with security improvements from that vendor to limit your risk.

By enforcing the Baseline level of the Pod Security Standard for most configurations and allowing Privileged for a select few, then fixing any misconfigurations that violate these policies, you’ve checked off your next policy milestone.

Step 2: Fix Labels and Annotations

Labels are meant to identify resources for grouping or filtering, while annotations are for important but nonidentifying context. If your head is still spinning from that, here’s a handy definition from Richard Li at Ambassador Labs: “Labels are for Kubernetes, while annotations are for humans.”

Labels should only be used for their intended purpose, and even then, be careful with where and how you apply them. In the past, attackers have used labels to probe deeply into the architecture of a Kubernetes cluster, including which nodes are running individual pods, without leaving behind logs of the queries they ran.

The same idea applies to your annotations: While they’re meant for humans, they are often used to obtain credentials that, in turn, give them access to even more secrets. If you use annotations to describe the person who should be contacted in case of an issue, know that you’re creating additional soft targets for social engineering attacks.

Step 3: Migrate to the Restricted PSS

While Baseline is permissible but safe-ish, the “Restricted Pod Security Standard employs current best practices for hardening a pod. As Red Hat’s Mo Khan once described it, the Restricted standard ensures “the worst you can do is destroy yourself,” not your cluster.

With the Restricted standard, developers must write applications that run in read-only mode, have enabled only the Linux features necessary for the Pod to run, cannot escalate privileges at any time and so on.

I recommend starting with the Baseline and migrating to Restricted later, as separate milestones, because the latter almost always requires active changes to existing Dockerfiles and Kubernetes configurations. As soon as you instantiate and enforce the Restricted policy, your configurations will need to adhere to these policies or they’ll be rejected by your validator or admission controller.

Step 3a: Suppress, Not Ignore, the Inevitable False Positives

As you work through the Baseline and Restricted milestones, you’re approaching a more mature (and complicated) level of policy management. To ensure everyone stays on the same page regarding the current policy milestone, you should start to deal with the false positives or configurations you must explicitly allow despite the Restricted PSS.

When choosing between ignoring a rule or suppressing it, always favor suppression. That requires an auditable action, with logs or a configuration change, to codify an exception to the established policy framework. You can add suppressions in source, directly into your K8s configurations or externally, where a developer requests their operations peer to reconfigure their validator or admission controller to allow a “misconfiguration” to pass through.

In Monokle, you add in-source suppressions directly in your configuration as an annotation, with what the Static Analysis Results Interchange Format (SARIF) specification calls a justification:

metadata:
  annotations:
    monokle.io/suppress.pss.host-path-volumes: Agent requires access to back up cluster volumes

Step 4: Layer in Common Hardening Guidelines

At this point, you’ve moved beyond established Kubernetes frameworks for security, which means you need to take a bit more initiative on building and working toward your own milestones.

The National Security Agency (NSA) and Cybersecurity and Infrastructure Security Agency (CISA) have a popular Kubernetes Hardening Guide, which details not only pod-level improvements, such as effectively using immutable container file systems, but also network separation, audit logging and threat detection.

Step 5: Time to Plug and Play

After implementing some or all of the established hardening guidelines, every new policy is about choices, trust and trade-offs. Spend some time on Google or Stack Overflow and you’ll find plenty of recommendations for plug-and-play policies into your enforcement mechanism.

You can benefit from crowdsourced policies, many of which come from those with more unique experience, but remember that while rules might be well-intentioned, you don’t understand the recommender’s priorities or operating context. They know how to implement certain “high-hanging fruit” policies because they have to, not because they’re widely valuable.

One ongoing debate is whether to, and how strictly to, limit the resource needs of a container. Same goes for request limits. Not configuring limits can introduce security risks, but if you severely constrain your pods, they might not function properly.

Step 6: Add Custom Rules for the Unforeseen Peculiarities

Now you’re at the far end of Kubernetes policy, well beyond the 20% of misconfigurations and vulnerabilities that create 80% of the negative impact on production. But even now, having implemented all the best practices and collective cloud native knowledge, you’re not immune to misconfigurations that unexpectedly spark an incident or outage — the wonderful unknown unknowns of security and stability.

A good rule of thumb is if a peculiar (mis)configuration causes issues in production twice, it’s time to codify it as a custom rule to be enforced during development or by the admission controller. It’s just too important to be latently documented internally with the hope that developers read it, pay attention to it and catch it in each other’s pull-request reviews.

Once codified into your existing policy, custom rules become guardrails you enforce as close to development as possible. If you can reach developers with validation before they even commit their work, which Monokle Cloud does seamlessly with custom plugins and a development server you run locally, then you can save your entire organization a lot of rework and twiddling their thumbs waiting for CI/CD pipeline to inevitably fail when they could be building new features or fixing bugs.

Wrapping Up

If you implement all the frameworks and milestones covered above and make all the requisite changes to your Dockerfiles and Kubernetes configurations to meet these new policies, you’ll probably find your list of 90 major vulnerabilities has dropped to a far more manageable number.

You’re seeing the value of our step-by-step approach to shaping and enforcing Kubernetes policies. The more you can interact with the impact of new policies and rules, the way Monokle does uniquely at the pre-commit stage, the easier it’ll be to make incremental steps without overwhelming yourself or others.

You might even find yourself proudly claiming that your Kubernetes environment is entirely misconfiguration-free. That’s a win, no doubt, but it’s not a guarantee — there will always be new Kubernetes versions, new applications and new best practices to roll into what you’ve already done. It’s also not the best way to talk about your accomplishments with your leadership or executive team.

The advantage of leveraging the frameworks and hardening guidelines is that you have a better common ground to talk about your impact on certification, compliance and long-term security goals.

What sounds more compelling to a non-expert:

  • You reduced your number of CVEs from 90 to X,
  • Or that you fully complied with the NSA’s Kubernetes hardening guidelines?

The sooner we worry less about numbers and more about common milestones, enforced as early in the application life cycle as possible (ideally pre-commit!), the sooner we can find the sustainable sweet spot for each of our unique forays into cloud native policy.

The post 7 Steps to Highly Effective Kubernetes Policies appeared first on The New Stack.

]]>
Setting up Multicluster Service Mesh with Rafay CLI https://thenewstack.io/setting-up-multicluster-service-mesh-with-rafay-cli/ Wed, 30 Aug 2023 17:12:08 +0000 https://thenewstack.io/?p=22716971

This is the second of a two-part series. Read Part 1.  Over the past several months, our team has been

The post Setting up Multicluster Service Mesh with Rafay CLI appeared first on The New Stack.

]]>

This is the second of a two-part series. Read Part 1

Over the past several months, our team has been working on scaling Rafay’s SaaS controller. As a crucial part of this, we embarked on setting up multicluster Istio environments. During this process, we encountered and successfully tackled the challenges previously mentioned. These challenges encompassed managing the complexity of the configuration, ensuring consistent settings across clusters, establishing secure network connectivity and handling service discovery, monitoring and troubleshooting complexities.

To overcome these challenges, we adopted Infrastructure as Code (IaC) approaches for configuration management and developed a command line interface (CLI) automation tool to ensure consistent and streamlined multicluster Istio deployments. The CLI follows the “multi-primary on different networks” model described in the Istio documentation. The topology we use in our multicluster Istio deployments looks like the image below.

The CLI uses a straightforward configuration. Below is an example of the configuration format:

$ cat examples/mesh.yaml
apiVersion: ristioctl.k8smgmt.io/v3
kind: Certificate
metadata:
  name: ristioctl-certs	
spec:
  validityHours: 2190
  password: false
  sanSuffix: istio.io # Subject Alternative Name Suffix
  meshID: uswestmesh
---
apiVersion: ristioctl.k8smgmt.io/v3
kind: Cluster
metadata:
  name: cluster1
spec:
  kubeconfigFile: "kubeconfig-istio-demo.yaml"
  context: cluster1
  meshID: uswestmesh
  version: "1.18.0"
  installHelloWorld: true #deploy sample HelloWorld application
---
apiVersion: ristioctl.k8smgmt.io/v3
kind: Cluster
metadata:
  name: cluster2
spec:
  kubeconfigFile: "kubeconfig-istio-demo.yaml"
  context: cluster2
  meshID: uswestmesh
  version: "1.18.0"
  installHelloWorld: true   #deploy sample HelloWorld application


(Note: The example above is a generic representation.)

In this configuration, the CLI is set up to work with two Kubernetes clusters: cluster1 and cluster2. Each cluster is defined with its respective details, including the Kubernetes kubeconfig file, the context and the version of Istio to be installed. The CLI uses this configuration to establish connectivity between services across the clusters and create the multicluster service mesh.

Explanation of the configuration:

Certificate: The CLI establishes trust between all clusters in the mesh using this configuration. It will generate and deploy distinct certificates for each cluster. All cluster certificates are issued by the same root certificate authority (CA). Internally, the CLI uses the step-ca tool.

Explanation:

  • apiVersion: The version of the API being used, in this case, it’s ristioctl.k8smgmt.io/v3.
  • kind: The type of resource, which is Certificate in this case.
  • metadata: Metadata associated with the resource, such as the resource name.
  • spec: This section contains the specifications or settings for the resource.
    • validityHours: Specifies the validity period of the certificate in hours.
    • password: Indicates whether a password is required.
    • sanSuffix: Subject Alternative Name (SAN) Suffix for the certificate.
    • meshID: Identifier for the multicluster service mesh.

Cluster: These are cluster resources used to define individual Kubernetes clusters that will be part of the multicluster service mesh. Each cluster resource represents a different Kubernetes cluster.

Explanation:

  • kubeconfigFile: Specifies the path to the kubeconfig file for the respective cluster, which contains authentication details and cluster information.
  • context: The Kubernetes context associated with the cluster, which defines a named set of access parameters.
  • meshID: Identifies the multicluster service mesh that these clusters will be connected to.
  • version: Specifies the version of Istio to be deployed in the clusters.
  • installHelloWorld: Indicates whether to deploy a sample HelloWorld application in each cluster.

Overall, this configuration describes the necessary settings to set up a multicluster service mesh using the ristioctl CLI tool. It includes the specification for a certificate and Kubernetes clusters that will be part of the service mesh. The ristioctl CLI tool will use this configuration to deploy Istio and other required configurations to create a unified and scalable mesh over these clusters. The steps below outline the tasks the CLI tool handles internally to set up a multicluster service mesh. Let’s further explain each step:

  • Configure trust across all clusters in the mesh: The CLI tool establishes trust between the Kubernetes clusters participating in the multicluster service mesh. This trust allows secure communication and authentication between services in different clusters. This involves generating and distributing certificates and keys for mutual TLS (Transport Layer Security) authentication.
  • Deploy Istio into the clusters: The CLI deploys Istio into each Kubernetes cluster within the mesh.
  • Deploy east-west gateway into the clusters: The east-west gateway is an Istio component responsible for handling traffic within the service mesh, specifically the traffic flowing between services in different clusters (east-west traffic). The CLI deploys the east-west gateway into each cluster to enable cross-cluster communication.
  • Expose services in the clusters: The CLI ensures that services run within each cluster are appropriately exposed and accessible to the other clusters in the multicluster service mesh.
  • Provision cross-cluster service discovery using Rafay ZTKA-based secure channel: Rafay ZTKA (Zero Trust Kubectl Access) is a secure channel technology that enables cross-cluster Kube API server communication.

By automating these steps, the CLI simplifies setting up a multicluster service mesh, reducing the operational complexity for users and ensuring a unified and scalable mesh over clusters in different environments. This approach enhances connectivity, security and observability, allowing organizations to adopt a multicloud or hybrid cloud strategy easily.

To use it:

ristioctl apply -f examples/mesh.yaml

The CLI is open source. You can find more details at https://github.com/RafaySystems/rafay-istio-multicluster/blob/main/README.md.

We use Rafay Zero Trust Kubectl Access (ZTKA) to prevent exposing the Kubernetes Cluster Kube API server to a different network for improved security. To implement this, you need to incorporate Rafay’s ZTKA kubeconfig in the configuration. The resulting topology will resemble the following:

Conclusion

Multicluster service connectivity is crucial for various organizational needs. While Istio provides multicluster connectivity, configuring it can be complex and cumbersome. Therefore, we have developed a tool to simplify the configuration process. Ensuring secure network connectivity between clusters is paramount to safeguarding data in the multicluster environment. With our tool, organizations can streamline the setup of multicluster service mesh and establish a secure and scalable infrastructure to support their distributed applications effectively.

The post Setting up Multicluster Service Mesh with Rafay CLI appeared first on The New Stack.

]]>
Dapr: Create Applications Faster with Standardized APIs https://thenewstack.io/dapr-create-applications-faster-with-standardized-apis/ Thu, 20 Jul 2023 15:24:26 +0000 https://thenewstack.io/?p=22713641

Building applications and getting them ready for production is a huge task. Take a look at the Twelve-Factor App that

The post Dapr: Create Applications Faster with Standardized APIs appeared first on The New Stack.

]]>

Building applications and getting them ready for production is a huge task. Take a look at the Twelve-Factor App that lists many essential aspects involved in creating and running SaaS products. Development teams must be able to understand and implement all of these technical factors in a timely manner. Teams are expected to deliver functionality fast, while operational costs need to be minimized.

Tools and frameworks that lower the boundary of creating and running production-ready applications are essential, since it allows developers to spend more time on implementing business features rather than yak shaving. The runtime framework Dapr is popular for building distributed applications based on microservice architectures.

This article highlights how developer productivity is increased and an agile architecture can be achieved by using Dapr, regardless of the software architecture chosen.

Increasing Developer Productivity

Dapr helps developers be efficient by providing easy-to-understand APIs, also known as building blocks, to start developing applications quickly. Many organizations use Dapr to build microservices on top of Kubernetes. But what if you don’t want or need a microservice architecture? What if you want to build monoliths, modular monoliths or N-tier architecture applications? Dapr building blocks can benefit all kinds of backend applications.

Standard Building Block APIs

Some Dapr building blocks, such as service-to-service invocation, pub/sub and distributed lock, are essential for distributed systems. The majority of building blocks, however, are relevant for any type of software architecture:

  • State management (key/value store): read/write key/value pairs to create stateful services.
  • Secrets management: securely access secrets from your application.
  • Bindings: interface with external systems.
  • Configuration: manage application configuration and have your application subscribed to changes.
  • Workflow: enables stateful and long-running business processes.
  • Actors: self-contained unit that contains state and behavior.
  • Cryptography: perform cryptographic operations without exposing keys to your application.

The diagram below illustrates how an N-tier application or a modular monolith can benefit from using Dapr building blocks in the logic tier.

The logic tier of these architectures usually contains a business logic layer and a data access layer. Building blocks such as Workflow and Actors are meant for business logic development, so they are a great fit here. The Configuration building block is useful when used as a lightweight feature flag solution. The State management, Bindings and Secrets building blocks are particularly useful in the data access layer when external resources are accessed.

All building block APIs are intentionally basic to keep the learning curve as flat as possible. For instance, if you use the state management API for key/value data, these are the methods that you would use:

Create/Insert Key/Value Pair

POST <http://localhost>:<daprPort>/v1.0/state/<storename>
Content-Type: application/json

[
	{ "key" : "<key1>", "value" : "<value1>" },
	{ "key" : "<key2>", "value" : "<value2>" }
]


*<storename> refers to the component name of the state store*

Get a Value by Key

GET<http://localhost>:<daprPort>/v1.0/state/<storename>/<key>

Get Multiple Values by Keys

POST <http://localhost>:<daprPort>/v1.0/state/<storename>/bulk
Content-Type: application/json

{
	"keys": [ "<key1>", "<key2>" ]
}

Delete a Key/Value Pair

DELETE <http://localhost>:<daprPort>/v1.0/state/<storename>/<key>


By offering these standard APIs that can be used via HTTP, gRPC or one of the Dapr client SDKs, developers can use their favorite language and stay in their integrated development environment (IDE) of choice.

Cross-Cutting Concerns

So besides the standard building block APIs, how exactly can Dapr help speed up development time even more? All applications need to consider cross-cutting concerns such as:

  • Observability
  • Resiliency
  • Security

If no frameworks are used to implement these concerns, these need to be implemented in each application over and over again, which would not be the best use of development time. Some teams roll their own solution to manage these concerns, with the downside of owning and maintaining a larger codebase. With Dapr, all the above cross-cutting concerns come out of the box and are managed with declarative configuration files.

Observability

Dapr has observability builtin; this includes tracing, metrics, and logging.

  • Tracing: captures the flow of a request through a system.
  • Metrics: describes a point-in-time measurement of a system property that can easily be aggregated.
  • Logging: captures discrete events in a richer format to provide context.

Dapr can connect to any observability tools that support OpenTelemetry (OTEL) and Zipkin protocols, such as DataDog, New Relic, Grafana and Jaeger. Tracing is configured as part of a configuration YAML file:

spec:
  tracing:
    samplingRate: "1"
    otel: 
      endpointAddress: "https://..."
			isSecure: true
      protocol: grpc


Dapr sidecars expose a Prometheus metrics endpoint that can be used to gain insights of how Dapr itself is behaving. Metric types that are collected include: control plane, application, component and resiliency policies. See the full list with all individual metrics on GitHub.

Resilience

Making applications resilient is hard work. Resiliency is usually applied by implementing well-known architectural patterns such as retry, circuit breaker and timeout.

  • The retry pattern is used when a call to a service fails, and the call is retried several times.
  • The circuit breaker pattern is used to prevent cascading failures when a call to a service fails.
  • The timeout pattern is used to wait for a response from a service that might be slow to respond.

Typically, combining these patterns is preferred to obtain a good overall system resiliency.

There are some libraries out there that can help, such as Polly (.NET) or Resilience4j (Java), but they’re typically available for only one programming language or runtime and must be built into your application. With Dapr, resiliency is configured using resiliency spec files that contain policies and targets that the policies apply to. Resiliency spec files have the following structure:

apiVersion: dapr.io/v1alpha1
kind: Resiliency
metadata:
  name: myresiliency
scopes:
  # optionally scope the policy to specific apps
spec:
  policies:
    timeouts:
      # timeout policy definitions

    retries:
      # retry policy definitions

    circuitBreakers:
      # circuit breaker policy definitions

  targets:
    apps:
      # apps and their applied policies here

    actors:
      # actor types and their applied policies here

    components:
      # components and their applied policies here


If a constant retry policy is required that will retry 10 times with an interval of 5 seconds, this can be configured as follows:

retries:
  pubsubRetry:
    policy: constant
    duration: 5s
    maxRetries: 10


Policies are applied to targets, which can be applications, actors or components. This example configures the pubsubRetry outbound policy for a component named messageBrokerA:

targets:
  components:
    messageBrokerA: # any component name -- happens to be another pubsub broker here
      outbound:
        retry: pubsubRetry


See a full example of a resiliency spec in the Dapr docs. Applying Dapr resiliency is language-agnostic and not limited to Dapr applications; they can also be applied when calling external (non-Dapr) endpoints.

Security

Dapr provides end-to-end security by using mTLS encryption and security policies where access to Dapr APIs, Dapr applications and components are explicitly configured.

In this configuration example, access to the state management HTTP API is allowed:

apiVersion: dapr.io/v1alpha1
kind: Configuration
metadata:
  name: myappconfig
  namespace: default
spec:
  api:
    allowed:
      - name: state
        version: v1.0
        protocol: http


Access to components can be scoped to specific Dapr applications. In this component file example, the statestore component can be accessed by app1 and app2:

apiVersion: dapr.io/v1alpha1
kind: Component
metadata:
  name: statestore
  namespace: production
spec:
  type: state.redis
  version: v1
  metadata:
  - name: redisHost
    value: redis-master:6379
scopes:
- app1
- app2


If the solution is exposing web APIs, endpoint authorization can be implemented using Dapr OAuth middleware to use the Authorization Code Grant flow or Client Credentials Grant flow. In addition, the use of authentication tokens can be required for accessing the Dapr API via a reverse proxy.

Agile Architecture

When a new software project starts in an agile organization, usually a combination of emergent and intentional architectural design is done. This combination is known as the architectural runway. This runway is in contrast with big design up front (BDUF), which is commonly done when doing waterfall-style project management. BDUF takes a long time to complete, leaves no room for changes and is therefore often a limiting factor later in the project.

There should be just enough intentional architecture design to get the development going and deliver usable features each sprint. The rest of the architectural design emerges while the team is building and makes further decisions. It helps to postpone some decisions as long as possible to allow more time to explore options during development.

One of the intentional architecture decisions could be to use a framework such as Dapr that allows postponing other architectural decisions that will later emerge. Frequently, the development team is not in control of certain decisions.

For instance, the procurement department is still in negotiation with several cloud providers, so there’s no decision yet which one to use. The development team can still proceed with Dapr since Dapr applications can run on any cloud that offers Kubernetes or VMs.

Or the application requires state management for key/value storage, but the decision of the exact store is under review by the security team. The development team can continue with application development using the Dapr state management API since that is a standard building block, and its implementation can be switched easily between a locally hosted solution and a cloud provider, one that is under review, by selecting a different state store component.

The diagram below illustrates the flexibility in application hosting and provides an example of various components that can be used via the standard API for state management.

Flexibility with Components

Since Dapr applications use a standard API to access features such as state management, pub/sub messaging, bindings, configuration stores, secrets stores etc., these APIs and their underlying implementation are decoupled by Dapr components. Dapr has over 115 built-in components across the 11 building block APIs. Components of the same type are interchangeable without changing the application code. A component file that contains the name of the component, the type and metadata about how to connect to the underlying resource is the only required change.

This example shows a component file where Redis is used as the state store:

apiVersion: dapr.io/v1alpha1
kind: Component
metadata:
  name: <storename> 
spec:
  type: state.redis
  version: v1
  metadata:
  - name: redisHost
    value: <host>
  - name: redisPassword
    value: <password>


The only thing the application requires to use this component is the metadate.name value (<storename> in this example):

GET <http://localhost>:<daprPort>/v1.0/state/<storename>/<key>


At the component level, there is flexibility when it comes to the resources used by the building block APIs, so developers are not stuck with the lowest denominator across all resources. For instance, although the state management API has no methods to configure time to live (TTL) for the key/value pairs, at the component level (declarative configuration), this can be specified for resources that support this natively:

- name: ttlInSeconds
  value: <int> # Optional


In case the built-in components are not sufficient, Dapr allows the creation of pluggable components for state management, pub/sub messaging and bindings. This allows development teams to use the well-known building block APIs while providing maximum flexibility in the resources they require.

Closing

Many applications need some form of state management, access to configuration and secrets, and integration with external systems. Since Dapr offers building blocks for these features, development teams benefit significantly by using Dapr, even if this doesn’t involve microservices initially. The standard API for the building blocks has a modest learning curve, and combined with the implementation of cross-cutting concerns for observability, security, and resiliency, Dapr helps developers build production-ready applications faster. The decoupling of the APIs with the components allows architectural decisions to be postponed, which ensures the application architecture is flexible and portable.

Want to learn more about Dapr? Take a look at the learn-dapr repository on GitHub, and please join the Dapr Discord, with over 6,000 members, to ask for help and share your knowledge.

The post Dapr: Create Applications Faster with Standardized APIs appeared first on The New Stack.

]]>
State of the API: Microservices Gone Macro and Zombie APIs https://thenewstack.io/state-of-the-api-microservices-gone-macro-and-zombie-apis/ Wed, 28 Jun 2023 15:55:17 +0000 https://thenewstack.io/?p=22712129

Microservices expanding into unwieldy messes and zombie APIs were among the concerns that emerged from this year’s Postman State of

The post State of the API: Microservices Gone Macro and Zombie APIs appeared first on The New Stack.

]]>

Microservices expanding into unwieldy messes and zombie APIs were among the concerns that emerged from this year’s Postman State of the API survey. But even with those problems, the survey also found that APIs are paying off for organizations — almost two-thirds of the 40,000 respondents said their APIs generate revenue.

Forty-three percent said APIs generate over a quarter of company revenue. For a handful of companies, APIs generated more than 75% of total revenue. These companies were almost twice as likely to be in financial services as other sectors, the report stated.

Given API profitability, it’s not hard to see why 92% of global respondents say that API investments will increase or stay the same over the next 12 months. That’s up three percentage points from last year’s report.

“This increase may reflect a sense in some quarters that the worst of tech’s economic contraction has passed,” the report noted. “At the same time, fewer respondents say they expect to cut investments into APIs this year.”

Microservices, Mesoservices, and Monoliths

Microservices remain the dominant architectural style of APIs at the majority of organizations, but it seems that doesn’t always go as planned. In this year’s report, 10% of respondents said that APIs powering microservices have grown large and unwieldy, creating “macroservices” instead of microservices.

Microservices was defined as small services working independently to carry out single responsibilities, while macroservices were defined as being microservices that have grown large and unwieldy, approaching monoliths. Monoliths are single-tiered apps in which the interface and data access are combined into one package. Finally, mesoservices are Goldilock’s preference: not too big, not too small — just right.

Monoliths and mesoservices each represent a little over 20% of organizations surveyed.

APIs as macroservices, microservices, monoliths or something in between

API architecture style image via Postman’s 2023 State of the API report.

“There is obviously observations that have come in from a decade of microservices going mainstream and how companies are thinking about it,” Ankit Sobti, co-founder and CTO of Postman, told The New Stack. “The idea of microservices that grow and become unwieldy and lose the essence of being a microservice in the first place is something that we wanted to call out through the survey.”

One reason to use microservices is that in theory, the microservice and its API could be reused easier. So it’s worth noting that related to that, 21% of self-identified “API first” leaders cited reusing APIs or microservices as a pain point for their organizations.

“The aspect of reusability is how consumable is the API, and the struggle that we see when we talk to customers is outside of discovering the API in the first place,” Sobti said. “Is the API consistent, conformant, and easy to just set up? Authentication ends up being a big problem in just starting out with API. So I think those factors that drive the consumption of the API make a big difference in the way that the API’s are difficult to integrate in the network.”

Zombie APIs Rise as Layoffs Happen

Nearly 60% of all respondents are concerned over zombie APIs — APIs that lack proper documentation and ownership, but persist after a developer has left the organization. It ranked as the second leading concern when developers leave, behind poor documentation. Engineers and developers ranked zombie APIs as a higher concern than executives did, who placed “loss of institutional memory” as slightly more concerning than loss of maintenance, aka zombie APIs.

“These APIs have no owner, oversight, or maintenance — and are sometimes forgotten by the company,” the report noted. “At worst, zombie APIs pose a security risk; at best, they deliver a poor consumer experience.”

Troubles when developers leave graph

When Developers Leave graph image via Postman’s 2023 State of the API report.

One solution is to maintain a catalog of APIs used, suggested Sobti.

“That’s the emergence of zombie APIs, because a lot of institutional knowledge lies with the people who built it,” Sobti told The New Stack. “Once the people transition out, the change management is complex, and that’s where cataloging your API has internal APIs, in particular, becomes very critical.”

API catalogs can keep track of internal APIs in one place, he added. There are dedicated teams that are now responsible for not just building the underlying infrastructure that allows the catalogs to exist, but also managing the catalog and creating the practices on building to get those APIs into the catalogs. That is where reuse becomes critical, he added.

As further proof of the need for better documentation, the survey found that a lack of documentation was cited as the primary obstacle to consuming an API.

Fewer than one in 20 API changes fail, according to half of respondents. Among industries, healthcare claimed the best rate, with 55% of respondents stating that fewer than one in 20 API deployments failed. Education was at the other end of the spectrum; only 43% of respondents there said their failure rate was that good. Perhaps that ties to another key finding: Education was also the sector likeliest to skip API testing and spend the last amount of time on API development.

API-first leaders were less likely to encounter failures than all respondents, with 60% stating that failures occurred less than one time in 20.

API-First Development

The report also noticed a trend for what it terms API-first companies to perform better across a variety of API issues than companies that are not API-first. API-first companies prioritize APIs at the start of development process, positioning APIs as the building blocks of software. Over 75% of respondents somewhat agreed or strongly agreed that developers at API-first companies are more productive, create better software, and integrate faster with partners.

Benefits of being API first

Benefits of being API first image via Postman’s 2023 State of the API survey.

Being API-first means developing APIs before writing other code, instead of treating them as afterthoughts, according to Postman, which defined the term for survey respondents.

“API-first companies are the ones that acknowledged that APIs are the building blocks of their software strategy,” Sobti said. “So you’re thinking of a development model where applications are conceptualized as this interconnection of internal and external services to these APIs. The API-first organizations are becoming more cognizant of the business and the technical implications of APIs.”

Companies are realizing the strategic value of APIs to the business: More companies reported that the API is generating a significant portion of the company’s revenues this year, he said. The survey also found that the revenue was the second most important metric of APIs access after usage itself. Benefits to API-first companies increase as the company size and developer numbers grow, the survey found.

“At small companies with 100 developers or fewer, 32% of respondents strongly agreed that API-first companies onboard faster. But as developer headcount rose above 100, that figure steadily climbed,” the report stated. “By the time a company reached 5,000 developers, 42% of its respondents strongly agreed with the statement. We see similar increases across almost all metrics when answers are sorted by company size.”

“Technically, what we see are API-first companies being ones that are able to build APIs faster, report fewer failures, and when APIs go down, they’re able to restore and respond in less than an hour,” he said. “What we see is APIs definitely increasing in number across organizations, both internally and externally, and a part of that is the ability to reuse more of the capabilities that have been created within your organization and also externally, that you can now either use or subscribe or buy.”

Leveraging reuse is what drives the ability of developers to do more with less, whether it’s doing less with fewer people, or because developers don’t have to create businesses functions because they can, for instance, subscribe to a Stripe and actually manage billing through Stripe’s API, he said.

GraphQL Improves Position as an API Architecture

The survey also found that while REST remains the most-used API architecture by far, it has lost a bit of ground to newcomers, the report noted. This year, 86% of respondents said they used REST, which is down from 89% last year’s report and 92% the year prior.

SOAP usage fell to 26% of all respondents this year versus 34% last year. Taking SOAP’s place is GraphQL, which was used by 29% of survey-takers.

When it comes to API specifications, JSON Schema is the favorite followed by Swagger/OpenAPI 2.0 and Open API 3.x tied almost evenly. GraphQL came in fourth in popularity.

API Schemas preferred image via Postman's 2023 State of the API report.

API Schemas preferred image via Postman’s 2023 State of the API report.

Editor’s Note: On June 29, 2023, this piece was updated to show an updated specifications chart.

The post State of the API: Microservices Gone Macro and Zombie APIs appeared first on The New Stack.

]]>
In the Great Microservices Debate, Value Eats Size for Lunch https://thenewstack.io/in-the-great-microservices-debate-value-eats-size-for-lunch/ Tue, 13 Jun 2023 13:10:42 +0000 https://thenewstack.io/?p=22710290

In May, an old hot topic in software design long thought to be settled was stirred up again, sparked by

The post In the Great Microservices Debate, Value Eats Size for Lunch appeared first on The New Stack.

]]>

In May, an old hot topic in software design long thought to be settled was stirred up again, sparked by an article from the Amazon Prime Video architecture team about moving from serverless microservices to a monolith. This sparked some spirited takes and also a clamor for using the right architectural pattern for the right job.

Two interesting aspects can be observed in the melee of ensuing conversations. First, the original article was more about the scaling challenges with serverless “lambdas” rather than purely about microservices. Additionally, it covered state changes within Step Functions leading to higher costs, data transfer between lambdas and S3 storage, and so on.

It bears reminding that there are other and possibly better ways of implementing microservices other than just the use of serverless. The choice of serverless lambdas is not synonymous with the choice of microservices. Choosing serverless as a deployment vehicle should be contingent upon factors such as expected user load and call frequency patterns, among other things.

The second and more interesting aspect was about the size of the services (micro!) and this was the topic of most debates that emerged. How micro is micro? Is it a binary choice of micro versus monolith? Or is there a spectrum of choices of granularity? How should the size or granularity factor into the architecture?

Value-Based Services: Decoupling to Provide Value Independently

A key criterion for a service to be standing alone as a separate code base and a separately deployable entity is that it should provide some value to the users — ideally the end users of the application. A useful heuristic to determine whether or not a service satisfies this criterion is to think about whether most enhancements to the service would result in benefits perceivable by the user. If in a vast majority of updates the service can only provide such user benefit by having to also get other services to release enhancements, then the service has failed the criterion.

Services Providing Shared Internal Value: Coupling Non-Divergent Dependent Paths

What about services that offer capabilities internally to other services and not directly to the end user? For instance, there might be a service that offers a certain specialty queuing that is required for the application. In such cases, the question becomes whether the capabilities provided by the service have just one internal client or several internal clients.

If most of the time a service ends up calling exactly just one other service apart from very few exceptional cases where-in the call path may diverge, then there is little benefit in separating that service and its most predominant dependency. Another useful heuristic:  if a circuit breaks and a service is unable to reach one of its dependency services, can the calling service provide anything at all to its users or nothing?

Avoiding Talkative Services with Heavy Payloads

Providing value is also about the cost efficiency of designing as multiple services versus combining as a single service. One such aspect that was highlighted in the Prime Video case was chatty network calls. This could be a double whammy because it not only results in additional latency before a response goes back to the user, but it might also increase your bandwidth costs.

This would be more problematic if you have large or several payloads moving around between services across network boundaries. To mitigate, one could consider the use of a storage service, so one doesn’t need to move the payload around, rather only an identifier of the payload and only services that need it to consume it.

However even if an ID is passed around, if several services along the call path need to inspect or operate on the payload, those would need to pull the payload down from the storage service which would completely nullify and possibly worsen the situation.

How and where payloads are handled should be an important part of designing service boundaries and thereby influencing the number of services we have in the system.

Testability and Deployability

Finally, one more consideration would be the cost of rapidly testing and deploying services. Consider a scenario wherein a majority of the time multiple services need to be simultaneously enhanced in order to provide a feature enhancement to the user.

Feature testing would involve testing all of those services together. This could potentially result in bottlenecks for releases or necessitate the requirement for complex release control and testing mechanisms such as feature flags or blue-greening sets of services, among other things. This tendency is a sure-shot sign of the disadvantageous proliferation of too many discrete parts.

Too many teams fall into the trap of building “service enhancements” in every release but those enhancements not doing much for the end user because a number of other pieces from other services need to come together. Such highly coupled architecture complicates both dependency management and versioning, with delays in delivering “end user value”.

Value-Based Services, Not ‘Micro’  Services!

Architecture should be able to deliver value to the end users a majority of the time by the release of individual services independently. Considerations of coupling, dependencies, ease of testing and frequency of deployment matter more, while the size of the service itself has limited usefulness other than for applying reasonable limits on becoming too gigantic or too nano-sized.

There may be other esoteric reasons for splitting or creating multiple services such as the way our teams are organized (Conway’s law, anyone?) or providing flexibility with languages and frameworks but these are rarely real needs for providing value in enterprise software development.

One could very well have a performant cost-efficient architecture that delivers “value” with a diverse mix of services of various sizes — some big, some micro, and others somewhere in between. Think of it as a “value-based services architecture” rather than a “microservices-based architecture” that enables services to deliver value quickly and independently. Because value always eats size for lunch!

The post In the Great Microservices Debate, Value Eats Size for Lunch appeared first on The New Stack.

]]>
Amazon Prime Video’s Microservices Move Doesn’t Lead to a Monolith after All https://thenewstack.io/amazon-prime-videos-microservices-move-doesnt-lead-to-a-monolith-after-all/ Tue, 13 Jun 2023 13:00:28 +0000 https://thenewstack.io/?p=22710277

In any organizational structure, once you break down regular jobs into overly granularized tasks and delegate them to too many

The post Amazon Prime Video’s Microservices Move Doesn’t Lead to a Monolith after All appeared first on The New Stack.

]]>

In any organizational structure, once you break down regular jobs into overly granularized tasks and delegate them to too many individuals, their messaging soon becomes unmanageable, and the organization stops growing.

Last March 22, in a blog post that went unnoticed for several weeks, Amazon Prime Video’s engineers reported the service quality monitoring application they had originally built to determine quality-of-service (QoS) levels for streaming videos — an application they built on a microservices platform — was failing, even at levels below 10 percent of service capacity.

What’s more, they had already applied a remedy: a solution their post described as “a monolith application.”

The change came at least five years after Prime Video — home of on-demand favorites such as “Game of Thrones” and “The Marvelous Mrs. Maisel” — successfully outbid traditional broadcast outlets for the live-streaming rights to carry NFL Thursday Night Football.

One of the leaders in on-demand streaming now found itself in the broadcasting business, serving an average 16.6 million real-time viewers simultaneously. To keep up with live sports viewers’ expectations of their “networks” — in this case, CBS, NBC, or Fox — Prime Video’s evolution needed to accelerate.

It wasn’t happening. When the 2022 football season kicked off last September, too many of Prime Video’s tweets were prefaced with the phrase, “We’re sorry for the inconvenience.

Prime Video engineers overcame these glitches, the engineers’ blog reported, by consolidating QoS monitoring operations that had been separated into isolated AWS Step Functions and Lambda functions, into a unified code module.

As initially reported, their results appeared to finally confirm many organizations’ suspicions, well-articulated over the last decade, that the costs incurred in maintaining system complexity and messaging overhead inevitably outweighed any benefits to be realized from having adopted microservices architecture.

Once that blog post awakened from its dormancy, several experts declared all of microservices architecture dead. “It’s clear that in practice, microservices pose perhaps the biggest siren song for needlessly complicating your system,” wrote Ruby on Rails creator David Heinemeier Hansson.  “Are we seeing a resurgence of the majestic monolith?” asked .NET MVP Milan Jovanović on Twitter. “I hope so.”

“That’s great news for Amazon because it will save a ton of money,” declared Jeff Delaney on his enormously popular YouTube channel Fireship, “but bad news for Amazon because it just lost a great revenue source.”

Yet there were other experts, including CodeOpinion.com’s Derek Comartin, who compared Prime’s “before” and “after” architectural diagrams with one another, and noticed some glaring disconnects between those diagrams and their accompanying narrative.

As world-class experts speaking with the New Stack also noticed, and as a high-ranking Amazon Web Services engineer finally confirmed for us, the solution Prime Video adopted not only fails to fit the profile of a monolithic application. In every respect that truly matters, including scalability and functionality, it is a more evolved microservice than what Prime Video had before.

That Dear Perfection

“This definitely isn’t a microservices-to-monolith story,” remarked Adrian Cockcroft, the former vice president of cloud architecture strategy at AWS, now an advisor for Nubank, in an interview with The New Stack. “It’s a Step Functions-to-microservices story. And I think one of the problems is the wrong labeling.”

Cockcroft, as many regular New Stack readers will be familiar, is one of microservices architecture’s originators, and certainly its most outspoken champion. He has not been directly involved with Prime Video or AWS since becoming an advisor, but he’s familiar with what actually happened there, and he was an AWS executive when Prime’s stream quality monitoring project began. He described for us a kind of prototyping strategy where an organization utilizes AWS Step Functions, coupled with serverless orchestration, for visually modeling business processes.

With this adoption strategy, an architect can reorganize digital processes essentially at will, eventually discovering their best alignment with business processes. He’s intimately familiar with this methodology because it’s part of AWS’ best practices — advice which he himself co-authored. Speaking with us, Cockcroft praised the Prime Video team for having followed that advice.

As Cockcroft understands it, Step Functions was never intended to run processes at the scale of live NFL sports events. It’s not a staging system for processes whose eventual, production-ready state would need to become more algorithmic, more efficient, more consolidated. So the trick to making the Step Functions model workable for more than just prototyping is not just to make the model somewhat scalable, but also transitional.

“If you know you’re going to eventually do it at some scale,” said Cockcroft, “you may build it differently in the first place. So the question is, do you know how to do the thing, and do you know the scale you’re going to run it at? Those are two separate cases. If you don’t know either of those, or if you know it’s small-scale, complex, and you’re not exactly sure how it’s going to be built, then you want to build a prototype that’s going to be very fast to build.”

However, he suggested, if an organization knows from the outset its application will be very widely deployed and highly scalable, it should optimize for that situation by investing in more development time up-front. The Prime Video team did not have that luxury. In that case, Cockcroft said, the team was following best practices: building the best system they could, to accomplish the business objectives as they interpreted them at the time.

“A lot of workloads cost more to build than to run,” Cockcroft explained. “[For] a lot of internal corporate IT workloads, lots of things that are relatively small-scale, if you’re spending more on the developers than you are on the execution, then you want to optimize for saving developer time by building it super-quickly. And I think the first version… was optimized that way; it wasn’t intended to run at scale.”

As any Step Functions-based system becomes refined, according to those same best practices, the next stage of its evolution will be transitional. Part of that metamorphosis may involve, contrary to popular notions, service consolidation. Despite how Prime Video’s blog post described it, the result of consolidation is not a monolith. It’s now a fully-fledged microservice, capable of delivering those 90% cost reductions engineers touted.

“This is an independently scalable chunk of the overall Prime Video workload,” described Cockcroft. “If they’re not running a live stream at the moment, it would scale down or turn off — which is one reason to build it with Step Functions and Lambda functions to start with. And if there’s a live stream running, it scales up. That’s a microservice. The rest of Prime Video scales independently.”

Following the publication of this article, an AWS spokesperson contacted The New Stack offering further advice on how Step Functions may be put to use within organizations. Many AWS customers, including Liberty Mutual and Taco Bell, the spokesperson told us, begin their architectural plan with Step Functions, and have chosen to stick with it as their deployments scale up and out. The Prime Video stream QoS service that was the topic of the original Prime blog post, the spokesperson asserted, is one of many services the streamer utilizes on the AWS platform, and many of those others may continue to use Step Functions for the foreseeable future.

The New Stack spoke with Ajay Nair, AWS’ general manager for Lambda and for its managed container service App Runner. Nair confirmed Cockcroft’s account in its entirety for how the project was initially framed in Step Functions, as well as how it ended up a scalable microservice.

Nair outlined for us a typical microservices development pattern. Here, the original application’s business processes may be too rigidly coupled together to allow for evolution and adaptation. So they’re decoupled and isolated. This decomposition enables developers to define the contracts that spell out each service’s expected inputs and outputs, requirements and outcomes. For the first time, business teams can directly observe the transactional activities that, in the application’s prior incarnations, had been entirely obscured by its complexity and unintended design constraints.

From there, Nair went on, software engineers may codify the isolated serverless functions as services. In so doing, they may further decompose some services — as AWS did for Amazon S3, which is now served by over 300 microservice classes. They may also consolidate other services. One possible reason: Observing their behavior may reveal they actually did not need to be scaled independently after all.

“It is a natural evolution of any architecture where services that are built get consolidated and redistributed,” said Nair. “The resulting capability still has a well-established contract, [and] has a single team managing and deploying it. So it technically meets the definition of a microservice.”

Breakdown

“I think the definition of a microservice is not necessarily crisp,” stated Brendan Burns, the co-creator of Kubernetes, now corporate vice president at Microsoft, in a note to The New Stack.

“I tend to think of it more in terms of capabilities around functionality, scaling, and team size,” Burns continued. “A microservice should be a consistent function or functions — this is like good object-oriented design. If your microservice is the CatAndDog() service, you might want to consider breaking that into Cat() and Dog() services. But if your microservice is ThatOneCatOnMyBlock(), it might be a sign that you have broken things down too far.”

“The level of granularity that you decompose to,” explained F5 Networks Distinguished Engineer Lori MacVittie, speaking with The New Stack, “is still limited by the laws of physics, by network speed, by how much [code] you’re actually wrapping around. Could you do it? Could you do everything as functions inside a containerized environment, and make it work? Yes. It’d be slow as heck. People would not use it.”

Adrian Cockcroft advises that the interpretability of each service’s core purpose, even by a non-developer, should be a tenet of microservice architecture itself. That fact alone should mitigate against poor design choices.

“It should be simple enough for one person to understand how it works,” Cockcroft advocated. “There are lots of definitions of microservices, but basically, you’ve partitioned your problem into multiple, independent chunks that are scaled independently.”

“Everything we’re describing,” remarked F5’s MacVittie, “is just SOA without the standards… We’re doing the same thing; it’s the same pattern. You can take a look at the frameworks, objects, and hierarchies, and you’d be like, ‘This is not that much different than what we’ve been doing since we started this.’ We can argue about that. Who wins? Does it matter? Is Amazon going to say, ‘You’re right, that’s a big microservice, thank you?’ Does it change anything? No. They have solved a problem that they had, by changing how they design things. If they happen to stumble on what they should have been doing in the first place, according to the experts on the Internet, great. It worked for them. They’re saving money, and they did expose one of those problems with decomposing something too far, on a set of networks on the Internet that is not designed to handle it yet.

“We are kinda stuck by physics, right?” she continued.  “We’re unlikely to get any faster than we are right now, so we have to work around that.”

Perhaps you’ve noticed: Enterprise technology stories thrive on dichotomy. For any software architecture to be introduced to the reader as something of value, vendors and journalists frame it in opposition to some other architecture. When an equivalent system or methodology doesn’t yet exist, the new architecture may end up being portrayed as the harbinger of a revolution that overturns tradition.

One reason may be because the discussion online is being led either by vendors, or by journalists who tend to speak with vendors first.

“There is this ongoing disconnect between how software companies operate, and how the rest of the world operates,” remarked Platify Insights analyst Donnie Berkholz. “In a software company, you’ve got ten times the staffing and software engineering on a per capita basis across the company, as you do in many other companies. That gives you a lot of capacity and talent to do things that other people can’t keep up with.”

Maybe the big blazing “Amazon” brand obscured the fact — despite the business units’ proximity to one another — that Prime Video was a customer of AWS. With its engineers’ blog post, Prime joined an ongoing narrative that may have already spun out of control. Certain writers may have focused so intently upon selected facets of microservices architecture, that they let readers draw their own conclusions about what the alternatives to that architecture must look like. If microservices were, by definition, small (an aspect that one journalist in particular was guilty as hell of over-emphasizing), its evil counterpart must be big, or bigness itself.

Subsequently, in a similar confusion of scale, if Amazon Prime Video embraces a monolith, so must all of Amazon. Score one come-from-behind touchdown for monoliths in the fourth quarter, and cue the Thursday Night Football theme.

“We’ve seen the same thing happening over and over across the years,” mentioned Berkholz. “The leading-edge software companies, web companies, and startups encounter a problem because they’re operating at a different scale than most other companies. And a few years later, that problem starts to hit the masses.”

Buildup

The original “axis of evil” in the service-orientation dichotomy was 1999’s Big Ball of Mud. First put forth by Professors Brian Foote and Joseph Yoder of the University of Illinois at Urbana-Champaign, the Big Ball helped catalyze a resurgence in support for distributed systems architecture. It was seated at the discussion table where the monolith sits now, but not for the same reasons.

The Big Ball wasn’t a daunting tower of rigid, inflexible, tightly-coupled processes, but rather programs haphazardly heaped onto other programs, with data exchanged between them by means of file dumps onto floppy disks carried down office staircases in cardboard boxes. Amid the digital chaos of the 1990s and early 2000s, anything definable as not a Big Ball of Mud, was already halfway beautiful.

“Service Oriented Architecture was actually the same idea as microservices,” recalls Forrester senior analyst David Mooter. “The idea was, you create services that align with your business capabilities and your business operating model. Most organizations, what they heard was, ‘Just put stuff [places] and do a Web service,’ [the result being] you just make things SOAP. And when you create haphazard SOAP, you create Distributed Little Balls of Mud. SOA got a bad name because everyone was employing SOA worst practices.”

Mooter shared some of his latest opinions in a Forrester blog post entitled, “The Death of Microservices?” In an interview with us, he noted, “I think you’re seeing, with some of the reaction to this Amazon blog, when you do microservices worst practices, and you blame microservices rather than your poor architectural decisions, then everyone says microservices stink… Put aside microservices: Any buzzword tech trend cannot compensate for poor architectural decisions.”

The sheer fact that “Big Ball” is a nebulous, plastic metaphor has enabled almost any methodology or architecture that fell out of favor over the past quarter-century, to become associated with it. When microservices makes inroads with organizations, it’s the monolith that gets to wear the crown of thorns. More recently, with some clever phraseology, microservices has carried the moniker of shame.

“Our industry swings like a pendulum between innovation, experimentation, and growth (sometimes just called ‘peacetime’) and belt-tightening and pushing for efficiency (‘wartime’),” stated Laura Tacho, long-time friend of The New Stack, and a professional engineering coach.  “Of course, most companies have both scenarios going on in different pockets, but it’s obvious that we’re in a period of belt-tightening now. This is when some of those choices — for example, breaking things into microservices — can no longer be justified against the efficiency losses.”

Berkholz has been observing the same trend: “There’s been this push back-and-forth within the industry — some sort of a pendulum happening, from monolith to microservices and back again. Years ago, it was SOA and back again.”

Defenders of microservices against the mud-throwing that happens when the pendulum swings back, say their architecture won’t be right for every case, or even every organization. That’s a problem. Whenever a market is perceived as being served by two or more equivalent, competing solutions, that market may correctly be portrayed as fragmented. Which is exactly the kind of market enterprises typically avoid participating in.

“Fragmentation implies that the problem hasn’t been well-solved for everybody yet,” Berkholz told us, “when there’s a lot of different solutions, and nobody’s consolidated on a single one that makes sense most of the time. That is something that companies watch. Is this a fragmented ecosystem, where it’s hard to make choices? Or is this an ecosystem where there’s a clear and obvious master?”

From time to time, Lori MacVittie told us, F5 Networks surveys its clients, asking them for the relative percentages of their applications portfolios they would describe as monoliths, microservices, mobile apps and middleware-infused client/server apps.  “Most organizations were operating at some percentage of each of those,” she told us. When the question was adjusted, asking only whether their apps were “traditional” or “modern,” the split usually has been 60/40, respectively.

“They’re doing both,” she said. “And within those, they’re doing different styles. Is that a mess? I don’t think so. They had specific uses for them.”

“I kind of feel like microservice-vs.-monolith isn’t a great argument,” stated Microsoft’s Brendan Burns. “It’s like arguing about vectors vs. linked lists or garbage collection vs. memory management. These designs are all tools — what’s important is to understand the value that you get from each, and when you can take advantage of that value. If you insist on microservicing everything, you’re definitely going to microservice some monoliths that probably you should have just left alone. But if you say, ‘We don’t do microservices,’ you’re probably leaving some agility, reliability and efficiency on the table.”

The Big Ball of Mud metaphor’s creators cited, as the reason software architectures become bloated and unwieldy, Conway’s Law: “Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization’s communication structure.” Advocates of microservices over the years have taken this notion a few steps further, suggesting business structures and even org charts should be deliberately remodeled to align with software, systems, and services.

When the proverbial pendulum swings back, notes Tacho, companies start reconsidering this notion. “Perhaps it’s not only Conway’s Law coming home to roost,” she told us, “but also, ‘Do market conditions allow us to take a gamble on ignoring Conway’s Law for the time being, so we could trade efficiency for innovation?’”

Continuing her war-and-peace metaphor, Tacho went on: “Everything’s a tradeoff. Past decisions to potentially slow development down and make processes less efficient due to microservices might have been totally fine during peacetime, but having to continuously justify those inefficiencies, especially during a period of belt-tightening, is tiresome. What surprises me sometimes is that rearchitecting a large codebase is not something that most companies would invest in during wartime. They simply have to have other priorities with a better ROI for the business, but big fish like Amazon have more flexibility.”

“The first thing you should look at is your business,” advised Forrester’s Mooter, “and what is the right architecture for that? Don’t start with microservices. Start with, what are the business outcomes you’re trying to achieve? What Forrester calls, ‘Outcome-Driven Architecture.’ How do we align our IT systems and infrastructure and applications, to optimize your ability to deliver that? It will change over time.”

“It’s definitely the case,” remarked Microsoft’s Burns, “that one of the benefits of microservices design is that it enables small teams to behave autonomously because they own very specific APIs with crisp contracts between teams. If the rest of your development culture prevents your small teams from operating autonomously, then you’re never going to gain the agility benefits of microservices. Of course, there are other benefits too, like increased resiliency and potentially improved efficiency from more optimal scaling. It’s not an all-or-nothing, but it’s also the case that an engineering culture that is structured for independence and autonomy is going to do better when implementing microservices. I don’t think that this is that much different than the cultural changes that were associated with the DevOps movement a decade ago.”

Prime Video made a huge business gamble on NFL football rights, and the jury is still out as to whether, over time, that gamble will pay off. That move lit a fire under certain sensitive regions of Prime Video’s engineering team. The capabilities they may have planned to deliver three to five years hence, were suddenly needed now. So they made an architectural shift — perhaps the one they’d planned on anyway, or maybe an adaptation. Did they enable business flexibility down the road, as their best practices advised? Or have they just tied Prime Video down to a service contract, to which their business will be forced to adapt forever? Viewed from that perspective, one could easily forget which option was the monolith, and which was the microservice.

It’s a dilemma we put to AWS’ Ajay Nair, and his response bears close scrutiny, not just by software engineers: “Building an evolvable architectural software system is a strategy, not a religion.”

Update: Since publication, this story has been updated with additional material from AWS around Step Functions.

The post Amazon Prime Video’s Microservices Move Doesn’t Lead to a Monolith after All appeared first on The New Stack.

]]>
Case Study: A WebAssembly Failure, and Lessons Learned https://thenewstack.io/webassembly/case-study-a-webassembly-failure-and-lessons-learned/ Thu, 25 May 2023 14:00:55 +0000 https://thenewstack.io/?p=22708922

VANCOUVER — In their talk “Microservices and WASM, Are We There Yet?” at the Open Source Summit North America, Kingdon

The post Case Study: A WebAssembly Failure, and Lessons Learned appeared first on The New Stack.

]]>

VANCOUVER — In their talk “Microservices and WASM, Are We There Yet?” at the Linux Foundation’s Open Source Summit North America, Kingdon Barrett, of Weaveworks, and Will Christensen, of Defense Unicorns, said they were surprised as anyone that their talk was accepted since they were newbies who had spent about three weeks delving into this nascent technology.

And their project failed. (Barrett argued, “It only sort of failed  … We accomplished the goal of the talk!”)

But they learned a lot about what WebAssembly, or Wasm, can and cannot do.

“Wasm has largely delivered on its promise in a browser and in apps, but what about for microservices?” the pair’s talk synopsis summarized. “We didn’t know either, so we tried to build a simple project that seemed fun, and learned Wasm for microservices is not as mature and a bit more complicated than running in the browser.”

“Are we there yet? Not really. There’s some caveats,” said Christensen. “But there are a lot of things that do work, but it’s not enough that I wouldn’t bet the farm on it kind of thing.”

Finding Wasm’s Limitations

Barrett, an open source support engineer at Weaveworks, called WebAssembly “this special compiled bytecode language that works on some kind of like a virtual machine that’s very native toward JavaScript. It’s definitely shown that is significantly faster than, let’s say, JavaScript running with the JIT (just-in-time compiler).

“And when you write software to compile for it, you just need to treat it like a different target — like on x86 or Arm architectures; we can compile to a lot of different targets.”

The speakers found there are limitations or design constraints, if you will:

  • You cannot access the network in an unpermissioned way.
  • You cannot pass a string as an argument to a function.
  • You cannot access the file system unless you have specified the things that are permitted.

“There is no string type,” Barrett said. “As far as I can tell, you have to manage memory and count the bytes you’re going to pass. Make sure you don’t lose that number. That’s a little awkward, but there is a way around that as well.”

One of the big potential benefits for government contractors with Wasm is the ability to use existing code and to retain people with deep knowledge in a particular language.

The talk was part of the OpenGovCon track at the conference.

“We came up with this concept, being the government space, that I thought was going to be really interesting for an ATO perspective” — authorized to operate — “which is, how do you enable continuous delivery while still maintaining a consistent environment?” Christensen said.

The government uses ATO certification to manage risk in contractors’ networks by evaluating the security controls for new and existing systems.

One of the big potential benefits for government contractors with Wasm, Christensen said, is the ability to use existing code and to retain people with deep knowledge in a particular language.

“You can use that, tweak it a little bit and get life out of it,” he said. “You may have some performance losses where there may be some nuances, but largely you can retain a lot of that domain language or that sort of domain knowledge and carry it over for the future.”

Barrett and Christensen set out to write a Kubernetes operator.

“I wanted to write something in Go … so all your functions for this or wherever you need come in the event hooks,” Christensen said.

Then, instead of calling the state a function, or a class that you have inside of that monolithic operating design, the idea is that you can reference somehow the last value store. It could be a Redis cache, database, or object storage. Wasm is small enough that at load time, a small binary can be loaded at initialization.

If cold start times are not a problem, you could write something that will go request, pull a Wasm module, load, run and return the result.

And, Christensen continued, “if you really want to get creative, you can shove it in as a config map inside of Kubernetes and … whatever you want to do, but the biggest thing is Wasm gets pulled in. And the idea is you call it almost like a function, and you just execute it.

“And each one of those executions would be a sandbox so you can control the exposure and security and what’s exposed throughout the entire operator. … You could statically compile the entire operator and control it that way. Anyone who wants to work in the sandbox with modules, they would have the freedom within the sandbox to execute. This is the dream. … Well, it didn’t work.”

The idea was that there would be stringent controls in a sandbox about how the runtime would be exposed to the Wasm module, which would include logging and traceability for compliance.

Runtimes and Languages

WebAssembly is being hailed for its ability to compile from any language, though Andrew Cornwall, a Forrester analyst, told The New Stack that it’s easier to compile languages that do not have garbage collectors, so languages such as Java, Python and interpreted languages tend to be more difficult to run in WebAssembly than languages such as C or Rust.

Barrett and Christensen took a few runtimes and languages for (ahem) a spin. Here’s what they found:

Fermyon Spin

Runtime class has been available since Kubernetes v1.12. It’s easy to get started, light on controls. The design requires privileged access to your nodes. Containerd shims control which nodes get provisioned with the runtime.

Kwasm

“There’s a field on the deployment class called runtimeClassName, and you can set that to whatever you want, as long as containerd knows what that means. So Kwasm operator breaks into the host node and sets up some containerd configuration imports of binary from wherever — this is not production ready,” Barrett said, unless you already had separate controls around all of those knobs and know how to authorize that type of grant safely.

He added, “Anyway, this was very easy to get your Wasm modules to run directly on Kubernetes this way, despite it does require privileged access to the nodes and it’s definitely not ATO.”

WASI/WAGI

WASI (WebAssembly System Interface) provides system interfaces; WAGI (WebAssembly Gateway Interface) permits standard IO to be treated as a connection.

“Basically, you don’t have to handle connections, the runtime handles that for you,” Barrett said. “That’s how I would summarize WAGI, and WASI is the system interface that makes that possible. You have standard input, standard output, you have the ability to share memory, and functions — you can import them or export them, call them from inside or outside of the Wasm, but only in ways that you permit.”

WasmEdge

WasmEdge Runtime, based on C++, became a Cloud Native Computing Foundation project in 2021.

The speakers extolled an earlier talk at the conference by Michael Yuan, a maintainer of the project, and urged attendees to look for it.

Wasmer/Wastime

Barrett and Christensen touted the documentation on these runtime projects.

“There are a lot of language examples that are pretty much parallel to what I went through … and it started to click for me,” Barrett said. “I didn’t really understand WASI at first, but going through those examples made it pretty clear.”

They’re designed to get you thinking about low-level constructs of Wasm:

  • What is possible with a function, memory, compiler.
  • How to exercise these directly from within the host language.
  • How to separate your business logic.
  • Constraints in these environments will help you scope your project’s deliverable functions down smaller and smaller.

Wasmtime or Wasmer run examples in WAT (WebAssembly Text Format), a textual representation of the Wasm binary format, something to keep in mind when working in a language like Go. If you’re trying to figure out how to call modules in Go and it’s not working, check out Wazero, the zero-dependency WebAssembly runtime written in Go, Barrett said.

Rust

It has first-class support and the most documentation, the speakers noted.

“If you have domain knowledge of Rust already, you can start exploring right now how to use Wasm in your production workflow,” Christensen said.

Node.js/Deno

Wasm was first designed for use in web browsers. There’s a lot of information out there already about the V8 engine running code that wasn’t just JavaScript in the browser. V8 is implemented in C++ with support for JavaScript. That same V8 engine is found at the heart of NodeJS and Deno. The browser-native JavaScript runtimes in something like Node.js or Deno are what made their use with Wasm so simple.

“A lot of the websites that had the integration already with the V8 engine, so we found that from the command line from a microservices perspective was kind of really easy to implement,” Christensen said.

“So the whole concept about the strings part, about passing it with a pointer, if you’re running Node.js and Deno, you can pass strings natively and you don’t even know it’s any different. …Using Deno, it was really simple to implement. …There are a lot of examples that we’ve discovered, one of which is ‘Hello World,’ actually works. I can compile it so it actually runs and can pass a string and get a string out simply from a web assembly module with Deno.”

Christensen said that Deno or Node.js currently provides the best combination of WASM support that is production ready with a sufficient developer experience.

A Few Caveats

“But a little bit of warning when you go to compile,” Christensen said. “What we have discovered is: all WASM is not compiled the same.”

There are three compilers for Wasm:

  • Singlepass doesn’t have the fastest runtime, but has the fastest compilation.
  • Cranelift is a main engine used in Wasmer and Wasmtime. It doesn’t have the fastest runtime; it’s much better, but it’s still a bizarre compilation.
  • LLVM has the slowest compile time. No one who’s ever used LVM is surprised there, but it is the fastest runtime.

A Few Problems

Pointer functions for handling strings are problematic. String passing, specifically with Rust, even when done correctly, could decrease performance by up to 20 times, they said.

There is a significant difference between compiled and interpreted languages when compiled to a WASM target. Wasm binaries for Ruby and Python may see 20 to 50MG penalties compared to Go or Rust because of the inclusion of the interpreter.

“And specifically, just because we’re compiling Ruby or Python to Wasm, you do need to compile the entire interpreter into it,” Christensen said. “So that means if you are expecting Wasm to be better for boot times and that kind of stuff, if you’re using an interpreted language, you are basically shoving the entire interpreter into the Wasm binary and then running your code to be on the interpreter. So please take note that it’s not a uniform experience.”

“If you’re using an interpreted language, it’s still interpreted in Wasm,” Barrett said. “If you’re passing the script itself into Wasm, the interpreter is compiled in Wasm but the script is still interpreted.”

And Christensen added, “You’re restricted to the runtime restrictions of the browser itself, which means sometimes they may be single-threaded. Good, bad, just be aware.”

A web browser, Deno and Node.js all use the V8 engine, meaning they all exhibit the same limitations when running Wasm.

And language threading needs to be known at runtime for both host and module.

“One thing I’ve noticed: in Go, if I use the HTTP module to do a request from a Wasm-compiled Go module from Deno, there is no way that I can turn around and make sure that’s not gonna break the threaded nature of Deno and that V8 engine,” Christensen said.

He added, “Maybe there’s an answer there, but I didn’t find it. So if you are just getting started and you’re just trying to mess around and try to find all that happening, just know that you may spend some time there.”

And what happens when you have a C dependency with your RubyGem?

Barrett said he didn’t try that at all.

“Most Ruby dependencies are probably native Ruby, not native extensions,” he said. “They’re pure Ruby, but a ‘native extension’ is Ruby compiling C code. And then you have to deal with C code now,” in addition to Ruby.

“Of course, C compiles to Wasm, so I’m sure there is a solution for this. But I haven’t found anyone who has solved it yet.”

It applies to some Python packages as well, Christensen said.

“They [Python eggs] are using the binary modules as well, so there is definitely no way to do a [native system] binary translation into Wasm — binary to binary,” he said. “So if you need to do it, you need to get your hands dirty, compile the library itself to Wasm, then compile whatever gem or package that function calls are there.”

The speakers said that in working with Wasm, they found that ChatGPT wasn’t very helpful and that debugging can be harsh.

So, Should You Be Excited about Wasm?

“Yes. There’s plenty of reasons to be excited,” Christensen said. “It may not be ready yet, but I definitely think it’s enough to move forward and start playing around yourself.”

When Wasm is fully mature, he said, it will have benefits in terms of tech workforce retention, especially in governmental organizations: “You can take existing workforce, you don’t have to re-hire and you can get longevity out of them. Especially to have all that wonderful domain knowledge and you don’t have to re-solve the same problem using a new tool.

“If you have a lot of JavaScript stuff, [you’ll have] better control over it and it runs faster, which is the whole reason why Wasm is interesting,” Christensen said. The reason is that JavaScript compiled to Wasm is much faster, as the V8 engine no longer has to do “just-in-time” operations.

“And then finally, I’m sure a lot of you have an ARM MacBook, and then you try to deploy something to the cloud,” he said. “And next thing you realize, ‘Oh look, my entire stack is in x86.’ Well, Wasm magically does take care of this. I did test this out on a Mac Mini and ran it on a brand new AMD 64 system and Deno couldn’t tell the difference.”

WebAssembly is ready to be tested, Christensen said, and the open source community is the way to make that happen.

“Let the maintainers know; start talking about it. Bring up issues. We need more working examples. That’s missing. We can’t even get ChatGPT to give us anything decent,” he said, so the community is relying on its members to experiment with it and share their experiences.

The post Case Study: A WebAssembly Failure, and Lessons Learned appeared first on The New Stack.

]]>
RabbitMQ Is Boring, and I Love It https://thenewstack.io/rabbitmq-is-boring-and-i-love-it/ Mon, 15 May 2023 13:30:32 +0000 https://thenewstack.io/?p=22707624

RabbitMQ is boring. Very boring. And we tend not to think about boring things. RabbitMQ, like the electrical grid, is

The post RabbitMQ Is Boring, and I Love It appeared first on The New Stack.

]]>

RabbitMQ is boring. Very boring. And we tend not to think about boring things. RabbitMQ, like the electrical grid, is entirely uninteresting — until it stops working. The human brain is conditioned to recognize and respond to pain and peril more than peace, so we tend only to remember the traumas in life. In this post, I want to try to change that. Let’s talk about RabbitMQ, an open source message broker I’ve been using for the better part of 15 years — happily and bored.

My background is in, among other things, messaging and integration technologies. Unfortunately, legacy systems are often hostile, mostly because those who came before us did not foresee the highly distributed nature of today’s modern, API-dominant architectures, such as cloud native computing and microservices.

There are many ways to approach integration. In their book, “Enterprise Integration Patterns,” Gregor Hohpe and Bobby Woolf talk about four approaches: shared databases, messaging, remote procedure call (RPC) and file transfer. Integration is all about optionality and coupling: How do we take services that don’t know about each other and make them work together without overly coupling them by proximity or time? Messaging, that’s how. Messaging is the integration approach that comes with batteries included. It has all the benefits and few, if any, of the drawbacks of the three other integration styles. Messaging means I can sleep at night. Messaging is boring. I love boring.

With the support of multiple open protocols, such as AMQP 0.9, 1.0, MQTT, STOMP and others, RabbitMQ gives people options and flexibility and interoperability, so a service written in Python can communicate with another in C# or Java, and both can be none the wiser.

A Beautiful Indirection 

RabbitMQ has a straightforward programming model: Clients send messages to exchanges. The exchange acts as the broker’s front door, accepting incoming messages and routing them onward. An exchange looks at the incoming message and the message’s headers — and sometimes at one special header in particular, called a routing key — and decides to which queue (or queues) it should send the message. These exchanges can even send messages to other brokers. Queues are the thing consumers consume from. This beautiful indirection is why inserting an extra hop between a producer and a consumer is possible without affecting the producer or the consumer.

It all seems so straightforward — and boring! — when you think about it. But you wouldn’t believe how many people got this stuff wrong from the get-go. Let’s look at Java Message Service (JMS), the Java-standardized API for messaging. It has no concept of an exchange, so it is impossible to reroute a message (without sidestepping the JMS interfaces) once a producer and a consumer connect. Meanwhile, some JMS brokers couple the consumer and the producers by the Java driver client to talk to the broker. If the client supports version X of the broker, and someone has upgraded the broker to X+1, then the producer and the consumer may need to upgrade their Java client drivers to X+1.

Born in 2007, RabbitMQ was conceived due to the need for large banks to standardize their digital systems so they and their customers (that’s us) can transact more easily. RabbitMQ implemented the AMQP protocol from the jump; it’s still the most popular way to connect to the broker today. But it’s not the only way.

Let Me Count the Ways

As mentioned, RabbitMQ supports multiple protocols, which certainly offers choice, but there are other benefits as well. Like MQTT, popular in the Internet-of-Things space, where millions of clients — think microwaves, refrigerators and cars — might need to communicate with a single broker in a lightweight, efficient way. This work’s ongoing and keeps getting better by the day. For example, Native MQTT was recently announced, dramatically reducing memory footprint and increasing scalability.

RabbitMQ supports federation and active/passive deployments. It has various approaches to storing the messages in RAM or on disk. It supports transactions. It’s speedy, and it guarantees the consistency of your data. However, RabbitMQ has traditionally served messaging and integration use cases, not stream processing pipelines.

The community around RabbitMQ is vibrant, burgeoning and fast-moving, and the last few years have been incredibly prolific. Have you tried RabbitMQ Streams? Streams are a new persistent and replicated data structure that models an append-only log with nondestructive consumer semantics. You can use Streams from a RabbitMQ client library as a plain ol’ queue or through a dedicated binary protocol plugin and associated clients for even better throughput and performance.

To say it’s been successful would be an understatement. StackShare states that, among others, Reddit, Robinhood, Zillow, Backbase, Hello Fresh and Alibaba Travels all use RabbitMQ.

There are drivers for virtually every language, platform and paradigm. For example, I work on the Spring team, and we have several increasingly abstract ways by which you can use RabbitMQ, starting with the Spring for RabbitMQ foundational layer and going all the way up to the support in Spring Cloud Data Flow, our stream and batch processing stack.

RabbitMQ is open and extensible, supporting unique features with plugins to the server and extensions to the AMQP protocol.

The RabbitMQ site has a nonexhaustive list of some of the best ones. They include things like publisher confirms, dead letter exchanges, priority queues, per-message and per-queue TTL (time to live values tell RabbitMQ how long an unacknowledged message is allowed to remain in a queue before being deleted), exchange-to-exchange bindings and so much more. These are extensions to the protocol itself, implemented in the broker.

Plugins are slightly different. Numerous plugins extend the broker proper and introduce new management, infrastructure and engine capabilities. For example, there are plugins to support Kubernetes service discovery, OAuth 2, LDAP, WAN federation, STOMP and so much more.

What does all this mean for me? It means that, like PostgreSQL, RabbitMQ is a Swiss army knife. Does it do everything that the costly alternatives from Tibco or IBM do? No. But I’ll bet it can do 95% of whatever I’d need, and it’ll do so in a way that leaves my options open in the future. (And you can’t beat the price!)

Maybe it’s just all those years spent wearing a pager or running into data centers with winter jackets on at 3 a.m., but I actively avoid running anything I can’t charge for directly. I prefer to have someone run RabbitMQ for me. It’s cheap and easy enough to do so on several different cloud providers or the Kubernetes distribution of your choice.

As a developer, RabbitMQ couldn’t be more boring. As an operator in production, RabbitMQ couldn’t be more boring. I love boring, and I love RabbitMQ.

If boring appeals to you too, I encourage you to learn more about RabbitMQ, how it works as an event-streaming broker, how it compares to Kafka and about the beta of RabbitMQ as a service.

The post RabbitMQ Is Boring, and I Love It appeared first on The New Stack.

]]>
How OpenSearch Visualizes Jaeger’s Distributed Tracing https://thenewstack.io/how-opensearch-visualizes-jaegars-distributed-tracing/ Thu, 11 May 2023 17:00:12 +0000 https://thenewstack.io/?p=22706795

We all know how important observability is. Open source tooling is always a popular option. The complexity of selecting tooling

The post How OpenSearch Visualizes Jaeger’s Distributed Tracing appeared first on The New Stack.

]]>

We all know how important observability is. Open source tooling is always a popular option. The complexity of selecting tooling is always a challenge. Typically, we end up with several best-of-breed tools in use in most organizations, which include many different projects and databases.

As organizations continue to implement microservices-based architectures and cloud native technologies, operational data is becoming increasingly large and complex. Because of the distributed nature of the data, the old approach of sorting through logs is not scalable.

As a result, organizations are continuing to adopt distributed tracing as a way of gaining insight into their systems. Distributed tracing helps determine where to start investigating issues and ultimately reduces the time spent on root cause analysis. It serves as an observability signal that captures the entire lifecycle of a particular request as it traverses distributed services. Traces can have multiple service hops, called spans, that comprise the entire operation.

Jaeger

One of the most popular open source solutions for distributed tracing is Jaeger. Jaeger is an open source, end-to-end solution hosted by the Cloud Native Computing Foundation (CNCF). Jaeger leverages data from instrumentation SDKs that are OpenTelemetry (OTel) based and support multiple open source data stores, such as Cassandra and OpenSearch and Elasticsearch, for trace storage.

While Jaeger does provide a UI solution for visualizing and analyzing traces along with monitoring data from Prometheus, OpenSearch now provides the option to visualize traces in OpenSearch Dashboards, the native OpenSearch visualization tool.

Trace Analytics

OpenSearch provides extensive support for log analytics and observability use cases. Starting with version 1.3, OpenSearch added support for distributed trace data analysis with the Observability feature. Using Observability, you can analyze the crucial rate, errors, and duration (RED) metrics in trace data. Additionally, you can evaluate various components of your system for latency and errors and pinpoint services that need attention.

The OpenSearch Project launched the trace analytics feature with support for OTel-compliant trace data provided by Data Prepper — the OpenSearch server-side data collector. To incorporate the popular Jaeger trace data format, in version 2.5 OpenSearch introduced the trace analytics feature in Observability.

With Observability, you can now filter traces to isolate the spans with errors in order to quickly identify the relevant logs. You can use the same feature-rich analysis capabilities for RED metrics, contextually linking traces and spans to their related logs, which are available for the Data Prepper trace data. The following image shows how you can view traces with Observability.

Keep in mind that the OTel and Jaeger formats have several differences, as outlined in OpenTelemetry to Jaeger Transformation in the OpenTelemetry documentation.

Try It out

To try out this new feature, see the Analyzing Jaeger trace data documentation. The documentation includes a Docker Compose file that shows you how to add sample data using a demo and then visualize it using trace analytics. To enable this feature, you need to set the --es.tags-as-fields.all flag to true, as described in the related GitHub issue. This is necessary because of an OpenSearch Dashboards limitation.

In Dashboards, you can see the top service and operation combinations with the highest latency and the greatest number of errors. Selecting any service or operation will automatically direct you to the Traces page with the appropriate filters applied, as shown in the following image. You can also investigate any trace or service on your own by applying various filters.

Next Steps

To try the OpenSearch trace analytics feature, check out the OpenSearch Playground or download the latest version of OpenSearch. We welcome your feedback on the community forum!

The post How OpenSearch Visualizes Jaeger’s Distributed Tracing appeared first on The New Stack.

]]>
Spring Cloud Gateway: The Swiss Army Knife of Cloud Development https://thenewstack.io/spring-cloud-gateway-the-swiss-army-knife-of-cloud-development/ Mon, 08 May 2023 13:47:40 +0000 https://thenewstack.io/?p=22707347

A microservice has to fulfill many functional and nonfunctional requirements. When implementing one, I mostly start with the happy path

The post Spring Cloud Gateway: The Swiss Army Knife of Cloud Development appeared first on The New Stack.

]]>

A microservice has to fulfill many functional and nonfunctional requirements. When implementing one, I mostly start with the happy path to see if I meet these requirements. And with all the other nonfunctional requirements, like protecting my service or scaling various parts independently, I love to work with Spring Cloud Gateway as this tiny and, IMHO, underrated tool is powerful, even with just a few lines of configuration.

What Is Spring Cloud Gateway?

Spring Cloud Gateway is an open source, lightweight and highly customizable API gateway that provides routing, filtering and load-balancing functionality for microservices. It is built on top of Spring Framework and integrates easily with other Spring Cloud components.

If you’re new to Spring Cloud Gateway, this article outlines some common use cases where it can come in handy and requires minimal configuration.

How to Get Started with Spring Cloud Gateway

The easiest way to start experimenting with Spring Cloud Gateway is by using Spring Initializr. So let’s go to start.spring.io and generate a project stub. Pick the project type, language and the versions you want, and be sure to add the Spring Cloud Gateway dependency. Once you are done, hit the Download button.

What you get is a basic project structure like this:

And this is a fully working, almost production-ready Spring Cloud Gateway. The important part is the application.yml, which will hold all further configuration. Now let’s add some magic.

Protect Services through Rate Limiting

Sometimes it’s necessary to protect your service from misbehaving clients to ensure availability for correctly behaving ones. In that case, Spring Cloud Gateway can help with its rate-limiting capabilities. By combining it with a KeyResolver, you can correctly identify all your clients and assign them a quota of requests they are allowed per second.

It also offers a burst mode, where you can get above the assigned quota for a short period of time to cope with sudden bursts of requests.

As the gateway, in that case, requires some sort of memory, you should combine it with an attached Redis cache. This would allow for horizontally scaling your gateways.

spring:
  cloud:
    gateway:
      default-filters:
      - name: RequestRateLimiter
        args:
          # maximum number of requests per period
          redis-rate-limiter.replenishRate: 10
          # maximum number of requests that can be queued before being rejected
          redis-rate-limiter.burstCapacity: 20
          # the key resolver for rate limiting
          key-resolver: "#{@apiKeyResolver}"
      routes:
      - id: example
        uri: http://example.com
        predicates:
        - Path=/**
     
# key resolver for rate limiting
apiKeyResolver:
  type: com.example.ApiKeyResolver


The KeyResolver should look like this. It can hold any custom logic required to select the criteria of distinguishing different users. In this example, it’s just the X-API-key header, but another common thing might be an Authorization header or some sort of set cookie.

package com.example;

import org.springframework.cloud.gateway.filter.ratelimit.KeyResolver;
import org.springframework.web.server.ServerWebExchange;
import reactor.core.publisher.Mono;

public class ApiKeyResolver implements KeyResolver {

    @Override
    public Mono<String> resolve(ServerWebExchange exchange) {
        return Mono.justOrEmpty(exchange.getRequest().getHeaders().getFirst("X-API-Key"));
    }
}


With this, you can easily add a rate-limiting capability to your microservice without having to implement this within the service itself.

Adding a Global Namespace to Various Microservices

As services mature, it might be necessary to do the step from version 1 to version 2. But implementing the new version in the same deployable unit implies the risk — while implementing on version 2, you might accidentally change something related to version 1. So why not leave version 1 as it is and implement the new version as a separate deployment unit?

A simple configuration can help bring these two applications under a common name to seem as though they are one unit of deployment.

spring:
  cloud:
    gateway:
      routes:
      - id: api-v1
        uri: http://v1api.example.com/
        predicates:
        - Path=/v1/**
      - id: api-v2
        uri: http://v2api.example.com/
        predicates:
        - Path=/v2/**


This allows us to access our microservice with the URLs https://api.example.com/v1/* and https://api.example.com/v2/* but have them be deployed and maintained separately.

Scale Subcontext of Your Services Independently

In the previous tip, we showed how to bring together components that need to be together. But this can extend to other scenarios as well. So let’s assume our microservice is simply too big to be managed by a single team. We can use the same configuration to unite different logical parts of the service into one common umbrella to look like one service.

spring:
  cloud:
    gateway:
      routes:
      - id: locations
        uri: http://locationapi.example.com/
        predicates:
        - Path=/v1/locations/**
      - id: weather
        uri: http://weatherapi.example.com/
        predicates:
        - Path=/v1/weather/**


This would also allow us to independently scale the different parts of the API as needed.

AB Test Your Service with a Small Customer Group

AB testing is common when developing new applications since it is a good way of testing whether your service meets requirements without having to roll it out completely. AB testing gives only a small group of users access to the new version of a service and asks them whether they like it. One group could be your colleagues within the company network, for example.

spring:
  cloud:
    gateway:
      routes:
      - id: example-a
        uri: http://example.com/service-a
        predicates:
        - Header=X-AB-Test, A
        - RemoteAddr=192.168.1.1/24
      - id: example-b
        uri: http://example.com/service-b
        predicates:
        - Header=X-AB-Test, B
        - RemoteAddr=192.168.10.1/24
      - id: example-default
        uri: http://example.com/service-default
        predicates:
        - True


In this example, the gateway evaluates all the routes and their predicates defined in this order, and if all the predicates match, the corresponding route will be selected. This means:

  • Customers coming from 192.168.1.1/24 network having the X-AB-Test header set to A will be presented service variant a.
  • Customers coming from 192.168.10.1/24 network having the X-AB-Test header set to B will be presented service variant b.
  • All other customers will see that the service-default variant as the corresponding predicate always evaluates to true.

Protect Services by Adding Authentication

This is not really a specific Spring Cloud Gateway feature, but it’s also handy in combination with the gateway. Imagine you have blackbox services and you want to enhance its implementation without having the source code or the permission to change this.

In this case, a gateway as a reverse proxy in front can help add features like authentication.

spring:
  security:
    oauth2:
      resourceserver:
        jwt:
          issuer-uri: https://client.idp.com/oauth2/default
          audience: api://default
  cloud:
    gateway:
      routes:
      - id: api
        uri: http://api.example.com/
        predicates:
        - Path=/**


As seen here, the gateway parts take care of the traffic forwarding to the target and the Spring Security configuration adds an oauth2 flow for logging in.

Add Audit Logging to Your Services

Another scenario for enhancing existing services might be adding an audit log to the service.

Want to know who’s calling which of your services operations? Add some audit logging. This needs a bit more implementation as I haven’t found a default implementation.

spring:
  cloud:
    gateway:
      routes:
      - id: backend
        uri: http://backend.example.com
        predicates:
        - Path=/**
        filters:
        - name: RequestLogger


The RequestLogger can be implemented as a Spring bean, just like this:

@Component
public class RequestLogger implements GatewayFilter, Ordered {

    private static final Logger LOGGER = LoggerFactory.getLogger(RequestLogger.class);

    @Override
    public Mono<Void> filter(ServerWebExchange exchange, GatewayFilterChain chain) {
        ServerHttpRequest request = exchange.getRequest();
        HttpMethod method = request.getMethod();
        String uri = request.getURI().toString();
        HttpHeaders headers = request.getHeaders();
        String authHeader = headers.getFirst(HttpHeaders.AUTHORIZATION);
        LOGGER.info("Request - method: {}, uri: {}, authHeader: {}", method, uri, authHeader);
        return chain.filter(exchange);
    }

    @Override
    public int getOrder() {
        return Ordered.HIGHEST_PRECEDENCE;
    }
}


In this implementation, the RequestLogger filter logs the HTTP method, Uniform Resource Identifier (URI) and authentication header of each request using the SLF4J logging framework. The filter is implemented as a Spring @Component and is added to the Spring Cloud Gateway filter chain using the filters property in the application.yml file. The Ordered interface is implemented to ensure that this filter has the highest precedence and runs first in the filter chain.

Protect Services by a Circuit Breaker

In some cases, rate limiting, as discussed previously, is not enough to operate a service safely, and if response times for your service get too high, it’s sometimes necessary to cut off traffic for a short period of time to let the service recover itself. This is where all the classical resilience patterns can help. A pattern like the circuit breaker has to be implemented outside of the affected service. The Spring Cloud Gateway can also help here.

spring:
  cloud:
    gateway:
      routes:
      - id: slow-service
        uri: http://example.com/slow-service
        predicates:
        - Path=/slow/**
        filters:
        - name: CircuitBreaker
          args:
            name: slow-service
            fallbackUri: forward:/fallback/slow-service
            statusCodes:
              - SERVICE_UNAVAILABLE
        - name: ResponseTime
          args:
            baseName: slow-service
            timeout: 1000
            tripwires:
              - id: slow-response
                type: MAX_RESPONSE_TIME
                threshold: 500
                circuitBreaker:
                  enabled: true
                  timeout: 10000
                  ringBufferSizeInClosedState: 5
                  ringBufferSizeInHalfOpenState: 2
                  failureRateThreshold: 50
      - id: fast-service
        uri: http://example.com/fast-service
        predicates:
        - Path=/fast/**
      - id: fallback-slow-service
        uri: forward:/fallback/slow-service
        predicates:
        - Path=/fallback/slow/**


In this example, calls to the slow service will be monitored and if its response time exceeds 1,000 milliseconds, it will be cut off and given 10 seconds to rest before trying to bring it back in. In the meantime, the fallback is used. This fallback can be a simple error message, sending a “Temporarily Unavailable” status code or maybe some more helpful implementation, depending on the use case.

More Creative Ways of Using Spring Cloud Gateway

Like Legos, there are endless possibilities, combining all these building blocks to build whatever is needed. One of the most creative ways of using this that I have seen is the following:

A team encountered a challenge when using autoscaling in combination with Java applications. They realized that newly started applications were much slower than those that were already running. While this is a normal behavior for Java applications due to the just-in-time (JIT) compiling process, it affects end-user experience. The team added a Spring Cloud Gateway for load balancing to all these services and configured it as load balancing based on the average response time. So backends with fast average response times would get more traffic than backend instances with slower average response times. Overall, this allowed freshly started instances to warm up their JIT without negatively affecting overall performance, and it also helped reduce traffic to overloaded instances, leading to better performance overall for the end user.

As you can see, there are a lot of possibilities where this highly flexible API gateway can be used in a wide range of products. Due to its extensibility, you can easily add new features and capabilities by simply implementing custom filters or predicates and letting the traffic flow just the way you need. That’s why this is one of my favorite tools for shaping microservice landscapes.

The post Spring Cloud Gateway: The Swiss Army Knife of Cloud Development appeared first on The New Stack.

]]>
Return of the Monolith: Amazon Dumps Microservices for Video Monitoring https://thenewstack.io/return-of-the-monolith-amazon-dumps-microservices-for-video-monitoring/ Thu, 04 May 2023 14:23:21 +0000 https://thenewstack.io/?p=22707172

A blog post from the engineering team at Amazon Prime Video has been roiling the cloud native computing community with

The post Return of the Monolith: Amazon Dumps Microservices for Video Monitoring appeared first on The New Stack.

]]>

A blog post from the engineering team at Amazon Prime Video has been roiling the cloud native computing community with its explanation that, at least in the case of video monitoring, a monolithic architecture has produced superior performance than a microservices and serverless-led approach.

For a generation of engineers and architects raised on the superiority of microservices, the assertion is shocking indeed. In a microservices architecture, an application is broken into individual components, which then can be worked on and scaled independently.

“This post is an absolute embarrassment for Amazon as a company. Complete inability to build internal alignment or coordinated communications,” wrote analyst Donnie Berkholz, who recently started his own industry-analyst firm Platify.

“What makes this story unique is that Amazon was the original poster child for service-oriented architectures,” weighed in Ruby-on-Rails creator and Basecamp co-founder David Heinemeier Hansson, in a blog item post Thursday. “Now the real-world results of all this theory are finally in, and it’s clear that in practice, microservices pose perhaps the biggest siren song for needlessly complicating your system. And serverless only makes it worse.”

In the original post, dated March 22, Amazon Prime Senior Software Development Engineer Marcin Kolny explained how moving the video streaming to a monolithic architecture reduced costs by 90%. It turns out that components from Amazon Web Services hampered scalability and skyrocketed costs.

The Video Quality Analysis (VQA) team at Prime Video initiated the work.

The task as to monitor the thousands of video streams that the Prime delivered to customers. Originally this work was done by a set of  distributed components orchestrated by AWS Step Functions, a serverless orchestration service, AWS Lambda serverless service.

In theory, the use of serverless would allow the team to scale each service independently. It turned out, however, that at least for how the team implemented the components, they hit a hard scaling limit at only 5% of the expected load. The costs of scaling up to monitor thousands of video streams would also be unduly expensive, due to the need to send data across multiple components.

Initially, the team tried to optimize individual components, but this did not bring about significant improvements. So, the team moved all the components into a single process, hosting them on Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic Container Service (Amazon ECS).

Takeaway

Kolny was careful to mention that the architectural decisions made by the video quality team may not work in all instances.

“Microservices and serverless components are tools that do work at high scale, but whether to use them over monolith has to be made on a case-by-case basis,” he wrote.

To be fair, the industry has been looking to temper the enthusiasm of microservices over the past decade, stressing it is only good in some cases.

“As with many good ideas, this pattern turned toxic as soon as it was adopted outside its original context, and wreaked havoc once it got pushed into the internals of single-application architectures,” Hansson wrote. “In many ways, microservices is a zombie architecture. Another strain of an intellectual contagion that just refuses to die.”

The IT world is nothing but cyclical, where an architectural trend is derided as hopelessly archaic one year can be the new hot thing the following year. Certainly, over the past decade when microservices ruled (and the decade before when web services did), we’ve heard more than one joke in the newsroom about “monoliths being the next big thing.” Now it may actually come to pass.

The post Return of the Monolith: Amazon Dumps Microservices for Video Monitoring appeared first on The New Stack.

]]>
Cloud Native Basics: 4 Concepts to Know  https://thenewstack.io/cloud-native-basics-4-concepts-to-know/ Thu, 27 Apr 2023 16:42:17 +0000 https://thenewstack.io/?p=22706518

To stay competitive, companies must adjust and adapt their technology stack to accelerate their digital transformation. This means engineering teams

The post Cloud Native Basics: 4 Concepts to Know  appeared first on The New Stack.

]]>

To stay competitive, companies must adjust and adapt their technology stack to accelerate their digital transformation. This means engineering teams now experience exponential data growth that is starting to outgrow underlying infrastructure. That requires durable infrastructure that can support rapid data growth and high availability. With cloud native architecture, companies can meet all their availability requirements and effectively store data in real time.

So what is cloud native? Well, cloud native is an approach to build and run applications that takes full advantage of cloud computing technology. If something is “cloud native,” then it is designed and coded to run on a cloud architecture at the start of the application development process like Kubernetes.

At its core, cloud native is about designing applications as a collection of microservices, each of which can be deployed independently and scaled horizontally to meet demand. This allows for greater flexibility because developers can update specific services as needed, instead of updating the entire application.

Such agility lets engineering teams rapidly deploy and update applications through agile development, containers and orchestration. It also provides improved scalability because teams can easily spin up containers in response to traffic demand, which maximizes resource usage and reduces cost. Additionally, applications that are distributed across multiple servers or nodes mean that one component’s failure does not bring down the entire system.

The 4 Basic Cloud Native Components

Before your organization implements any sort of cloud native architecture, it’s important to understand its basic components. The four pillars of cloud native are microservices, DevOps, open source standards and containers.

No. 1: Microservices are the foundation of cloud native architecture because they offer several benefits, including scalability, fault tolerance and agility. Microservices are smaller and more focused than monolithic applications, which makes them easier to develop, test and deploy. This allows teams to move faster and respond more quickly to changing business requirements and application needs. Plus, a failure in one microservice does not cause an outage of the entire application. This means that developers can replace or update individual microservices and not disrupt the entire system.

No. 2: DevOps is a set of practices that emphasize collaboration and communication between development and operations teams. Its goal is to deliver software faster and more reliably. DevOps plays a critical role in enabling continuous delivery and deployment of cloud native architecture. DevOps teams collaborate to rapidly test and integrate code changes, and focus on automating as much of the deployment process as possible. Another key aspect of DevOps in a cloud native architecture is the use of Infrastructure as Code (IaC) tools, which allow for declarative configuration of infrastructure resources. DevOps’ focus on CI/CD enables products and features to be released to market faster; improves software; ensures that secure coding practices are met and reduces cost for the organization; and improves collaboration between the development and operations teams.

No. 3: There are a variety of industrywide open source standards such as Kubernetes, Prometheus and the Open Container Initiative. These cloud native open source standards are important for several reasons:

  • They help organizations avoid vendor lock-in by ensuring that applications and infrastructure are not tied to any particular cloud provider or proprietary technology.
  • Open source standards promote interoperability between different cloud platforms, technologies and organizations to integrate their environments with a wide range of tools and services to meet business needs.
  • Open source standards foster innovation as they allow developers and organizations to collaborate on new projects and coding advancements for cloud native architectures across the industry.
  • Open source standards are developed through a community-driven process, which ensures that the needs and perspectives of a wide range of stakeholders are considered.

No. 4: Containers enable organizations to package applications into a standard format to easily deploy and run on any cloud platform. Orchestration on the other hand, is the process of managing and automating the deployment, scaling and management of containerized applications. Containers and orchestration help build and manage scalable, portable and resilient applications. This allows businesses to quickly respond to market changes, which gives them a competitive advantage so they can constantly implement value-add features and keep customer-facing services online.

Chronosphere + Cloud Native 

Cloud native practices offer significant business benefits, including faster time-to-market, greater scalability, improved resilience, reduced costs, and better application agility and flexibility. With cloud native adoption, organizations can improve their software development processes and deliver better products and services to their customers.

When migrating to a cloud native architecture, teams must have observability software to oversee system health. Observability tools provide real-time visibility into system performance that help developers to quickly identify and resolve issues, optimize system performance and design better applications for the cloud.

Built specifically for cloud native environments, Chronosphere provides a full suite of observability tools for your organization to control data cardinality and understand costs with the Chronosphere control plane, and assist engineering teams with cloud native adoption.

The post Cloud Native Basics: 4 Concepts to Know  appeared first on The New Stack.

]]>
Kubernetes Evolution: From Microservices to Batch Processing Powerhouse https://thenewstack.io/kubernetes-evolution-from-microservices-to-batch-processing-powerhouse/ Sun, 16 Apr 2023 17:00:54 +0000 https://thenewstack.io/?p=22704735

Kubernetes has come a long way since its inception in 2014. Initially focused on supporting microservice-based workloads, Kubernetes has evolved

The post Kubernetes Evolution: From Microservices to Batch Processing Powerhouse appeared first on The New Stack.

]]>

Kubernetes has come a long way since its inception in 2014.

Initially focused on supporting microservice-based workloads, Kubernetes has evolved into a powerful and flexible tool for building batch-processing platforms. This transformation is driven by the growing demand for machine learning (ML) training capabilities, the shift of high-performance computing (HPC) systems to the cloud, and the evolution towards more loosely coupled mathematical models in the industry.

Recent work by PGS to use Kubernetes to build a compute platform that is equivalent to the world’s top seventh supercomputer with 1.2MvCPUs but running in the cloud and on Spot VMs is a great highlight of this trend.

In its early days, Kubernetes was primarily focused on building features for microservice-based workloads. Its strong container orchestration capabilities made it ideal for managing the complexity of such applications.

However, batch workloads users frequently preferred to rely on other frameworks like Slurm, Mesos, HTCondor, or Nomad. These frameworks provided the necessary features and scalability for batch processing tasks, but they lacked the vibrant ecosystem, community support, and integration capabilities offered by Kubernetes.

In recent years, the Kubernetes community has recognized the growing demand for batch processing support and has made significant investments in this direction. One such investment is the formation of the Batch Working Group, which has undertaken several initiatives to enhance Kubernetes’ batch processing capabilities.

The Batch Working Group has built numerous improvements to the Job API, making it more robust and flexible to support a wider range of batch processing workloads. The revamped API allows users to easily manage batch jobs, offers scalability, performance and reliability enhancements.

Kueue (https://kueue.sigs.k8s.io/) is a new job scheduler developed by the Batch Working Group, designed specifically for Kubernetes batch processing workloads. It offers advanced features such as job prioritization, backfilling, resource flavors orchestration and preemption, ensuring efficient and timely execution of batch jobs while keeping your resources usage at maximum efficiency.

The team is now working on building its integrations with various frameworks like Kubeflow, Ray, Spark and Airflow. These integrations allow users to leverage the power and flexibility of Kubernetes while utilizing the specialized capabilities of these frameworks, creating a seamless and efficient batch-processing experience.

There are also a number of other capabilities that the group is looking to deliver. This includes job-level provisioning APIs in autoscaling, scheduler plugins, node-level runtime improvements and many others.

As Kubernetes continues to invest in batch processing support, it becomes an increasingly competitive option for users who previously relied on other frameworks. There is a number of advantages Kubernetes brings to the table that includes:

  1. Extensive Multitenancy Features: Kubernetes provides robust security, auditing, and cost allocation features, making it an ideal choice for organizations managing multiple tenants and heterogeneous workloads.
  2. Rich Ecosystem and Community: Kubernetes boasts a thriving open-source community, with a wealth of tools and resources available to help users optimize their batch-processing tasks.
  3. Managed Hosting Services: Kubernetes is available as a managed service on all major cloud providers. This offers tight integrations with their compute stacks, enabling users to take advantage of unique capabilities, and simplified orchestration of harder-to-use scarce resources like Spot VMs or accelerators. Using these services will result in faster development cycles, more elasticity and lower total cost of ownership.
  4. Compute orchestration standardization and portability: Enterprises can choose a single API layer to wrap their computational resources to mix their batch and serving workloads. They can use Kubernetes to reduce lock-in to a single provider and get the flexibility of leveraging the best of all that the current cloud market has to offer.

Usually, a user’s transition to use Kubernetes also involves containerization of their batch workloads. Containers themselves have revolutionized the software development process and for computational workloads, they offer a great acceleration of release cycles leading to much faster innovation.

Containers encapsulate an application and its dependencies in a single, self-contained unit, which can run consistently across different platforms and environments. They eliminate the “it works on my machine” problem. They enable rapid prototyping and faster iteration cycles. If combined with cloud hosting it allows agility that helps HPC and ML-oriented companies innovate faster.

The Kubernetes community still needs to solve a number of challenges, including the need for more advanced controls of the runtime on each host node, and the need for more advanced Job API support. HPC users are accustomed to having more control over the runtime.

Setting up large-scale platforms using Kubernetes on premises still requires a significant amount of skill and expertise. There is currently some fragmentation in the batch processing ecosystem, with different frameworks re-implementing common concepts (like Job, Job Group, Job Queueing)  in different ways. Going forward we’ll see these addressed with each Kubernetes release.

The evolution of Kubernetes from a microservices-focused platform to a powerful tool for batch processing demonstrates the adaptability and resilience of the Kubernetes community. By addressing the growing demand for ML training capabilities, HPC migration to the cloud, Kubernetes has become an increasingly attractive option for batch-processing workloads.

Kubernetes’ extensive multitenancy features, rich ecosystem, and managed hosting services on major cloud providers make it a great choice for organizations seeking to optimize their batch-processing tasks and tap into the power of the cloud. If you want to join the Batch Working Group and help contribute to Kubernetes then you can find all the details here. We have regular meetings, a Slack channel and an email group that you can join.

The post Kubernetes Evolution: From Microservices to Batch Processing Powerhouse appeared first on The New Stack.

]]>
What Is Container Monitoring? https://thenewstack.io/what-is-container-monitoring/ Wed, 05 Apr 2023 14:31:21 +0000 https://thenewstack.io/?p=22704515

Container monitoring is the process of collecting metrics on microservices-based applications running on a container platform. Containers are designed to

The post What Is Container Monitoring? appeared first on The New Stack.

]]>

Container monitoring is the process of collecting metrics on microservices-based applications running on a container platform. Containers are designed to spin up code and shut down quickly, which makes it essential to know when something goes wrong as downtime is costly and outages damage customer trust.

Containers are an essential part of any cloud native architecture, which makes it paramount to have software that can effectively monitor and oversee container health and optimize resources to ensure high infrastructure availability.

Let’s take a look at the components of container monitoring, how to select the right software and current offerings.

Benefits and Constraints of Containers

Containers provide IT teams with a more agile, scalable, portable and resilient infrastructure. Container monitoring tools are necessary, as they let engineers resolve issues more proactively, get detailed visualizations, access performance metrics and track changes. As engineers get all of this data in near-real time, there is a good potential of reducing mean time to repair (MTTR).

Engineers must be aware of the limitations of containers: complexity and changing performance baselines. While containers can spin up quickly, they can increase infrastructure sprawl, which means greater environmental complexity. It also can be hard to define baseline performance as containerized infrastructure consistently changes.

Container monitoring must be specifically suited for the technology; legacy monitoring platforms, designed for virtualized environments, are inadequate and do not scale well with container environments. Cloud native architectures don’t rely on dedicated hardware like virtualized infrastructure, which changes monitoring requirements and processes.

How Container Monitoring Works

A container monitoring platform uses logs, tracing, notifications and analytics to gather data.

What Does Container Monitoring Data Help Users Do?

It allows users to:

  • Know when something is amiss
  • Triage the issue quickly
  • Understand the incident to prevent future occurrences

The software uses these methods to capture data on memory utilization, CPU use, CPU limits and memory limit — to name a few.

Distributed tracing is an essential part of container monitoring. Tracing helps engineers understand containerized application performance and behavior. It also provides a way to identify bottlenecks and latency problems, how changes affect the overall system and what fixes work best in specific situations. It’s very effective at providing insights into the path taken by an application through a collection of microservices when it’s making a call to another system.

More comprehensive container monitoring offerings account for all stack layers. They can also produce text-based error data such as “container restart” or “could not connect to database” for quicker incident resolution. Detailed container monitoring means users can learn which types of incidents affect container performance and how shared computing resources connect with each other.

How Do You Monitor Container Health?

Container monitoring requires multiple layers throughout the entire technology stack to collect metrics about the container and any supporting infrastructure, much like application monitoring. Engineers should make sure they can use container monitoring software to track the cluster manager, cluster nodes, the daemon, container and original microservice to get a full picture of container health.

For effective monitoring, engineers must create a connection across the microservices running in containers. Instead of using service-to-service communication for multiple independent services, engineers can implement a service mesh to manage communication across microservices. Doing so allows users to standardize communication among microservices, control traffic, streamline the distributed architecture and get visibility of end-to-end communication.

How to Select a Container Monitoring Tool

In the container monitoring software selection process, it’s important to identify which functions are essential, nice to have or unnecessary. Tools often include these features:

  • Alerts: Notifications that provide information to users about incidents when they occur.
  • Anomaly detection: A function that lets users have the system continuously oversee activity and compare against programmed baseline patterns.
  • Architecture visualization: A graphical depiction of services, integrations and infrastructure that support the container ecosystem.
  • Automation: A service that performs changes to mitigate container issues without human intervention.
  • API monitoring: A function that tracks containerized environment connections to identify anomalies, traffic and user access.
  • Configuration monitoring: A capability that lets users oversee rule sets, enforce policies and log changes within the environment.
  • Dashboards and visualization: The ability to present container data visually so users can quickly see how the system is performing.

Beyond specific features and functions, there are also user experience questions to ask about the software:

  • How quickly and easily can users add instrumentation to code?
  • What is the process for alarm, alert and automation?
  • Can users see each component and layer to isolate the source of failure?
  • Can users view entire application performance for both business and technical organizations?
  • Is it possible to proactively and reactively correlate events and logs to spot abnormalities?
  • Can the software analyze, display and alarm on any set of acquired metrics?

The right container monitoring software should make it easy for engineers to create alarms and automate actions when the system reaches certain resource usage thresholds.

When it comes to container management and monitoring, the industry offers a host of open source and open-source-managed offerings: Prometheus, Kubernetes, Jaeger, Linkerd, Fluentd and cAdvisor are a few examples.

Ways Chronosphere Can Monitor Containers 

Chronosphere’s offering is built for cloud native architectures and Kubernetes to help engineering teams that are collecting container data at scale. Chronosphere’s platform can monitor all standard data ingestion for Kubernetes clusters, such as pods and nodes, standard ingestion protocols as with Prometheus.

Container monitoring software generates a lot of data. When combined with cloud native environment metrics, this creates a data overload that outpaces infrastructure growth. This makes it important to have tools that can help refine what data is useful so that it gets to the folks who need it the most and ends up on the correct dashboards.

The Control Plane can help users fine-tune which container metrics and traces the system ingests. Plus, with the Metrics Usage Analyzer, users are put back in control of which container observability data is being used, and more importantly, pointing out when data is not used. Users decide which data is important after ingestions with the Control Plane so their organization avoids excessive costs across their container and services infrastructure.

To see how Chronosphere can help you monitor your container environments, contact us for a demo today. 

The post What Is Container Monitoring? appeared first on The New Stack.

]]>
How to Fix Kubernetes Monitoring https://thenewstack.io/how-to-fix-kubernetes-monitoring/ Fri, 31 Mar 2023 17:00:32 +0000 https://thenewstack.io/?p=22703174

It’s astonishing how much data is emitted by Kubernetes out of the box. A simple three-node Kubernetes cluster with Prometheus

The post How to Fix Kubernetes Monitoring appeared first on The New Stack.

]]>

It’s astonishing how much data is emitted by Kubernetes out of the box. A simple three-node Kubernetes cluster with Prometheus will ship around 40,000 active series by default! Do we really need all that data?

It’s time to talk about the unspoken challenges of monitoring Kubernetes. The difficulties include not just the bloat and usability of metric data, but also the high churn rate of pod metrics, configuration complexity when running multiple deployments, and more.

This post is inspired by my recent episode of OpenObservability Talks, in which I spoke with Aliaksandr Valialkin, CTO of VictoriaMetrics, a company that offers the open source time series database and monitoring solution by the same name.

Let’s unpack Kubernetes monitoring.

A Bloat of Out-of-the-Box Default Metrics

One of the reasons that Prometheus has become so popular is the ease of getting started collecting metrics. Most of the tools and projects expose metrics in OpenMetrics format, so you just need to turn that on, and then install the Prometheus server to start scraping those metrics.

Prometheus Operator, the standard installation path, installs additional components for monitoring Kubernetes, such as kube-state-metrics, node-exporter and cAdvisor. Using the default Prometheus Operator to monitor even a small 3-node Kubernetes cluster results in around 40,000 different metrics! That’s the starting point, before even adding any applicative or custom metrics.

And this number keeps growing at a fast pace. Valialkin shared that since 2018, the amount of metrics exposed by Kubernetes has increased by 3-1/2 times. This means users are flooded with monitoring data from Kubernetes. Are all these metrics really needed?

Not at all! In fact, the vast majority of these metrics aren’t used anywhere. Valialkin said that 75% of these metrics are never put to use in any dashboards or alert rules. I see quite a similar trend among Logz.io users.

The Metrics We Really Need

Metrics need to be actionable. If you don’t act on them, then don’t collect them. This is even more evident with managed Kubernetes solutions, in which end users don’t manage the underlying system anyway, so many of the exposed metrics are simply not actionable for them.

This drove us to compose a curated set of recommended metrics, essential Kubernetes metrics to be collected whether from self-hosted Kubernetes or from managed Kubernetes services such as EKS, AKS and GKE. We share our curated sets publically as part of our Helm charts on GitHub (based on OpenTelemetry, kube-state-metrics and prometheus-node-exporter charts). VictoriaMetrics and other vendors have similarly created their curated lists.

However, we cannot rely on individual vendors to create such sets. And most end-users aren’t acquainted enough with the various metrics to determine themselves what they need, so they look for the defaults, preferring the safest bet of collecting everything so as not to lack important data later.

Rather, we should come together as the Kubernetes and cloud native community, vendors and end-users alike, and join forces to define a standard set of golden metrics for each component. Valialkin also believes that “third-party monitoring solutions should not install additional components for monitoring Kubernetes itself,” referring to additional components such as kube-state-metrics, node-exporter and cadvisor. He suggests that “all these metrics from such companions should be included in Kubernetes itself.”

I’d also add that we should look into removing unused labels. Do we really need from prometheus-node-exporter the details on each network card or CPU core? Each label adds a dimension to the metric, and multiplies the time series data exponentially.

Microservices Proliferation

Kubernetes has made it easy to package, deploy and manage complex microservices architectures at scale with containers. The growth in the number of microservices results in an increased load on the monitoring system: Every microservice exposes system metrics, such as CPU, memory, and network utilization. On top of that, every microservice exposes its own set of application metrics, depending on the business logic it implements. In addition, the networking between the microservices needs to be monitored as well for latency, RPS and similar metrics. The proliferation of microservices generates a significant amount of telemetry data, which can get quite costly.

High Churn Rate of Pods

People move to Kubernetes to be more agile and release more frequently. This results in frequent deployments of new versions of microservices. With every deployment in Kubernetes, new instances of pods are created and deleted, in what is known as “pod churn.” The new pod gets a unique identifier, different from previous instances, even if it is essentially a new version of the same service instance.

I’d like to pause here and clarify an essential point about metrics. Metrics data is time series data. Time series is uniquely defined by the metric name and a set of labeled values. If one of the label values changes, then a new time series is created.

Back to our ephemeral pods, many practitioners use the pod name as a label within their metrics time series data. This means that with every new deployment and the associated pod churn, the old time series stops receiving new samples and is effectively terminated, while a new time series is initiated, which causes discontinuity in the logical metric data sequence.

Kubernetes workloads typically have high pod churn rates due to frequent deployments of new versions of a microservice, as well as autoscaling of pods based on incoming traffic, or resource constraints on the underlying nodes that require eviction and rescheduling of pods. The discontinuity of metric time series makes it difficult to apply continuous monitoring on the logical services and analyze trends over time on their respective metrics.

A potential solution can be to use the ReplicaSet or StatefulSet ID for the metric label, as these remain fixed as the set adds and removes pods. Valialkin, however, refers to this as somewhat of a hack, saying we should push as a community to have first-level citizen nomenclature in Kubernetes monitoring to provide consistent naming.

Configuration Complexity with Multiple Deployments

Organizations typically run hundreds and even thousands of different applications. When these applications are deployed on Kubernetes, this results in hundreds and thousands of deployment configurations, and multiple Prometheus scrape_config configurations defining how to scrape (pull) these metrics, rules, filters and relabeling to apply, ports to scrape and other configurations. Managing hundreds and thousands of different configurations can quickly become unmanageable at scale. Furthermore, it can burden the Kubernetes API server, which needs to serve requests on all these different configurations.

As a community, we can benefit from a standard for service discovery of deployments and pods in Kubernetes on top of the Prometheus service discovery mechanism. In Valialkin’s vision, “in most cases Prometheus or some other monitoring system should automatically discover all the deployments, all the pods which need to be scraped to collect metrics without the need to write custom configuration per each deployment. And only in some exceptional cases when you need to customize something, then you can write these custom definitions for scraping.”

Want to learn more? Check out the OpenObservability Talks episode: “Is Kubernetes Monitoring Flawed?” On Spotify, Apple Podcasts, or other podcast apps.

The post How to Fix Kubernetes Monitoring appeared first on The New Stack.

]]>
Saga Without the Headaches https://thenewstack.io/making-the-saga-pattern-work-without-all-the-headaches/ Fri, 24 Mar 2023 17:00:11 +0000 https://thenewstack.io/?p=22703126

Part 1: The Problem with Sagas We’ve all been at that point in a project when we realize that our

The post Saga Without the Headaches appeared first on The New Stack.

]]>

Part 1: The Problem with Sagas

We’ve all been at that point in a project when we realize that our software processes are more complex than we thought. Handling this process complexity has traditionally been painful, but it doesn’t have to be.

A landmark software development playbook called the Saga design pattern has helped us to cope with process complexity for over 30 years. It has served thousands of companies as they build more complex software to serve more demanding business processes.

This pattern’s downside is its higher cost and complexity.

In this post, we’ll first pick apart the traditional way of coding the Saga pattern to handle transaction complexity and look at why it isn’t working. Then, I’ll explain in more depth what happens to development teams that don’t keep an eye on this plumbing code issue. Finally, I’ll show you how to avoid the project rot that ensues.

Meeting the need for durable execution

The Saga pattern emerged to cope with a pressing need in complex software processes: durable execution. When the transactions you’re writing make a single, simple single database call and get a quick response, you don’t need to accommodate anything outside that transaction in your code. However, things get more difficult when transactions rely on more than one database — or indeed on other transaction executions — to get things done.

For example, an application that books a car ride might need to check that the customer’s account is in good standing, then check their location, then examine which cars are in that area. Then it would need to book the ride, notify both the driver and the customer, then take the customer’s payment when the ride is done, writing everything to a central store that updates the driver and customer’s account histories.

Processes like these that process dependent transactions need to keep track of data and state throughout the entire sequence of events. They must be able to survive problems that arise in the transaction flow. If a transaction takes more time than expected to return a result (perhaps a mobile connection falters for a moment or a database hits peak load and takes longer to respond), the software must adapt.

It must wait for the necessary transaction to complete, retrying until it succeeds and coordinating other transactions in the execution queue. If a transaction crashes before completion, the process must be able to roll back to a consistent state to preserve the integrity of the overall application.

This is difficult enough in a use case that requires a response in seconds. Some applications might execute over hours or days, depending on the nature of the transactions and the process they support. The challenge for developers is maintaining the state of the process across the period of execution.

This kind of reliability — a transaction that cannot fail or time out — is known as a strong execution guarantee. It is the opposite of a volatile execution, which can cease to exist at any time without completing everything that it was supposed to do. Volatile executions can leave the system in an inconsistent state.

What seemed simple at the outset turns into a saga with our software as the central character. Developers had to usher it through multiple steps on its journey to completion, ensuring that we preserve its state if something happens.

Understanding the Saga pattern

The Saga pattern provides a road map for that journey. First discussed in a 1987 paper, this pattern brings durable execution to complex processes by enabling them to communicate with each other. A central controller manages that service communication and transaction state.

The pattern offers developers the three things they need for durable execution. It can string together transactions to support long-running processes and guarantee their execution by retrying in the event of failure. It also offers consistency by ensuring that either a process completes entirely or doesn’t complete at all.

However, there’s a heavy price to pay for using the Saga pattern. While there’s nothing wrong with the concept in principle, everything depends on the implementation. Developers have traditionally had to code the pattern themselves as part of their application. That makes its design, deployment and maintenance so difficult that the application can become a slave to the pattern, which ends up taking most of the developers’ time.

Eventually, developers are spending more time maintaining the plumbing code as they add more transactions. What was a linear development workload now becomes exponential. The time spent on development increases disproportionately with every new change.

Coding the Saga pattern manually involves breaking up a coherent process into chunks and then wrapping them with code that manages their operation, including retrying them if they fail. The developer must also manage the scheduling and coordination of these tasks across different processes that depend on each other. They must juggle databases, queues and timers to manage this inter-process communication.

Increasing the volume of software processes and dependencies requires more developer hours to create and maintain the plumbing infrastructure, which in turn drives up application cost. This increasing complexity also makes it more difficult for developers to prove the reliability and security of their code, which carries implications for operations and compliance.

Abstraction is the key

Abstraction is the key to retaining the Saga pattern’s durable execution benefits while discarding its negative baggage. Instead of leaving developers to code the pattern into their applications, we must hide the transaction sequencing from them by abstracting it to another level.

Abstraction is a well-understood process in computing. It gives each application the illusion that it owns everything, eliminating the need for the developer to accommodate it. Virtualization systems do this with the help of a hypervisor. The TCP stack does it by retrying network connections automatically so that developers don’t have to write their own handshaking code. Relational databases do it when they roll-back failed transactions invisibly to keep them consistent.

Running a separate platform to manage durable execution brings these benefits to transaction sequencing by creating what Temporal calls a workflow. Developers still have control over workflows, but they need not concern themselves with the underlying mechanics.

Abstracting durable execution to workflows brings several benefits aside from ease of implementation. A tried-and-tested workflow management layer makes complex transaction sequences less likely to fail than home-baked ad-hoc plumbing code. Eliminating thousands of lines of custom code for each project also makes the code that remains easier to maintain and reduces technical debt.

Developers see these benefits most clearly when debugging. Root cause analysis and remediation get exponentially harder when you’re having to mock and manage plumbing code, too. Workflows hide an entire layer of potential problems.

Productive developers are happy developers

Workflow-based durable execution boosts the developer experience. Instead of disappearing down the transaction management rabbit hole, they get to work on what’s really important to them. This improves morale and is likely to help retain them. With the number of open positions for software engineers in the US expected to grow by 25% between 2021 and 2031, competition for talent is intense. Companies can’t afford much attrition.

Companies have been moving in the right direction in their use of the Saga pattern to handle context switching in software processes. However, they can go further by abstracting these Saga patterns away from the application layer to a separate service. Doing this well could move software maturity forward years in an organization.

Part 2: Avoiding the Tipping Point

In the first half of this post, I talked about how burdensome it is to coordinate transactions and preserve the state at the application layer. Now, we’ll talk about how that sends software projects off-course and what you can do about it.

Any software engineering project of reasonable size runs into the need for durable execution.

Ideally, the cost and time involved in creating new software features would be consistent and calculable. Coding for durability shatters that consistency. It makes the effort involved with development look more like a hockey-stick curve than a linear slope.

The tipping point is where the time and effort spent on coding new features begins its upward spike. It’s when the true extent of managing long-term transactions becomes clear. I’ll describe what it is, why it happens and why hurriedly writing plumbing code isn’t the right way to handle it.

What triggers the tipping point

Life before the tipping point is generally good because the developer experience is linear. The application framework that the developers are using support each new feature that the developer adds with no nasty surprises. That enables the development team to scale up the application with predictable implementation times for new features.

This linear scale works as long as developers make quantitative changes, adding more of the same thing. Things often break when someone has to make a change that isn’t like the rest and discovers a shortcoming in the application framework. This is usually a qualitative change that demands a change in the way the application works.

This change might involve calls to multiple databases, or reliance on multiple dependent transactions for the first time. It might call on a software process that takes an unpredictable amount of time to deliver a result.

The change might not be enough to force the tipping point at first, but life for developers will begin to change. They might write the plumbing code to manage the inter-process communication in a bid to guarantee execution and keep transactions consistent. But this is just the beginning. That code took time to write, and now, developers must expand it to cope with every new qualitative change that they introduce.

They’ll keep doing that for a while, but the rot gets worse. Eventually, developers are spending more time maintaining the plumbing code as they add more transactions. What was a linear development workload now becomes exponential. The time spent on development increases disproportionately with every new change.

The “Meeting of Doom”

Some people are unaware of the tipping point until it happens. Junior developers without the benefit of experience often wander into them unaware. Senior developers are often in the worst position of all; they know the tipping point is coming but politics often renders them powerless to do anything other than wait and pick up the pieces.

Eventually, someone introduces a change that surfaces the problem. It is the straw that breaks the camel’s back. Perhaps a change breaks the software delivery schedule and someone with influence complains. Then, someone calls the “Meeting of Doom.”

This meeting is where the team admits that their current approach is unsustainable. The application has become so complex that these ad hoc plumbing changes are no longer supporting project schedules or budgets.

This realization takes developers through the five stages of grief:

  • Denial. This will have been happening for a while. People try to ignore the problem, arguing that it’ll be fine to continue as they are. This gives way to…
  • Anger. Someone in the meeting explains that this will not be fine. Their budgets are broken; their schedules are shot; and the problem needs fixing. They won’t take no for an answer. So people try…
  • Bargaining. People think of creative ways to prop things up for longer with more ad hoc changes. But eventually, they realize that this isn’t scalable, leading to…
  • Depression. Finally, developers realize that they’ll have to make more fundamental architectural changes. Their ad ho- plumbing code has taken a life of its own and the tail is now wagging the dog. This goes hand in hand with…
  • Acceptance. Everyone leaves the meeting with a sense of doom and knows that nothing is going to be good after this. It’s time to cancel a few weekends and get to work.

That sense of doom is justified. As I explained, plumbing code is difficult to write and maintain. From the tipping point onward, things get more difficult as developers find code harder to write and maintain. Suddenly, the linear programming experience they’re used to evaporates. They’re spending more time writing transaction management code than they are working through software features on the Kanban board. That leads to developer burnout, and ultimately, attrition.

Preventing the tipping point

How can we avoid this tipping point, smoothing out the hockey-stick curve and preserving a linear ratio between software features and development times? The first suggestion is usually to accept defeat this time around and pledge to write the plumbing code from the beginning next time or reuse what you’ve already cobbled together.

That won’t work. It leaves us with the same problem, which is that the plumbing code will ultimately become unmanageable. Rather than a tipping point, the development would simply lose linearity earlier. You’d create a more gradual decline into development dysphoria beginning from the project’s inception.

Instead, the team needs to do what it should have done at the beginning: make a major architectural change that supports durable execution systematically.

We’ve already discussed abstraction as the way forward. Begin by abstracting the plumbing functions from the application layer into their own service layer before you write a line more of project code. That will unburden developers by removing the non-linear work, enabling them to scale and keeping the time needed to implement new features constant.

This abstraction maintains the linear experience for programmers. They’ll always feel in control of their time, and certain that they’re getting things done. They will no longer need to consider strategic decisions around tasks such as caching and queuing. Neither will they have to worry about bolting together sprawling sets of software tools and libraries to manage those tasks.

The project managers will be just as happy as the developers with an abstracted set of transaction workflows. Certainty and predictability are key requirements for them, which makes the tipping point with its break from linear development especially problematic. Abstracting the task of transaction sequencing removes the unexpected developer workload and preserves that linearity, giving them the certainty they need to meet scheduling and budgetary commitments.

Tools that support this abstraction and the transformation of plumbing code into manageable workflows will help you preserve predictable software development practices, eliminating the dreaded tipping point and saving you the stress of project remediation. The best time to deploy these abstraction services is before your project begins, but even if your team is in crisis right now, it offers a way out of your predicament.

 

The post Saga Without the Headaches appeared first on The New Stack.

]]>
What Is Microservices Architecture? https://thenewstack.io/microservices/what-is-microservices-architecture/ Thu, 23 Feb 2023 12:22:35 +0000 https://thenewstack.io/?p=22701069

Microservices are a hot topic when people talk about cloud native application development, and for good reason. Microservices architecture is

The post What Is Microservices Architecture? appeared first on The New Stack.

]]>

Microservices are a hot topic when people talk about cloud native application development, and for good reason.

Microservices architecture is a structured manner for deploying a collection of self-contained and independent services in an organization. They are game changing compared to some past application development methodologies, allowing development teams to work independently and at cloud native scale.

Let’s dive into the history of application development, characteristics of microservices and what that means for cloud native observability.

Microservices Architecture vs. Monolithic Architecture

The easiest way to understand what microservices architecture does is to compare it to monolithic architecture.

Monolithic Architecture

Monolithic architecture, as the prefix “mono” implies, is software that is written with all components combined into a single executable. There are typically three advantages to this architecture:

  • Simple to develop
    •  Many development tools support the creation of monolithic applications.
  • Simple to deploy
    • Deploy a single file or directory to your runtime.
  • Simple to scale
    • Scaling the application is easily done by running multiple copies behind some sort of load balancer.

Microservices Architecture

Microservices are all about small, self-contained services. The advantages are that these services are highly maintainable and testable, loosely coupled with other services, independently deployable and developed by small, highly productive developer teams. Some service types for a microservices architecture may include the following examples.

  • Client
    Client-side services are used for collecting client requests, such as requests to search, build, etc.
  • Identity Provider
    Before being sent to an API gateway, requests from clients are processed by an identity provider, which is a service that creates, authenticates and manages digital identity information.
  • API Gateway
    An API (application programming interface) is the intermediate system that allows for two services to talk to each other. In the case of microservices, an API gateway is like an entry point: It accepts requests from clients, collects the services needed to fulfill those requests from the backend and returns the correct response.
  • Database
    In microservices, each microservice usually has its own database, which is updated through an API service.
  • Message Formatting
    Services communicate in two types of ways: synchronous messages and asynchronous messages.
    • Synchronous messaging is for when a client waits for a response. This type includes both REST (representational state transfer) and HTTP protocols.
    • Asynchronous messaging is for scenarios where clients don’t wait for an immediate response from services. Common protocols for this type of messaging include AMQP, STOMP and MQTT.
  • Static Content
    Once microservices have finished communicating, any static content is sent to a cloud-based storage system that can directly deliver that content to the client.
  • Management
    Having a management element in a microservices structure can help monitor services and identify failures to keep everything running smoothly
  • Service Discovery
    For both client-side and server-side services, service discovery is a tool that locates devices and services on a specific network.

The monolithic model is more traditional and certainly has some pros, but microservices are becoming more and more common in cloud native environments. The following are the most defining differences between the two models.

Benefits of a Microservices Architecture

Microservices are showing lots of success in modern cloud native businesses. Businesses benefit from using this structure for a number of reasons.

Flexibility: Because each service is independent, the programming language can vary between all microservices (although it’s prudent to standardize as much as possible on one modern programming language).

Faster deployment: Not only are microservices easier to understand for most developers, but they’re also faster to deploy. Change one thing in the code for a monolithic structure, and it affects everything across the board. Microservices are independently deployed and don’t affect other services.

Scalability: If you run everything through one application, it’s hard to manage the massive scale of services as an application grows. Instead, a microservices architecture allows a team to modify the capabilities of an individual service instead of redeploying an entire system. This even applies to the scale of each service within an application. If a payment service, for example, is seeing more demand than other services, that microservice can be scaled as needed.

Isolated failures: With a monolithic architecture, a failure in one service can compromise the entire application. Microservices isolate each component, so if one service fails or has issues, the rest of the applications can still function, although likely in a degraded state.

Ultimately, microservices architecture saves teams time, offers more granular modifications to each service and scales with the need of every individual service and interface as a whole.

Challenges

Microservices are incredibly useful, but they do come with their own challenges.

Increased complexity and dependency issues: The design of microservices favors individual services, but that also increases the number of users, the variety of user behavior and the interactions between services. This makes tracing individual traffic from frontend to backend more difficult and can cause dependency issues between services. When microservices are so independent of each other, it’s not always easy to manage compatibility and other effects caused by the different existing versions and workloads.

Testing: Testing can be difficult across the entire application because of how different and variant each executed service route can be. Flexibility is a great element of microservices, but the same diversity can be hard to observe consistently in a distributed deployment.

Managing communication systems: Even though services can easily communicate with each other, developers have to manage the architecture in which services communicate. APIs play an essential role in reliable and effective communication between services, which requires an API gateway. These gateways are helpful, but they can fail, lead to dependency problems or bottleneck communication.

Observability: Because microservices are distributed, monitoring and observability can be a challenge for developers and observability teams. It’s important to consider a monitoring system or platform to help oversee, troubleshoot and observe an entire microservices system.

Should You Adopt Microservices Architecture?

Sometimes a monolithic structure may be the way to go, but microservices architecture is growing for a reason. Ask yourself:

What Environment Are My Applications Working In?

Because most cloud native architectures are designed for microservices, microservices are the way to go if you want to get the full benefits of a cloud native environment. With applications moving to cloud-based settings, application development favors microservices architecture and will continue to do so moving forward.

How Does Your Team Function?

In a microservices architecture, the codebase can typically be managed by smaller teams. Still, development teams also need the tools to identify, monitor and execute the activity of different components, including if and how they interact with each other. Teams also need to determine which services are reusable so they don’t have to start from scratch when building a new service.

How Flexible Are Your Applications?

If you have to consistently modify your application or make adjustments, a microservices approach is going to be best because you can edit individual services instead of the entire monolithic system.

How Many Services Do You Have and Will There Be More?

If you have a lot of services or plan to continue growing — in a cloud-based platform, you should expect growth — then a microservices architecture is also ideal because monolithic application software doesn’t scale well.

Chronosphere Scales with Microservices

Microservices architecture is designed to make application software development in a cloud native environment simpler, not more difficult. With the challenges that come along with managing each individual service, it’s even more critical for teams to have observability solutions that scale with the growth of their business.

Chronosphere specializes in observability for a cloud native world so your team has greater control, reliable functionality and flexible scalability at every level of your application development. Monitoring microservices with a trustworthy and flexible platform greatly lowers risks, helps anticipate failures and empowers development teams to understand their data.

The post What Is Microservices Architecture? appeared first on The New Stack.

]]>
Java’s History Could Point the Way for WebAssembly https://thenewstack.io/webassembly/javas-history-could-point-the-way-for-webassembly/ Thu, 12 Jan 2023 11:00:28 +0000 https://thenewstack.io/?p=22697040

It’s hard to believe that it’s been over 20 years since the great dotcom crash happened in 2001, which continues

The post Java’s History Could Point the Way for WebAssembly appeared first on The New Stack.

]]>

It’s hard to believe that it’s been over 20 years since the great dotcom crash happened in 2001, which continues to serve as a harbinger of potential doom whenever cyclic tech is on a downward path. I remember quite distinctly hanging out with either unemployed or underemployed folks in the IT field shortly after the great crash in 2001. We were doing just that: hanging out with time on our hands.

During that time, one day in a park in New York City, Brookdale Park in Montclair, NJ, one of my friends was sitting on a park bench pounding away on his laptop, and he said there was this really cool thing for website creation called Java. It’s been around for a long time, actually, but described how amazing it was that you could program in Java code and deploy, where you want on websites, he said. And of course, have played a key role in transforming the user experience on websites compared to the days of the 1990s when HTML code provided the main elements of website design. Sure, why not, I’ll check it out, I said. And the rest is history as Java secured its place in history, not only for web development, but across IT infrastructure.

Flash forward to today: There’s this thing called WebAssembly or Wasm, which is offering a very similar claim. of where you write once and deploy anywhere. Not only for web applications for which it was originally created, but across networks and on anything that runs on a CPU.

Remind you of something?

“Wasm could be Java’s grandchild that follows the same broad principle of allowing developers to run the same code on any device, but at the same time Wasm fixes the fundamental issues that prevented the original vision of “Java on any device” from becoming reality,” Torsten Volk, an analyst for Enterprise Management Associates (EMA), told The New Stack.

The Simple Case

Wasm has shown to be very effective in a number of different hardware environments, ranging from server-side to edge deployments and IoT devices or wherever code can be run directly on a CPU. The code runs bundled in the neatly packaged Wasm executable that can be compared to a container or even a mini operating system that can run with significantly less — if any — configuration required for the code and the target. Wherever code can be deployed, essentially, the applications are much farther out than just being confined inside the web browser environment. The developers thus creates the code and deploys it. It can really be that simple, especially when PaaS solutions are used.

Most importantly, Wasm enables true “code once, deploy” anywhere capabilities, where the same code runs on any supported device without the need to recompile, Volk noted. “Wasm is not tied to one development language but supports Python and many other popular languages. Developers can run their code on shared environments on servers and other devices without having to worry about the underlying Kubernetes cluster or hypervisor,” Volk said. “They also receive unified logging and tracing for their microservices, right out of the box. This simplified developer experience is another big plus compared to Java.”

During a recent KubeCon + CloudNativeCon conference, a talk was given about using Wasm to replace Kafka for lower latency data streaming. At the same time, Java continues to be used for networking apps even though alternatives can offer better performance, but developers kept using it because they just liked working with Java. So, if Wasm’s runtime performance were not great — which it is — developers might still adopt it merely for its simplicity of use.

“One of the big pluses of Wasm is that it is very easy for developers to get started, just by deploying some code and instantly watching it run. This is one of these value propositions where it may take a little while to fully understand it, but once you get hooked, you don’t want to worry about the ins and outs of the underlying infrastructure anymore,” Volk said. “You can then decide if it makes sense to replace Kafka or if you just want to connect it to your Wasm app.”

Java’s entire “write once, run anywhere” promise is quite similar to WebAssembly‘s, Fermyon Technologies CEO and co-founder Matt Butcher told the New Stack: “In fact, Luke [Luke Wagner was the original author of WebAssembly], once told me once that he considered Java to be 20 years of useful research that formed the basis of how to write the next generation (e.g. Wasm),” Butcher said.

Still Not the Same

There is one key difference between Java and Wasm: their security postures.

Its portability and consistency can make security and compliance much easier to manage (again, it runs in a binary format on a CPU level). Also, part of Wasm’s simplicity in structure means that code is released in a closed Sandbox environment, almost directly to the endpoint. Java’s (as well as .NET’s) default security posture is “to trust the code it is running,” while Java grants the code access to the file system, to the environment, to processes, and to the network, Butcher said.

“In contrast, Wasm’s default security posture is to not trust the code that is running in the language runtime. For Fermyon (being cloud- and edge-focused), this was the critical feature that made Wasm a good candidate for a cloud service,” Butcher said. “Because it is the same security posture that containers and virtual machines take. And it’s what makes it possible for us as cloud vendors to sell a service to a user without having to vet or approve the user’s code.”

In other words, there are just an exponentially greater number of attack points to worry about when working with distributed containerized and microservices environments. Volk agreed with Matt’s assessment, as relying on the zero trust principle allows for multitenancy based on the same technologies, like mTLS and jwt, that are already being used for application containers running on Kubernetes, Volk said. “This makes Wasm easy to safely try out in shared environments, which should lower the initial barriers to get started,” Volk said.

Another big difference between Java and Wasm — which can actually run in a Linux kernel — is that Java requires JVM and does not require additional resources, such as a garbage collector, Sehyo Chang, CTO at InfinyOn, told The New Stack. “Wasm, on the other hand, is very close to the underlying CPU and doesn’t need GC or other heavy glue logic,” Chang said. “This allows Wasm to run on a very low-power CPU suitable for running in embedded devices or IoT sensors to run everywhere.”

 

The post Java’s History Could Point the Way for WebAssembly appeared first on The New Stack.

]]>
Limiting the Deployment Blast Radius https://thenewstack.io/limiting-the-deployment-blast-radius/ Tue, 15 Nov 2022 16:14:52 +0000 https://thenewstack.io/?p=22692941

Complex application environments often require deployments to keep things running smoothly. But deployments, especially microservice updates, can be risky because

The post Limiting the Deployment Blast Radius appeared first on The New Stack.

]]>

Complex application environments often require deployments to keep things running smoothly. But deployments, especially microservice updates, can be risky because you never know what could go wrong.

In this article, we’ll explain what can go wrong with deployments and what you can do to limit the blast radius.

What Is a Blast Radius?

A blast radius is an area around an explosion where damage can occur. In the context of deployments, the blast radius is the area of the potential impact that a deployment might have.

For example, if you deploy a new feature to your website, the blast radius might be the website itself. But if you’re deploying a new database schema, the blast radius might be the database and all the applications that use it. The problem with deployments is that they often have an infinite blast radius.

While we always expect some blast radius, an infinite blast radius means that anything could go wrong and cause problems. That’s bad.

What Causes Blast Damage?

Poor Planning

Hastily developed and scheduled deployments are often the leading causes of infinite blast radius. When you rush a deployment, you’re more likely to make mistakes. These mistakes can include forgetting to update the documentation, accidentally breaking something in production, or not giving other interested parties, like dependent service owners, a chance to reflect and respond to the deployment.

Poor Communication

Have you ever woken up to frantic calls that your app is not working only to discover that another team had an unplanned deployment and didn’t inform you that it was going ahead? I have.

Telling someone they’re deploying without giving them adequate time to prepare is another source of trouble. Not communicating is a recipe for disaster.

Lack of Testing

One of the most important things you can do to limit a deployment’s blast radius is to test it thoroughly before pushing it to production. That means testing in a staging environment configured as close to production as possible. It also means doing things like unit testing and end-to-end testing.

By thoroughly testing your code before deploying it, you can catch any potential issues and fix them before they cause problems in production.

Incomplete or Unclear Requirements

If your developers had to guess what the requirements were, they most likely didn’t have a clear test plan either. Unclear requirements can lead to code that works in development but breaks in production. In addition, it can lead to code that doesn’t play well with other systems. It can also lead to features that don’t meet users’ needs.

To avoid this, make sure you have a clear and complete set of requirements before starting development. Precise requirements will help ensure that your developers understand what they need to build and that they can test it properly before deploying it.

Configuration Errors

The most common way that deployments go wrong is when configuration changes occur. For example, you might change the database settings and forget to update the application. Or you might change the way you serve your website and break the links to all of your other websites. Configuration changes are often the cause of deployments going wrong because they can affect so many different parts of your system.

Human Failure

Human beings get tired, make mistakes, and forget things. When deploying a complex system, there’s a lot of room for error. Even the most experienced engineers can make mistakes.

Environment Parity Issues

Sometimes, things get changed in production in the chaotic world of production support, and it doesn’t trickle down to the testing and development environments. Environment inequality can lead to problems when you go to deploy. Your application might not work in the new environment, or you might not have all the necessary files and configurations. Or there are the dreaded things in the environment that weren’t in testing, and no one knows why they’re there.

Software Issues

Finally, the software itself can go wrong. Software is complex, and it’s often hard to predict how it will behave in different environments. For example, you might test your software in a development environment, and it works fine. But when you deploy it to production, there might be unintended consequences. For example, if your code is “spaghetti” and is tightly coupled or is difficult to maintain, you probably have issues with deployments.

A lot can go wrong! But what if there were ways you could limit the blast radius?

How to Limit the Blast Radius

There are a few ways to limit the blast radius.

Plan and Schedule

Set your team up for success by establishing a regular cadence and process for deployments. Consistency will minimize the potential for human error.

When everyone knows when to expect deployments or what to expect when there’s a particular case, your deployments will go much smoother. You should also plan what to do if something goes wrong. Can you roll back? Are there backups?

Over Communicate

Part of planning the deployment is making sure that all responsible and affected parties are informed adequately and have time to review, reflect, and respond. Communicating includes sending an email to all users informing them of the upcoming deployment and telling them what to expect. In addition, it means communicating with the people who will do the deployment. Give them adequate time to prepare, review, practice, and clear their calendars. It’s better to err on the side of providing too much information rather than too little.

Understand Risks

The most important way to prepare is to understand your risks. First, you need to know what could go wrong and how it would affect your system. Only then can you take steps to prevent it from happening.

Automate Everything

Another way to limit the blast radius is to automate as much of the deployment process as possible. If there’s something that can be automated, then do it. For example, automating your deployments will help ensure consistency and accuracy. This way, you can be sure that everything is covered, it’s done correctly, and that you forget nothing.

There are many different tools available to help you automate your deployments. Choose the one that best fits your needs and then automate as much of the process as possible.

Use an Internal Developer Portal

Finally, use an internal developer portal. An internal developer portal organizes much of the information you may need to help limit the blast radius.

For instance, it can help you understand the downstream services that depend on your service, identify the owners of those services, find related documentation, and visualize key metrics about those services all in one place. This enables you to know who to communicate with ahead of a deployment and how to get in touch, provides context on how those downstream services work, and offers a place to monitor those services during testing (assuming your internal developer portal is infrastructure-aware at the environment level).

An internal developer tool can also help you understand your risks by providing access to version control information that gives you access to what was changed, when it was changed, and by whom. This way, you can identify any changes that might have introduced risk and take steps to mitigate that risk.

One such deployment management tool is configure8. It will help you understand the blast radius of your deployments and limit the potential fallout, ensuring your deployments run smoother and flawlessly. It enables your team to answer the following questions:

  • Who owns the service?
  • Who’s on call for the service?
  • What was in the last deployment?
  • What applications depend on the service, and how mission-critical are they?
  • What’s the health of the service, including monitoring at the environment level?
  • When was the last time someone updated the service?

Learn more about how your engineering team can benefit from Configure8 and how it can help you limit the blast radius of your deployments. You can check it out here.

The post Limiting the Deployment Blast Radius appeared first on The New Stack.

]]>
Do … or Do Not: Why Yoda Never Used Microservices https://thenewstack.io/do-or-do-not-why-yoda-never-used-microservices/ Tue, 25 Oct 2022 13:00:44 +0000 https://thenewstack.io/?p=22689827

Microservices were meant to be a blessing, but for many, they’re a burden. Some developers have even moved away from them after negative

The post Do … or Do Not: Why Yoda Never Used Microservices appeared first on The New Stack.

]]>

Microservices were meant to be a blessing, but for many, they’re a burden. Some developers have even moved away from them after negative experiences. Operational complexity becomes a headache for this distributed, granular software model in production. Is it possible to solve microservices’ problems while retaining their advantages?

Microservices shorten development cycles. Changing a monolithic code base is a complex affair that risks unexpected ramifications. It’s like unraveling a sweater so that you could change its design. Breaking that monolith down into lots of smaller services managed by two-pizza teams can make software easier to develop, update, and fix. It’s what helped Amazon grow from a small e-commerce outfit to the beast it is today.

Microservices also introduce new challenges. Their distributed nature exposes developers to complex state management issues.

The Yoda Principle

Ideally, developers shouldn’t deal with state management at all. Instead, the platform should handle it as a core abstraction. Database transaction management is a good example; many database platforms support atomic transactions, which divide a single transaction into a set of smaller operations and ensure that either all of them happen, or none of them do. To achieve this behavior the database uses transaction isolation, which restricts the visibility of each operation in a transaction until the entire transaction completes. If an operation fails, the application using the database sees only the pre-transaction state, as though none of the operations happened.

This transactionality enables the developer to concentrate on their business logic while the database platform handles the underlying state. A database transaction doesn’t fail half-complete and then leave the developer to sort out what happened. An account won’t be debited without the corresponding party’s account being credited, for example. As Yoda said: “Do or do not. There is no try.” Appreciate ACIDSQL databases, he would have.

“Phew,” you think. “Thank goodness I don’t have to write code to unravel half-completed operations just to work out the transaction state.” Unfortunately, microservices developers are still living in that era. This is why Yoda never used Kubernetes.

I’ve Got a Bad Feeling about This

In microservice architectures, a single business process interacts with multiple services, each of which operates and fails autonomously. There is no single monolithic engine to manage and maintain state in the event of a failure.

This lack of transactionality between independent services leaves developers holding the bag. Instead of just focusing on their own applications’ functionality, they must also handle application resilience by managing what happens when things go wrong. What was once abstracted is now their problem.

In practice, things can go wrong quickly in microservice architectures, with cascading failures that cause performance and reliability problems. For example, a service that one development team updates can cause other services to fail if they haven’t also been updated to handle those new errors.

The brittle complexity of microservices is a challenge, in part because of the weakest link effect. An application’s overall reliability is only as good as its least reliable microservice. The whole thing becomes a lot harder with asynchronous primitives. State management is more difficult if a microservice’s response time is uncertain.

Look at the Size of That Thing

Another aspect of this problem is that managing state on your own doesn’t scale well. The more microservices a user has, the more time-consuming managing their state becomes. Companies often have thousands of microservices in production, outnumbering their developers. This is what we noticed as early developers at Uber. Uber had 4,000 microservices, even back in 2018. We spent most of our time writing code to manage the microservice state in this environment.

Developers have taken several approaches to solve homegrown state management. Some use Kafka event streams hidden behind an API to queue microservice-based messages, but the lack of diagnostics makes root cause analysis a nightmare. Others use databases and timers to keep track of the system state.

Monitoring and tracing can help, but only up to a point. Monitoring tools oversee platform services and infrastructure health while tracing makes it easier to troubleshoot bottlenecks and unexpected anomalies. There are many on offer. For example, Prometheus offers open-source monitoring that developers can query, while its sibling Grafana adds visualization capabilities to trace system behavior.

These solutions can be useful, providing at least some observability into microservices-based systems. However, monitoring tools don’t help with the task of state management, leaving that burden with the developer. That’s why developers spend way too much time writing state management code instead of highly differentiated business logic. In an ideal world, something else would abstract state management for them.

Use the Microservices State Management Platform, Luke

The answer to simplifying state management in microservices is to offer it as a core abstraction for distributed systems.

We worked on a statement management platform after spending far too much time manually managing microservices state at Uber. We wanted a product that would enable us to define workflows that make calls to different microservices (in the language of the developer’s choosing), and then execute them without worrying about it afterwards.

In our solution, which we originally called Cadence, a workflow automatically maintains state while waiting for potentially long-running microservices to respond. Its concurrent nature also enables the workflow to continue with other non-dependent operations in the meantime.

The system manages disruption in state without requiring developer intervention. For example, in the event of a hardware failure the state management platform will restart a Workflow on another machine in the same state without the developer needing to do anything.

Do. Don’t Not Do.

A dedicated state management platform for microservices gives us the same kind of abstraction that we see in atomic database transactions. Developers can be certain that a Workflow will run once, to completion. Temporal takes care of any failures and restarts under the hood. Now, microservices-based applications can guarantee that a debit from one account automatically credits the other in just a couple of lines of code. They get the best of both worlds.

This fixes a long-standing problem with microservices and supercharges developer productivity, especially now that they are typically responsible for the operation of their application in addition to the development. Finally, developers that want the benefits of microservices can enjoy them without having to go to the dark side.

The post Do … or Do Not: Why Yoda Never Used Microservices appeared first on The New Stack.

]]>
The Gateway API Is in the Firing Line of the Service Mesh Wars  https://thenewstack.io/the-gateway-api-is-in-the-firing-line-of-the-service-mesh-wars/ Mon, 17 Oct 2022 14:00:55 +0000 https://thenewstack.io/?p=22682717

It appears that the leading service mesh vendors are leaning towards the Kubernetes Gateway API, replacing Ingress with a single

The post The Gateway API Is in the Firing Line of the Service Mesh Wars  appeared first on The New Stack.

]]>

It appears that the leading service mesh vendors are leaning towards the Kubernetes Gateway API, replacing Ingress with a single API that can be shared for the management of Kubernetes nodes and clusters through service mesh. While the Gateway API is designed to — like service mesh — for other uses for infrastructure management in addition to Kubernetes,  it has been configured for Kubernetes specifically, created by Kubernetes creator Google.

“In general, if implementing the Gateway API for Kubernetes solves any operational friction that exists today between infrastructure providers, platform admins, and developers and can ease any friction developers experience in how they deploy North-South APIs and services, I think it makes sense to assess what that change will look like,” Nick Rago, field CTO for Salt Security, told The New Stack. This will put organizations in a good position down the road as the Gateway API specification matures and the gateway controller providers support more of the specs, reducing the need for vendor or platform-specific annotation knowledge and usage.

The Controversy

In that sense, this helps to explain to some degree why projects such as Linkerd, Istio and especially Google are offering this as a standard API relying on that. Although the push to incite organizations to Gateway APIs is not without controversy.

To wit, Linkerd’s August release of Linkerd 2.12 is what Buoyant CEO William Morgan describes as “a first step towards adopting the Gateway API as a core configuration mechanism.” However, Morgan is cautious and wary of standards in general and the risk of leading to vendor lock-in and other issues (as he vociferously expresses below).

Wariness of standards is not unreasonable given that they may or may not be appropriate for certain use cases, often depending on the maturity of the project.

“Standards can be a gift and a curse depending on the lifecycle stage of the underlying domain/products. They can be an enabler to allow higher-order innovation, or they can be overly constraining,” Daniel Bryant, head of developer relations at Ambassador Labs, told The New Stack. “I believe that Kubernetes ingress and networking is a well enough explored and understood domain for standards to add a lot of value to support additional innovation. This is why we’re seeing not only Ingress projects like Emissary-ingress and Contour adopt the Gateway API spec, but also service mesh products like Linkerd and Istio.”

Needless to say, Google obviously supports the Gateway API as far as standards go.  In an emailed response, Louis Ryan, principal engineer, Google Cloud, offered this when prompted to explain why projects such as Linkerd, Istio and especially Google are supporting the Gateway API:

“Kubernetes has proven itself an effective hub for standardizing APIs with broad cross-industry engagement and support. The Gateway API has benefitted from this and, as a result, is a very well-designed solution for a wide variety of traffic management use cases; ingress, egress, and intra-cluster,” Ryan wrote. “Applying the Gateway API to mesh traffic management is a very natural next step and should benefit users by creating a standard that is thorough, community-driven and durable.”

The Istio Steering Committee’s decision to offer its service mesh project as an incubating project with the Cloud Native Computing Foundation (CNCF) in part to improve Istio’s integration with Kubernetes through the Gateway API (as well as with gRPC with proxyless mesh and Envoy). The Gateway API is also seen as a viable Ingress replacement.

“Donating the project to the CNCF offered reassurance that Istio is in good shape and that it’s not a Google project but a community project,” Idit Levine, founder and CEO of solo.io — the leading provider of tools for Istio — told The New Stack.

Istio’s move followed concerns by IBM — one of the original creators with Google and car-sharing provider Lyft — and other community members over the project’s governance, specifically Google’s advocacy of the creation of the Open Usage Commons (OUC) for the project in 2020.

In Linkerd’s case, Linkerd 2.12 provides a first step towards supporting the Kubernetes Gateway API, Linkerd CEO William Morgan said. While the Gateway API was originally designed as a richer and more flexible alternative to the long-standing Ingress resource in Kubernetes, it “provides a great foundation for describing service mesh traffic and allows Linkerd to keep its added configuration machinery to a minimum,” Morgan wrote in a blog post.

“The value of the Gateway API for Linkerd is that it’s already on users’ clusters because it’s a part of Kubernetes. So to the extent that Linkerd can build on top of the Gateway API, that reduces the amount of novel configuration machinery we need to introduce,” Morgan told The New Stack “Reducing configuration is part and parcel of our mission to deliver all the benefits of the service mesh without the complexity of other projects in the space.”

The portability promised by the Gateway API spec “is attractive to operators and platform engineers,” Bryant said. “Although many of them will chose a service mesh for the long haul, using the Gateway API does enable the ability to both standardize configuration across all service meshes deployed within an organization and also open to door to swapping out an implementation if the need arises (although, I’m sure this wouldn’t be an easy task),” Bryant said.

However, Linkerd also only provides a partial implementation of parts of the Gateway API (e.g. CRDs such as HTTPRoute) to configure Linkerd’s route-based policies. This approach allows Linkerd to start using Gateway API types without implementing the portions of the spec “that don’t make sense for Linkerd,” Morgan wrote in a blog post. As the Gateway API evolves to better fit Linkerd’s needs, Linkerd’s intention is to switch to the source types in a way that minimizes friction to our users.

“I think the biggest concern is that the Gateway API gets co-opted by one particular project or company and stops serving the needs of the community as a whole. While the Gateway API today is reasonably complete and stable for the ingress use case, making it amenable to service meshes is still an ongoing effort (the “GAMMA initiative”) and there’s plenty of room for that process to go south,” Morgan told The New Stack. “In particular, many of the participants in the Gateway API today are from Google and work on Istio; if the GW API develops in an Istio-specific way then it doesn’t actually help the end user because we’ll end up with projects (like Linkerd) just developing their own APIs rather than conforming to something that doesn’t make sense to them. (We saw this a little bit with SMI.)”

The Way to Go

Meanwhile, Linkerd is only providing parts of the Gateway API to remain in line with the service mesh provider’s vision to keep the service mesh experience “light and simple,” Torsten Volk, an analyst for Enterprise Management Associates (EMA), told The New Stack, “They will not want to adopt anything that creates admin overhead for their user base or could potentially introduce network latency that would take away from their high-performance claim,” Volk said. “They even advertise on their website, ‘as little YAML and as few CRDs as possible,’ meaning that they will want to critically evaluate any advanced features they might need to offer to fully support Gateway API. This would dilute simplicity and performance as Linkerd’s key differentiators against Istio.”

Istio and Linkerd, of course, represent competing service mesh alternatives. For some service mesh users, Istio is GKE’s service mesh of choice and therefore, if GKE support is critical, Istio “might be the way to go,” Volk said. “However, most other vendors of proxies, ingress controllers and service mesh platforms have also indicated their support of Gateway API, at some point in the future,” Volk said. “Therefore, it might be wisest to trust your vendor of choice to ultimately support the critical elements of the Gateway API standard.”

HashiCorp, Kong, Ambassador and others are supporting the API Gateway, Bryant noted. Already, “the majority of Kubernetes API Gateway providers offer some level of support for the Gateway API spec,” Bryant said. “Both Emissary-ingress and Ambassador Edge Stack have offered this type of support for quite some time, and this will continue to evolve in the future.“

Ambassador Labs is also working with other founding contributors on the Envoy Gateway project, which will be the reference implementation of the Kubernetes Gateway API spec, Bryant said. They include Tetrate, VMware, Fidelity and others. “Our goal here is to collaborate on a standardized K8s API Gateway implementation that we can all build upon and innovate on top of,” Bryant said.

The post The Gateway API Is in the Firing Line of the Service Mesh Wars  appeared first on The New Stack.

]]>
AmeriSave Moved Its Microservices to the Cloud with Traefik’s Dynamic Reverse Proxy https://thenewstack.io/amerisave-moved-its-microservices-to-the-cloud-with-traefiks-dynamic-reverse-proxy/ Thu, 08 Sep 2022 21:02:23 +0000 https://thenewstack.io/?p=22682687

When AmeriSave Mortgage Corporation decided to make the shift to microservices, the financial services firm was taking the first step

The post AmeriSave Moved Its Microservices to the Cloud with Traefik’s Dynamic Reverse Proxy appeared first on The New Stack.

]]>

When AmeriSave Mortgage Corporation decided to make the shift to microservices, the financial services firm was taking the first step in modernizing a legacy technology stack that had been built over the previous decade. The entire project — migrating from on-prem to cloud native — would take longer.

Back in 2002, when company founder and CEO Patrick Markert started AmeriSave, only general guidelines for determining rates were available online. “At that time, finance was very old-school, with lots of paper and face-to-face visits,” said Shakeel Osmani, the company’s principal lead software engineer.

But Markert had a technology background, and AmeriSave became a pioneer in making customized rates available online. “That DNA of technology being the driver of our business has remained with us,” said Osmani.

Since then, AmeriSave has automated the creation and processing of loan applications, giving it lower overall operating costs. With six major loan centers in 49 states and over 5,000 employees, the company’s continued rapid growth demanded an efficient, flexible technology stack.

Steps to the Cloud

With many containerized environments on-prem, company management initially didn’t want to migrate to a cloud native architecture. “The financial industry was one of the verticals hesitant to adopt the cloud because the term ‘public’ associated with it prompted security concerns,” said Maciej Miechowicz, AmeriSave’s senior vice president of enterprise architecture.

Most of the engineers on his team came from companies that had already adopted microservices, so that’s where they started. First, they ported legacy applications into microservices deployed on-prem in Docker Swarm environments, while continuing to use the legacy reverse proxy solution NGINX for routing.

“We then started seeing some of the limitations of the more distributed Docker platform, mostly the way that networking operated, and also some of the bottlenecks in that environment due to increased internal network traffic,” said Miechowicz.

The team wanted to move to an enterprise-grade cloud environment for more flexibility and reliability, so the next step was migrating microservices to Microsoft’s Azure Cloud platform. Azure’s Red Hat OpenShift, already available in the Azure Cloud environment, offered high performance and predictable cost.

The many interdependencies among AmeriSave’s hundreds of microservices required the ability to switch traffic easily and quickly between Docker Swarm and OpenShift environments, so the team wanted to use the same URL for both on-prem and in the cloud. Without that ability, extensive downtime would be required to update configurations of each microservice when its dependency microservice was being migrated. With over 100 services, that migration task would cause severe business interruptions.

First, the team tried out Azure Traffic Manager, an Azure-native, DNS-based traffic load balancer. But because it’s not automated, managing all those configurations through Azure natively would require a huge overhead of 300 to 500 lines of code for each service, said Miechowicz.

One of the lead engineers had used Traefik, a dynamic reverse proxy, at his prior company and liked it, so the team began discussions with Traefik Labs about its enterprise-grade Traefik Enterprise for cloud native networking.

Cloud and Microservices Adoption Simplified

Traefik was founded to deliver a reverse proxy for microservices that can automatically reconfigure itself on the fly, without the need to go offline.

The open source Traefik Proxy handles all of the microservices applications networking in a company’s infrastructure, said Traefik Labs founder and CEO Emile Vauge. This includes all incoming traffic management: routing, load balancing, and security.

Traefik Enterprise is built on top of that. “Its additional features include high availability and scalability, and advanced security, as well as advanced options for routing traffic to applications,” he said. “It also integrates API gateway features, and connects to legacy environments.”

Vauge began work on Traefik as an open source side project while he was developing a Mesosphere-based microservices platform. “I wanted to automate 2,000 microservices on it,” he said. “But there wasn’t much in microservices available at that time, especially for edge routing.”

He founded Traefik Labs in 2016 and the software is now one of the top 10 downloaded packages on GitHub: it’s been downloaded more than 3 billion times.

“The whole cloud native movement is driven by open source, and we think everything should be open source-based,” he said. “We build everything with simplicity in mind: we want to simplify cloud and microservices adoption for all enterprises. We want to automate all the complexity of the networking stack.”

Multilayered Routing Eliminates Downtime

Working together, Traefik’s team and Miechowicz’s team brainstormed the idea of dynamic path-based routing of the same URL, between on-prem Docker Swarm and cloud-based OpenShift. This means a service doesn’t need to be updated while its dependency microservice is being migrated.

Any migration-related problem can be quickly fixed in Traefik Enterprise by redirecting routing from OpenShift back to on-prem Docker Swarm, correcting the issue, and redirecting back to OpenShift. Also, there’s no need to update configurations of any other services.

This is made possible by the way that Traefik Enterprise’s multilayered routing works. “Layer 1 of Traffic Enterprise dynamically collects path-based and host-based routing configured in Layer 2,” said Miechowicz. “In our case, we had two Layer 2 sources: on-prem Docker Swarm and cloud-based OpenShift. Layer 1 then directs the traffic to the source that matches the host/path criteria and has a higher priority defined. Rollback from OpenShift to Docker Swarm simply consists of lowering the priority on the OpenShift route. We did a proof-of-concept and it worked perfectly and fast.”

This contrasts with how NGINX works. “You may configure it to route to a hundred services, but if one service does not come up, NGINX will fail to start and cause routing outage of all the services,” said Osmani. But Traefik Enterprise will detect a service that’s failing and stop routing to it, while other services continue to work normally. Then, once the affected service comes back up, Traefik Enterprise automatically establishes routing again.

NGINX also doesn’t have Traefik’s other capabilities, like routing on the same URL, and it’s only suited for a smaller number of services, Osmani said. Both Azure Traffic Manager and Traefik must be maintained and managed, but that’s a lot easier to do with Traefik.

No More Service Interruptions

Osmani said adopting Traefik Enterprise was one of the best decisions the team has made in the past year because it’s removed many pain points.

“When we were on-prem, we were responsible for managing everything — we’ve often gotten up at midnight to fix something that someone broke,” he said. “But with Traefik you can only take down the service you’re affecting at that moment.”

From the business standpoint, the main thing that’s better is the migration, said Osmani. “Because we are a living, breathing system, customers are directly affected. In the online mortgage lending business, if a service is down people will just move on to the next mortgage lender’s site. Now we don’t experience service interruptions. There’s no other way we could have easily accomplished this.”

“For developers in our organization, the result works like magic,” said Miechowicz. “We just add a few labels and Traefik Enterprise routes to our services. As our developers move services to the cloud, none of them have seen a solution as streamlined and automated like this before.”

The post AmeriSave Moved Its Microservices to the Cloud with Traefik’s Dynamic Reverse Proxy appeared first on The New Stack.

]]>