Online Networking Architectures | The New Stack

Performant and Programmable Telco Networking with eBPF

Bill Mulligan — Fri, 11 Aug 2023 17:00:58 +0000

To keep the world connected, telecommunication networks demand performance and programmability to meet customers when and where they are, from streaming the winning goal of the world cup to coordinating responses to the latest natural disaster.

When switchboards were still run by human operators, telco companies were all about custom hardware with “black boxes” from vendors providing the speed the network needed. These black boxes controlled the performance of the network, which also made it dependent on where they were actually deployed.

As telcos moved from traditional phone calls to additional services like messaging and mobile data, the demands on the network pushed the boundaries of what was possible. Network Functions Virtualization (NFV) sought to allow telcos to use “white box” commodity hardware to scale out throughput and increase flexibility.

Technologies like the Data Plane Development Kit (DPDK) were developed to accelerate networking and bring it closer to the performance of custom hardware by reimplementing large portions of the Linux networking stack in userspace, thus achieving better throughput, lower packet processing latency, and making it easier to add functionality than trying to get it upstreamed into the kernel. However, these two trends became at odds with one another.

The performance of the boxes became dependent upon fine-tuning on a per-host basis which is hard to maintain at scale without absolutely uniform servers, but being absolutely uniform is also impossible at scale — creating an operational nightmare. Telcos were left with the worst of both worlds, getting cloud complexity without cloud benefits like flexibility and scalability.

Now, with consumers online 24/7 and every company moving to the cloud to deliver their services, what telcos need to deliver and what they can rely upon has drastically changed yet again. The rise of cloud native approaches to building scalable distributed systems has put the industry at a critical juncture, where they need cloud native infrastructure but are still stuck with the baggage from NFV, DPDK, and other related technologies.

Being able to weave together performance, flexibility, and operational scalability is what the next generation of networking for telco providers needs, but raises the question of whether we can finally deliver the vision of performant and programmable networks everywhere. Learning from these past technology transitions, we can see that the key is to have high-performance technology that is actually available everywhere. Enter eBPF.

eBPF is a Linux kernel technology with roots in networking. It allows users to programmatically extend the functionality of the Linux kernel while still doing it in a safe, performant, and integrated manner. Not only that, eBPF is part of the Linux kernel itself so is available everywhere that is running a semi-modern kernel (4.18 and above).

Instead of re-routing to where functionality can be implemented, eBPF enables efficient packet processing everywhere. eBPF is already transforming telco networks because it provides flexibility for integrating different protocols like SCTP, programmability for reducing operational complexity like leveraging NAT46/64 and SRv6, performant load balancing through XDP, and complete observability to see where the bottlenecks are.

With eBPF, telcos may finally be able to deliver the performant programmable networking that they have been driving toward for so long without getting tangled in an operational nightmare. Telco vendors as critical players in the value chain now have a unique opportunity and a toolset on the network level to modernize their Network Functions and make them suitable for the cloud native world with eBPF.

Bringing Network Function Virtualization up to Speed

As telcos began the transition from specialized hardware boxes to NFVs running on top of “white box” commodity hardware, their performance and operational considerations needed to change to address this new paradigm.

With dedicated devices like packet processors, DSPs, and FPGAs, key network performance characteristics like throughput (bits and packets per second), latency, and jitter (and the changes in the above) were more or less consistent and predictable. When relinquishing the performance advantages of “bare metal” and dedicated devices, telcos still needed to keep the performance characteristics of the network up to speed — even when scaling horizontally — to keep pace with demand and customer expectations.

Fig 1. Telcos transitioned from hardware box to network function virtualization to scale out, but traded off flexibility and some performance

Telcos now needed compensatory measures to address these crucial networking parameters. Dataplane Development Kit (DPDK) was created as a collection of building blocks to compose high-performance data processing with a strong focus on network packet processing.

Single Root I/O Virtualization (SR-IOV) allowed for “splitting” a single PCI device to multiple virtual functions. Thus instead of “sharing” a common device, multiple processes could get their own mini version of the original PCI device. Direct PCI assignment enabled detaching a PCI device from the kernel and allowing for a process (or virtual machine, or container) to directly operate it.

Combining these three concepts allowed for building powerful data planes which resembled the operation of the traditional packet switch. On top of that — combining them with the pre-existing control planes from vendors created a Virtualized Router (a VNF). They became (and still are) a huge topic in conferences, talks, and discussions.

Provisioning efficient networking for KVM virtual machines became a hot area including special functionality in KVM itself, QEMU, Libvirt and finally OpenStack (through Neutron ML2 plugins). The OPNFV project was also spun up to explore performance in the virtualized world with theoretical references, measurements, and test suits.

NFV Operations Still Running on Spreadsheets

While these technologies have been effective in bringing the performance of NFV closer to bare metal, they also impose significant operational burdens. Provisioning efficient networking for virtual machines requires meticulous orchestration and management of resources at scale.

More often than not, this “orchestration” is still not automated and a very labor-intensive process. It is not uncommon to see Excel sheets with PCI device addresses per server name. During normal operations, it is a manual and error-prone task, and during a network incident, it becomes a hopeless maze of cells and rows to resolve the issue and restore connectivity.

Attempting to automate these details spawned projects like Open Network Automation Platform (ONAP) and Airship but still left the network performance dependent on which server was running what software. And even with these projects, there are too many details to orchestrate when done at a telco scale across thousands of locations.

Trading performance for operational complexity left telcos in between a rock and a hard place where they needed to scale their networks to keep up with demands, but scaling out impacted performance and introduced additional operational complexity. Most NFVs were poorly ported versions of black-box software that was never designed to run in the cloud.

Vendors were caught behind and rewriting the entire stack from scratch was too expensive. A performant and programmable solution was needed across the network that worked wherever customers and coverage needed to go.

Fig. 2: eBPF moves the operations story out of the sheets.

NFV Collides with Cloud Native

At this crucial inflection point, another technological revolution was happening in the IT world with the rise of cloud and cloud native computing. Containers and container orchestration were popping up faster than you could say Kubernetes and workload IPs addresses were changing faster than people switching apps on their phones.

In contrast to OpenStack, Kubernetes came from the world of hyperscale cloud providers and had fundamentally different assumptions about how the network would look than the telco world was used to. Instead of complex topologies and multiple interfaces, Kubernetes comes with a flat network where every pod should be able to communicate with every other pod without NAT.

Trying to mesh this model with the expectations of telco networks led to the creation of projects like Multus and an SR-IOV container network interface. However, rather than accelerating the transformation of telco networks, these technologies instead tried to replicate the NFV model, with PCI device IDs, CPU pinnings, etc, in the cloud native world thus missing the benefits of both. To really go cloud native at their scale, telcos needed a way to decouple and abstract their workloads from the hardware details.

Enter eBPF — Accelerating and Simplifying Networks Everywhere

Telcos now need the performance of bare metal while being decoupled from the underlying hardware details and it needs to be done in an increasingly dynamic world without increasing operational complexity.

Rather than waiting for the Linux kernel to change, eBPF has emerged because it can modify and accelerate networking while being seamlessly integrated with the Linux networking stack — and the best part — it is already part of the kernel and available everywhere. With eBPF, telcos can achieve a versatile and highly efficient implementation of packet processing, enabling improved throughput and performance without the operational overhead of figuring out where it is available.

By still being a part of the kernel and integrating with it rather than trying to bypass it, like DPDK did or pinning pods to specific nodes using specific PCIs pinned to specific CPUs, eBPF is able to both take advantage of the existing kernel networking stack while also modifying, accelerating, and making it more efficient when needed.

Because eBPF is part of the Linux kernel, it’s already available everywhere, allowing telcos to commoditize network scale-out while still keeping decent performance. In the cloud native era, eBPF offers a promising avenue to enhance networking capabilities while streamlining operations for telcos resulting in both a more manageable and scalable networking environment. It also offers an avenue for telco vendors to modernize parts of their network functions that are hardcoded to SRIOV, DPDK, or even to particular hardware, enabling them to work without having to worry about the underlying infrastructure.

Fig 3. eBPF brings flexibility, observability, and performance to telco networks.

Observability for Day and Decade 2

Since networks are always on and can take decades to retire, telcos also have to think about the day 2 and decade 2 operational concerns of their infrastructure. Physical boxes and VNF had tools that worked for static environments and were hardware dependent, but they can no longer keep pace in the cloud native world. With Kubernetes, infrastructure becomes much more dynamic and unpredictable so having a good operations and observability story goes from a “nice to have” to mandatory.

In contrast to previous tools, eBPF is part of the kernel itself which means that no modifications are needed to applications or infrastructure to get good systems, network, and protocol observability.

Rather than having to ask each vendor to instrument or modify their network functions, eBPF provides telcos complete visibility massively lowering the operational hurdle to get started. And with eBPF everywhere, observability now comes out of the box – even for legacy systems that can’t or won’t be updated. With eBPF telcos can gradually get rid of complex proprietary network and protocol tracing systems which are still significant operational cost drivers.

For telco networks, eBPF can provide the panacea of improved performance, simplified operations, and complete visibility that cloud native demands while still supporting existing systems.

Telco Networking in the Real World with eBPF

If eBPF seems too good to be true, let’s look at a few examples of how it is transforming networks in the real world today like integrating different protocols, supporting dual stack and IPv6, and increasing load balancing performance.

As a networking technology eBPF is well-positioned to support telco workloads because it works on the packet rather than the protocol level. Cilium, a networking project based on eBPF, was easily able to add support for SCTP and eBPF can do full GTP/GRPS protocol parsing despite the Linux kernel not fully understanding the protocol itself.

The world is also in the transition from IPv4 to IPv6, once again by understanding the packets, eBPF is able to translate seamlessly between IPv4 and IPv6. This also allows it to support advanced telco networking topologies like SRv6 and enables them to add value to their network. By processing packets in the kernel rather than transferring the information to user space, eBPF also offers low CPU cost observability.

Finally, with XDP, by processing the packets before they even hit the networking stack, eBPF is able to provide extremely high-performance load balancing. By switching to eBPF, Seznam was able to reduce CPU consumption by 72x and Cloudflare does all of its DDoS protection with eBPF allowing it to drop over 8 million packets per second with a single server. eBPF running on SmartNICs/DPUs also allows them to be programmed in a standardized way rather than being tied to a specific vendor interface. eBPF is transforming networks today, not just a promise for the future.

Fig 4. eBPF drastically drops resource consumption.

Performant and Programmable Telco Networking with eBPF

For telcos seeking to enhance their networking capabilities and streamline operations for the cloud native era and beyond, embracing eBPF presents a compelling opportunity. It offers versatile and efficient packet processing, decoupled from hardware-specific details while integrating seamlessly with the Linux networking stack. Since eBPF is already available in their networks through the Linux kernel, telcos can leverage it today rather than search through spreadsheets to find out which server it is available on. They can achieve improved throughput capacity, reduce operational burden, and enhance visibility and observability, and that is just the start.

The post Performant and Programmable Telco Networking with eBPF appeared first on The New Stack.

Create a Samba Share and Use from in a Docker Container

Jack Wallen — Sat, 29 Jul 2023 13:00:16 +0000

At some point in either your cloud- or container-development life, you’re going to have to share a folder from the Linux server. You may only have to do this in a dev environment, where you want to be able to share files with other developers on a third-party, cloud-hosted instance of Linux. Or maybe file sharing is part of an app or service you are building.

And because Samba (the Linux application for Windows file sharing) is capable of high availability and scaling, it makes perfect sense that it could be used (by leveraging a bit of creativity) within your business, your app stack, or your services.

You might even want to use a Samba share to house a volume for persistent storage (which I’m going to also show you how). This could be handy if you want to share the responsibilities for, say, updating files for an NGINX-run website that was deployed via Docker.

Even if you’re not using Samba shares for cloud or container development, you’re going to need to know how to install Samba and configure it such that it can be used for sharing files to your network from a Linux server and I’m going to show you how it’s done.

There are a few moving parts here, so pay close attention.

I’m going to assume you already have Docker installed on a Ubuntu server but that’s the only assumption I’ll make.

How to Install Samba on Ubuntu Server

The first thing we have to do is install Samba on Ubuntu Server. Log into your instance and install the software with the command:

sudo apt-get install samba -y

When that installation finishes, start and enable the Samba service with:

sudo sysemctl enable --now smbd

Samba is now installed and running.

You then have to add a password for any user who’ll access the share. Let’s say you have the user Jack. To set Jack’s Samba password, issue the following command:

sudo smbpasswd -a jack

You’ll be prompted to type and verify the password.

Next, enable the user with:

sudo smbpasswd -e jack

How to Configure Your First Samba Share

Okay, let’s assume you want to create your share in the folder /data. First, create that folder with the command:

sudo mkdir /data

In order to give it the proper permissions (so those users who need access), you might want to create a new group and then add users to the group. For example, create a group named editors with the command:

sudo groupadd editors

Now, change the ownership of the /data directory with the command:

sudo chow -R :editors /data

Next, add a specific user to that new group with:

sudo usermod -aG editors USER

Where USER is the specific user name.

Now, make sure the editors group has write permission for the /data directory with:

sudo chmod -R g+w /data

At this point, any member of the editors group should be able to access the Samba share. How they do that will depend on the operating system they use.

How to Create a Persistent Volume Mapped to the Share

For our next trick, we’re going to create a persistent Docker volume (named public) that is mapped to the /data directory. This is done with the following command:

docker volume create --opt type=none --opt o=bind --opt device=/data public

To verify the creation, you can inspect the volume with the command:

docker volume inspect public

The output will look something like this:

[
{
"CreatedAt": "2023-07-27T14:44:52Z",
"Driver": "local",
"Labels": {},
"Mountpoint": "/var/lib/docker/volumes/public/_data",
"Name": "public",
"Options": {
"device": "/data",
"o": "bind",
"type": "none"
},
"Scope": "local"
}
]

Let’s now add an index.html file that will be housed in the share and used by our Docker NGINX container. Create the file with:

nano /data/index.html

In that file, paste the following:

Save and close the file.

Deploy the NGINX Container

We can now deploy our NGINX container that will use the index.html file in our public volume that is part of our Samba share. To do that, issue the command:

docker run -d --name nginx-samba -p 8090:80 -v public:/usr/share/nginx/html nginx

Once the container is deployed, point a web browser to http://SERVER:8090 (where SERVER is the IP address of the hosting server), and you should see the index.html file that we created above (Figure 1).

Figure 1: Our custom index.html has been officially served in a Docker container.

Another really cool thing about this setup is that anyone with access to the Samba share can edit the index.html file (even with the container running) to change the page. You don’t even have to stop the container. You could even create a script to automate updates of the file if you like. For this reason, you need to be careful who has access to the share.

Congrats, you’ve just used Docker and Samba together. Although this might not be a wise choice for production environments, for dev or internal services/apps, it could certainly come in handy.

The post Create a Samba Share and Use from in a Docker Container appeared first on The New Stack.

CIOs, Heed On-Premises App and Infrastructure Performance

Gregg Ostrowski — Wed, 05 Jul 2023 20:21:49 +0000

Although legacy applications and infrastructure may not be a popular topic, their significance to organizations is crucial.

As cloud native technologies are poised to become a dominant part of computing, certain applications and infrastructure must remain on premises, particularly in regulated and other industries.

Amid the buzz surrounding no-code and low-code platforms, technologists must prioritize acquiring the appropriate tools and insights to manage on-premises environments’ availability and performance. Consumer expectations for flawless digital experiences continue to rise, so companies must optimize their on-premises customer-facing applications to accommodate.

For Some, On-Premises Infrastructure Will Remain Essential

Much of the recent digital transformation across multiple industries can be attributed to a substantial shift to the cloud. Cloud native technologies are in high demand due to their ability to expedite release velocity and optimize operations with speed, agility, scale and resilience.

Nevertheless, it’s easy to overlook the fact that many organizations, especially larger enterprises, still run their applications and infrastructure on premises. While this may seem surprising, it’s partially due to the time-consuming process of seamlessly and securely migrating highly intricate, legacy applications to the cloud. Often, only a portion of an application may be migrated to the cloud while major components will remain on-premises. Additionally, as cloud expenses continue to escalate, cost must be considered and closely managed. In today’s tough economic environment, business leaders and technologists are increasingly discerning about what and how they migrate to the cloud, keeping cost management top of mind.

However, the core reason organizations are keeping their application on premises is for control and visibility. Technologists desire these two components for their applications and infrastructure and want to know their data’s exact location and manage upgrades within their own premises. In fact, this need for control can particularly be found within large global brands with sensitive intellectual property (IP), as IT leaders perceive it too great a risk to store their most valuable assets outside of their organization.

Naturally, there are also other industries where data privacy and security severely restrict organizations’ ability to migrate to the cloud. Some federal government agencies, for example, are required to operate air-gapped environments with no internet access, and there are stringent regulations governing the handling of citizen data in industries such as healthcare and pharmaceuticals.

Also, financial services institutions must comply with strict data sovereignty regulations to ensure customer data stays within the border of the operating country, making it impossible to relocate applications that manage customer data to a public cloud environment. It’s clear organizations must manage and optimize legacy applications within their on-premises environment, and this will remain the case for the foreseeable future.

Sudden Increases in Demand Handled On-Premises

Technologists today are faced with the challenges of scaling on-premises applications and infrastructure to accommodate fluctuating demand. Cloud providers are able to excel in effectively managing this challenge through automatic workload scaling in modernized architectures.

Every industry experiences major spikes in demand, including retail, finance, travel or healthcare, making it essential for IT teams to prepare for these fluctuations with seamless and rapid scaling of their on-premises applications and infrastructure to get them through.

Organizations must use an observability platform with dynamic baselining capabilities that can set off additional capacity in their hyperscaler environments to achieve this seamless experience.

Prepare Yourself for the Hybrid Future

Over the next few years, many organizations are likely to adopt a hybrid strategy of maintaining mission-critical applications and infrastructure on-premises, while transitioning other IT elements into public cloud environments. By combining control and compliance of on-premises with the scale, agility and speed of cloud native, this approach offers the best outlook for a hybrid future.

As more applications run across on-premises and cloud environments, IT teams responsible for managing availability and performance face significant challenges. Today, most IT departments use separate tools to monitor on-premises and cloud applications, which brings a lack of visibility across the entire application path in hybrid environments. IT leaders can’t visualize the path up and down the application stack and they can’t derive business context, making it virtually impossible to troubleshoot issues quickly. This leaves them in a firefighting mode to solve issues before they affect end users. An IT department’s worst nightmare, like an outage or even damaging downtime, surges when metrics such as MTTR and MTTX inevitably rise.

To avoid these issues, IT teams require an observability platform for unified visibility across their entire IT estate. Through this platform, IT leaders can access real-time insights of IT availability and performance across both on-premises and public cloud environments and are able to correlate IT data with real-time business metrics, allowing them to prioritize issues that matter most to customers and the business.

While cloud native technologies will continue to dominate headlines, IT teams must also optimize availability and performance within on-premises environments. For many organizations, their most critical applications will remain on-premises for the foreseeable future. In return, it’s crucial for technologists to keep their eye on the ball and ensure they have the tools and visibility to monitor and manage highly dynamic and complex microservices environments, optimizing availability and performance at all times.

The post CIOs, Heed On-Premises App and Infrastructure Performance appeared first on The New Stack.

Hasura Launches New Data Network for APIs Only

Chris J. Preimesberger — Thu, 29 Jun 2023 16:36:23 +0000

Data networks are generally used for file sharing, application operations or internet access, but what about a network strictly for distributing application programming interfaces? After all, an API is pretty esoteric, given that it is not standard data but a set of rules that define how two pieces of software can interact with each other.

Well, that out-of-the-ordinary system now exists, and it’s designed to do a ton of heavy lifting behind the scenes that developers will appreciate.

Bangalore- and San Francisco-based Hasura recently launched Hasura DDN, a new edge network using Graph Query Language and designed for transporting real-time, streaming and analytical data. It enables developers to run low-latency/high-performance data APIs at a global scale, with no additional effort and no additional fees, according to the company.

Hasura CEO and co-founder Tanmai Gopal told The New Stack that it is “the world’s first CDN (content delivery network) for data,” in which all projects deployed on Hasura Cloud are automatically deployed to an edge network of 100-plus global regions. It pre-connects all those hard-to-navigate networking nodes and protocols that take more time than they should to secure. Hasura automatically routes and executes client requests on the Hasura instance closest to the client, minimizing latency.

The edge-based network integrates with distributed databases including CockroachDB, Amazon Aurora, Yugabyte and others, Gopal said. The company is unafraid to guarantee 99.99% uptime, so there’s an important consideration, Gopal said.

“Our service is multicloud and multiregion, and we make sure that people can connect their sources of truth to the medium,” Gopal said. “The EVN (Easy Virtual Network, a Cisco creation that simplifies Layer 3 network virtualization) becomes the API that enables other applications that are external and other microservices or APIs — anything — connect to that layer and get access. So that’s the way that we think about it.”

Hasura’s GraphQL Engine provides GraphQL APIs over new or existing Postgres databases. With a query, it instantly composes a GraphQL API that is backed by databases and services so that the developer team gets immediately productive, Gopal said.

The Rise of Polyglot Data

“A big change that has happened over the past few years is the rise of polyglot data. One general-purpose database is not going to fit all,” Gopal said. “First, you know that you’re going to need multiple databases for which you need to build next-generation applications (for various use cases). You’ll want to combine your general-purpose database with AI for a vector database; for real-time analytics solutions, again, you’d need something else to upgrade your general-purpose DB. So systems are becoming polyglot, which is the big goal for this.

“The second change is that we have so much data in so many different technologies. If only we could unify them, we would be able to extract so much more value, and be able to build next-generation applications, so we can add value for our users. That’s what we’re seeing, and that explains the timing of having this now become an infrastructure layer.”

Hasura DDN is made possible by a major architecture change of the Hasura engine that reduced its cold start time to under 1 millisecond, Gopal said. As a result, the Hasura runtime can be instantiated on the edge region when the API is invoked, enabling instant auto-scaling to handle any spike in traffic globally. Numbers folks will want to know that DDN is a cost-efficient way to get value-based API pricing, instead of infrastructure-based pricing and alternative approaches, such as always-on or warmed-up instances, he said.

Trending: New Methods of Developing and Managing APIs

“Organizations should not be caught up in a single way of doing API development and integration,” Paul Nashawaty, principal analyst at Enterprise Strategy Group, told The New Stack. “Often organizations do not realize there is a better way to enable the creation and usage of API performance at a global scale. By integrating with distributed databases, Harusa minimizes the latency from the consumer to the underlying data source. This can be achieved with both heritage and new data sources.”

Hasura GraphQL Engine includes new features that include:

Instant GraphQL APIs: The engine can generate a GraphQL API from a Postgres database in seconds, making it easy to get started and to build new features.
Built-in authorization: Includes a built-in authorization engine that allows users to control who has access to data.
Real-time subscriptions: Supports real-time subscriptions, which allow users to keep clients up-to-date with changes to your data.
Webhooks: The engine also can generate webhooks, which allow users to be notified when certain events occur in your database. This is useful for integrating your GraphQL APIs with other systems.

Hasura is used by a variety of companies, including Atlassian (powers its GraphQL APIs for Jira, Confluence, and Bitbucket), GitLab (GraphQL APIs for GitLab.com) and Red Hat (GraphQL APIs for OpenShift).

The post Hasura Launches New Data Network for APIs Only appeared first on The New Stack.

Unveiling the Future of Application Networking: Trends and Impacts

Bilgin Ibryam — Wed, 28 Jun 2023 18:06:27 +0000

Where are the application networking features heading, and how might this affect the way we design and approach distributed applications in the future? The revelations might surprise you. Let’s explore the shifting sands of application networking, focusing on the movement of networking concerns with the rise of the application cloud. Unraveling the intricacies of transparent, synchronous and asynchronous networking, let’s examine the migration and transformation of these aspects within the context of modern distributed architectures.

Transparent Networking Descends to the Platform

Distributed applications are composed of multiple components interacting with each other using networking. These interactions can be controlled at runtime transparently to the applications through service mesh and other similar technologies, or from within the application through explicitly implemented patterns such as point-to-point integration, event-driven or orchestration-based interactions.

Here I define transparent networking as control and monitoring mechanisms that can be added to the behavior of the applications interactions with each other without making developers and application implementation aware. Think of service discovery or load balancing that is performed by Kubernetes without the application being aware of how it is done. Think of traffic shifting, and resiliency features such as retry and circuit breaking performed by a Istio’s sidecars. Think of mTLS, authentication and authorization, or network tracing and observability that you can get from the Linux kernel through Cillium’s eBPF-based implementation.

All of these features can be added to a distributed application at runtime without changing the application code and without developers implementing a single line of code within the application.

Transparent networking features merge with the runtime platform.

These concerns used to be addressed by developers in the application layer through language-specific libraries such as Apache Camel or Spring Cloud Netflix in the Java ecosystem, but today they are increasingly delegated to polyglot runtimes such as Dapr, or delegated to the platform layer through transparent sidecars such as Envoy, and even get deeply embedded with the compute platform in the case of eBPF technology and closed source networking cloud services.

Synchronous Networking in Transit Away from Applications

Synchronous interactions between applications are ones that don’t require any kind of intermediating persistent state store such as a message broker to offload a request into an in-between-applications medium. As a result, the synchronous interactions I describe here are typically blocking interactions initiated by client applications and reaching the target service in the same invocation. The kind of application responsibilities considered here are connectors to various external APIs, calls between services within a solution and protocol conversions. This also includes content-based routing, filterings and light transformations of requests, aggregation of multiple messages into one or splitting a large message into multiple. The last group can be done with a persistent statestore, but here I consider when it is done on the fly, without persistence. Broadly, these application networking concerns consist of message routing and message transformation patterns as listed in the book “Enterprise Integration Patterns.”

While these concerns traditionally were implemented from within the application and popular mainly in the Java ecosystem, for example with projects such as Apache Camel and Spring Integration, today we can see these features moving out into purpose-built plug-and-play runtimes that can be used with many polyglot applications. Examples of these are Dapr sidecar, Apache Kafka Connect, Knative Event Sources, NATS and various managed cloud-based connectors and traffic routing services such as AWS API Gateway for routing traffic or AWS EventBridge for routing events, etc.

In all of these examples, the application passes a message to a separate runtime where message routing and transformation logic is performed, and the result is passed back to the application or forwarded to another application. The routing, filtering and transformation logic applied affects the shape of the data and the target it is flowing into.

Synchronous connectivity patterns move into plug-and-play runtimes.

In contrast to the transparent features that can be applied by the operations teams after an app has been implemented, synchronous networking capabilities are used by developers, and the application has to be designed and implemented with that in mind.

As a result, we can see synchronous networking features not sinking down into the platform transparently, but transforming from libraries into purpose-built reusable runtimes and cloud services that can be plugged into any applications and swapped when needed without affecting the application implementation. The latter is possible by designing the application with principles of hexagonal architecture and decoupling it from external dependencies through well-adopted open standards.

Today there is no single universally adopted standard or implementation in this space, but there are a few commonly used messaging patterns (such as filter, content-based router, wiretap, aggregator and splitter) that serve as a commonly understood language. These patterns are typically implemented through domain-specific languages or using the Common Expression Language spec, and act on data that is JSON or ProtoBuf formatted in a CloudEvents wrapper traveling on HTTP or gRPC protocols.

The Ascent of Asynchronous Networking Toward the Cloud

Asynchronous networking allows applications to store state into an external system for its own use or as a temporary store before exchanging data with another service. For example, a developer may use an external state store such as Redis for key/value access or an object store such as AWS S3 to store state and make the service stateless. An application may use a message broker such as Apache Kafka to publish an event that another service might be interested in. An application can start a business process stored in a persistent workflow engine such as Conductor, which needs to orchestrate interactions with other services. When we look at the end-to-end interaction between the source and target services, a state is persisted in the intermediate system before exchanging with other services. These asynchronous network interaction styles distribute state among participants in a predictable and reliable manner using a few well-known methods such as pub/sub, key/value access, orchestration, cron job, distributed lock, etc.

Asynchronous networking infrastructure transitions into SaaS.

Each of the asynchronous networking patterns offers a unique interaction style on top of state. Key/value and object stores offload state that is typically accessed from the same application. A message broker is used to asynchronously communicate between a publisher service and one or more recipient services. And workflows are used to coordinate complex stateful interactions between multiple applications or to trigger a service endpoint on time-based intervals.

There are other specialized examples of stateful application infrastructure too: for example, distributing application configurations from a central configuration store, distributing secrets, mutually exclusive access to a resource using a distributed lock, etc. These interactions are explicit to the application and the developers need to develop the application in a way to interact with these specialized systems. There are a number of APIs that are becoming widely adopted standards in each respective domain. For example Redis, MongoDB and Amazon Web Services (AWS) S3 are examples of popular APIs for key/value and document access. Apache Kafka, AMQP, NATS, are examples of asynchronous interaction protocols. Camunda, Conductor and Cadence are examples of stateful orchestration engines.

While these projects focus on a single type of stateful interaction and provide the implementation and the API, the Dapr project is focused on providing unified APIs for different interaction styles and plugging them into existing backend implementations. For example, the Dapr statestore API can be used with Redis, MongoDB, PostgreSQL and others. Dapr pubsub API can be used with Kafka, AWS SQS, GCP Pub/Sub, Azure EventHub and others. Its configuration, secret and distributed lock APIs also plug into existing infrastructure systems and offer a unified polyglot higher-level HTTP and gRPC-based protocols to abstract these backends.

In response to the inherent complexities of managing state, the industry is witnessing a remarkable shift where asynchronous networking capabilities are increasingly being offered as SaaS solutions. This transition streamlines the adoption process, simplifies scalability and enhances the manageability of these services.

Apache Kafka, a widely used message broker, is now accessible as Confluent Cloud and AWS Managed Streaming for Apache Kafka (MSK). Similarly, key/value stores like Redis and MongoDB, traditionally managed in-house, have evolved into cloud services. Redis Labs’ fully managed cloud service and MongoDB Atlas’ globally available service with integrated resource and workload optimization are testaments to this shift.

Stateful workflow systems, too, have entered the SaaS realm, simplifying developers’ task of orchestrating intricate stateful interactions between applications. AWS Step Functions, Temporal Cloud, Orkes and Diagrid Cloud are leading this evolution. This SaaS transformation of stateful networking projects is driven by the desire to abstract state management complexities. It enables developers to focus on business logic rather than intricate asynchronous interactions.

The Divergent Path for Application Networking

Distributed applications are composed of multiple components distributed across numerous processes that interact with each other over the network. The primary advantages of distributed applications such as faster release cycles and scalability lie in how different networking patterns facilitate isolation of dependencies and state distribution among participants in a predictable manner. However, networking introduces new challenges in terms of the distributed systems programming model, reliability, security and observability. Similarly to how container adoption shifted significant application responsibilities from developers to operations teams, we can observe a shift in different types of networking concerns too.

Transparent networking features, while limited in their capabilities, are becoming more prevalent as they are integrated into platform offerings. With the right platform capabilities available, developers no longer need to bother with network security, observability and traffic management.

Stateless interactions combine networking with knowledge of data formats and message transformation logic. Such interactions are increasingly made reusable via standard connectors and enterprise integration patterns implemented as purpose-built distributed system middleware. Developers don’t have to continually reinvent the wheel in every language and application stack, but plug such capabilities into their applications at runtime. Given enough time, these networking patterns become reusable libraries, purpose-built frameworks and sidecars, and eventually transition into cloud-based APIs.

Asynchronous interactions present a higher degree of complexity as they necessitate behind-the-scenes state management. These networking interactions are often provided as specialized standalone software or as managed services, ideally fronted by widely adopted APIs. Unlike transparent APIs that sink into the compute layer and are primarily used by operations teams, asynchronous networking interactions arise in cloud offerings, created for application developers.

The evolution of application networking concerns

This progression of networking responsibilities is anticipated to drive transparent runtime and networking features further into the compute platform. Meanwhile, explicit features will continue to consolidate, forming common APIs and elevating into the cloud as ubiquitous serverless capabilities. Appropriately delegating responsibilities across different layers and choosing the appropriate standardized APIs for diverse networking tasks is becoming increasingly essential. Consequently, this megatrend will empower developers to focus on implementing business logic, integrating other capabilities either transparently or via universally recognized and portable APIs.

The post Unveiling the Future of Application Networking: Trends and Impacts appeared first on The New Stack.

Red Hat Launches OpenStack Platform 17.1 with Enhanced Security

Steven J. Vaughan-Nichols — Wed, 14 Jun 2023 17:34:12 +0000

VANCOUVER — At OpenInfra Summit here, Red Hat, announced the impending release of its OpenStack Platform 17.1. This release is the product of the company’s ongoing commitment to support telecoms as they build their next-generation 5G network infrastructures.

In addition to bridging existing 4G technologies with emerging 5G networks, the platform enables advanced use cases like 5G standalone (SA) core, open virtualized radio access networks (RAN), and network, storage, and compute functionalities, all with increased resilience. And, when it comes to telecoms, the name of the game is resilience. Without it, your phone won’t work, and that can’t happen.

Runs On OpenShift

The newest version of the OpenStack Platform runs on Red Hat OpenShift, the company’s Kubernetes distro. Under this, Red Hat Enterprise Linux (RHEL) 8.4 or 9.2 runs. This means it can support logical volume management partitioning, and Domain Name System as a Service (DNSaaS).

The volume management partition enables short-lived snapshot and reverts functionalities. This enables service providers to revert back to a previous state during upgrades if something goes wrong. Of course, we all know that everything goes smoothly during updates and upgrades. Not.

This take on DNSaaS includes a framework for integration with Compute (Nova) and OpenStack Networking (Neutron) notifications, allowing auto-generated DNS records. In addition, DNSaaS includes integration support for Bind9.

Other Improvements

Red Hat also announced improvements to the Open Virtual Networking (OVN) capabilities, Octavia load balancer, and virtual data path acceleration. These enhancements ensure higher network service quality and improved OVN migration time for large-scale deployments.

OpenStack Platform 17.1 continues its legacy of providing a secure and flexible private cloud built on open source foundations. This latest release offers role-based access control (RBAC), FIPS-140 (ISO/IEC 19790) compatibility, federation through OpenID Connect, and Fernet tokens, ensuring a safer, more controlled IT environment.

Looking ahead to the next version, Red Hat software engineers are working on making it much easier to upgrade its OpenStack distro from one version to the next. Historically, this has always been a major headache for all versions of OpenStack. Red Hat’s control plane-based approach, a year or so in the future, sounds very promising.

The post Red Hat Launches OpenStack Platform 17.1 with Enhanced Security appeared first on The New Stack.

WithSecure Pours Energy into Making Software More Efficient

Joe Fay — Thu, 01 Jun 2023 14:14:07 +0000

WithSecure has unveiled a mission to reduce software energy consumption, backing research on how users trade off energy consumption against performance and developing a test bench for measuring energy use, which it ultimately plans to make open source.

The Finnish cyber security firm has also kicked off discussions on establishing standards for measuring software power consumption with government agencies in Finland and across Europe, after establishing that there is little in the way of guidance currently.

Power Consumption

Power consumption by backend infrastructure is a known problem. Data centers, for example, account for up to 1.3% of worldwide electricity consumption, according to the International Energy Agency. While this figure has stayed relatively stable in recent years, it excludes the impact of crypto mining, which accounts for almost half as much.

A report for the UK Parliament last year cited estimated that user devices consume more energy than networks and data centers combined.

Leszek Tasiemski, WithSecure’s vice president for product management, spoke at Sphere 2023 in Helsinki, saying that most of the firm’s own operations run in the cloud, which gives it good visibility into the resources it was using and their CO2 impact.

Most of the data centers it uses already run on renewable energy sources, he said, and it was already “optimizing the code as much as we can so that it performs less operations. Or it performs the same operations with a different approach or a different programming language or different libraries so that it results in less CPU cycles, less I/O operations.”

It is harder to have an impact on power consumption outside of the platforms it directly controls, says Tasiemski, but the firm is working to optimize the agent software its clients run on their systems.

The energy consumption of the WithSecure agent, which runs on clients’ devices, might be relatively small, but Tasiemski said, “This is where we have economies of scale. We have millions of devices out there.”

This would benefit the users, he said. “This is not for our direct benefit. It’s not our electricity bills, it’s not our heat to remove.” He added that, as for its own systems, lowering energy usage usually means better performance. “It’s not always black and white, but it’s related.”

The Challenge

The challenge is how to do this without compromising security. Users can vary settings in WithSecure’s profile editor, for example, how often scans are run. Optimizing or adjusting these could be used to reduce resource use. But this could also be dangerous if admins are so focused on energy reduction that they dial things back too far.

So it has kicked off research at Poland’s Poznan University of Technology to examine how general users and security pros are likely to visualize energy consumption versus risk appetite. “We are doing this research to see how we can, in a responsible way, show this information,” said Tasiemski.

Tasiemski said another problem is that there aren’t many standards for measuring energy consumption by software, so WithSecure intends to meet with government organizations and institutions to try to kickstart a conversation. There is no tangible work at present, either in Finland or the European Commission. He said there seems to be some work going on in France, so he is trying to contact the relevant organizations there.

“In the case of software, it’s incredibly hard to figure out common standards for energy efficiency. We have it for buildings. Buildings are also not the easiest thing to measure, so I think it can be done.”

He said there was no direct commercial object to this. “I absolutely don’t mind if somebody steals our idea. We do the research; it will be open to everybody. So if other companies would like to use it, yeah, go ahead.”

Likewise, he said, WithSecure has built a test bench for measuring energy usage of software. It has been using this since January to measure the power consumption of its ongoing agent releases by modeling typical user behavior. The goal is to establish a baseline against which it can measure progress in reducing consumption over time.

“I absolutely wouldn’t mind open sourcing that because this is not our core business, and it’s only for the greater good.” He said the biggest brake on making it open source so far is that it was still being tweaked and he wanted to be sure the documentation was good enough.

But ultimately, making such tools open source was the right thing to do, he said. “It doesn’t make sense if every company builds things like that on their own because it’s going to be built in a different way. Classical reinventing the wheel.” And that would be a waste of resources in itself.

The post WithSecure Pours Energy into Making Software More Efficient appeared first on The New Stack.

Don’t Force Containers and Disrupt Workflows

Alex Williams — Thu, 25 May 2023 22:10:20 +0000

How do you allow people to use their technologies in their workflows? The first thing you do is not force people to use containers, says Rob Barnes, a senior developer advocate at HashiCorp, in this episode of The New Stack Makers.

Barnes came by The New Stack booth at KubeCon Europe in Amsterdam to discuss how HashiCorp builds intent into Consul so users may use containers or virtual machines in their workflows.

Consul from HashiCorp is one of the early implementations of service mesh technology, writes Jankiran MSV in The New Stack. “It comes with a full-featured control plane with service discovery, configuration, and segmentation functionality. The best thing about Consul is the support for various environments including traditional applications, VMs, containers, and orchestration engines such as Nomad and Kubernetes.”

Consul is, at heart, a networking service that provides identity, for example, in Kubernetes. A service mesh knows about all services across the stack. In Kubernetes, Helm charts get configured to register the services to Consul automatically. That’s a form of intent. Trust is critical to that intent in Kubernetes.

“We can then assign identity — so in a kind of unofficial way, Consul has almost become an identity provider for services,” Barnes said.

In Consul, identity helps provide more granular routing to services, Barnes said. Consul can dictate what services can talk to each other. The intent gets established. A rules-based system, for instance, may dictate what services can talk to each other and which can’t.

“I think that’s an opportunity that HashiCorp has taken advantage of,” Barnes said. “We can do a lot more here to make people’s lives easier and more secure.”

So what’s the evolution of service mesh?

“There’s a lot of misconceptions with service mesh,” Barnes said. “As I say, I think people feel that if you’re using service meshes, that means you’re using containers, right? Whereas, like, I can speak for Consul specifically, that’s not the case. Right? I think the idea is that if more service meshes out, they make themselves a bit more flexible and meet people where they are. I think the adoption of the service mesh, and all the good stuff that comes with it, is only going to grow.”

“So I think what’s next for service mesh isn’t necessarily the service mesh itself. I think it’s people understanding how it fits into the bigger picture. And I think it’s an educational piece and where there are gaps, maybe we as vendors need to make some advances.”

The post Don’t Force Containers and Disrupt Workflows appeared first on The New Stack.

How to Decide Between a Layer 2 or Layer 3 Network

Gino Dion — Tue, 25 Apr 2023 17:00:10 +0000

As communication service providers (CSPs) continue to provide essential services to businesses and individuals, the demand for faster and more reliable network connectivity continues to grow in demand and in complexity. To meet these demands, CSPs must offer a variety of connectivity services that provide high-quality network performance, reliability and scalability.

When it comes to offering network connectivity services, CSPs have many options when providing Layer 2 (data link) or Layer 3 (network or packet layer) connectivity of the Open Systems Interconnection (OSI) model for network communication.

This article will explore some of the advantages and benefits of each type of connectivity, in order for CSPs to determine which one may be better suited for different types of environments or applications.

What Is Layer 2 Connectivity?

At a basic level, Layer 2 connectivity refers to the use of the data link layer of the OSI Model. It is often used to connect local area networks (LANs) or to provide point-to-point connectivity between two networks or even broadcast domains or devices.

Often, Layer 2 connectivity is referred to as Ethernet connectivity, as Ethernet is one of the most common Layer 2 protocol used today, and it comes several advantages.

First off, Layer 2 connectivity generally provides low latency as it requires fewer network hops than Layer 3 connectivity. This makes it ideal for applications that require low latency, such as real-time voice, video or highly interactive applications.

Layer 2 connectivity is also relatively simple to configure and maintain when compared to Layer 3 connectivity. Its connectivity reduces the complexity of network configurations by eliminating the need for complex routing protocols and configurations. This makes it an attractive option for small- to medium-sized businesses that do not have dedicated IT resources.

In addition to offering low latency and simplicity, Layer 2 connectivity also provides high network performance as it can take advantage of the full bandwidth of the network.

What Is Layer 3 Connectivity?

On the other hand, Layer 3 connectivity refers to the use of the network layer of the OSI model for network communication. It is often used to provide wide area network (WAN) connectivity, to connect different LANs and to provide access to the internet. Layer 3 connectivity is often referred to as IP connectivity, as IP is the most common Layer 3 protocol used today.

As with Layer 2, Layer 3 connectivity comes with its own set of advantages.

To start, Layer 3 connectivity is highly scalable and can handle large networks with many devices. Likewise, its connectivity provides flexibility in terms of routing and network design, making it suitable for complex network architectures.

Opposite of Layer 2, Layer 3 connectivity provides enhanced security features, including firewalls and virtual private networks (VPNs), which can protect the network from external threats.

Additionally, its connectivity can help reduce network congestion by providing more efficient routing of network traffic, versus the management of large broadcast domains.

Layer 2 vs. Layer 3 Connectivity: Which Is Better?

The decision to use Layer 2 or Layer 3 connectivity depends on the specific needs of the application(s) or network. However, there are some general guidelines to consider.

For local network connectivity, Layer 2 connectivity is generally more suitable. It provides low latency and high performance, making it ideal for real-time applications such as voice and video.

For wide-area network connectivity, on the other hand, Layer 3 connectivity is generally more suitable as it provides scalability, flexibility and enhanced security features, making it ideal for connecting different LANs and for accessing the internet.

For applications that require both local and wide area network connectivity, a combination of Layer 2 and Layer 3 connectivity might be necessary to achieve optimal network performance.

Both Layer 2 and Layer 3 connectivity have their own distinct advantages and benefits.

While Layer 2 connectivity is simple to configure, provides low latency and high performance and is ideal for local network connectivity, Layer 3 connectivity is highly scalable, flexible and provides enhanced security features, making it ideal for wide-area network connectivity.

By selecting their necessary network qualities, CSPs can determine the best network connectivity service for their application and environment.

The post How to Decide Between a Layer 2 or Layer 3 Network appeared first on The New Stack.

Linkerd Service Mesh Update Addresses More Demanding User Base

Joab Jackson — Tue, 11 Apr 2023 13:17:14 +0000

Five years ago, when the hype around the service mesh was at its greatest, Buoyant CEO William Morgan, fielded a lot of questions about the company’s flagship Linkerd open source service mesh software. Many in the open source community were very curious about what what it could do, and what it could be used for.

These days, Morgan still gets questions, but now they are a lot more pointed, about how Linkerd would work in a specific situation. Users are less worried about how it works, and more concerned about just getting the job done. So they are more direct what they want, and what they want to pay for.

“In the very early days of the service mesh, a lot of open source enthusiast who were excited about the technology wanted to get to the details, and wanted to do all the exciting stuff,” Morgan explained. “Now the audience coming in just wants it to work. They don’t want to get into the details, because they’ve got like a business to run.”

In anticipation of this year’s KubeCon + CloudNativeCon EU, Buoyant has released an update to Linkerd. Version 2.13 includes new features such as dynamic request routing, circuit breaking, automated health monitoring, vulnerability alerts, proxy upgrade assistance, and FIPS-140 “compatibility.”

And on April 18, the day before the Amsterdam-based KubeCon EU 2023 kicks off in earnest, the first-ever Linkerd Day co-located conference will be held.

What Is a Service Mesh?

Categorically, service mesh software is a tool for adding reliability, security, and observability features to Kubernetes environments. Kubernetes is a platform for building platforms, so it is not meant for managing the other parts of a distributed system, such as networking, Morgan explained.

In the networking realm, the service mesh software handles all additional networking needs beyond simple TCP handshake Kubernetes offers, such as retries, mitigating failing requests, sending traffic to other clusters, encryption, access management. The idea with the service mesh is to add a “sidecar” to each instance of the application, so developers don’t have to mess with all these aspects, of which they may not be familiar with.

There are multiple service mesh packages — Istio, Consul, Traefik Mesh and so on — but what defines LinkerD specifically is its ease-of-use, Morgan said.

“When people come to us because they recognize the value of a service mesh, they want to add it to their stack,” Morgan said. “But they want a simple version, they don’t want a complicated thing. They don’t want to have to have a team of four service mesh engineers on call.”

Buoyant likes to tout Linkerd as the Cloud Native Computing Foundation‘s “only graduated service mesh” (CNCF also provides a home for Istio, though that service mesh is still at an incubating level). The certs simply mean that Linkerd is not some “fly-by-night open source things that’s just been around for six months. It’s a recognition of the maturity of the project.”

New Features of Linkerd 2.13

For Kubernetes users, the newly-added dynamic request routing provides fine-grained control over the routing of individual HTTP and gRPC requests.

To date, Linkerd has offered a fair amount of traffic shaping, such as the ability to send a certain percentage of each traffic to a different node. Now, the level granularity is much finer, with the ability to parse traffic by, say, query parameter, or a specific URL. Route requests can be routed based on HTTP headers, gRPC methods, query parameters, or almost any other aspect of the request.

One immediate use case that comes to mind are sticky sessions, where all a user’s transactions take place on a single node, in order to get the full benefit of caching. User-based A/B testing, canary deploys, and dynamic staging environments are some of the other possible uses. And they can be set up either by the users themselves, or even by third-party software vendors who want to offer specialized services around testing, for instance.

Linkerd’s dynamic request routing came about thanks to Kubernetes Gateway API. Leveraging the Gateway API “reduces the amount of new configuration machinery introduced onto the cluster,” Buoyant states in its press materials. Although the Gateway API standard, concerning network ingress, wasn’t specifically addressing service mesh “east-west” capabilities, many of the same types can also be used to shape east-west traffic, reliving the administrative burden of learning yet another configuration syntax, Morgan said, admiringly of the standard.

(Morgan also pointed to a new promising new initiative within the Kubernetes community, called GAMMA, which would further synthesize service mesh requirements into the API Gateway).

Another new feature with Linkerd: Circuit breaking, where Kubernetes users can mark services as delicate, so that meshed clients will automatically reduce traffic should these services start throwing a lot of errors.

Security, Gratis

A version of the 2.13 release comes in “a FIPS-compatible form,” the company asserts.

Managed by the U.S. National Institute of Standards and Technology (NIST), the Federal Information Processing Standard (FIPS, currently at version 3) is a set of standards for deploying encryption modules, with requirements around interfaces, operating environments, security management and lifecycle assurance. It is a government requirement for any software that touches encrypted traffic. Many other industries, such as finance, also follow the government’s lead in using FIP-compliant products.

That said, Linkerd is not certified for use by the U.S. government. “Compatible” means Buoyant feels it could muster with a NIST-accredited lab, though the company has no immediate plans to certify the software.

And, finally, Buoyant itself is offering to all of Linkerd users, basic health monitoring, vulnerability reporting, and upgrade assistance, through its Buoyant Cloud SaaS automation platform. This feature is for all users, even of the open source version, and not just for paid subscribers.

“We realized a lot of Linkerd users out there are actually in a vulnerable position,” Morgan explained. “They aren’t subscribed to the security mailing lists. They’re not necessarily monitoring the health of their deployments. They’re avoiding upgrades because that sounds like a pain. So we’re trying to provide them with tools. Even if it’s pure, open source, they can at least keep their clusters secure, and healthy and up to date.”

Of course, those with the paid edition getting a more in-depth set of features.

To upgrade Linkerd 2.13 or install it new, start here, or search it out on the Azure Marketplace.

The post Linkerd Service Mesh Update Addresses More Demanding User Base appeared first on The New Stack.

Wireshark Celebrates 25th Anniversary with a New Foundation

Joab Jackson — Tue, 28 Mar 2023 12:00:46 +0000

No doubt, countless engineers and hackers remember the first time they used Wireshark, or — if they’re a bit older — Wireshark’s predecessor, Ethereal. The experience of using Wireshark is a bit like what Robert Hooke must have felt in 1665 when using the newly-developed microscope to view cells for the first time ever: What was once just an inscrutable package had opened up to reveal a treasure trove of useful information.

This year, the venerable Wireshark has turned 25, and its creators are taking a step back from this massively successful open source project, to let additional parties to help govern. This month, Sysdig, the current sponsor of Wireshark, launched a new foundation that will serve as the long-term custodian of the project. The Wireshark Foundation will house the Wireshark source code and assets, and manage the SharkFest, Wireshark’s developer and user conference (Singapore April 17-19 and San Diego June 10-15).

The creators call the software the “world’s foremost traffic protocol analyzer” with considerable justification. Just in the past five years, it has been downloaded more than 60 million times and has attracted more than 2,000 contributors. Today, Wireshark is free and available under the GNU General Public License (GPL) version 2.

Wireshark provides a glimpse into the traffic going across your network at a packet level, allowing users to understand the system better and diagnosis problems. A built-in powerful data parsing engine is only half the appeal; an extensible design has allowed others to easily provide plug-ins for an endless array of new protocols and data formats.

There were packet analyzers prior to Ethereal, of course, though, but they were expensive.

When network engineer Gerald Combs released first this code as open source in 1998, he democratized IP packet inspection for everyone. And a few years later, when WiFi was being introduced, Ethereal, was put into action by every system administrator trying to fix a buggy WiFi connection. It also inspired an entire generation of hackers — friendly or otherwise — to sniff out unsecured wireless connections (“wardriving“).

“Wireshark is my favourite ‘I told you so’ tool. You can’t imagine how useful it is for network troubleshooting,” one Hacker News commenter enthused.

Network Observability for All

Combs created Ethereal while working as an engineer for a Kansas City Internet Service Provider, for the purposes of troubleshooting. At the time, the only packet sniffers available were costly, and the ISP didn’t have a budget for one (which could run into tens of thousands of dollars).

This was a few years into the commercial use of the Internet, and so when Combs released Ethereal, he immediately started getting contributions from others.

One of those early contributors was Loris Degioanni, now CTO and Founder of cloud security company Sysdig. He was in school at the time. His computer network professor had said that the best way to understand the network is to observe the network. But since there were no inexpensive packet sniffers for Windows, Degioanni wrote WinPcap, a driver for capturing packets in Windows machines, which many people immediately started using with Ethereal.

One factor for Ethereal’s success was its extensibility. It allowed many developers to work in parallel, creating plug-ins to would run on top of Ethereal’s network analysis capabilities. In this way, it was “really easy for the project to accumulate features and functionality and become more and more useful at a very rapid pace,” Degioanni said.

Contributions came in not just from students and hobbyists, but from engineers at actual companies, which found it more cost-effective to dedicate an engineer to creating and managing some obscure protocol that otherwise would require a more expensive tool to analyze.

The killer use however, came from the emerging use of wireless (WiFi) networks. When it was introduced for home use in 1999, WiFi was still incredibly buggy. Degioanni got with Combs to develop a plug-in for inspecting 802.11 wireless traffic on Windows XP, called AirPCap, which proved to be helpful for many who just wondered why their packets seemingly vanished in the air.

With the wireless, Ethereal also attracted the attention of hackers, who could use network analysis for intercepting wireless packets of data from people and companies, as they sat outside in a car with a laptop and a copy of Ethereal.

“It’s it’s not a community, we specifically cater to, but it’s a committee that finds the tool to be useful,” Combs said. “I don’t know that it was a surprise that the security industry latched on to it. But it has been interesting seeing how that developed.”

The two thought there would be a business for this market, so they set off to start Cacetech (since purchased by Riverbed) to manage Wireshark and related technologies. Combs’ prior employer, held the trademark for Ethereal, so the duo forked the technology, renaming it Wireshark.

Today, the software is being used across a wide range of industries, each with its own set of oddball protocols and network traffic patterns to be grappled with. When Degioanni launched Sysdig in 2013, it immediately put Wireshark to use in helping parse log data in real time from the cloud providers, Degioanni said.

At its heart, Ethereal had a power dissection engine. You could feed it “these blobs of data and it will break them down and tear them apart and show you all the various bits and bytes needed to its best ability,” Combs said. “And this also lets you apply filters and apply all these other powerful features. But the thing is that the engine doesn’t really care if it’s packet data, it can be any sort of data you want.”

Currently, for instance, Combs is looking to extend it to non-IP sources of data such as Bluetooth and USB devices.

New Foundation

Beyond its massive usefulness, Wireshark has also played a role in educating generations of programmers and administrators on how a network works. Just looking at the GUI as it is decodes packets off the wire, you can get the sense of how the Internet actually works.

“I think it’s important to educate people about low-level analysis, whether it’s in packets, system events or system calls, Combs said. “I think that’s very important knowledge to pass on and to educate people on.”

Education will be one of the chief missions of the new Wireshark Foundation, which will provide a formal support structure for the Wireshark, Combs said. Today, the chief income that Wireshark gets is through its conferences; with the foundation, the project will be able to accept contributions directly.

It will also provide some much-needed relief to Combs. To date, Combs has been the chief maintainer, or the “benign dictator,” so to speak. The foundation will shift the structure to something more resembling a benign democracy.

“You can tell that all of us are starting to get a little bit of gray hair. And it’s pretty clear at this point, that Wireshark is big enough, relevant enough for the whole planet, that it is going to survive us,” Degioanni said.

The post Wireshark Celebrates 25th Anniversary with a New Foundation appeared first on The New Stack.

This Week in Computing: Malware Gone Wild

Joab Jackson — Sat, 25 Mar 2023 14:10:18 +0000

Malware is sneaky AF. It tries to hide itself and cover up its actions. It detects when it is being studied in a virtual sandbox, and so it sits still to evade detection. But when it senses a less secure environment — such as an unpatched Windows 7 box — it goes wild, as if possessing a split personality.

In other words, malware can no longer be fully understood simply by studying it in a lab setting, asserted the University of Maryland Associate professor Tudor Dumitras, in a recently posted talk from USENIX‘s last-ever Enigma security and privacy conference.

Today, most malware is examined by examining execution traces that the malicious program generates (“Dynamic Malware Analysis”). This is usually done in a controlled environment, such as a sandbox or virtual machine. Such analysis creates the signatures to describe the behavior of the malicious software.

The malware community, of course, has been long hip to this scrutiny, and has developed an evasion technique known as red pills, which helps malware detect when it is in a controlled environment, and change its behavior accordingly.

As a result, many of the signatures used for commercial malware detection packages may not be able to adequately to identify malware in all circumstances, depending on what traces the signature actually captured.

What we really need, Dumitras said, is execution traces from the wild. Dumitras led a study that collected info on real-world attacks, consisting of over 7.6 million traces from 5.4 million users.

“Sandbox traces can not account for the range of behaviors encountered in the wild.”

They had found that, as Dumitras expected, traces collected in a sandbox rarely capture the full behavior of malware in the wild.

In the case of Wannacry ransom attack, for instance, sandbox tracing only caught 18% of all the actions that the randomware attack executed in the wild.

For the keepers of malware detection engines, Dumitras advised using traces from multiple executions in the wild. He advised using three separate traces, as diminishing returns set in after that.

Full video of the talk here:

Reporter’s Notebook

“So far, having an AI CEO hasn’t had any catastrophic consequences for NetDragon Websoft. In fact, since Yu’s appointment, the company has outperformed Hong Kong’s stock market.” — The Hustle, on replacing CEOs with AI Chatbots.

AI “Latent space embeddings end up being a double-edged sword. They allow the model to efficiently encode and use a large amount of data, but they also cause possible problems where the AI will spit out related but wrong information.” — Geek Culture, on why ChatGPT lies.

“We think someone who writes for a living needs to constantly be thinking about the best way to express complex ideas in their own words.” ⁦– Wired, on its editorial use of generative AI.

AI Hype holes?

The biggest one is that building the model underneath GPT-4 takes 18 months. So it can’t tell you anything new. Can’t help you on your fantasy football league. Can’t tell you what happened in the stock market today.

That won’t get fixed soon, I hear from AI… https://t.co/XPKUYitNnc

— Robert Scoble (@Scobleizer) March 17, 2023

“I think with Kubernetes, we did a decent job on the backend. But we did not get developers, not one little bit. That was a missed opportunity to really bring the worlds together in a natural way” — Kubernetes co-founder Craig McLuckie, on how the operations-centric Kubernetes perplexed developers (See: YAML), speaking at a Docker press roundtable this week.

McLuckie also noted that 60% of machine learning workloads now run on Kubernetes.

“After listening to feedback and consulting our community, it’s clear that we made the wrong decision in sunsetting our Free Team plan. Last week we felt our communications were terrible but our policy was sound. It’s now clear that both the communications and the policy were wrong, so we’re reversing course and no longer sunsetting the Free Team plan” —Docker, responding to the outcry in the open source community over the suspension of its free Docker Hub tier for teams.

“Decorators are by far the biggest new feature, making it possible to decorate classes and their members to make them more easily reusable. […] Decorators are just syntactic glue aiming to simplify the definition of higher-order functions” — Software Engineer Sergio De Simone on the release of TypeScript 5.0, in InfoQ.

Based on true events#100Daysofcode #javascript #programming #dev #linux #java #programming #CodeNewbie #python #reactjs #bugbounty #DataScience #infosec #gamedev #BigData @programmerjoke9 pic.twitter.com/x3CA2FhYnb

— programmerjokesofficial (@programmerjoke9) March 19, 2023

“If these details cannot be hidden from you, and you need to build a large knowledge base around stuff that does not directly contribute to implementing your program, then choose another platform.” — Hacker News commenter, on the needless complexity that came with using Microsoft Foundation Classes (MFC) for C++ coding.

Now 25 years old, the venerable Unix curl utility can now enjoy an adult beverage in New Dehli.

Ken Thompson “has a long and storied history of trolling the computer industry […] he revealed, during his Turing Award lecture, that he had planted an essentially untraceable back door in the original C compiler… and it was still there.” — Liam Proven, The Register.

“It’s just like planning a dinner. You have to plan ahead and schedule everything so it’s ready when you need it.” — Grace Hopper, 1967, explaining programming to the female audience of Cosmopolitan.

The post This Week in Computing: Malware Gone Wild appeared first on The New Stack.

JWTs: Connecting the Dots: Why, When and How

Michal Trojanowski — Mon, 20 Mar 2023 14:10:34 +0000

JSON web tokens (JWTs) are great — they are easy to work with and stateless, requiring less communication with a centralized authentication server. JWTs are handy when you need to securely pass information between services. As such, they’re often used as ID tokens or access tokens.

This is generally considered a secure practice as the tokens are usually signed and encrypted. However, when incorrectly configured or misused, JWTs can lead to broken object-level authorization or broken function-level authorization vulnerabilities. These vulnerabilities can expose a state where users can access other data or endpoints beyond their privileges. Therefore, it’s vital to follow best practices for using JWTs.

Knowing and understanding the fundamentals of JWTs is essential when determining a behavior strategy.

What Are JWTs?

JWT is a standard defined in RFC 7519, and its primary purpose is to pass a JSON message between two parties in a compact, URL-safe and tamper-proof way. The token looks like a long string divided into sections and separated by dots. Its structure depends on whether the token is signed (JWS) or encrypted (JWE).

JWS Structure

JWE Structure

Are JWTs Secure?

The short answer is that it depends. The security of JWTs is not a given. As mentioned above, JWTs are often considered secure because they are signed or encrypted, but their security really depends on how they are used. A JWT is a message format in which structure and security measures are defined by the RFC, but it is up to you to ensure their use does not harm the safety of your whole system in any way.

When to Use JWTs

Should they be used as access and ID tokens?

JWTs are commonly used as access tokens and ID tokens in OAuth and OpenID Connect flows. They can also serve different purposes, such as transmitting information, requesting objects in OpenID Connect, authenticating applications, authorizing operations and other generic use cases.

Some say that using JWTs as access tokens is an unwise decision. However, in my opinion, there is nothing wrong if developers choose this strategy based on well-done research with a clear understanding of what JWTs essentially are. The worst-case scenario, on the other hand, is to start using JWTs just because they are trendy. There is no such thing as too many details when it comes to security, so following the best practices and understanding the peculiarities of JWTs is essential.

JWTs are by-value tokens containing data intended for the API developers so that APIs can decode and validate the token. However, if JWTs are issued to be used as access tokens to your clients, there is a risk that client developers will also access this data. You should be aware that this may lead to accidental data leaks since some claims from the token should not be made public. There is also a risk of breaking third-party integrations that rely on the contents of your tokens.

Therefore, it is recommended to:

Remember that introducing changes into JWTs used as access tokens may cause problems with app integrations.
Consider switching to Phantom tokens or Split tokens when sensitive or personal information is used in a token. In these cases, an opaque token should be used outside your infrastructure.
When a high level of security is required, use Proof-of-Possession tokens instead of Bearer tokens by adding a confirmation claim to mitigate the risks of unwanted access.

Should they be used to handle sessions?

An example of improper use of JWTs is choosing them as a session-retention mechanism and replacing session cookies and centralized sessions with JWTs. One of the reasons you should avoid this tactic is that JWTs cannot be invalidated, meaning you won’t be able to revoke old or malicious sessions. Size issues pose another problem, as JWTs can take up a lot of space. Thus, storing them in cookies can quickly exceed size limits. Solving this problem might involve storing them elsewhere, like in local storage, but that will leave you vulnerable to cross-site scripting attacks.

JWTs were never intended to handle sessions, so I recommend avoiding this practice.

Claims Used in JWTs and How to Handle Them

JWTs use claims to deliver information. Properly using those claims is essential for security and functionality. Here are some basics on how to deal with them.

Claim	Function	Best Practices
iss	Shows the issuer of the token	Always check against an allowlist to ensure it has been issued by someone you expect to issue it. The value of the claim should exactly match the value you expect it to be.
sub	Indicates the user or other subject of the token	As anyone can decode the token and access the data, avoid using sensitive data or PII.
aud	Indicates the receiver of the token	Always verify that the token is issued to an audience you expect. Reject any request intended for a different audience.
exp	Indicates the expiration time for the token	Use a short expiration time — minutes or hours at maximum. Remember that server times can differ slightly between different machines. Consider allowing a clock skew when checking the time-based values. Don’t use more than 30 seconds of clock skew. Use iat to reject tokens that haven’t expired but which, for security reasons, you deem to be issued too far in the past.
nbf	Identifies the time before which the JWT must not be accepted for processing
iat	Identifies the time at which the JWT was issued
jti	Provides a unique identifier for the token	It must be assigned in a way that prevents the same value from being used with a different object.

Validating Tokens

It is important to remember that incoming JWTs should always be validated. It doesn’t matter if you only work on an internal network (with the authorization server, the client and the resource server not connected through the internet). Environment settings can be changed, and if services become public, your system can quickly become vulnerable. Implementing token validation can also protect your system if a malicious actor is working from the inside.

When validating JWTs, always make sure they are used as intended:

Check the scope of the token.
Don’t trust all the claims. Verify whether keys contained in claims, or any URIs, correspond to the token’s issuer and contain a value that you expect.

Best Algorithms to Use with JWTs

The registry for JSON Web Signatures and Encryption Algorithms lists all available algorithms that can be used to sign or encrypt JWTs. It is also very useful to help you choose which algorithms should be implemented by clients and servers.

Currently, the most recommended algorithms for signing are EdDSA or ES256. They are preferred over the most popular one, RS256, as they are much faster than the well-tried RS256.

No matter the token type — JWS or JWE — they contain an alg claim in the header. This claim indicates which algorithm has been used for signing or encryption. This claim should always be checked with a safelist of algorithms accepted by your system. Allowlisting helps to mitigate attacks that attempt to tamper with tokens (these attacks may try to force the system to use different, less secure algorithms to verify the signature or decrypt the token). It is also more efficient than denylisting, as it prevents issues with case sensitivity.

How to Sign JWTs

One thing to remember about JWS signatures is that they are used to sign both the payload and the token header. Therefore, if you make changes to either the header or the payload, whether merely adding or removing spaces or line breaks, your signature will no longer validate.

My recommendations when signing JWTs are the following:

To avoid duplicating tokens, add a random token ID in the jti claim. Many authorization servers provide this opportunity.
Validate signatures, keys and certificates. Keys and certificates can be obtained from the authorization server. A good practice is to use an endpoint and download them dynamically. This makes it easy to rotate keys in a way that would not break the implementation.
Check the keys and certificates sent in the header of the JWS against an allowlist, or validate the trust chain for certificates.

Symmetric keys are not recommended for use in signing JWTs. Using symmetric signing presupposes that all parties need to know the shared secret. As the number of involved parties grows, it becomes more difficult to guard the safety of the secret and replace it if it is compromised.

Another problem with symmetric signing is that you don’t know who actually signed the token. When using asymmetric keys, you’re sure that the JWT was signed by whoever possesses the private key. In the case of symmetric signing, any party with access to the secret can also issue signed tokens. Always choose asymmetric signing. This way, you’ll know who actually signed the JWT and make security management easier.

JWTs and API Security

API security has become one of the main focuses of cybersecurity efforts. Unfortunately, vulnerabilities have increased as APIs have become critical for overall functionality. One of the ways to mitigate the risks is to ensure that JWTs are used correctly. JWTs should be populated with scopes and claims that correspond well to the client, user, authentication method used and other factors.

Conclusion

JWTs are a great technology that can save developers time and effort and ensure the security of APIs and systems. To fully reap their benefits, however, you must ensure that choosing JWTs fits your particular needs and use case. Moreover, it is essential to make sure they are used correctly. To do this, follow the best practices from security experts.

Here are some additional guidelines:

The post JWTs: Connecting the Dots: Why, When and How appeared first on The New Stack.

Palo Alto Networks Adds AI to Automate SASE Admin Operations

Chris J. Preimesberger — Fri, 17 Mar 2023 13:00:10 +0000

Whether one pronounces SASE as “sassy” or “sayce,” a secure access service edge is IT that is fast becoming central to enterprise systems as increasing amounts of data come into them from a multiplicity of channels. SASE is used to distribute wide-area network and security controls as a cloud service directly to the source of connection at the edge of the network rather than to a data center.

As its contribution to managing this tangle of virtual wires, Palo Alto Networks this week revealed new capabilities to update its Prisma SASE platform by — you guessed it — adding AIOps to automate these operations and make them more efficient. The company describes this as the industry’s first “natively integrated Artificial Intelligence for IT Operations” for SASE, because it brings together what normally are best-of-breed components (SDN, zero trust security, software-defined secure web gateway) into the same centralized package.

New Features

The new features enable organizations to automate their increasingly complex IT and network operations center (NOC) functions, Palo Alto Networks VP of SASE Marketing Matt De Vincentes told The New Stack.

“You can mix and match these components from multiple different vendors, and you get a potential stack when you have these capabilities kind of integrated together,” De Vincentes said. “But increasingly, we’re seeing a movement toward what we call single-vendor SASE, which is all of these capabilities brought together by a single thing that you can simplify. That’s exactly what we’re doing.

“So all of the capabilities that a customer would need to build out this SASE deployment they can get through a single (SaaS) service. Then on top of that, with one vendor you can bring all the data together into one single data lake — and do some interesting AI on top of that.”

AIOps

Palo Alto Networks calls this Autonomous Digital Experience Management (ADEM), which also provides users end-to-end observability across their network, De Vincentes said. Since ADEM is integrated within Prisma SASE, it does not require additional appliances or agents to be deployed, De Vincentes said.

Capabilities that AIOps for ADEM provides are, according to De Vincentes:

proactively remediates issues that can cause service interruption through AI-based problem detection and predictive analytics;
isolates issues faster (reduced mean time to repair) through an easy-to-use query interface; and
discovers network anomalies from a single dashboard.

PA Networks also announced three new SD-WAN (software-defined wide-area network) features for users to secure IoT devices, automate branch management, and manage their SD-WAN via on-premises controllers. Capabilities, according to the company, include:

Prisma SD-WAN Command Center provides AI-powered and segment-wise insights and always-on monitoring for network and apps for proactive problem resolution at the branch level.
Prisma SD-WAN with integrated IoT security enables existing Prisma SD-WAN appliances to help secure IoT devices. This enables accurate detection and identification of branch IoT devices.
On-Prem Controller for Prisma SD-WAN helps meet customer regulatory and compliance requirements and works with on-prem and cloud controller deployments.

Users can now elect to deploy Prisma SD-WAN using the cloud-management console, on-prem controllers, or both in a hybrid scenario, the company said.

All new capabilities will be available by May 2023, except the Prisma SD-WAN Command Center, which will be available by July, the company said.

The post Palo Alto Networks Adds AI to Automate SASE Admin Operations appeared first on The New Stack.

TrueNAS SCALE Network Attached Storage Meets High Demand

Jack Wallen — Thu, 02 Mar 2023 15:13:27 +0000

TrueNAS SCALE might not be a distribution on the radar of most cloud native developers, but it should be. Although TrueNAS SCALE is, by design, a network-attached storage solution (based on Debian), it is also possible to create integrated virtual machines and even Linux containers.

TrueNAS SCALE can be deployed as a single node or even to a cluster. It can be expanded with third-party applications, offers snapshotting, and can be deployed on off-the-shelf hardware or as a virtual machine.

IXsystems‘ TrueNAS SCALE is built on TrueNAS CORE, is designed for hybrid clouds, and will soon offer enterprise support options. The operating system is powered by OpenZFS and Gluster for scalable ZFS features and data management.

You’ll find support for KVM virtual machines, Kubernetes, and Docker.

Even better TrueNAS SCALE is open-source and free to use.

Latest Release

Recently, the company launched TrueNAS SCALE 22.12.1 (Bluefin), which includes numerous improvements and bug fixes. The list of improvements to the latest release includes the following:

SMB Share Proxy to provide a redirect mechanism for SMB shares in a common namespace.
Improvements to rootless login.
Fixes to ZFS HotPlug.
Improved Dashboard for both Enterprise HA and Enclosure management.
Improved Host Path Validation for SCALE applications.
Support for external share paths added.

There have also been a number of new features added to the latest release, including the following:

SSH Key Upload to simplify and better secure remote access for users.
DFS Proxy Share
Kubernetes Pass-Through enables external access to the Kubernetes API within a node.
Improved first UI login (when root password has not been set).
Allow users to create and manage ACL presets.
Sudo fields to provide correct privileges for remote targets.

Read the entire changelog to find out all of the improvements and new features that were added to TrueNAS SCALE.

Up-Front Work

One thing to keep in mind when considering TrueNAS SCALE is that there is a bit of up-front work you must do to make it work. Upon installation of the OS, you’ll have to create storage pools, users, shares, and more. There is a bit of a learning curve with this NAS solution, but the end result is very much worth the time you’ll spend making it work.

As far as the web UI is concerned, you’ll find it to be incredibly well-designed (Figure 1).

Figure 1: The default TrueNAS SCALE web UI is a thing of beauty.

In order to use the Virtualization feature, your CPU must support KVM extensions. This can be problematic when using TrueNAS as a virtual machine (with the likes of VirtualBox). To make this work, you must enable Nested Virtualization. Here’s how you do that.

First, create the virtual machine for TrueNAS. Once the VM is created, you’ll need to find the .vbox file in the TrueNAS VirtualBox folder. Open that file for editing (in my example, the file is TRUENAS.vbox). Look for the following section:

Add the following line to that section:

The new section should look like this:

The GUI Method

If you prefer the GUI method, open the Settings for the VM, go to System, and click the checkbox for Enable Nested VT-x/AMD-V, and click OK. Start the VM and Virtualization should now work. You’ll know if it’s working if you click on the Virtualization section and you see Add Virtual Machine (Figure 2).

Figure 2: Virtualization is now enabled for our TrueNAS VM.

In a soon-to-be-written tutorial, I will show you how to start working with containers via TrueNAS. Until then, I highly recommend you download an ISO image of this incredible NAS solution, install it, create your pools/users/shares, and start enjoying the ability to share files and folders to your network.

The post TrueNAS SCALE Network Attached Storage Meets High Demand appeared first on The New Stack.

How Secure Is Your API Gateway?

Andrew Stiefel — Tue, 28 Feb 2023 13:16:57 +0000

Quick, how many APIs does your organization use? We’re talking for internal products, for external services and even for infrastructure management such as Amazon’s S3 object storage or Kubernetes. If you don’t know the answer, you are hardly alone. In survey after survey, CIOs and CISOs admit they don’t have an accurate catalog of all their APIs. Yet the march toward greater use of APIs is inevitable, driven by continued adoption of API-centric technology paradigms like cloud native computing and microservices.

According to statistics shared by Mark O’Neill, chief of research for software engineering at Gartner, in 2022:

98% of organizations use or are planning to use internal APIs, up from 88% in 2019
94% of organizations use or are planning to use public APIs provided by third parties, up from 52% in 2019
90% of organizations use or are planning to use private APIs provided by partners, up from 68% in 2019
80% of organizations provide or are planning to provide publicly exposed APIs, up from 46% in 2019

API Gateways Remain Critical Infrastructure Components

To deal with this rapid growth and the management and security challenges it creates, CIOs, Platform Ops teams, and cloud architects are turning to API gateways to centrally manage API traffic. API gateways help discover, manage, observe and secure API traffic on a network.

In truth, API gateways are a function that can be performed by either a reverse proxy or load balancer, and increasingly, an ingress controller. We know this for a fact because many NGINX open source users configure their NGINX instances specifically to manage API traffic.

This requires considerable customization, however, so it’s not surprising that many DevOps teams instead choose to deploy an API gateway that is already configured to handle some of the most important use cases for API management, like NGINX Plus.

API gateways improve security by acting as a central point of control and access for external applications accessing APIs. They can enforce authentication and authorization policies, as well as implement rate limiting and other security measures to protect against malicious attacks and unauthorized access.

Additionally, API gateways can encrypt data in transit and provide visibility and monitoring capabilities to help identify and prevent security breaches. API gateways can also prioritize traffic, enforce service-level agreements (SLAs) or business decisions around API usage, and conserve network and compute resources.

Once installed and fully deployed, API gateways tend to be sticky and hard to remove. So ensuring that you pick the right API gateway the first time is imperative. The stakes are high. Not all API gateways offer the same level of security, latency, observability and flexibility.

Some rely on underlying open source technologies that can cause security vulnerabilities or difficulties with reliability. Others may require cumbersome integration steps and generate unforeseen traffic latencies. All of these can affect the security of your API gateway and need to be considered during the selection process.

What’s under the Hood Matters — A Lot

The majority of API gateway solutions on the market are built atop modified versions of open source software. NGINX, HAProxy, Envoy and Traefik are all commonly used. However, many API gateway solutions are closed source (they use open source wrapped in proprietary code). That said, such proprietary solutions are still completely dependent on the underlying security of the open source components.

This can create significant security gaps. When a vulnerability is announced in an open source project underlying a proprietary API gateway solution, it may take months for the gateway vendor to push a security patch because any changes to the reverse proxy layer require regression testing and other quality assurance measures to ensure the fix does not affect stability or performance. Attackers know this and often look to target the exposed and unpatched open source layers in these products.

The bottom line? You need to know which technologies are part of your API gateway. Dependencies on third parties for modules and foundational components, either open source or proprietary, can generate unacceptable risks if you require a highly secure solution for your APIs.

Audit Your Security Dependencies with a Software Bill of Materials

Creating a software bill of material (SBOM) is one of the most common ways to assess potential vulnerabilities. Simply put, an SBOM is a detailed inventory of all the software components, commercial and open source, that make up an application. To learn more about SBOMs, read “Create a Software Bill of Materials for Your Operating System.”

Once you have a full picture of your software stack, you can assess whether all your items meet your security and compliance standards. You’ll often find that many tools have embedded dependencies within them. Some projects are actively maintained and release patches for known CVEs (common vulnerabilities and exposures) with a standardized service-level agreement.

But many major open source projects are not commercial entities, so they might not issue SLAs on vulnerability disclosure or guaranteed patch times, leaving you more vulnerable to CVEs. That, in turn, can unintentionally put your services out of compliance with the required standards. For those reasons, you need to verify whether each individual component in your SBOM can be made compliant.

You can read “How to Prepare Your Apps for Regulated Markets” for more information about auditing your software technology stack.

Easy Integration with Other Security Controls Is Critical

While API gateways are a critical part of API security, they are only one element. Most organizations running API gateways also need a web application firewall (WAF) in front of their gateway to block attacks (OWASP API Security Top 10 and others). If their infrastructure is distributed, they need more than one WAF. In larger enterprises, the API gateway needs to integrate with a global firewall that polices all traffic going in or out.

Even newer API security solutions that help with challenges like API discovery and threat analysis depend on robust integration with an API gateway. These tools often rely on the API gateway for visibility into API traffic, and usually work with the API gateway to address any emerging threats.

In all cases, tight integration between the API gateway and security tools is critical for maintaining effective security. It’s most convenient if you can use a single monitoring solution to track both firewall and gateway traffic.

This can be a challenging integration, particularly if an organization is operating in a multicloud or hybrid environment. Integration challenges can also mean that changes to gateway configurations require updates to the WAF or global firewall, adding to team workloads or — worst case — slowing down application development teams that have to wait for their firewall or gateway configuration requests to be synced.

Policy Granularity Can Vary Widely across Environments

In theory, an API gateway can enforce the same policy no matter what environment it is operating in. The reality is very different if you have to build your API gateways from different mixtures of components in different environments.

For example, an API management solution might use one underlying open source technology for on-premises or hosted installations of its API gateway and another for cloud services. Policy granularity and all of the resulting security benefits can also be starkly limited by the underlying foundational reverse proxy itself, or the mismatch of capabilities between the two implementations.

For these reasons, it’s critical to run an extensive proof of concept (POC) that closely emulates live production traffic. It’s the only way to be sure the API gateway solution can provide the type of policy granularity and control you require for your growing API constellation.

Inadequate policy granularity and control can result in less agile security capabilities, often reducing the API gateway to a blunt instrument rather than the finely honed scalpel required for managing the rapidly shifting attack surface of your API landscape.

Speed Matters to Application Development Teams

How fast an API gateway can pass traffic safely while still enforcing policies is of critical importance to application teams and API owners. Slow APIs can affect overall performance of applications in compounding ways by forcing dependent processes to wait, generating a poor user experience.

Teams forced to deal with slow APIs are more likely to circumvent security systems or roll their own to improve performance and better control user experience and dependencies. This is the API equivalent of shadow IT, and it creates considerable security risks if APIs are not properly locked down, tested and monitored.

The API gateway alone must be fast. But it’s equally important to look at the latency hit generated by the combination of WAF and API gateway. Ideally, the two are tightly integrated, reducing the need to slow down packets. This is another reason why a near-production POC is crucial for making the right decision.

Conclusion: Your API Gateway Security Mileage Can Vary — Choose Wisely

APIs are the future of technology infrastructure and composable, loosely coupled applications. Their rapid proliferation is likely to accelerate as more and more organizations move to the cloud, microservices and other decoupled and distributed computing paradigms.

Even if you are going against the tide and moving in the opposite direction to monoliths, your applications still need to manage APIs to communicate with the rest of the world, including partners, customers, storage layers, payment providers like Stripe, critical cloud services like CDNs and more.

An API gateway is a serious purchase requiring careful consideration. The most important consideration of all, naturally, is security.

The four criteria we laid out in this post — reliable underlying technology, easy integration with security tools, policy granularity across environments and low latency — are just a few of the many boxes an API gateway needs to check before you put it into production.

Choose wisely, think deeply, and may the API force be with you!

The post How Secure Is Your API Gateway? appeared first on The New Stack.

Bullet-Proofing Your 5G Security Plan

Anand Oswal — Fri, 24 Feb 2023 15:24:00 +0000

With latency improvements and higher data speeds, 5G represents exponential growth opportunities with the potential to transform entire industries — from fueling connected autonomous vehicles, smart cities, mixed reality technologies, robotics and more.

As enterprises rethink connectivity, 5G will be a major investment area. However, according to Palo Alto Networks’ What’s Next in Cyber survey, while 88% of global executives say they understand the security challenges associated with 5G, only 21% of them have a plan to address such challenges.

As is true for any emerging technology, there will always be a level of uncertainty. However, with a few key considerations, executives across industries can foster confidence in their organization’s ability to handle 5G security challenges.

Outlining the Framework

A comprehensive 5G security plan is built by first outlining the framework and identifying the key security principles that should inform every component of the plan. The goal, of course, is to navigate risks and secure your organization’s 5G network while also advancing digital transformation. Your framework should ultimately center around visibility, control, enforcement, dynamic threat correlation and life cycle.

Implementing visibility is key to having a complete understanding of the enterprise 5G network. Data logs, for instance, can capture data from multiple systems to better secure an entire environment. In terms of control, it’s beneficial to use cloud-delivered advanced threat protection to command and control traffic, as well as detect malware.

Adopting a zero trust model can ensure strong enforcement and consistent security visibility across the entire network, while dynamic threat correlation can help isolate infected devices. Last, by shedding light on usage patterns and potential security gaps, you can stay one step ahead of an evolving threat landscape.

Embracing AI and Automation

With the expanded surface area of a 5G network spanning multiaccess edge computing (MEC), network slices and instantaneous service orchestration, there is much more room for potential threats. Coupled with the proliferation of user-owned devices and IoT, this highly distributed environment creates grounds for threats to evolve faster and do more damage.

Given this, automation plays an important role in building a secure 5G network. With advanced automation, organizations can alleviate the stress put on their cybersecurity teams to scan a multitude of areas for potential threats. As new services are configured and added to the 5G network, automation also helps to quickly scan and serves as a repeatable approach to deploying security.

Additionally, as threat actors leverage AI to automate attacks, similar technology is needed at the organizational level to best defend. With the complexity of 5G deployments, an AI-powered approach can intelligently stop attacks and threats while also providing granular application identification policies to protect against advanced threats regardless of their origin.

Adopting Zero Trust

Zero trust for 5G means removing implicit trust and continuously monitoring and approving each stage of digital interaction. This means that regardless of what the situation is, who the user is or what application they are trying to gain access to, each interaction has to be validated. On a network security level, zero trust specifically protects sensitive data and critical applications.

More specifically, zero trust leverages network segmentation, provides Layer 7 threat prevention, prevents lateral movement and simplifies granular user-access control. Whereas a traditional security model operates under the assumption that everything within the organization’s purview can be trusted, this model understands that trust is a vulnerability. Ultimately, zero trust provides an opportunity for your organization to rethink security and keep up with digital transformation.

5G represents a paradigm shift and has the potential to expand connectivity. As your organization embarks on its own journey toward a 5G future, security cannot be an afterthought. Building a strong 5G security plan must start from the ground up as new, sophisticated cyberattacks are always looming. However, by building an informed framework, leveraging AI and automation, and implementing a zero trust framework, your organization will enjoy the innovation, reliability and performance that 5G has to offer.

The post Bullet-Proofing Your 5G Security Plan appeared first on The New Stack.

What David Flanagan Learned Fixing Kubernetes Clusters

Loraine Lawson — Fri, 17 Feb 2023 18:54:39 +0000

People are mean. That’s one of the first things David Flanagan learned by fixing 50+ deliberately broken Kubernetes clusters on his YouTube series, “Klustered.”

In one case, the submitter substituted a ‘c’ character with a unicode doppleganger — it looked identical to a c on the terminal output — thus causing an error that led to Flanagan doubting himself and his ability to fix clusters.

“I really hate that guy,” Flanagan confided at the Civo Navigate conference last week in Tampa. “That was a long episode, nearly two hours we spent trying to fix this. And what I love about that clip — because I promise you, I’m quite smart and I’m quite good with Kubernetes — but it had me doubting things that I know are not the fault. The fact that I thought a six digit number is going to cause any sort of overflow on a 64 bit system — of course not. But debugging is hard.”

After that show, Klustered adopted a policy of no Unicode breaks.

“You only learn when things go wrong,” Flanagan said. “This is why I really love doing Klustered. If you just have a cluster that just works, you’re never really going to learn how to operate that beyond a certain level of scale. And Klustered brings us a situation where we can have people bring their failures from their own companies, their own organizations, their own teams, and we replicate those issues on a live stream format, but it allows us to see how individuals debug it as well.”

Linux Problems

Debugging is hard, he said, even when you have a team from Red Hat working to resolve the problem, as he learned during another episode featuring teams from Red Hat and Talos. In that situation, Red Hat had removed the executable bit from important binaries such as kubectl, kubeadm, and even Perl — which has the ability to execute most Sys calls on a machine; limiting the Talos ability to fix the fault.

“What we learned from this episode is you can actually execute the dynamic linker on Linux. So we have this ld-linux.so you can actually execute any binary on a machine, proxying it through that linker. So you can bin.chmod, like so, which is a really cool trick.”
/lib/ld-linux.so /bin.chmod +x /bin/chmod

People have also modified attributes on a Linux file system.

“Anyone know what attributes are in a Linux file system?” He asked. “No, of course not. Why should you?”

But these attributes allow you to get really low level and to the file system. He showed how they marked a file as immutable.

“So you can pack a file that you know, kubectl or Kubernetes has to write to and mark it as immutable, and you’ve immediately broken the system,” he said. “You’re not going to be able to detect that break by running your regular LS commands, you actually do need to do an lsattr on the file, and understand what these obscure references mean when you list them all. So, again, Klustered just gives us an environment where we get to extract all of this knowledge from people that have done stuff that we haven’t done before.”

On another episode, he had Kris Nóva, a kernel hacker who has worked in security and Kubernetes, along with Thomas Stromberg, a previous maintainer of minikube while Google, who has also worked in forensic analysis of intrusions. Stromberg had to fix the broken cluster by Nova, a security industry elite.

“Thomas came on and runs this FLS command,” he said. “It’s very old toolkit, written in the late 90s, called Sleuth Kit that does forensic analysis of Linux file systems.”

“By running this command, he got a time ordered change of every modification to the Linux file system. He had every answer to every question he wanted to answer for the last 48 hours…. So I love that we have these opportunities of complete serendipity to share knowledge with everyone,” he added.

Network Breaks Common

Networking breaks are often fairly common on that show. Kubernetes has core networking policies in place to keep them from happening…but still, it happens.

“However, we’re now seeing fragmentation as other CNI providers bring on their own adaptations to network policies,” Flanagan relayed. “It’s not enough to check for network policies or cluster network policies. …You need to know to successfully operate a Kubernetes cluster from a networking level [that] continues to evolve and get very cumbersome, scary, complicated, but also easier.”

Flanagan’s biggest frustration with Kubernetes is the default DNS policy.

“Who thinks the default DNS policy in Kubernetes is the default DNS policy? Now we have this DNS policy called default,” he said. “But it’s not the default. The default is cluster first, which means it’s going to try and resolve the DNS name within the cluster. And the default policy actually resolves to the default routing on the host.”

Flanagan said he’s been discussing with people like Tom Hockin and other commentators of Kubernetes how the community can remove some of the anomalies that are out there essentially tripping up people who just haven’t encountered these problems before.

Ebpf Changing the Landscape

eBPF is changing the landscape as well, he said. Rather than go into a Linux machine anymore, and run IP tables -l, which he noted has been ingrained into developer’s skulls for the past 20 years. Now developers are supposed to listen to to all the eBPF probes and traffic policies. And essentially, you need to have other eBPF tools that can understand the existing eBPF tools.

He recommended checking out Hubble for a visual representation of older network policies — Kubernetes and Cilium specifically, he added. Hubble also ships with a CLI.

“We have the tools to understand networking within our cluster. If you’re lucky enough to be using Cilium, if you’re using other CNI, you will have to find other tools, but they do exist as well,” he said.

He also recommended Cilium Editor.

“You can build a Kubernetes networking policy, or a Cilium network policy by dragging boxes, changing labels and changing port numbers,” Flanagan said. “So you don’t actually need to learn how to navigate these esoteric YAML files anymore.”

Ciluim Editor will allow you to use drag-and-drops to build out a Kubernetes networking policy, he said.

Other Learnings

There are other ways to break Kubernetes clusters, of course. You can attack the container runtime, he noted. People have rolled back the kubectl binary as many as 25 versions; 25 versions is what it took to actually break backwards compatibility so that it can’t speak to the API server. Storage is another consideration with your own CSI providers, he added.

He also recommended three resources:
• Brendan Gregg’s book;
• BCC;
• Ebpfkit;

What he’d like to normalize is engineers admitting what they don’t know and sharing knowledge.

“The one rule I give people is please don’t sit there quietly, Googling off camera to get an answer and go, Oh, I know how to fix this,” he said. “I’d love to get senior engineers to set better norms for the newcomers in our industry and remove the hero culture we’ve established over the last 30 years.”

Civo paid for Loraine Lawson’s travel and accommodations to attend the conference.

The post What David Flanagan Learned Fixing Kubernetes Clusters appeared first on The New Stack.

API Gateway, Ingress Controller or Service Mesh: When to Use What and Why

Jenn Gile — Fri, 17 Feb 2023 14:40:32 +0000

In just about every conversation on ingress controllers and service meshes, we hear some variation of the questions, “How is this tool different from an API gateway?” or “Do I need both an API gateway and an ingress controller (or service mesh) in Kubernetes?”

This confusion is understandable for two reasons:

Ingress controllers and service meshes can fulfill many API gateway use cases.
Some vendors position their API gateway tool as an alternative to using an ingress controller or service mesh — or they roll multiple capabilities into one tool.

Here, we will tackle how these tools differ and which to use for Kubernetes-specific API gateway use cases. For a deeper dive, including demos, watch the webinar “API Gateway Use Cases for Kubernetes.”

Definitions

At their cores, API gateways, ingress controllers and service meshes are each a type of proxy, designed to get traffic into and around your environments.

What Is an API Gateway?

An API gateway routes API requests from a client to the appropriate services. But a big misunderstanding about this simple definition is the idea that an API gateway is a unique piece of technology. It’s not. Rather, “API gateway” describes a set of use cases that can be implemented via different types of proxies, most commonly an ADC or load balancer and reverse proxy, and increasingly an ingress controller or service mesh. In fact, we often see users, from startup to enterprise, deploying out-of-the-box NGINX as an API gateway with reverse proxies, web servers or load balancers, and customizing configurations to meet their use case needs.

There isn’t a lot of agreement in the industry about what capabilities are “must haves” for a tool to serve as an API gateway. We typically see customers requiring the following abilities (grouped by use case):

Resilience Use Cases

A/B testing, canary deployments and blue-green deployments
Protocol transformation (between JSON and XML, for example)
Rate limiting
Service discovery

Traffic Management Use Cases

Method-based routing and matching
Request/response header and body manipulation
Request routing at Layer 7
Retries and keepalives

Security Use Cases

API schema enforcement
Client authentication and authorization
Custom responses
Fine-grained access control
TLS termination

Almost all these use cases are commonly used in Kubernetes. Protocol transformation and request/response header and body manipulation are less common since they’re generally tied to legacy APIs that aren’t well-suited for Kubernetes and microservices environments. They also tend to be indicative of monolithic applications that are less likely to run in Kubernetes.

What Is an Ingress Controller?

An ingress controller is a specialized Layer 4 and Layer 7 proxy that gets traffic into Kubernetes, to the services, and back out again (referred to as ingress-egress or north-south traffic). In addition to traffic management, ingress controllers can also be used for visibility and troubleshooting, security and identity, and all but the most advanced API gateway use cases.

What Is a Service Mesh?

A service mesh handles traffic flowing between Kubernetes services (referred to as service-to-service or east-west traffic). It is commonly used to achieve end-to-end encryption (E2EE) and for applying TLS to all traffic. A service mesh can be used as a distributed (lightweight) API gateway very close to the apps, made possible on the data plane level by service mesh sidecars.

Note: Choosing a service mesh is its own journey that is worth some consideration.

Use Kubernetes Native Tools for Kubernetes Environments

So how do you decide which tool is right for you? We’ll make it simple: If you need API gateway functionality inside Kubernetes, it’s usually best to choose a tool that can be configured using native Kubernetes config tooling such as YAML. Typically, that’s an ingress controller or service mesh. But we hear you saying, “My API gateway tool has so many more features than my ingress controller (or service mesh). Aren’t I missing out?” No! More features do not equal better tools, especially within Kubernetes where tool complexity can be a killer.

Note: “Kubernetes native” (not the same as Knative) refers to tools that were designed and built for Kubernetes. Typically, they work with the Kubernetes CLI, can be installed using Helm and integrate with Kubernetes features.

Most Kubernetes users prefer tools they can configure in a Kubernetes native way because that avoids changes to the development or GitOps experience. A YAML-friendly tool provides three major benefits:

YAML is a familiar language to Kubernetes teams, so the learning curve is low or even nonexistent if you’re using an existing Kubernetes tool for API gateway functionality. This helps your teams work within their existing skill set without the need to learn how to configure a new tool that they might only use occasionally.
You can automate a YAML-friendly tool in the same fashion as your other Kubernetes tools. Anything that cleanly fits into your workflows will be popular with your team, increasing the probability that they use it.
You can shrink your Kubernetes traffic-management tool stack by using Kubernetes native tools already in the stack. Every extra hop matters, and there’s no reason to add unnecessary latency or single points of failure. And of course, reducing the number of technologies deployed within Kubernetes is also good for your budget and overall security.

North-South API Gateway Use Cases: Use an Ingress Controller

Ingress controllers have the potential to enable many API gateway use cases. In addition to the ones outlined in Definitions, we find organizations most value an ingress controller that can implement:

Offload of authentication and authorization
Authorization-based routing
Layer 7 level routing and matching (HTTP, HTTP/S, headers, cookies, methods)
Protocol compatibility (HTTP, HTTP/2, WebSocket, gRPC)
Rate limiting

Sample Scenario: Method-Level Routing

You want to implement method-level matching and routing using the ingress controller to reject the POST method in API requests.

Some attackers look for vulnerabilities in APIs by sending request types that don’t comply with an API definition — for example, sending POST requests to an API that is defined to accept only GET requests. Web application firewalls (WAF) can’t detect these kinds of attacks. They examine only request strings and bodies for attacks, so it’s best practice to use an API gateway at the ingress layer to block bad requests.

As an example, suppose the new API /coffee/{coffee-store}/brand was just added to your cluster. The first step is to expose the API using an ingress controller simply by adding the API to the upstreams field.

apiVersion: k8s.nginx.org/v1
kind: VirtualServer
metadata:
  name: cafe
spec:
  host: cafe.example.com
  tls:
    secret: cafe-secret
  upstreams:
  -name: tea
    service: tea-svc
    port: 80
  -name: coffee
    service: coffee-svc 
    port: 80

To enable method-level matching, you add a /coffee/{coffee-store}/brand path to the routes field and add two conditions that use the $request_method variable to distinguish between GET and POST requests. Any traffic using the HTTP GET method is passed automatically to the coffee service. Traffic using the POST method is directed to an error page with the message "You are rejected!" And just like that, you’ve protected the new API from unwanted POST traffic.

routes:
  - path: /coffee/{coffee-store}/brand
    matches:
    - conditions:
      - variable: $request_method
        value: POST
        action:
          return:
            code: 403
            type: text/plain
            body: "You are rejected!"
    - conditions:
      - variable: $request_method
        value: GET
        action:
          pass: coffee
  - path: /tea
    action:
      pass:tea

For more details on how you can use method-level routing and matching with error pages, check out these ingress controller docs. Additionally, you can dive into a security-related example of using an ingress controller for API gateway functionality.

East-West API Gateway Use Cases: Use a Service Mesh

A service mesh is not required, or even initially helpful, for most API gateway use cases because most of what you might want to accomplish can, and ought to, happen at the ingress layer. But as your architecture increases in complexity, you’re more likely to get value from using a service mesh. The use cases we find most beneficial are related to E2EE and traffic splitting, such as A/B testing, canary deployments and blue-green deployments.

Sample Scenario: Canary Deployment

You want to set up a canary deployment between services with conditional routing based on HTTP/S criteria.

The advantage is that you can gradually roll out API changes — such as new functions or versions — without affecting most of your production traffic.

Currently, your ingress controller routes traffic between two services managed by NGINX service mesh: Coffee.frontdoor.svc and Tea.frontdoor.svc. These services receive traffic from ingress controller and route it to the appropriate app functions, including Tea.cream1.svc. You decide to refactor Tea.cream1.svc, calling the new version Tea.cream2.svc. You want your beta testers to provide feedback on the new functionality, so you configure a canary traffic split based on the beta testers’ unique session cookie, ensuring your regular users only experience Tea.cream1.svc.

Using a service mesh, you begin by creating a traffic split between all services fronted by Tea.frontdoor.svc, including Tea.cream1.svc and Tea.cream2.svc. To enable the conditional routing, you create an HTTPRouteGroup resource (named tea-hrg) and associate it with the traffic split, the result being that only requests from your beta users (requests with the session cookie set to version=beta) are routed from Tea.frontdoor.svc to Tea.cream2.svc. Your regular users continue to experience only version 1 services behind Tea.frontdoor.svc.

apiVersion: split.smi-spec.io/v1alpha3
kind: TrafficSplit
metadata:
  name: tea-svc
spec:
  service: tea.1
  backends:
  - service: tea.1
    weight: 0
  - service: tea.2
    weight: 100
  matches:
  - kind: HTTPRouteGroup
    name: tea-hrg


apiVersion: specs.smi-spec.io/v1alpha3
kind: HTTPRouteGroup
metadata:
  name: tea-hrg
  namespace: default
spec:
  matches:
  - name: beta-session-cookie
    headers:
    - cookie: "version=beta"

This example starts your canary deployment with a 0-100 split, meaning all your beta testers experience Tea.cream2.svc, but of course you could start with whatever ratio that aligns to your beta-testing strategy. Once your beta testing is complete, you can use a simple canary deployment (without the cookie routing) to test the resilience of Tea.cream2.svc.

Check out these docs for more details on traffic splits with a service mesh. The above traffic split configuration is self-referential as the root service is also listed as a backend service. This configuration is not currently supported by the Service Mesh Interface specification (smi-spec) However, the spec is currently in alpha and subject to change.

When (and How) to Use an API Gateway Tool for Kubernetes Apps

Though most API gateway use cases for Kubernetes can (and should) be addressed by an ingress controller or a service mesh, there are some specialized situations where an API gateway tool is suitable.

Business Requirements

Using both an ingress controller and an API gateway inside Kubernetes can provide flexibility for organizations to achieve business requirements. Some scenarios include:

Your API gateway team isn’t familiar with Kubernetes and doesn’t use YAML. For example, if they’re comfortable with NGINX config, then it eases friction and lessens the learning curve if they deploy NGINX as an API gateway in Kubernetes.
Your Platform Ops team prefers to dedicate the ingress controller solution to app traffic management only.
You have an API gateway use case that only applies to one of the services in your cluster. Rather than using an ingress controller to apply a policy to all your north-south traffic, you can deploy an API gateway to apply the policy only where it’s needed.

Migrating APIs into Kubernetes Environments

When migrating existing APIs into Kubernetes environments, you can publish those APIs to an API gateway tool that’s deployed outside of Kubernetes. In this scenario, API traffic is typically routed through an external load balancer (for load balancing between clusters), then to a load balancer configured to serve as an API gateway, and finally to the ingress controller or Gateway API module within your Kubernetes cluster.

The Future of Gateway API for API Gateway Use Cases

This conversation would be incomplete without a brief discussion of the Kubernetes Gateway API (which is not the same as an API gateway). Generally seen as the future successor of the Ingress API, the Gateway API can be implemented for both north-south and east-west traffic. This means an implementation could perform ingress controller capabilities, service mesh capabilities or both. Ultimately, there’s potential for Gateway API implementations to act as an API gateway for all your Kubernetes traffic.

Gateway API is in beta, and there are numerous vendors, including NGINX, experimenting with implementations. It’s worth keeping a close eye on the innovation in this space and maybe even start experimenting with the beta version yourself.

Watch this short video to learn more about how an API gateway differs from the Gateway API:

https://youtu.be/GQOf4t4KGbw

Conclusion: Right Tools for the Right Use Case

For Kubernetes newcomers and even for folks with a decent amount of experience, APIs can be painfully confusing. We hope these rules of the road can provide guidance on how to build out your Kubernetes architecture effectively and efficiently.

Of course, your mileage may vary, and your use case or situation may be unique. But if you stick to Kubernetes native tools to simplify your tool stack and only consider using a separate API gateway (particularly outside of Kubernetes) for very specific situations like those outlined above, your journey should be much smoother.

The post API Gateway, Ingress Controller or Service Mesh: When to Use What and Why appeared first on The New Stack.

13 Years Later, the Bad Bugs of DNS Linger on

Jessica Wachtel — Tue, 14 Feb 2023 17:00:09 +0000

It’s 2023, and we are still copying code without fully debugging. Did we not learn from the Great DNS Vulnerability of 2008? Fear not, internet godfather Paul Vixie has provided provides guidelines on how to do better.

Vixie, a Distinguished Engineer, and Vice President of Security at Amazon Web Services (and contributor to the internet before it was “The Internet”) spoke about the cost of open source dependencies in an Open Source Summit Europe in Dublin talk — which he revisited in a recent blog post. Both are highly recommended viewing and reading.

Flashback to 2008

In 2008, security expert Dan Kaminsky discovered a fundamental design flaw in DNS code that allowed for arbitrary cache poisoning that affected nearly every DNS server on the planet. The patch was released in July 2008 followed by the permanent solution, Domain Name Security Extensions (DNSSEC), in 2010. The Domain Name System is the basic name-based global addressing system for The Internet, so vulnerabilities in DNS could spell major trouble for pretty much everyone on the Internet.

Vixie and Kaminsky “set [their] hair on fire” to build the security vulnerability solution that “13 years later, is not widely enough deployed to solve this problem,” Vixie said. All of this software is open-source and inspectable but the DNS bugs are still being brought to Vixie’s attention in the present day.

“This is never going to stop if we don’t start writing down the lessons people should know before they write software,” Vixie said.

How Did This Happen?

It’s our fault, “the call is coming from inside the house.” Before internet commercialization and the dawn of the home computer room, publishers of Berkley Software Distribution (BSD) of UNIX decided to support the then-new DNS protocol. “Spinning up a new release, making mag tapes, and putting them all in shipping containers was a lot of work” so they published DNS as a patch and posted to Usenet newsgroups, making it available to anyone who wanted it via an FTP server and mailing list.

When Vixie began working on DNS at Berkeley, DNS was for all intents abandonware insofar as all the original creators had since moved on. Since there was no concept of importing code and making dependencies, embedded systems vendors copied the original code and changed the API names to suit their local engineering needs… this sound familiar?

And then Linux came along. The internet E-X-P-L-O-D-E-D. You get an AOL account. And you get an AOL account…

Distros had to build their first C library and copied some version of the old Berkeley code whether they knew what it was or not. It was a copy of a copy that some other distro was using, they made a local version forever divorced from the upstream. DSL modems are an early example of this. Now the Internet of Things is everywhere and “all of this DNS code in all of the billions of devices are running on some fork of a fork of a fork of code that Berkeley published in 1986.”

Why does any of this matter? The original DNS bugs were written and shipped by Vixie. He then went on to fix them in the 90s but some still appear today. “For embedded systems today to still have that problem, any of those problems, means that whatever I did to fix it wasn’t enough. I didn’t have a way of telling people.”

Where Do We Go from Here?

“Sure would have been nice if we already had an internet when we were building one,” Vixie said. But, try as I might, we can’t go backward we can only go forward. Vixie made it very clear, “if you can’t afford to do these things [below] then free software is too expensive for you.”

Here is some of Vixie’s advice for software producers:

Do the best you can with the tools you have but “try to anticipate what you’re going to have.”
Assume all software has bugs “not just because it always has, but because that’s the safe position to take.” Machine-readable updates are necessary because “you can’t rely on a human to monitor a mailing list.”
Version numbers are must-haves for your downstream. “The people who are depending on you need to know something more than what you thought worked on Tuesday.” It doesn’t matter what it is as long as it uniquely identifies the bug level of the software.
Cite code sources in the README files in source code comments. It will help anyone using your code and chasing bugs.
Automate monitoring of your upstreams, review all changes, and integrate patches. “This isn’t optional.”
Let your downstream know about changes automatically “otherwise these bugs are going to do what the DNS bugs are doing.”

Here is the advice for software consumers:

Your software’s dependencies are your dependencies. “As a consumer when you import something, remember that you’re also importing everything it depends on… So when you check your dependencies, you’d have to do it recursively you have to go all the way up.”
Uncontracted dependencies can make free software incredibly expensive but are an acceptable operating risk because “we need the software that everybody else is writing.” Orphaned dependencies require local maintenance and therein lies the risk because that’s a much higher cost than monitoring the developments that are coming out of other teams. “It’s either expensive because you hire enough people and build enough automation or it’s expensive because you don’t.”
Automate dependency upgrades (mostly) because sometimes “the license will change from one you could live with to one that you can’t or at some point someone decides they’d like to get paid” [insert adventure here].
Specify acceptable version numbers. If versions 5+ have the fix needed for your software, say that to make sure you don’t accidentally get an older one.
Monitor your supply chain and ticket every release. Have an engineer review every update to determine if it’s “set my hair on fire, work over the weekend or we’ll just get to it when we get to it” priority level.

He closed with “we are all in this together but I think we could get it organized better than we have.” And it sure is one way to say it.

There is a certain level of humility and grace one has to have after being on the tiny team that prevented the potential DNS collapse, is a leader in their field for over a generation, but still has their early career bugs (that they solved 30 years ago) brought to their attention at regular intervals because adopters aren’t inspecting the source code.

The post 13 Years Later, the Bad Bugs of DNS Linger on appeared first on The New Stack.

EU Analyst: The End of the Internet Is Near

Joab Jackson — Thu, 02 Feb 2023 11:00:07 +0000

The internet as we know it may no longer be a thing, warns a European Union-funded researcher. If it continues to fray, our favorite “network of networks” will just go back to being a bunch of networks again. And it will be the fault of us all.

“The idea of an open and global internet is progressively deteriorating and the internet itself is changing,” writes Konstantinos Komaitis, author of the report, “Internet Fragmentation: Why It Matters for Europe” posted Tuesday by the EU Cyber Diplomacy Initiative.

In short, the global and open nature of the internet is being impacted by larger geo-political forces, perhaps beyond everyone’s control. “Internet fragmentation must be seen both as a driver and as a reflection of an international order that is increasingly growing fragmented,” Komaitis concluded.

The vision for the internet has always been one of end-to-end communications, where one end device on the internet can exchange packets with any other end device, regardless of what network either one of them was on. And, by nature, the internet was meant to be open, with no central governing authority, allowing everyone in the world to join, for the benefit of all, rich or poor.

In practice, these technical and ideological goals may have played out inconsistently (NAT… cough), but the internet has managed to keep on keeping on for a remarkably long time for such a minimally-managed effort.

Yet, this may not always be the case, Komaitis foretells.

He notes the internet is besieged from all sides by potential fragmentation: from commercial pressures, technical changes and government interference. Komaitis highlighted a few culprits:

DNS: The Domain Name System is the index that holds everything together, mapping domain names to IP numbers. The Internet Corporation for Assigned Names and Numbers (ICANN) manages this work on a global scale, but there’s nothing to stop another party from setting up an alternative root server. A few have tried: The International Telecommunications Union’s Digital Object Architecture (DOA) as well as Europe’s Network and Information Systems both set out to challenge the global DNS.
Stalled IPv4 to IPv6 translation: The effort to move the internet from the limited IPv4 addressing scheme to the much larger IPv6 address pool has been going on for well over two decades now, with only limited success thus far. “Even though there is a steady increase in the adoption of IPv6 addresses, there is still a long way to go,” Komaitis for writes. He notes that “Just 32 economies” have IPv6 adoption rates above the global average of 30%. Without full IPv6 adoption, he argues, the internet will continue to e fragmented, with no assurance of end-to-end connectivity across those using one version or the other.
Internet content blocking: Governments have take an increasing interest in curating the internet for its own citizens, using tools such as DNS filtering, IP blocking, distributed denial of service (DDoS) attacks and search result removals. The most prominent example is China, which runs “a sophisticated filtering system that can control which content users are exposed to,” Komaitis wrote.
Breakdown of peering agreements: The internet is the result of a set of bilateral peering agreements, which allow very small internet service providers to share the address space with global conglomerates. Increasingly, however, the large telcos are prioritizing their own traffic at the expense of smaller players. The European Union is looking at ways to restructure these agreements, though South Korea tried this, and the results ended up just confusing and burdening the market, Komaitis wrote.

Other mitigating factors that Komaitis discussed include wall gardens, data localization practices (i.e. GDPR) and ongoing governmental interest/interference in open standards bodies.

What does all this mean for the European Union, which funded this overview? The Union has already pledged to offer everyone online access by 2030, as well as to thwart any commercial of government attempts to throttle or prioritize internet traffic. It has also made a pledge, with the U.S. and other governments to ensure the Internet ” is “open, global and interoperable.”

So the EU needs to make the choice of whether or not to back its pledges.

“Moving forward, Europe must make a choice as to what sort of internet it wants: an open, global, interoperable internet or one that is fragmented and limited in choice?” Komaitis wrote.

The EU Cyber Diplomacy Initiative is “an EU-funded project focused on policy support, research, outreach and capacity building in the field of cyber diplomacy,” according to the project’s site.

The post EU Analyst: The End of the Internet Is Near appeared first on The New Stack.

Turbocharging Host Workloads with Calico eBPF and XDP

Reza Ramezanpour — Fri, 27 Jan 2023 18:36:56 +0000

In Linux, network-based applications rely on the kernel’s networking stack to establish communication with other systems. While this process is generally efficient and has been optimized over the years, in some cases it can create unnecessary overhead that can affect the overall performance of the system for network-intensive workloads such as web servers and databases.

XDP (eXpress Data Path) is an eBPF-based high-performance datapath inside the Linux kernel that allows you to bypass the kernel’s networking stack and directly handle packets at the network driver level.

XDP can achieve this by executing a custom program to handle packets as they are received by the kernel. This can greatly reduce overhead, improve overall system performance and improve network-based applications by shortcutting the normal networking path of ordinary traffic.

However, using raw XDP can be challenging due to its programming complexity and the high learning curve involved. Solutions like Calico Open Source offer an easier way to tame these technologies.

Calico Open Source is a networking and security solution that seamlessly integrates with Kubernetes and other cloud orchestration platforms. While infamous for its policy engine and security capabilities, there are many other features that can be used in an environment by installing Calico. These include routing, IP address management and a pluggable data plane with various options such as eBPF (extended Berkeley Packet Filter), IPtables and Vector Packet Processor (VPP).

In addition to these features, Calico also makes it simple to create and load custom XDP programs to target your cluster host interfaces.

The power of XDP can be used to improve the performance of your host workloads by using the same familiar Kubernetes-supported syntax that you use to manage your cluster resources every day.

Calico’s integration with XDP works and is implemented in the same way for any cluster running Linux and Calico, whether it uses IPtables, IP Virtual Server (IPVS) or Calico’s eBPF data plane.

By the end of this article, you will know how to harness the power of XDP programs to accelerate your host workloads without needing to learn any programming languages. You will also learn more about these technologies, how Calico offers an easier way to adapt to the ever-changing cloud computing scene and how to use these cutting-edge technologies to boost your cluster performance.

XDP Workload Acceleration

One of the main advantages of XDP for high-connection workloads is its ability to operate at line rate with low overhead for the system. Because XDP is implemented directly in the kernel at the earliest possible execution point, it can process packets at very high speeds with minimal latency. This makes it well suited for high-performance networking applications that require fast and efficient packet processing.

The following image illustrates the packet flow in the Linux kernel.

Fig 1: XDP and general networking packet flow

https://upload.wikimedia.org/wikipedia/commons/3/37/Netfilter-packet-flow.svg

To better understand the advantages of XDP, let’s examine it in a common setup by running an in-memory database on a Kubernetes host and using XDP to bypass the Linux conntrack features.

Host Network Security

A host-based workload is a type of application that runs directly on the host machine. It is often used to describe workloads that are not deployed in a container orchestration platform like Kubernetes, but rather directly on a host.

The following code block illustrates a host workload network socket.

This type of workload is quite difficult to secure by using the ordinary Kubernetes network policy (KNP) resource, since host workloads do not belong to any of the namespaces that Kubernetes moderates. In fact, one of the shortcomings of KNP is its limitation in addressing these types of traffic. But don’t worry, the modular nature of Kubernetes allows us to use Container Networking Interface (CNI) plugins, such as Calico, to address such issues.

Calico Host Endpoint Policy (HEP)

Calico host endpoint policy (HEP) is a Kubernetes custom resource definition that enables the manipulation of host traffic within a cluster. A HEP in Calico represents a host participating in a cluster and the traffic generated by the workload on that host. HEPs can be associated with host network cards and allow you to apply network security policies to the traffic generated by the workload on the host.

A host endpoint policy is defined using the HostEndpoint resource in Calico, and it has the following structure:

The metadata field contains information about the host endpoint, such as its name and any labels that are associated with it, similar to other Kubernetes resources, these labels can be used with other resources to reference the traffic to or from a HEP.

The spec field contains the configuration for the HEP, including the name of the interface that it is associated with, the node on which it is running and the expected IP addresses of the designated Kubernetes node network interface card.

Using HostEndpoint with Security Policies

Similar to other Kubernetes security resources, HEP will become a deny-all behavior and impose a lockdown on your cluster in the absence of an explicit permit that requires a preemptive look at the traffic that should be allowed in your cluster before implementing such a resource. But on top of its security advantages, you can refer to HEP labels from other Calico security resources such as global security policies to control traffic that might otherwise be difficult to control and create more complex and fine-grained security rules for your cluster environment.

The following selector in a security policy would allow you to filter packets that are associated with the redis-host host endpoint policy:

selector: has(hep) && hep == "redis-host"

This selector matches packets that have the hep label with a value of redis-host. You can use this selector in combination with other rules in a security policy to specify how the matched packets should be treated by your CNI. In the next section, we will use the same logic to bypass the Linux conntrack feature.

Connection Tracking in Linux

By default, networking in Linux is stateless, meaning that each incoming and outgoing traffic flow must be specified before it can be processed by the system networking stack. While this provides a strong security measure, it can also add complexity in certain situations. To address this, conntrack was developed.

Conntrack, or connection tracking, is a core feature of the Linux kernel used by technologies such as stateful firewalls. It allows the kernel to keep track of all logical network connections or flows by maintaining a list of the state of each connection, such as the source and destination IP addresses, protocol and port numbers.

This list has an adjustable soft limit, meaning that it can expand as needed to accommodate new connections. However, in some cases, such as with short-lived connections, conntrack can become a bottleneck for the system and affect performance.

For example, this is the limit on my local computer.

While it is possible to dig deeper into the conntrack table details from the /proc/sys/net/netfilter path in your system, applications such as conntrack can give you a more organized view of these records.

The following code block illustrates recorded entries by my local Kubernetes host.

In addition to securing a cluster, a host endpoint selector can be added to a Calico global security policy with a doNotTrack value to bypass Linux tracker capabilities for specific flows. This method could be beneficial for network-intensive workloads that receive a massive number of short-lived requests, and prevent the Linux conntrack table from overflowing.

The following code block is a policy example that bypasses the Linux conntrack table.

It is worth noting that since doNotTrack disables Linux conntrack capabilities, any traffic that matches the previous policy will become a stateless connection. In order to return it to its originating source, we have to specifically add a return policy (egress) to our Calico global security policy resource.

Calico’s workload acceleration is not limited to XDP. In fact, Calico’s eBPF data plane can be used to provide acceleration for other types of workloads in a cluster. If you would like to learn more about Calico’s eBPF data plane, please click here.

Conclusion

Overall, eBPF and XDP are powerful technologies that offer significant advantages for high-connection workloads, including high performance, low overhead and programmability.

In this article, we established how Calico makes it easier to take advantage of these technologies without worrying about the complexities associated with the learning curve involved with these complicated technologies.

Check out our free self-paced workshop, “Turbocharging host workloads with Calico eBPF and XDP,” to learn more.

The post Turbocharging Host Workloads with Calico eBPF and XDP appeared first on The New Stack.

Azure Went Dark

Steven J. Vaughan-Nichols — Wed, 25 Jan 2023 21:06:13 +0000

And down went all Microsoft 365 services around the world.

One popular argument against putting your business trust in the cloud is that if your hyper-cloud provider goes down, so does your business. Well, on the early U.S. East coast morning, it happened. Microsoft Azure went down and along with it went Microsoft 365, Exchange Online, Outlook, SharePoint Online, OneDrive for Business, GitHub, Microsoft Authenticator, and Teams. In short, pretty much everything running on Azure went boom.

Azure’s status page revealed the outage hit everything in the Americas, Europe, Asia-Pacific, the Middle East, and Africa. The only area to avoid the crash was China.

First Report

Microsoft first reported the problem at 2:31 a.m. Eastern, just as Europe was getting to work. The Microsoft 365 Status Twitter account reported, “We’re investigating issues impacting multiple Microsoft 365 services.”

Of course, by that time, users were already screaming. As one Reddit user on the sysadmin subreddit, wrote, “Move it to the cloud, they said, it will never go down, they said, we will save so much money they said.”

The Resolution

Later, Microsoft reported, “We’ve rolled back a network change that we believe is causing impact. We’re monitoring the service as the rollback takes effect.” By 9:31 a.m., Microsoft said the disaster was over. “We’ve confirmed that the impacted services have recovered and remain stable.” But, “We’re investigating some potential impact to the Exchange Online Service.” So, Exchange admins and users? Don’t relax just yet.

What Caused It?

So, what really caused it? Microsoft isn’t saying, but my bet, as a former network administrator, is it was either a Domain Name System (DNS) or Border Gateway Protocol (BGP) misconfiguration. Given the sheer global reach of the failure across multiple Azure Regions, I’m putting my money on BGP.

The post Azure Went Dark appeared first on The New Stack.

Performance Measured: How Good Is Your WebAssembly?

B. Cameron Gain — Thu, 19 Jan 2023 16:00:25 +0000

WebAssembly adoption is exploding. Almost every week at least one startup, SaaS vendor or established software platform provider is either beginning to offer Wasm tools or has already introduced Wasm options in its portfolio, it seems. But how can all of the different offerings compare performance-wise?

The good news is that given Wasm’s runtime simplicity, the actual performance at least for runtime can be compared directly among the different WebAssembly offerings. This direct comparison is certainly much easier to do when benchmarking distributed applications that run on or with Kubernetes, containers and microservices.

This means whether a Wasm application is running on a browser, an edge device or a server, the computing optimization that Wasm offers in each instance is end-to-end and, and its runtime environment is in a tunnel of sorts — obviously good for security — and not affected by the environments in which it runs as it runs directly on a machine level on the CPU.

Historically, Wasm has also been around for a while, before the World Wide Web Consortium (W3C) named it as a web standard in 2019, thus becoming the fourth web standard with HTML, CSS and JavaScript. But while web browser applications have represented Wasm’s central and historical use case, again, the point is that it is designed to run anywhere on a properly configured CPU.

In the case of a PaaS or SaaS service in which Wasm is used to optimize computing performance — whether it is running in a browser or not — the computing optimization that Wasm offers in runtime can be measured directly between the different options.

At least one application is increasingly being adopted that can be used to benchmark the different runtimes, compilers and JITs of the different versions of Wasm: libsodium. While anecdotal, this writer has contacted at least 10 firms that have used it or know of it.

Libsodium consists of a library for encryption, decryption, signatures, password hashing and other security-related applications. Its maintainers describe it in the documentation as a portable, cross-compilable, installable and packageable fork of NaCl, with an extended API to improve usability.

Since its introduction, the libsodium benchmark has been widely used to measure to pick the best runtimes, a cryptography engineer Frank Denis, said. Libsodium includes 70 tests, covering a large number of optimizations code generators can implement, Denis noted. None of these tests perform any kind of I/O (disk or network), so they are actually measuring the real efficiency of compilers and runtimes, in a platform-independent way. “Runtimes would rank the same on a local laptop and on a server in the cloud,” Denis said.

Indeed, libsodium is worthwhile for testing some Wasm applications, Fermyon Technologies CEO and co-founder Matt Butcher told the New Stack. “Any good benchmark tool has three desirable characteristics: It must be repeatable, fair (or unbiased toward a particular runtime), and reflective of production usage,” Butcher said. “Libsodium is an excellent candidate for benchmarking. Not only is cryptography itself a proper use case, but the algorithms used in cryptography will suss out the true compute characteristics of a runtime.”

Libsodium is also worthwhile for testing some Wasm environments because it includes benchmarking tasks with a wide range of different requirement profiles, some probing for raw CPU or memory performance, while others check for more nuanced performance profiles,” Torsten Volk, an analyst for Enterprise Management Associates (EMA), told The New Stack. “The current results show the suite’s ability to reveal significant differences in performance between the various runtimes, both for compiled languages and for interpreted ones,” Volk said. “Comparing this to the performance of apps that run directly on the operating system, without WASM in the middle, provides us with a good idea of the potential for future optimization of these runtimes.”

True Specs

In a blog post. Denis described how different Wasm apps were benchmarked in tests he completed. They included:

Iwasm, which is part of the WAMR (“WebAssembly micro runtime”) package — pre-compiled files downloaded from their repository.
Wasm2c, included in the Zig source code for bootstrapping the compiler.
Wasmer 3.0, installed using the command shown on their website. The three backends have been individually tested.
Wasmtime 4.0, compiled from source.
Node 18.7.0 installed via the Ubuntu package.
Bun 0.3.0, installed via the command show on their website.
Wazero from git rev 796fca4689be6, compiled from source.

Which one came out on top in the runtime tests? Iwasm, which is part of WebAssembly Micro Runtime (WAMR), according to Denis’ results. The iwasm VM core is used to run WASM applications. It supports interpreter mode, ahead-of-time compilation (AOT) mode and just-in-time compilation (JIT) modes, LLVM JIT and Fast JIT, according to the project’s documentation.

This does not mean that iwasm wins accolades for simplicity of use. “Compared to other options, [iwasm] is intimidating,” Denis wrote. “It feels like a kitchen sink, including disparate components.” These include IDE integration, an application framework library’s remote management and an SDK “that makes it appear as a complicated solution to simple problems. The documentation is also a little bit messy and overwhelming,” Denis writes.

Runtime Isn’t Everything

Other benchmarks exist to gauge the differences in performance among different Wasm alternatives. Test alternatives that Denis communicated include:

sightglass: for Wasmtime and Cranelift.
PSPDFKit: (that targets WebAssembly in web browsers).
Wasmedge’s benchmark suite.

However, benchmarking runtime performance is not an essential metric for all WebAssembly applications. Other test alternatives exist to test different Wasm runtimes that focus on very specific tasks, such as calculating the Fibonacci sequence, sorting data arrays or summing up integers, Volk noted. There are other more comprehensive benchmarks consisting of the analysis of entire use cases, such as video processing, editing of PDF, or even deep learning-based object recognition, Volk said.

“Wasm comes with the potential of delivering near-instant startup and scalability and can therefore be used for the cost-effective provisioning and scaling of network bandwidth and functional capabilities,” Volk said. “Evaluating this rapid startup capability based on specific use case requirements can show the direct impact of a Wasm runtime on the end-user experience.”

Some Wasm applications are used in networking to improve latency. Runtime performance is important, of course, but it is the latency performance that counts in this case, Sehyo Chang, chief technology officer at InfinyOn, said. This is because, Chang said, latency plays a crucial role in determining the overall user experience in any application. “A slow response time can greatly impact user engagement and lead to dissatisfaction, potentially resulting in lost sales opportunities,” Chang said.

During a recent KubeCon + CloudNativeCon conference, Chang gave a talk about using Wasm to replace Kafka for lower latency data streaming. Streaming technology based on Java, like Kafka, can experience high latency due to Garbage collection and JVM, Chang said. However, using WebAssembly (WASM) technology allows for stream processing without these penalties, resulting in a significant reduction of latency while also providing more flexibility and security, Chang said.

The post Performance Measured: How Good Is Your WebAssembly? appeared first on The New Stack.

How to Overcome Challenges in an API-Centric Architecture

Srinath Perera — Mon, 09 Jan 2023 17:00:50 +0000

This is the second in a two-part series. For an overview of a typical architecture, how it can be deployed and the right tools to use, please refer to Part 1.

Most APIs impose usage limits on number of requests per month and rate limits, such as a maximum of 50 requests per minute. A third-party API can be used by many parts of the system. Handling subscription limits requires the system to track all API calls and raise alerts if the limit will be reached soon.

Often, increasing the limit requires human involvement, and alerts need to be raised well in advance. The system deployed must be able to track API usage data persistently to preserve data across service restarts or failures. Also, if the same API is used by multiple applications, collecting those counts and making decisions needs careful design.

Rate limits are more complicated. If handed down to the developer, they will invariably add sleep statements, which will solve the problem in the short term; however, in the long run, this leads to complicated issues when the timing changes. A better approach is to use a concurrent data structure that limits rates. Even then, if the same API is used by multiple applications, controlling rates is more complicated.

An option is to assign each API a portion of the rates, but the downside of that is some bandwidth will be wasted because while some APIs are waiting for capacity, others might be idling. The most practical solution is to send all calls through an outgoing proxy that can handle all limits.

Apps that use external APIs will almost always run into this challenge. Even internal APIs will have the same challenge if they are used by many applications. If an API is only used by one application, there is little point in making that an API. It may be a good idea to try to provide a general solution that handles subscription and rate limits.

Overcoming High Latencies and Tail Latencies

Given a series of service calls, tail latencies are the few service calls that take the most time to finish. If tail latencies are high, some of the requests will take too long or time out. If API calls happen over the internet, tail latencies keep getting worse. When we build apps combining multiple services, each service adds latency. When combining several services, the risk of timeouts increases significantly.

Tail latency is a topic that has been widely discussed, which we will not repeat. However, it is a good idea to explore and learn this area if you plan to run APIs under high-load conditions. See [1], [2], [3], [4] and [5] for more information.

But, why is this a problem? If the APIs we expose do not provide service-level agreement (SLA) guarantees (such as in the 99th percentile in less than 700 milliseconds), it would be impossible for downstream apps that use our APIs to provide any guarantees. Unless everyone can stick to reasonable guarantees, the whole API economy will come crashing down. Newer API specifications, such as the Australian Open Banking specification, define latency limits as part of the specification.

If the use case allows it, the best option is to make tasks asynchronous.

There are several potential solutions. If the use case allows it, the best option is to make tasks asynchronous. If you are calling multiple services, it inevitably takes too long, and often it is better to set the right expectations by promising to provide the results when ready rather than forcing the end user to wait for the request.

When service calls do not have side effects (such as search), there is a second option: latency hedging, where we start a second call when the wait time exceeds the 80th percentile and respond when one of them has returned. This can help control the long tail.

The third option is to try to complete as much work as possible in parallel by not waiting for a response when we are doing a service call and parallelly starting as many service calls as possible. This is not always possible because some service calls might depend on the results of earlier service calls. However, coding to call multiple services in parallel and collecting the results and combining them is much more complex than doing them one after the other.

When a timely response is needed, you are at the mercy of your dependent APIs. Unless caching is possible, an application can’t work faster than any of its dependent services. When the load increases, if the dependent endpoint can’t scale while keeping the response times within the SLA, we will experience higher latencies. If the dependent API can be kept within the SLA, we can get more capacity by paying more for a higher level of service or by buying multiple subscriptions. When that is possible, keeping within the latency becomes a capacity planning problem, where we have to keep enough capacity to manage the risk of potential latency problems.

Another option is to have multiple API options for the same function. For example, if you want to send an SMS or email, there are multiple options. However, it is not the same for many other services. It is possible that as the API economy matures, there will be multiple competing options for many APIs. When multiple options are available, the application can send more traffic to the API that responds faster, giving it more business.

If our API has one client, then things are simple. We can let the client use the API as far as our system allows. However, if we are supporting multiple clients, we need to try to reduce the possibility of one client slowing down others. This is the same reason why other APIs will have a rate limit. We should also define rate limits in our API’s SLA. When a client sends too many requests too fast, we should reject their requests using a status code such as HTTP status code 503. Doing this communicates to the client that it must slow down. This process is called backpressure, where we communicate to upstream clients that the service is overloaded and the message will eventually be handed out to the end user.

It is important to have enough tracing and logs to help you find out whether an error is happening on our side of the system or the side of third-party APIs.

If we are overloaded without any single user sending requests too fast, we need to scale up. If we can’t scale up, we still need to reject some requests. It is important to note that rejecting requests, in this case, makes our system unavailable, while rejecting requests in the earlier case where one client is going over his SLA does not count as unavailable time.

Cold start times (the time for the container to boot up) and service requests are other latency sources. A simple solution is to keep a replica running at all times; this is acceptable for high-traffic APIs. However, if you have many low-traffic APIs, this could be expensive. In such cases, you can guess the traffic and warm up the container before (using heuristics, AI or both). Another option is to optimize the startup time of the servers to allow for fast bootup.

Latency, scale and high availability are closely linked. Even a well-tuned system would need to scale to keep the system running within acceptable latency. If our APIs need to reject valid requests due to load, the API will be unavailable from the user’s point of view.

Managing Transactions across Multiple APIs

If you can run all code from a single runtime (such as JVM), we can commit it as one transaction. For example, premicroservices-era monolithic applications could handle most transactions directly with the database. However, as we break the logic across multiple services (hence multiple runtimes), we cannot carry a single database transaction across multiple service invocations without doing additional work.

One solution for this has been programming language-specific transaction implementations provided by an application server (such as Java transactions). Another is using Web Service atomic transactions if your platform supports it. Yet another has been to use a workflow system (such as Ode or Camunda), that has support for transactions. You can also use queues and combine database transactions and queue system transactions into a single transaction through a transaction manager like Atomikos.

This topic has been discussed in detail under microservices, and we will not repeat those discussions here. Please refer to [6], [7] and [8] for more details.

Finally, with API-based architectures, troubleshooting is likely more involved. It is important to have enough tracing and logs to help you find out whether an error is happening on our side of the system or the side of third-party APIs. Also, we need clear data we can share in case help is needed from a third-party API to isolate and fix the problem.

I would like to thank Frank Leymann, Eric Newcomer and others for their thoughtful feedback to significantly shape these posts.

The post How to Overcome Challenges in an API-Centric Architecture appeared first on The New Stack.

How to Use Time-Stamped Data to Reduce Network Downtime

Caitlin Croft — Mon, 09 Jan 2023 14:41:54 +0000

Increased regulations and emerging technologies forced telecommunications companies to evolve quickly in recent years. These organizations’ engineers and site reliability engineering (SRE) teams must use technology to improve performance, reliability and service uptime. Learn how WideOpenWest uses a time series platform to monitor its entire service delivery network

Trends in the Telecommunications Industry

Telecommunication companies are facing challenges that vary depending on where the company is in their life cycle. Across the industry, businesses must modernize their infrastructure while also maintaining legacy systems. At the same time, new regulations at both the local and federal levels increase the competition within the industry, and new businesses challenge the status quo set by current industry leaders.

In recent years, the surge in people working from home requires a more reliable internet connection to handle their increased network bandwidth needs. The increased popularity of smartphones and other devices means there are more devices requiring network connectivity — all without a reduction in network speeds. Latency issues or poor uptime lead to unhappy customers, who then become flight risks. Add to this situation more frequent security breaches, which then requires all businesses to monitor their networks to detect potential breaches faster.

Challenges to Modernizing Networks

Founded in 1996 in Denver, Colorado, WideOpenWest (WOW) provides internet, video and voice services in various markets across the United States. Over the years, WOW acquired various telecommunication organizations, and as its network expanded, it needed a better network monitoring tool to address a growing list of challenges. For instance, WOW engineers wanted to be able to analyze an individual customer’s cable modem, determine the health of a node and understand the overall state of the network. However, several roadblocks prevented the company from doing so. WideOpenWest already used multiple monitoring platforms internally, and the cost of purchasing hardware that aids in monitoring individual nodes was too expensive. It already had a basic process in place to collect telemetry data from specific modems, but there was no single source of truth to tie everything together.

Using Time Series Data to Reduce Network Latency

A few years ago, WideOpenWest decided to replace its legacy time series database, and after considering other solutions, it chose InfluxDB, the purpose-built time series database. It now has a four-node cluster of InfluxDB Enterprise in production and a two-node cluster running on OpenStack for testing. The team uses Ansible to automate cluster setup and installation.

The primary motivations for using InfluxDB are to improve overall observability of the entire network and to implement better alerting. The WOW engineers use Telegraf for data collection whenever possible because it integrates easily with all the other systems. Some legacy hardware requires them to use Filebeats, custom scripts and vendor APIs.

They make extensive use of Simple Network Management Protocol (SNMP) polling and traps in the data collection process because that remains an industry standard, despite its age. Specifically, they use SNMP to collect metrics from cable modems and Telegraf to collect time-stamped data from their virtual machines and containers. Using InfluxDB provided the team with the necessary flexibility to work around restrictions from vendor-managed systems, and they now collect data from all desired sources.

Next they stream the data to Kafka to better control data input and output. Kafka also allows them to easily consume or move data into different regions or systems, if necessary. From the Kafka cluster, they use Telegraf to send data to their InfluxDB Enterprise cluster.

WOW’s team aggregates various metrics from the fiber-to-the-node network, such as:

Telemetry metrics, like usage and uptime, from over 650,000 cable modems on a five-minute polling cycle.
Status of all television channels upstream and downstream, including audio and visual signal strength and outages.
Average signal, port and power levels.
Signal-to-noise ratio (SNR) — used to ensure the highest level of wireless functionality.
Modulation error ratio (MER) — another measurement used to understand signal quality that factors in the amount of interference occurring on the transmission channel.

The WOW team uses all this data to gain insights from real-time analytics to create visualizations and to trigger alerts and troubleshoot processes. Once the data is in InfluxDB, they use Grafana for all their visualizations. They also leverage InfluxDB’s alerting frameworks to send alerts via ServiceNow, Slack and email. Adopting InfluxDB allowed the WOW team to implement an Infrastructure-as-Code (IaC) system, so instead of spending time manually managing their infrastructure, they can write config files to simplify processes.

Diagram 1: WideOpenWest’s InfluxDB implementation

WideOpenWest’s next big project is to implement a full CI/CD pipeline with automated code promotions. With this, they hope to improve automated testing. WOW also wants to streamline all monitoring across the organization and increase the level of infrastructure monitoring.

The post How to Use Time-Stamped Data to Reduce Network Downtime appeared first on The New Stack.