DevOps Overview, News, Trends and Analysis | The New Stack

Software Delivery Enablement, Not Developer Productivity

Jennifer Riggins — Tue, 26 Sep 2023 14:00:12 +0000

BILBAO — Anna Daugherty thinks we shouldn’t be so obsessed with developer productivity. That was a spicy take at last week’s Continuous Delivery Mini Summit.

“This is something I talked about with almost everyone at the conference,” Daugherty, director of product marketing at Opsera, told The New Stack.

“There’s a difference between an individual trying their best and singling them out for not being productive. But productive doesn’t mean anything,” she continued. “Is individual developer productivity determined by code commits? What they accomplished in a sprint? The individual on the team, doesn’t a product make.”

Productivity metrics are trying to answer the wrong question, Daugherty argues, when you really should focus on:

Are your customers and end users seeing the value from accelerated delivery?
Are developers more satisfied with their job? Do they feel more enabled?
Are you creating more opportunities for revenue and investment?

Just like DevOps looks to accelerate the speed of the team’s delivery of software, Daugherty contends, you should look to focus on software team enablement, not individual developer productivity.

How Do You Measure Team Enablement?

The most common DevOps metrics aren’t really metrics. DORA is more of a framework, she says, to measure velocity — via lead time for changes and deployment frequency — and stability — via change failure rate and time to restore service.

DORA “allows you to have some sort of metrics that your teams can work toward, or that have been identified as being metrics that indicate high performance,” she said. “But it’s not necessarily like the end all, be all. It’s not a Northstar metric. It’s an example of what is can constitutes a high performing software team.”

For Northstar metrics you could go for the 2021 SPACE developer productivity framework or the recent McKinsey developer productivity effort, which she says “is both SPACE plus DORA plus some other nonsense that they’ve all wrapped up.”

But really, you have to keep it simple. For Daugherty, that comes down to asking why you’re creating software in the first place, which comes down to three audiences:

The users.
The people who create it.
The market.

While DORA and SPACE can point you in the right direction, she says you should be measuring outcomes that help measure the satisfaction of those three reasons to build software.

Customer Enablement: Measure for Customer Satisfaction.

This looks to answer if the software that you’re delivering is usable and delights your customers, she explained. This can be assessed via net promoter scores (NPS), G2 and other product review sites, and customer testimonials.

You need both qualitative data, with tight feedback cycles with your product’s users, and quantitative tracking, like drop-off rates.

Developer Enablement: Measure for Employee Satisfaction.

Look to answer: Do your developers enjoy creating and releasing software? Do you have a high level of developer burnout? This is where platform engineering comes in as a way to increase developer enablement and reduce friction to release. This can be measured via platform adoption rate, regular developer surveys with an actionable follow-up strategy, Glassdoor reviews and sentiment on their public social media.

Business Enablement: Measure for Market Share.

Is your delivered software helping capture the desired market share? Is it creating investment and/or partnership opportunities? Is it actually moving the sales pipeline along, generating measurable profit? Daugherty explained that business metrics are assessed by measuring things like the sales pipeline, investment and partnerships.

Some companies only seem to focus on the business metrics. But, while there’s been a noticeable shift in the tech industry “from growth at any cost to every dollar matters,” Daugherty emphasizes, how to increase developer productivity isn’t the right question to be asking.

Part of this is the fundamental disconnect between business leadership and engineering teams.

“Business leadership is always measuring revenue and pipeline, but that isn’t making its way to the engineering teams, or it’s not being translated in a way that they can understand,” she said. “They’re always chasing their tails about revenue, about pipeline, about partnerships [and] about investment, but it really should be a full conversation amongst the entirety of business, with engineering as a huge consideration for who that audience should be.”

Indeed, engineering tends to have the highest salaries, making it an important cost center. One of the early goals of platform engineering should be to facilitate a common language where business understands the benefits of engineering, while engineers understand the connection of their work to delivering business value.

Still, a lot of organizations fall short here. Sometimes, Daugherty says, that persistent chasm can be bridged by the blended role of Chief Digital Officer or Chief Transformation Officer.

How to Help Teams Improve Their Outcomes

Software delivery enablement and 2023’s trend of platform engineering won’t succeed by focusing solely on people and technology. At most companies, processes need an overhaul too.

A team has “either a domain that they’re working in or they have a piece of functionality that they have to deliver,” she said. “Are they working together to deliver that thing? And, if not, what do we have to do to improve that?”

Developer enablement should be concentrated at the team outcome level, says Daugherty, which can be positively influenced by four key capabilities:

Continuous integration and continuous delivery (CI/CD)
Automation and Infrastructure as Code (IaC)
Integrated testing and security
Immediate feedback

“Accelerate,” the iconic, metrics-centric guide to DevOps and scaling high-performing teams, has found certain decisions that are proven to help teams speed up delivery.

One is that when teams are empowered to choose which tools they use, this is proven to improve performance. When asked if this goes against platform engineering and its establishment of golden paths, Daugherty remarked that this train of thought derails from the focus on enablement.

“Platform engineering is not about directing which tools you use. That’s maybe what it has been reduced to in some organizations, but that’s not the most effective version of that,” she said. “Platform thinking is, truly, you are Dr. Strange from the Avengers, and you see the bigger picture and where things come together and align.”

Platform teams shouldn’t be adopting a rigid, siloed mindset of this team does this and uses that tool.

Platform engineering is about bringing people, products and processes together to increase efficiency and effectiveness for all teams, Daugherty clarified. “Does that means maybe sometimes choosing better workflows or technology and architecture? Yes, maybe for your business. But that’s just a reductive way of thinking about it,” if that’s your whole platform strategy.

Despite job roles, she emphasizes, DevOps and platform engineering are ways of working, not things you do or not do. And a platform team looks to trace over the DevOps infinity symbol, to make the pathway to delivery and the communication between Dev and Ops even smoother.

“A lot of people tell me: ‘I hate people like you, because you come in and tell me that I need this tool, and I need to do it this way’,” she said. After all, Opsera is a unified DevOps platform for engineering teams of any size.

But she always counters, “I’m not here to tell you anything. I’m here to help you do the work that you want to do because your work matters. And to help you understand how to communicate that value that you’re bringing to your organization to those business leaders who want more from you. And they constantly will be asking more from you.”

Her role, Daugherty says, is to help teams — and by extension the individual developers that make them up — to figure out how to deliver more, without increasing developer burnout.

DevOps Is First about Facilitating Meaningful Communication

DevOps is about enabling the right kind of communication to increase speed and collaboration — not creating more human-in-the-loop bumps in the road.

“Teams that reported no approval process or used peer review achieved higher software delivery performance,” found the research cited in Accelerate. “Teams that required approval by an external body achieved lower performance.”

It doesn’t necessarily mean zero approval process, but it’s more about keeping the majority of decisions within that team unit. Accelerate continues with the recommendation to “use a lightweight change approval process based on peer review, such as pair programming or intra-team code review, combined with a deployment pipeline to detect and reject bad changes.”

This could be an ops-driven, automated approval like in your Infrastructure as Code or an integrated test, explained Daugherty, or a human-to-human opportunity for shared knowledge.

“It’s gone through a peer review so that people on your team are in agreement that this is what they want to deliver,” she said. “And then it’s automatically deployed to production and not hanging around waiting at some gate [or] some bottleneck. If you have integrated testing and security throughout your pipeline, that’s going to enable you to do that.”

Peer review processes both ensure readability and thus maintainability of code, while also facilitating informal training. Daugherty recalled what Andrew Fenner, master designer at Ericsson, said during a panel on developer experience and productivity, also at the Continuous Delivery Mini Summit.

“Ericsson is kind of an old school sort of company, and so them being able to do this lightweight approval process is kind of a miracle.” Daugherty continued that Fenner spoke about how, sometimes, their most senior developers spend most of their time helping more junior developers, instead of committing code themselves. If you are measuring these more senior members by traditional, individual developer productivity metrics, they would score poorly. But, in reality, their less measurable impact has them helping to improve Ericsson’s junior developers every day. It also means knowledge is not perilously held by one team member by shared across the team.

“That’s what I mean by lightweight — not expecting your developers to have all the answers all the time, [but] to have some mechanism that’s easy for them to get feedback quickly. And to utilize the most knowledgeable and helpful people on your teams to be able to deliver quickly that feedback,” Daugherty said. “That lightweight idea is very much about not standing in people’s way. The better and easier they can deploy to production, the better outcomes they will have, based on velocity and stability.”

The post Software Delivery Enablement, Not Developer Productivity appeared first on The New Stack.

Platform Engineering Helps a Scale-up Tame DevOps Complexity

Jennifer Riggins — Tue, 26 Sep 2023 12:00:20 +0000

Going from startup to scale-up is a great moment for any tech company. It means you have great customer traction and proof of value that you can expand your reach to new markets and verticals.

But it also means it’s time to scale up your technology, often in the cloud. And that isn’t easy.

Capillary Technologies, which builds Software as a Service (SaaS) products within the customer loyalty and engagement domain, saw its customers increase in number from 100 to 250. It started experiencing the typical scale-up growing pains, Piyush Kumar, the company’s CTO, told The New Stack.

As Capillary’s team grew significantly, its challenges pertaining to DevOps complexity also grew. Read on to see if these challenges ring true for you and how Capillary Technologies leveraged Facets.cloud self-service infrastructure management and adopted platform engineering to speed up developer productivity and deliver value to end customers faster.

DevOps Doesn’t Scale by Itself

When Kumar joined Capillary as a principal architect in 2016, the company’s presence was growing in India, Southeast Asia and the Middle East, while starting to gain traction in China. But when it looked to go further, this company built on Amazon Web Services (AWS) started hitting some common roadblocks in the cloud.

“The ratio of number of developers to the number of people in our DevOps infrastructure team was starting to get skewed,” Kumar said. “That meant that the number of requests going in from the engineers to the DevOps teams was growing, so the operations tickets were basically growing, and our response times were beginning to slow down.”

Toward the end of 2019, Capillary started to expand to new markets and cloud regions in the U.S. and Europe. These opportunities also presented challenges.

“Newer regions essentially meant spinning off the entire software, infrastructure, monitoring, everything else in a different region,” he said.

Launching in new regions requires organizations to adhere to data sovereignty and data localization laws.

As these launches occurred, Capillary’s infrastructure was in a semi-automated mode. “When you’re in that mode, there are things that are automated and then there are quite a few things that are not. So you don’t have enough visibility into your overall environment stack,” said Kumar.

New regions brought a lot of surprises — the DevOps team had to grow to manage the new environments, and had to meet the new demands of the growing customer base, product portfolio and required number of infrastructure components.

At the same time, Capillary grew from about 100 to 250 engineers.

“We didn’t want stability to start to take a hit, because we now needed to release across multiple environments,” Kumar said. In short, he noted, “more than linear scaling was needed to manage all of this.”

The Cloud Native Complexity Problem

A lot of platform engineering initiatives are sparked by struggles with disparate dev and ops tooling. This was not the case at Capillary, which has always had centrally managed infrastructure.

This is why, in order to battle this complexity at scale, the team members logically tried to increase the automation coverage of their infrastructure. But they found themselves stuck in a constant game of catchup.

“So we tried to continue to automate more and more, and it continued as a team, where you would do more and then you will realize that there is more to be done, so it felt like a constant battle because that landscape kept growing,” Kumar said.

“In six months, whatever we went ahead and automated, we basically carried newer debt, so there was more to be automated.”

For instance, they adopted the open source database MongoDB to bring new infrastructure, storage and database capabilities into the Capillary ecosystem. The DevOps team soon realized that they couldn’t easily automate everything — from launching to new regions to monitoring, backups, upgrades, patches and restoration.

By the time the Capillary teams automated whatever they could, they had also adopted Apache Kafka for real-time data streaming and an AWS EMR to run and scale workloads — which they then also tried to automate.

Capillary’s teams had gone the open source route to avoid vendor lock-ins. But whether they went open source or proprietary, they realized the complexity of the cloud native landscape means a lot of stitching automation toolchains together.

To tackle this, they needed:

Something that would make the overall infrastructure and deployment architecture more uniform, more visible and 100% automated, from build to deploy.
To move developers from being reliant on the DevOps team, to being able to provision infrastructure in a self-service way. This includes documentation uniformity to create a single source of truth.
A tool to manage the environment, infrastructure and deployment.

The solution Capillary sought, Kumar said, would allow users to “go ahead and create a document. You would say that this is my source of truth. And now I go ahead and do all of this in this way, And I do it uniformly all the time.”

In short, he wondered, “Is this something that a software could translate in terms of managing your environment, infrastructure, deployments, everything?”

Building an Infrastructure Blueprint

A lot of companies kick off their adoption of platform engineering with a journey of discovery. They literally ask themselves: what technology do we have and who owns what?

In late 2020, Capillary began partnering with Facets to co-build a solution to help answer this question. Capillary chose Facets in part because it automated the cataloging of applications, databases, caches, queues and storage across the infrastructure, as well as the interdependencies among them. This cataloging helped to create a deployment blueprint of how architecture should look in an environment.

Facets’ Blueprint Designer provides a high-level view of the entire architecture and detailed information on deployed resources.

“Once you have a single blueprint, then whatever it is you do downstream in terms of launching your infrastructure, in terms of running your applications, in terms of monitoring and managing, everything becomes a downstream activity from there,” Kumar said.

“This essentially is the piece which brings in good visibility and a standardized structure of how your blueprint would look like for your entire environment and applications.”

Another reason Capillary went with Facets is because it was running 10 environments globally — three for testing and the rest in production. This meant the whole migration to Facets process took four to five months to complete, ensuring that all existing data had migrated.

The teams specifically spent about three months moving the testing environments to ensure that everything worked perfectly. The production environments, Kumar said, were much faster to move.

Seeing Results

By mid-2021, Kumar’s team had witnessed some clear results:

Operations Tickets Down by 95%.

“What we’ve been able to do with Facets is that we have created a self-service environment where, as a developer, if you have to create a new application, you go ahead and add it into that catalog,” Kumar said. “Somebody in your team, like your lead or architect, will go ahead and approve that. And then it gets launched on its own. There is no involvement required from the DevOps team.”

The DevOps teams were no longer involved in the day-to-day software launching. Now they were able to run about 15 environments across two product stacks with a six-member DevOps team.

In fact, Capillary renamed its DevOps team “SRE and developer experience,” pivoting to site reliability engineering and creating solutions to enable its developers.

Overall Uptime Increased from 99.8% to 99.99%.

“Our environment stability has basically taken a massive movement forward,” Kumar said. “Our environments are monitored continuously. Anything that you are seeing as a blip will basically get alerted. Your backups, your fallbacks, they are all pretty standardized.”

A 20% Increase in Developer Productivity.

“The biggest thing that has happened is that the queue time or the wait time on the DevOps team is gone,” Kumar said.

There’s also now uniformity across engineering operations, including logs and monitoring, which further increases developer productivity.

“And because our releases are completely automated, the monitoring of releases is completely automated,” Kumar said.

This has meant that over the last two years, the Capillary team has gone from releasing every two weeks to now releasing daily. Plus they’ve moved into an automated, unattended release mode with verifications. Now, said Kumar, “In case something is broken, you will get an immediate alert on that to go ahead and attend.”

The Capillary engineering team continues to grow with new products, the CTO said, as well as become more efficient. In 2016, it took 64 developer weeks to launch an environment. Now, it takes just eight developer weeks, including all verifications and stabilization.

Using the blueprint the company created with Facets, he said, the users have to define how a new environment “will handle this kind of workload and hence, this is the kind of capacity that is required. And so once you set that up, the environment launch is all automated. So you save a lot of time on that.”

Earlier this year, Capillary acquired another tech company, which required the launch of a new developer environment. The engineering team was able to define the blueprint within Facets and launch a new environment in two and a half weeks.

Greater Visibility of Infrastructure Costs.

Finally, three to four years ago, Kumar could only monitor infrastructure costs through post-mortem analysis, which caused a delayed response and leaked costs. Now, he said, Facets has helped with auditing and given it more visibility on how it’s using its infrastructure and where it’s over-provisioning.

The new capabilities, Kumar said, have sparked more proactive monitoring and CloudOps and FinOps, “where there are signals that I get on the cost spikes much sooner.”

The post Platform Engineering Helps a Scale-up Tame DevOps Complexity appeared first on The New Stack.

Don’t Listen to a Vendor about AI, Do the DevOps Redo

Alex Williams — Thu, 21 Sep 2023 15:42:08 +0000

Don’t listen to a vendor about AI, says John Willis, a well-known technologist and author in the latest episode of The New Stack Makers.

“They’re going to tell you to buy the one size fits all,” Willis said. It’s like going back 30 to 40 years ago and saying, ‘Oh, don’t learn how to code Java, you’re not going to need it — here, buy this product.'”

Willis said that DevOps provides an example of how human capital solves problems, not products. The C-level crowd needs to learn how to manage the AI beast and then decide what to buy and not buy. They need a DevOps redo.

One of the pioneers of the DevOps movement, Willis said now is a time for a “DevOps redo.” It’s time to experiment and collaborate as companies did at the beginning of the DevOps movement.

“If you look at the patterns of DevOps, like the ones who invested early, some of the phenomenal banks that came out unbelievably successful by using a DevOps methodology,” Willis said. “They invested very early in the human capital. They said let’s get everybody on the same page, let’s run internal DevOps days.”

Just don’t let it sort of happen on its own and start buying products, Willis said. The minute you start buying products is the minute you enter a minefield of startups that will be gone soon enough or will get bought up by large companies.

Instead, people will need to learn how to manage their data using techniques such as retrieval augmentation, which provides ways to fine-tune a larger language model, for example, with a vector database.

It’s a cleansing process, Willis said. Organizations will need cleansing to create robust data pipelines that keep the LLMs from hallucinating or giving up code or data that a company would never want to let an LLM provide to someone. We’re talking about the danger of giving away code that makes a bank billions in revenues or the contract for a superstar athlete.

For a company of any scale, the coding gets fun again when done right for a company using LLMs at scale with some form of retrieval augmentation.

Getting it right means adding some governance to the retrieval augmentation model. “You know, some structuring, ‘can you do content moderation?'” Are you red-teaming the data? So these are the things I think will get really interesting that you’re not going to hear vendors tell you about necessarily; vendors are going to say, ‘We’ll just pop your product in our vector database.'”

The post Don’t Listen to a Vendor about AI, Do the DevOps Redo appeared first on The New Stack.

CloudBees Scales Jenkins, Redefines DevSecOps

Darryl K. Taft — Thu, 21 Sep 2023 12:00:46 +0000

CloudBees, which offers a software delivery platform for enterprises, announced significant performance and scalability enhancements to Jenkins with new updates to its CloudBees Continuous Integration (CI) software. The company also delivered a new DevSecOps solution based on Tekton.

CloudBees made the announcements at the recent DevOps World 2023 conference. CloudBees CI is an enterprise version of Jenkins. Jenkins is the most widely used CI/CD software globally, with an estimated 11.2 million developers using it as part of their software delivery process, the company said.

HA, Scalability, Performance

The new updates bring high availability and horizontal scalability to Jenkins, eliminating the bottlenecks that plague administrators and developers as they massively scale CI/CD workloads on Jenkins, said Sacha Labourey, co-founder and chief strategy officer at CloudBees.

“The ability to roll out, protect, and scale Jenkins on top of Kubernetes is critical to Jenkins remaining the go-to platform for managing CI/CD pipelines,” said Torsten Volk, an analyst at Enterprise Management Associates. “The over 1000 existing integrations are still a massive argument for many DevOps teams to adopt or stay with Jenkins, but now these integrations no longer come at the expense of adding tech debt.”

In addition, CloudBees announced additional performance-enhancing capabilities such as workspace caching to speed up builds and a new AI-powered pipeline explorer for easier and faster debugging.

“I think these changes are significant to existing Jenkins users, and there are still a lot of Jenkins users,” said Jim Mercer, an analyst at IDC. Specifically, he noted:

The caching will help to improve startup times and the speed of Jenkins pipelines.
The HA and scaling create additional controller replicas to balance the load of multiple users doing builds concurrently, appearing to the developer as a single controller. Previously, organizations attempted to mitigate the Jenkins controller issue by adding more Jenkins instances, creating other overhead for administration, etc.

“These changes’ overall theme is improving the developer experience by addressing issues where time is lost and enhancing their lives,” Mercer said. “These are not sexy changes per se, but they benefit the Jenkins user base.”

Jenkins has long had scalability issues, said Jason Bloomberg, an analyst at Intellyx. “So the Cloudbees High Availability Mode is a welcome update. Now Jenkins will no longer have a single point of failure and will also offer automatic load balancing — capabilities expected in any cloud environment and long overdue for Jenkins.”

Moreover, high availability and horizontal scalability for Jenkins is a capability our enterprise customers have wanted for a long time, Labourey told The New Stack

“The ability to run Jenkins at massive scale with active-active high availability becomes especially critical when you’re dealing with thousands of developers, running multiple thousands or hundreds of thousands of jobs across a small set of monolithic, overloaded controllers,” said Shawn Ahmed, chief product officer, CloudBees, in a statement. “At this scale, you are dealing with a community of developers that want a high-resiliency developer experience with no disruption. We have removed significant barriers in scaling Jenkins, enabling enterprises to run greater workloads than ever before. The new capabilities in CloudBees CI are a game-changing experience for DevOps teams.”

Other Features

In addition to high availability and horizontal scalability, additional performance-enhancing features introduced include:

Workspace Caching – Improves the performance of Jenkins by speeding up builds.
Pipeline Explorer – Easier and faster AI-powered debugging. Find and fix pipeline issues in complex environments with massive Jenkins workloads.
Build Storm Prevention – Baseline your repository without causing build storms (gridlock on startup).

“We had a full-fledged CI/CD offering but it was solely available as software,” Labourey said. “So our customers were deployed on-premises or in their own public cloud accounts. But we own and operate those environments. And obviously there is a desire for more SaaS consumption, but also the evolution towards more cloud native types of workloads. And so that’s what we are releasing and announcing now and releasing on November 1 to all customers. And we’ve been working on this for a long time.”

Tekton-Based DevSecOps

Meanwhile, the new CloudBees DevSecOps platform is built on Tekton, uses a GitHub Actions style domain-specific language (DSL), and adds feature flagging, security, compliance, pipeline orchestration, analytics and value stream management (VSM) into a fully managed single-tenant SaaS, multitenant SaaS or on-premises virtual private cloud instance.

“The new CloudBees platform turns Tekton into an easy-to-use pipeline automation solution and can directly benefit from Jenkins also running and scaling with Kubernetes,” Volk said. “This strategy makes sense as it builds on existing differentiation (1000 plugins) and aims to make Tekton, an incredibly scalable pipeline automation framework, accessible to the masses.”

CloudBees said its new extensible DevSecOps platform redefines DevSecOps by addressing the challenges associated with delivering better, more secure and compliant cloud native software at a faster pace than ever.

“DevSecOps has been harder to implement than people would like, so bringing the ‘Sec’ part of the equation into CloudBees’ expertise with DevOps can only be a step forward,” Bloomberg said.

What’s Old Is New?

But is it just old wine in new bottles?

“The adoption of Tekton by Cloudbees was originally announced back in the DevOps days in 2019 when they announced the JenkinsX project would delegate the execution layer to Tekton. So, I don’t see this as new,” Mercer told The New Stack. “Outside of this, the collection of capabilities, such as value stream management, compliance, and feature flags, provide compelling capabilities as an integrated stack. I am not a fan of the addition of a new DSL. I also feel like they would do well to promote their compliance capabilities more since our survey data shows this is a top challenge for teams.”

Moreover, CloudBees cites a new discipline called platform engineering, which has emerged as an evolution of DevOps practices. The discipline brings together multiple roles such as site reliability engineers (SREs), DevOps engineers, security teams, product managers, and operations teams. Their shared mission is to integrate all the siloed technology and tools used within the organization into a golden path for developers. The CloudBees platform is purpose-built for this mission, the company said in a statement.

In addition, CloudBees said its focus going forward is on the following imperatives:

Developer-Centric Experience

Enhances the developer experience by minimizing cognitive load and making DevOps processes nearly invisible, using concepts of blocks, automation and golden paths.

Open and Extensible

Embraces the DevOps ecosystem of tools, starting with Jenkins. This flexibility to orchestrate any other tool enables organizations to protect the investments they have already made in tooling. Teams can continue to use their preferred technologies simply by plugging them into the platform.

Self-Service Model

Enables platform engineering to customize the platform, thus providing autonomy for development teams. For example, platform engineers can design automation and actions that are then consumed in a self-service mode by developers. Developers focus on what they do best: create innovation. No waiting for needed automation, actions, or resources.

Security and Compliance

Centralizes security and compliance. The CloudBees platform comes with out-of-the-box workflow templates containing built-in security. Sensitive information, like passwords and tokens, are abstracted out of the pipeline, significantly enhancing security and compliance throughout the software development life cycle. Automated DevSecOps is baked in, with best-of-breed checks across source code, binaries, cloud environments, data and identity, all based on Open Policy Agent (OPA). Continuous compliance just happens, with out-of-the-box regulatory frameworks for standards such as FedRamp, SOC2 and automated evidence collection for the auditors.

“We have been using the CloudBees platform in beta. One significant value add for us was that it significantly reduced the time it took to pass our ISO 27001 compliance audit,” said Michel Lopez, founder and CEO at E2F, in a statement. “The auditor had scheduled 12 hours of interviews, but it ended after 60 minutes. This was because all of the controls were provided by the CloudBees platform.”

The post CloudBees Scales Jenkins, Redefines DevSecOps appeared first on The New Stack.

DevOps, DevSecOps, and SecDevOps Offer Different Advantages

John Ross — Fri, 15 Sep 2023 12:00:14 +0000

Within the business of software development, DevOps (Development and Operations) and DevSecOps (Development, Security, and Operations) practices have similarities and differences… and both offer advantages and disadvantages. DevOps offers efficiency and speed while DevSecOps integrates security initiatives into every stage of the software development lifecycle. However, gaining a better view of the DevOps vs. DevSecOps question requires a deeper inspection.

Development Teams Gain an Advantage Through Agility

The similarities and differences between DevOps and DevSecOps begin with Agile project management and the values found within Agile software development. Built around an emphasis on cross-functional teams, successful Agile management depends on the effectiveness of teamwork and the constant integration of customer requirements into the software development cycle. Rather than focus on processes, tools, and volumes of comprehensive documentation, Agile values a development environment that cultivates the adaptability, creativity, and collaboration of the individuals who make up the development and operations teams. Because of the reliance on Agile management, DevOps produces working software that satisfies customer needs.

While traditional approaches to development and testing can result in communication failures and siloed actions, DevOps asks project leads, programmers, testers, and modelers to work smarter as one cohesive unit. In addition, customers serve as important and valued members of DevOps teams through continuous feedback. Melding the development, testing, and operations teams together speeds the process of producing code and, in turn, delivers applications and services to customers at a much faster pace.

Incorporating continuous feedback into the development process creates a quality loop within DevOps. As a result, sustaining quality occurs at each point of the software development cycle. With the needs of the customer driving quality, programmers constantly check for errors in code while adapting to changing customer requests. As the cycle continues, testers measure application functionality against business risks.

Speed, quality, and efficiency grow from the daily integration of testing through Continuous Integration (CI) and Continuous Delivery (CD). Teams can quickly detect integration errors while building, configuring, and packaging software for customers. The practices come full circle through great opportunities for customers to utilize software and offer feedback.

What Is the Difference Between DevOps and DevSecOps?

DevOps — and the utilization of Agile management principles — establishes the foundation for DevSecOps. Both methodologies utilize the same guiding principles and rely on constant development iterations, continuous integration, continuous delivery, and timely feedback from customers. Even with those similarities in mind, though, the question of “what is the difference between DevOps and DevSecOps?” remains.

When comparing DevOps vs. DevSecOps, the objective shifts from a sole focus on speed and quality to speed, quality, and security. The key difference, though, rests within the placement of security within the development cycle and the need for sharing responsibility for security. Teams working within the DevOps framework incorporate the need for security at the end of the development process.

In contrast, teams working within the DevSecOps framework consider the need for security at each part — from the beginning to the end — of the software development cycle. Because development and operations teams share responsibility, security moves from an add-on to a prominent part of project plans and the development cycle. As a result, DevSecOps mitigates risk within the entire software development process.

Another difference between DevOps and DevSecOps also exists. The definition of quality for DevSecOps moves beyond the needs of the customer and adds security as a key ingredient. Because security integrates into DevSecOps processes from start to finish, the design process includes developers, testers, and security experts. With this shift in mindset and workplace culture, developers must recognize that their code — and any dependencies within that code — have implications for security. Integrating security tools from beginning to end of the coding process increases opportunities for developers and testers to discover flaws that could open applications to cybercrime.

The principles of CI and CD not only serve to automate processes but also lead to more the frequent checks and controls for coding, testing, and version control. Integrating security into the development process provides a greater window for mitigating or eliminating business risks while shortening the delivery cycle.

Another Alternative Exists: SecDevOps vs. DevSecOps

Development teams always search for methods to create better code and to decrease the time needed to bring products to market. While DevOps and DevSecOps offer distinct advantages in terms of speed and security, another alternative has entered the development arena. SecDevOps moves teams beyond integrating security into each stage of software development by prioritizing security and eliminating vulnerabilities across the lifecycle. Within the SecDevOps environment, developers work as security experts who write code.

When comparing SecDevOps vs. DevSecOps, SecDevOps places less emphasis on continuous assessment and communication. Instead of emphasizing business practices, businesses, and reducing time-to-market, SecDevOps may sacrifice speed and efficiency for security. However, the SecDevOps vs. DevSecOps comparison takes another turn when considering security testing and risk mitigation.

With DevSecOps, security testing occurs at the completion of the coding cycle. Because SecDevOps prioritizes security, testing happens at the beginning of the software development cycle. Development and Operations teams apply security policies and standards during the planning phase as well as within each development phase. Creating clean, bug-free code becomes the responsibility of everyone on the respective teams.

The transition to SecDevOps requires coders who have an intimate knowledge of security policies and standards. Although SecDevOps may reduce errors in code — and subsequently cut development costs, some costs may range higher because of the need to train or hire coders who have the ability to recognize and implement security protocols. SecDevOps also requires lengthier planning processes that can add costs to the development cycle. SecDevOps teams may also request specialized software to detect bugs and tools for improved data protection. As a result, the costs of prioritizing security may not align with all the benefits that businesses seek.

SecDevOps vs. DevSecOps vs. DevOps… and the Winner Is…

Ultimately, customers win in DevOps vs. DevSecOps vs. SecDevOps comparison. Each offers significant advantages — and similar principles exist in each method. However, the definition of “win” varies and certainly could involve the phrase, “it depends.” While DevOps brings development and operations teams together for better communication and cooperation, DevSecOps maintains the emphasis on teams, customers, and time-to-market but slightly changes the model by inserting security at each stage of the development process. SecDevOps places much less emphasis on speed while protecting the customer from vulnerabilities that lead to cyberattacks and loss of reputation or business.

Today — and well into the future — customers seek a balance between achieving business goals and protecting against vulnerabilities. Including security from start to finish while maintaining the ability to quickly deliver applications to customers and to quickly adapt to customer needs gives DevSecOps a business advantage.

The post DevOps, DevSecOps, and SecDevOps Offer Different Advantages appeared first on The New Stack.

How to Create an Internal Developer Portal MVP

Mor Paz — Tue, 12 Sep 2023 14:22:33 +0000

What needs to go into an internal developer portal and how should it be set up by platform engineers and used by developers? This post will take a practical approach to building a portal minimum viable product (MVP), assuming in a GitOps and Kubernetes native environment. MVPs are a great way to get started with an idea and see what it can materialize into. We’ll explore the software catalog, both a basic catalog and an extended one, and then look at setting developer self-service actions and specifically how to deploy a microservice from testing to production. Then we’ll add some scorecards and automations.

Sounds difficult? It’s actually quite simple.

5 Steps to Creating an MVP of Your Developer Portal

Forming an initial software catalog. In the example below we will show how to populate the initial software catalog using Port’s GitHub app and a git-based template.
Enriching the data model beyond the initial blueprints, bringing in more valuable data to the portal.
Creating your first self-service action. In the example below we will show how to scaffold a new microservice, but you can also think of adding Day 2 actions, or an action with a TTL (temporary environment, for instance).
Enriching the data model with additional blueprints and Kubernetes data, and allowing developers to build additional self-service actions so that they can test and then promote the service to production.
Adding scorecards and dashboards. These features offer developers insight into ongoing activities and quality initiatives.

Defining and Creating the Basic Data Model for the Software Catalog

The basic setup of the software catalog will be based on raw GitHub data, though you can make other choices. But how will the developer portal “classify” the data and create software catalog entities?

In Port, blueprints are where you define the metadata associated with the software catalog entities you choose to add to your catalog. Blueprints support the representation of any asset in Port, such as microservice, environment, package, cluster, databases, etc. Once blueprints are populated with data — in this case, coming from GitHub — the software catalog entities are discovered automatically and formed.

What are the right blueprints for this initial catalog and how do we define their relations? Let’s look at the diagram:

Let’s dive a little deeper:

The Workflow Run blueprint shows metadata associated with GitHub workflow pipelines.
The Pull Request blueprint shows metadata associated with, well, pull requests. This will allow you to create custom views for the PRs relevant to teams or individual developers.
The Issues blueprint shows metadata associated with GitHub issues.
The Workflow blueprint explores pipelines and workflows that currently exist in your GitHub (and uses them to create self-service actions in Port that can trigger more GitHub workflows).
The Microservice blueprint shows GitHub repositories and monorepos represented as microservices in the portal.

This basic catalog provides a developer with a strong foundation to understand the software development life cycle. This helps developers become familiar with the tech stack, understand who are the owners of the different services, access documentation for each service directly from Port, keep track of deployments and changes made to a given service, etc.

Data Model Extension: Domain and System Integration

Given that these fundamental blueprints provide good visibility into the life cycle of each service, the model we just discussed can suffice. You can also take it one step further and extend the data model by introducing domain and system blueprints. Domains often correspond to high-level engineering functions, maybe a pivotal service or feature within a product.

System blueprints are a depiction of a collection of microservices that collectively enhance a segment of functionality provided by the domain. With the addition of these two blueprints, we can now see how a microservice fits in a greater app or functionality and how it provides the developer with additional insight into how their microservice interacts with the greater tech stack. This information can be invaluable to speed up the onboarding process for new developers, as well as make diagnosing and debugging incidents easier since the dependency between microservices and products within the company is clearer.

This mirrors the backstage C4 model, with the addition of the running service, which provides additional stateful information.

When we finish ingestion, we’ll have a fully populated MVP software catalog. Drilling down into an entity, we can understand dependencies, health, on-call data and more.

Internal developer portals aren’t only about a software catalog, containing microservice and underlying resource and DevOps asset data. They are mostly about enabling developer self-service actions. Let’s go ahead and do that.

First Developer Self-Service Action Setup

Internal developer portals are made to relieve developer cognitive load and allow developers to access the self-service section in the portal and do their work with the right guardrails in place. This is done by defining the right flow in the portal’s UI, and then by loosely coupling it with the underlying platform that will execute the self-service action, while still providing feedback to developers about their executed actions, such as logs, relevant links and the effects of the action on the software catalog. We can also show whether a self-service action is waiting for manual approval.

For the MVP, let’s define a self-service action for scaffolding a new microservice. This is what developers will see:

When setting up a self-service action, the platform engineer doesn’t just define the backend process, but also sets up the UI in the developer self-service form. By being able to control what the developer sees and can do, as well as permissions, we can allow developers to perform actions on their own within a defined flow, setting guardrails and relieving cognitive load.

Expanding the Data Model with Kubernetes Abstractions

We’ve begun by saying that we’re working in a Kubernetes native environment. Kubernetes knowledge is not common, and our goal is to abstract Kubernetes for developers, providing them with the information they need.

Let’s add the different Kubernetes resources (deployments, namespaces, pods, etc.) into our software catalog. This then allows us to configure abstractions and thus reduce cognitive load for developers.

When populated, the cluster blueprint will show its correlated entities. This will allow developers to view the different components that make up a cluster in an abstracted way that’s defined by the platform engineer.

To bring everything together, let’s create an “environment” blueprint. This will allow us to differentiate between multiple environments that are in the same organization and create an in-context view (including CI/CD pipelines, microservices, etc.) of all running services in an individual environment. In this instance we will create a test environment and also a production environment.

Now let’s build a relation between the microservice blueprint we made in our initial data model to the workload blueprint. This will allow us to understand which microservices are running in each cluster as workloads. A workload is a running service of a microservice. This will allow us to have an in-context view of a microservice, including which environments it is running in, meaning we now know exactly what is going on and where (Is a service healthy? Which cluster is a service deployed on?).

Generally, creating relations between blueprints can be compared to linking a number of tables with a foreign key but using identical names for both tables. You can customize the information you see on each entity or blueprint, thus modeling them to suit your needs exactly. You can build relations that are one-to-one or one-to-many. In our example, the link between workload and microservice is a one-to-one relation, as each workload is only one deployment of one microservice.

Let’s now create a relation between cluster and environment so that we know where we have running clusters. We could also expand this idea to a cloud region or environment, depending on the context.

Let’s also create a relation between microservice and system, and workflow run and workload. This allows us to see the source of every workload, as well as see what microservices make up the systems in our architecture.

And that’s it!

Scorecards and Dashboards: Promoting Engineering Quality

The ability to define scorecards and dashboards has proven to be of great significance within enterprises, as they help push initiatives and drive engineering quality. This is thanks to teams now being able to visualize service maturity and engineering quality of different services in a domain and thus understand how close or far they are from reaching a production-ready service.

In Conclusion

The highly discussed distinction between portal and platform fades away when put into practice. While one focuses on infrastructure and backend definitions, the other empowers developers to take control of their needs through a software catalog and self-service actions, as well as be able to give great insight into service and infrastructure well-being, which is allowed through scorecards and visualizations.

Want to try Port or see it in action? Go here.

The post How to Create an Internal Developer Portal MVP appeared first on The New Stack.

Is Policy as Code the Cure for Multicloud Config Chaos?

Tzvika Shahaf — Mon, 11 Sep 2023 19:13:37 +0000

Hosting software across public cloud and private cloud is, at present, inherently less manageable and less secure than simpler hosting paradigms — full stop. The 2022 edition of IBM’s Cost of a Data Breach report states that 15% of all breaches are still attributed to cloud misconfiguration.

The same year, a white paper by Osterman Research and Ermetic placed detecting general cloud misconfigurations like unencrypted resources and multifactor authentication at the top of the list of concerns in organizations with a high cloud maturity level.

The move to multicloud has created silos between environments and every layer of IT. Instead of working together to create better infrastructure for better software and better service, IT Ops teams spend valuable time mitigating the liabilities of cross-deployed infrastructure and misconfigured services — all with different tools and no visibility into the policies they’re expected to enforce.

But there is a better way to manage the cloud and ensure that policy enforcement is in place: Policy as Code. Policy as Code (sometimes called PaC) is a development approach that expresses infrastructure and application behavior policies in code, rather than being hardcoded.

That means those policies can be used and reused to automatically enforce consistent configurations across the estate — like security, compliance, baselines and more. Policy as Code can enforce configurations throughout the entire software development life cycle, rather than relying on manual checks and processes.

Despite its obvious benefits for DevOps, PaC still isn’t a common practice in the industry — and it’s rarely used as a tool for tackling tangled messes like cloud misconfiguration. Let’s break down how PaC can help bridge today’s cloud config gaps.

The Power of Policy as Code in Multicloud Configuration

With PaC, one size actually can fit all. Policy as Code is used to unite public cloud with private cloud for simpler management and faster scaling of software, resources and services offered by each.
Policy as Code can standardize a governable process across multiple layers of IT, from central IT and infrastructure all the way up to app developers. It does that by making your policies visible, auditable and shareable.
Policy as Code supports business expectations by aligning configurations, from baseline through deployment, with strategic business objectives.
Policy as Code lets developers do what they do best: code. Adding configuration to developers’ plates is asking them to work outside their sweet spot. That can ruin the developer experience, causing burnout and turnover, which makes all your other problems worse.
Policy as Code ensures greater security by shifting compliance responsibility away from burdened individuals and onto repeatable code that’s automatically enforced.

Simple, Right? Then Why Aren’t Organizations Implementing Policy as Code?

When organizations started migrating services to public clouds, most failed to consider the long-term implications of such a move. The past few years have revealed the lasting effect of cloud migration on the standardized processes they’d spent so long building on the ground:

The pandemic drove an insatiable desire for availability of services and resources, which overrode caution.
The abstraction of cost attracted bottom-liners and business leaders.
Service-level agreements for cloud availability were supposed to make in-house security guarantees obsolete.
The cloud gave organizations across industries the chance to “remain competitive” as early adopters saw a rush of benefits.

Developers, too, helped drive some of the fervor for cloud. Developers at the app layer needed the flexibility of the cloud (the freedom to choose tools and workflows at will). Only later did organizations realize that their detachment from corporate policy was leading to misconfigurations across hybrid deployments, complicating an already messy paradigm.

Cloud repatriation is only exacerbating those problems, even when done by degrees. Today, some organizations are scaling back their cloud deployments or diversifying them by returning mainframe hosting to the mix. But that mix still lacks the standardization needed to effectively manage it all. Far from a solution, cloud repatriation is, in fact, an aggravating factor for the issues associated with cross-deployed infrastructure.

As long as organizations have one foot in the data center and one foot in the cloud — and as long as they obscure their cloud configuration approach with disparate toolsets — cloud misconfiguration will keep holding back the potential of their hybrid cloud Ops. A lack of standardization will keep leading to business problems like security gaps, unauthorized access, rampant drift, resource inefficiencies, noncompliance and data loss.

How to Start Building a Policy as Code Practice

The best way to create PaC for your infrastructure is through reverse engineering. Start by defining your ideal state, identify the potential risks and gaps you’ll uncover on your way there, and develop a framework to mitigate those risks.

Here are a few recommendations to start building a PaC approach that can enforce desired state for better infrastructure and better DevOps, wherever you’re deployed:

Don’t spend a bunch of resources on new tools. PaC isn’t about reinventing the wheel — it’s about leveraging the tools and processes you’ve got (like Infrastructure as Code) to enforce a repeatable state across all of your infrastructure. Strong automation and configuration management are at the core of PaC, so use the tools you already have to establish a PaC approach.

Define the desired state of your infrastructure across data center, multicloud and hybrid. Identify potential areas of risk that can result from configuration drift, like compliance errors, and chart a course back to your desired infrastructure configurations through state enforcement. With desired state enforcement through PaC, you can preempt and prevent misconfigurations even in cross-deployed infrastructure.

Align your infrastructure with the business goals it supports. When creating PaC, guardrails are crucial to targeting your efforts where they’re needed most. Start with your infrastructure management journeys: Consider who in your organization needs infrastructure resources, what their main use cases are for that infrastructure, and where and how they consume infrastructure. Map those needs to Day 0 (provisioning), Day 1 (configuration management) and Day 2 (state enforcement and compliance) for a strong PaC framework that supports your whole DevOps cycle.

Test your PaC. Challenge both your infrastructure configuration management and state enforcement to ensure it’s doing what you want it to do from the perspectives of both your business goals and risk assessment.

Your Cloud Infra Management Won’t Get Simpler on Its Own

Developers can’t be expected to handle policy enforcement in their own tools. When they can rely on configuration files written as code, they can work quickly and confidently in line with company standards, using tools they already know, rather than toying with functional code to make it compliant at their layer.

With PaC, your team can support the needs of developers in the cloud and changing expectations of compliance to help you realize the reasons you moved to the cloud in the first place.

The post Is Policy as Code the Cure for Multicloud Config Chaos? appeared first on The New Stack.

How Platform Engineering Can Help Keep Cloud Costs in Check

Pravanjan Choudhury — Mon, 11 Sep 2023 17:18:27 +0000

This is the fifth part in a series. Read Part 1, Part 2, Part 3 and Part 4.

Picture being in a never-ending cycle of cloud costs that keep piling up, no matter what you do. You’re in good company. Most businesses are stuck in the same loop, using the same old audits and tools as quick fixes. But let’s face it, that’s just putting a Band-Aid on a problem that needs surgery.

Now, we all know audits and quick reviews are essential; they’re like the routine checkups we need to stay healthy. But when it comes to cloud costs, those checkups aren’t enough. They might spot the immediate problems, but they rarely dig deep to find the root cause. It’s time to think longer term.

Instead of just putting out fires, why not prevent them in the first place? A more sustainable approach to managing cloud costs is to focus on building an efficient system from the ground up. This isn’t about quick fixes; it’s about laying a strong foundation that prevents issues down the road.

Good news: As the pages of platform engineering are being written, it presents an opportunity for the creators to help you do exactly that. Think of it as designing your new toolkit for smarter, more efficient cloud management. With platform engineering, your team gets access to high-level tools that go beyond patching holes. They help you map out a well-planned route through the confusing world of cloud costs.

Attempted Solutions and Reactive Approaches

The moment the cloud cost alarm bells start ringing, specialized centralized teams or “war rooms” are created — often to manage this process. These teams look closely at cost reports, figure out which department is spending too much, and then tell them to cut back. Here’s how it typically goes down:

By audit: Relying on audits to identify areas of excessive spending. Continuous audit cycles are used to understand and potentially optimize cloud costs. It’s often seen as a never-ending process.
Manual oversight: The centralized team is responsible for scrutinizing cost dashboards, identifying responsible teams for various infrastructure parts and informing them to take corrective action.
Project tracker: A project tracker is created to monitor the cost-reducing activities and to keep all stakeholders updated.
Tools and anomaly detection: Specialized tools that offer better analysis and anomaly detection capabilities are deployed, with some even allowing automated actions.
Ops team responsibility: Typically, the operations team handles the burden of cost management, but they are often lean and already over-burdened with other critical tasks.

The problem? All of these steps are more reactive than proactive, and prone to toil. They focus on trimming existing costs — often described as cutting the fat — rather than building a cost-efficient system from the start. The result is a strategy that’s more about short-term gains than long-term sustainability.

Further, In the world of cloud native apps, Ops teams alone can’t take optimizations beyond a point. Service and architectural enhancements by application developers give biggest results in the long run. But the system today isn’t inclusive enough.

So, how do we break this cycle? By shifting the focus from immediate cost-cutting to long-term financial health. That means adopting strategies that don’t just react to problems as they arise but prevent them from happening in the first place.

Platform Engineering: The Linchpin

This is where platform engineering comes in. The platform engineering team is responsible for laying down the path not only to make developers own their cost, but also to inherently control costs. Here’s how platform engineering contributes to cloud cost sustenance:

Sharing ownership and accountability: Platform engineers need to let go of the control of cost ownership and instead look at creating a collaborative experience for developers to share ownership.

Building cost-efficient golden paths: The platform engineering team’s first order of business is to lay down golden paths engineered to be cost-efficient from the start. This becomes the playground for developers to experiment and build, but cost control isn’t just nice to have; it’s a must-have.

Providing developer-friendly cost breakdowns: The platform gives developers the tools to see costs broken down in a language they understand. The platform should present a zoomed-in view that allows each development team to see only the costs related to the resources they’re directly managing. This focus helps teams zero in on costs specific to their own projects or services.

Providing smart cost correlation: Understanding the “why” behind the costs is as crucial as knowing the “what.” The platform lets developers tie costs to specific runtime metrics like “utilization” or business metrics like “number of transactions,” paving the way for smarter decision-making.

Assigning budgets: Setting a budget shouldn’t feel like walking a tightrope. The platform allows teams to set up budgetary guardrails for different resources and activities. If you’re about to go over budget, consider yourself notified or even restricted — keeping costs in check.

Ability to prevent leaks: Unused or underutilized resources are the silent budget killers. The platform should be designed to prevent these so-called “leakages” earlier in the software development life cycle and prevent them from draining your budget in the future.

In essence, platform engineering aims to create a symbiotic relationship between developers and their cloud environment. It’s not just about empowering developers; it’s about making them conscientious stewards of their resources. This fosters a culture where cost efficiency and developer freedom coexist, setting clear guidelines for how to manage both effectively.

Developer Responsibilities

In a world powered by platform engineering, treating cost as an afterthought just won’t cut it. Developers need to elevate cost to the VIP status of “first-class citizen” in their sprints, right next to other big-league players like performance and availability.

Be your own landlord: Owning cloud infrastructure, including services and resources, isn’t just a responsibility, it’s a necessity. With ownership comes the imperative of constant vigilance: Developers need to be on top of monitoring both costs and resource use, around the clock.

Budget mastery: Staying within the lines of a coloring book is basic; doing the same with budgets is an art. Developers must stick to the budget frameworks laid out by the platform engineering team, while making sure cost-optimization tasks don’t get pushed to the back burner during sprints.

Business-metrics harmony: Translating cloud costs into business speak is a win-win. Developers should align their resource utilization metrics with tangible business outcomes. Want to know the cost of a single business transaction or operation? That’s the kind of clarity this alignment can offer.

Resource optimization: Don’t let resource “leakages” turn into resource “floods.” Developers should break down the attributed cost to pinpoint and plug these leakages, and to fine-tune the overall resource landscape for optimal efficiency.

Innovation: Many cost-optimization projects are tweaks to your service performance and architecture that can lead to tremendous results.

Keep the dialogue going: A fruitful partnership with the platform engineering team isn’t a one-off event; it’s an ongoing conversation. Developers should keep the lines of communication open to continuously refine tools, metrics and best practices for sustainable cloud management.

By taking ownership of these responsibilities, developers aren’t just lightening the load on the Ops team; they’re stepping up as co-pilots in navigating the cloud cost landscape. It’s a team effort aimed at achieving a leaner, more efficient cloud without compromising on performance or possibilities.

In a Nutshell

Criteria	Cloud Optimization by Audit	Cloud Sustenance
Objective	To reduce immediate costs through audits and one-time actions.	To maintain a sustainable, cost-effective architecture by design.
Methodology	Audit-based, reactionary measures taken after costs have escalated.	Planning and a set of practices and mechanisms for long-term sustainability.
Primary Responsibility	Centralized team or Ops team usually handles this through audits and dashboards.	Both platform engineering teams and development teams are responsible for cost management.
Impact	Short-term cost reduction.	Long-term efficiency and cost-effectiveness.
Continuity	Generally a recurring but isolated exercise.	Integrated into development sprints and long-term planning.

While audit-based cloud optimization might offer a rapid-fire way to trim costs, let’s be honest — it’s a reactive, temporary solution mostly overseen by operations teams. And because it often sprawls across the entire cloud, pinpointing who’s responsible for what in the cost-saving equation can get muddled.

On the flip side, cloud sustenance is a proactive, long-game approach that zeroes in on specific projects, distributing cost responsibilities across developers, platform engineers and operations.

While the journey toward sustainable cloud management needs everyone on board, the upfront time and resources invested pay off big time. We’re talking about a cloud ecosystem that’s built for long-term efficiency and resilience. So why not invest a little more now for peace of mind later?

The post How Platform Engineering Can Help Keep Cloud Costs in Check appeared first on The New Stack.

Getting Started with Infrastructure Monitoring

Charles Mahler — Mon, 11 Sep 2023 14:01:40 +0000

While building new features and launching new products is fun, none of it matters if your software isn’t reliable. One key part of making sure your apps run smoothly is having robust infrastructure monitoring in place. In this article you will learn about the following:

The different components of infrastructure monitoring.
Popular tools used for infrastructure monitoring.
How to set up monitoring for an application.

If you prefer video, you can also check out this presentation, which covers some of the themes discussed in this article.

Components of Infrastructure Monitoring

Infrastructure monitoring consists of a number of different architecture components that are needed to serve a modern application. To ensure software is reliable, all of these components need to be properly monitored.

Network monitoring — Network monitoring focuses on hardware-like routers and switches and involves tracking things like bandwidth usage, uptime and device status. It is used to identify bottlenecks, downtime and potentially inefficient network routing.
Server monitoring — Server monitoring is focused on monitoring the performance and health of physical and virtual server instances. Metrics like CPU, RAM and disk utilization are common. Server monitoring is important for capacity planning.
Application performance monitoring (APM) — APM is focused on software and is used to track how an application is performing at every layer from the UI to how data is stored. Common metrics are things like error rates and response times.
Cloud infrastructure monitoring — Cloud monitoring, as the name implies, is about monitoring cloud infrastructure like databases, different types of storage and VMs. The goal is to track availability and performance, as well as resource utilization to prevent over or under provisioning of cloud hardware.

Each of these types of monitoring act as a different lens for teams to view and manage their infrastructure. By taking advantage of all of this data, companies can ensure their infrastructure is performing optimally while reducing costs.

Tools for Infrastructure Monitoring

Choosing the right tools for the job is critical when it comes to creating an infrastructure monitoring system. There are a number of open source and commercial options available. You also have the option of choosing a full-service solution or creating your own custom solution by combining specialized tools. Regardless, there are three main questions to consider: How are you going to collect your data, how to store the data and what will you do with the data? Let’s look at some of the tools available for accomplishing each one.

Data Collection Tools

One of the biggest challenges with infrastructure monitoring is collecting data that may be coming from many different sources, often with no standardized protocol or API. The key goal here should be to choose a tool that saves you from having to reinvent the wheel, doesn’t lock you in and is extensible so you can scale or modify data collection as your app changes.

Telegraf

Telegraf is an open source server agent that is ideal for infrastructure monitoring data collection. Telegraf solves most of the problems mentioned above. It has over 300 different plugins for inputs and outputs, meaning you can easily collect data from new sources and output that data to whichever storage solution works best for your use case.

The result is that Telegraf saves you a ton of engineering resources by not having to write custom code for collecting data and prevents vendor lock-in because you can change storage outputs easily. Telegraf also has plugins for data processing and transformation, so in some use cases it can simplify your architecture by replacing stream-processing tools.

OpenTelemetry

OpenTelemetry is an open source set of SDKs and tools that make it easy to collect metrics, logs and traces from applications. The primary advantage of OpenTelemetry is that it is vendor agnostic, so you don’t have to worry about getting locked into an expensive APM tool with high switching costs. OpenTelemetry also saves your developers time by providing tools to make instrumenting applications for data collection easy.

Data Storage Tools

After you start collecting data from your infrastructure, you’ll need a place to store that data. While a general-purpose database can be used for this data, in many cases you will want to look for a more specialized database designed for working with the types of time series data collected for infrastructure monitoring. Here are a few available options:

InfluxDB

InfluxDB is an open source time series database designed for storing and analyzing high volumes of time series data. It offers efficient storage and retrieval capabilities, scalability and support for real-time analytics. With InfluxDB, you can easily capture and store metrics from various sources, making it a good fit for monitoring and analyzing the performance and health of your infrastructure.

Prometheus

Prometheus is an open source monitoring and alerting toolkit built for collecting and storing metrics data. It is specifically designed to monitor dynamic and cloud native environments. Prometheus provides a flexible data model and powerful query language, making it well-suited for storing infrastructure monitoring data. With its built-in alerting and visualization capabilities, Prometheus enables you to gain insight into the performance and availability of your infrastructure.

Graphite

Graphite is a time series database and visualization tool that focuses on storing and rendering graphs of monitored data. It is widely used for monitoring and graphing various metrics, making it a suitable option for storing infrastructure monitoring data. Graphite excels at visualizing time series data, allowing you to create interactive and customizable dashboards to monitor the performance and trends of your infrastructure. Its scalable architecture and extensive ecosystem of plugins make it a popular choice for monitoring and analyzing infrastructure metrics.

Data Analysis Tools

Once you’ve got your data stored, it’s time for the fun part, actually doing something with it to create value. Here are a few tools that you can use for analyzing your data.

Grafana

Grafana is a powerful open source data visualization and analytics tool that allows users to create, explore and share interactive dashboards. It is commonly used for analyzing infrastructure monitoring data by connecting to various data sources such as databases, APIs and monitoring systems. With Grafana, users can create visualizations, set up alerts and gain insights into their infrastructure metrics, logs and traces.

Apache Superset

Apache Superset is a modern enterprise-ready business intelligence web application that enables users to explore, visualize and analyze data. It provides a user-friendly interface for creating interactive dashboards, charts and reports. When it comes to analyzing infrastructure monitoring data, Apache Superset can be used to connect to monitoring systems, databases or other data sources to explore and visualize key metrics, generate reports and gain insights into the performance and health of the infrastructure.

Jaeger

Jaeger is an open source, end-to-end distributed tracing system that helps users monitor and troubleshoot complex microservices architectures. It can be used for analyzing infrastructure monitoring data by providing detailed insights into the interactions and dependencies between different components of the infrastructure. Jaeger captures and visualizes traces, which represent the path of requests as they travel through the system, allowing users to identify bottlenecks, latency issues and performance optimizations in the infrastructure.

Infrastructure Monitoring Tutorial

Now let’s look at an example of how to implement a monitoring system for an application. This tutorial will focus on a combination of open source tools known as the TIG stack: Telegraf, InfluxDB and Grafana. The TIG stack allows developers to easily build an infrastructure monitoring solution that is scalable and extensible in the long term.

Architecture Overview

The example application for this tutorial is a chat app powered by an AI model that returns responses based on user input. The app has a hybrid architecture with the backend hosted on AWS, and the AI model is run on dedicated GPUs outside the cloud. The primary challenge is ensuring reliability of the service while also scaling infrastructure due to rapid user growth. Doing this requires collecting large amounts of data to track resource utilization in real time for monitoring and also for future capacity planning based on user growth.

Infrastructure Monitoring Setup

Now let’s look at how to set up and configure monitoring for this application. The first step will be configuring Telegraf to collect the data we want from each part of our infrastructure. We’ll take advantage of the following Telegraf plugins:

SNMP input — The SNMP plugin is used to collect the metrics needed for network monitoring.
CPU, Disk, Nvidia SMI, DiskIO, mem, swap, system input — These plugins are used to collect server monitoring metrics.
OpenTelemetry input — OpenTelemetry is used to collect application performance metrics like logs, metrics and traces.
AWS Cloudwatch input — The AWS CloudWatch plugin makes it easy to collect all the cloud infrastructure metrics we need from AWS.
InfluxDB V2 output — The InfluxDB output plugin will send all of these collected metrics to the specified InfluxDB instance.

And here’s an example of a Telegraf configuration TOML file for this setup:

```TOML
[global_tags]
  # dc = "us-east-1" # will tag all metrics with dc=us-east-1
  # rack = "1a"
  # user = "$USER"

[agent]
  interval = "10s"
  round_interval = true

  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"

  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""

  # debug = false
  # quiet = false
  # logtarget = "file"
  # logfile = ""
  # logfile_rotation_interval = "0d"
  # logfile_rotation_max_size = "0MB"
  # logfile_rotation_max_archives = 5

  hostname = ""
  omit_hostname = false

[[inputs.snmp]]
  agents = ["udp://127.0.0.1:161"].
  timeout = "15s"
   version = 2
   community = "SNMP"
  retries = 1


  [[inputs.snmp.field]]
    oid = "SNMPv2-MIB::sysUpTime.0"
    name = "uptime"
    conversion = "float(2)"

  [[inputs.snmp.field]]
    oid = "SNMPv2-MIB::sysName.0"
    name = "source"
    is_tag = true

[[inputs.cpu]]
  percpu = true
  totalcpu = true
  collect_cpu_time = false
  report_active = false

[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]

[[inputs.diskio]]
[[inputs.mem]]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]
[[nvidia-smi]]

[[inputs.opentelemetry]]
service_address = "0.0.0.0:4317"

  timeout = "5s"

 metrics_schema = "prometheus-v2"

  tls_cert = "/etc/telegraf/cert.pem"
   tls_key = "/etc/telegraf/key.pem"

[[inputs.cloudwatch_metric_streams]]

  service_address = ":443"

[[inputs.cloudwatch]]
  region = "us-east-1"

[[outputs.influxdb_v2]]
  urls = ["http://127.0.0.1:8086"]

  ## Token for authentication.
  token = ""

  ## Organization is the name of the organization you wish to write to.
  organization = ""

  ## Destination bucket to write into.
  bucket = ""

  ## The value of this tag will be used to determine the bucket.  If this
  ## tag is not set the 'bucket' option is used as the default.
  # bucket_tag = ""

  ## If true, the bucket tag will not be added to the metric.
  # exclude_bucket_tag = false

  ## Timeout for HTTP messages.
  # timeout = "5s"

  ## Additional HTTP headers
  # http_headers = {"X-Special-Header" = "Special-Value"}

  ## HTTP Proxy override, if unset values the standard proxy environment
  ## variables are consulted to determine which proxy, if any, should be used.
  # http_proxy = "http://corporate.proxy:3128"

```

This Telegraf configuration takes care of both the data collection and data storage steps by collecting all the designated data and sending it to InfluxDB for storage. Let’s go over some ways you can use that data.

Data Visualization

One of the first steps for many companies is to create dashboards and data visualizations for their infrastructure monitoring system. These dashboards can be used for everything from high-level reports to detailed analysis by engineers monitoring things in real time. Here’s an example of a Grafana dashboard built using the data collected for this tutorial:

Alerting

While dashboards are nice, it’s impossible to manually track everything happening with your infrastructure at scale. To help with this problem, setting up automated alerting is a common feature of infrastructure monitoring systems. Here’s an example of how Grafana can be used to set value thresholds for metrics and create automated alerts if those thresholds are violated.

Grafana integrates with third-party tools like PagerDuty and Slack so engineers can be notified if something goes wrong. In some cases, alerting like this could be used to completely automate certain actions, like automatically scaling cloud capacity if hardware utilization hits a certain level.

Predictive Analytics and Forecasting

Predictive analytics and forecasting are probably the ideal end goal for many engineering teams. While alerting is a reactive approach that only works after something has gone wrong, predictive analytics and forecasting allow you to take action before the issue occurs. Creating accurate forecasts is obviously easier said than done, but it has huge benefits when done right.

Next Steps

Hopefully this article helped you to better understand infrastructure monitoring and some of the tools that are available for building your own system. If you want to play around with some real data you can check out the following resources:

The post Getting Started with Infrastructure Monitoring appeared first on The New Stack.

Can ChatGPT Save Collective Kubernetes Troubleshooting?

Blair Rampling — Fri, 08 Sep 2023 14:54:27 +0000

Decades ago, sysadmins started flooding the internet with questions about the technical problems they faced daily. They had long, vibrant and valuable discussions about how to investigate and troubleshoot their way to understanding the root cause of the problem; then they detailed the solution that ultimately worked for them.

This flood has never stopped, only changed the direction of its flow. Today, these same discussions still happen on Stack Overflow, Reddit and postmortems on corporate engineering blogs. Each one is a valuable contribution to the global anthology of IT system troubleshooting.

Kubernetes has profoundly altered the flow as well. The microservice architecture is far more complex than the virtual machines (VMs) and monolithic applications that have troubled sysadmin and IT folks for decades. Local reproductions of K8s-scale bugs are often impossible to operate. Observability data gets fragmented across multiple platforms, if captured at all, due to Kubernetes’ lack of data persistence. Mapping the interconnectedness of dozens or hundreds of services, resources and dependencies is an effort in futility.

Now your intuition, driven by experience, isn’t necessarily enough. You need to know how to debug the cluster for clues as to your next step.

This complexity means that public troubleshooting discussions are more important now than ever, but now we’re starting to see this valuable flood not get redirected, but dammed up entirely. You’ve seen this in Google. Any search for a Kubernetes-related issue brings you a half-dozen paid ads and at least a page of SEO-driven articles that lack technical depth. Stack Overflow is losing its dominance as the go-to Q&A resource for technical folks, and Reddit’s last few years have been mired in controversy.

Now, every DevOps platform for Kubernetes is building one last levee: Centralize your troubleshooting knowledge within their platform, and replace it with AI and machine learning (ML) until the entire stack becomes a black box to even your most experienced cloud native engineers. When this happens, you lose the skills for individually probing, troubleshooting and fixing your system. This trend turns what used to be a flood of crowdsourced troubleshooting know-how into a mere trickle compared to what was available in the past.

When we become dependent on platforms, the collective wisdom of troubleshooting techniques disappears.

The Flood Path of Troubleshooting Wisdom

In the beginning, sysadmins relied on genuine books for technical documentation and holistic best practices to implement in their organizations. As the internet proliferated in the ‘80s and ‘90s, these folks generally adopted Usenet to chat with peers and ask technical questions about their work in newsgroups like comp.lang.*, which operated like stripped-down versions of the forums we know today.

The general availability of the World Wide Web quickly and almost completely diverted the flood of troubleshooting wisdom. Instead of newsgroups, engineers and administrators flocked to thousands of forums, including Experts Exchange, which went live in 1996. After amassing a repository of questions and answers, the team behind Experts Exchange put all answers behind a $250-a-year paywall, which isolated countless valuable discussions from public consumption and ultimately led to the site’s sinking relevance.

Stack Overflow came next, opening up these discussions to the public again and gamifying discussions through reputation points, which could be earned by providing insights and solutions. Other users then vote for and validate the “best” solution, which helps follow-on searchers find an answer quickly. The gamification, self-moderation and community around Stack Overflow made it the singular channel where the flood of troubleshooting know-how flowed.

But, like all the other eras, nothing good can last forever. Folks have been predicting the “decline of Stack Overflow” for nearly 10 years, citing that it “hates new users” due to its combative nature and structure of administration by whoever has the most reputation points. While Stack Overflow has certainly declined in relevance and popularity, with Reddit’s development/engineering-focused subreddits filling the void, it remains the largest repository of publicly accessible troubleshooting knowledge.

Particularly so for Kubernetes and the cloud native community, which is still experiencing major growing pains. And that’s an invaluable resource, because if you think Kubernetes is complex now …

The Kubernetes Complexity Problem

In a fantastic article about the downfall of “intuitive debugging,” software delivery consultant Pete Hodgson argues that the modern architectures for building and delivering software, like Kubernetes and microservices, are far more complex than ever. “The days of naming servers after Greek gods and sshing into a box to run tail and top are long gone for most of us,” he writes, but “this shift has come at a cost … traditional approaches to understanding and troubleshooting production environments simply will not work in this new world.”

Cynefin model. Source: Wikipedia

Hodgson uses the Cynefin model to illustrate how software architecture used to be complicated, in that given enough experience, one could understand the cause-and-effect relationship between troubleshooting and resolution.

He argues that distributed microservice architectures are instead complex, in that even experienced folks only have a “limited intuition” as to the root cause and how to troubleshoot it. Instead of driving straight toward results, they must spend more time asking and answering questions with observability data to eventually hypothesize what might be going wrong.

If we agree with Hodgson’s premise — that Kubernetes is inherently complex and requires much more time spent analyzing the issue before responding — then it seems imperative that engineers working with Kubernetes learn which questions are most imperative to ask, then answer with observability data, to make the optimal next move.

That’s exactly the type of wisdom disappearing into this coming generation of AI-driven troubleshooting platforms.

Two Paths for AI in Kubernetes Troubleshooting

For years, companies like OpenAI have been scraping and training their models based on public data published on Stack Overflow, Reddit and others, which means these AI models have access to lots of systems and applications knowledge including Kubernetes. Others recognize that an organization’s observability data is a valuable resource for training AI/ML models for analyzing new scenarios.

They’re both asking the same question: How can we leverage this existing data about Kubernetes to simplify the process of searching for the best solution to an incident or outage? The products they’re building take very different paths.

First: Augment the Operator’s Analysis Efforts

These tools automate and streamline access to that existing flood of troubleshooting knowledge published publicly online. They don’t replace the human intuition and creativity that’s required to do proper troubleshooting or root-cause analysis (RCA), but rather thoughtfully automate how an operator finds relevant information.

For example, if a developer new to Kubernetes struggles with deploying their application because they see a CrashLoopBackOff status when running kubectl get pods, they can query an AI-powered tool to provide recommendations, like running kubectl describe $POD or kubectl logs $POD. Those steps might in turn lead the developer to investigate the relevant deployment with kubectl describe $DEPLOYMENT.

At Botkube, we found ourselves invested in this concept of using AI, trained on the flood of troubleshooting wisdom, to automate this back-and-forth querying process. Users should be able to ask questions directly in Slack, like “How do I troubleshoot this nonfunctional service?” and receive a response penned by ChatGPT. During a companywide hackathon, we followed through, building a new plugin for our collaborative troubleshooting platform designed around this concept.

With Doctor, you can tap into the flood of troubleshooting know-how, with Botkube as the bridge between your Kubernetes cluster and your messaging/collaboration platform without trawling through Stack Overflow or Google search ads, which is particularly useful for newer Kubernetes developers and operators.

The plugin also takes automation a step further by generating a Slack message with a Get Help button for any error or anomaly, which then queries ChatGPT for actionable solutions and next steps. You can even pipe the results from the Doctor plugin into other actions or integrations to streamline how you actively use the existing breadth of Kubernetes troubleshooting knowledge to debug more intuitively and sense the problem faster.

Second: Remove the Operator from Troubleshooting

These tools don’t care about the flood of public knowledge. If they can train generalist AI/ML models based on real observability data, then fine-tune based on your particular architecture, they can seek to cut out the human operator in RCA and remediation entirely.

Causely is one such startup, and they’re not shying away from their vision of using AI to “eliminate human troubleshooting.” The platform hooks up to your existing observability data and processes them to fine-tune causality models, which theoretically take you straight to remediation steps — no probing or kubectl-ing required.

I’d be lying if I said a Kubernetes genie doesn’t sound tempting on occasion, but I’m not worried about a tool like Causely taking away operations jobs. I’m worried about what happens to our valuable flood of troubleshooting knowledge in a Causely-led future.

The Gap Between These Paths: The Data

I’m not priming a rant about how “AI will replace all DevOps jobs.” We’ve all read too many of these doomsday scenarios for every niche and industry. I’m far more interested in the gap between these two paths: What data is used for training and answering questions or presenting results?

The first path generally uses existing public data. Despite concerns around AI companies crawling these sites for training data — looking at you, Reddit and Twitter — the openness of this data still provides an incentive loop to keep developers and engineers contributing to the continued flood of knowledge on Reddit, Stack Overflow and beyond.

The cloud native community is also generally amenable to an open source-esque sharing of technical knowledge and the idea that a rising tide (of Kubernetes troubleshooting tips) lifts all boats (of stressed-out Kubernetes engineers).

The second path looks bleaker. With the rise of AI-driven DevOps platforms, more troubleshooting knowledge gets locked inside these dashboards and the proprietary AI models that power them. We all agree that Kubernetes infrastructure will continue to get more complex, not less, which means that over time, we’ll understand even less about what’s happening between our nodes, pods and containers.

When we stop helping each other analyze a problem and sense a solution, we become dependent on platforms. That feels like a losing path for everyone but the platforms.

How Can We Not Lose (or Lose Less)?

The best thing we can do is continue to publish amazing content online about our troubleshooting endeavors in Kubernetes and beyond, like “A Visual Guide on Troubleshooting Kubernetes Deployments”; create apps that educate through gamification, like SadServers; take our favorite first steps when troubleshooting a system, like “Why I Usually Run ‘w’ First When Troubleshooting Unknown Machines”; and conduct postmortems that detail the stressful story of probing, sensing and responding to potentially disastrous situations, like the July 2023 Tarsnap outage.

We can go beyond technical solutions, too, like talking about how we can manage and support our peers through stressful troubleshooting scenarios, or building organizationwide agreement on what observability is.

Despite their current headwinds, Stack Overflow and Reddit will continue to be reliable outlets for discussing troubleshooting and seeking answers. If they end up in the same breath as Usenet and Experts Exchange, they’ll likely be replaced by other publicly available alternatives.

Regardless of when and how that happens, I hope you’ll join us at Botkube, and the new Doctor plugin, to build new channels for collaboratively troubleshooting complex issues in Kubernetes.

It doesn’t matter if AI-powered DevOps platforms continue to train new models based on scraped public data about Kubernetes. As long as we don’t willingly and wholesale deposit our curiosity, adventure and knack for problem-solving into these black boxes, there will always be a new path to keep the invaluable flood of troubleshooting know-how flowing.

The post Can ChatGPT Save Collective Kubernetes Troubleshooting? appeared first on The New Stack.

Stream Processing 101: What’s Right for You?

David Anderson — Fri, 08 Sep 2023 13:20:07 +0000

Over the last decade, the growing adoption of Apache Kafka has allowed data streaming — the continuous transmission of streams of data — to go mainstream.

To run operational and analytics use cases in real time, you don’t want to work with pockets of data that will sit and go stale. You want continuous streams of data that you can deal with and apply as they’re generated and ingested. That’s why so many companies have turned to data streaming, but the reality is that data streaming alone is not enough to maximize the value of real-time data. For that, you need stream processing.

What Is Stream Processing and How Does It Work?

Stream processing means performing operations on data as soon as it’s received. Processing data in flight allows you to extract its value as soon as it arrives rather than waiting for data collection and then batch processing.

By default, most systems are designed with high latency. Batch jobs are strung together to periodically move data from one place to another, like a Rube Goldberg machine. But that doesn’t have to be the case. Organizations gain an advantage when they architect for faster processing, especially in use cases designed to improve an organization’s responsiveness.

The TV streaming apps many of us use are a great example of how stream processing can improve both frontend experiences and backend processes. Every button pressed on a remote control provides information about viewing behavior that can inform the categorization of content to improve the user experience.

At the same time, the app can be designed to ensure viewing quality by monitoring streams of data on rebuffering events and regional outages. Compare that to a system or app that can only provide data on interruptions in predetermined intervals, minutes, hours or even days apart. That’s the difference between using batch-based versus streaming data pipelines to capture the data that runs a business. And once an organization makes the jump to data streaming, incorporating stream processing into the new pipelines they build is the only thing that makes sense.

Organizations that adopt data streaming without taking advantage of stream processing are left dealing with more latency and higher costs than they have to. Why bother to capture data in real time if you’re not going to process and transform it in real time too?

Although not every application you build requires processing data in flight, many of the most valuable use cases such as fraud detection, cyber security and location tracking need real-time processing to work effectively.

When streaming data isn’t processed in real time, it has to be stored in a traditional file system or a cloud data warehouse until an application or service requests that data. That means executing queries from scratch every time you want the data to be joined, aggregated or enriched so it’s ready for downstream systems and applications.

In contrast, stream processing allows you to “look” at the data once rather than having to apply the same operations to it over and over. That reduces storage and compute costs, especially as your data-streaming use cases scale over time.

Stream Processing in the Real World

Once you have stream processing pipelines built, you can connect them to all the places your data lives — from on-premise relational databases to the increasingly popular cloud data warehouses and data lakes. Or you can use these pipelines to connect directly to a live application.

A great example of the benefits of stream processing is real-time e-commerce. Stream processing allows an e-commerce platform to update downstream systems as soon as there’s new information available. When it comes to data points like product pricing and inventory, there can be multiple operational and customer-facing use cases that need that information.

If these platforms have to process data in batches, this leads to greater lag time between the information customers want — new sales and promotions, shipping updates or refunds — and the notifications they actually receive. That’s a poor customer experience that businesses need to avoid if they want to be competitive, and something that’s applicable across every industry.

But before companies and their developers can get started, they need to choose the right data-stream-processing technology. And that choice isn’t necessarily a straightforward one.

Common Stream Processing Technologies

Over the last seven or eight years, a few open source technologies have dominated the world of stream processing. This small handful of technologies are trying to solve the problem of putting data to work faster without compromising data quality or consistency, even if the technical, architectural and operational details underneath differ.

Let’s look at three commonly used stream processors.

Apache Flink is a data-processing framework designed to process large-scale data streams. Flink supports both event-driven processing and batch processing, as well as interactive analytics.
Kafka Streams, part of the Apache Kafka ecosystem, is a microservices-based, client-side library that allows developers to build real-time stream-processing applications and scalable, high-throughput pipelines.
Apache Spark is a distributed engine built for big data analytics using micro-batches, and is similar to the real-time processing achieved with Flink and Kafka Streams.

Each of these technologies has its strengths, and there are even use cases where it makes sense to combine these technologies. Whether considering these three technologies or the many others available in the broader ecosystem, organizations need to consider how this decision will further their long-term data strategy and allow them to pursue use cases that will keep them competitive as data streaming becomes more widespread.

How Organizations Can Choose Their Stream-Processing Technologies

Organizations adopting stream processing today often base this decision on the existing skill set of their developer and operations teams. That’s why you often see businesses with a significant community of practice around Kafka, turning to Kafka Streams, for example.

The developer experience is an important predictor of productivity if you plan to build streaming applications in the near future. For example, using a SQL engine (Flink SQL, ksqlDB or Spark SQL) to process data streams may be the right choice for making real-time data accessible to business analysts in your organization. In contrast, for developers used to working with Java, the ease of use and familiarity of Kafka Streams might be a better fit for their skill set.

While this reasoning makes sense for not blocking the way of innovation in the short term, it’s not always the most strategic decision and can limit how far you can take your stream-processing use cases.

How to Get Started with Stream Processing Today

Getting started with stream processing looks different from a practitioner perspective versus an organizational one. While organizations need to think about business requirements, practitioners can focus on the technology that helps them launch and learn fast.

Start by looking at side-by-side comparisons of the streaming technologies you want to use. While a company might evaluate several technologies at once, I’d recommend against that approach for developers — you don’t want to do a proof of concept (POC) on five different technologies. Instead, narrow down your list to two options that fit your requirements, and then build a POC for each.

The easiest way to do this is to find a tutorial that closely matches your use case and dive in. A great way to start is by building streaming pipelines that ingest and process data from Internet of Things (IoT) devices or public data sets like Wikipedia updates. Here are some places to start learning:

Stream Processing Simplified is about Flink for Kafka Users.
Learn Flink: Hands-On Training is about using Flink’s APIs to manage time and state.
Get started with Flink in Java with this hands-on exercise.
Apache Flink 101 discusses Flink’s core concepts and architecture.
Build a real-time fraud detection pipeline with Kafka Streams.
Build a real-time stream-processing pipeline with Spark and Kafka.

Developing streaming applications and services can be challenging because they require a different approach than traditional synchronous programming. Practitioners not only need to become familiar with the technology but also how to solve problems by reacting to events and streams of data, rather than by applying conditions and operations to data at rest.

While the technology you choose today may not be the one you use tomorrow, the problem-solving and stream-processing skills you’re gaining won’t go to waste.

The post Stream Processing 101: What’s Right for You? appeared first on The New Stack.

Achieve Cloud Native without Kubernetes

Rak Siva — Thu, 07 Sep 2023 15:57:34 +0000

This is the second of a two-part series. Read part one here.

At its core, cloud native is about leveraging the benefits of the cloud computing model to its fullest. This means building and running applications that take advantage of cloud-based infrastructures. The foundational principles that consistently rise to the forefront are:

Scalability — Dynamically adjust resources based on demand.
Resiliency — Design systems with failure in mind to ensure high availability.
Flexibility — Decouple services and make them interoperable.
Portability — Ensure applications can run on any cloud provider or even on premises.

In Part 1 we highlighted the learning curve and situations where directly using Kubernetes might not be the best fit. This part zeros in on constructing scalable cloud native applications using managed services.

Managed Services: Your Elevator to the Cloud

Reaching the cloud might feel like constructing a ladder piece by piece using tools like Kubernetes. But what if we could simply press a button and ride smoothly upward? That’s where managed services come into play, acting as our elevator to the cloud. While it might not be obvious without deep diving into specific offerings, managed services often use Kubernetes behind the scenes to build scalable platforms for your applications.

There’s a clear connection between control and complexity when it comes to infrastructure (and software in general). We can begin to tear down the complexity by delegating some of the control to managed services from cloud providers like AWS, Azure or Google Cloud.

Managed services empower developers to concentrate on applications, relegating the concerns of infrastructure, scaling and server management to the capable hands of the cloud provider. The essence of this approach is crystallized in its core advantages: eliminating server management and letting the cloud provider handle dynamic scaling.

Think of managed services as an extension of your IT department, bearing the responsibility of ensuring infrastructure health, stability and scalability.

Choosing Your Provider

When designing a cloud native application, the primary focus should be on architectural principles, patterns and practices that enable flexibility, resilience and scalability. Instead of immediately selecting a specific cloud provider, it’s much more valuable for teams to start development without the blocker of this decision-making.

Luckily, the competitive nature of the cloud has driven cloud providers toward feature parity. Basically, they have established foundational building blocks which have taken inspiration greatly from each other and ultimately offer the same or extremely similar functionality and value to end users.

This paves the way for abstraction layers and frameworks like Nitric, which can be used to take advantage of these similarities to deliver cloud development patterns for application developers with greater flexibility. The true value here is the ability to make decisions about technologies like cloud providers on the timeline of the engineering team, not upfront as a blocker to starting development.

Resources that Scale by Default

The resource choices for apps set the trajectory for their growth; they shape the foundation upon which an application is built, influencing its scalability, security, flexibility and overall efficiency. Let’s categorize and examine some of the essential components that contribute to crafting a robust and secure application.

Execution, Processing and Interaction

Handlers: Serve as entry points for executing code or processing events. They define the logic and actions performed when specific events or triggers occur.
API gateway: Acts as a single entry point for managing and routing requests to various services. It provides features like rate limiting, authentication, logging and caching, offering a unified interface to your backend services or microservices.
Schedules: Enable tasks or actions to be executed at predetermined times or intervals. Essential for automating repetitive or time-based workloads such as data backups or batch processing.

Communication and Event Management

Events: Central to event-driven architectures, these represent occurrences or changes that can initiate actions or workflows. They facilitate asynchronous communication between systems or components.
Queues: Offer reliable message-based communication between components, enhancing fault tolerance, scalability and decoupled, asynchronous communication.

Data Management and Storage

Collections: Data structures, such as arrays, lists or sets that store and organize related data elements. They underpin many application scenarios by facilitating efficient data storage, retrieval and manipulation.
Buckets: Containers in object storage systems like Amazon S3 or Google Cloud Storage. They provide scalable and reliable storage for diverse unstructured data types, from media files to documents.

Security and Confidentiality

Secrets: Concerned with securely storing sensitive data like API keys or passwords. Using centralized secret management systems ensures these critical pieces of information are protected and accessible only to those who need them.

Automating Deployments

Traditional cloud providers have offered services for CI/CD but often fall short of delivering a truly seamless experience. Services like AWS CodePipeline or Azure DevOps require intricate setup and maintenance.

Why is this a problem?

Time-consuming: Setting up and managing these pipelines takes away valuable developer time that could be better spent on feature development.
Complexity: Each cloud provider’s CI/CD solution might have its quirks and learning curves, making it harder for teams to switch or maintain multicloud strategies.
Error-prone: Manual steps or misconfigurations can lead to deployment failures or worse, downtime.

You might notice a few similarities here with some of the challenges of adopting K8s, albeit at a smaller scale. However, there are options that simplify the deployment process significantly, such as using an automated deployment engine.

Example: Simplified Process

This is the approach Nitric takes to streamline the deployment process:

The developer pushes code to the repository.
Nitric’s engine detects the change, builds the necessary infrastructure specification and determines the minimal permissions, policies and resources required.
The entire infrastructure needed for the app is automatically provisioned, without the developer explicitly defining it and without the need for a standalone Infrastructure as Code (IaC) project.

Basically, the deployment engine intelligently deduces and sets up the required infrastructure for the application, ensuring roles and policies are configured for maximum security with minimal privileges.

This streamlined process relieves application and operations teams from activities like:

The need to containerize images.
Crafting, troubleshooting and sustaining IaC tools like Terraform.
Managing discrepancies between application needs and existing infrastructure.
Initiating temporary servers for prototypes or test phases.

Summary

Using managed services streamlines the complexities associated with infrastructure, allowing organizations to zero in on their primary goals: application development and business expansion. Managed services, serving as an integrated arm of the IT department, facilitate a smoother and more confident transition to the cloud than working directly with K8s. They’re a great choice for cloud native development to reinforce digital growth and stability.

With tools like Nitric streamlining the deployment processes and offering flexibility across different cloud providers, the move toward a cloud native environment without Kubernetes seems not only feasible but also compelling. If you’re on a journey to build a cloud native application or a platform for multiple applications, we’d love to hear from you.

Read Part 1 of this series: “Kubernetes Isn’t Always the Right Choice.”

The post Achieve Cloud Native without Kubernetes appeared first on The New Stack.

7 Benefits of Developer Access to Production

Eran Kinsbruner — Thu, 07 Sep 2023 15:12:10 +0000

Platform engineering has emerged as a game-changer in modern software development. It promises to revolutionize the developer experience and deliver customer value faster through automated infrastructure operations and self-service capabilities. At the heart of this approach lies a critical aspect: developer access to production.

Let’s explore why providing developers access to production environments is crucial for their productivity and the success of the product, and how it aligns perfectly with the principles of platform engineering. Also, secure and controlled access to production for developers can significantly benefit operation teams by reducing their operational burden on repetitive tasks, allowing them to prioritize resources on high-value tasks such as infrastructure scaling or security enhancements.

Productivity Enabler

When developers have access to production environments, they can directly interact with the real-life systems they build and deploy. This access translates to greater productivity, reducing the need for communication with separate operations teams to diagnose and resolve issues. This firsthand interaction means that developers can instantly diagnose, troubleshoot and rectify anomalies or inefficiencies they detect without waiting for feedback or navigating bureaucratic processes.

By immersing themselves in the production environment, developers gain invaluable insights, identify potential bottlenecks and fine-tune their code with real-world data, resulting in faster iterations and more efficient development processes that would be difficult to achieve in their local development environment, which cannot usually reproduce completely the actual behavior of the production environment.

Faster Issue Resolution

Every minute counts in the fast-paced world of software development. Hence, delays in addressing issues can lead to considerable setbacks.

When developers have access to production systems, they can swiftly diagnose and address issues as they arise, minimizing mean time to resolution (MTTR). This capability is especially beneficial during high-pressure situations such as system outages, where developers’ firsthand experience with the codebase usually means getting to the problematic components faster and knowing exactly which logs, events or data to gather to troubleshoot and diagnose the problem.

This ability to troubleshoot and debug in real time not only reduces downtime but also leads to improved overall system stability, as it makes it easier to predict potential system bottlenecks or failures. Developers can provide insights into future updates or changes that might affect system performance, allowing operations teams to prepare in advance.

Ownership and Accountability

Granting developers access to production fosters a sense of ownership and accountability. When development teams are responsible for their product’s performance and reliability, they take more significant ownership of its success. This sense of responsibility drives them to deliver high-quality code and actively participate in maintaining the application’s health.

Well-regulated access to production should lead to a shared responsibility model between the development and operation teams, as the responsibility for system health and uptime becomes a shared endeavor. This collaborative approach ensures that developers and operations teams are aligned in their goals, reducing the likelihood of miscommunication or misaligned priorities.

Empowering Innovation

Developers are at their creative best when they can explore and experiment freely. By providing access to production, organizations enable their development teams to innovate and push boundaries. Developers can directly prototype and validate new features in the production environment, leading to more creative and innovative solutions.

Feedback Loop Improvement

In the traditional setup, feedback from operations teams might take time to reach developers. However, with direct access to production, developers receive instant feedback on their code’s impact, performance and scalability by gathering logs, data samples and events. Additionally, the real-time data and insights from the live environment empower developers to make informed decisions, refine their code based on actual user interactions and iteratively improve the software

This feedback loop enables continuous improvement, leading to faster and more reliable updates. This direct involvement not only streamlines the development and maintenance processes but also ensures that solutions are tailored to real-world demands and challenges, leading to faster development cycles and reduced time to market.

Empowering the Operation Team

In most traditional setups, the operation teams act as gatekeepers to production. While this helps in protecting the production environment from certain risks, it also forces the operations team to engage in repetitive tasks, like gathering logs and events, tweaking configurations, analyzing payloads, etc. By granting controlled access to production to developers, operations teams can reduce the existing burden and enhance the overall team’s productivity. Operations teams can focus more on strategic tasks and proactive system improvements rather than being bogged down with routine troubleshooting.

In essence, granting developers access to production paves the way for a more symbiotic relationship between them and operations teams. It promotes collaboration, fosters knowledge exchange and, most importantly, ensures that both teams work harmoniously toward a singular goal: delivering a seamless and resilient user experience.

Cost Efficiency

When developers can debug directly in production, organizations can significantly reduce logging costs, circumvent the need for costly redeployments or initiate new CI/CD cycles merely to add new log lines. This direct access speeds up issue resolution and eliminates unnecessary spending on reiteration cycles. Cost optimization also affects operations teams: With developers directly resolving certain issues in autonomy, operations teams can better allocate their resources and prioritize tasks that demand their specific expertise.

Overcoming Challenges through Developer Observability

Lightrun’s Developer Observability Platform streamlines the debugging process in production applications through dynamic log additions, metrics integration and virtual breakpoints without requiring code changes, application restarts or redeployment.

Lightrun’s platform facilitates developer access to production via:

Dynamic logs, which allow developers to add new log lines anywhere in the production codebase without writing new code or redeploying the application, and without losing state.
Snapshots, which are virtual breakpoints that provide typical breakpoint functionalities without stopping execution, allowing them to be used directly on production. Once a snapshot is captured, the developer can view the captured data and act on it.
Metrics, which can monitor production applications in real time and on demand. They can, for example, monitor the size of data structures over time, allowing users to find bugs that can be reproduced only on the production system.

How Lightrun Overcomes the Challenges Associated with Production Access

While granting developers access to production has advantages, it also poses challenges in security, auditing and data confidentiality. Here’s how Lightrun addresses them:

Security: Lightrun implements robust security measures and access controls to prevent unauthorized access and mitigate risks, ensuring controlled and safe developer access to production.
Auditing and compliance: Its comprehensive audit system facilitates continuous compliance monitoring, simplifies auditing processes and ensures adherence to industry standards.
Data confidentiality: It safeguards sensitive data in production environments, preventing exposure in logs and snapshots. This enables developers to work with production data securely and compliantly.
Controlled access management: Lightrun enables organizations to define precise access controls for users and roles, creating a secure and collaborative development environment.

Conclusion

Allowing developers access to production environments is a cornerstone of platform engineering. It empowers them with the tools to create, innovate and maintain their products more efficiently, ultimately benefiting the entire organization and its customers.

Granting developers access to production is pivotal for productivity and product success, and a robust platform like Lightrun represents a powerful enabler for this strategy.

The post 7 Benefits of Developer Access to Production appeared first on The New Stack.

Is Security a Dev, DevOps or Security Team Responsibility?

Chris Tozzi — Thu, 07 Sep 2023 13:26:09 +0000

No matter what role you work in — software development, DevOps, ITOps, security or any other technical position — you probably appreciate the importance of strong cyber hygiene.

But you may be unsure whose job it is to act on that principle. Although the traditional approach to cybersecurity at most organizations was to expect security teams to manage risks, security engineers often point fingers at other teams, telling them it’s their job to ensure that applications are designed and deployed securely.

For their part, developers might claim that security is mainly the responsibility of DevOps or ITOps, since those are the teams that have to manage applications in production — the place where most attacks occur — whereas developers only design and build software.

Meanwhile, the operations folks often point their fingers back at developers, arguing that if there are vulnerabilities inside an application that attackers exploit once the app is in production, the root cause of the problem is mistakes made by developers, not DevOps or ITOps engineers.

On top of all of this, engineers can treat other stakeholders as bearing primary responsibility for security. They might say that if a breach occurs, it’s because a cloud provider didn’t have strong access controls or because end users did something irresponsible, for example.

Cloud Security Is a Collective Responsibility

Who’s right? Nobody, actually. Security is not the job of any one group or type of role.

On the contrary, security is everyone’s job. Forward-thinking organizations must dispense with the mindset that a certain team “owns” security, and instead embrace security as a truly collective team responsibility that extends across the IT organization and beyond.

After all, there is a long list of stakeholders in cloud security, including:

Security teams, who are responsible for understanding threats and providing guidance on how to avoid them.
Developers, who must ensure that applications are designed with security in mind and that they do not contain insecure code or depend on vulnerable third-party software to run.
ITOps engineers, whose main job is to manage software once it is in production and who therefore play a leading role both in configuring application-hosting environments to be secure and in monitoring applications to detect potential risks.
DevOps engineers, whose responsibilities span both development and ITOps work, placing them in a position to secure code during both the development and production stages.
Cloud-service providers, who are responsible for ensuring that underlying cloud infrastructure is secure, and who provide some (though certainly not all) of the tooling (like identity and access management frameworks) that organizations use to protect cloud workloads.
End users, who need to be educated about cloud security best practices in order to resist risks like insecure sharing of business data between applications and phishing attacks.

It would be nice if just one of these groups could manage all aspects of cybersecurity, but they can’t. There are too many types of risks, which manifest across too many different workflows and resources, for cloud security to be the responsibility of any one group.

Every Organization — and Every Security Responsibility Model — Is Different

On top of this, there is the challenge that, depending on your organization, not all of the groups above may even exist. Maybe you no longer have development and ITOps teams because you’ve consolidated them into a single DevOps team. Maybe you’re not large enough to employ a full-time security team. Maybe you don’t use the public cloud, in which case there is no cloud provider helping to secure your underlying infrastructure.

My point here is that organizations vary, and so do the security models that they can enforce. There is no one-size-fits-all strategy for delegating security responsibilities between teams or roles.

Putting DevSecOps into Practice

All of the above is why it’s critical to operationalize DevSecOps — the idea that cloud security is a shared responsibility between developers, security teams, and operations teams — across your organization.

Now, this may seem obvious. There’s plenty of talk today about DevSecOps and plenty of organizations that claim to be “doing” DevSecOps.

But just because a business says it has embraced DevSecOps doesn’t necessarily mean that security has seeped into all units and processes of the business. Sometimes DevSecOps is just jargon that executives toss around to sound like they take security seriously, even though they haven’t actually changed the organizational culture surrounding security. Other times, DevSecOps basically means that your security team talks to developers and ITOps, but your business still treats the security team as the primary stakeholder in security operations.

Approaches like these aren’t enough. In a world where every year sets new records for the pace and scope of cyberattacks, security truly needs to be the job of your entire organization — not just technical teams, but also nontechnical stakeholders like your “business” employees and even external stakeholders such as cloud-service providers and partners. It’s only by enforcing security at every level of the organization, and at every stage of your processes, that you can move the needle against risks.

Conclusion: To Change Security, Change Your Mindset

So, don’t just talk about DevSecOps or rest on your laurels because you’ve designated a certain group of engineers as the team that “owns” security. Strive instead to make cloud security a priority for every stakeholder inside and outside your business who plays a role in helping to protect IT assets. Until the answer to “who’s responsible for security?” is “everyone,” you’ll never be as secure as you can be.

Want to take charge of your cloud security? The Orca Cloud Security Platform offers comprehensive visibility into your cloud environment, providing prioritized alerts for vulnerabilities, misconfigurations, compromises and other potential threats across your entire inventory of cloud accounts. To get started, request a demo of the Orca cloud-security platform or sign up for a free cloud risk assessment today.

Britive: Just-in-Time Access across Multiple Clouds

Susan Hall — Thu, 07 Sep 2023 10:00:41 +0000

Traditionally when a user was granted access to an app or service, they kept that access until they left the company. Unfortunately, too often it wasn’t revoked even then. This perpetual 24/7 access left companies open to a multitude of security exploits.

More recently the idea of just-in-time (JIT) access has come into vogue, addressing companies’ growing attack surface that comes with the proliferation of privileges granted for every device, tool and process. Rather than ongoing access, the idea is to grant it only for a specific time period.

But managing access manually for the myriad technologies workers use on a daily basis, especially for companies with thousands of employees would be onerous. And with many companies adopting a hybrid cloud strategy, each of which with its own identity and access management (IAM) protocols, the burden grows. With zero standing privileges considered a pillar of a zero trust architecture, JIT access paves the way to achieve it.

Glendale, California-based Britive is taking on the challenge of automating JIT access across multiple clouds not only for humans but also for machine processes.

“We recognize that in the cloud, access is typically not required to be permanent or perpetual,” pointed out Britive CEO and co-founder Art Poghosyan. “Most of access is so frequently changing and dynamic, it really doesn’t have to be perpetual standing access … if you’re able to provision with an identity at a time when [users] need it. With proper security, guardrails in place and authorization in place, you really don’t need to keep that access there forever. … And that’s what we do, we call it just-in-time ephemeral privilege management or access management,”

‘Best Left to Automation’

Exploited user privileges have led to some massive breaches in recent years, like Solarwinds, MGM Resorts, Uber and Capital One. Even IAM vendor Okta fell victim.

In the Cloud Security Alliance report “Top Threats to Cloud Computing,” more than 700 industry experts named identity issues as the top threat overall.

And in “2022 Trends in Securing Digital Identities,” of more than 500 people surveyed, 98% said the number of identities is increasing, primarily driven by cloud adoption, third-party relationships and machine identities.

Pointing in particular to cloud identity misconfigurations, a problem occurring all too often, Matthew Chiodi, then Palo Alto Networks’ public cloud chief security officer cited a lack of IAM governance and standards multiplied by “the sheer volume of user and machine roles combined with permissions and services that are created in each cloud account.”

Chiodi added, “Humans are good at many things, but understanding effective permissions and identifying risky policies across hundreds of roles and different cloud service providers are tasks best left to algorithms and automation.”

JIT systems take into account whether a user is authorized to have access, the user’s location and the context of their current task. Access is granted only if the given situation justifies it, and then revokes it when the task is done.

Addressing Need for Speed

Founded in 2018, Britive automates JIT access privileges, including tokens and keys, for people and software accessing cloud services and apps across multiple clouds.

Aside from the different identity management processes involved with cloud platforms like Azure, Oracle, Amazon Web Services (AWS) and Google, developers in particular require access to a range of tools, Poghosyan pointed out.

“Considering the fact that a lot of what they do requires immediate access … speed is the topmost priority for users, right?” he said.

“And so they use a lot of automation, tools and things like HashiCorp Terraform or GitHub or GitLab and so on. All these things also require access and keys and tokens. And that reality doesn’t work well with the traditional IAM tools where it’s very much driven from a sort of corporate centralized, heavy workflow and approval process.

“So we built technology that really, first and foremost, addresses this high velocity and highly automated process that cloud environments users need, especially development teams,” he said, adding that other teams, like data analysts who need access to things like Snowflake or Google Big Query and whose needs change quickly, would find value in it as well.

“That, again, requires a tool or a system that can dynamically adapt to the needs of the users and to the tools that they use in their day-to-day job,” he said.

Beyond Role-Based Access

Acting as an abstraction layer between the user and the cloud platform or application, Britive uses an API-first approach to grant access with the level of privileges authorized for the user. A temporary service account sits inside containers for developer access rather than using hard-coded credentials.

While users normally work with the least privileges required for their day-to-day jobs, just-in-time access grants elevated privileges for a specific period and revoke those permissions when the time is up. Going beyond role-based access (RBAC), the system is flexible enough to allow companies to alternatively base access on attributes of the resource in question (attribute-based access) or policy (policy-based access), Poghosyan said.

The patented platform integrates with most cloud providers and with CI/CD automation tools like Jenkins and Terraform.

Its cross-cloud visibility provides a single view into issues such as misconfigurations, high-risk permissions and unusual activity across your cloud infrastructure, platform and data tools. Data analytics offers risk scores and right-sizing access recommendations based on historical use patterns. The access map provides a visual representation of the relationships between policies, roles, groups and resources, letting you know who has access to what and how it is used.

The company added cloud infrastructure entitlement management (CIEM) in 2021 to understand privileges across multicloud environments and to identify and mitigate risks when the level of access is higher than it should be.

The company launched Cloud Secrets Manager in March 2022, a cloud vault for static secrets and keys when ephemeral access is not feasible. It applies the JIT concept of ephemeral creation of human and machine IDs like a username or password, database credential, API token, TLS certificate, SSH key, etc. It addresses the problems of hard-coded secrets management in a single platform, replacing embedded API keys in code by retrieving keys on demand and providing visibility into who has access to which secrets and how and when they are used.

In August it released Access Builder, which provides self-service access requests to critical cloud infrastructure, applications and data. Users set up a profile that can be used as the basis of access and can track the approval process. Meanwhile, administrators can track requested permissions, gaining insights into which identities are requesting access to specific applications and infrastructure.

Range of Integrations

Poghosyan previously co-founded Advancive, an IAM consulting company acquired by Optiv in 2016. Poghosyan and Alex Gudanis founded Britive in 2018. It has raised $35.9 million, most recently $20.5 million in a Series B funding round announced in March. Its customers include Gap, Toyota, Forbes and others.

Identity and security analysts KuppingerCole named Britive among the innovation leaders in its 2022 Leadership Compass report along with the likes of CyberArk, EmpowerID, Palo Alto Networks, Senhasegura, SSH and StrongDM that it cited for embracing “the new worlds of CIEM and DREAM (dynamic resource entitlement and access management) capability.”

“Britive has one of the widest compatibilities for JIT machine and non-machine access cloud services [including infrastructure, platform, data and other ‘as a service’ solutions] including less obvious provisioning for cloud services such as Snowflake, Workday, Okta Identity Cloud, Salesforce, ServiceNow, Google Workspace and others – some following specific requests from customers. This extends its reach into the cloud beyond many rivals, out of the box,” the report states.

It adds that it is “quite eye-opening in the way it supports multicloud access, especially in high-risk develop environments.”

Poghosyan pointed to two areas of focus for the company going forward: one is building support for non-public cloud environments because that’s still an enterprise reality, and the other is going broader into the non-infrastructure technologies. It’s building a framework to enable any cloud application or cloud technology vendor to integrate with Britive’s model, he said.

The post Britive: Just-in-Time Access across Multiple Clouds appeared first on The New Stack.

How to Pave Golden Paths That Actually Go Somewhere

Aeris Stewart — Wed, 06 Sep 2023 17:53:04 +0000

More than ever, software engineering organizations are turning to platform engineering to enable standardization by design and true developer self-service. Platform engineers build an internal developer platform (IDP), which is the sum of all the tech and tools bound together to pave golden paths for developers. According to Humanitec’s CEO Kaspar von Grünberg, golden paths are any procedure “in the software development life cycle that a user can follow with minimal cognitive load and that drives standardization.” Golden paths have long been discussed as an important goal of successful platform (and DevOps) setups.

Grünberg’s PlatformCon 2023 talk, “Build golden paths for day 50, not day 1!”, dove into how and why software engineering organizations should shift their focus to golden paths for the long term, complete with specific examples. Let’s explore the problem with the way most platform teams approach golden paths, how platform teams can fix their priorities and what scalable golden paths actually look like.

The Problem with Most Golden Paths? Bad Priorities

When deciding which golden paths to build and in what order, too many organizations make whatever comes first in the application and development life cycle their top priority. They start optimizing processes that only take place on Day 1 of the application life cycle, like how to create a new application or service via scaffolding. However, when evaluating the entire life cycle of an application, it’s clear that golden paths for Day 1 don’t go that far. Prioritizing golden paths for Day 2 to 50 (or day 1,000, for that matter) has a much larger impact on developer productivity and business performance.

Grünberg started studying the practices of top-performing engineering organizations years before platform engineering was on everyone’s radar. He has long considered this prioritization failure one of the top 10 fallacies in platform engineering, writing: “Of the time your team will invest in an application, the creation process is below 1%.” In his view, the return on investment (ROI) on this small part of the chain is too small to justify investing in its golden paths first. Organizations should instead invest in golden paths for Day 50 and beyond.

Lessons from Netflix’s Platform Console

The first iterations of Netflix’s federated platform console, which is a developer portal, demonstrate that not all golden paths are created equal. Senior software engineer Brian Leathem shared that one of the platform team’s original goals was to “unify the end-to-end experience to improve usability.”

Through user research, Leathem’s team found that developers were struggling with the high volume and variety of tools distributed across their workflows. They also found that limited discovery was hurting both new and tenured developers, who had difficulty onboarding or were unaware of new offerings that would improve their existing workflows. The solution they chose was a platform console, or as Leathem described it, a “common front door” for developers.

They adopted the Backstage plugin UI so they could invest their development resources into building custom UI components for the Netflix portal. The result was a portal in which users could manage their software across the software development life cycle in a single view. They introduced “collections,” or fleets of services for which the developer wants to view and assess performance together, to ease the burden of managing multiple services and software. They decided to use a golden path (Leathem used the term “paved road”) to tackle the discoverability problem only.

To start, the golden path was a static website that featured all documentation and recommended appropriate tools for the problems developers were solving. The goal was to weave the golden paths into the console to more deeply integrate documentation with its corresponding running services. Further down the line, Leathem’s team also hoped to build functionality for developers to create, modify and operate services through the console.

In feedback on the first iteration of the platform console, Netflix developers said the “View and Access” experience was not compelling enough for them to abandon their old habits and routines. In response, the platform team switched their focus to end-to-end workflows not available with existing tooling to keep users returning to the console. In Leathem’s PlatformCon 2023 talk, he said the approach significantly boosted the number of recurring users on the console.

Netflix’s example demonstrates that platforms need more than the developer portal component to be compelling to users. Developers want golden paths for end-to-end workflows.

Furthermore, usability is one of many problems a platform can improve. For example, an organization can design golden paths that improve usability, productivity and scalability by focusing on end-to-end workflows. Golden paths for different workflows can enable standardization by design and true developer self-service.

How to Prioritize Potential Golden Paths

With more of an application’s life cycle to cover, golden paths for Day 50 can be daunting to prioritize. Inspired by his research, von Grünberg proposed a simple exercise to help platform teams determine what their priorities ought to be, based on the frequency of and the waiting time for developers and operations associated with a specific procedure. The table below is an example of what this analysis could look like based on an evaluation of 100 deployments.

Procedure	Frequency (% of deployments)	Dev time in hours (including waiting and errors)	Ops time in hours (including waiting and errors)
Add/update app configurations (such as env variables)	5%*	1*	1*
Add services and dependencies	1%*	16*	8*
Add/update resources	0.38%*	8*	24*
Refactor and document architecture	0.28%*	40*	8*
Waiting due to blocked environment	0,5%*	15*	0*
Spinning up environment	0,33%*	24*	24*
Onboarding devs, retrain and swap teams	1%*	80*	16*
Rollback failed deployment	1,75%	10*	20*
Debugging, error tracing	4.40%	10*	10*
Waiting for other teams	6.30%	16*	16*

*per 100 deployments

From this table, organizations can gain a holistic view of the processes their golden paths need to address.

Since von Grünberg shared this exercise in early 2022, he says that the explosive growth of the platform engineering community has enabled him to observe the most common and pressing pain points across thousands of top engineering organizations and learn successful approaches to soothing them. These insights were valuable in understanding what types of Day 50 processes are the most important for platform teams to optimize andhow best to optimize them. He found that tackling the most pressing pain points with golden paths first consistently netted the best ROI. More importantly, he learned that most organizations’ pain points with these processes had the same root cause and could be mitigated in large part by addressing that common cause directly.

The Universal Pain Point: Static Configuration Management

The problem in question is that most organizations have IDPs that enable developers to deploy an updated image from one stage to another only when the infrastructure of the application does not change. These static configuration files are manually scripted against a set of static environments and infrastructure and, as a result, are prone to errors or excessive overhead when moving beyond the simplest use cases.

With static configuration management, rolling back, changing configs, adding services and dependencies, and similarly complex tasks are arduous for developers. They can either choose to manage infrastructure themselves, reducing their time spent coding and creating shadow operations, or they could submit a ticket to ops, increasing their waiting times and exacerbating existing bottlenecks.

With static configuration management, neither developers nor ops win. Therefore, golden paths that address the challenges of static configuration management have greater potential to optimize a much larger range of processes and at scale.

Dynamic Configuration Management: The Key to Scalable Golden Paths

Instead of settling for static configuration management, organizations should enable dynamic configuration management (DCM). DCM is “a methodology used to structure the configuration of compute workloads. Developers create workload specifications describing everything their workloads need to run successfully. The specification is then used to dynamically create the configuration, to deploy the workload in a specific environment.” With DCM, developers aren’t slowed down by the need to define or maintain any environment-specific configuration for their workloads. DCM drives standardization by design and enables true developer self-service.

The Humanitec Platform Orchestrator, in combination with the workload specification Score, enables DCM by following an RMCD (read, match, create, deploy) pattern: it reads and interprets the workload specification and context, matches the correct configuration templates and resources, creates application configurations and deploys the workload into the target environment wired up to its dependencies. A platform orchestrator is the core of any enterprise-grade IDP because it enables platform teams to enforce organization-wide standards with every git push.

Examples of Scalable Golden Paths

In his PlatformCon 2023 talk, von Grünberg shared a few examples of how a platform orchestrator can facilitate the creation of impactful and scalable golden paths. These examples are also featured in Humanitec’s IDP reference architecture for AWS-based setups.

Simple Deployment to Dev

For example, the golden path pictured below enables developers to deploy the changes made on a workload to dev more efficiently and consistently.

Let’s say a developer wants to deploy a change on their workload to dev. All the developer has to do is modify the workload and git-push to code. From there, the CI pipeline picks up the code and runs it, the image is built and stored in the image registry.

The workload source code contains the workload specification Score:

In this example, the resources section of the workload specification states that the developer requires a database type Postgres, a storage type of S3, and a DNS type of DNS.

After the CI has been built, the platform orchestrator realizes the context and looks up what resources are matched against it. It checks whether the resources have been created and reaches out to

the Amazon Web Services (AWS) API to retrieve the resource credentials. The target compute in this architecture is Amazon Elastic Kubernetes Service (EKS), so the platform orchestrator creates the app configs in the form of manifests. Then the platform orchestrator uses Vault to deploy the configs and inject the secrets at runtime into the container.

Deployments like this happen all the time, so optimizing this process makes a major difference for developers and the business at large.

Create a New Resource

In a static setup, many golden paths fail when faced with a developer request the system is unfamiliar with. With DCM, everything is repository-based, and developers can extend the set of available resources or customize them.

For example, if a developer needs ArangoDB but it isn’t known to the setup so far, they can add a resource definition to the general baseline of the organization. This way, the developer has easily extended the setup in a way that can be reused by the next developer.

Update a Resource

Updating a resource is a great example of how platform engineers use a platform orchestrator to maintain a high degree of standardization.

Let’s say you want to update the resource “Postgres” to the latest Postgres version.

With a dynamic approach to golden paths, the “thing” you need to update is only the resource definition where Postgres is specified.

You can find which workloads currently depend on the resource definition by pinging the platform orchestrator API or looking at the UI in the resource definition section. Once identified, you can auto-enforce a deployment across all workloads that depend on the resource.

With this golden path, rolling out the updated resource across all workloads and dependencies is simplified and scalable.

Good Golden Paths Turn Every Day into Day 1

When platform teams invest in these scalable golden paths, von Grünberg argues, everyone wins. Golden paths that leverage a platform orchestrator and DCM enable developers and ops to execute common tasks with greater ease and peace of mind from the earliest stages of an IDP’s development, delivering more value, faster.

Paving golden paths with this approach also catalyzes an important mindset shift for platform teams, according to von Grünberg. With DCM, every day can become Day 1, a starting point for further optimization and opportunity to reduce technical debt. This shift enables organizations to make the most of what platform engineering has to offer.

Get Recommended Golden Path Examples

Humanitec has created reference architectures for AWS, Azure, and GCP based on McKinsey’s research. These resources walk through examples of recommended golden paths in more detail, as well as how they fit into the larger IDP architecture.

The post How to Pave Golden Paths That Actually Go Somewhere appeared first on The New Stack.

SBOMs, SBOMs Everywhere

Nnenna Ndukwe — Wed, 06 Sep 2023 16:13:26 +0000

Talk about software bills of materials or SBOMs has become even more prevalent in the wake of many supply chain attacks that have occurred in the past few years. Software supply chain attacks can target upstream elements of your software, like open source libraries and packages, and SBOMs are a way to understand what’s in your application or container images.

But while SBOMs are a useful piece of information, there are plenty of questions teams are asking about them: Do we need an SBOM? What do we do with them once we produce them? How can I use them during a security incident?

To answer these and other questions, let’s start with what an SBOM actually is.

What Is an SBOM?

A software bill of materials is a comprehensive inventory of all of the software components and dependencies used in a software application or system. This enables security teams as well as developers to have a better understanding of the third-party resources and imports they are using, particularly when new vulnerabilities in open source packages are constantly being discovered. To protect your organization from these threats, you first have to know what you even have in your stack.

Containers have become the de facto way that developers package and ship software in today’s cloud native landscape. In the context of containers and SBOMs, the equivalent of a software bill of materials in a container is a JSON file listing all the packages, libraries and components used in both the application and the surrounding container. This package.json file includes version information of all of these components, where the package is machine-readable, which is no less important.

If we were to liken this to something we’re all familiar with, this JSON is almost like the nutrition label you’d find on packaged food, but for your containerized application. This JSON file is also a point-in-time artifact, meaning it is tied to a specific SHA256 digest of a container. This differs from mutable tags like latest, in that an SBOM for a mutable tag would change over time, but not for a specific SHA256 digest.

Another important aspect that your package.json provides is all the historical information that is ultimately managed by git, giving you critical retrospective knowledge of what was running in your container in any given build. This will enable you to take action when a vulnerability is discovered by knowing what is in the containers that are currently running in production. This also makes the data easier to search for rapid inventory with zero-day attacks.

The Log4j zero-day attack that occurred in December 2021 was believed to have affected over 100 million software environments and applications. CISA recognized this attack as a threat to governments and organizations across the world. Cyber Safety Review Board postmortem report on Log4j backed by OpenSSF and the Linux Foundation encouraged the industry to use software component tools such as SBOMs to reduce the time between awareness and mitigation of vulnerabilities.

SBOMs will make it easier for companies to understand which version of containers they are running in production and how exposed their production systems are when a Log4j-type incident occurs. Like many areas of the software supply chain, there are plenty of excellent open source tools that can produce an SBOM for you, such as Syft, Trivy, BOM and CycloneDX, as well as others that provide commercial services.

Show Me the Code

Below you’ll find an example of a slice of an SBOM from a Node container image from Docker Hub with over 1.4 billion pulls. We hopped into the Slim platform (portal.slim.dev/login) to search for the public Node image, analyzed the image for vulnerabilities, and then downloaded the SBOM directly off the platform in a CycloneDX JSON format.

"$schema": "http://cyclonedx.org/schema/bom-1.4.schema.json",
 "bomFormat": "CycloneDX",
 "specVersion": "1.4",
 "serialNumber": "urn:uuid:aaf2dfd5-5294-4277-8cc1-f7fe6f6d514b",
 "version": 1,
 "metadata": {
   "timestamp": "2023-06-21T13:26:31Z",
   "tools": [
     {
       "vendor": "slim.ai",
       "name": "slim",
       "version": "0.0.1"
     }
   ],
   "component": {
     "bom-ref": "b1ef6d159e61300a",
     "type": "container",
     "name": "index.docker.io/library/latest:latest",
     "version": "sha256:b3fc03875e7a7c7e12e787ffe406c126061e5f68ee3fb93e0ef50aa3f299c481"
   }
 },

You can see the highlighted metadata that’s provided. A full example of an SBOM download would include all the associated components, packages, libraries, a short description of their purpose/use, publishers, distribution types and their dependencies.

So this all begs the question we started out with: What do we actually do with an SBOM? The truth is, this is still a work in progress, and there are many senior developers and security engineers who have a perfect answer for this. For the most part, the goal of an SBOM is to have this inventory accessible, backed up and safe. Many times you’ll find an SBOM stored in an artifact repository, backed up to an S3 bucket or hosted with a provider to enable easy access when knowing what’s running in your container becomes mission critical.

Slim.AI Public Container Report 2022

Not only is there a question of how to truly extract value out of these types of artifacts, but how will teams manage them as containers continue to grow in size and complexity? This growth can lead to longer CI/CD processing times and an increased workload for DevSecOps teams. The use and management of SBOMs will continue to have a spotlight in software supply chain management.

The Impending Importance of Software Transparency

Chris Hughes and Tony Turner break down the fundamental principles of what SBOMs encapsulate in their latest book, “Software Transparency,” where the function of SBOMs is described as a foundational element in achieving software transparency, enabling organizations to identify potential vulnerabilities and proactively address them. Although there are concerns about SBOMs providing visibility for attackers, Hughes states that “having an SBOM puts software consumers in a much better position to understand both the initial risk associated with software use as well as new and emerging vulnerabilities associated with software components in the software they consume.”

According to Gartner, by 2025, SBOMs will be a requirement for 60% of software providers, as they become a critical component to achieving software supply chain security. This predictive insight exemplifies the need for SBOM generation ahead of the demand that is inevitably growing.

AWS recently announced its support for SBOM export capabilities in Amazon Inspector, a vulnerability management service that scans AWS workloads across the entire AWS organization to gain insight into your software supply chain. Heavy hitters such as Amazon Web Services (AWS) or Docker releasing SBOM exportability features is a glaring hint that the demand and urgency for providing software transparency from software providers is expected to increase. Slim.AI also provides a pathway for generating and managing SBOMs for your container images.

The Slim Solution for the SBOM Surge

In the ever-evolving landscape of software supply chain security (SSCS), staying ahead of future requirements is imperative. Slim uses advanced scanning and analysis capabilities to generate SBOMs that you can immediately download. We thoroughly inspect the entire stack to extract crucial information and construct a detailed inventory of components. This enables organizations to maintain a robust security posture while ensuring compliance with evolving regulatory requirements.

NTIA recommends generating and storingSBOMs at build time or in your container images in preparation for new releases. On the Slim platform, you can connect to your container registries (such as AWS, Docker Hub, Google Container Registry, and others) to store SBOMs for each of your container images. SBOMs are generated for both the original and hardened container images as part of the many artifacts that are accessible via the platform. Flow through our container hardening process on the platform to generate a smaller, more optimized, and less vulnerable version of your container image to deploy to production.

The Evolving Future of SBOMs

In the Congressional hearings that followed Log4J, the message from the cloud-native industry was clear: SBOMs are just a starting point. While full software inventories are necessary for triaging risk in the event of an attack, they are not by themselves a means to prevent attacks. There’s excitement around what new tools will be made available that use SBOMs as their source of truth.

Until then, most registries are working on the capability to store and manage SBOMs directly inside your registry of choice.

With hugely popular containers and packages having built-in and maintained SBOMs, it will be much easier and faster to start mitigating and reducing risk with resources taken from the wild. In addition, many security organizations like OWASP and the OpenSSF are working toward making tooling more accessible and dev-friendly to drive adoption and wider usage.

Added measures like slimming and hardening containers can also add greater security benefits in ensuring you only ship to production the critical packages truly required by your application. This will provide us with greater trust in our third-party packages and imports, and greater security for our entire software supply chain.

The post SBOMs, SBOMs Everywhere appeared first on The New Stack.

7 Steps to Highly Effective Kubernetes Policies

Wito Delnat — Wed, 06 Sep 2023 14:37:06 +0000

You just started a new job where, for the first time, you have some responsibility for operating and managing a Kubernetes infrastructure. You’re excited about toeing your way even deeper into cloud native, but also terribly worried.

Yes, you’re concerned about the best way to write secure applications that follow best practices for naming and resource usage control, but what about everything else that’s already deployed to production? You spin up a new tool to peek into what’s happening and find 100 CVEs and YAML misconfigurations issues of high or critical importance. You close the tab and tell yourself you’ll deal with all of that … later.

Will you?

Maybe the most ambitious and fearless of you will, but the problem is that while the cloud native community likes to talk about security, standardization and “shift left” a lot, none of these conversations deaden the feeling of being overwhelmed by security, resource, syntax and tooling issues. No development paradigm or tool seems to have discovered the right way to present developers and operators with the “sweet spot” of making misconfigurations visible without also overwhelming them.

Like all the to-do lists we might face, whether it’s work or household chores, our minds can only effectively deal with so many issues at a time. Too many issues and we get lost in context switching and prioritizing half-baked Band-Aids over lasting improvements. We need better ways to limit scope (aka triage), set milestones and finally make security work manageable.

It’s time to ignore the number of issues and focus on interactively shaping, then enforcing, the way your organization uses established policies to make an impact — no overwhelming feeling required.

The Cloudy History of Cloud Native Policy

From Kubernetes’ first days, YAML configurations have been the building blocks of a functioning cluster and happily running applications. As the essential bridge between a developer’s application code and an Ops engineer’s work to keep the cluster humming, they’re not only challenging to get right, but also the cause of most deployment/service-level issues in Kubernetes. To add in a little extra spiciness, no one — not developers and not Ops engineers — wants to be solely responsible for them.

Policy entered the cloud native space as a way to automate the way YAML configurations are written and approved for production. If no one person or team wants the responsibility of manually checking every configuration according to an internal style guide, then policies can slowly shape how teams tackle common misconfigurations around security, resource usage and cloud native best practices. Not to mention any rules or idioms unique to their application.

The challenge with policies in Kubernetes is that it’s agnostic to how, when and why you enforce them. You can write rules in multiple ways, enforce them at different points in the software development life cycle (SDLC) and use them for wildly different reasons.

There is no better example of this confusion than pod security policy (PSP), which entered the Kubernetes ecosystem in 2016 with v1.3. PSP was designed to control how a pod can operate and reject any noncompliant configurations. For example, it allowed a K8s administrator to prevent developers from running privileged pods everywhere, essentially decoupling low-level Linux security decisions away from the development life cycle.

PSP never left that beta phase for a few good reasons. These policies were only applied when a person or process requested the creation of a pod, which meant there was no way to retrofit PSPs or enable them by default. The Kubernetes team admits PSP made it too easy to accidentally grant too-broad permissions, among other difficulties.

The PSP era of Kubernetes security was so fraught that it inspired a new rule for release cycle management: No Kubernetes project can stay in beta for more than two release cycles, either becoming stable or marked for deprecation and removal.

On the other hand, PSP moved the security-in-Kubernetes space in one positive direction: By separating the creation and instantiation of Kubernetes security policy, PSP opened up a new ecosystem for external admission controllers and policy enforcement tools, like Kyverno, Gatekeeper and, of course, Monokle.

Tools that we’ve used to shed our clusters of the PSP shackles and replaced that with… the Pod Security Standard (PSS). We’ll come back to that big difference in a minute.

A Phase-Based Approach to Kubernetes Policy

With this established decoupling between policy creation and instantiation, you can now apply a consistent policy language across your clusters, environments and teams, regardless of which tools you choose. You can also switch the tools you use for creation and instantiation at will and get reliable results in your clusters.

Creation typically happens in an integrated development environment (IDE), which means you can stick with your current favorite to express rules using rule-specific languages like Open Policy Agent (OPA), a declarative syntax like Kyverno, or a programming language like Go or TypeScript.

Instantiation and enforcement can happen in different parts of the software development life cycle. As we saw in our previous 101-level post on Kubernetes YAML policies, you can apply validation at one or more points in the configuration life cycle:

Pre-commit directly in a developer’s command line interface (CLI) or IDE,
Pre-deployment via your CI/CD pipeline,
Post-deployment via an admission controller like Kyverno or Gatekeeper, or
In-cluster for checking whether the deployed state still meets your policy standards.

The later policy instantiation, validation and enforcement happen in your SDLC, the more likely a dangerous misconfiguration slips its way into the production environment, and the more work will be needed to identify and fix the original source of any misconfigurations found. You can instantiate and enforce policies at several stages, but earlier is always better — something Monokle excels at, with robust pre-commit and pre-deployment validation support.

With the scenario in place — those dreaded 90 issues — and an understanding of the Kubernetes policy landscape, you can start to whittle away at the misconfigurations before you.

Step 1: Implement the Pod Security Standard

Let’s start with the PSS mentioned earlier. Kubernetes now describes three encompassing policies that you can quickly implement and enforce across your cluster. The “Privileged” policy is entirely unrestricted and should be reserved only for system and infrastructure workloads managed by administrators.

You should start with instantiating the “Baseline” policy, which allows for the minimally specified pod, which is where most developers new to Kubernetes begin:

apiVersion: v1
kind: Pod
metadata:
  name: default
spec:
  containers:
    - name: my-container
      Image: my-image

The advantage of starting with the Baseline is that you prevent known privilege escalations without needing to modify all your existing Dockerfiles and Kubernetes configurations. There will be some exceptions, which I’ll talk about in a moment.

Creating and instantiating this policy level is relatively straightforward — for example, on the namespace level:

apiVersion: v1
kind: Namespace
metadata:
  name: my-baseline-namespace
  labels:
    pod-security.kubernetes.io/enforce: baseline
    pod-security.kubernetes.io/enforce-version: latest
    pod-security.kubernetes.io/warn: baseline
    pod-security.kubernetes.io/warn-version: latest

You will inevitably have some special services that require more access than Baseline allows, like a Promtail agent for collecting logs and observability. In these cases, where you need certain beneficial features, those namespaces will need to operate under the Privileged policy. You’ll need to keep up with security improvements from that vendor to limit your risk.

By enforcing the Baseline level of the Pod Security Standard for most configurations and allowing Privileged for a select few, then fixing any misconfigurations that violate these policies, you’ve checked off your next policy milestone.

Step 2: Fix Labels and Annotations

Labels are meant to identify resources for grouping or filtering, while annotations are for important but nonidentifying context. If your head is still spinning from that, here’s a handy definition from Richard Li at Ambassador Labs: “Labels are for Kubernetes, while annotations are for humans.”

Labels should only be used for their intended purpose, and even then, be careful with where and how you apply them. In the past, attackers have used labels to probe deeply into the architecture of a Kubernetes cluster, including which nodes are running individual pods, without leaving behind logs of the queries they ran.

The same idea applies to your annotations: While they’re meant for humans, they are often used to obtain credentials that, in turn, give them access to even more secrets. If you use annotations to describe the person who should be contacted in case of an issue, know that you’re creating additional soft targets for social engineering attacks.

Step 3: Migrate to the Restricted PSS

While Baseline is permissible but safe-ish, the “Restricted” Pod Security Standard employs current best practices for hardening a pod. As Red Hat’s Mo Khan once described it, the Restricted standard ensures “the worst you can do is destroy yourself,” not your cluster.

With the Restricted standard, developers must write applications that run in read-only mode, have enabled only the Linux features necessary for the Pod to run, cannot escalate privileges at any time and so on.

I recommend starting with the Baseline and migrating to Restricted later, as separate milestones, because the latter almost always requires active changes to existing Dockerfiles and Kubernetes configurations. As soon as you instantiate and enforce the Restricted policy, your configurations will need to adhere to these policies or they’ll be rejected by your validator or admission controller.

Step 3a: Suppress, Not Ignore, the Inevitable False Positives

As you work through the Baseline and Restricted milestones, you’re approaching a more mature (and complicated) level of policy management. To ensure everyone stays on the same page regarding the current policy milestone, you should start to deal with the false positives or configurations you must explicitly allow despite the Restricted PSS.

When choosing between ignoring a rule or suppressing it, always favor suppression. That requires an auditable action, with logs or a configuration change, to codify an exception to the established policy framework. You can add suppressions in source, directly into your K8s configurations or externally, where a developer requests their operations peer to reconfigure their validator or admission controller to allow a “misconfiguration” to pass through.

In Monokle, you add in-source suppressions directly in your configuration as an annotation, with what the Static Analysis Results Interchange Format (SARIF) specification calls a justification:

metadata:
  annotations:
    monokle.io/suppress.pss.host-path-volumes: Agent requires access to back up cluster volumes

Step 4: Layer in Common Hardening Guidelines

At this point, you’ve moved beyond established Kubernetes frameworks for security, which means you need to take a bit more initiative on building and working toward your own milestones.

The National Security Agency (NSA) and Cybersecurity and Infrastructure Security Agency (CISA) have a popular Kubernetes Hardening Guide, which details not only pod-level improvements, such as effectively using immutable container file systems, but also network separation, audit logging and threat detection.

Step 5: Time to Plug and Play

After implementing some or all of the established hardening guidelines, every new policy is about choices, trust and trade-offs. Spend some time on Google or Stack Overflow and you’ll find plenty of recommendations for plug-and-play policies into your enforcement mechanism.

You can benefit from crowdsourced policies, many of which come from those with more unique experience, but remember that while rules might be well-intentioned, you don’t understand the recommender’s priorities or operating context. They know how to implement certain “high-hanging fruit” policies because they have to, not because they’re widely valuable.

One ongoing debate is whether to, and how strictly to, limit the resource needs of a container. Same goes for request limits. Not configuring limits can introduce security risks, but if you severely constrain your pods, they might not function properly.

Step 6: Add Custom Rules for the Unforeseen Peculiarities

Now you’re at the far end of Kubernetes policy, well beyond the 20% of misconfigurations and vulnerabilities that create 80% of the negative impact on production. But even now, having implemented all the best practices and collective cloud native knowledge, you’re not immune to misconfigurations that unexpectedly spark an incident or outage — the wonderful unknown unknowns of security and stability.

A good rule of thumb is if a peculiar (mis)configuration causes issues in production twice, it’s time to codify it as a custom rule to be enforced during development or by the admission controller. It’s just too important to be latently documented internally with the hope that developers read it, pay attention to it and catch it in each other’s pull-request reviews.

Once codified into your existing policy, custom rules become guardrails you enforce as close to development as possible. If you can reach developers with validation before they even commit their work, which Monokle Cloud does seamlessly with custom plugins and a development server you run locally, then you can save your entire organization a lot of rework and twiddling their thumbs waiting for CI/CD pipeline to inevitably fail when they could be building new features or fixing bugs.

Wrapping Up

If you implement all the frameworks and milestones covered above and make all the requisite changes to your Dockerfiles and Kubernetes configurations to meet these new policies, you’ll probably find your list of 90 major vulnerabilities has dropped to a far more manageable number.

You’re seeing the value of our step-by-step approach to shaping and enforcing Kubernetes policies. The more you can interact with the impact of new policies and rules, the way Monokle does uniquely at the pre-commit stage, the easier it’ll be to make incremental steps without overwhelming yourself or others.

You might even find yourself proudly claiming that your Kubernetes environment is entirely misconfiguration-free. That’s a win, no doubt, but it’s not a guarantee — there will always be new Kubernetes versions, new applications and new best practices to roll into what you’ve already done. It’s also not the best way to talk about your accomplishments with your leadership or executive team.

The advantage of leveraging the frameworks and hardening guidelines is that you have a better common ground to talk about your impact on certification, compliance and long-term security goals.

What sounds more compelling to a non-expert:

You reduced your number of CVEs from 90 to X,
Or that you fully complied with the NSA’s Kubernetes hardening guidelines?

The sooner we worry less about numbers and more about common milestones, enforced as early in the application life cycle as possible (ideally pre-commit!), the sooner we can find the sustainable sweet spot for each of our unique forays into cloud native policy.

The post 7 Steps to Highly Effective Kubernetes Policies appeared first on The New Stack.

How AI Helped Us Add Vector Search to Cassandra in 6 Weeks

Jonathan Ellis — Wed, 06 Sep 2023 13:11:53 +0000

With the huge demand for vector search functionality that’s required to enable generative AI applications, DataStax set an extremely ambitious goal to add this capability to Apache Cassandra and Astra DB, our managed service built on Cassandra.

Back in April, when I asked our chief vice president of product officer who was going to build it, he said, “Why don’t you do it?”

With two other engineers, I set out to deliver a new vector search implementation on June 7 — in just six weeks.

Could new AI coding tools help us meet that goal? Some engineers have confidently claimed that AI makes so many mistakes that it’s a net negative to productivity:

I uninstalled copilot. It helped me maybe 1% of auto completes, and fixing it’s mistakes took more time than writing the code right myself would have taken 🙃

I’m not worried about ai.

— NullVoxPopuli (@nullvoxpopuli) December 4, 2022

And more recently:

UNIMPRESSED* 😒

Coding with ChatGPT has serious pitfalls; you spend more time debugging than just writing the damn thing from scratch

Troubleshooting is way quicker than Stackoverflow

— Adam (@adam___dee) July 26, 2023

After trying them out on this critical project, I’m convinced that these tools are in fact a massive boost to productivity. In fact, I’m never going back to writing everything by hand. Here’s what I learned about coding with ChatGPT, GitHub Copilot and other AI tools.

Copilot

Copilot is simple: It’s enhanced autocomplete. Most of the time it will complete a line for you or pattern-match a completion of several lines from context. Here, I’ve written a comment, and then started a new line writing neighbors. Copilot offered to complete the rest, correctly (with the text following ‘neighbors’ on the second line):

Here’s a slightly more involved example from test code, where I started off writing the loop as a mapToLong but then changed my data structures so that it ended up being cleaner to invoke a method with forEach instead. Copilot had my back:

And occasionally (this is more the exception than the rule), it surprises me by offering to complete an entire method:

Copilot is useful, but limited, for two reasons. First, it’s tuned to (correctly) err on the side of caution. It can still hallucinate, but it’s rare; when it doesn’t think it knows what to do, it doesn’t offer completions. Second, it is limited by the requirement to be fast enough to seamlessly integrate with a brief pause in human typing, which rules out using a heavyweight model like GPT-4, for now.

ChatGPT

You can try to get Copilot to generate code from comments, but for that use case you will almost always get better results from GPT-4, via paid ChatGPT or API access.

If you haven’t tried GPT-4 yet, you absolutely should. It’s true that it sometimes hallucinates, but it does so much less than GPT-3.5 or Claude. It’s also true that sometimes it can’t figure out simple problems (here I am struggling to get it to understand a simple binary search). But other times it’s almost shockingly good, like this time when it figured out my race condition on its first try. And even when it’s not great, having a rubber duck debugging partner that can respond with a passable simulacrum of intelligence is invaluable to stay in the zone and stay motivated.

And you can use it for everything. Or at least anything you can describe with text, which is very close to everything, especially in a programming context.

Here are some places I used GPT-4:

Random questions about APIs that I would have had to source dive for. This is the most likely category to result in hallucinations, and I have largely switched to Phind for this use case (see below).
Micro-optimizations. It’s like Copilot but matching against all of Stack Overflow, because that’s (part of) what it was trained on.
Involved Stream pipelines, because I am not yet very good at turning the logic in my head into a functional chain of Stream method calls. Sometimes, as in this example, the end result is worse than where we started, but that happens a lot in programming. It’s much easier and faster to do that exploration with GPT than one keystroke at a time. And making that time-to-results loop faster makes it more likely that I’ll try out a new idea, since the cost of experimenting is lower.
Of course GPT also knows about git, but maybe you didn’t realize how good it is at building custom tools using git. Like the other bullets in this list, this is stuff I could have done before by hand, but having GPT there to speed things up means that now I’ll create tools like this (before, I usually would have reached for whatever the second-best solution was, instead of spending an hour on a one-off script like this).

Here’s my favorite collaboration with GPT-4. I needed to write a custom class to avoid the garbage collection overhead of the box/unbox churn from a naive approach using ConcurrentHashMap, and this was for Lucene, which has a strict no-external-dependencies policy, so I couldn’t just sub in a concurrent primitives map like Trivago’s fastutil-concurrent-wrapper.

I went back and forth several times with GPT, improving its solution. This conversation illustrates what I think are several best practices with GPT (as of mid-2023):

When writing code, GPT does best with nicely encapsulated problems. By contrast, I have been mostly unsuccessful trying to get it to perform refactorings that touch multiple parts of a class, even a small one.
Phrase suggestions as questions. “Would it be more efficient to … ?” GPT (and, even more so, Claude) is reluctant to directly contradict its user. Leave it room to disagree or you may unintentionally force it to start hallucinating.
Don’t try to do everything in the large language model (LLM). The final output from this conversation still needs some tweaks, but it’s close enough to what I wanted that it was easier and faster to just finish it manually instead of trying to get GPT to get it exactly right.
Generally, I am not a believer in magical prompts — better to use a straightforward prompt, and if GPT goes off in the wrong direction, correct it — but there are places where the right prompt can indeed help a great deal. Concurrent programming in Java is one of those places. GPT’s preferred solution is to just slap synchronized on everything and call it a day. I found that telling it to think in the style of concurrency wizard Cliff Click helps a great deal. More recently, I’ve also switched to using a lightly edited version of Jeremy Howard’s system prompt.

Looking at this list, it’s striking how well it fits with the rule of thumb that AI is like having infinite interns at your disposal. Interns do best with self-contained problems, are often reluctant to contradict their team lead and frequently it’s easiest to just finish the job yourself rather than explain what you want in enough detail that the intern can do it. (While I recommend resisting the temptation to do that with real interns, with GPT it doesn’t matter.)

Advanced Data Analysis

Advanced Data Analysis, formerly known as Code Interpreter — also part of ChatGPT — is next level, and I wish it were available for Java yesterday. It wraps GPT-4 Python code generation into a Juypter or Jupyter-like sandbox, and puts it in a loop to correct its own mistakes. Here’s an example from when I was troubleshooting why my indexing code was building a partitioned graph.

The main problem to watch for is that ADA likes to “solve” problems with unexpected input by throwing the offending lines away, which usually isn’t what you want. And it’s usually happy with its efforts once the code runs to completion without errors – you will need to be specific about sanity checks that you want it to include. Once you tell it what to look for, it will add that to its “iterate until it succeeds” loop, and you won’t have to keep repeating yourself.

Also worth mentioning: The rumor mill suggests that ADA is now running a more advanced model than regular GPT-4, with (at minimum) a longer context window. I use ADA for everything by default now, and it does seem like an improvement; the only downside is that sometimes it will start writing Python for me when I want Java.

Claude

Claude is a competitor of OpenAI’s GPT from Anthropic. Claude is roughly at GPT 3.5 level for writing code — it’s noticeably worse than GPT-4.

But Claude has a 100,000 token context window, which is over 10 times what you get with GPT-4. (OpenAI just announced an Enterprise ChatGPT that increases GPT-4’s context window to 32,000 tokens, which is still only a third of Claude.)

I used Claude for three things:

Pasting in entire classes of Cassandra code to help figure out what they do.
Uploading research papers and asking questions about them.
Doing both at once: Here’s a research paper; here’s my implementation in Java. How are they different? Do those differences make sense given constraints X and Y?

Bing and Phind

Bing Chat got a bunch of attention when it launched earlier this year, and it’s still a good source of free GPT-4 (select the “Creative” setting), but that’s about it. I have stopped using it almost entirely. Whatever Microsoft did to Bing’s flavor of GPT-4 made it much worse at writing code than the version in ChatGPT.

Instead, when I want AI-flavored search, I use Phind. It’s what Bing should have been, but for whatever reason a tiny startup out-executed Microsoft on one of its flagship efforts. Phind has completely replaced Google for my “how do I do X”-type questions in Java, Python, git and more. Here’s a good example of solving a problem with an unfamiliar library. On this kind of query, Phind almost always nails it — and with relevant sources, too. In contrast, Bing will almost always cite at least one source as saying something different than it actually does.

Bard

I haven’t found anything that Bard is good at yet. It doesn’t have GPT-4’s skill at writing code or Claude’s large context window. Meanwhile, it hallucinates more than either.

Making Coding Productive — and Fun

Cassandra is a large and mature codebase, which can be intimidating to a new person looking to add a feature — even to me, after 10 years spent mostly on the management side. If AI is going to help any of us move faster, this is the way. ChatGPT and related AI tooling are good at writing code to solve well-defined problems, both as part of a larger project designed by a human engineer or for one-off tooling. They are also useful for debugging, sketching out prototypes and exploring unfamiliar code.

In short, ChatGPT and Copilot were key to meeting our deadline. Having these tools makes me 50% to 100% more productive, depending on the task. They have limitations, but they excel at tirelessly iterating on smaller tasks and help their human supervisor stay in the zone by acting as a tireless, uncomplaining partner to bounce ideas off of. Even if you have years of programming experience, you need to do this.

Because finally, even without the productivity aspects, coding with an AI that helps with the repetitive parts is just more fun. It’s given me a second wind and a new level of excitement for building cool things. I look forward to using more advanced versions of these tools as they evolve and mature.

Try building on Astra DB with vector search.

The post How AI Helped Us Add Vector Search to Cassandra in 6 Weeks appeared first on The New Stack.

Backstage in Production: Considerations for Platform Teams

Jorge Lainfiesta — Tue, 05 Sep 2023 19:47:13 +0000

The developer portal is a prominent aspect of most platforms, as it’s a privileged point of interaction between you and your users. Developer portals reflect the features of the platform through a centralized UI, which means they must be tailored to your developers and the capabilities you want to provide.

Here’s where Backstage shines: customizability. You can make the developer portal of your dreams with Backstage, which could include replacing the UI with your organization’s design system or bringing your own data consumption mechanism. This is possible because Backstage is not a ready-made developer portal, but a framework that provides the building blocks to build one.

However, developer portals are web apps. Thus, when you adopt and extend Backstage from scratch, you’re signing up for its full-stack consequences. For this reason, Gartner and others have reported that setting up and maintaining Backstage yourself can be challenging, yet the value of doing so has overwhelming benefits for many companies.

With that said, there is no one-size-fits-all way to adopt Backstage. When you set out to stand up Backstage yourself, you’ll run into a few common tasks nobody told you were part of adopting the framework. In this article, I’ll walk you through a few considerations to make when planning your team’s work.

Initial Setup and Deployment

Backstage provides a create-app command through its command line interface (CLI) to help you get started with a new instance. The result will run fine on your machine, but from this point on, you still have some work to do to make it a production-ready developer portal.

My recommendation for a Backstage proof of concept is to implement first a single meaningful integration like GitHub. This will let you go through all the touch points from React and Node configs to deployment.

Your developer portal most likely will have to connect data from various sources through integrations. Therefore you’ll need to implement a secret-management strategy that lets you inject secrets into the container that will be running Backstage.

In terms of deployment, the Backstage team recommends using what you normally would for a similar application. Thus, you can benefit from applying your standard CI/CD practices to your developer portal.

In the case of a Roadie-managed Backstage instance, all these considerations are built into the product so you don’t have to invest time into any of them.

Authentication and Security

Your developer portal is a one-stop shop that integrates third-party services such as GitHub, Argo CD or PagerDuty. The developer portal will allow users to request infrastructure through its self-service or golden paths capabilities. Therefore, it is important to ensure that your Backstage instance is secure.

First, you’ll need to install and set up an authentication mechanism. Thankfully, Backstage offers open source integrations with 13 providers from Okta to Google IAP.

Next, you’ll need to use the permissions framework that comes with Backstage. By default, Backstage’s backend routes are not protected behind authentication because there’s an openness assumption in a developer portal.

Additionally, I recommend you set up your scaffolder to execute tasks in ephemeral environments. Doing this at Roadie from the beginning prevented all of our customers from being affected by last year’s Backstage remote code execution vulnerability.

Always remember to keep an eye on the security patches released by the Backstage teams and upgrade your instance constantly.

Upgrades and Maintenance

The Backstage team merges around 300 pull requests monthly from many contributors, resulting in minor version releases every second week. This process gives the framework an impressive flow of features and bug fixes.

I recommend adding upgrades to your planning regularly. Backstage’s upgrade process currently involves a few manual steps, with varying complexity for each release.

Beware that some improvements come as an additional API that you need to hook into your developer portal or a new set of UI components that can benefit your instance.

Most importantly, it’s useful to stay tuned to new features as sometimes you have to opt into them even after you upgrade Backstage. I write the Backstage Weekly newsletter, so you don’t have to go through the codebase yourself. Feel free to sign up.

Working with Plugins

There are more than 100 plugins available in the Backstage ecosystem, and you’re encouraged to build your own plugin to integrate your unique development needs into Backstage.

Plugins are usually implemented in two or three packages: backend, frontend and common. Plugins can also provide extension points, so they can be customized or adapted for different circumstances.

The Backstage community is actively working on simplifying the process of installing and upgrading plugins, but it remains a bit of manual work at the moment and requires you to redeploy your instance.

When authoring a plugin, be aware that Backstage is consolidating a new backend system with simplified APIs, so might be worth checking it out.

A Backstage Instance Is Just the First Step

Or maybe it’s the second step? You first need to identify what you want your developer portal to solve. Then once you set up an instance, you’ll be on track to a longer-term journey onboarding more use cases and iterating on your developer portal as you learn from your developers.

If you want to adopt Backstage but don’t want to own its implementation or maintenance, Roadie.io offers hosted and managed Backstage instances. Roadie offers a no-code Backstage experience and more refined features while remaining fully compatible with the open source software framework. Additionally, Roadie offers fully-fledged Scorecards to measure the software quality and maturity across your organization.

If you’re interested in learning about the advantages and trade-offs of managed production-grade Backstage instances vs. self-hosted ones, check out our white paper.

The post Backstage in Production: Considerations for Platform Teams appeared first on The New Stack.

Change Data Capture for Real-Time Access to Backend Databases

Jim Moffitt — Tue, 05 Sep 2023 15:05:52 +0000

In a recent post on The New Stack, I discussed the emergence and significance of real-time databases. These databases are designed to support real-time analytics as a part of event-driven architectures. They prioritize high write throughput, low query latency, even with complex analytical queries including filter aggregates and joins, and high levels of concurrent requests.

This highly-specialized class of database, which includes open source variants such as ClickHouse, Apache Pinot and Apache Druid, is often the first choice when you’re building a real-time data pipeline from scratch. But more often than not, real-time analytics is pursued as an add-on to an existing application or service, where a more traditional, relational database like PostgreSQL, SQL Server or MySQL has already been collecting data for years.

In the post I linked above, I also briefly touched on how these online transactional processing (OLTP) databases aren’t optimized for analytics at scale. When it comes to analytics, they simply cannot deliver the same query performance at the necessary levels of concurrency. If you want to understand why in more detail, read this.

But the Internet Is Built on These Databases!

Row-based databases may not work for real-time analytics, but we can’t get around the fact that they are tightly integrated with backend data systems around the world and across the internet. They’re everywhere, and they host critical data sets that are integral to and provide context for many of the real-time systems and use cases we want to build. They store facts and dimensions about customers, products, locations and more that we want to use to enrich streaming data and build more powerful user experiences.

So, what are we to do? How do you bring this row-oriented, relational data into the high-speed world of real-time analytics? And how do you do it without overwhelming your relational database server?

Here’s How Not to Do It

Right now, the prevailing pattern to get data out of a relational database and into an analytical system is using a batch extract, transform, load (ETL) process scheduled with an orchestrator to pull data from the database, transform it as needed and dump it into a data warehouse so the analysts can query it for the dashboards and reports. Or, if you’re feeling fancy, you go for an extract, load, transform (ELT) approach and let the analytics engineers build 500 dbt models on the Postgres table you’ve replicated in Snowflake.

This may as well be an anti-pattern in real-time analytics. It doesn’t work. Data warehouses make terrible application backends, especially when you’re dealing with real-time data.

Batch ETL processes read from the source system on a schedule, which not only introduces latency but also puts strain on your relational database server.

ETL/ELT is simply not designed for serving high volumes of concurrent data requests in real-time. By nature, it introduces untenable latency between data updates and their availability to downstream consumers. With these batch approaches, latencies of more than an hour are common, with five-minute latencies about as fast as can be expected.

And finally, ETLs put your application or service at risk. If you’re querying a source system (often inefficiently) on a schedule, that puts a strain on your database server, which puts a strain on your application and degrades your user experience. Sure, you can create a read replica, but now you’re doubling your storage costs, and you’re still stuck with the same latency and concurrency constraints.

Change Data Capture (CDC) to the Real-Time Rescue

Hope is not lost, however, thanks to real-time change data capture (CDC). CDC is a method of tracking changes made to a database such as inserts, updates and deletes, and sending those changes to a downstream system in real time.

Change data capture works by monitoring a transaction log of the database. CDC tools read the transaction log and extract the changes that have been made. These changes are then sent to the downstream system.

Change data capture tools read from the database log file and propagate change events to a message queue for downstream consumers.

The transaction log, such as PostgreSQL’s Write Ahead Log (WAL) or MySQL’s “bin log,” chronologically records database changes and related data. This log-based CDC minimizes the additional load on the source system, making it superior to other methods executing queries directly on source tables.

CDC tools monitor these logs for new entries and append them to a topic on an event-streaming platform like Apache Kafka or some other message queue, where they can be consumed and processed by downstream systems such as data warehouses, data lakes or real-time data platforms.

Real-Time Analytics with Change Data Capture Data

If your service or product uses a microservices architecture, it’s highly likely that you have several (perhaps dozens!) of relational databases that are continually being updated with new information about your customers, your products and even how your internal systems are running. Wouldn’t it be nice to be able to run analytics on that data in real time so you can implement features like real-time recommendation engines or real-time visualizations in your products or internal tools like anomaly detection, systems automation or operational intelligence dashboards?

For example, let’s say you run an e-commerce business. Your website runs over a relational database that keeps track of customers, products and transactions. Every customer action, such as viewing products, adding to a cart and making a purchase, triggers a change in a database.

Using change data capture, you can keep these data sources in sync with real-time analytics systems to provide the up-to-the-second details needed for managing inventory, logistics and positive customer experiences.

Now, when you want to place a personalized offer in front of a shopper during checkout to improve conversion rates and increase average order value, you can rely on your real-time data pipelines, fed by the most up-to-date change data to do so.

How Do You Build a Real-Time CDC Pipeline?

OK, that all sounds great. But how do you build a CDC event pipeline? How do you stream changes from your relational database into a system that can run real-time analytics and then expose them back as APIs that you can incorporate into the products you’re building?

Let’s start with the components you’ll need:

Source data system: This is the database that contains the data being tracked by CDC. It could be Postgres, MongoDB, MySQL or any other such database. Note that the database server’s configuration may need to be updated to support CDC.
CDC connector: This is an agent that monitors the data source and captures changes to the data. It connects to a database server, monitors transaction logs and publishes events to a message queue. These components are built to navigate database schema and support tracking specific tables. The most common tool here is Debezium, an open source change data capture framework on which many data stack companies have built change data tooling.
Event streaming platform: This is the transport mechanism for your change data. Change data streams get packaged as messages, which are placed onto topics, where they can be read and consumed by many downstream consumers. Apache Kafka is the go-to open source tool here, with Confluent and Redpanda , among others, providing some flexibility and performance extensions on Kafka APIs.
Real-time database or platform: For batch analytics workflows like business intelligence and machine learning, this is usually a data warehouse or data lake. But we’re here for real-time analytics, so in this case, we’d go with a real-time database like those I mentioned above or a real-time data platform like Tinybird. This system subscribes to change data topics on the event streaming platform and writes them to a database optimized for low-latency, high-concurrency analytics queries.
Real-time API layer: If your goal, like many others, is to build user-facing features on top of change data streams, then you’ll need an API layer to expose your queries and scale them to support your new service or feature. This is where real-time data platforms like Tinybird provide advantages over managed databases, as they offer API creation out of the box. Otherwise, you can turn to tried-and-tested ORMs (object-relational mappings) and build the API layer yourself.

An example real-time CDC pipeline for PostgreSQL. Note that unless your destination includes an API layer, you’ll have to build one to support user-facing features.

Put all these components together, and you’ve got a real-time analytics pipeline built on fresh data from your source data systems. From there, what you build is limited only by your imagination (and your SQL skills).

Change Data Capture: Making Your Relational Databases Real Time

Change data capture (CDC) bridges the gap between traditional backend databases and modern real-time streaming data architectures. By capturing and instantly propagating data changes, CDC gives you the power to create new event streams and enrich others with up-to-the-second information from existing applications and services.

So what are you waiting for? It’s time to tap into that 20-year-old Postgres instance and mine it for all its worth. Get out there, research the right CDC solution for your database, and start building. If you’re working with Postgres, MongoDB or MySQL, here are some links to get you started:

The post Change Data Capture for Real-Time Access to Backend Databases appeared first on The New Stack.

4 Key Observability Best Practices

Sophie Kohler — Fri, 01 Sep 2023 16:11:29 +0000

With bigger systems, higher loads and more interconnectivity between microservices in cloud native environments, everything has become more complex. Cloud native environments emit somewhere between 10 and 100 times more observability data than traditional, VM-based environments.

As a result, engineers aren’t able to make the most out of their workdays, spending more time on investigations and cobbling together a story of what happened from siloed telemetry, leaving less time to innovate.

Without the right observability set up, precious engineering time is wasted trying to sift through data to spot where a problem lies, rather than shipping new features — potentially introducing buggy features and affecting the customer experience.

So, how can modern organizations find relevant insights in a sea of telemetry and make their telemetry data work for them, not the other way around? Let’s explore why observability is key to understanding your cloud native systems, and four observability best practices for your team.

What Are the Benefits of Observability?

Before we dive into ways your organization can improve observability, lower costs and ensure smoother customer experience, let’s talk about what the benefits of investing in observability actually are.

Better Customer Experience

With better understanding and visibility into relevant data, your organization’s support teams can gain customer-specific insights to understand the impact of issues on particular customer segments. Maybe a recent upgrade works for all of your customers except for those under the largest load, or during a certain time window. Using this information, on-call engineers can resolve incidents quickly and provide more detailed incident reports.

Better Engineering Experience and Retention

By investing in observability, site reliability engineers (SREs) benefit from knowing the health of teams or components of the systems to better prioritize their reliability efforts and initiatives.

As for developers, benefits of observability include more effective collaboration across team boundaries, faster onboarding to new services/inherited services and better napkin math for upcoming changes.

Four Observability Best Practices

Now that we have a better understanding of why teams need observability to run their cloud native system effectively, let’s dive into four observability best practices teams can use to set themselves up for success.

1. Integrating with Developer Experience

Observability is everyone’s job, and the best people to instrument it are the ones who are writing the code. Maintaining instrumentation and monitors should not be a job just for the SREs or leads on your team.

A thorough understanding of the telemetry life cycle — the life of a span, metric or log — is key, from setting up configuration to emitting signals and any modifications or processing done before getting stored. If there is a high-level architecture diagram, engineers can better understand if or where their instrumentation gets modified (like aggregating or dropping, for example.) Often, this processing falls in the SRE domain and is invisible to developers, who won’t understand why their new telemetry is partially or entirely missing.

You can check out simple instrumentation examples in this OpenTelemetry Python Cookbook.

If there are enough resources and a clear need for a central internal tool, platform engineering teams should consider writing thin wrappers around instrumentation libraries to ensure standard metadata is available out of the box.

Viewing Changes to Instrumentation

Another way to enable developers is by providing a quick feedback loop when instrumenting locally, so that they can view changes to the instrumentation before merging a pull request. This recommendation is helpful for training purposes and for those teammates who are new to instrumenting or unsure about how to.

Updating the On-Call Process

Updating the on-call onboarding process to pair a new engineer with a tenured one for production investigations can help distribute tribal knowledge and orient the newbie to your observability stack. It’s not just the new engineers who benefit. Seeing the system through new eyes can challenge seasoned engineers’ mental models and assumptions. Exploring production observability data together is a richly rewarding practice you might want to keep after the onboarding period.

You can check out more in this talk from SRECon, “Cognitive Apprenticeship in Practice with Alert Triage Hour of Power.”

2. Monitor Observability Platform Usage in More than One Way

For cost reasons, becoming comfortable with tracking the current telemetry footprint and reviewing options for tuning — like dropping data, aggregating or filtering — can help your organization better monitor costs and platform adoption proactively. The ability to track telemetry volume by type (metrics, logs, traces or events) and by team can help define and delegate cost-efficiency initiatives.

Once you’ve gotten a handle on how much telemetry you’re emitting and what it’s costing you, consider tracking the daily and monthly active users. This can help you pinpoint which engineers need training on the platform.

These observability best practices for training and cost will lead to better understanding the value that each vendor is providing you, as well as what’s underutilized.

3. Center Business Context in Observability Data

Deciphering the business context in a pile of observability data can help shortcut high stakes in different ways:

By making it easier to translate incidents affecting workflows and functionality from a user perspective.
By creating a more efficient onboarding process for engineers.

One way to center business context in observability data is by renaming default dashboards, charts and monitors.

4. Un-Silo Your Telemetry

Teams need better investigations. One way to ensure a smoother remediation process is through an organized process like following breadcrumbs rather than having 10 different bookmark links and a mental map of what data lives where.

One way to do this is by understanding what telemetry your system emits from metrics, logs and traces and pinpointing the potential duplication or better sources of data. To achieve this, teams can create a trace-derived metric that represents an end-to-end customer workflow, such as:

“Transfer money from this account to that account.”
“Apply for this loan.”

Regardless of whether you’re sending to multiple vendors or a mix of DIY in-house stack and vendors, ensuring that you are able to link data between systems — such as adding the traceID to log lines, or a dashboard note with links to preformatted queries for relevance — will add that extra support for your team to perform better investigations and remediate issues faster.

Explore Chronosphere’s Future-Proof Solution

Engineering time comes at a premium. The more you can invest in getting high-fidelity insights and supporting engineers in understanding what telemetry is available, instrumenting will become fearless, troubleshooting faster and your team will make future-proof, data-informed decisions when weighing options.

As companies transition to cloud native, uncontrollable costs and rampant data growth can stop your team from performing successfully and innovating. That’s why cloud native requires more reliability and compatibility with future-proof observability. Take back control of your observability today, and learn how Chronosphere’s solutions manage scale and meet modern business needs.

The post 4 Key Observability Best Practices appeared first on The New Stack.

The Past, Present and Future of Multitenancy

Steve Fenton — Fri, 01 Sep 2023 13:56:52 +0000

Multitenanted software arrived alongside agile software development, Software as a Service (SaaS) and cloud computing. To understand modern multitenancy, we must look at how existing practices shaped early multitenancy and dig into the technological advancements that mean we must revise how we approach it.

Let’s look at how software was built around the turn of the century. You’d likely create a gold copy of your application once or twice a year. This would be burned onto CD-ROMs that systems administrators and end users would use to install the software.

For line-of-business applications, you’d use a network location instead of a disc and a checklist instead of an installation wizard.

Instead of self-hosting your purchased software, you could pay an application-service provider to manage it. The provider would manage the infrastructure and installations for you. This led to the idea of the software vendor offering a managed instance on your behalf, which we now call SaaS.

SaaS and Multitenancy

Once the same organization builds and operates the software, it inevitably found that it could reduce infrastructure and license costs by sharing the software between many customers.

A customer self-hosting an application could use virtualization to reduce the operational cost of the software, but a SaaS provider could reduce costs even further by sharing application instances between many customers.

The three hosting models are shown below, along with the nonshared costs that can be attributed to a single tenant.

Hosting model	Sharing	Nonshared costs
Dedicated infrastructure	Nothing is shared	Power, racks, networking, servers, operating systems, application licenses
Virtual machines	Shared servers	Operating systems, application licenses
Multitenanted software	Shared application instance	Application licenses

This comparison to dedicated instances understandably influenced early multitenancy. Designs for multitenancy would attempt to replicate the boundary created by client-specific software installations by enforcing a perimeter around the users and data for each tenant. The concerns held by security-conscious organizations would have reinforced this design.

While it was easy to predict the infrastructure costs and savings, the total cost of multitenanted software also included intangible costs for operational complexity and code complexity.

The configuration must become more dynamic to support multiple tenants on a single application instance, and the information boundary must be carefully managed. This can result in an increased cost for new features.

Operationally, while there are fewer instances to manage, the instance must meet new scaling demands. A new class of problems emerged when tenant-specific data needed to be managed within a shared database.

The cost-benefit analysis for multitenanted software would have placed more weight on the tangible costs. For example, counting the number of servers and operating system licenses is a simple task compared to predicting the future costs associated with the additional code complexity.

Technical Advancements

Since the emergence of multitenanted software, several technical advancements have been made that directly affect it, namely containers, deployment automation and infrastructure automation.

Virtualization has been transformed with the arrival of containers, which provide isolation without the weight and cost of multiple operating systems. Deployment automation eliminates the costs of manually installing software and upgrades, which makes deploying a thousand instances as easy as deploying one. Infrastructure automation does the same for provisioning dedicated cloud infrastructure or containers for each tenant.

Combining these developments with the experience gained from two decades of multitenanted software architecture requires a new approach. The economics have changed dramatically.

The Future of Multitenancy

The old dichotomy was multitenancy versus dedicated physical or virtual infrastructure. This led to a binary choice between single or multitenanted software. We must adjust this view by considering multitenancy as a whole-system design process.

There are two dimensions to consider in this new approach: layers of sharing and component-level granularity for decisions. Where we used to consider whether an application instance was multitenanted, we now look at each layer to decide if it should be shared. We can combine a tenant-specific database with a single application instance or share only the codebase and the CI/CD pipeline with tenant-specific infrastructure provisioned automatically.

Equally, we don’t make this decision once for the whole system but per component. This allows us to design a system with tenant-specific instances that call shared multitenanted services. This can be particularly powerful if the service is tenant unaware, such as a stateless service or one whose state is not tenant-specific.

By broadening the view of multitenancy to apply to software, infrastructure and CI/CD pipelines, it is possible to design systems that take advantage of lightweight tenant-specific applications, isolated by modern virtualization and calling out to scalable shared services. A more thoughtful approach can be taken regarding data that should be protected for a tenant and data that isn’t tenant-specific.

This new approach means we can get a better system-level outcome. We can have stronger data isolation without the drawbacks of earlier forms of dedicated infrastructure.

Summing It up

While early multitenancy was defined by sharing an instance of the application and database, technological advancements require a broader perspective and a more granular decision-making approach. The only way to find the right balance between the trade-offs is to consider the widest range of options available.

You should no longer view a tenant as an analog for dedicated isolated instances of users and data. A tenant is a view of data that must be protected. It’s common for sets of data to be stored outside the boundary of a tenant and for users to have access to multiple tenants.

Aside from the technology changes, businesses are more security-conscious than ever, as shown by the increase in ISO 27001 certifications since 2005.

Component-level decisions about the layers of sharing define modern multitenancy.

The post The Past, Present and Future of Multitenancy appeared first on The New Stack.

Governance Engineering Breaks Down the Silos in Regulated Software

Mike Long — Thu, 31 Aug 2023 17:00:46 +0000

For as long as we can remember, the big conversation in software engineering has been about tension between dev and ops. The core chronic conflict came down to two tribes who spoke different languages, had different values and were rewarded by different incentives. What’s worse is that these different factions also were siloed into separate organizational units with little incentive to collaborate.

This realization gave rise to the DevOps movement, ushering in a new culture of collaboration, automation, measurement, lean and sharing,

But, if you’re delivering software in a regulated industry like financial services or healthcare you might also be experiencing a new kind of conflict.

In these industries, DevOps comes with extra challenges that are still unsolved. Because, while engineering teams are technically capable of deploying multiple times per day, they have governance concerns that are still being performed by legacy manual processes for compliance, audit and security. And these governance requirements are set and monitored by folks in security, change management, audit and risk management in siloed parts of the organization.

The good news is that the solution to this type of problem is known: DevOps. Or as Mark Twain said, “History never repeats itself, but it does often rhyme.”

Regulated Software

In today’s interconnected world, software permeates every aspect of our lives. It enables us to share cat pictures, read cinema reviews and (my favorite) play Wordle. However, software’s influence extends far beyond entertainment; it powers our financial systems, propels our vehicles and even controls life-saving medications.

When software of this importance fails, the consequences can be significant. Companies developing critical software must navigate substantial risks, driven by legal requirements and the imperative of maintaining their brand reputation.

This complex task of managing these risks is known as software governance. Yet, in many organizations, the individuals responsible for ensuring software governance operate in isolated silos, disconnected from each other and the broader context.

The Challenge of Governance Silos

Within organizations operating in high-stakes environments, specialized governance processes and roles have been established to manage risks effectively. These roles encompass a diverse range of specialists, including risk officers, change managers, security experts, compliance and legal professionals, and quality officers. Despite their different titles and responsibilities, their overarching objective remains the same: to safeguard against the company’s risks.

Each specialist group speaks the language of risk, values safety and is rewarded by the company according to compliance. These individuals possess the authority to establish standards, policies, and guardrails to mitigate risks and are often tasked with inspecting and ensuring adherence to these standards.

What’s surprising about this siloed arrangement is that often it is like this on purpose. For instance, financial institutions adopt the Three Lines of Defense strategy, which provides independent oversight for risk management.

However, a notable challenge arises from this setup: these governance specialists rarely have a comprehensive understanding of the technical risks involved. Their expertise lies in their respective domains of risk management, compliance and legal frameworks, which may not encompass the intricacies of the software systems themselves.

This knowledge gap poses a significant handicap in ensuring effective governance, as it hampers the ability of these specialists to assess and address the specific technical risks that underlie regulated software.

The Challenge of Engineering Silos

As organizations recognize the outsize value that technology brings, every company has become a software company.

Engineers, who form the backbone of software development, speak the language of technology. They value freedom in their work and are rewarded by the speed of their delivery. However, they often find themselves in a situation in which they don’t speak the same language as the governance specialists.

Technologists struggle to understand why there is an abundance of red tape and bureaucratic processes surrounding their software development. They may feel disconnected from the objectives and constraints imposed by governance and compliance requirements. And they largely feel disempowered to influence or change these processes.

The Wall of Confusion in Governance

This chasm in language, values and rewards leads to a disconnect between the engineering teams and the governance specialists resulting in a chronic breakdown — the wall of confusion.

Governance silos set rules that engineering doesn’t understand or control

One of the key issues of the wall of confusion stems from rules and processes that engineering teams often struggle to understand or control. Examples of such rules include segregation of duties and change approvals. These directives are often imposed without clear context or explanation regarding the underlying risks. What’s worse, often the implementation of these rules gets ossified in legacy, one-size-fits-all processes that don’t keep up with other tech improvements.

All of this leads to frustration and confusion among engineers. Without a comprehensive understanding of why these rules are in place, or the specific risks they aim to mitigate, engineers can perceive them as unnecessary bureaucratic hurdles. This lack of context and transparency can breed resistance, non-compliance and poor governance.

Engineering delivers compliance evidence that governance doesn’t understand

And the confusion runs both ways! When it comes time to validate compliance through audit, the evidence provided is in the form of tickets, docker image shas and git commits that are impossible to navigate to a non-engineer.

So simple questions from an auditor like “Can you tell me every change to production?” quickly escalate into spelunking a multitude of incomprehensible CI logs.

All this results in poor risk management, massive amounts of toil and frustration at audit, and ultimately clogs and demotivates engineering.

Toward a Better Approach with Governance Engineering

The fantastic news is this: we have seen the problem before, and we already know the solution. This exact chronic problem was what used to describe the tension between dev and ops — and the answer is DevOps!

The solution is to bring these two disciplines together to collaborate on risk management using the knowledge from both sides of the wall. A combination of Culture, Automation, Lean, Measurement and Sharing (CALMS) and holistic thinking can have an outsize positive impact.

And we already see a lot of first steps towards bringing governance and engineering together. Books have been written on the subject, and the beginnings of a community are forming.

What’s missing is a name for this. And then Bill Bensing had an insight that he shared in a talk at DOES Vegas last year: What if we applied SRE principles to software governance? Or, in his words:

“Governance engineering is what happens when you ask a software engineer to design a governance team.”

So what is governance engineering? Well, it’s DevOps of course! ;-) Just this time including governance folks into the fold.

If you’d like to learn more about how folks are working with governance engineering in real life, you can connect with the community over at the Governance Engineering LinkedIn group.

The post Governance Engineering Breaks Down the Silos in Regulated Software appeared first on The New Stack.

How to Give Developers Cloud Security Tools They’ll Love

Chris Tozzi — Thu, 31 Aug 2023 15:10:43 +0000

There are few better ways to make developers resent cybersecurity than to impose security tools on them that get in the way of development operations.

After all, although many developers recognize the importance of securing applications and the environments that host them, their main priority as software engineers is to build software, not to secure it. If you burden them with security tools that hamper their ability to write code efficiently, you’re likely to get resistance against the solutions — and rampant security risks because your developers may not take the tools seriously or use them to maximum effect.

Fortunately, that doesn’t have to be the case. There are ways to square the need for rigorous security tools with developers’ desire for efficiency and flexibility in their own work. Here are some tips to help you choose the right security tools and features to ensure that security solutions effectively mitigate risks without burdening developers.

What to Look for in Modern Cloud Security Tools

There are many types of security tools out there, each designed to protect a specific type of environment, a certain stage of the software delivery life cycle or against a certain type of risk. You might use “shift left” security tools to detect security risks early in the software delivery pipeline, for example, while relying on cloud security posture management (CSPM) and cloud identity and entitlement management (CIEM) solutions to detect and manage risks within the cloud environments that host applications.

You could leverage all of these features via an integrated cloud native application protection platform (CNAPP) solution, or you could implement them individually, using separate tools for each one.

However, regardless of the type of security tools you need to deploy or types of risks you’re trying to manage, your solutions should provide a few key benefits to ensure they don’t get in the way of developer productivity.

Context-Aware Security

Context-aware security is the use of contextual information to assess whether a risk exists in the first place, and if so, the potential severity of that risk. It’s different from a more-generic, blunter approach to security wherein all potential risks are treated the same, regardless of context.

The key benefit of context-aware security for developers is that it’s a way of balancing security requirements with usability and productivity. Based on the context of each situation, your security tools can evaluate how rigorously to deploy protections that may slow down development operations.

For example, imagine that you’ve configured multifactor authentication (MFA) by default for the source code management (SCM) system that your developers use. In general, requiring MFA to access source code is a best practice from a security perspective because it reduces the risk of unauthorized users being able to inject malicious code or dependencies into your repositories. However, having to enter multiple login factors every time developers want to push code to the SCM or view its status can slow down operations.

To provide a healthy balance between risk and productivity in this case, you could deploy a context-aware security platform that requires MFA by default when accessing the SCM but only requires one login factor when a developer connects from the same IP address and during the same time window from which he or she has previously connected. Based on contextual information, lighter security protections can be deployed in some circumstances so that developers can work faster.

Security Integrations

The more security tools you require developers to integrate with their own tooling, the harder their lives will be. Not only will the initial setup take a long time, but they’ll also be stuck having to update integrations every time they update their own tools.

To mitigate this challenge, look for security platforms that offer a wide selection of out-of-the-box integrations. Native integrations mean that developers can connect security tooling to their own tools quickly and easily, and that updates can happen automatically. It’s another way to ensure that development operations are secure, but without hampering developer efficiency or experience.

Comprehensive Protection

The more security features and protections you can deploy through a single platform, the fewer security tools and processes your developers will have to contend with to secure their own tools and resources.

This is the main reason why choosing a consolidated, all-in-one cloud security platform leads to a better developer experience. It not only simplifies tool deployment, but also gives developers a one-stop solution for reporting, managing and remediating risks. Instead of toggling through different tools to manage different types of security challenges, they can do it all from a single location, and then get back to their main job — development.

Getting Developers on Board with Security

At its worst, security tools are the bane of developers’ existence. It gets in their way and slows them down, and they treat it as a burden they have to bear.

Well-designed, well-implemented security tools do the opposite. By using strategies such as context-aware security, broad integrations and comprehensive, all-in-one cloud security platforms, organizations can deploy the protections they need to keep IT resources secure while simultaneously keeping developers happy and productive.

Interested in strengthening your cloud security posture? The Orca Cloud Security Platform offers complete visibility and prioritized alerts for potential threats across your entire cloud estate. Sign up for a free cloud risk assessment or request a demo today to learn more.

The post How to Give Developers Cloud Security Tools They’ll Love appeared first on The New Stack.

How to Tackle Tool Sprawl Before It Becomes Tool Hell

Heath Newburn — Thu, 31 Aug 2023 13:18:03 +0000

Today’s digital-first companies know their customers demand seamless, compelling experiences. But incidents are inevitable. That puts the pressure on operations teams already struggling with a heavy workload.

Teams looking for novel ways to tackle these challenges often hit a formidable roadblock in the form of tool sprawl. When the world is on fire, swivel-chairing between tools while trying to get the full picture is the last thing incident responders need as they try to resolve incidents and deliver a great customer experience. But complaining will get them nowhere. The key is to be able to articulate a business case for change to senior leaders.

Into the Valley of Tool Sprawl

Digital operations teams may have a slew of poorly connected tools across their environment, handling event correlation, internal communication, collaboration, workflows, status pages, customer-service case management, ticketing and more. Within each category, there may also be separate tools doing similar things. And they may be built to or governed by different standards, further siloing their operation and slowing things down.

Incident response is a collaborative process. It is also one where seconds and minutes of delay can have a real-world impact on customer experience and, ultimately, revenue and reputation.

Stakeholders from network teams, senior developers, database administrators, customer service and others may need to come together quickly to triage and work through incidents. Their ability to do so is impaired when much time and effort must be expended on simply jumping between tools to get everyone on the same page and in the same place to tackle incidents. That’s not to mention the extra licensing costs, the people to manage and maintain the tool, and the need for additional security patching, etc.

How to Tell the Right Story

Incident responders need a unified platform to tackle issues but without the need to constantly switch context. Integrating and consolidating tools can reduce sprawl and drive simplicity end to end, underpinned by a single set of standards. We’re talking about one common data model and one data flow — enabling teams to reduce costs and go faster, at scale.

Such platforms exist. However, engineers and developers typically don’t have the power to demand change and drive adoption. But that shouldn’t stop them from asking for change. To do this, they must play a longer game, one designed to influence those holding the purse strings. It’s about telling a story in the language that senior executives will understand. That means focusing on business impact.

Humans are naturally story-driven creatures, so senior leaders will likely respond well to real-life examples of how disruptive context switching can be. When speaking to senior leaders, teams should seek to bring problems to life with a story.

Consider the most recent incident that’s affecting customers. How did your team identify and triage the incident? In many cases, teams don’t have a centralized place to capture incident context. This leads to them having to chase information across systems to understand what happened and access the context needed to start remediation. This adds critical time to the process and, in the larger incidents, a loss of customer trust.

Once the issue has been identified, you then have to communicate to the right people. This involves a lot of tools to pull in incident responders and subject matter experts. On top of this, teams also need to communicate about incidents to business and customer stakeholders, which again requires switching between different systems to craft and send messages.

Much of this is manual work that could be automated, but that’s only possible from one place, not disparate systems. The intent isn’t to get to a single pane of glass, which can be a fool’s errand as tools and processes evolve, but building a first pane of glass with the necessary context to immediately resolve issues is a great target.

Using this scenario, don’t be shy in naming all the specific tools and systems teams had to switch between to get to the end goal: uptime. Build a picture of the volume you are having to juggle. It’s also important to weave in the impact of the tool sprawl on the business.

A good starting point is to calculate how much time managing these disparate solutions added to resolving the last SEV 1 incident. Then multiply the figure by how many such incidents there were in the previous 12 months, and then work out how that translates into team costs.

These are the kinds of calculations that can make a big impact on senior decision-makers. It’s about showing the financial and temporal impact of tool sprawl on incident response, and ultimately, the business. If the figure is impactful, it might be enough to start a conversation with the people who can make a difference. The same capability can then be applied to lower severity but more frequently occurring issues, which can solidify your position.

By bringing the problem to life and showing the business and, most importantly, customer impact, teams can have practical conversations with decision-makers that can help to drive change and bring incident response processes into one place.

One Tool to Rule Them All

The valley of tool sprawl is bad enough. But combine it with a deluge of manual processes, and you have a recipe for too much toil and multiple points of failure. Maintaining and managing multiple tools is time-consuming, unwieldy and expensive. It requires continuous training for staff and disrupts critical workflows at a time when seconds often count. In this context, something as simple as an operations cloud to capture incident context from multiple systems of record and automate incident workflows can make a huge difference to responder productivity.

Centralizing on a single, unified platform for digital operations should be a no-brainer. But to get there, teams have to engage senior decision-makers. It’s no use complaining that context switching between tools is causing problems.

The key is to prove it with data and stories to provide irrefutable proof. It’s the way to win over hearts, minds and wallets — and lay a pathway out of the valley of tool sprawl, toward optimized operations.

The post How to Tackle Tool Sprawl Before It Becomes Tool Hell appeared first on The New Stack.