Platform Engineering Overview, News & Trends | The New Stack

The Pillars of Platform Engineering: Part 6 — Observability

Michael Fonseca — Wed, 27 Sep 2023 13:12:16 +0000

This guide outlines the workflows and checklist steps for the six primary technical areas of developer experience in platform engineering. Published in six parts, part one introduced the series and focused on security. Part six addresses observability. The other parts of the guide are listed below, and you can download a full PDF version of the The 6 Pillars of Platform Engineering for the complete set of guidance, outlines and checklists:

Security (includes introduction)
Pipeline (VCS, CI/CD)
Provisioning
Connectivity
Orchestration
Observability (includes conclusion and next steps)

The last leg of any platform workflow is the monitoring and maintenance of your deployments. You want to build observability practices and automation into your platform, measuring the quality and performance of software, services, platforms and products to understand how systems are behaving. Good system observability makes investigating and diagnosing problems faster and easier.

Fundamentally, observability is about recording, organizing and visualizing data. The mere availability of data doesn’t deliver enterprise-grade observability. Site reliability engineering, DevOps or other teams first determine what data to generate, collect, aggregate, summarize and analyze to gain meaningful and actionable insights.

Then those teams adopt and build observability solutions. Observability solutions use metrics, traces and logs as data types to understand and debug systems. Enterprises need unified observability across the entire stack: cloud infrastructure, runtime orchestration platforms such as Kubernetes or Nomad, cloud-managed services such as Azure Managed Databases, and business applications. This unification helps teams understand the interdependencies of cloud services and components.

But unification is only the first step of baking observability into the platform workflow. Within that workflow, a platform team needs to automate the best practices of observability within modules and deployment templates. Just as platform engineering helps security functions shift left, observability integrations and automations should also shift left into the infrastructure coding and application build phases by baking observability into containers and images at deployment. This helps your teams build and implement a comprehensive telemetry strategy that’s automated into platform workflows from the outset.

The benefits of integrating observability solutions in your infrastructure code are numerous: Developers can better understand how their systems operate and the reliability of their applications. Teams can quickly debug issues and trace them back to their root cause. And the organization can make data-driven decisions to improve the system, optimize performance, and enhance the user experience.

Workflow: Observability

An enterprise-level observability workflow might follow these eight steps:

Code: A developer commits code.
1. Note: Developers may have direct network control plane access depending on the RBACs assigned to them.
Validate: The CI/CD platform submits a request to the IdP for validation (AuthN and AuthZ).
IdP response: If successful, the pipeline triggers tasks (e.g., test, build, deploy).
Request: The provisioner executes requested patterns, such as building modules, retrieving artifacts or validating policy against internal and external engines, ultimately provisioning defined resources.
Provision: Infrastructure is provisioned and configured, if not already available.
Configure: The provisioner configures the observability resource.
Collect: Metrics and tracing data are collected based on configured emitters and aggregators.
Response: Completion of the provisioner request is provided to the CI/CD platform for subsequent processing and/or handoff to external systems, for purposes such as security scanning or integration testing.

Observability Requirements Checklist

Enterprise-level observability requires:

Real-time issue and anomaly detection
Auto-discovery and integrations across different control planes and environments
Accurate alerting, tracing, logging and monitoring
High-cardinality analytics
Tagging, labeling, and data-model governance
Observability as code
Scalability and performance for multi-cloud and hybrid deployments
Security, privacy, and RBACs for self-service visualization, configuration, and reporting

Next Steps and Technology Selection Criteria

Platform building is never totally complete. It’s not an upfront-planned project that’s finished after everyone has signed off and started using it. It’s more like an iterative agile development project rather than a traditional waterfall one.

You start with a minimum viable product (MVP), and then you have to market your platform to the organization. Show teams how they’re going to benefit from adopting the platform’s common patterns and best practices for the entire development lifecycle. It can be effective to conduct a process analysis (current vs. future state) with various teams to jointly work on and understand the benefits of adoption. Finally, it’s essential to make onboarding as easy as possible.

As you start to check off the boxes for these six platform pillar requirements, platform teams will want to take on the mindset of a UX designer. Investigate the wants and needs of various teams, understanding that you’ll probably be able to satisfy only 80 – 90% of use cases. Some workflows will be too delicate or unique to bring into the platform. You can’t please everyone. Toolchain selection should be a cross-functional process, and executive sponsorship at the outset is necessary to drive adoption.

Key toolchain questions checklist:

Practitioner adoption: Are you starting by asking what technologies your developers are excited about? What enables them to quickly support the business? What do they want to learn and is this skillset common in the market?
Scale: Can this tool scale to meet enterprise expectations, for both performance, security/compliance, and ease of adoption? Can you learn from peer institutions instead of venturing into uncharted territory?
Support: Are the selected solutions supported by organizations that can meet SLAs for core critical infrastructure (24/7/365) and satisfy your customers’ availability expectations?
Longevity: Are these solution suppliers financially strong and capable of supporting these pillars and core infrastructure long-term?
Developer flexibility: Do these solutions provide flexible interfaces (GUI, CLI, API, SDK) to create a tailored user experience?
Documentation: Do these solutions provide comprehensive, up-to-date documentation?
Ecosystem integration: Are there extensible ecosystem integrations to neatly link to other tools in the chain, like security or data warehousing solutions?

For organizations that have already invested in some of these core pillars, the next step involves collaborating with ecosystem partners like HashiCorp to identify workflow enhancements and address coverage gaps with well-established solutions.

The post The Pillars of Platform Engineering: Part 6 — Observability appeared first on The New Stack.

Software Delivery Enablement, Not Developer Productivity

Jennifer Riggins — Tue, 26 Sep 2023 14:00:12 +0000

BILBAO — Anna Daugherty thinks we shouldn’t be so obsessed with developer productivity. That was a spicy take at last week’s Continuous Delivery Mini Summit.

“This is something I talked about with almost everyone at the conference,” Daugherty, director of product marketing at Opsera, told The New Stack.

“There’s a difference between an individual trying their best and singling them out for not being productive. But productive doesn’t mean anything,” she continued. “Is individual developer productivity determined by code commits? What they accomplished in a sprint? The individual on the team, doesn’t a product make.”

Productivity metrics are trying to answer the wrong question, Daugherty argues, when you really should focus on:

Are your customers and end users seeing the value from accelerated delivery?
Are developers more satisfied with their job? Do they feel more enabled?
Are you creating more opportunities for revenue and investment?

Just like DevOps looks to accelerate the speed of the team’s delivery of software, Daugherty contends, you should look to focus on software team enablement, not individual developer productivity.

How Do You Measure Team Enablement?

The most common DevOps metrics aren’t really metrics. DORA is more of a framework, she says, to measure velocity — via lead time for changes and deployment frequency — and stability — via change failure rate and time to restore service.

DORA “allows you to have some sort of metrics that your teams can work toward, or that have been identified as being metrics that indicate high performance,” she said. “But it’s not necessarily like the end all, be all. It’s not a Northstar metric. It’s an example of what is can constitutes a high performing software team.”

For Northstar metrics you could go for the 2021 SPACE developer productivity framework or the recent McKinsey developer productivity effort, which she says “is both SPACE plus DORA plus some other nonsense that they’ve all wrapped up.”

But really, you have to keep it simple. For Daugherty, that comes down to asking why you’re creating software in the first place, which comes down to three audiences:

The users.
The people who create it.
The market.

While DORA and SPACE can point you in the right direction, she says you should be measuring outcomes that help measure the satisfaction of those three reasons to build software.

Customer Enablement: Measure for Customer Satisfaction.

This looks to answer if the software that you’re delivering is usable and delights your customers, she explained. This can be assessed via net promoter scores (NPS), G2 and other product review sites, and customer testimonials.

You need both qualitative data, with tight feedback cycles with your product’s users, and quantitative tracking, like drop-off rates.

Developer Enablement: Measure for Employee Satisfaction.

Look to answer: Do your developers enjoy creating and releasing software? Do you have a high level of developer burnout? This is where platform engineering comes in as a way to increase developer enablement and reduce friction to release. This can be measured via platform adoption rate, regular developer surveys with an actionable follow-up strategy, Glassdoor reviews and sentiment on their public social media.

Business Enablement: Measure for Market Share.

Is your delivered software helping capture the desired market share? Is it creating investment and/or partnership opportunities? Is it actually moving the sales pipeline along, generating measurable profit? Daugherty explained that business metrics are assessed by measuring things like the sales pipeline, investment and partnerships.

Some companies only seem to focus on the business metrics. But, while there’s been a noticeable shift in the tech industry “from growth at any cost to every dollar matters,” Daugherty emphasizes, how to increase developer productivity isn’t the right question to be asking.

Part of this is the fundamental disconnect between business leadership and engineering teams.

“Business leadership is always measuring revenue and pipeline, but that isn’t making its way to the engineering teams, or it’s not being translated in a way that they can understand,” she said. “They’re always chasing their tails about revenue, about pipeline, about partnerships [and] about investment, but it really should be a full conversation amongst the entirety of business, with engineering as a huge consideration for who that audience should be.”

Indeed, engineering tends to have the highest salaries, making it an important cost center. One of the early goals of platform engineering should be to facilitate a common language where business understands the benefits of engineering, while engineers understand the connection of their work to delivering business value.

Still, a lot of organizations fall short here. Sometimes, Daugherty says, that persistent chasm can be bridged by the blended role of Chief Digital Officer or Chief Transformation Officer.

How to Help Teams Improve Their Outcomes

Software delivery enablement and 2023’s trend of platform engineering won’t succeed by focusing solely on people and technology. At most companies, processes need an overhaul too.

A team has “either a domain that they’re working in or they have a piece of functionality that they have to deliver,” she said. “Are they working together to deliver that thing? And, if not, what do we have to do to improve that?”

Developer enablement should be concentrated at the team outcome level, says Daugherty, which can be positively influenced by four key capabilities:

Continuous integration and continuous delivery (CI/CD)
Automation and Infrastructure as Code (IaC)
Integrated testing and security
Immediate feedback

“Accelerate,” the iconic, metrics-centric guide to DevOps and scaling high-performing teams, has found certain decisions that are proven to help teams speed up delivery.

One is that when teams are empowered to choose which tools they use, this is proven to improve performance. When asked if this goes against platform engineering and its establishment of golden paths, Daugherty remarked that this train of thought derails from the focus on enablement.

“Platform engineering is not about directing which tools you use. That’s maybe what it has been reduced to in some organizations, but that’s not the most effective version of that,” she said. “Platform thinking is, truly, you are Dr. Strange from the Avengers, and you see the bigger picture and where things come together and align.”

Platform teams shouldn’t be adopting a rigid, siloed mindset of this team does this and uses that tool.

Platform engineering is about bringing people, products and processes together to increase efficiency and effectiveness for all teams, Daugherty clarified. “Does that means maybe sometimes choosing better workflows or technology and architecture? Yes, maybe for your business. But that’s just a reductive way of thinking about it,” if that’s your whole platform strategy.

Despite job roles, she emphasizes, DevOps and platform engineering are ways of working, not things you do or not do. And a platform team looks to trace over the DevOps infinity symbol, to make the pathway to delivery and the communication between Dev and Ops even smoother.

“A lot of people tell me: ‘I hate people like you, because you come in and tell me that I need this tool, and I need to do it this way’,” she said. After all, Opsera is a unified DevOps platform for engineering teams of any size.

But she always counters, “I’m not here to tell you anything. I’m here to help you do the work that you want to do because your work matters. And to help you understand how to communicate that value that you’re bringing to your organization to those business leaders who want more from you. And they constantly will be asking more from you.”

Her role, Daugherty says, is to help teams — and by extension the individual developers that make them up — to figure out how to deliver more, without increasing developer burnout.

DevOps Is First about Facilitating Meaningful Communication

DevOps is about enabling the right kind of communication to increase speed and collaboration — not creating more human-in-the-loop bumps in the road.

“Teams that reported no approval process or used peer review achieved higher software delivery performance,” found the research cited in Accelerate. “Teams that required approval by an external body achieved lower performance.”

It doesn’t necessarily mean zero approval process, but it’s more about keeping the majority of decisions within that team unit. Accelerate continues with the recommendation to “use a lightweight change approval process based on peer review, such as pair programming or intra-team code review, combined with a deployment pipeline to detect and reject bad changes.”

This could be an ops-driven, automated approval like in your Infrastructure as Code or an integrated test, explained Daugherty, or a human-to-human opportunity for shared knowledge.

“It’s gone through a peer review so that people on your team are in agreement that this is what they want to deliver,” she said. “And then it’s automatically deployed to production and not hanging around waiting at some gate [or] some bottleneck. If you have integrated testing and security throughout your pipeline, that’s going to enable you to do that.”

Peer review processes both ensure readability and thus maintainability of code, while also facilitating informal training. Daugherty recalled what Andrew Fenner, master designer at Ericsson, said during a panel on developer experience and productivity, also at the Continuous Delivery Mini Summit.

“Ericsson is kind of an old school sort of company, and so them being able to do this lightweight approval process is kind of a miracle.” Daugherty continued that Fenner spoke about how, sometimes, their most senior developers spend most of their time helping more junior developers, instead of committing code themselves. If you are measuring these more senior members by traditional, individual developer productivity metrics, they would score poorly. But, in reality, their less measurable impact has them helping to improve Ericsson’s junior developers every day. It also means knowledge is not perilously held by one team member by shared across the team.

“That’s what I mean by lightweight — not expecting your developers to have all the answers all the time, [but] to have some mechanism that’s easy for them to get feedback quickly. And to utilize the most knowledgeable and helpful people on your teams to be able to deliver quickly that feedback,” Daugherty said. “That lightweight idea is very much about not standing in people’s way. The better and easier they can deploy to production, the better outcomes they will have, based on velocity and stability.”

The post Software Delivery Enablement, Not Developer Productivity appeared first on The New Stack.

The Pillars of Platform Engineering: Part 5 — Orchestration

Michael Fonseca — Tue, 26 Sep 2023 13:14:58 +0000

This guide outlines the workflows and checklist steps for the six primary technical areas of developer experience in platform engineering. Published in six parts, part one introduced the series and focused on security. Part five addresses orchestration. The other parts of the guide are listed below, and you can download a full PDF version of The 6 Pillars of Platform Engineering for the complete set of guidance, outlines, and checklists:

Security (includes introduction)
Pipeline (VCS, CI/CD)
Provisioning
Connectivity
Orchestration
Observability (includes conclusion and next steps)

When it comes time to deploy your application workload, if you’re working with distributed applications, microservices, or generally wanting resilience across cloud infrastructure, it’s going to be much easier using a workload orchestrator.

Workload orchestrators such as Kubernetes and HashiCorp Nomad provide a multitude of benefits over traditional technologies. The level of effort may vary to achieve these benefits. For example, rearchitecting for containerization to adopt Kubernetes may involve a higher degree of effort than using an orchestrator like HashiCorp Nomad which is oriented more toward supporting a variety of workload types. In either case, workload orchestrators enable:

Improved resource utilization
Scalability and elasticity
Multicloud and hybrid cloud support
Developer self-service
Service discovery and networking (built-in or pluggable)
High availability and fault tolerance
Advanced scheduling and placement control
Resource isolation and security
Cost optimization

Orchestrators provide optimization algorithms to determine the most efficient way to allocate workloads into your infrastructure resources (e.g. bin-packing, spread, affinity, anti-affinity, autoscaling, dynamic application sizing, etc.), which can lower costs. They automate distributed computing and resilience strategies without developers having to know much about how it works under the hood.

As with the other platform pillars, the main goal is to standardize workflows, and an orchestrator is a common way modern platform teams unify deployment workflows to eliminate ticket-driven processes.

When choosing an orchestrator, it’s important to make sure it’s flexible enough to handle future additions to your environments and heterogeneous workflows. It’s also crucial that the orchestrator can handle multitenancy and easily federate across multiple on-premises data centers and multicloud environments.

It is important to note that not all systems can be containerized, or shifted to a modern orchestrator such as vendor-provided monolithic appliances or applications, so it is important for platform teams to identify opportunities for other teams to optimize engagement and automation for orchestrators as per the tenets of the other platform pillars. Modern orchestrators provide a broad array of native features. While specific implementations and functionality vary across systems, there are a number of core requirements.

Workflow: Orchestration

A typical orchestration workflow should follow these eight steps:

Code: A developer commits code.
1. Note: Developers may have direct network control plane access depending on the RBACs assigned to them.
Validate: The CI/CD platform submits a request to the IdP for validation (AuthN and AuthZ).
IdP response: If successful, the pipeline triggers common tasks (test, build, deploy).
Request: The provisioner executes requested patterns, such as building modules, retrieving artifacts, or validating policy against internal and external engines, ultimately provisioning defined resources.
Provision: Infrastructure is provisioned and configured, if not already available.
Configure: The provisioner configures the orchestrator resource.
Job: The orchestrator runs jobs on target resources based on defined tasks and policies.
Response: Completion of the provisioner request is provided to the CI/CD platform for subsequent processing and/or handoff to external systems that perform actions such as security scanning or integration testing.

Orchestration flow

Orchestration Requirements Checklist

Successful orchestration requires:

Service/batch schedulers
Flexible task drivers
Pluggable device interfaces
Flexible upgrade and release strategies
Federated deployment topologies
Resilient, highly available deployment topologies
Autoscaling (dynamic and fixed)
An access control system (IAM JWT/OIDC and ACLs)
Support for multiple interfaces for different personas and workflows (GUI, API, CLI, SDK)
Integration with trusted identity providers with single sign-on and delegated RBAC
Functional, logical, and/or physical isolation of tasks
Native quota systems
Audit logging
Enterprise support based on an SLA (e.g. 24/7/365)
Configuration through automation (infrastructure as code, runbooks)

The sixth and final pillar of platform engineering is observability: Check back tomorrow!

The post The Pillars of Platform Engineering: Part 5 — Orchestration appeared first on The New Stack.

Platform Engineering Helps a Scale-up Tame DevOps Complexity

Jennifer Riggins — Tue, 26 Sep 2023 12:00:20 +0000

Going from startup to scale-up is a great moment for any tech company. It means you have great customer traction and proof of value that you can expand your reach to new markets and verticals.

But it also means it’s time to scale up your technology, often in the cloud. And that isn’t easy.

Capillary Technologies, which builds Software as a Service (SaaS) products within the customer loyalty and engagement domain, saw its customers increase in number from 100 to 250. It started experiencing the typical scale-up growing pains, Piyush Kumar, the company’s CTO, told The New Stack.

As Capillary’s team grew significantly, its challenges pertaining to DevOps complexity also grew. Read on to see if these challenges ring true for you and how Capillary Technologies leveraged Facets.cloud self-service infrastructure management and adopted platform engineering to speed up developer productivity and deliver value to end customers faster.

DevOps Doesn’t Scale by Itself

When Kumar joined Capillary as a principal architect in 2016, the company’s presence was growing in India, Southeast Asia and the Middle East, while starting to gain traction in China. But when it looked to go further, this company built on Amazon Web Services (AWS) started hitting some common roadblocks in the cloud.

“The ratio of number of developers to the number of people in our DevOps infrastructure team was starting to get skewed,” Kumar said. “That meant that the number of requests going in from the engineers to the DevOps teams was growing, so the operations tickets were basically growing, and our response times were beginning to slow down.”

Toward the end of 2019, Capillary started to expand to new markets and cloud regions in the U.S. and Europe. These opportunities also presented challenges.

“Newer regions essentially meant spinning off the entire software, infrastructure, monitoring, everything else in a different region,” he said.

Launching in new regions requires organizations to adhere to data sovereignty and data localization laws.

As these launches occurred, Capillary’s infrastructure was in a semi-automated mode. “When you’re in that mode, there are things that are automated and then there are quite a few things that are not. So you don’t have enough visibility into your overall environment stack,” said Kumar.

New regions brought a lot of surprises — the DevOps team had to grow to manage the new environments, and had to meet the new demands of the growing customer base, product portfolio and required number of infrastructure components.

At the same time, Capillary grew from about 100 to 250 engineers.

“We didn’t want stability to start to take a hit, because we now needed to release across multiple environments,” Kumar said. In short, he noted, “more than linear scaling was needed to manage all of this.”

The Cloud Native Complexity Problem

A lot of platform engineering initiatives are sparked by struggles with disparate dev and ops tooling. This was not the case at Capillary, which has always had centrally managed infrastructure.

This is why, in order to battle this complexity at scale, the team members logically tried to increase the automation coverage of their infrastructure. But they found themselves stuck in a constant game of catchup.

“So we tried to continue to automate more and more, and it continued as a team, where you would do more and then you will realize that there is more to be done, so it felt like a constant battle because that landscape kept growing,” Kumar said.

“In six months, whatever we went ahead and automated, we basically carried newer debt, so there was more to be automated.”

For instance, they adopted the open source database MongoDB to bring new infrastructure, storage and database capabilities into the Capillary ecosystem. The DevOps team soon realized that they couldn’t easily automate everything — from launching to new regions to monitoring, backups, upgrades, patches and restoration.

By the time the Capillary teams automated whatever they could, they had also adopted Apache Kafka for real-time data streaming and an AWS EMR to run and scale workloads — which they then also tried to automate.

Capillary’s teams had gone the open source route to avoid vendor lock-ins. But whether they went open source or proprietary, they realized the complexity of the cloud native landscape means a lot of stitching automation toolchains together.

To tackle this, they needed:

Something that would make the overall infrastructure and deployment architecture more uniform, more visible and 100% automated, from build to deploy.
To move developers from being reliant on the DevOps team, to being able to provision infrastructure in a self-service way. This includes documentation uniformity to create a single source of truth.
A tool to manage the environment, infrastructure and deployment.

The solution Capillary sought, Kumar said, would allow users to “go ahead and create a document. You would say that this is my source of truth. And now I go ahead and do all of this in this way, And I do it uniformly all the time.”

In short, he wondered, “Is this something that a software could translate in terms of managing your environment, infrastructure, deployments, everything?”

Building an Infrastructure Blueprint

A lot of companies kick off their adoption of platform engineering with a journey of discovery. They literally ask themselves: what technology do we have and who owns what?

In late 2020, Capillary began partnering with Facets to co-build a solution to help answer this question. Capillary chose Facets in part because it automated the cataloging of applications, databases, caches, queues and storage across the infrastructure, as well as the interdependencies among them. This cataloging helped to create a deployment blueprint of how architecture should look in an environment.

Facets’ Blueprint Designer provides a high-level view of the entire architecture and detailed information on deployed resources.

“Once you have a single blueprint, then whatever it is you do downstream in terms of launching your infrastructure, in terms of running your applications, in terms of monitoring and managing, everything becomes a downstream activity from there,” Kumar said.

“This essentially is the piece which brings in good visibility and a standardized structure of how your blueprint would look like for your entire environment and applications.”

Another reason Capillary went with Facets is because it was running 10 environments globally — three for testing and the rest in production. This meant the whole migration to Facets process took four to five months to complete, ensuring that all existing data had migrated.

The teams specifically spent about three months moving the testing environments to ensure that everything worked perfectly. The production environments, Kumar said, were much faster to move.

Seeing Results

By mid-2021, Kumar’s team had witnessed some clear results:

Operations Tickets Down by 95%.

“What we’ve been able to do with Facets is that we have created a self-service environment where, as a developer, if you have to create a new application, you go ahead and add it into that catalog,” Kumar said. “Somebody in your team, like your lead or architect, will go ahead and approve that. And then it gets launched on its own. There is no involvement required from the DevOps team.”

The DevOps teams were no longer involved in the day-to-day software launching. Now they were able to run about 15 environments across two product stacks with a six-member DevOps team.

In fact, Capillary renamed its DevOps team “SRE and developer experience,” pivoting to site reliability engineering and creating solutions to enable its developers.

Overall Uptime Increased from 99.8% to 99.99%.

“Our environment stability has basically taken a massive movement forward,” Kumar said. “Our environments are monitored continuously. Anything that you are seeing as a blip will basically get alerted. Your backups, your fallbacks, they are all pretty standardized.”

A 20% Increase in Developer Productivity.

“The biggest thing that has happened is that the queue time or the wait time on the DevOps team is gone,” Kumar said.

There’s also now uniformity across engineering operations, including logs and monitoring, which further increases developer productivity.

“And because our releases are completely automated, the monitoring of releases is completely automated,” Kumar said.

This has meant that over the last two years, the Capillary team has gone from releasing every two weeks to now releasing daily. Plus they’ve moved into an automated, unattended release mode with verifications. Now, said Kumar, “In case something is broken, you will get an immediate alert on that to go ahead and attend.”

The Capillary engineering team continues to grow with new products, the CTO said, as well as become more efficient. In 2016, it took 64 developer weeks to launch an environment. Now, it takes just eight developer weeks, including all verifications and stabilization.

Using the blueprint the company created with Facets, he said, the users have to define how a new environment “will handle this kind of workload and hence, this is the kind of capacity that is required. And so once you set that up, the environment launch is all automated. So you save a lot of time on that.”

Earlier this year, Capillary acquired another tech company, which required the launch of a new developer environment. The engineering team was able to define the blueprint within Facets and launch a new environment in two and a half weeks.

Greater Visibility of Infrastructure Costs.

Finally, three to four years ago, Kumar could only monitor infrastructure costs through post-mortem analysis, which caused a delayed response and leaked costs. Now, he said, Facets has helped with auditing and given it more visibility on how it’s using its infrastructure and where it’s over-provisioning.

The new capabilities, Kumar said, have sparked more proactive monitoring and CloudOps and FinOps, “where there are signals that I get on the cost spikes much sooner.”

The post Platform Engineering Helps a Scale-up Tame DevOps Complexity appeared first on The New Stack.

The Pillars of Platform Engineering: Part 4 — Connectivity

Michael Fonseca — Mon, 25 Sep 2023 20:00:49 +0000

This guide outlines the workflows and checklist steps for the six primary technical areas of developer experience in platform engineering. Published in six parts, part one introduced the series and focused on security. Part four addresses network connectivity. The other parts of the guide are listed below, and you can download a full PDF version of The 6 Pillars of Platform Engineering for the complete set of guidance, outlines, and checklists:

Security (includes introduction)
Pipeline (VCS, CI/CD)
Provisioning
Connectivity
Orchestration
Observability (includes conclusion and next steps)

Networking connectivity is a hugely under-discussed pillar of platform engineering, with many legacy patterns and hardware still in use at many enterprises. It needs careful consideration and strategies right alongside the provisioning pillar, since connectivity is what allows apps to exchange data and is part of both the infrastructure and application architectures.

Traditionally, ticket-driven processes were expected to support routine tasks like creating DNS entries, opening firewall ports or network ACLs, and updating traffic routing rules. This caused (and still causes in some enterprises) days-to-weeks-long delays in simple application delivery tasks, even when the preceding infrastructure management is fully automated. In addition, these simple updates are often manual, error-prone, and not conducive to dynamic, highly fluctuating cloud environments. Without automation, connectivity definitions and IP addresses quickly become stale as infrastructure is rotated at an increasingly rapid pace.

To adapt networking to modern dynamic environments, platform teams are bringing networking functions, software, and appliances into their infrastructure as code configurations. This brings the automated speed, reliability, and version-controlled traceability benefits of infrastructure as code to networking.

If organizations adopt microservices architectures, they quickly realize the value of software-driven service discovery and service mesh solutions. These solutions create an architecture where services are discovered and automatically connected based on centralized policies in a zero trust network if they have permissions, otherwise the secure default is to deny service-to-service connections. In this model, service-based identity is critical to ensuring strict adherence to common security frameworks.

An organization’s choice for its central shared registry should be multicloud, multiregion, and multiruntime — meaning it can connect a variety of cluster types, including VMs, bare metal, serverless, or Kubernetes. Teams need to minimize the need for traditional networking ingress or egress points that bring their environments back toward an obsolete “castle-and-moat” network perimeter approach to security.

Workflow: Connectivity

A typical network connectivity workflow should follow these eight steps:

Code: The developer commits code.
1. Note: Developers may have direct network control plane access depending on the role-based access controls (RBAC) assigned to them.
Validate: The CI/CD platform submits a request to the IdP for validation (AuthN and AuthZ).
IdP response: If successful, the pipeline triggers tasks (e.g. test, build, deploy).
Request: The provisioner executes requested patterns, such as building modules, retrieving artifacts, or validating policy against internal and external engines, ultimately provisioning defined resources.
Provision: Infrastructure is provisioned and configured, if not already available.
Configure: The provisioner configures the connectivity platform.
Connect: Target systems are updated based on defined policies.
Response: A metadata response packet is sent to CI/CD and to external systems that perform actions such as security scanning or integration testing.

Connectivity flow (the Connect box includes service mesh and service registry)

Connectivity Requirements Checklist

Successful network connectivity automation requires:

A centralized shared registry to discover, connect, and secure services across any region, runtime platform, and cloud service provider
Support for multiple interfaces for different personas and workflows (GUI, API, CLI, SDK)
Health checks
Multiple segmentation and isolation models
Layer 4 and Layer 7 traffic management
Implementation of security best practices such as defense-in-depth and deny-by-default
Integration with trusted identity providers with single sign-on and delegated RBAC
Audit logging
Enterprise support based on an SLA (e.g. 24/7/365)
Support for automated configuration (infrastructure as code, runbooks)

Check back tomorrow for the fifth pillar of platform engineering: Orchestration.

The post The Pillars of Platform Engineering: Part 4 — Connectivity appeared first on The New Stack.

Why You Can’t Go from Zero to Platform Engineering

David Williams — Mon, 25 Sep 2023 14:44:01 +0000

Much of the conversation around platform engineering has suggested that it is a golden bullet for maximizing the developer experience while speeding application development. While it can provide significant benefits in these areas, this is only part of the story.

To achieve centralized control over infrastructure while reaping these benefits, infrastructure governance must be frictionless and consistent, which necessitates a significant level of automation maturity. So, what is automation maturity and why is automation required for a successful platform engineering strategy?

Why Automation Maturity Is Crucial for Platform Engineering

Most large organizations deliver automated infrastructure at multiple levels of maturity. The objective is for all infrastructure to be delivered and managed consistently, removing disparity and reducing the diversity of infrastructure types. For our purposes, we will consider three levels of maturity for infrastructure automation: ad-hoc, automated (some automation) and frictionless.

Organizations considered to have an ad hoc level of maturity employ a mix of manual, scripted and niche infrastructure methods. This level is people-intensive and error-prone. Maturing requires consistent, simple, measurable and secure infrastructure automation. It should be frictionless, enabling infrastructure to be created, administrated and supported without needing increasing numbers of experts who are in high demand and short supply.

Automation must be easy, providing native support for entire infrastructure stacks, which vary in size and complexity and reside in multiple locations. An effective automation solution removes all manual work, reduces complexity and fragmentation, ensuring the infrastructure is governed, secure and accountable to the business.

Most enterprises have multiple types and levels of infrastructure automation. Some visible and others buried as code. So, while enterprises may use a mix of Infrastructure as Code (IaC), cloud services, virtualization, containers and configuration tools, it does not mean things are under control or providing a measurable value. The use of automation is mandatory; however too much disparate and fragmented automation creates more issues than it solves. This includes difficulties in administration, inconsistent policies, unknown risks, increasing skill requirements, and difficulties debugging and supporting the automation.

Having many automation tools creates islands of automation where each method of provisioning infrastructure stands alone. Since modern application architectures use a mix of infrastructure resources, it’s common to see automation procedures trigger other automation procedures to deliver all the things needed at any point in an application software-delivery life cycle. Though this second level of automation maturity is where many businesses can begin to think about a platform engineering approach, many still have work to do to deliver the associated business value.

Finally, the move from automated to frictionless requires the infrastructure to stop being an abstraction and become an integral part of the activities for which it is used. For developers using an IDP, infrastructure is a seamless integrated function. The activity defines what infrastructure is needed at any point in time.

What makes frictionless automation possible is the notion of environments. An environment is an easy-to-understand model that includes all the things necessary to support an activity — the middleware, data, tools, infrastructure, etc. As environments are derived from an activity or task, they contextualize the purpose for those environments, allowing organizations to understand the reason, the criticality and the value of each environment. Then multiple environments can be associated so organizations understand how environments are used, why they are used and by whom, no matter the cloud provider, the infrastructure types or the complexity of the infrastructure layers. This gives businesses the ability to plan, prioritize and budget based on the value each activity delivers.

Plot a Path to Automation Maturity

While smaller teams typically have less complexity, larger teams and organizations are much more complex, requiring a greater level of automation. This means that the requisite level of automation to successfully employ a platform engineering strategy will vary from company to company or even between disparate teams.

Platform engineering for organizations with little to no infrastructure automation is a non-starter. Instead, these organizations should start by focusing on their desired outcomes and implement the necessary tools and processes to achieve them. For example, is the objective developer experience, speeding application development, managing costs, enforcing governance or something else? Based on that, organizations can then determine which tools and processes make sense to automate based on the associated business value.

Organizations with a medium level of automation maturity are in a good position to explore implementing platform engineering. The goals should be to standardize all infrastructure, understand all components that make up the complete environments used, enforce security and compliance, associate cloud costs with those environments and the business value they provide and, finally, make infrastructure readily available on demand via self-service when developers need it. On top of that, there should be a high level of ongoing visibility into every environment instance throughout its life cycle to maintain governance and manage costs.

Achieve More from Platform Engineering

Platform engineering, while still an emerging methodology, promises many operational benefits. While speeding application development and keeping your developers happy provide significant value for businesses, they should strive to realize the most value possible. By outlining the desired outcomes and implementing the requisite automated processes, the pursuit of platform engineering will yield better results more quickly for organizations, justifying the cost and effort of technological and organizational investments.

The post Why You Can’t Go from Zero to Platform Engineering appeared first on The New Stack.

Metrics-Driven Developer Productivity Engineering at Spotify

Jennifer Riggins — Mon, 25 Sep 2023 12:00:53 +0000

At the crux of any platform engineering success or failure is your ability to get enough developers to adopt your platform and then your ability to measure if it is actually helping them. Except, like the discipline itself, developer productivity metrics are inherently socio-technical, which makes them challenging to accurately gauge. And then, how do you align your platform metrics to the organization’s overall goals?

“I have a strong bias for platform-focused development and developer productivity,” Laurent Ploix, engineering manager on the Platform Insights team at Spotify. Over the last three years, his team has worked on making data-informed decisions, based on a mix of platform engineering, data science, research and development, and product management.

In the lead-up to last week’s DPE Summit on developer productivity engineering and developer experience, Ploix gave The New Stack what he refers to as his opinionated view on metrics-informed development, which drives platform engineering at Spotify, and — by extension of the most popular open source IDP — the whole tech industry.

Searching for the Right Developer Productivity Metrics

There’s been a lot of huff and puff about developer productivity over the last couple months. In reality, companies like Google and Spotify have been tracking this for years now. And then a white paper on DevEx metrics was released last May.

Why so much focus in 2023 on measuring developer productivity? Why, it’s the year where the industry’s over-hiring has slowed and most teams are trying to do more with less. And, when the cloud native landscape is so lengthy and complex that developer flow states are constantly interrupted, resulting in unbearable cognitive load.

Spotify puts the different developer metrics on a scale, from very leading all the way to very lagging.

The lagging metrics, Ploix says, are the value- or impact-focused ones, like the long-term trends of revenue, monthly users, and user satisfaction. “They tend to [be] low noise like they don’t move that fast from one day to the next. They move slowly when we take action. They are kind of hard to relate to the action we take,” he said, as they are more indicative of long-term trends. Still, these lagging metrics are strategically important.

Common value-focused lagging metrics are revenue and end-user satisfaction.

These lagging metrics aren’t quick to tell you the magic thing to change in order to boost your engineering team’s productivity, but, Laura Tacho, engineering leadership coach and teacher of the course Measuring Development Team Performance, said they can:

Track progress against goals and benchmark against yourselves.
Add a quantitative perspective to known issues or trends in developer experience.
Help create a narrative to explain your team operations, or defend project investments, to higher-level stakeholders like your exec team or the board.

On the other hand, are the leading metrics. These can be much more easily understood and therefore actionable, like the number of pull requests in a given day or build time. They also tend to be easier to measure, via both automated tooling and developer surveys, as these metrics live closer to the day-to-day data and developer experience.

“They’re going to be useful for tactical, short-term action. They move fast when we take action,” Ploix said. But he warned that they also “can easily turn into vanity metrics. They tend to be difficult to relate to the actual value that is created. And the most problematic part is it’s kind of easy to game.”

In the end, no matter which is preferred, both leading and lagging metrics matter. “Metrics which are both value-focused and actionable typically don’t exist. Stop looking for them,” Ploix emphasized. “What you truly care about, at the end of the day for the company, is value creation, but in a sustainable way. This is nevertheless a very lagging metric.”

Also Read: How Spotify Achieved a Voluntary 99% Internal Platform Adoption Rate

Tying Developer Metrics to Organizational Goals

The Spotify platform insights team looks at a big priority metric — in this case, the organizational top-level objective of increasing end-user satisfaction — and then proxy perimeters to support it, like leading technical metrics.

Meantime to recover or MTTR is a factor into user satisfaction because, as Ploix said, “If we have fewer incidents or if the incidents are closed faster, we hope that the end users will be happier.” He says this is an example of one of the “bets” his team makes and then measures over time in an effort to align with cross-company objectives.

A way to decrease MTTR could be to focus on the site reliability engineering (SRE) experience, which had the team look to answer: “Are we going to fix problems faster if SREs are more efficient? And are SREs going to be more efficient if we have a faster log ingestion?”

Developer productivity metrics should always be tied back to organizational goals.

An example, provided by Ploix, of how to create team-level OKRs based on a leading metric, that impact OKRs for larger groups for the more lagging metrics.

An engineering department could have an OKR on the lagging metric of MTTR and a platform team supporting SREs would have a leading metric of log ingestion speed. These would both be in support of the company-level OKR to increase customer satisfaction, which is measured by things like net promoter scores (NPS), active users and churn rate.

This emphasizes one of the important goals of platform engineering which is to increase engineers’ sense of purpose by connecting their work more closely to delivering business value.

“Productivity cannot be measured easily. And certainly not with a single accurate number. And probably not even with a few of them. So these metrics about SRE efficiency or developer productivity, they need to be contextualized for your own company, your tech stack, your team even,” he said, emphasizing that the trends are typically more important than the actual values. “That does not mean that we cannot have a productive conversation about them. But it does mean there is no absolute way to measure” developer productivity, knowing that proxy metrics will never capture everything.

Spotify has found that it’s really useful to align everyone in the company around OKRs and that by changing some leading metrics, they are indeed able to move some of the lagging ones.

The platform insights team has also uncovered three axioms that are needed to successfully connect developer productivity metrics to all levels of OKRs:

The metrics you attach to the OKR must be sensitive to the change you implement.
The change you implement must be aligned to value creation.
The metrics you are trying to move need to be moveable within your OKR tracking period.

Through these mixed metrics-driven experiments at Spotify, they’ve also found that build time — specifically, the number of builds on the continuous integration pipeline per day — impacts developer satisfaction. And it’s been long held that happy workers are more productive, so satisfied developers should be more productive.

“The faster things are built, the more people can produce code and possibly [increase] deployment frequency,” Ploix said. “We also know that developer satisfaction has an impact on attrition. That might actually mean that build time has an impact on attrition.”

The platform insights team has also realized that when test coverage is done well, it can help cut technical debt. And it’s already proven that technical debt can have a negative effect on developer morale.

How to Get Started for Your DevEx Metrics

Don’t wait! The best way to get started on your developer experience or DevEx metrics is by getting started.

“Start by collecting data. Then try to grow some metrics from that, but the fact is it’s not going to be good,” Ploix warned. “You will not know if your data has bad quality until you have metrics,” so the only way to improve it is by starting that data collection. “Metrics are products that require iterations and have bugs. Deal with it.”

Data scientists should work with decision-makers to figure out what’s important to be tracking. And then, once you start collecting data, you’ll start noticing visible trends that can be understood across business and technical domains, which he said, from there creates knowledge and influences company culture.

Often, the perception of productivity is as important as the actual numbers, which can be a good place to kick off your developer productivity metrics journey.

“If you pair these kinds of workflow metrics with perception-based metrics — like those gathered through a DevEx survey — you’ll have an easier time identifying the right things to do to reduce friction in your development cycles,” Tacho wrote in a recent LinkedIn post. “Your team already knows where the inefficiencies are. They deal with the pain all day long.”

Spotify’s quarterly developer survey includes questions around developers’ perceived productivity. Of course, individual developers may not be 100% accurate, but the trends don’t lie.

They’ve also uncovered a direct link between tool satisfaction and developer productivity — or at least the perception of it. This quarterly Engineering Satisfaction Survey has a deeply sociotechnical approach to data collection and has a whole section on how devs are using tools and how they feel about them. It also asks engineers if they feel productive. Spotify has learned from this developer research that people who dislike tools feel less productive.

“People are surprisingly good at telling about what happens — productivity and blockers,” Ploix reminded. “Trust people! Ask them!”

The post Metrics-Driven Developer Productivity Engineering at Spotify appeared first on The New Stack.

The Pillars of Platform Engineering: Part 3 — Provisioning

Michael Fonseca — Fri, 22 Sep 2023 13:22:47 +0000

This guide outlines the workflows and checklists for the six primary technical areas of developer experience in platform engineering. Published in six parts, part one introduced the series and focused on security. Part three will address infrastructure provisioning. The other parts of the guide are listed below, and you can download the full PDF version for the complete set of guidance, outlines and checklists.

Security (includes introduction)
Pipeline (VCS, CI/CD)
Provisioning
Connectivity
Orchestration
Observability (includes conclusion and next steps)

In the first two pillars, a platform team provides self-service VCS and CI/CD pipeline workflows with security workflows baked in to act as guardrails from the outset. These are the first steps for software delivery. Now that you have application code to run, where will you run it?

Every IT organization needs an infrastructure plan at the foundation of its applications, and platform teams need to treat that plan as the foundation of their initiatives. Their first goal is to eliminate ticket-driven workflows for infrastructure provisioning, which aren’t scalable in modern IT environments. Platform teams typically achieve this goal by providing a standardized shared infrastructure provisioning service with curated self-service workflows, tools and templates for developers. Then they connect those workflows with the workflows of the first two pillars.

Building an effective modern infrastructure platform hinges on the adoption of Infrastructure as Code. When infrastructure configurations and automations are codified, even the most complex provisioning scenarios can be automated. The infrastructure code can then be version controlled for easy auditing, iteration and collaboration. There are a few solutions for adopting Infrastructure as Code, but the most common is Terraform: a provisioning solution that is more widely used than competing tools by a wide margin.

Terraform is the most popular choice for organizations adopting Infrastructure as Code because of its large integration ecosystem. This ecosystem helps platform engineers meet the final major requirement for a provisioning platform: extensibility. An extensive plugin ecosystem allows platform engineers to quickly adopt new technologies and services that developers want to deploy, without having to write custom code.

Provisioning: Modules and Images

Building standardized infrastructure workflows require platform teams to break down their infrastructure into reusable, and ideally immutable, components. Immutable infrastructure is a common standard among modern IT that reduces complexity and simplifies troubleshooting while also improving reliability and security.

Immutability means deleting and re-provisioning infrastructure for all changes, which minimizes server patching and configuration changes, helping to ensure that every service iteration initiates a new tested and up-to-date instance. It also forces runbook validation and promotes regular testing of failover and canary deployment exercises. Many organizations put immutability into practice by using Terraform, or another provisioning tool, to build and rebuild large swaths of infrastructure by modifying configuration code. Some also build golden image pipelines, which focus on building and continuous deployment of repeatable machine images that are tested and confirmed for security and policy compliance (golden images).

Along with machine images, modern IT organizations are modularizing their infrastructure code to compose commonly used components into reusable modules. This is important because a core principle of software development is the concept of not “reinventing the wheel,” and it applies to infrastructure code as well. Modules create lightweight abstractions to describe infrastructure in terms of architectural principles, rather than discrete objects. They are typically managed through version control and interact with third-party systems, such as a service catalog or testing framework.

High-performing IT teams bring together golden image pipelines and their own registry of modules for developers to use when building infrastructure for their applications. With little knowledge required about the inner workings of this infrastructure and its setup, developers can use infrastructure modules and golden image pipelines in a repeatable, scalable and predictable workflow that has security and company best practices built in on the first deployment.

Workflow: Provisioning Modules and Images

A typical provisioning workflow will follow these six steps:

Code: A developer commits code and submits a task to the pipeline.
Validate: The CI/CD platform submits a request to your IdP for validation (AuthN and AuthZ).
IdP response: If successful, the pipeline triggers tasks (e.g., test, build, deploy).
Request: CI/CD-automated workflow to build modules, artifacts, images and/or other infrastructure components.
Response: The response (success/failure and metadata) is passed to the CI/CD platform.
Output: The infrastructure components such as modules, artifacts and image configurations are deployed or stored.

Module- and image-provisioning flow

Provisioning: Policy as Code

Agile development practices have shifted the focus of infrastructure provisioning from an operations problem to an application-delivery expectation. Infrastructure provisioning is now a gating factor for business success. Its value is aligned around driving organizational strategy and the customer mission, not purely based on controlling operational expenditures.

In shifting to an application-delivery expectation, we need to shift workflows and processes. Historically, operations personnel applied workflows and complaints to the provisioning process by leveraging tickets. These tickets usually involved validating access, approvals, security, costs, etc. The whole process was also audited for compliance and control practices.

This process now must change to enable developers and other platform end users to provision via a self-service workflow. This means that a new set of codified security controls and guardrails must be implemented to satisfy compliance and control practices.

Within cloud native systems, these controls are implemented via policy as code. Policy as code is a practice that uses programmable rules and conditions for software and infrastructure deployment that codify best practices, compliance requirements, security rules and cost controls.

Some tools and systems include their own policy system, but there are also higher-level policy engines that integrate with multiple systems. The fundamental requirement is that these policy systems can be managed as code and will provide evaluations, controls, automation and feedback loops to humans and systems within the workflows.

Implementing policy as code helps shift workflows “left” by providing feedback to users earlier in the provisioning process and enabling them to make better decisions faster. But before they can be used, these policies need to be written. Platform teams should own the policy-as-code practice, working with security, compliance, audit and infrastructure teams to ensure that policies are mapped properly to risks and controls.

Workflow: Policy as Code

Implementing policy-as-code checks in an infrastructure-provisioning workflow typically involves five steps:

Code: The developer commits code and submits a task to the pipeline.
Validate: The CI/CD platform submits a request to your IdP for validation (AuthN and AuthZ).
IdP response: If successful, the pipeline triggers tasks (e.g., test, build, deploy).
Request: The provisioner runs the planned change through a policy engine and the request is either allowed to go through (sometimes with warnings) or rejected if the code doesn’t pass policy tests.
Response: A metadata response packet is sent to CI/CD and to external systems from there, such as security scanning or integration testing.

Provisioning flow with policy as code

Provisioning Requirements Checklist

Successful self-service provisioning of infrastructure requires:

A consolidated control and data plane for end-to-end automation
Automated configuration (infrastructure as code, runbooks)
Predefined and fully configurable workflows
Native integrations with VCS and CI/CD tools
Support for a variety of container and virtual machine images required by the business
Multiple interfaces for different personas and workflows (GUI, API, CLI, SDK)
Use of a widely adopted Infrastructure-as-Code language — declarative language strongly recommended
Compatibility with industry-standard testing and security frameworks, data management (encryption) and secrets management tools
Integration with common workflow components such as notification tooling and webhooks
Support for codified guardrails, including:
- Policy as code: Built-in policy-as-code engine with extensible integrations
- RBAC: Granularly scoped permissions to implement the principle of least privilege
- Token-based access credentials to authenticate automated workflows
- Prescribed usage of organizationally approved patterns and modules
Integration with trusted identity providers with single sign on and RBAC
Maintenance of resource provisioning metadata (state, images, resources, etc.):
- Controlled via deny-by-default RBAC
- Encrypted
- Accessible to humans and/or machines via programmable interfaces
- Stored with logical isolation maintained via traceable configuration
Scalability across large distributed teams
Support for both public and private modules
Full audit logging and log-streaming capabilities
Financial operations (FinOps) workflows to enforce cost-based policies and optimization
Well-defined documentation and developer enablement
Enterprise support based on an SLA (e.g., 24/7/365)

Stay tuned for our post on the fourth pillar of platform engineering: connectivity. Or download the full PDF version of The 6 Pillars of Platform Engineering for the complete set of guidance, outlines and checklists.

The post The Pillars of Platform Engineering: Part 3 — Provisioning appeared first on The New Stack.

A Practical Step-by-Step Approach to Building a Platform

Hemanth Kavuluru — Thu, 21 Sep 2023 17:41:38 +0000

In my previous article, I discussed the concept of a platform in the context of cloud native application development. In this article, I will dig into the journey of a platform engineering team and outline a step-by-step approach to building such a platform. It is important to note that building a platform should be treated no differently than building any other product, as the platform is ultimately developed for internal users.

Therefore, all the software development life cycle (SDLC) practices and methodologies typically employed in product development are equally applicable to platform building. This includes understanding end users’ pain points and needs, assembling a dedicated team with a product owner, defining a minimum viable product (MVP), devising an architecture/design, implementing and testing the platform, deploying it and ensuring its continuous evolution beyond the MVP stage.

Step 1: Define Clear Goals

Before starting to build a platform, it is important to determine if the organization actually needs one and what is driving the need for it. Additionally, it is crucial to establish clear goals for the platform and define criteria for measuring its success. Identifying the specific business goals and outcomes that the platform will address is essential to validate its necessity.

While the benefits of reducing cognitive load for developers, providing self-serve infrastructure and improving the developer experience are obvious, it is important to understand the organization’s unique challenges and pain points and how the platform can address them. Some common business goals include the following:

Accelerating application modernization through shared Kubernetes infrastructure.
Reducing costs by consolidating infrastructure and tools.
Addressing skill-set gaps through automation and self-serve infrastructure.
Improving product delivery times by reducing developer toil.

Step 2: Discover Landscape and Identify Use Cases

Once platform teams establish high-level business goals, the next step in the platform development process is to understand the current technology landscape of the organization. Platform teams must develop a thorough understanding of their existing infrastructure and their future infrastructure needs, applications, services, frameworks and tools. Platform teams must also understand how their internal teams are structured, their skills in using frameworks like Terraform, the SDLC tools, etc. This can be done via a series of discovery calls and user interviews with different application teams/business units, inventory audits and interviews with potential platform users.

Through the discovery process, platform teams must identify the challenges that the internal teams face with the current services and tools, deriving the use cases for the platform based on the pain points of the internal users. The use cases can be as simple as creating self-serve development environments to more complex use cases like a single pane of glass administration for infrastructure management and application deployment. The following are several discovery items:

Current infrastructure (e.g., public clouds, private clouds)
Kubernetes distributions in usage (Amazon EKS, AKS, GKE, Upstream Kubernetes)
Managed services (databases, storage, registry, etc.)
CI/CD methodologies currently in use
Security tools
SDLC tools
Internal teams and their structure for implementing RBAC, clear isolation boundaries and team-specific workflows
HA/DR requirements
Applications, services in use, common frameworks and technology stacks (Python, Java, Go, React.Js, etc.) to create standard templates, catalogs and documentation

Step 3: Define the Product Roadmap

The use cases gathered during the discovery process should be considered to create a roadmap for the platform. This roadmap should outline the MVP requirements necessary to build an initial platform that can demonstrate its value. Platform teams may initially focus on one or two use cases, prioritizing those potentially benefiting a larger group of internal users.

It is recommended to start by piloting the MVP with a small group of internal users, application teams or business units to gather feedback and make improvements. As the platform becomes more robust, it can be expanded to serve a broader range of users and address additional use cases. The following are several example user stories from cloud native application development projects:

As a developer, I want to create a CI pipeline to compile my code and create artifacts. (CI as a Service and Registry as a Service)
As a developer, I want to create a sandbox environment and deploy my application to the sandbox for testing. (Environment as a Service)
As a developer, I want to deploy my applications into Kubernetes clusters. (Deployment as a Service)
As a developer, I want access to application logs and metrics to troubleshoot product issues.
As an SRE, I want to create and manage cloud environments and Kubernetes clusters compliant with my organization’s security and governance policies.
As a FinOps, I want to create chargeback reports and allocate costs to various business units. (Cost management as a Service)
As a security engineer, I want to consistently apply network security and OPA policies across the Kubernetes infrastructure. I also want to see policy violations and access logs in the central SIEM platform. (Network and OPA policy management as a Service)

Step 4: Build the Platform

Building the platform involves developing the automation backend to provide the infrastructure, services and tools that internal users need in a self-serve manner. The self-serve interface can vary from Jenkins pipelines to Terraform modules to Backstage IDP to a custom portal.

The backend involves automating tasks such as creating cloud environments, provisioning Kubernetes clusters, creating Kubernetes namespaces, deploying workloads in Kubernetes, viewing application logs, metrics, etc. Care must be taken to apply the organization’s security, governance and compliance policies as platform teams automate these tasks. The following simple technology stack is assumed for the example organization:

Infrastructure: AWS
Kubernetes: AWS EKS
Registry: AWS ECR
CI/CD: GitLab for CI and ArgoCD for application deployment
Databases: AWS RDS Postgres, Amazon ElasticCache for Redis
Observability: AWS OpenSearch, Prometheus and Grafana for metrics, OpsGenie for alerts
Security: Okta for SSO, Palo Alto Prisma Cloud

The example organization runs workloads in the AWS cloud. All stateless application workloads are containerized and run in Amazon EKS clusters. Workloads utilize AWS RDS Postgres for the database and Amazon ElasticCache (Redis) for the cache. The initial user stories are:

Create an AWS environment that creates a separate AWS account, VPC, an IAM Role, security groups, AWS RDS Postgres, AWS ElasticCache.
Create an EKS cluster with add-ons required for security, governance and compliance.
Download Kubeconfig file.
Create a Kubernetes namespace.
Deploy workload.

Using Backstage as the developer portal and Rafay backstage plugins as the automation backend, the following are the high-level steps to build the self-serve platform supporting the above use cases:

Install the Backstage app and configure Postgres.
Configure authentication using Backstage’s auth provider.
Set up Backstage catalog to ingest organization data from LDAP.
Set up Backstage to load and discover entities using GitHub integration.
Create a blueprint in Rafay console to define a baseline set of software components required by the organization (cost profiles, monitoring, ingress controllers, network security and OPA policies, etc.).
Install Rafay frontend and backend plugins in the Backstage app.
Use template actions provided by the Rafay backend plugin to add software templates for creating services.
- Create a Cluster template with ‘rafay:create-cluster’ action and provide the blueprint and other configuration from user input or by defining defaults in cluster-config.yaml.
- Create Namespace and Workload templates using ‘rafay:create-namespace’ and ‘rafay:create-workload’ actions.
Import UI widgets from the Rafay frontend plugin to create component pages for services and resources developed through templates (EntityClusterInfo, EntityClusterPodList, EntityNamespaceInfo, EntityWorkloadInfo, etc.).

The screens in the backstage developer portal look like the following after the implementation:

While this is a simple representation of a platform built using Backstage and Rafay backstage plugins, the actual platform may need to solve for many other use cases, which may require a larger effort. Similarly, platform teams may use some other interface and automation backend for building the platform.

Treat the Platform as a Product

When embarking on the journey of building a platform, it is essential to treat the platform as a product and follow a systematic approach similar to any other product development. The first step is to invest time in thoroughly discovering and understanding the organization’s technological landscape, identifying current pain points and gathering requirements from internal users. Based on these findings, a roadmap for the platform should be defined, setting clear milestones and establishing success criteria for each milestone.

Building such a platform requires consideration of various factors, including current and future infrastructure needs, application deployment, security, operating models, cost management, developer experience, and shared services and tools. Conducting a build versus buy analysis helps determine which parts of the platform should be built internally and which open source and commercial tools can be leveraged. Most platforms ultimately use all of these components. It is crucial to treat internal users as the platform’s customers, continuously seeking their feedback and iteratively improving the platform to ensure its success.

The post A Practical Step-by-Step Approach to Building a Platform appeared first on The New Stack.

The 6 Pillars of Platform Engineering: Part 2 — CI/CD & VCS Pipeline

Michael Fonseca — Thu, 21 Sep 2023 15:07:38 +0000

This guide outlines the workflows and steps for the six primary technical areas of developer experience in platform engineering. Published in six parts, part one introduced the series and focused on security. Part two will cover the application deployment pipeline. The other parts of the guide are listed below, and you can also download the full PDF version for the complete set of guidance, outlines and checklists.

Security (includes introduction)
Pipeline (VCS, CI/CD)
Provisioning
Connectivity
Orchestration
Observability (includes conclusion and next steps)

Platform Pillar 2: Pipeline

One of the first steps in any platform team’s journey is integrating with and potentially restructuring the software delivery pipeline. That means taking a detailed look at your organization’s version control systems (VCS) and continuous integration/continuous deployment (CI/CD) pipelines.

Many organizations have multiple VCS and CI/CD solutions in different maturity phases. These platforms also evolve over time, so a component-based API platform or catalog model is recommended to support future extensibility without compromising functionality or demanding regular refactoring.

In a cloud native model, infrastructure and configuration are managed as code, and therefore a VCS is required for this core function. Using a VCS and managing code provide the following benefits:

Consistency and standardization
Agility and speed
Scalability and flexibility
Configuration as documentation
Reusability and sharing
Disaster recovery and reproducibility
Debuggability and auditability
Compliance and security

VCS and CI/CD enable interaction and workflows across multiple infrastructure systems and platforms, which requires careful assessment of all the VCS and CI/CD requirements listed below.

Workflow: VCS and CI/CD

A typical VCS and CI/CD workflow should follow these five steps:

Code: The developer commits code to the VCS and a task is automatically submitted to the pipeline.
Validate: The CI/CD platform submits a request to your IdP for validation (AuthN and AuthZ).
Response: If successful, the pipeline triggers tasks (e.g., test, build, deploy).
Output: The output and/or artifacts are shared within platform components or with external systems for further processing.
Operate: Security systems may be involved in post-run tasks, such as deprovisioning access credentials.

VCS and CI/CD pipeline flow

VCS and CI/CD Requirements Checklist

Successful VCS and CI/CD solutions should deliver:

A developer experience tailored to your team’s needs and modern efficiencies
Easy onboarding
A gentle learning curve with limited supplementary training needed (leveraging industry-standard tools)
Complete and accessible documentation
Support for pipeline as code
Platform agnosticism (API driven)
Embedded expected security controls (RBAC, auditing, etc.)
Support for automated configuration (infrastructure as code, runbooks)
Support for secrets management, identity and authorization platform integration
Encouragement and support for a large partner ecosystem with a broad set of enterprise technology integrations
Extended service footprint, with runners to delegate and isolate span of control
Enterprise support based on an SLA (e.g., 24/7/365)

Note: VCS and CI/CD systems may have more specific requirements not listed here.

As platform teams select and evolve their VCS and CI/CD solutions, they need to consider what this transformation means for existing/legacy provisioning practices, security and compliance. Teams should assume that building new platforms will affect existing practices, and they should work to identify, collaborate and coordinate change within the business.

Platform teams should also be forward-looking. VCS and CI/CD platforms are rapidly evolving to further abstract away the complexity of the CI/CD process from developers. HashiCorp looks to simplify these workflows for developers by providing a consistent way to deploy, manage and observe applications across multiple runtimes, including Kubernetes and serverless environments with HashiCorp Waypoint.

Stay tuned for our post on the third pillar of platform engineering: provisioning. Or download the full PDF version of The 6 Pillars of Platform Engineering for the complete set of guidance, outlines and checklists.

The post The 6 Pillars of Platform Engineering: Part 2 — CI/CD & VCS Pipeline appeared first on The New Stack.

CloudBees Scales Jenkins, Redefines DevSecOps

Darryl K. Taft — Thu, 21 Sep 2023 12:00:46 +0000

CloudBees, which offers a software delivery platform for enterprises, announced significant performance and scalability enhancements to Jenkins with new updates to its CloudBees Continuous Integration (CI) software. The company also delivered a new DevSecOps solution based on Tekton.

CloudBees made the announcements at the recent DevOps World 2023 conference. CloudBees CI is an enterprise version of Jenkins. Jenkins is the most widely used CI/CD software globally, with an estimated 11.2 million developers using it as part of their software delivery process, the company said.

HA, Scalability, Performance

The new updates bring high availability and horizontal scalability to Jenkins, eliminating the bottlenecks that plague administrators and developers as they massively scale CI/CD workloads on Jenkins, said Sacha Labourey, co-founder and chief strategy officer at CloudBees.

“The ability to roll out, protect, and scale Jenkins on top of Kubernetes is critical to Jenkins remaining the go-to platform for managing CI/CD pipelines,” said Torsten Volk, an analyst at Enterprise Management Associates. “The over 1000 existing integrations are still a massive argument for many DevOps teams to adopt or stay with Jenkins, but now these integrations no longer come at the expense of adding tech debt.”

In addition, CloudBees announced additional performance-enhancing capabilities such as workspace caching to speed up builds and a new AI-powered pipeline explorer for easier and faster debugging.

“I think these changes are significant to existing Jenkins users, and there are still a lot of Jenkins users,” said Jim Mercer, an analyst at IDC. Specifically, he noted:

The caching will help to improve startup times and the speed of Jenkins pipelines.
The HA and scaling create additional controller replicas to balance the load of multiple users doing builds concurrently, appearing to the developer as a single controller. Previously, organizations attempted to mitigate the Jenkins controller issue by adding more Jenkins instances, creating other overhead for administration, etc.

“These changes’ overall theme is improving the developer experience by addressing issues where time is lost and enhancing their lives,” Mercer said. “These are not sexy changes per se, but they benefit the Jenkins user base.”

Jenkins has long had scalability issues, said Jason Bloomberg, an analyst at Intellyx. “So the Cloudbees High Availability Mode is a welcome update. Now Jenkins will no longer have a single point of failure and will also offer automatic load balancing — capabilities expected in any cloud environment and long overdue for Jenkins.”

Moreover, high availability and horizontal scalability for Jenkins is a capability our enterprise customers have wanted for a long time, Labourey told The New Stack

“The ability to run Jenkins at massive scale with active-active high availability becomes especially critical when you’re dealing with thousands of developers, running multiple thousands or hundreds of thousands of jobs across a small set of monolithic, overloaded controllers,” said Shawn Ahmed, chief product officer, CloudBees, in a statement. “At this scale, you are dealing with a community of developers that want a high-resiliency developer experience with no disruption. We have removed significant barriers in scaling Jenkins, enabling enterprises to run greater workloads than ever before. The new capabilities in CloudBees CI are a game-changing experience for DevOps teams.”

Other Features

In addition to high availability and horizontal scalability, additional performance-enhancing features introduced include:

Workspace Caching – Improves the performance of Jenkins by speeding up builds.
Pipeline Explorer – Easier and faster AI-powered debugging. Find and fix pipeline issues in complex environments with massive Jenkins workloads.
Build Storm Prevention – Baseline your repository without causing build storms (gridlock on startup).

“We had a full-fledged CI/CD offering but it was solely available as software,” Labourey said. “So our customers were deployed on-premises or in their own public cloud accounts. But we own and operate those environments. And obviously there is a desire for more SaaS consumption, but also the evolution towards more cloud native types of workloads. And so that’s what we are releasing and announcing now and releasing on November 1 to all customers. And we’ve been working on this for a long time.”

Tekton-Based DevSecOps

Meanwhile, the new CloudBees DevSecOps platform is built on Tekton, uses a GitHub Actions style domain-specific language (DSL), and adds feature flagging, security, compliance, pipeline orchestration, analytics and value stream management (VSM) into a fully managed single-tenant SaaS, multitenant SaaS or on-premises virtual private cloud instance.

“The new CloudBees platform turns Tekton into an easy-to-use pipeline automation solution and can directly benefit from Jenkins also running and scaling with Kubernetes,” Volk said. “This strategy makes sense as it builds on existing differentiation (1000 plugins) and aims to make Tekton, an incredibly scalable pipeline automation framework, accessible to the masses.”

CloudBees said its new extensible DevSecOps platform redefines DevSecOps by addressing the challenges associated with delivering better, more secure and compliant cloud native software at a faster pace than ever.

“DevSecOps has been harder to implement than people would like, so bringing the ‘Sec’ part of the equation into CloudBees’ expertise with DevOps can only be a step forward,” Bloomberg said.

What’s Old Is New?

But is it just old wine in new bottles?

“The adoption of Tekton by Cloudbees was originally announced back in the DevOps days in 2019 when they announced the JenkinsX project would delegate the execution layer to Tekton. So, I don’t see this as new,” Mercer told The New Stack. “Outside of this, the collection of capabilities, such as value stream management, compliance, and feature flags, provide compelling capabilities as an integrated stack. I am not a fan of the addition of a new DSL. I also feel like they would do well to promote their compliance capabilities more since our survey data shows this is a top challenge for teams.”

Moreover, CloudBees cites a new discipline called platform engineering, which has emerged as an evolution of DevOps practices. The discipline brings together multiple roles such as site reliability engineers (SREs), DevOps engineers, security teams, product managers, and operations teams. Their shared mission is to integrate all the siloed technology and tools used within the organization into a golden path for developers. The CloudBees platform is purpose-built for this mission, the company said in a statement.

In addition, CloudBees said its focus going forward is on the following imperatives:

Developer-Centric Experience

Enhances the developer experience by minimizing cognitive load and making DevOps processes nearly invisible, using concepts of blocks, automation and golden paths.

Open and Extensible

Embraces the DevOps ecosystem of tools, starting with Jenkins. This flexibility to orchestrate any other tool enables organizations to protect the investments they have already made in tooling. Teams can continue to use their preferred technologies simply by plugging them into the platform.

Self-Service Model

Enables platform engineering to customize the platform, thus providing autonomy for development teams. For example, platform engineers can design automation and actions that are then consumed in a self-service mode by developers. Developers focus on what they do best: create innovation. No waiting for needed automation, actions, or resources.

Security and Compliance

Centralizes security and compliance. The CloudBees platform comes with out-of-the-box workflow templates containing built-in security. Sensitive information, like passwords and tokens, are abstracted out of the pipeline, significantly enhancing security and compliance throughout the software development life cycle. Automated DevSecOps is baked in, with best-of-breed checks across source code, binaries, cloud environments, data and identity, all based on Open Policy Agent (OPA). Continuous compliance just happens, with out-of-the-box regulatory frameworks for standards such as FedRamp, SOC2 and automated evidence collection for the auditors.

“We have been using the CloudBees platform in beta. One significant value add for us was that it significantly reduced the time it took to pass our ISO 27001 compliance audit,” said Michel Lopez, founder and CEO at E2F, in a statement. “The auditor had scheduled 12 hours of interviews, but it ended after 60 minutes. This was because all of the controls were provided by the CloudBees platform.”

The post CloudBees Scales Jenkins, Redefines DevSecOps appeared first on The New Stack.

The 6 Pillars of Platform Engineering: Part 1 — Security

Michael Fonseca — Wed, 20 Sep 2023 18:00:10 +0000

Platform engineering is the discipline of designing and building toolchains and workflows that enable self-service capabilities for software engineering teams. These tools and workflows comprise an internal developer platform, which is often referred to as just “a platform.” The goal of a platform team is to increase developer productivity, facilitate more frequent releases, improve application stability, lower security and compliance risks and reduce costs.

This guide outlines the workflows and checklist steps for the six primary technical areas of developer experience in platform engineering. Published in six parts, this part, part one, introduces the series and focuses on security. (Note: You can download a full PDF version of the six pillars of platform engineering for the complete set of guidance, outlines and checklists.)

Platform Engineering Is about Developer Experience

The solutions engineers and architects I work with at HashiCorp have supported many organizations as they scale their cloud operating model through platform teams, and the key for these teams to meet their goals is to provide a satisfying developer experience. We have observed two common themes among companies that deliver great developer experiences:

Standardizing on a set of infrastructure services to reduce friction for developers and operations teams: This empowers a small, centralized group of platform engineers with the right tools to improve the developer experience across the entire organization, with APIs, documentation and advocacy. The goal is to reduce tooling and process fragmentation, resulting in greater core stability for your software delivery systems and environments.
A Platform as a Product practice: Heritage IT projects typically have a finite start and end date. That’s not the case with an internal developer platform. It is never truly finished. Ongoing tasks include backlog management, regular feature releases and roadmap updates to stakeholders. Think in terms of iterative agile development, not big upfront planning like waterfall development.

No platform should be designed in a vacuum. A platform is effective only if developers want to use it. Building and maintaining a platform involves continuous conversations and buy-in from developers (the platform team’s customers) and business stakeholders. This guide functions as a starting point for those conversations by helping platform teams organize their product around six technical elements or “pillars” of the software delivery process along with the general requirements and workflow for each.

The 6 Pillars of Platform Engineering

What are the specific building blocks of a platform strategy? In working with customers in a wide variety of industries, the solutions engineers and architects at HashiCorp have identified six foundational pillars that comprise the majority of platforms, and each one will be addressed in a separate article:

Security
Pipeline (VCS, CI/CD)
Provisioning
Connectivity
Orchestration
Observability

Platform Pillar 1: Security

The first questions developers ask when they start using any system are: “How do I create an account? Where do I set up credentials? How do I get an API key?” Even though version control, continuous integration and infrastructure provisioning are fundamental to getting a platform up and running, security also should be a first concern. An early focus on security promotes a secure-by-default platform experience from the outset.

Historically, many organizations invested in network perimeter-based security, often described as a “castle-and-moat” security approach. As infrastructure becomes increasingly dynamic, however, perimeters become fuzzy and challenging to control without impeding developer velocity.

In response, leading companies are choosing to adopt identity-based security, identity-brokering solutions and modern security workflows, including centralized management of credentials and encryption methodologies. This promotes visibility and consistent auditing practices while reducing operational overhead in an otherwise fragmented solution portfolio.

Leading companies have also adopted “shift-left” security; implementing security controls throughout the software development lifecycle, leading to earlier detection and remediation of potential attack vectors and increased vigilance around control implementations. This approach demands automation-by-default instead of ad-hoc enforcement.

Enabling this kind of DevSecOps mindset requires tooling decisions that support modern identity-driven security. There also needs to be an “as code” implementation paradigm to avoid ascribing and authorizing identity-based on ticket-driven processes. That paves the way for traditional privileged access management (PAM) practices to embrace modern methodologies like just-in-time (JIT) access and zero-trust security.

Identity Brokering

In a cloud operating model approach, humans, applications and services all present an identity that can be authenticated and validated against a central, canonical source. A multi-tenant secrets management and encryption platform along with an identity provider (IdP) can serve as your organization’s identity brokers.

Workflow: Identity Brokering

In practice, a typical identity brokering workflow might look something like this:

Request: A human, application, or service initiates interaction via a request.
Validate: One (or more) identity providers validate the provided identity against one (or more) sources of truth/trust.
Response: An authenticated and authorized validation response is sent to the requestor.

Identity Brokering Requirements Checklist

Successful identity brokering has a number of prerequisites:

All humans, applications and services must have a well-defined form of identity.
Identities can be validated against a trusted IdP.
Identity systems must be interoperable across multi-runtime and multicloud platforms.
Identity systems should be centralized or have limited segmentation in order to simplify audit and operational management across environments.
Identity and access management (IAM) controls are established for each IdP.
Clients (humans, machines and services) must present a valid identity for AuthN and AuthZ).
Once verified, access is brokered through deny-by-default policies to minimize impact in the event of a breach.
AuthZ review is integrated into the audit process and, ideally, is granted just in time.
- Audit trails are routinely reviewed to identify excessively broad or unutilized privileges and are retroactively analyzed following threat detection.
- Historical audit data provides non-repudiation and compliance for data storage requirements.
Fragmentation is minimized with a flexible identity brokering system supporting heterogeneous runtimes, including:
- Platforms (VMware, Microsoft Azure VMs, Kubernetes/OpenShift, etc.)
- Clients (developers, operators, applications, scripts, etc.)
- Services (MySQL, MSSQL, Active Directory, LDAP, PKI, etc.)
Enterprise support 24/7/365 via a service level agreement (SLA)
Configured through automation (infrastructure as code, runbooks)

Access Management: Secrets Management and Encryption

Once identity has been established, clients expect consistent and secure mechanisms to perform the following operations:

Retrieving a secret (a credential, password, key, etc.)
Brokering access to a secure target
Managing secure data (encryption, decryption, hashing, masking, etc.)

These mechanisms should be automatable — requiring as little human intervention as possible after setup — and promote compliant practices. They should also be extensible to ensure future tooling is compatible with these systems.

Workflow: Secrets Management and Encryption

A typical secrets management workflow should follow five steps:

Request: A client (human, application or service) requests a secret.
Validate: The request is validated against an IdP.
Request: A secret request is served if managed by the requested platform. Alternatively:
1. The platform requests a temporary credential from a third party.
2. The third-party system responds to the brokered request with a short-lived secret.
Broker response: The initial response passes through an IAM cryptographic barrier for offload or caching.
Client response: The final response is provided back to the requestor.

Secrets management flow

Access Management: Secure Remote Access (Human to Machine)

Human-to-machine access in the traditional castle-and-moat model has always been inefficient. The workflow requires multiple identities, planned intervention for AuthN and AuthZ controls, lifecycle planning for secrets and complex network segmentation planning, which creates a lot of overhead.

While PAM solutions have evolved over the last decade to provide delegated solutions like dynamic SSH key generation, this does not satisfy the broader set of ecosystem requirements, including multi-runtime auditability or cross-platform identity management. Introducing cloud architecture patterns such as ephemeral resources, heterogeneous cloud networking topologies, and JIT identity management further complicates the task for legacy solutions.

A modern solution for remote access addresses the challenges of ephemeral resources and the complexities that arise with ephemeral resources such as dynamic resource registration, identity, access, and secrets. These modern secure remote access tools no longer rely on network access such as VPNs as an initial entry point, CMDBs, bastion hosts, manual SSH and/or secrets managers with check-in/check-out workflows.

Enterprise-level secure remote access tools use a zero-trust model where human users and resources have identities. Users connect directly to these resources. Scoped roles — via dynamic resource registries, controllers, and secrets — are automatically injected into resources, eliminating many manual processes and security risks such as broad, direct network access and long-lived secrets.

Workflow: Secure Remote Access (Human to Machine)

A modern remote infrastructure access workflow for a human user typically follows these eight steps:

Request: A user requests system access.
Validate (human): Identity is validated against the trusted identity broker.
Validate (to machine): Once authenticated, authorization is validated for the target system.
Request: The platform requests a secret (static or short-lived) for the target system.
Inject secret: The platform injects the secret into the target resource.
Broker response: The platform returns a response to the identity broker.
Client response: The platform grants access to the end user.
Access machine/database: The user securely accesses the target resource via a modern secure remote access tool.

Secure remote access flow

Access Management Requirements Checklist

All secrets in a secrets management system should be:

Centralized
Encrypted in transit and at rest
Limited in scoped role and access policy
Dynamically generated, when possible
Time-bound (i.e., defined time-to-live — TTL)
Fully auditable

Secrets management solutions should:

Support multi-runtime, multicloud and hybrid-cloud deployments
Provide flexible integration options
Include a diverse partner ecosystem
Embrace zero-touch automation practices (API-driven)
Empower developers and delegate implementation decisions within scoped boundaries
Be well-documented and commonly used across industries
Be accompanied by enterprise support 24/7/365 based on an SLA
Support automated configuration (infrastructure as code, runbooks)

Additionally, systems implementing secure remote access practices should:

Dynamically register service catalogs
Implement an identity-based model
Provide multiple forms of authentication capabilities from trusted sources
Be configurable as code
Be API-enabled and contain internal and/or external workflow capabilities for review and approval processes
Enable secrets injection into resources
Provide detailed role-based access controls (RBAC)
Provide capabilities to record actions, commands, sessions and give a full audit trail
Be highly available, multiplatform, multicloud capable for distributed operations, and resilient to operational impact

Stay tuned for our post on the second pillar of platform engineering: version control systems (VCS) and the continuous integration/continuous delivery (CI/CD) pipeline. Or download a full PDF version of the six pillars of platform engineering for the complete set of guidance, outlines and checklists.

The post The 6 Pillars of Platform Engineering: Part 1 — Security appeared first on The New Stack.

A Guide to Open Source Platform Engineering

Jennifer Riggins — Wed, 20 Sep 2023 15:22:36 +0000

BILBAO — The rise of platform engineering in 2023 is part of a pendulum swinging back away from developer autonomy but not all the way back to Waterfall command and control. We want developers to retain their sense of choice, encouraging the problem-solving side of these creative workers. But we also want to cut down on tool sprawl and cut the cost and risk that can come with it.

This week at the Linux Foundation’s Open Source Summit Europe, open source community members looked to highlight the on-ramp to getting things done with an open source stack. That included Gedd Johnson, a full-stack engineer at Defense Unicorns, a consultancy that builds open source DevSecOps platforms, including for the U.S. Department of Defense, talking about how to get on board with this recent trend of platform engineering.

Most importantly, Johnson’s talk exhibited how to achieve the benefits of open source platform engineering, diagramming an example of building a Kubernetes-based platform entirely on free and open source software.

Oops! DevOps Causes Ops Sprawl

“Adopting a DevOps culture means that the team has full ownership over the entire software lifecycle,” Johnson said. “The team is proficient across the entire stack,” which they’ve chosen to suit their unique context.

But with great freedom and responsibility comes great inefficiency.

“If you have disparate app teams that have adopted a DevOps culture, and they’re owning their own processes from end to end, those teams have a tendency to inadvertently silo themselves,” he said. “And if this happens, you can bet that multiple app teams are going to be solving the same problems over and over again, and maybe not necessarily at the application code level.”

He gave the example of an organization built this way on Amazon Web Services.

“A very popular method for deploying on AWS is with Elastic Container Service, or ECS. And if you have multiple app teams using ECS, and those app teams are siloed away, you can bet that they’d all solve the problem of deploying their app on ECS in their own way,” Johnson said.

Some app teams might take an Infrastructure as Code approach in TerraForm. Others might just “ClickOps” their way through the AWS console, scrolling through menus for their right choice. Each team is likely using load balancers in front of their apps, but may choose different ones for different reasons.

“So even in just one AWS service, there’s a multitude of different ways to do things, and I’m just talking about ECS,” he observed. “In the AWS on the whole, there are hundreds of ways to deploy an app and, without any org level guidance and opinionation, you end up with what are called ops sprawl.”

Ops sprawl is what Johnson calls when a cloud environment is overloaded with somewhat differing, but often repeat work, usually without the context of why a tool or process was chosen in the first place. This sprawl happens across all these accidental silos — an ironic casualty of DevOps, which was born to break down silos.

“This is also likely very expensive for your work because you likely have orphaned resources lying all over your cloud accounts,” he continued. “Lastly, security and compliance can become especially difficult if each app team has their own snowflake deployment process that means your security and compliance engineers have to re-accredit each of these individual snowflake applications.”

Platform Engineering to the Standardization Rescue!

Everyone has different explanations for what is platform engineering. Johnson says, “The goal of platform engineering is to standardize the process for deploying and operating applications as well as their underlying infrastructure.” A platform strategy kicks off by discovering the disparate tools and processes used by each app team, and then “forms an opinion on the best, most robust and most relevant to our contexts methods for deploying an application and operating its infrastructure,” therefore laying, enforcing and automating the golden path to production.

What’s in a golden path besides yellow bricks? This can range from creating reusable continuous integration pipelines that enforce static code analysis and dependency scanning all the way to a streamlining of your (probably costly) AWS accounts by “scoping applications to use the minimum set of permissions necessary to operate,” he said.

Now, if you’ve been reading The New Stack for a while now, you know that an internal developer platform isn’t successful if it’s a top-down initiative — that’s a surefire plan to build just what your engineers do not want. So you have to be careful in internally launching your platform as standardization and opinionation start to make people “rightfully” nervous that their autonomy is being stripped away — after all, in the DevOps world their team had all the control.

“It can sound like platform engineering is here to take that freedom away and go back to a world where other teams who don’t have an appropriate level of context are dictating how an app team does their job,” Johnson warned. “In reality, the crux of platform engineering is bouncing this level of freedom you give developers with the underlying opinionation of the platform.”

Or, as Spotify puts it, standards can set you free.

How to Build an Open Source Internal Developer Platform

In order to sell your DevOps-driven apps teams on this new pseudo-freedom, you need to explain to them what platform engineering is — and being such a nascent discipline, the industry hasn’t exactly coalesced on a single definition of internal developer platform (IDP).

“In its simplest form, a platform is just the underlying set of services and capabilities that an application requires to run effectively in a production environment,” Johnson said. Platform-driven automation makes it really easy to do the right things and really hard to do the wrong ones.

While every IDP should naturally be unique based on the results of your ops sprawl crawl, there is a certain pattern his team has uncovered for what goes into a completely free open source platform engineering strategy. Johnson talked us through what to build around that frontend and backend layer that you want your app team to focus on for value delivery to end users.

IDP Demand #1: A Monitoring Layer

The monitoring and observability needs are likely the same across your organization, making it a fair place to start your platform journey.

“When this app gets deployed, engineers are going to want to know if it actually got deployed, is it healthy, is it reachable and also how is this app performing,” Johnson said, which is why, at bare minimum, any abstraction must include assurances and health checks, monitoring of connectivity, and “some quantitative insight into how the app is performing from like a CPU and networking perspective.”

He suggests the Kubernetes dashboard Metrics Scraper to achieve this.

IDP Demand #2: A Logging Layer

Engineers are going to want to view logs in real-time, as well as query historical logs, which means you need a logging stack, says Johnson. “The app is going to emit logs to a particular directory, and we’ll need a log scraper to grab logs from that directory and then forward them on to a log database.”

Logalyzer and OmniSci are popular open source log scrapers.

IDP Demand #3: A Dashboard

Then you need a dashboard to make it easy for engineers to observe their logging and monitoring. In order to do this safely, Johnson says this requires network encryption with TLS. Each application can implement its own TLS or you can point them onto some centralized certificate authority or CA server, and then manually pass around certificates — but anything manual contributes to your ops sprawl. So he offers the alternative of adding a service mesh, which performs the same capabilities, without having to touch application code.

IDP Demand #4: Load Balancer and DNS

A load balancer and DNS records will make the app accessible to its end users by creating a public interface.

IDP Demand #5: An Internal Developer Portal

This is where your platform engineering strategy will sink or swim. This developer portal should enable self-service so devs can work out what they need to do. It also should give them an easy way to view all these dashboards in one place, and give them a snapshot of the health of their app at all times. In addition, this dev portal lets them see what other teams are doing and who owns what, giving cross-organizational clarity and enabling reusability.

The internal developer portal should give them confidence that platform engineering is the way forward and make them want to actually adopt it.

IDP Demand #6: A Resource Scaler

“As more users interact with the application, and with the platform, we’ll probably want some mechanisms to scale up and scale down these various resources,” Johnson said,” so we’re making efficient use of the underlying compute that this is all running.” He suggests the Kubernetes resource scaler.

IDP Demand #7: Security at all Layers

“At some point, either the app or one of these platform components will have a zero-day critical security vulnerability, and this system will get hacked,” he warned. This demands a runtime security component to identify this vulnerability and to alert the team.

Build or Buy?

Going the free open source software route is not typically free in terms of time or energy. “There is basically an endless number of moving parts and tech decisions that you have to make to build something that your app devs actually want to use, and is easy to operate,” Johnson said. This is why many orgs choose to simply outsource it to their cloud provider.

However, in the highly regulated, Egress-limited environments where Defense Unicorns is working, what the big three offer may not be enough, leaving you to build your own.

Bonus is that by going the open source route, you’re able to evade vendor lock-in.

An Open Source Platform Engineering Case Study

Defense Unicorns chooses to build their open stack with Kubernetes, which it calls a cornerstone to platform engineering — as well as often the trigger to go the platform route.

“At its core, all Kubernetes does is provide a really robust and extensible API for managing containers. And this API is so popular that Kubernetes has become the de facto standard for deploying containers, both on-prem and in the cloud.” But, Johnson warned, “When you choose to adopt Kubernetes, many of those platform components that your cloud provider was providing you, you now own and you now have to operate and maintain and update.”

This is why he recommends having a team of platform engineers dedicated to building and maintaining — and of course in tight feedback loops with their internal app developer customers. But, the benefits typically outweigh the cost, he argues, because, by “using Kubernetes as a base, we can build this entire system using exclusively free and open source software.”

Johnson shared with the OS Summit Big Bang which was created and then outsourced by the Department of Defense’s Platform One DevSecOps program. “Big Bang is an open source declared baseline of applications and configurations used to create a secure Kubernetes-based platform,” he said, which was created by the U.S. Air Force.

Big Bang aims to take the most widely used and adopted open source components and bundle them together in a single Helm chart that can be deployed together. These free and open source tools included are:

Logging stack – Promtail and Grafana loki.
Monitoring stack – Prometheus.
Dashboard – Grafana.
Service mesh – Istio to secure traffic between those platform components and for Egress for the cluster.
Runtime security – NeuVector, monitoring for cluster anomalies or intrusions.
Resource scaler – Kubernetes.
Continuous delivery – Flux, which has the concept of the Helm release.
AWS EKS cluster.
EBS CSI driver.

Essentially, Johnson explained, “Big Bang itself is a Helm chart made up of Flux Helm releases for all of these various platform components.” Big Bang provides the default configuration, which automatically wires together many of the platform components.

He live-demoed building it on stage, and, within a few minutes, was able to check the health and connectivity of just under 50 pods and the virtual services, including CPU and memory utilization.

“And you have all these fancy charts to show your boss and all I’m checking here is that like, okay, metrics data is definitely being calculated and collected and we can view it so now for logs, what we can do is go to this explorer tab in Grafana,” Johnson said. In addition, with this open source platform engineering stack, you can see the most recent laws and query historical logs.

You train NeuVector in discovery mode for a bit and then switch it over to protect, “and it will use what it learned in discovery mode to it will use what it learned in discovery mode as heuristics for detecting various anomalies and intrusions and you can even set it up to automatically neutralize certain behaviors.”

Platform Engineering Lessons (from One of the Biggest AWS Bills in the U.S.)

The Department of Defense is a massive organization which makes it unsurprising that it experienced a lot of ops sprawl across hundreds of accounts, with each app team owning their own process, kind of doing whatever they wanted.

Add to this, these applications are operating within highly regulated environments, handling highly sensitive data. Any baseline architecture has to comply with NIST 800-53 requirements, across hundreds of apps teams.

“And because of this amount of ops sprawl, it took security teams an egregious amount of time to accredit these apps to run in the various regulated environments,” Johnson said.

They were actually able to build and deploy their open source internal developer platform in six months.

The main goal though — like for most IDPs — is whether it is adopted. Johnson said two things that must be true of internal developer platforms:

It should make developers’ lives easier.
It should enhance workflows.

“If the platform doesn’t do that, then fundamentally like we have missed the mark.”

And there were some roadblocks to adoption along the way. First, they realized that, while many teams were in the cloud running on AWS, very few were actually containerized — a prerequisite to running in Kubernetes. Most were either running in a virtual machine or completely serverless using lambdas.

So before they could on sell people on the platform, they actually needed teams to commit a few engineering cycles to containerization — which most teams were reluctant to do.

“In the beginning, we were marketing this platform more towards security and compliance engineers, and really highlighting the value prop of automatically satisfying all of these security controls. So there was less emphasis on the actual app devs,” Johnson admitted.

In order to get the app teams on board, they decided to inner source the work, and “then invited them to come build with us and encouraging these community adoptions and being transparent with what we’re building.”

And while this was happening, even though the ops sprawl was starting to get under control, Johnson’s team still had to run and manage Kubernetes and Big Bang for those almost 50 pods. “And when I say manage, I mean things like checking for image updates, and refactoring whatever upstream Helm chart has breaking changes and its values — and that happens way more often than you think.”

He’d find himself refactoring the logging staff for the tenth time across the many ways AWS does logging.

“Depending on how fast the team moves and the amount of changes being made, the operational overhead of running a homegrown, Kubernetes-based platform, it can get pretty absurd,” Johnson said. “So the lesson learned there is, when you’re architecting a platform, don’t forget to weigh [in] that operational overhead.”

And don’t assume that Kubernetes is the right choice for all organizations and all platforms — have a data-driven identified need for Kubernetes. Once you commit to Kubernetes, it’s hard to back out of that technical decision.

In the end, those original six months to build turned into a year. He learned that the best platforms come from extensive user research up front and then building smaller to fail faster.

Finally, Johnson advised that platform teams focus on the problem, not the tech: “The core problems that all these buzzwords are trying to get at are like, how do we make software development suck less and how do we build better products?”

The post A Guide to Open Source Platform Engineering appeared first on The New Stack.

Drive Platform Engineering Success with Humanitec and Port

Zohar Einy — Fri, 15 Sep 2023 14:18:20 +0000

Platform engineering is evolving like crazy. So many cool new tools are popping up left, right and center, all designed to boost developer productivity and slash lead time. But here’s the thing. For many organizations, this makes picking the right tools to build an internal developer platform an instant challenge, especially when so many seem to do the same job.

At Humanitec we get the need to ensure you’re using the right tech to get things rolling smoothly. This article aims to shed some light on two such tools: the Humanitec Platform Orchestrator and Port, an internal developer portal. We’ll dive into the differences, their functionalities and the unique roles they play in building an enterprise-grade internal developer platform.

The Power of the Platform Orchestrator

The Humanitec Platform Orchestrator sits at the core of an enterprise-grade internal developer platform and enables dynamic configuration management across the entire software delivery cycle. The Platform Orchestrator enables a clear separation of concerns: Platform engineers define how resources should be provisioned in a standardized and dynamic way. For developers, it removes the need to define and maintain environment-specific configs for their workloads. They can use an open source workload specification called Score to define resources required by their workloads in a declarative way. With every git push, the Platform Orchestrator automatically figures out the necessary resources and configs for their workload to run.

When used to build an internal developer platform, the Platform Orchestrator cuts out a ton of manual tasks. The platform team defines the rules, and the Platform Orchestrator handles the rest, as it follows a “RMCD” execution pattern:

Read: Interpret workload specification and context.
Match: Identify the correct configuration baselines to create application configurations and identify what resources to resolve or create based on the matching context.
Create: Create application configurations; if necessary, create (infrastructure) resources, fetch credentials and inject credentials as secrets.
Deploy: Deploy the workload into the target environment wired up to its dependencies.

No more configs headaches, just more time to focus on the important stuff that adds real value.

The Pivotal Role of Internal Developer Portals

Like the Platform Orchestrator, internal developer portals like Port also play a pivotal role from the platform perspective, mainly since it acts as the interface to the platform enhancing the developer experience. However, the two tools belong to different categories, occupy different planes of a platform’s architecture and have different primary use cases.

Note: The components and tools referenced below apply to a GCP-based setup, but all are interchangeable. Similar reference architectures can be implemented for AWS, Azure, OpenShift or any hybrid setup. Use this reference as a starting point, but prioritize incorporating whatever components your setup already has in place.

For example, where the Platform Orchestrator is the centerpiece of an internal developer platform, Port acts as the user interface to the platform, providing the core pillars of the internal developer portal: a software catalog that provides developers with the right information in context, a developer self-service action layer (e.g., setting up a temporary environment, provisioning a cloud resource and scaffolding a service), a scorecards layer (e.g., indicating whether software catalog entities comply with certain requirements) and an automation layer (for instance, alerting users when a scorecard drops below a certain level). Port lets you define any catalog for services, resources, Kubernetes, CI/CD, etc., and it is easily extensible.

Same tools or apples and oranges?

So, you could say comparing the Humanitec Platform Orchestrator and Port is like comparing apples to oranges. Both play important roles in building a successful platform. But they’re not the same thing at all. The Platform Orchestrator is designed to generate and manage configurations. It interprets what resources and configurations are required for a workload to run, it creates app and infrastructure configs based on rules defined by the platform team and executes them. As a result, developers don’t have to worry about dealing with environment-specific configs for their workloads anymore. The Platform Orchestrator handles it all behind the scenes, making life easier for them.

Port, on the other hand, is like the front door to your platform. It acts as the interface, containing anything developers need to use to be self-sufficient, from developer self-service actions through the catalog and even automation that can alert them on vulnerabilities, ensuring AppSec through the entire software development life cycle. In short, an internal developer portal drives productivity by allowing developers to self-serve without placing too much cognitive load on them, from setting up a temporary environment, getting a cloud resource or starting a new service. It’s all about making self-service and self-sufficiency super smooth for developers.

Building the Perfect Platform with Port and the Platform Orchestrator

The real magic happens when these two tools join forces. While they support different stages in the application life cycle, they can be used in tandem to build an effective enterprise-grade internal developer platform that significantly enhances DevOps productivity.

So when it comes to the Humanitec Platform Orchestrator and Port, it’s not about choosing one over the other. Both can be valuable tools for your platform. What matters is the order in which you bring them into the mix, and how you integrate them.

Step one, let’s set the foundation right. You should structure your internal developer platform to drive standardization across the end-to-end software development life cycle and establish a clear separation of concerns between app developers and platform teams. And the best way to do that is by starting with a Platform Orchestrator like the one from Humanitec. Think of it like the beating heart of your platform.

Next, you can decide what abstraction layers should be exposed to developers in the portal, what self-service actions you need to offer them to unleash their productivity, and which scorecards and automations need to be in place. For this, you can adopt Port as a developer portal on top of the platform.

Port and Humanitec in Action

Here’s what combining the Humanitec Platform Orchestrator and Port could look like:

First, you’ll need to set up both Humanitec and Port. For Port, you’ll need to think about the data model of the software catalog that you will want to cover in Port, for instance, bringing in CI/CD data, API data, resource data, Kubernetes data or all of the above. You’ll also need to identify a set of initial popular self-service actions that you will want to provide in the portal.
Let’s assume you want to create a self-service action to deploy a new build of a microservice in Port.
Make sure the microservice repository/definition includes a Score file, which defines the workload dependencies.
Port receives the action request and triggers a GitHub Workflow to execute the service build.
Once the service is built, the Platform Orchestrator is notified and dynamically creates configuration files based on the deployment context. The Platform Orchestrator can derive the context from API calls or from tags passed on by any CI system.
Humanitec deploys the new service.
The resulting new microservice deployment entity will appear in Port’s software catalog.

Don’t forget what happens after Day 1, though. Dealing with complex app and infra configs, and having to add or remove workload-dependent resources (stuff like databases, DNS, storage) for different types of environments, can equate to a ton of headaches.

This is where the Platform Orchestrator and Score do their thing. With Score, open source workload specification, developers can easily request resources their workloads need or tweak configs in a simple way, depending on the context — like what kind of environment they’re working with. Let’s dive into an example to make it clear:

1. Add the following request to the Score file.

Bucket

      Type: s3

2. Run a git push.

3. The Orchestrator will pick this up and update or create the correct S3 based on the context, create the app configs and inject the secrets.

4. At the end, it will register the new resource in the portal.

5. Resources provisioned by Humanitec based on requests from the Score file will be shown on the Port service catalog for visibility and also to enable a graphical overview of the resources.

Port comes with additional capabilities beyond scaffolding microservices and spinning up environments, and also for self-service actions that are long running and asynchronous or require manual approval. Sample Day 2 actions are:

Add temporary permissions to cloud resource.
Extend developer environment TTL.
Update running service replica count.
Create code dependencies upgrade PR.
And more.

Drive Productivity and Slash Time to Market

To sum up, the Humanitec Platform Orchestrator and Port are an awesome match when it comes to building an effective enterprise-grade internal developer platform. And the best place to start? Build your platform around the Platform Orchestrator. That’s the key to unlocking the power of dynamic configuration management (DCM), which will standardize configs, ensure a clear separation of concerns and take your DevEx to the next level. Then, choose your developer abstraction layers. This is where you can use Port as a developer portal sitting on top of the platform. Successfully integrate the two and expect productivity gains, a boost to developer performance and, ultimately, slashed time to market.

The post Drive Platform Engineering Success with Humanitec and Port appeared first on The New Stack.

Demo: Building an Internal Developer Portal with Port

Heather Joslyn — Thu, 14 Sep 2023 13:45:52 +0000

Platform engineering is supposed to make the lives of both Devs and Ops easier and more productive — reducing complexity for developers so they can focus on building and delivering their applications, and freeing operations engineers from repetitive tasks.

And that mission includes building an internal developer portal, or IDP, through which developers can access everything they need for self-service.

Port, a two-year-old company, started when its co-founders, Zohar Einy and Yonatan Boguslavski, built an IDP for more than 2,000 engineers at a previous employer to use.

“This is what got us inspired into starting a company around it and allow every organization in the world to adopt the portal,” Einy said.

Now Port’s CEO, he demonstrated his company’s platform for building an IDP in this episode of The New Stack Demos. The no-code portal, which users can set up to suit their specific needs, includes a robust service catalog and a service actions tab, where, Einy said, “developers can get everything that they need when it comes to provisioning infrastructure or for performing data operations and so on.”

The open platform allows users to integrate tools and programs like Kubernetes and Jira, and also custom plugins. It also offers visibility of the entire life cycle of a service. It also offers a way to create scorecards for different software components. “scorecards are the way to define the guardrails you want to put in place for developers to be able to comply with their organization organizational standards,” Einy said.

To see more of how Port works, check out the video. There’s also an interactive demo. “It’s publicly available, and you don’t need to authenticate,” Einy said. “You can play around with the demo. And you can break it, it will refresh after three hours. So don’t worry.”

The post Demo: Building an Internal Developer Portal with Port appeared first on The New Stack.

Platform Engineering: What’s Hype and What’s Not?

Tommy McClung — Wed, 13 Sep 2023 13:43:08 +0000

There’s a sea change happening in the software development world. The merging of developers and operations that gave us DevOps may be entering a new chapter. In this new world, the emerging discipline of platform engineering is quickly gaining popularity.

Platform engineering is the practice of designing and building self-service capabilities to reduce cognitive load for developers and to facilitate fast-flow software delivery. Many large organizations have struggled to reap the benefits of DevOps, in part because shifting more operational and security concerns “left” and into the domain of software developers has created bottlenecks for dev teams. At the same time, faced with a growing cognitive burden of repetitive, time-consuming tasks that kick them out of the flow state of highly productive coding, many devs want less and less to do with ops and the “you build it, you run it” paradigm.

Platform engineering is emerging as the solution to many of these challenges. But in all the buzz around it, it’s hard to know what’s real and what’s not. To help you separate the facts from the hype, here’s a round-up of viewpoints on what platform engineering is and is not.

Platform Engineering Is New – Hype

There are those who hail platform engineering as the new kid on the block. But there’s nothing new about the building of digital platforms as a means of delivering software at scale. It even pre-dates the birth of the DevOps movement in the mid-2000s. According to Puppet’s 2023 State of DevOps Report, large software companies have been taking a platform approach for decades as a way to enable developer teams to build, ship, and operate applications more quickly and at higher quality.

What is new, especially in the enterprise space, is the rapidly growing traction of platform engineering as a way for larger companies to improve software delivery at scale. Gartner identified platform engineering (which it interchangeably calls “platform ops”) as one of the Top Strategic Technology Trends of 2023. Gartner analysts predict that 80% of software engineering organizations will establish platform teams by 2026, and 75% of those will include developer self-service portals.

Platform Engineering Has Toppled DevOps – Hype

Those who claim that DevOps is dead and that platform engineering has supplanted it are engaging in hyperbole. “DevOps is dead, long live Platform Engineering!” tweeted software engineer and DevOps commentator Sid Palas in 2022. “1. Developers don’t like dealing with infra, 2. Companies need control of their infra as they grow. Platform engineering enables these two facts to coexist.”

Platform Engineering Is the Next Evolution of DevOps and SRE – Fact

Rather than dealing a death blow to DevOps, a more accurate take is that platform engineering is the next evolution of DevOps and SRE (site reliability engineering). In particular, it benefits developers struggling with code production bottlenecks as they wait on internal approvals or fulfillment. It also helps devs deliver on their own timeline rather than that of their IT team. And it helps operator types (such as SREs or DevOps engineers) who are feeling the pain of repetitive request fulfillment and operational firefighting — busy work that keeps them from building their vision for the future.

Platform Engineering Should Embrace DevOps Culture – Fact

The agile development practices that are at the core of DevOps culture — such as collaboration, communication, and continuous improvement — have not extended to the operations domain. This has hobbled the ability of agile development teams to quickly deliver products. In order not to perpetuate this dynamic, DevOps team culture should evolve to support platform engineering, and platform teams should embrace DevOps team culture.

Platform Engineering Is a Con – Hype

There are those, like independent technology consultant Sam Newman, who argue platform engineering is just another vendor-generated label to be slapped on to old practices in a bid to mask the horrendously complex technology ecosystems we’ve accumulated. Newman’s concern is that “single-issue” platform teams risk becoming the very bottleneck they’re supposed to alleviate, and can become so focused on managing the tool that they forget about outcomes. Rather than using names like Platform Team, Newman suggests more outcome-oriented labels such as “Delivery Support” or even better “Delivery Enablement.”

Platform Engineering Is All about Scaling – Fact

Platform engineering solves the challenges of scaling and accelerating DevOps adoption by dedicating a team to the delivery of a shared self-service platform for app developers. It works best for enterprises with more mature DevOps practices who need to scale and move fast. It often does not make sense in smaller companies, for a single development team, or when multiple divergent platforms need to be supported, where scale is not a driving factor of success yet.

Platform Engineering Centers on IDPs and Golden Paths – Fact

One definition of platform engineering is the practice of creating a reusable set of standardized tools, components, and automated processes, often referred to as an internal developer platform (IDP). IDPs, and the teams that build them, provide paths of least resistance that developers can take to complete their day-to-day tasks. These “golden paths” come with recommended tools and best security practices built in, enabling developers to self-serve and self-manage their code.

Platform Engineering Requires a Platform as Product Approach – Fact

Gartner and others recommend treating your platform as a product by treating the developers who use it as your customers, so that they in turn can deliver services to your organization’s customers. Like any other product, pushing your platform on to developers without their input is unlikely to produce positive outcomes. So it’s essential to talk to your internal consumers to solve for their needs. Many traditional infrastructure teams don’t do this and often don’t even understand the workloads running on their platforms.

Striking a Balance

Successful IDPs achieve a balance between allowing developers to remain in the flow state of highly productive coding while eliminating repetitive tasks through automated, full-stack environments. Developers can deliver apps faster because platform engineers smooth the path for them, enabling them to create their own environment with every check-in. This allows devs to review, share, and test apps without waiting in line or worrying about code conflicts. When done well, platform engineering delivers best-in-class developer experiences; provides choices of leading tools, platforms and clouds across the software development lifecycle; and gives self-service access to full-stack environments to every developer.

The post Platform Engineering: What’s Hype and What’s Not? appeared first on The New Stack.

How to Create an Internal Developer Portal MVP

Mor Paz — Tue, 12 Sep 2023 14:22:33 +0000

What needs to go into an internal developer portal and how should it be set up by platform engineers and used by developers? This post will take a practical approach to building a portal minimum viable product (MVP), assuming in a GitOps and Kubernetes native environment. MVPs are a great way to get started with an idea and see what it can materialize into. We’ll explore the software catalog, both a basic catalog and an extended one, and then look at setting developer self-service actions and specifically how to deploy a microservice from testing to production. Then we’ll add some scorecards and automations.

Sounds difficult? It’s actually quite simple.

5 Steps to Creating an MVP of Your Developer Portal

Forming an initial software catalog. In the example below we will show how to populate the initial software catalog using Port’s GitHub app and a git-based template.
Enriching the data model beyond the initial blueprints, bringing in more valuable data to the portal.
Creating your first self-service action. In the example below we will show how to scaffold a new microservice, but you can also think of adding Day 2 actions, or an action with a TTL (temporary environment, for instance).
Enriching the data model with additional blueprints and Kubernetes data, and allowing developers to build additional self-service actions so that they can test and then promote the service to production.
Adding scorecards and dashboards. These features offer developers insight into ongoing activities and quality initiatives.

Defining and Creating the Basic Data Model for the Software Catalog

The basic setup of the software catalog will be based on raw GitHub data, though you can make other choices. But how will the developer portal “classify” the data and create software catalog entities?

In Port, blueprints are where you define the metadata associated with the software catalog entities you choose to add to your catalog. Blueprints support the representation of any asset in Port, such as microservice, environment, package, cluster, databases, etc. Once blueprints are populated with data — in this case, coming from GitHub — the software catalog entities are discovered automatically and formed.

What are the right blueprints for this initial catalog and how do we define their relations? Let’s look at the diagram:

Let’s dive a little deeper:

The Workflow Run blueprint shows metadata associated with GitHub workflow pipelines.
The Pull Request blueprint shows metadata associated with, well, pull requests. This will allow you to create custom views for the PRs relevant to teams or individual developers.
The Issues blueprint shows metadata associated with GitHub issues.
The Workflow blueprint explores pipelines and workflows that currently exist in your GitHub (and uses them to create self-service actions in Port that can trigger more GitHub workflows).
The Microservice blueprint shows GitHub repositories and monorepos represented as microservices in the portal.

This basic catalog provides a developer with a strong foundation to understand the software development life cycle. This helps developers become familiar with the tech stack, understand who are the owners of the different services, access documentation for each service directly from Port, keep track of deployments and changes made to a given service, etc.

Data Model Extension: Domain and System Integration

Given that these fundamental blueprints provide good visibility into the life cycle of each service, the model we just discussed can suffice. You can also take it one step further and extend the data model by introducing domain and system blueprints. Domains often correspond to high-level engineering functions, maybe a pivotal service or feature within a product.

System blueprints are a depiction of a collection of microservices that collectively enhance a segment of functionality provided by the domain. With the addition of these two blueprints, we can now see how a microservice fits in a greater app or functionality and how it provides the developer with additional insight into how their microservice interacts with the greater tech stack. This information can be invaluable to speed up the onboarding process for new developers, as well as make diagnosing and debugging incidents easier since the dependency between microservices and products within the company is clearer.

This mirrors the backstage C4 model, with the addition of the running service, which provides additional stateful information.

When we finish ingestion, we’ll have a fully populated MVP software catalog. Drilling down into an entity, we can understand dependencies, health, on-call data and more.

Internal developer portals aren’t only about a software catalog, containing microservice and underlying resource and DevOps asset data. They are mostly about enabling developer self-service actions. Let’s go ahead and do that.

First Developer Self-Service Action Setup

Internal developer portals are made to relieve developer cognitive load and allow developers to access the self-service section in the portal and do their work with the right guardrails in place. This is done by defining the right flow in the portal’s UI, and then by loosely coupling it with the underlying platform that will execute the self-service action, while still providing feedback to developers about their executed actions, such as logs, relevant links and the effects of the action on the software catalog. We can also show whether a self-service action is waiting for manual approval.

For the MVP, let’s define a self-service action for scaffolding a new microservice. This is what developers will see:

When setting up a self-service action, the platform engineer doesn’t just define the backend process, but also sets up the UI in the developer self-service form. By being able to control what the developer sees and can do, as well as permissions, we can allow developers to perform actions on their own within a defined flow, setting guardrails and relieving cognitive load.

Expanding the Data Model with Kubernetes Abstractions

We’ve begun by saying that we’re working in a Kubernetes native environment. Kubernetes knowledge is not common, and our goal is to abstract Kubernetes for developers, providing them with the information they need.

Let’s add the different Kubernetes resources (deployments, namespaces, pods, etc.) into our software catalog. This then allows us to configure abstractions and thus reduce cognitive load for developers.

When populated, the cluster blueprint will show its correlated entities. This will allow developers to view the different components that make up a cluster in an abstracted way that’s defined by the platform engineer.

To bring everything together, let’s create an “environment” blueprint. This will allow us to differentiate between multiple environments that are in the same organization and create an in-context view (including CI/CD pipelines, microservices, etc.) of all running services in an individual environment. In this instance we will create a test environment and also a production environment.

Now let’s build a relation between the microservice blueprint we made in our initial data model to the workload blueprint. This will allow us to understand which microservices are running in each cluster as workloads. A workload is a running service of a microservice. This will allow us to have an in-context view of a microservice, including which environments it is running in, meaning we now know exactly what is going on and where (Is a service healthy? Which cluster is a service deployed on?).

Generally, creating relations between blueprints can be compared to linking a number of tables with a foreign key but using identical names for both tables. You can customize the information you see on each entity or blueprint, thus modeling them to suit your needs exactly. You can build relations that are one-to-one or one-to-many. In our example, the link between workload and microservice is a one-to-one relation, as each workload is only one deployment of one microservice.

Let’s now create a relation between cluster and environment so that we know where we have running clusters. We could also expand this idea to a cloud region or environment, depending on the context.

Let’s also create a relation between microservice and system, and workflow run and workload. This allows us to see the source of every workload, as well as see what microservices make up the systems in our architecture.

And that’s it!

Scorecards and Dashboards: Promoting Engineering Quality

The ability to define scorecards and dashboards has proven to be of great significance within enterprises, as they help push initiatives and drive engineering quality. This is thanks to teams now being able to visualize service maturity and engineering quality of different services in a domain and thus understand how close or far they are from reaching a production-ready service.

In Conclusion

The highly discussed distinction between portal and platform fades away when put into practice. While one focuses on infrastructure and backend definitions, the other empowers developers to take control of their needs through a software catalog and self-service actions, as well as be able to give great insight into service and infrastructure well-being, which is allowed through scorecards and visualizations.

Want to try Port or see it in action? Go here.

The post How to Create an Internal Developer Portal MVP appeared first on The New Stack.

Next-Gen Observability: Monitoring and Analytics in Platform Engineering

Robert Kimani — Tue, 12 Sep 2023 11:00:21 +0000

As applications become more complex, dynamic, and interconnected, the need for robust and resilient platforms to support them has become a foundational requirement. Platform engineering is the art of crafting these robust foundations, encompassing everything from orchestrating microservices to managing infrastructure at scale.

In this context, the concept of Next-Generation Observability emerges as a crucial enabler for platform engineering excellence. Observability transcends the traditional boundaries of monitoring and analytics, providing a comprehensive and insightful view into the inner workings of complex software ecosystems. It goes beyond mere visibility, empowering platform engineers with the knowledge and tools to navigate the intricacies of distributed systems, respond swiftly to incidents, and proactively optimize performance.

Challenges Specific to Platform Engineering

Platform engineering presents unique challenges that demand innovative solutions. As platforms evolve, they inherently become more intricate, incorporating a multitude of interconnected services, microservices, containers, and more. This complexity introduces a host of potential pitfalls:

Distributed Nature: Services are distributed across various nodes and locations, making it challenging to comprehend their interactions and dependencies.
Scaling Demands: As platform usage scales, ensuring seamless scalability across all components becomes a priority, requiring dynamic resource allocation and load balancing.
Resilience Mandate: Platform outages or degraded performance can have cascading effects on the applications that rely on them, making platform resilience paramount.

The Role of Next-Gen Observability

Next-Gen observability steps in as a transformative force to address these challenges head-on. It equips platform engineers with tools to see beyond the surface, enabling them to peer into the intricacies of service interactions, trace data flows, and understand the performance characteristics of the entire platform. By aggregating data from metrics, logs, and distributed traces, observability provides a holistic perspective that transcends the limitations of siloed monitoring tools.

This article explores the marriage of Next-Gen Observability and platform engineering. It delves into the intricacies of how observability reshapes platform management by providing real-time insights, proactive detection of anomalies, and informed decision-making for optimizing resource utilization. By combining the power of observability with the art of platform engineering, organizations can architect resilient and high-performing platforms that form the bedrock of modern applications.

Understanding Platform Engineering

Platform engineering plays a pivotal role in shaping the foundation upon which applications are built and delivered. At its core, platform engineering encompasses the design, development, and management of the infrastructure, services, and tools that support the entire application ecosystem.

Platform engineering is the discipline that crafts the technical underpinnings required for applications to thrive. It involves creating a cohesive ecosystem of services, libraries, and frameworks that abstract away complexities, allowing application developers to focus on building differentiated features rather than grappling with infrastructure intricacies.

A defining characteristic of platforms is their intricate web of interconnected services and components. These components range from microservices to databases, load balancers, caching systems, and more. These elements collaborate seamlessly to provide the functionalities required by the applications that rely on the platform.

The management of platform environments is marked by inherent complexities. Orchestrating diverse services, ensuring seamless communication, managing the scale-out and scale-in of resources, and maintaining consistent performance levels present a multifaceted challenge. Platform engineers must tackle these complexities while also considering factors like security, scalability, and maintainability.

Platform outages wield repercussions that stretch beyond the boundaries of the platform itself, casting a pervasive shadow over the entire application ecosystem. These disruptions reverberate, resulting in downtimes, data loss, and a clientele that’s both agitated and dismayed. The ramifications encompass more than just the immediate fiscal losses; they extend to a long-lasting tarnish on a company’s reputation, eroding trust and confidence.

In the contemporary landscape, user expectations hinge on the delivery of unwaveringly consistent and dependable experiences. The slightest lapse in platform performance has the potential to mar user satisfaction. This can, in turn, lead to a disheartening ripple effect, manifesting as user attrition and missed avenues for business growth. The prerequisite for safeguarding high-quality user experiences necessitates the robustness of the platform itself.

Enter the pivotal concept of observability — a cornerstone in the architecture of modern platform engineering. Observability serves as a beacon of hope, endowing platform engineers with an arsenal of tools that transcend mere visibility. These tools enable engineers to transcend the surface and plunge into the intricate machinations of the platform’s core.

This dynamic insight allows them to navigate the labyrinth of intricate interactions, promptly diagnosing issues and offering remedies in real-time. With its profound capacity to unfurl the platform’s inner workings, observability empowers engineers to swiftly identify and address problems, thereby mitigating the impact of disruptions and fortifying the platform’s resilience against adversity.

Core Concepts of Next-Gen Observability for Platform Engineering

Amidst the intricacies of platform engineering, where a multitude of services work in concert to deliver a spectrum of functionalities, comprehending the intricate interplay within a distributed platform presents an imposing challenge.

At the heart of this challenge lies a complexity born of a web of interconnected services, each with specific tasks and responsibilities. These services often span a gamut of nodes, containers, and even geographical locations. Consequently, tracing the journey of a solitary request as it navigates this intricate network becomes an endeavor fraught with intricacies and nuances.

In this labyrinthine landscape, the beacon of distributed tracing emerges as a powerful solution. This technique, akin to unraveling a tightly woven thread, illuminates the flow of requests across the expanse of services. In capturing these intricate journeys, distributed tracing unravels insights into service dependencies, bottlenecks causing latency, and the intricate tapestry of communication patterns. As if endowed with the ability to see the threads that weave the fabric of the platform, platform engineers gain a holistic view of the journey each request undertakes. This newfound clarity empowers them to pinpoint issues with precision and optimize with agility.

However, the advantages of distributed tracing transcend the microcosm of individual services. The insights garnered extend their reach to encompass the platform as a whole. Platform engineers leverage these insights to unearth systemic concerns that span multiple services. Bottlenecks, latency fluctuations, and failures that cast a shadow over the entire platform are promptly brought to light. The outcomes are far-reaching: heightened performance, curtailed downtimes, and ultimately, a marked enhancement in user experiences. In the intricate dance of platform engineering, distributed tracing emerges as a beacon that dispels complexity, illuminating pathways to optimal performance and heightened resilience.

At the nucleus of observability, metrics and monitoring take center stage, offering a panoramic view of the platform’s vitality and efficiency.

Metrics, those quantifiable signposts, unfold a tapestry of data that encapsulates the platform’s multifaceted functionality. From the utilization of the CPU and memory to the swift cadence of response times and the mosaic of error rates, metrics lay bare the inner workings, revealing a clear depiction of the platform’s operational health.

A parallel function of this duo is the art of monitoring — an ongoing vigil that unveils deviations from the expected norm. The metrics, acting as data sentinels, diligently flag sudden surges in resource consumption, the emergence of perplexing error rates, or deviations from the established patterns of performance. Yet, the role of monitoring transcends mere alerting; it is a beacon of foresight. By continuously surveying these metrics, monitoring predicts the need for scalability. As the platform’s utilization ebbs and flows, as users and requests surge and recede, the platform’s orchestration must adapt in stride. Proactive monitoring stands guard, ensuring that resources are dynamically assigned, and ready to accommodate surging demands.

And within this dance of metrics and monitoring, the dynamic nature of platform scalability comes to the fore. In the tapestry of modern platforms, scalability is woven as an intrinsic thread. As users and their requests ebb and flow, as services and their load variate, the platform must be malleable, and capable of graceful expansion and contraction. Observability, cast in the role of a linchpin, empowers platform engineers with the real-time pulse of these transitions. Armed with the insights furnished by observability, the engineers oversee the ebb and flow of the platform’s performance, ensuring a proactive, rather than reactive, approach to scaling. Thus, as the symphony of the platform unfolds, observability lends its harmonious notes, orchestrating the platform’s graceful ballet amidst varying loads.

In the intricate tapestry of platform engineering, logs emerge as the textual chronicles that unveil the story of platform events.

Logs assume the role of a scribe, documenting the narrative of occurrences, errors, and undertakings within the platform’s realm. In their meticulously structured entries, they create a chronological trail of the endeavors undertaken by various components. The insights gleaned from logs provide a contextual backdrop for observability, enabling platform engineers to dissect the sequences that lead to anomalies or incidents.

However, in the context of multi-service environments within complex platforms, the aggregation and analysis of logs take on a daunting hue. With a myriad of services coexisting, the task of corralling logs spreads across diverse nodes and instances. Uniting these scattered logs to craft a coherent narrative poses a formidable challenge, amplified by the sheer volume of logs generated in such an environment.

Addressing this intricate challenge are solutions that carve paths for efficient log analysis. The likes of log aggregation tools, with exemplars like the ELK Stack comprising Elasticsearch, Logstash, and Kibana, stand as guiding beacons. These tools facilitate the central collection, indexing, and visualization of logs. The platform engineer’s endeavors to search, filter, and analyze logs are fortified by these tools, offering a streamlined process. Swiftly tracing the origins of incidents becomes a reality, empowering engineers in the realm of effective troubleshooting and expedited resolution. As logs evolve from mere entries to a mosaic of insight, these tools, augmented by observability, light the way to enhanced platform understanding and resilience.

Implementing Next-Gen Observability in Platform Engineering

Instrumenting code across the breadth of services within a platform is the gateway to achieving granular observability.

Here are some factors to consider:

Granular Observability Data: Instrumentation involves embedding code with monitoring capabilities to gather insights into service behavior. This allows engineers to track performance metrics, capture traces, and log events at the code level. Granular observability data provides a fine-grained view of each service’s interactions, facilitating comprehensive understanding.
Best Practices for Instrumentation: Effective instrumentation requires a thoughtful approach. Platform engineers need to carefully select the metrics, traces, and logs to capture without introducing excessive overhead. Best practices include aligning instrumentation with key business and operational metrics, considering sampling strategies to manage data volume, and ensuring compatibility with observability tooling.
Code-Level Observability for Bottleneck Identification: Code-level observability plays a pivotal role in identifying bottlenecks that affect platform performance. Engineers can trace request flows, pinpoint latency spikes, and analyze service interactions. By understanding how services collaborate and identifying resource-intensive components, engineers can optimize the platform for enhanced efficiency.

Proactive Monitoring and Incident Response

Proactive monitoring enables platform engineers to preemptively identify potential issues before they escalate into major incidents.

The proactive monitoring approach involves setting up alerts and triggers that detect anomalies based on predefined thresholds. By continuously monitoring metrics, engineers can identify deviations from expected behavior early on. This empowers them to take corrective actions before users are affected.

Observability data seamlessly integrates into incident response workflows. When an incident occurs, engineers can access real-time observability insights to quickly diagnose the root cause. This reduces mean time to resolution (MTTR) by providing immediate context and actionable data for effective incident mitigation.

Observability provides real-time insights into the behavior of the entire platform during incidents. Engineers can analyze traces, metrics, and logs to trace the propagation of issues across services. This facilitates accurate root cause analysis and swift remediation.

Scaling Observability with Platform Growth

Scaling observability alongside the platform’s growth introduces challenges related to data volume, resource allocation, and tooling capabilities. The sheer amount of observability data generated by numerous services can overwhelm traditional approaches.

To manage the influx of data, observability pipelines come into play. These pipelines facilitate the collection, aggregation, and processing of observability data. By strategically designing pipelines, engineers can manage data flow, filter out noise, and ensure that relevant insights are available for analysis.

Observability is not static; it evolves alongside the platform’s expansion. Engineers need to continually assess and adjust their observability strategies as the platform’s architecture, services, and user base evolve. This ensures that observability remains effective in uncovering insights that aid in decision-making and optimization.

Achieving Platform Engineering Excellence Through Observability

At its core, observability unfurls real-time insights into the dynamic symphony of platform resource utilization. Metrics, such as the rhythm of CPU usage, the cadence of memory consumption, and the tempo of network latency, play harmonious notes that guide engineers. These metrics, akin to notes on a musical score, disclose the underutilized instruments and the overplayed chords. Such insights propel engineers to allocate resources judiciously, deftly treading the fine line between scaling and conserving, balancing and distributing.

Yet, observability is not just a map; it’s an artist’s palette. With its brushes dipped in data, it empowers engineers to craft performances of peak precision. Within the intricate canvas of observability data lies the artist’s ability to diagnose performance constraints and areas of inefficiency. Traces and metrics unveil secrets, pointing out latency crescendos, excessive resource indulgence, and the interplay of service dependencies that orchestrate slowdowns. Armed with these revelations, engineers don the mantle of virtuosity, fine-tuning the components of the platform. The aim is nothing short of optimal performance, a symphony of efficiency that resonates throughout the platform.

Real-world vignettes, cast as case studies, offer a vivid tableau of the observability’s transformative impact. These tales unfold how insights, gleaned through observability, yield tangible performance enhancements. The chronicles narrate stories of reduced response times, streamlined operations, and harmonized experiences. These are not merely anecdotes but showcases of observability data weaving into the very fabric of engineering decisions, orchestrating leaps of performance that resonate with discernible gains. In the intricate choreography of platform engineering, observability dons multiple roles — an instructor, a composer, and an architect of performance enhancement.

Ensuring Business Continuity and User Satisfaction

In the intricate interplay of business operations and user satisfaction, observability emerges as a safety net, a sentinel that safeguards business continuity and elevates user contentment.

In the realm of business operations, observability stands as a sentinel against the tempestuous tide of platform outages. The turbulence of such outages can unsettle business operations and erode the very bedrock of user trust. Observability steps in, orchestrating a swift ballet of incident identification and resolution. In this dynamic dance, engineers leverage real-time insights as beacons, pinpointing the elusive root causes that underlie issues. The power of observability ensures that recovery is swift, and the impact is pared down, a testament to its role in minimizing downtime’s blow.

Yet, observability’s canvas extends beyond the realm of business operations. It stretches its reach to the very threshold of user experience. Here, it unveils a compelling correlation—platform health waltzes in tandem with user satisfaction. A sluggish response, a dissonant error, or the stark absence of service can fracture user experiences, spurring disenchantment and even churn. The portal to user interactions, as illuminated by observability data, becomes the looking glass through which engineers peer. This vantage point affords a glimpse into the sentiment of users and their interactions. The insights unveiled through observability carve a pathway for engineers to align platform behavior with user sentiment, choreographing proactive measures that engender positive experiences.

As the proverbial cherry on top, case studies illuminate observability’s transformative prowess. These real-world tales narrate how the tapestry of observability-driven optimizations interlaces with the fabric of user satisfaction.

From smoothing the checkout processes in the e-commerce realm to fine-tuning video streaming experiences, these examples resonate as testimonies to observability’s role in crafting user-centric platforms. In this symphony of platform engineering, observability stands as a conductor, orchestrating harmony between business continuity and user contentment.

Conclusion

Observability isn’t a mere tool; it’s a mindset that reshapes how we understand, manage, and optimize platforms. The world of software engineering is evolving, and those who embrace the power of Next-Gen Observability will be better equipped to build robust, scalable, and user-centric platforms that define the future.

As you continue your journey in platform engineering, remember that the path to excellence is paved with insights, data, and observability. Embrace this paradigm shift and propel your platform engineering endeavors to new heights by integrating observability into the very DNA of your strategies. Your platforms will not only weather the storms of complexity but will also emerge stronger, more resilient, and ready to redefine the boundaries of what’s possible.

The post Next-Gen Observability: Monitoring and Analytics in Platform Engineering appeared first on The New Stack.

Platform Engineering Demands a Product Mindset

Jennifer Riggins — Tue, 12 Sep 2023 10:00:46 +0000

LONDON — If you build it, they won’t come. Successful platform engineers keep beating this drum. Less successful platform engineers are stuck thinking they know best — after all, they are engineers so surely they must know what their fellow engineers want better than they do.

Platform engineering — the discipline dedicated to removing friction and frustration across the software development lifecycle in order to improve developer experience — demands a platform as a product mindset. This is where your internal developer platform is built not only with your internal developers “in [often back of] mind” but with demos and prototypes and continuous feedback throughout. Basically, your developers become your customers. And you want to make dang sure you’re building what they want because otherwise they won’t use it, and you’ll be back where you started, except having wasted everyone’s time, money and trust.

At CIVO Navigate, Syntasso’s Principal Engineer Abigail Bangser reflected on what it really means to adopt a platform-as-a-product mindset, and when she’s fallen short over years in platform engineering roles.

Platform as a Team Topology

In math, a topology is a structure that holds strong despite being stretched and squashed, even under constant pressure and change. Recognizing that software development teams can be laid out in different ways that are more sustainable and flexible to change under pressure, Matthew Skelton and Manuel Pais created a series of principles and patterns they call team topologies.

“The shapes of teams look a little bit different throughout an organization and how they interact is a little bit different so they codified patterns that we’ve seen throughout the years,” Bangser explained.

The shape of teams may vary, but most organizations share common types of teams. What team topologies call stream-aligned teams, usually referred to as application development teams, focus on products and features that deliver value to the end user.

“In an ideal world, 100% of engineers would be working on that thing, because that’s what customers pay for,” she said. “But the reality is that you can’t have 100% of people working on customer-facing features because there’s a lot of underlying requirements that they have to depend on.” And with the continually more complex cloud native landscape, your dependencies have dependencies, which is why you have other teams to support those value-driving app teams.

Most organizations have to enable teams that bring specialization like in testing, agile coaching, or databases. There are also complicated subsystem teams that bring specialization in internal features like AI or security. Finally, there’s the platform team that grows to provide the underlying services that most if not all app teams use to release that value.

In the age of DevOps and continuous delivery pipelines, feature teams have come to rely more and more on tooling created by these supporting, cross-functional teams in order to release their own code to production.

All three of these teams work to reduce the cognitive load of stream-aligned teams, with platform teams often abstracting and combining the work of the enabling and complex subsystem teams into easier to consume workflows and toolchains.

Collaboration vs. Flow State

Team topologies also highlight three ways in which the platform team interacts with its app team customers:

Collaboration — is person-to-person, like asking a specific person for access or permission to do something like provision a container. When you’re a baby startup with everyone sitting next to each other, this mode of interaction can be good enough. No matter what your company size, it’s also good for brainstorming and testing out new ideas. However, collaboration does not scale. It can also be a major source of friction, often slowing down deployment and interrupting flow states.
Facilitation — This follows a one-to-many human interaction pattern where documentation, training and communities of practice scale information and knowledge. This scales but at a human scale. And developers notoriously put off writing documentation and often don’t have time for extra training.
X-as-a-Service — a shift from human-to-human interaction where your developers interact with an internal platform via an API. When done with your development team customer in mind, this can have infinite scalability.

“As a topology, platform teams are often operating all three of these modes at different points in their time,” Bangser said. All three ways are important and, in the world of platform engineering, not everything should be automated, lest you become too distanced from your customers.

“I think that platform engineering has a lot of potential but I don’t think we always take advantage of that potential.” In the end, Bangser contends that platform engineering initiatives most often falter when platform teams don’t see themselves as product teams.

The Trial and Error of Platform Engineering

Bangser reflected on her eight years of building platforms across a dozen companies. At one scale-up, not uncommonly, there were several app teams dedicated to delivering business value, which were supported by other teams like customer support, finance and marketing. Perhaps more unusually, each different feature delivery team had a different Amazon Web Services account.

“When we were a small company, we wanted to allow people autonomy and to move fast and easy, but we also wanted to make sure that we didn’t risk permissions issues and stomping on each other. The fastest and easiest way to get people going was accounts,” Bangser said. As teams grew and split, new AWS accounts were opened.

Profering of new AWS accounts kicked off with a request via Jira, which the platform team picked up, went to Confluence to look at the runbook, and then they returned to the feature team with a properly configured account. This shadow IT process was “good enough” right up until finance started asking why it was getting expense reports and requests to reimburse personal credit cards for AWS charges. And we all know how cloud costs can add up.

Bangser’s team went down the Jira data rabbit hole. As her colleague at Syntasso Paula Kennedy previously told The New Stack, Jira is a great first step for platform teams to uncover repeated work, widespread pain and lengthy bottlenecks.

They realized not only was it less than ideal financial tracking, but that it was taking two weeks to get these teams up and running with their new AWS accounts.

“So we looked at our process, and we realized that the thing that was causing the issue was this manual runbook,” Bangser said. “It’s not just that it was manual. It was tiring for us to do it because people weren’t confident with it. And it was a pain in the butt and it took a lot of time and so people avoided it,” adding to the platform team’s backlog.

So they automated the runbook, which initially seemed to limit unexpected expenses which pleased finance. And it removed a frustrating, repetitive and manual task for the platform team, which meant less toil and a lower risk of errors.

“We’re pretty happy with ourselves because now, when a ticket comes into JIRA, we click one button, run one script and we’re back out the door,” Bangser said. “It reduced our time to market a lot as an individual team.”

Three months later, finance returned to the problem, realizing it hadn’t really been solved for them. They were still getting unexpected finance reports.

“The problem was, we had fixed a problem, not the problem,” she said. “Instead of a technical implementation point of view — how can we speed things up? — We needed to look at this from a customer point of view: What is it that the customers who are trying to create accounts need and how can we deliver that as a product team?”

They were correct that this two-week lag time was a problem. But they were centering the solution on the platform engineer’s experience, not the software developer’s motivations.

Jobs-to-Be-Done Framework

The platform team went on a journey of discovery following the Jobs-to-be-Done Theory, the outcome-driven innovation framework for defining, categorizing, capturing, and organizing customer needs. Remember, for platform teams, your colleagues are your customers.

“What the Jobs-to-be-Done Theory says is: No matter how great your data is about people, if you don’t understand what motivates them and what they need to complete by using your product, you’re going to be insufficient in solving problems,” Bangser explained.

This strategy developed by Tony Ulwick at the turn of the century argues that demographics are not the most important information about your prospects. What matters is answering: What job are they trying to do?

There are four characteristics to jobs, explained Bangser:

Solution agnostic — there can be many ways to complete that job.
You need to complete the job — progress must be made.
A job is stable over time — you can innovate to do the job better, not for the sake of innovation.
No need is just functional — there are social and emotional aspects too. Indeed, platform engineering is always a sociotechnical endeavor.

Jobs vary, as she explained using everyday life examples, and can be:

One-off or unexpected — breaking a bone.
Regular, repeated, or expected — tax season.
Small — making dinner.
Big — moving house.

An app team will (hopefully) adopt an internal developer platform to get the job of operations done.

“When it comes to internal platforms, we need to learn about what jobs our customers — these application developers — need to achieve and…how they want to build and operate their software,” Bangser said. The team needed to ask: Why are you creating an AWS account?

Like all good customer relationships, this kicks off with a conversation, taking maybe 15 minutes out of your day.

“You’re building a relationship. Show that you care about what they’re trying to do, and actually care about what they’re trying to do because they are your customers, right? You don’t want to tell them they’re right or wrong. You don’t want to problem-solve with them. You just want to hear from them.” — Abigail Bangser, Syntasso

They realized that different app teams had different jobs to get done which led them to open an AWS account. They could be splitting teams as they scaled beyond two-pizza-sized. Some projects wanted to get to production more quickly. Some wanted to duplicate a project to launch in a new country. Others wanted to support more authentication options across all products.

Bangser and her team realized that an AWS account was a means to an end: “They didn’t really know why they needed an account, when all they actually needed was a source of documents in S3 [AWS cloud compute storage], or all they actually needed was access to a server and that was all they wanted.”

They also realized that teams were still circumventing the platform-led path to cloud access because Jira intimidated them and it was easier to copy an old ticket, throw it on their personal credit cards and file for reimbursement or to manually reach out to a friend on the platform team to help them get the job done.

“They needed services, not accounts.” She explained that these app developers, “weren’t pros at using a cloud. They wanted to have this [process] much more approachable and much more usable for their use cases.”

Prototyping without Judgment

The platform team started brainstorming solutions.

“We had lots of big visions and ideas for where the platform would go,” Bangser said, “but finance was on a time budget and we had to get a solution ASAP.

They settled on eight possible solutions:

Simplify Jira.
Create a Slackbot interface.
Create a buddy system where existing confident users become mentors to newer users.
Platform team offers the services.
Change pull request services.
Pair programming with app teams and platform engineers.
Build configuration templates.
Absorb accounts.

Then they compared the solutions, considering cost versus value in terms of improving the platform offering. And settled on simplifying Jira as a way to balance the need to quickly appease finance with the need to invest in clearer interfaces for the application teams.

“Except Jira can be really hard to automate,” she said, to a knowingly chuckling audience. “We ran into the same problem as we run into with any code, which is that we want fast feedback but that’s not always easy to get.” Because code is expensive and takes time to write, test, implement and maintain, she explained, especially if you’re not sure it’s what your customers want.

“You have to give people something tangible to look at to get realistic feedback. If you tell them an idea, they either are sort of tuned out or they don’t really know how to respond to you,” she continued.

Rapid Feedback De-Risks Decisions

So they made a Jira prototype. And the feedback was that, in their bid to simplify, they were actually creating more stress. The prototype simply asked developers to identify an account type, but it didn’t explain what that even meant and the app teams — who knew both the organizational constraints and the tool’s complexity — didn’t even believe it was an accurate depiction of reality.

They went back to their ideation finalists, accepting that they’d have to go with something that would cost more time and money. They went for the chatbot, which Bangser clarified in an interview, “This allowed for more distance from the existing interfaces that caused friction as well as a more interactive feedback loop for users when making requests.”

This second ideation got positive feedback and they went ahead and implemented this chatbot-based platform engineering solution.

“Talk to your customers and try and get faster feedback loops,” she reminded the CIVO Navigate audience. “Platforms are viewed as mandatory. Even if they shouldn’t be, products don’t get feedback, if they’re mandatory.”

The post Platform Engineering Demands a Product Mindset appeared first on The New Stack.

How Platform Engineering Can Help Keep Cloud Costs in Check

Pravanjan Choudhury — Mon, 11 Sep 2023 17:18:27 +0000

This is the fifth part in a series. Read Part 1, Part 2, Part 3 and Part 4.

Picture being in a never-ending cycle of cloud costs that keep piling up, no matter what you do. You’re in good company. Most businesses are stuck in the same loop, using the same old audits and tools as quick fixes. But let’s face it, that’s just putting a Band-Aid on a problem that needs surgery.

Now, we all know audits and quick reviews are essential; they’re like the routine checkups we need to stay healthy. But when it comes to cloud costs, those checkups aren’t enough. They might spot the immediate problems, but they rarely dig deep to find the root cause. It’s time to think longer term.

Instead of just putting out fires, why not prevent them in the first place? A more sustainable approach to managing cloud costs is to focus on building an efficient system from the ground up. This isn’t about quick fixes; it’s about laying a strong foundation that prevents issues down the road.

Good news: As the pages of platform engineering are being written, it presents an opportunity for the creators to help you do exactly that. Think of it as designing your new toolkit for smarter, more efficient cloud management. With platform engineering, your team gets access to high-level tools that go beyond patching holes. They help you map out a well-planned route through the confusing world of cloud costs.

Attempted Solutions and Reactive Approaches

The moment the cloud cost alarm bells start ringing, specialized centralized teams or “war rooms” are created — often to manage this process. These teams look closely at cost reports, figure out which department is spending too much, and then tell them to cut back. Here’s how it typically goes down:

By audit: Relying on audits to identify areas of excessive spending. Continuous audit cycles are used to understand and potentially optimize cloud costs. It’s often seen as a never-ending process.
Manual oversight: The centralized team is responsible for scrutinizing cost dashboards, identifying responsible teams for various infrastructure parts and informing them to take corrective action.
Project tracker: A project tracker is created to monitor the cost-reducing activities and to keep all stakeholders updated.
Tools and anomaly detection: Specialized tools that offer better analysis and anomaly detection capabilities are deployed, with some even allowing automated actions.
Ops team responsibility: Typically, the operations team handles the burden of cost management, but they are often lean and already over-burdened with other critical tasks.

The problem? All of these steps are more reactive than proactive, and prone to toil. They focus on trimming existing costs — often described as cutting the fat — rather than building a cost-efficient system from the start. The result is a strategy that’s more about short-term gains than long-term sustainability.

Further, In the world of cloud native apps, Ops teams alone can’t take optimizations beyond a point. Service and architectural enhancements by application developers give biggest results in the long run. But the system today isn’t inclusive enough.

So, how do we break this cycle? By shifting the focus from immediate cost-cutting to long-term financial health. That means adopting strategies that don’t just react to problems as they arise but prevent them from happening in the first place.

Platform Engineering: The Linchpin

This is where platform engineering comes in. The platform engineering team is responsible for laying down the path not only to make developers own their cost, but also to inherently control costs. Here’s how platform engineering contributes to cloud cost sustenance:

Sharing ownership and accountability: Platform engineers need to let go of the control of cost ownership and instead look at creating a collaborative experience for developers to share ownership.

Building cost-efficient golden paths: The platform engineering team’s first order of business is to lay down golden paths engineered to be cost-efficient from the start. This becomes the playground for developers to experiment and build, but cost control isn’t just nice to have; it’s a must-have.

Providing developer-friendly cost breakdowns: The platform gives developers the tools to see costs broken down in a language they understand. The platform should present a zoomed-in view that allows each development team to see only the costs related to the resources they’re directly managing. This focus helps teams zero in on costs specific to their own projects or services.

Providing smart cost correlation: Understanding the “why” behind the costs is as crucial as knowing the “what.” The platform lets developers tie costs to specific runtime metrics like “utilization” or business metrics like “number of transactions,” paving the way for smarter decision-making.

Assigning budgets: Setting a budget shouldn’t feel like walking a tightrope. The platform allows teams to set up budgetary guardrails for different resources and activities. If you’re about to go over budget, consider yourself notified or even restricted — keeping costs in check.

Ability to prevent leaks: Unused or underutilized resources are the silent budget killers. The platform should be designed to prevent these so-called “leakages” earlier in the software development life cycle and prevent them from draining your budget in the future.

In essence, platform engineering aims to create a symbiotic relationship between developers and their cloud environment. It’s not just about empowering developers; it’s about making them conscientious stewards of their resources. This fosters a culture where cost efficiency and developer freedom coexist, setting clear guidelines for how to manage both effectively.

Developer Responsibilities

In a world powered by platform engineering, treating cost as an afterthought just won’t cut it. Developers need to elevate cost to the VIP status of “first-class citizen” in their sprints, right next to other big-league players like performance and availability.

Be your own landlord: Owning cloud infrastructure, including services and resources, isn’t just a responsibility, it’s a necessity. With ownership comes the imperative of constant vigilance: Developers need to be on top of monitoring both costs and resource use, around the clock.

Budget mastery: Staying within the lines of a coloring book is basic; doing the same with budgets is an art. Developers must stick to the budget frameworks laid out by the platform engineering team, while making sure cost-optimization tasks don’t get pushed to the back burner during sprints.

Business-metrics harmony: Translating cloud costs into business speak is a win-win. Developers should align their resource utilization metrics with tangible business outcomes. Want to know the cost of a single business transaction or operation? That’s the kind of clarity this alignment can offer.

Resource optimization: Don’t let resource “leakages” turn into resource “floods.” Developers should break down the attributed cost to pinpoint and plug these leakages, and to fine-tune the overall resource landscape for optimal efficiency.

Innovation: Many cost-optimization projects are tweaks to your service performance and architecture that can lead to tremendous results.

Keep the dialogue going: A fruitful partnership with the platform engineering team isn’t a one-off event; it’s an ongoing conversation. Developers should keep the lines of communication open to continuously refine tools, metrics and best practices for sustainable cloud management.

By taking ownership of these responsibilities, developers aren’t just lightening the load on the Ops team; they’re stepping up as co-pilots in navigating the cloud cost landscape. It’s a team effort aimed at achieving a leaner, more efficient cloud without compromising on performance or possibilities.

In a Nutshell

Criteria	Cloud Optimization by Audit	Cloud Sustenance
Objective	To reduce immediate costs through audits and one-time actions.	To maintain a sustainable, cost-effective architecture by design.
Methodology	Audit-based, reactionary measures taken after costs have escalated.	Planning and a set of practices and mechanisms for long-term sustainability.
Primary Responsibility	Centralized team or Ops team usually handles this through audits and dashboards.	Both platform engineering teams and development teams are responsible for cost management.
Impact	Short-term cost reduction.	Long-term efficiency and cost-effectiveness.
Continuity	Generally a recurring but isolated exercise.	Integrated into development sprints and long-term planning.

While audit-based cloud optimization might offer a rapid-fire way to trim costs, let’s be honest — it’s a reactive, temporary solution mostly overseen by operations teams. And because it often sprawls across the entire cloud, pinpointing who’s responsible for what in the cost-saving equation can get muddled.

On the flip side, cloud sustenance is a proactive, long-game approach that zeroes in on specific projects, distributing cost responsibilities across developers, platform engineers and operations.

While the journey toward sustainable cloud management needs everyone on board, the upfront time and resources invested pay off big time. We’re talking about a cloud ecosystem that’s built for long-term efficiency and resilience. So why not invest a little more now for peace of mind later?

The post How Platform Engineering Can Help Keep Cloud Costs in Check appeared first on The New Stack.

7 Benefits of Developer Access to Production

Eran Kinsbruner — Thu, 07 Sep 2023 15:12:10 +0000

Platform engineering has emerged as a game-changer in modern software development. It promises to revolutionize the developer experience and deliver customer value faster through automated infrastructure operations and self-service capabilities. At the heart of this approach lies a critical aspect: developer access to production.

Let’s explore why providing developers access to production environments is crucial for their productivity and the success of the product, and how it aligns perfectly with the principles of platform engineering. Also, secure and controlled access to production for developers can significantly benefit operation teams by reducing their operational burden on repetitive tasks, allowing them to prioritize resources on high-value tasks such as infrastructure scaling or security enhancements.

Productivity Enabler

When developers have access to production environments, they can directly interact with the real-life systems they build and deploy. This access translates to greater productivity, reducing the need for communication with separate operations teams to diagnose and resolve issues. This firsthand interaction means that developers can instantly diagnose, troubleshoot and rectify anomalies or inefficiencies they detect without waiting for feedback or navigating bureaucratic processes.

By immersing themselves in the production environment, developers gain invaluable insights, identify potential bottlenecks and fine-tune their code with real-world data, resulting in faster iterations and more efficient development processes that would be difficult to achieve in their local development environment, which cannot usually reproduce completely the actual behavior of the production environment.

Faster Issue Resolution

Every minute counts in the fast-paced world of software development. Hence, delays in addressing issues can lead to considerable setbacks.

When developers have access to production systems, they can swiftly diagnose and address issues as they arise, minimizing mean time to resolution (MTTR). This capability is especially beneficial during high-pressure situations such as system outages, where developers’ firsthand experience with the codebase usually means getting to the problematic components faster and knowing exactly which logs, events or data to gather to troubleshoot and diagnose the problem.

This ability to troubleshoot and debug in real time not only reduces downtime but also leads to improved overall system stability, as it makes it easier to predict potential system bottlenecks or failures. Developers can provide insights into future updates or changes that might affect system performance, allowing operations teams to prepare in advance.

Ownership and Accountability

Granting developers access to production fosters a sense of ownership and accountability. When development teams are responsible for their product’s performance and reliability, they take more significant ownership of its success. This sense of responsibility drives them to deliver high-quality code and actively participate in maintaining the application’s health.

Well-regulated access to production should lead to a shared responsibility model between the development and operation teams, as the responsibility for system health and uptime becomes a shared endeavor. This collaborative approach ensures that developers and operations teams are aligned in their goals, reducing the likelihood of miscommunication or misaligned priorities.

Empowering Innovation

Developers are at their creative best when they can explore and experiment freely. By providing access to production, organizations enable their development teams to innovate and push boundaries. Developers can directly prototype and validate new features in the production environment, leading to more creative and innovative solutions.

Feedback Loop Improvement

In the traditional setup, feedback from operations teams might take time to reach developers. However, with direct access to production, developers receive instant feedback on their code’s impact, performance and scalability by gathering logs, data samples and events. Additionally, the real-time data and insights from the live environment empower developers to make informed decisions, refine their code based on actual user interactions and iteratively improve the software

This feedback loop enables continuous improvement, leading to faster and more reliable updates. This direct involvement not only streamlines the development and maintenance processes but also ensures that solutions are tailored to real-world demands and challenges, leading to faster development cycles and reduced time to market.

Empowering the Operation Team

In most traditional setups, the operation teams act as gatekeepers to production. While this helps in protecting the production environment from certain risks, it also forces the operations team to engage in repetitive tasks, like gathering logs and events, tweaking configurations, analyzing payloads, etc. By granting controlled access to production to developers, operations teams can reduce the existing burden and enhance the overall team’s productivity. Operations teams can focus more on strategic tasks and proactive system improvements rather than being bogged down with routine troubleshooting.

In essence, granting developers access to production paves the way for a more symbiotic relationship between them and operations teams. It promotes collaboration, fosters knowledge exchange and, most importantly, ensures that both teams work harmoniously toward a singular goal: delivering a seamless and resilient user experience.

Cost Efficiency

When developers can debug directly in production, organizations can significantly reduce logging costs, circumvent the need for costly redeployments or initiate new CI/CD cycles merely to add new log lines. This direct access speeds up issue resolution and eliminates unnecessary spending on reiteration cycles. Cost optimization also affects operations teams: With developers directly resolving certain issues in autonomy, operations teams can better allocate their resources and prioritize tasks that demand their specific expertise.

Overcoming Challenges through Developer Observability

Lightrun’s Developer Observability Platform streamlines the debugging process in production applications through dynamic log additions, metrics integration and virtual breakpoints without requiring code changes, application restarts or redeployment.

Lightrun’s platform facilitates developer access to production via:

Dynamic logs, which allow developers to add new log lines anywhere in the production codebase without writing new code or redeploying the application, and without losing state.
Snapshots, which are virtual breakpoints that provide typical breakpoint functionalities without stopping execution, allowing them to be used directly on production. Once a snapshot is captured, the developer can view the captured data and act on it.
Metrics, which can monitor production applications in real time and on demand. They can, for example, monitor the size of data structures over time, allowing users to find bugs that can be reproduced only on the production system.

How Lightrun Overcomes the Challenges Associated with Production Access

While granting developers access to production has advantages, it also poses challenges in security, auditing and data confidentiality. Here’s how Lightrun addresses them:

Security: Lightrun implements robust security measures and access controls to prevent unauthorized access and mitigate risks, ensuring controlled and safe developer access to production.
Auditing and compliance: Its comprehensive audit system facilitates continuous compliance monitoring, simplifies auditing processes and ensures adherence to industry standards.
Data confidentiality: It safeguards sensitive data in production environments, preventing exposure in logs and snapshots. This enables developers to work with production data securely and compliantly.
Controlled access management: Lightrun enables organizations to define precise access controls for users and roles, creating a secure and collaborative development environment.

Conclusion

Allowing developers access to production environments is a cornerstone of platform engineering. It empowers them with the tools to create, innovate and maintain their products more efficiently, ultimately benefiting the entire organization and its customers.

Granting developers access to production is pivotal for productivity and product success, and a robust platform like Lightrun represents a powerful enabler for this strategy.

The post 7 Benefits of Developer Access to Production appeared first on The New Stack.

How to Pave Golden Paths That Actually Go Somewhere

Aeris Stewart — Wed, 06 Sep 2023 17:53:04 +0000

More than ever, software engineering organizations are turning to platform engineering to enable standardization by design and true developer self-service. Platform engineers build an internal developer platform (IDP), which is the sum of all the tech and tools bound together to pave golden paths for developers. According to Humanitec’s CEO Kaspar von Grünberg, golden paths are any procedure “in the software development life cycle that a user can follow with minimal cognitive load and that drives standardization.” Golden paths have long been discussed as an important goal of successful platform (and DevOps) setups.

Grünberg’s PlatformCon 2023 talk, “Build golden paths for day 50, not day 1!”, dove into how and why software engineering organizations should shift their focus to golden paths for the long term, complete with specific examples. Let’s explore the problem with the way most platform teams approach golden paths, how platform teams can fix their priorities and what scalable golden paths actually look like.

The Problem with Most Golden Paths? Bad Priorities

When deciding which golden paths to build and in what order, too many organizations make whatever comes first in the application and development life cycle their top priority. They start optimizing processes that only take place on Day 1 of the application life cycle, like how to create a new application or service via scaffolding. However, when evaluating the entire life cycle of an application, it’s clear that golden paths for Day 1 don’t go that far. Prioritizing golden paths for Day 2 to 50 (or day 1,000, for that matter) has a much larger impact on developer productivity and business performance.

Grünberg started studying the practices of top-performing engineering organizations years before platform engineering was on everyone’s radar. He has long considered this prioritization failure one of the top 10 fallacies in platform engineering, writing: “Of the time your team will invest in an application, the creation process is below 1%.” In his view, the return on investment (ROI) on this small part of the chain is too small to justify investing in its golden paths first. Organizations should instead invest in golden paths for Day 50 and beyond.

Lessons from Netflix’s Platform Console

The first iterations of Netflix’s federated platform console, which is a developer portal, demonstrate that not all golden paths are created equal. Senior software engineer Brian Leathem shared that one of the platform team’s original goals was to “unify the end-to-end experience to improve usability.”

Through user research, Leathem’s team found that developers were struggling with the high volume and variety of tools distributed across their workflows. They also found that limited discovery was hurting both new and tenured developers, who had difficulty onboarding or were unaware of new offerings that would improve their existing workflows. The solution they chose was a platform console, or as Leathem described it, a “common front door” for developers.

They adopted the Backstage plugin UI so they could invest their development resources into building custom UI components for the Netflix portal. The result was a portal in which users could manage their software across the software development life cycle in a single view. They introduced “collections,” or fleets of services for which the developer wants to view and assess performance together, to ease the burden of managing multiple services and software. They decided to use a golden path (Leathem used the term “paved road”) to tackle the discoverability problem only.

To start, the golden path was a static website that featured all documentation and recommended appropriate tools for the problems developers were solving. The goal was to weave the golden paths into the console to more deeply integrate documentation with its corresponding running services. Further down the line, Leathem’s team also hoped to build functionality for developers to create, modify and operate services through the console.

In feedback on the first iteration of the platform console, Netflix developers said the “View and Access” experience was not compelling enough for them to abandon their old habits and routines. In response, the platform team switched their focus to end-to-end workflows not available with existing tooling to keep users returning to the console. In Leathem’s PlatformCon 2023 talk, he said the approach significantly boosted the number of recurring users on the console.

Netflix’s example demonstrates that platforms need more than the developer portal component to be compelling to users. Developers want golden paths for end-to-end workflows.

Furthermore, usability is one of many problems a platform can improve. For example, an organization can design golden paths that improve usability, productivity and scalability by focusing on end-to-end workflows. Golden paths for different workflows can enable standardization by design and true developer self-service.

How to Prioritize Potential Golden Paths

With more of an application’s life cycle to cover, golden paths for Day 50 can be daunting to prioritize. Inspired by his research, von Grünberg proposed a simple exercise to help platform teams determine what their priorities ought to be, based on the frequency of and the waiting time for developers and operations associated with a specific procedure. The table below is an example of what this analysis could look like based on an evaluation of 100 deployments.

Procedure	Frequency (% of deployments)	Dev time in hours (including waiting and errors)	Ops time in hours (including waiting and errors)
Add/update app configurations (such as env variables)	5%*	1*	1*
Add services and dependencies	1%*	16*	8*
Add/update resources	0.38%*	8*	24*
Refactor and document architecture	0.28%*	40*	8*
Waiting due to blocked environment	0,5%*	15*	0*
Spinning up environment	0,33%*	24*	24*
Onboarding devs, retrain and swap teams	1%*	80*	16*
Rollback failed deployment	1,75%	10*	20*
Debugging, error tracing	4.40%	10*	10*
Waiting for other teams	6.30%	16*	16*

*per 100 deployments

From this table, organizations can gain a holistic view of the processes their golden paths need to address.

Since von Grünberg shared this exercise in early 2022, he says that the explosive growth of the platform engineering community has enabled him to observe the most common and pressing pain points across thousands of top engineering organizations and learn successful approaches to soothing them. These insights were valuable in understanding what types of Day 50 processes are the most important for platform teams to optimize andhow best to optimize them. He found that tackling the most pressing pain points with golden paths first consistently netted the best ROI. More importantly, he learned that most organizations’ pain points with these processes had the same root cause and could be mitigated in large part by addressing that common cause directly.

The Universal Pain Point: Static Configuration Management

The problem in question is that most organizations have IDPs that enable developers to deploy an updated image from one stage to another only when the infrastructure of the application does not change. These static configuration files are manually scripted against a set of static environments and infrastructure and, as a result, are prone to errors or excessive overhead when moving beyond the simplest use cases.

With static configuration management, rolling back, changing configs, adding services and dependencies, and similarly complex tasks are arduous for developers. They can either choose to manage infrastructure themselves, reducing their time spent coding and creating shadow operations, or they could submit a ticket to ops, increasing their waiting times and exacerbating existing bottlenecks.

With static configuration management, neither developers nor ops win. Therefore, golden paths that address the challenges of static configuration management have greater potential to optimize a much larger range of processes and at scale.

Dynamic Configuration Management: The Key to Scalable Golden Paths

Instead of settling for static configuration management, organizations should enable dynamic configuration management (DCM). DCM is “a methodology used to structure the configuration of compute workloads. Developers create workload specifications describing everything their workloads need to run successfully. The specification is then used to dynamically create the configuration, to deploy the workload in a specific environment.” With DCM, developers aren’t slowed down by the need to define or maintain any environment-specific configuration for their workloads. DCM drives standardization by design and enables true developer self-service.

The Humanitec Platform Orchestrator, in combination with the workload specification Score, enables DCM by following an RMCD (read, match, create, deploy) pattern: it reads and interprets the workload specification and context, matches the correct configuration templates and resources, creates application configurations and deploys the workload into the target environment wired up to its dependencies. A platform orchestrator is the core of any enterprise-grade IDP because it enables platform teams to enforce organization-wide standards with every git push.

Examples of Scalable Golden Paths

In his PlatformCon 2023 talk, von Grünberg shared a few examples of how a platform orchestrator can facilitate the creation of impactful and scalable golden paths. These examples are also featured in Humanitec’s IDP reference architecture for AWS-based setups.

Simple Deployment to Dev

For example, the golden path pictured below enables developers to deploy the changes made on a workload to dev more efficiently and consistently.

Let’s say a developer wants to deploy a change on their workload to dev. All the developer has to do is modify the workload and git-push to code. From there, the CI pipeline picks up the code and runs it, the image is built and stored in the image registry.

The workload source code contains the workload specification Score:

In this example, the resources section of the workload specification states that the developer requires a database type Postgres, a storage type of S3, and a DNS type of DNS.

After the CI has been built, the platform orchestrator realizes the context and looks up what resources are matched against it. It checks whether the resources have been created and reaches out to

the Amazon Web Services (AWS) API to retrieve the resource credentials. The target compute in this architecture is Amazon Elastic Kubernetes Service (EKS), so the platform orchestrator creates the app configs in the form of manifests. Then the platform orchestrator uses Vault to deploy the configs and inject the secrets at runtime into the container.

Deployments like this happen all the time, so optimizing this process makes a major difference for developers and the business at large.

Create a New Resource

In a static setup, many golden paths fail when faced with a developer request the system is unfamiliar with. With DCM, everything is repository-based, and developers can extend the set of available resources or customize them.

For example, if a developer needs ArangoDB but it isn’t known to the setup so far, they can add a resource definition to the general baseline of the organization. This way, the developer has easily extended the setup in a way that can be reused by the next developer.

Update a Resource

Updating a resource is a great example of how platform engineers use a platform orchestrator to maintain a high degree of standardization.

Let’s say you want to update the resource “Postgres” to the latest Postgres version.

With a dynamic approach to golden paths, the “thing” you need to update is only the resource definition where Postgres is specified.

You can find which workloads currently depend on the resource definition by pinging the platform orchestrator API or looking at the UI in the resource definition section. Once identified, you can auto-enforce a deployment across all workloads that depend on the resource.

With this golden path, rolling out the updated resource across all workloads and dependencies is simplified and scalable.

Good Golden Paths Turn Every Day into Day 1

When platform teams invest in these scalable golden paths, von Grünberg argues, everyone wins. Golden paths that leverage a platform orchestrator and DCM enable developers and ops to execute common tasks with greater ease and peace of mind from the earliest stages of an IDP’s development, delivering more value, faster.

Paving golden paths with this approach also catalyzes an important mindset shift for platform teams, according to von Grünberg. With DCM, every day can become Day 1, a starting point for further optimization and opportunity to reduce technical debt. This shift enables organizations to make the most of what platform engineering has to offer.

Get Recommended Golden Path Examples

Humanitec has created reference architectures for AWS, Azure, and GCP based on McKinsey’s research. These resources walk through examples of recommended golden paths in more detail, as well as how they fit into the larger IDP architecture.

The post How to Pave Golden Paths That Actually Go Somewhere appeared first on The New Stack.

Streamline Platform Engineering with Kubernetes

Robert Kimani — Wed, 06 Sep 2023 15:39:26 +0000

Platform engineering plays a pivotal role in the modern landscape of application development and deployment. As software applications have evolved to become more complex and distributed, the need for a robust and scalable infrastructure has become paramount. This is where platform engineering steps in, acting as the backbone that supports the entire software development lifecycle. Let’s delve deeper into the essential role of platform engineering in creating and maintaining the infrastructure for applications.

Understanding Platform Engineering

At its core, platform engineering involves creating an environment that empowers developers to focus on building applications without the burden of managing underlying infrastructure intricacies. Platform engineers architect, build, and maintain the infrastructure and tools necessary to ensure that applications run smoothly and efficiently, regardless of the complexities they might encompass.

In the dynamic world of application development, platform engineers face multifaceted challenges. One of the most prominent challenges is managing diverse applications and services that vary in requirements, technologies, and operational demands. As applications span across cloud environments, on-premises setups, and hybrid configurations, platform engineers are tasked with creating a unified, consistent, and reliable infrastructure.

Managing this diverse landscape efficiently is crucial to ensuring applications’ reliability and availability. In the absence of streamlined management, inefficiencies arise, leading to resource wastage, operational bottlenecks, and decreased agility. This is where Kubernetes comes into the spotlight as a transformative solution for platform engineering.

Enter Kubernetes: A Powerful Solution

Kubernetes, a container orchestration platform, has emerged as a game-changer in the field of platform engineering. With its ability to automate deployment, scaling, and management of containerized applications, Kubernetes addresses the very challenges that platform engineers grapple with. By providing a unified platform to manage applications regardless of their underlying infrastructure, Kubernetes aligns seamlessly with the goals of platform engineering.

Kubernetes takes the burden off platform engineers by allowing them to define application deployment, scaling, and management processes in a declarative manner. This eliminates manual interventions and streamlines repetitive tasks, enabling platform engineers to focus on higher-level strategies and optimizations.

Furthermore, Kubernetes promotes collaboration between different teams, including developers and operations, by providing a common language for application deployment and infrastructure management. This fosters a DevOps culture, where the lines between development and operations blur, and teams work collaboratively to achieve shared goals.

From here, we will delve deeper into the specifics of Kubernetes orchestration and how it revolutionizes platform engineering practices. From managing multi-tenancy to automating infrastructure, from ensuring security to optimizing scalability, Kubernetes offers a comprehensive toolkit that addresses the intricate needs of platform engineers. Join us on this journey as we explore how Kubernetes empowers platform engineering to streamline deployment and management, ultimately leading to more efficient and reliable software ecosystems.

Challenges of Managing Diverse Applications: A Platform Engineer’s Dilemma

The role of a platform engineer is akin to being the architect of a bustling metropolis, responsible for designing and maintaining the infrastructure that supports a myriad of applications and services. However, in today’s technology landscape, this task has become increasingly intricate and challenging. Platform engineers grapple with a range of difficulties as they strive to manage diverse applications and services across complex and dynamic environments.

In the ever-expanding digital realm, applications exhibit a stunning diversity in terms of their technologies, frameworks, and dependencies. From microservices to monoliths, from stateless to stateful, each application type presents its own set of demands. Platform engineers are tasked with creating an environment that caters to this diversity seamlessly, ensuring that every application can function optimally without interfering with others.

Modern applications are no longer confined to a single server or data center. They span across hybrid cloud setups, utilize various cloud providers, and often incorporate on-premises resources. This heterogeneity of infrastructure introduces challenges in terms of resource allocation, data consistency, and maintaining a coherent operational strategy. Platform engineers must find ways to harmonize these diverse elements into a unified and efficient ecosystem.

Applications’ resource requirements are seldom static. They surge and recede based on user demand, seasonal patterns, or promotional campaigns. Platform engineers must design an infrastructure that can dynamically scale resources up or down to match these fluctuations. This requires not only technical acumen but also predictive analytics to foresee resource needs accurately.

In today’s always-on digital landscape, downtime is not an option. Platform engineers are tasked with ensuring high availability and fault tolerance for applications, which often involves setting up redundant systems, implementing failover strategies, and orchestrating seamless transitions in case of failures. This becomes even more complex when applications are spread across multiple regions or cloud providers.

Applications and services need continuous updates to stay secure, leverage new features, and remain compatible with evolving technologies. However, updating applications without causing downtime or compatibility issues is a challenge. Platform engineers need to orchestrate updates carefully, often requiring extensive testing and planning to ensure a smooth transition.

In an era of heightened cybersecurity threats and stringent data regulations, platform engineers must prioritize security and compliance. They need to implement robust security measures, control access to sensitive data, and ensure that applications adhere to industry-specific regulations. Balancing security with usability and performance is a constant tightrope walk.

In an environment with diverse applications and services, achieving standardization can be elusive. Different development teams might have varying deployment practices, configurations, and toolsets. Platform engineers need to strike a balance between accommodating these unique requirements and establishing standardized processes that ensure consistency and manageability.

Kubernetes: A Paradigm Shift in Platform Engineering

As platform engineers grapple with the intricate landscape of managing diverse applications and services across complex environments, a beacon of transformation has emerged: Kubernetes. This open source container orchestration platform has swiftly risen to prominence as a powerful solution that directly addresses the challenges faced by platform engineers.

The diversity of applications, each with its own unique requirements and dependencies, can create an operational labyrinth for platform engineers. Kubernetes steps in as a unifying force, providing a standardized platform for deploying, managing, and scaling applications, irrespective of their underlying intricacies. By encapsulating applications in containers, Kubernetes abstracts away the specifics, enabling platform engineers to treat every application consistently.

Kubernetes doesn’t shy away from the complexities of modern infrastructure. Whether applications span hybrid cloud setups, multiple cloud providers, or on-premises data centers, Kubernetes offers a common language for orchestrating across these diverse terrains. It promotes the notion of “write once, deploy anywhere,” allowing platform engineers to leverage the same configuration across various environments seamlessly.

The challenge of resource allocation and scaling based on fluctuating user demands finds an elegant solution in Kubernetes. With its automated scaling mechanisms, such as Horizontal Pod Autoscaling, platform engineers are empowered to design systems that can dynamically expand or contract resources based on real-time metrics. This elasticity ensures optimal performance without the need for manual intervention.

Kubernetes embodies the principles of high availability and fault tolerance, critical aspects of platform engineering. By automating load balancing, health checks, and failover mechanisms, Kubernetes creates an environment where applications can gracefully navigate failures and disruptions. Platform engineers can architect systems that maintain continuous service even in the face of unforeseen challenges.

The daunting task of updating applications while minimizing downtime and compatibility hiccups finds a streamlined approach in Kubernetes. With features like rolling updates and canary deployments, platform engineers can orchestrate updates that are seamless, incremental, and reversible. This not only enhances the reliability of the deployment process but also boosts the confidence of developers and operations teams.

Security and Compliance at the Core

Security is paramount in platform engineering, and Kubernetes doesn’t fall short in this domain. By enforcing Role-Based Access Control (RBAC), Network Policies, and Secrets Management, Kubernetes empowers platform engineers to establish robust security practices. Compliance requirements are also met through controlled access and encapsulation of sensitive data.

Kubernetes bridges the gap between accommodating unique application requirements and establishing standard practices. It provides a foundation for creating reusable components through Helm charts and Operators, promoting a cohesive approach while allowing for flexibility. This journey towards standardization enhances manageability, reduces human error, and boosts collaboration across teams.

In the realm of platform engineering, the concept of multitenancy stands as a critical pillar. As organizations host multiple teams or projects within a shared infrastructure, the challenge lies in ensuring resource isolation, security, and efficient management. Kubernetes, with its robust feature set, provides an effective solution to tackle the intricacies of multitenancy.

Understanding Multitenancy

Multitenancy refers to the practice of hosting multiple isolated instances, or “tenants,” within a single infrastructure. These tenants can be teams, departments, or projects, each requiring their own isolated environment to prevent interference and maintain security.

Kubernetes introduces the concept of Namespaces to address the requirements of multitenancy. A Namespace is a logical partition within a cluster that allows for resource isolation, naming uniqueness, and access control. Platform engineers can leverage Namespaces to create segregated environments for different teams or projects, ensuring that resources are isolated and managed independently.

Here are some advantages of Namespaces:

Resource Isolation: Namespaces provide an isolated space where resources such as pods, services, and configurations are contained. This isolation prevents conflicts and resource contention between different teams or projects.
Security and Access Control: Namespaces allow platform engineers to set Role-Based Access Control (RBAC) rules specific to each Namespace. This ensures that team members can only access and manipulate resources within their designated Namespace.
Naming Scope: Namespaces ensure naming uniqueness across different teams or projects. Resources within a Namespace are identified by their names, and Namespaces provide a clear context for these names, avoiding naming clashes.
Logical Partitioning: Platform engineers can logically partition applications within the same cluster, even if they belong to different teams or projects. This makes it easier to manage a diverse application landscape within a shared infrastructure.

Challenges of Resource Allocation and Isolation

While Kubernetes Namespaces offer a solid foundation for multitenancy, challenges related to resource allocation and isolation persist:

Resource Allocation: In a multitenant environment, resource allocation becomes a balancing act. Platform engineers need to ensure that each Namespace receives adequate resources while preventing resource hogging that could impact other Namespaces.
Resource Quotas: Kubernetes enables setting resource quotas at the Namespace level, which can be complex to fine-tune. Striking the right balance between restricting resource usage and allowing flexibility is crucial.
Isolation Assurance: Ensuring complete isolation between Namespaces requires careful consideration. Leaked resources or network communication between Namespaces can compromise the intended isolation.
Managing Complexity: As the number of Namespaces grows, managing and maintaining configurations, RBAC rules, and resource allocations can become challenging. Platform engineers need efficient tools and strategies to manage this complexity effectively.

In the realm of platform engineering, the pursuit of efficiency and reliability hinges on automation. Kubernetes, with its robust set of features, stands as a beacon for platform engineers seeking to automate deployment and scaling processes. Let’s explore how Kubernetes streamlines these processes and empowers platform engineers to elevate their infrastructure management.

Kubernetes Controllers: The Automation Engine

Kubernetes controllers play a pivotal role in orchestrating automated tasks that range from scaling applications to ensuring self-healing.

Scaling: Horizontal Pod Autoscaling (HPA) is a prime example. HPA automatically adjusts the number of pod replicas based on observed CPU or custom metrics. This ensures that applications can seamlessly handle traffic fluctuations without manual intervention.
Self-Healing: Liveness and readiness probes are key components that contribute to application self-healing. Liveness probes detect application failures and trigger pod restarts, while readiness probes ensure that only healthy pods receive traffic.
Updating: Kubernetes controllers, such as Deployments, automate application updates by maintaining a desired number of replicas while transitioning to a new version. This prevents service disruptions during updates and rollbacks, ensuring seamless transitions.

Kustomize: Customized Automation

Kustomize is a tool that allows platform engineers to customize Kubernetes manifests without the need for complex templating. It provides a declarative approach to configuration management, enabling engineers to define variations for different environments, teams, or applications.

Some benefits of Kustomize include:

Reusability: Kustomize promotes reusability by enabling the creation of base configurations that can be extended or modified as needed.
Environment-Specific Customization: Platform engineers can customize configurations for different environments (development, staging, production) or teams without duplicating the entire configuration.
Efficiency: Kustomize reduces duplication and minimizes manual editing, which reduces the risk of inconsistencies and errors.

Policy Enforcement and Governance: Navigating the Path to Stability

In the dynamic landscape of platform engineering, enforcing policies and governance emerges as a linchpin for ensuring stability, security, and compliance. Kubernetes, with its robust feature set, offers tools like RBAC (Role-Based Access Control) and network policies to establish control and enforce governance.

Policy enforcement ensures that the platform adheres to predefined rules and standards. This includes access control, security policies, resource quotas, and compliance requirements. By enforcing these policies, platform engineers maintain a secure and reliable environment for applications.

In a dynamic Kubernetes environment, maintaining security and compliance can be challenging. As applications evolve, keeping track of changing policies and ensuring consistent enforcement across clusters and namespaces becomes complex. The ephemeral nature of Kubernetes resources adds another layer of complexity to achieving persistent security and compliance.

DevOps Culture and Collaboration: Bridging the Divide

In the pursuit of efficient and collaborative platform engineering, fostering a DevOps culture is paramount.

DevOps culture bridges the gap between development, operations, and platform engineering teams. It encourages seamless communication, shared goals, and a collective sense of responsibility for the entire application lifecycle.

Kubernetes acts as a catalyst for collaboration by providing a common language for application deployment and infrastructure management. It encourages cross-functional communication and allows teams to work collaboratively on shared configurations.

Kubernetes’ declarative nature and shared tooling break down silos that often arise in traditional workflows. Developers, operators, and platform engineers can collectively define, manage, and evolve applications without being constrained by rigid boundaries.

The post Streamline Platform Engineering with Kubernetes appeared first on The New Stack.

Backstage in Production: Considerations for Platform Teams

Jorge Lainfiesta — Tue, 05 Sep 2023 19:47:13 +0000

The developer portal is a prominent aspect of most platforms, as it’s a privileged point of interaction between you and your users. Developer portals reflect the features of the platform through a centralized UI, which means they must be tailored to your developers and the capabilities you want to provide.

Here’s where Backstage shines: customizability. You can make the developer portal of your dreams with Backstage, which could include replacing the UI with your organization’s design system or bringing your own data consumption mechanism. This is possible because Backstage is not a ready-made developer portal, but a framework that provides the building blocks to build one.

However, developer portals are web apps. Thus, when you adopt and extend Backstage from scratch, you’re signing up for its full-stack consequences. For this reason, Gartner and others have reported that setting up and maintaining Backstage yourself can be challenging, yet the value of doing so has overwhelming benefits for many companies.

With that said, there is no one-size-fits-all way to adopt Backstage. When you set out to stand up Backstage yourself, you’ll run into a few common tasks nobody told you were part of adopting the framework. In this article, I’ll walk you through a few considerations to make when planning your team’s work.

Initial Setup and Deployment

Backstage provides a create-app command through its command line interface (CLI) to help you get started with a new instance. The result will run fine on your machine, but from this point on, you still have some work to do to make it a production-ready developer portal.

My recommendation for a Backstage proof of concept is to implement first a single meaningful integration like GitHub. This will let you go through all the touch points from React and Node configs to deployment.

Your developer portal most likely will have to connect data from various sources through integrations. Therefore you’ll need to implement a secret-management strategy that lets you inject secrets into the container that will be running Backstage.

In terms of deployment, the Backstage team recommends using what you normally would for a similar application. Thus, you can benefit from applying your standard CI/CD practices to your developer portal.

In the case of a Roadie-managed Backstage instance, all these considerations are built into the product so you don’t have to invest time into any of them.

Authentication and Security

Your developer portal is a one-stop shop that integrates third-party services such as GitHub, Argo CD or PagerDuty. The developer portal will allow users to request infrastructure through its self-service or golden paths capabilities. Therefore, it is important to ensure that your Backstage instance is secure.

First, you’ll need to install and set up an authentication mechanism. Thankfully, Backstage offers open source integrations with 13 providers from Okta to Google IAP.

Next, you’ll need to use the permissions framework that comes with Backstage. By default, Backstage’s backend routes are not protected behind authentication because there’s an openness assumption in a developer portal.

Additionally, I recommend you set up your scaffolder to execute tasks in ephemeral environments. Doing this at Roadie from the beginning prevented all of our customers from being affected by last year’s Backstage remote code execution vulnerability.

Always remember to keep an eye on the security patches released by the Backstage teams and upgrade your instance constantly.

Upgrades and Maintenance

The Backstage team merges around 300 pull requests monthly from many contributors, resulting in minor version releases every second week. This process gives the framework an impressive flow of features and bug fixes.

I recommend adding upgrades to your planning regularly. Backstage’s upgrade process currently involves a few manual steps, with varying complexity for each release.

Beware that some improvements come as an additional API that you need to hook into your developer portal or a new set of UI components that can benefit your instance.

Most importantly, it’s useful to stay tuned to new features as sometimes you have to opt into them even after you upgrade Backstage. I write the Backstage Weekly newsletter, so you don’t have to go through the codebase yourself. Feel free to sign up.

Working with Plugins

There are more than 100 plugins available in the Backstage ecosystem, and you’re encouraged to build your own plugin to integrate your unique development needs into Backstage.

Plugins are usually implemented in two or three packages: backend, frontend and common. Plugins can also provide extension points, so they can be customized or adapted for different circumstances.

The Backstage community is actively working on simplifying the process of installing and upgrading plugins, but it remains a bit of manual work at the moment and requires you to redeploy your instance.

When authoring a plugin, be aware that Backstage is consolidating a new backend system with simplified APIs, so might be worth checking it out.

A Backstage Instance Is Just the First Step

Or maybe it’s the second step? You first need to identify what you want your developer portal to solve. Then once you set up an instance, you’ll be on track to a longer-term journey onboarding more use cases and iterating on your developer portal as you learn from your developers.

If you want to adopt Backstage but don’t want to own its implementation or maintenance, Roadie.io offers hosted and managed Backstage instances. Roadie offers a no-code Backstage experience and more refined features while remaining fully compatible with the open source software framework. Additionally, Roadie offers fully-fledged Scorecards to measure the software quality and maturity across your organization.

If you’re interested in learning about the advantages and trade-offs of managed production-grade Backstage instances vs. self-hosted ones, check out our white paper.

The post Backstage in Production: Considerations for Platform Teams appeared first on The New Stack.

How Spotify Achieved a Voluntary 99% Internal Platform Adoption Rate

Jennifer Riggins — Tue, 05 Sep 2023 16:29:52 +0000

“There was this first revolution, where everything became a microservice, and that kind of set us free in certain ways. And then there was the Kubernetes revolution that just shaped the cloud native landscape. And as cloud computing just gained momentum, there was this need [for] efficient orchestration and management across a diverse set of environments and whatnot. All of this, brought a lot of agility for teams to experiment,” Helen Greul, head of engineering for Backstage at Spotify, told The New Stack.

“But the downside — there has to be a downside — in our case, was the growing burden on developers. I’ve seen this with my own eyes, this developer mandate expanding,” she continued, reflecting on the impetus for the seemingly sudden rise of platform engineering.

Platform engineering is the sociotechnical mindset and set of best practices, toolchains and workflows that look to reduce friction and increase developer productivity. This year, when everyone is trying to do more with less, has seen a meteoritic rise in popularity of this discipline. We’ve also witnessed the exponential increase in adoption of the most popular open source internal developer platform or IDP: Backstage.

It was only natural that The New Stack sat down with Spotify to learn about why they invested so much time and money in Backstage, which they open sourced for any team to leverage — and how they got their own developers on board to happily use it, too.

The Origin of Backstage at Spotify

“As a hiring manager, you would first hire an engineer that would code, but then they would [have to] code and ship. Then they would need to care about runtime. Then all of a sudden, they need to manage across environments and then optimize,” Greul said, of how the developer mandate continued to explode and expand at Spotify.

Around five years ago, Spotify engineering realized they could no longer face complexity by just hiring more developers. And the modern concept of platform engineering was born, with the goal of streamlining software development.

“We have a platform for platforms here at Spotify, that’s how obsessed we are.” Greul remarked that Backstage is the biggest example, but the popular audio streaming service has several other internal ones, all at least inner-sourced.

“I think if you’re building a product for developers, it’s either going to become a platform — if it’s successful enough, if it lives long enough — or it’s just going to die out.” — Helen Greul, Spotify

Some would say the 2023 rise of platform engineering is a sign that the pendulum is swinging back from developer autonomy toward platform-driven command and control. Not Greul. She thinks the platform is the best way to deliver developers choices, giving them the ability to build onto those golden paths that they want.

“Developers love their platforms and products to be extensible,” she said. “There are very few turnkey solutions that I can think about that are used by developers and it’s true for almost all software that we use.”

Getting Started with Platform Engineering at Spotify

Spotify may stand out at the forefront of company culture and innovation, but its first goals ring true for most companies starting on their platform engineering journey. First and foremost, they wanted a catalog to create clarity of ownership.

“It was hard to know who owned the majority of microservices — there was a sprawl,” Greul said. “The concept of ownership was really important to us,” so this internal developer platform kicked off with a software catalog — just components, teams, and a few other abstractions, she outlined. This became the first slice that morphed into Backstage as we know it today.

Next — again, not unusual — Spotify engineering looked to abstract the complexity of Kubernetes. The nascent platform team was looking to address the problems of orchestration, management, deployment, testing, and health tech quality.

“And that was the point where it exploded internally. When we opened up,” Greul said.

The next demand was for a single pane of glass to understand all the services eventually housed in the internal developer platform, with inventory and a release launchpad.

Also Read: Metrics-Driven Developer Productivity Engineering at Spotify

Inner Source Your Platform Engineering Experience

If they build it, they will use it.

Greul attributes Spotify’s successful platform strategy to its strong culture of inner sourcing: “Everyone is free to contribute to any repository that’s available within the company. There is no codebase that is locked.”

Long before this new age of internal developer platforms and the fashion of a focus on developer experience, inner sourcing was a companywide policy from early on at Spotify, as a way to ensure organizational fluidity. After all, she remarked, the launch of an audiobook can leverage as many as 50 different repositories. Everyone should have the chance to improve their own developer experience. Bonus: more people building it means more people eager to use it over other options — and evangelize to their colleagues about it.

Even at Spotify, Backstage usage is not mandatory. “We try to work on people’s incentives and make it easier to do the right thing,” Greul continued. “And the right thing is a bit of standardization, not too much, not too little, but just enough.”

Still, at the home of Backstage, platform engineering adoption is high. But she says that’s because Spotify squads adhere to a philosophy that standards can set you free. Spotify’s version of Backstage includes three choices of frontend technology and three of backend, in what they call their technology radar.

This tool choice is reevaluated frequently, including in quarterly surveys for measuring developer productivity. They offer great incentives to stay on the golden paths including widely promoting wins and sharing survey results internally.

The New Stack has already written about how these golden paths have been proven to cut Spotify’s developer onboarding time down from 110 days to 20.

“As a new joiner, joining Spotify, you’re kind of almost being handheld through the journey with those standards and boilerplate templates in mind,” Greul said. “You have a bit of room [to] pick one or the other, but there are some guardrails in place that are part of your onboarding.”

Of course, new tooling may arise within the organization, but she assured that the plug-and-play style of Backstage allows that best practices can be integrated fairly easily.

In the end, after about a year of internal onboarding and promotion, she reckons 99% of Spotify’s development team had adopted Backstage — because it was convenient to do so. She points to their incentives or Platform as a Product mindset to why that rate is so high.

Also Read: How Spotify Adopted and Outsourced Its Platform Mindset

Why Is Backstage Adoption Rate Stagnant at 10%?

Spotify is famously open source by default, which Greul postulates is because “if you want something to become [an] industry standard, there is no way software can be proprietary,” arguing that all technical standardization, at least over the last decade, has been via an open source pathway. This is how Spotify is paving so-called golden paths at more than 260 organizations.

But Spotify isn’t your average company culture. Greul admitted that many other companies that have adopted Backstage for platform engineering are stuck trying to get past the 10% adoption rate.

“Oftentimes it happens that they hit a road bump, or maybe the rollout is not as easy as they would have hoped,” she said. “This is where sometimes the adoption kind of struggles or stagnates or it doesn’t go beyond the POC [proof of concept.]”

It all comes back to the incentives, she said, “Developers have to see how this is beneficial for their day-to-day.” For a startup, their cognitive load could be just fine with the AWS Management Console, and not be motivated to change. “But once you reach a certain scale, it sort of becomes almost a necessity to have the tools to combat the cognitive load.” Usually, an IDP becomes necessary.

You don’t just have to sell the developers on it either. Business stakeholders need to understand that an investment in platform engineering is necessary to improve developer productivity.

Just remember that Backstage isn’t an out-of-the-box panacea. In fact, Greul said that Backstage could be a bad fit for monolithic architecture.

But, consistently across interviews, we’ve heard that the number one thing developers want from their internal developer platform is extensibility — platform engineering is not about top-down, command-and-control platforms.

“The unique kind of selling point of Backstage is that you can make it look and feel your own. You can make it so you can customize it to solve your specific needs,” Greul explained. “For developers, there is no usually one-size-fits-all solution and they love that they can keep their unique kind of culture, unique set of tools, and unique flavors that make it look and feel like it’s a homegrown solution, even though it’s not.”

Most importantly, it has to solve their problems. Their first concerns are usually discoverability and standardization.

Of course, she remarked, this journey of discovery doesn’t start with deciding you need an internal developer portal. It’s about clarifying what are the challenges to developer productivity, what are keeping developers stressed and up at night.

In the end, it’s as Greul said, “With a product that’s built for developers, with empathy for developers, they recognize the value, and it’s just easier for them to use what’s already available than spinning something weird on the side.”

The post How Spotify Achieved a Voluntary 99% Internal Platform Adoption Rate appeared first on The New Stack.

SRE vs Platform Engineer: Can’t We All Just Get Along?

Jennifer Riggins — Wed, 30 Aug 2023 14:48:48 +0000

So far, 2023 has been all about doing more with less. Thankfully, tech layoffs — a reaction to sustained, uncontrolled growth and economic downturn — seem to have slowed. Still, many teams are left with fewer engineers working on increasingly complex and distributed systems. Something’s got to give.

It’s no wonder that this year has seen the rise of platform engineering. After all, this sociotechnical practice looks to use toolchains and processes to streamline the developer experience (DevEx), reducing friction on the path to release, so those that are short-staffed can focus on their end game — delivering value to customers faster.

What might be surprising, however, is the rolling back of the site reliability engineering or SRE movement. Both platform teams and SREs tend to work cross-organizationally on the operations side of things. But, while platform engineers focus on that DevEx, SREs focus on reliability and scalability of systems — usually involving monitoring and observability, incident response, and maybe even security. Platform teams are all about increasing developer productivity and speed, while SRE teams are all about increasing uptime in production.

Lately, a lot of organizations are also in the habit of simply waving a fairy wand and — bibbidi-bobbidi-boo!— changing job titles, like from site reliability engineer, sysadmin or DevOps engineer to platform engineer. Is this just because the latter makes for cheaper employees? Or can a change in role really make a difference? How many organizations are changing to adopt a platform as a product mindset versus just finding a new way to add to the ops backlog?

What do these trends actually mean in reality? Is it really SRE versus platform engineering? Are companies actually skipping site reliability engineering and jumping right into a platform-first approach? Or, as Justin Warren, founder and principal analyst at PivotNine, wrote in Forbes, is platform engineering already at risk of “collapsing under the weight of its own popularity, hugged to death by over-eager marketing folk?”

In 2023, we have more important things to worry about than two teams with similar objectives feeling pitted against each other. Let’s talk about where this conflict ends and where collaboration and corporate cohabitation begins.

SREs Should Be More Platform-focused

There’s opportunity in bringing platform teams and SREs together, but a history of friction and frustration can slow that collaboration. Often, SREs can be seen as gatekeepers, while platform engineers are just setting up the guardrails. That could be the shine effect for more nascent platform teams or it can be the truth at some orgs.

“Outside of Google, SREs in most organizations lack the capacity to constantly think about ways to enable better developer self-service or improve architecture and infrastructure tooling while also establishing an observability and tracing setup. Most SRE teams are just trying to survive,” wrote Luca Galante, from Humanitec’s product and growth team. He argues that too many companies are trying to follow suit of these “elite engineering organizations,” and the result is still developers tossing code over the wall, leaving too much burden on SREs to try to catch up.

Instead, Galante argues, a platform as a product approach allows organizations to focus on the developer experience, which, in turn, should lighten the load of operations. After all, when deployed well, platform engineering can actually help support the site reliability engineering team by reducing incidents and tickets via guardrails and systemization.

In fact, Dynatrace’s 2022 State of SRE Report emphasizes that the way forward for SRE teams is a “platform-based solution with state-of-the-art automation and everything-as-code capabilities that support the full lifecycle from configuration and testing to observability and remediation.” The report continues that SREs are still essential in creating a “single version of the truth” in an organization.

A platform is certainly part of the solution, it’s just, as we know from this year’s Puppet State of Platform Engineering Report, most companies have three to six different internal developer platforms running at once. That could leave platform and SRE teams working in isolation.

Xenonstack technical strategy consultancy actually places platform engineering and SRE at different layers of the technical stack, not in opposition to each other. It looks at SRE as a lower level or foundational process, while platform engineering is a higher level process that abstracts out ops work, including that which the SRE team puts in place.

Both SRE and platform teams are deemed necessary functions in the cloud native world. The next step is to figure out how they can not just collaborate but integrate their work together. After all, a focus on standardization, as is inherent to platform engineering, only supports security and uptime goals.

Another opportunity is in how SREs use service level objectives (SLOs) and error budgets to set expectations for reliability. Platform engineers should consider applying the same practices but for their internal customers.

The same Dynatrace State of SRE Report also found that, in 2022, more than a third of respondents already had the platform team managing the external SLOs.

In the end, it is OK if these two job buckets become grayer — even to the developer audience — so long as your engineers can work through one single viewpoint and, when things deviate from that singularity, they know who to ask.

How SREs Built Electrolux’s Platform

Whether a platform enables your site reliability team or your SREs can help drive your platform-as-a-product approach, collaboration yields better results than conflict. How it’s implemented is as varied as an organization’s technical stack and company culture.

Back in 2017, the second largest home appliance maker in the world, Electrolux, shifted toward its future in the Internet of Things. It opened a digital products division to connect eventually hundreds of home goods. This product team kicked off with ten developers and two SREs. Now, in 2023, the company has grown to about 200 developers helping to build over 350 connected products — supported by only seven SREs.

Electrolux teammates, Kristina Kondrashevich, SRE product owner, and Gang Luo, SRE manager, spoke at this year’s PlatformCon about how building their own platform allowed them to scale their development and product coverage without proportionally scaling their SRE team.

Initially, the SREs and developers sat on the same product team. Eventually, they split up but still worked on the same products. As the company scaled with more product teams, the support tickets started to pile up. This is when the virtual event’s screen filled with screenshots of Slack notifications around developer pain points, including service requests, meetings and logs for any new cluster, pipeline or database migration.

Electrolux engineering realized that it needed to scale the automation and knowledge sharing, too.

“[Developers] would like to write code and push it into production immediately, but we want them to be focused on how it’s delivered, how they provision the infrastructure for their services. How do they achieve their SLO? How much does it cost for them?” Kondrashevich said, realizing that the developers don’t usually care about this information.“They want it to be done. And we want our consumers to be happy.”

She said they realized that “We needed to create for them a golden path where they can click one button and get a new AWS environment.”

As the company continued to scale to include several product teams serving hundreds of connected appliances, the SRE team pivoted to becoming its own product team, as Electrolux set out to build an internal developer platform in order to offer a self-service model to all product teams.

Electrolux’s platform was built to hold all the existing automation, as well as well-defined policies, patterns and best practices.

“If developers need any infrastructure today — for example, if they need a Kubernetes cluster or database — they can simply go to the platform and click a few buttons and make some selections, and they will get their infrastructure up and running in a few minutes,” Luo said. He emphasized that “They don’t need to fire any tickets to the SRE team and we ensure that all the infrastructure that gets created has the same kind of policies, [and] they follow the same patterns as well.”

“For developers, they don’t need to navigate different tools, they can use the single platform to access most of the resources,” he continued, across infrastructure, services and APIs. “Each feature contains multiple pre-defined templates, which has our policies embedded, so, if someone creates a new infrastructure or creates a new service, we can ensure that it already has what we need for security, for observability. This provided the golden path for our developers,” who no longer need to worry about things like setting up CI/CD or monitoring.

Electrolux’s SRE team actually evolved into a platform-as-a-product team, as a way to cover the whole developer journey. As part of this, Kondrashevich explained, they created a platform plug-in to track cloud costs as well as service requests per month.

“The first intention was to show that it costs money to do manual work. Rather the SRE team can spend time and provide the automation — then it will be for free,” she said. Also, by observing costs via the platform, they’ve enabled cross-organization visibility and FinOps. “Before our SRE team was responsible for cost and infrastructure. Today, we see how our product teams are owners of not only their products but…their expenses for where they run their services, pipelines, etcetera.”

They also measure platform success with continuous surveying and office hours.

In the end, whether it’s the SRE or the product team running the show, “Consumer experience is everything,” Kondrashevich said. “When you have visibility of what other teams are doing now, you can understand more, and you can speak more, and you can share this experience with others.”

To achieve any and all of this, she argues, you really need to understand what site reliability engineering means for your individual company.

The colleagues ended their PlatformCon presentation with an important disclaimer: “You shouldn’t simply follow the same steps as we have done because you might not have the same result.”

The post SRE vs Platform Engineer: Can’t We All Just Get Along? appeared first on The New Stack.