The Pillars of Platform Engineering: Part 6 — Observability

Give platform teams workflows and checklists for building observability into their platform.

Sep 27th, 2023 6:12am by Michael Fonseca

Featued image for: The Pillars of Platform Engineering: Part 6 — Observability

This guide outlines the workflows and checklist steps for the six primary technical areas of developer experience in platform engineering. Published in six parts, part one introduced the series and focused on security. Part six addresses observability. The other parts of the guide are listed below, and you can download a full PDF version of the The 6 Pillars of Platform Engineering for the complete set of guidance, outlines and checklists:

Security (includes introduction)
Pipeline (VCS, CI/CD)
Provisioning
Connectivity
Orchestration
Observability (includes conclusion and next steps)

The last leg of any platform workflow is the monitoring and maintenance of your deployments. You want to build observability practices and automation into your platform, measuring the quality and performance of software, services, platforms and products to understand how systems are behaving. Good system observability makes investigating and diagnosing problems faster and easier.

Fundamentally, observability is about recording, organizing and visualizing data. The mere availability of data doesn’t deliver enterprise-grade observability. Site reliability engineering, DevOps or other teams first determine what data to generate, collect, aggregate, summarize and analyze to gain meaningful and actionable insights.

Then those teams adopt and build observability solutions. Observability solutions use metrics, traces and logs as data types to understand and debug systems. Enterprises need unified observability across the entire stack: cloud infrastructure, runtime orchestration platforms such as Kubernetes or Nomad, cloud-managed services such as Azure Managed Databases, and business applications. This unification helps teams understand the interdependencies of cloud services and components.

But unification is only the first step of baking observability into the platform workflow. Within that workflow, a platform team needs to automate the best practices of observability within modules and deployment templates. Just as platform engineering helps security functions shift left, observability integrations and automations should also shift left into the infrastructure coding and application build phases by baking observability into containers and images at deployment. This helps your teams build and implement a comprehensive telemetry strategy that’s automated into platform workflows from the outset.

The benefits of integrating observability solutions in your infrastructure code are numerous: Developers can better understand how their systems operate and the reliability of their applications. Teams can quickly debug issues and trace them back to their root cause. And the organization can make data-driven decisions to improve the system, optimize performance, and enhance the user experience.

Workflow: Observability

An enterprise-level observability workflow might follow these eight steps:

Code: A developer commits code.
1. Note: Developers may have direct network control plane access depending on the RBACs assigned to them.
Validate: The CI/CD platform submits a request to the IdP for validation (AuthN and AuthZ).
IdP response: If successful, the pipeline triggers tasks (e.g., test, build, deploy).
Request: The provisioner executes requested patterns, such as building modules, retrieving artifacts or validating policy against internal and external engines, ultimately provisioning defined resources.
Provision: Infrastructure is provisioned and configured, if not already available.
Configure: The provisioner configures the observability resource.
Collect: Metrics and tracing data are collected based on configured emitters and aggregators.
Response: Completion of the provisioner request is provided to the CI/CD platform for subsequent processing and/or handoff to external systems, for purposes such as security scanning or integration testing.

Observability Requirements Checklist

Enterprise-level observability requires:

Real-time issue and anomaly detection
Auto-discovery and integrations across different control planes and environments
Accurate alerting, tracing, logging and monitoring
High-cardinality analytics
Tagging, labeling, and data-model governance
Observability as code
Scalability and performance for multi-cloud and hybrid deployments
Security, privacy, and RBACs for self-service visualization, configuration, and reporting

Next Steps and Technology Selection Criteria

Platform building is never totally complete. It’s not an upfront-planned project that’s finished after everyone has signed off and started using it. It’s more like an iterative agile development project rather than a traditional waterfall one.

You start with a minimum viable product (MVP), and then you have to market your platform to the organization. Show teams how they’re going to benefit from adopting the platform’s common patterns and best practices for the entire development lifecycle. It can be effective to conduct a process analysis (current vs. future state) with various teams to jointly work on and understand the benefits of adoption. Finally, it’s essential to make onboarding as easy as possible.

As you start to check off the boxes for these six platform pillar requirements, platform teams will want to take on the mindset of a UX designer. Investigate the wants and needs of various teams, understanding that you’ll probably be able to satisfy only 80 – 90% of use cases. Some workflows will be too delicate or unique to bring into the platform. You can’t please everyone. Toolchain selection should be a cross-functional process, and executive sponsorship at the outset is necessary to drive adoption.

Key toolchain questions checklist:

Practitioner adoption: Are you starting by asking what technologies your developers are excited about? What enables them to quickly support the business? What do they want to learn and is this skillset common in the market?
Scale: Can this tool scale to meet enterprise expectations, for both performance, security/compliance, and ease of adoption? Can you learn from peer institutions instead of venturing into uncharted territory?
Support: Are the selected solutions supported by organizations that can meet SLAs for core critical infrastructure (24/7/365) and satisfy your customers’ availability expectations?
Longevity: Are these solution suppliers financially strong and capable of supporting these pillars and core infrastructure long-term?
Developer flexibility: Do these solutions provide flexible interfaces (GUI, CLI, API, SDK) to create a tailored user experience?
Documentation: Do these solutions provide comprehensive, up-to-date documentation?
Ecosystem integration: Are there extensible ecosystem integrations to neatly link to other tools in the chain, like security or data warehousing solutions?

For organizations that have already invested in some of these core pillars, the next step involves collaborating with ecosystem partners like HashiCorp to identify workflow enhancements and address coverage gaps with well-established solutions.

Mike is a Global Staff Solutions Engineer at HashiCorp. He has over 20 years of experience developing and implementing technology platforms, specifically focusing on resilient architectures, cloud-native design, and information security.