Engineer’s Guide to Cloud Cost Optimization: Manual DIY Optimization

Cloud cost optimization often starts with engineering teams that expect to save time and money with a do-it-yourself (DIY) approach.

Sep 25th, 2023 12:00pm by Steven O'Dwyer and Matt Stellpflug

Featued image for: Engineer’s Guide to Cloud Cost Optimization: Manual DIY Optimization

Feature image by Alexander Stein from Pixabay.

The following is the first part in a three-part series on cloud cost optimization. Check back tomorrow and Wednesday for parts two and three.

Cloud cost optimization often starts with engineering teams that expect to save time and money with a do-it-yourself (DIY) approach. They soon discover these methods are much more time-consuming than initially believed, and the effort doesn’t lead to expected savings outcomes. Let’s break down these DIY methods and explore their inherent challenges.

Introduction

One challenge cited in The State of FinOps data remains constant, despite great strides in FinOps best practices and processes. For the past three years, FinOps practitioners have experienced difficulty getting engineers to take action on cost optimization recommendations.

We don’t think they have to.

What prevents engineering and finance teams from achieving optimal cloud savings outcomes?

According to the State of FinOps data, from the FinOps Foundation:

Teams are optimizing in the wrong order. Engineers typically default to starting with usage optimizations before addressing their discount rate. This guide explains why you should, in reality, begin with rate optimizations, specifically with ProsperOps.
Engineers have too much to do. Relying on them to do more work isn’t helpful and doesn’t yield desired results.
Cost optimization isn’t the work engineers want to do. Is cost optimization the best use of engineering time and talent?

According to engineers working in cost optimization:

Recommendations from non-engineers were over-simplified
- An instance may be sized not to meet a vCPU or memory number, but rather to satisfy a throughput, bandwidth, or device attachment requirement. Therefore from a vCPU/RAM standpoint, an instance may look underutilized, but it cannot be downsized due to a disk I/O requirement or other metric not observed by the recommendation process.
- A lookback period that is too brief will produce poor recommendations, for example, if the lookback period the rightsizing recommendations are based upon is 60 days, and the app has quarterly seasonality, then the recommendation may not capture the “high water mark” or seasonal peak. In these cases, the application would ideally be modified to scale horizontally by increasing instance quantity as needed, but that is not always possible with legacy or monolithic applications.
Other than engineers, most people in the organization are unaware of the impact of each recommendation.
- For instance, engineering may be asked to convert workloads to Graviton due to the cost savings opportunity, however, that would require that all installed binaries have a variant for the ARM microarchitecture, which may be the case for the primary application but may not be available for other tools required by the organization, such as a SIEM agent, DLP tool, antivirus software, etc.
- Another common occurrence in which it may look like instances are eligible for downsizing is when a high availability application has a warm standby (typically at the DB layer) which is online due to a low RTO but is only synchronizing data and not actively serving end users.

Cloud Cost Optimization 2.0 Mitigates FinOps Friction

In a Cloud Cost Optimization 2.0 world, cost-conscious engineers build their architectures smarter — requiring fewer and less frequent resource optimizations.

But to achieve this, the discount management process (the financial side of cloud cost optimization) must be managed efficiently. Autonomous software like ProsperOps eliminates the need for anyone, including engineers, to manage rate optimization — the lever that produces immediate and magnified cloud savings.

Engineers should take their time back and let sophisticated algorithms do the work.

But we also recognize that engineers are expert problem solvers, and some may not see the need to yield control to an autonomous solution. This guide outlines your options and the reality of various cost-optimization methods, starting with a manual do-it-yourself approach.

Part 1: The Ins-and-Outs of Manual DIY Cloud Cost Optimization

In the cloud, costs can be a factor throughout the software lifecycle. So as engineers begin to work with cloud-based infrastructure, they often have to become more cost-conscious. It’s a responsibility that falls outside their primary role, yet they often have to decide how to mitigate uncontrollable cloud costs and optimize what falls in their domain — resources.

Many teams use in-house methods to do this, which may include manual DIY methods and/or software that recommends optimizations.

Manual Cloud Cost Optimization Methods

These options are usually perceived to be low-cost, low-risk, community-based, open-source, or homegrown solutions that address pieces of optimization tasks.

Engineering teams may go this route because they don’t know the reality of a manual DIY optimization experience or the tools that offer the best support. They may oppose third-party tools or believe they can manage everything in-house.

Standard DIY manual methods to identify and track resources and manage waste are usually handled with a combination of:

Spreadsheets
Scripts
Alerts and notifications
Tagging to define and identify resources using AWS Cost Explorer
Native tools like AWS Trusted Advisor, Amazon CloudWatch, and AWS Compute Optimizer, AWS Config, EventBridge, Cost Intelligence Dashboards, AWS Budgets, etc.
Forecasting
Rightsizing and other resource strategies
Writing code to automate simple tasks
Custom dashboards, based on CUR data

Regarding cloud purchases, some engineers may manually calculate to verify that cloud service changes and purchases fall within the budget.

Whatever the method, a typical result is this: it’s time-consuming to act on reported findings.

Manual DIY Challenges

Getting Data and Providing Visibility for Stakeholders

With a manual approach, data lives in different places, and there is a constant effort to assimilate it for decision-making. There isn’t centralized reporting or a single pane of glass through which to view data. Data is hard to collect and is often managed in silos, so multiple stakeholders cannot see the data that matters to them.

Given this constraint, it’s easy to see how DevOps teams can struggle to validate that they are using the correct discount type, coverage of instance families and regions, and/or commitment term.

It Wastes Time and Resources

Cloud costs need to be actively managed. But engineers shouldn’t be the ones doing it.

Requiring engineers to shift from shipping product innovations to manual cloud cost optimizations is an inefficient use of engineering resources. Establishing a FinOps practice, however, helps reallocate and redistribute workload more appropriately. It’s a better strategy.

Processes Don’t Scale

Then there’s the issue of working with limiting processes, like using a spreadsheet to track resources. Spreadsheets offer a simple, static view that doesn’t capture the cloud’s dynamic nature. In practice, it’s challenging to map containers and resources or see usage at any point in time, for example.

Who is managing the costs when multiple people are working in one account? As organizations grow, it becomes more challenging to identify the point of contact managing the moving parts of discounts and resources.

With DIY, one person or a team may be absorbed in time-consuming data management and distribution with processes that don’t scale.

Missed Savings Opportunities and Discount Mismanagement

Many companies take advantage of the attractive discounts Savings Plans provided for AWS cloud services. The trade-off for the discount is a 12-month or 36-month commitment term. Without centralized, real-time data, engineering teams are more likely to act conservatively and under-commit, leaving the company exposed to higher on-demand rates.

Cloud discount instruments, like Savings Plans and Reserved Instances, offer considerable relief from on-demand rates. Managing them involves the coordination of the right type of instrument, which:

Is purchased at different times.
Covers different resources and time frames.
Expires at different times.
Can change resource footprint or architecture.

Managing discount instruments is a dynamic process involving a lot of moving parts. It’s extremely challenging to coordinate these elements manually. That’s why autonomous discount management successfully produces consistent, incremental savings outcomes, an approach we’ll cover later in this guide.

Prioritizes Waste Reduction

Engineering teams naturally focus on reducing waste because it’s familiar ground—and work they are already doing. However it takes effort and time to identify and actively manage waste, which is constantly being created.

And, surprisingly, reducing waste in the cloud doesn’t affect savings as much or as fast as optimizing discount instruments, also called rate optimization. Consider waste #3 on the list of optimization priorities.

The reality of manual DIY optimization is more challenging than many engineering teams expect. Software alleviates some of these pain points.

Cloud Cost Management Software Generates Optimization Recommendations. Is That Enough?

Most cloud optimization software helps teams understand opportunities for optimization. They collect, centralize and present data:

Across accounts and organizations.
From multiple data sources like native tools, including Amazon Cost and Usage Report, Amazon CloudWatch or third parties.
From a single customizable pane of glass for various stakeholders.

These solutions centralize reporting so that stakeholders have a more comprehensive, accurate view of the environment and potential inefficiencies.

Recommendations-Based Software Challenges

It’s Still a Manual Process

The problem is that although engineering teams are being given directives on optimization, they still have to vet the recommendation and physically do the optimization work.

Engineers have to validate the recommendation based on software and other requirements, which could involve coordination between various teams. And once a decision has been made, the change is usually implemented manually.

Limited Savings Potential

Most solutions recommend one discount instrument as part of their optimization recommendation.

Without a strategic mix of discount instruments or flexibility to accommodate usage as it scales up and down or scales horizontally, companies will end up with suboptimal savings outcomes.

Risk and Savings Loss

The instruments that offer the best cloud discounts also require a one-year or three-year commitment. Even with the data assist from software, internal and external factors make it challenging to predict usage patterns over one-year or three-year terms.

As a result, engineering teams may unintentionally under or over-commit, both of which result in suboptimal savings.

Neither DIY Method Solves the Internal Coordination ‘Problem’

Operating successfully — and cost-effectively — in the cloud involves ongoing communication, education and coordination between engineers (who manage resources) and the budget holder (who may serve in IT, finance or FinOps).

This is the case whether teams are using manual DIY methods or recommendation software. The coordination looks like this:

The budget holder sees an optimization recommendation
The engineer validates if it is viable
If the recommendation involves resources, the engineer takes action to optimize
If the recommendation involves cloud rates, the budget holder or FinOps team takes action to optimize

Both DIY methods involve a lot of back-and-forth exchanges and manual effort to perform the optimization.

It’s possible to consolidate steps in this process (eliminating the back-and-forth) and shift the conversations from tactical to strategic with autonomous rate optimization.

FinOps teams, in particular, thrive with autonomous rate optimization. Because the financial aspect of cloud operations is being managed using sophisticated algorithms, cloud savings are actively realized in a hands-free manner.

With costs under control, engineers can focus on innovation and resource optimization as it fits into their project schedule.

FinOps teams, too, are freed up to address other top FinOps challenges and plans related to FinOps maturity.

Rate Optimization Provides Faster Cloud Savings Than DIY

In seeking DIY methods of cloud cost optimization, companies may lay some groundwork in understanding cloud costs, but they aren’t likely to find impactful savings. This is largely because there is still manual work associated with both approaches and resource optimization is the focus.

Resource optimization is important, but rate optimization is an often overlooked approach that yields more efficient, impactful cloud savings. We will explore rate and resource optimization further in this guide and explain why the order of optimization action makes a major difference in savings outcomes.

Steven O'Dwyer is a Senior FinOps Specialist with a background in AWS Technical Billing. He sharpened his FinOps acumen beginning as an AWS Enterprise Support Specialist, Cost Optimization Consultant, and FinOps Practitioner. Steven joined ProsperOps in 2022 to help customers...