Developer and IT Operations | The New Stack

NoSQL Data Modeling Mistakes that Ruin Performance

Felipe Cardeneti Mendes — Wed, 27 Sep 2023 16:00:16 +0000

Getting your data modeling wrong is one of the easiest ways to ruin your performance. And it’s especially easy to screw this up when you’re working with NoSQL, which (ironically) tends to be used for the most performance-sensitive workloads. NoSQL data modeling might initially appear quite simple: just model your data to suit your application’s access patterns. But in practice, that’s much easier said than done.

Fixing data modeling is no fun, but it’s often a necessary evil. If your data modeling is fundamentally inefficient, your performance will suffer once you scale to some tipping point that varies based on your specific workload and deployment. Even if you adopt the fastest database on the most powerful infrastructure, you won’t be able to tap its full potential unless you get your data modeling right.

This article explores three of the most common ways to ruin your NoSQL database performance, along with tips on how to avoid or resolve them.

Not Addressing Large Partitions

Large partitions commonly emerge as teams scale their distributed databases. Large partitions are partitions that grow too big, up to the point when they start introducing performance problems across the cluster’s replicas.

One of the questions that we hear often — at least once a month — is “What constitutes a large partition?” Well, it depends. Some things to consider:

Latency expectations: The larger your partition grows, the longer it will take to be retrieved. Consider your page size and the number of client-server round trips needed to fully scan a partition.
Average payload size: Larger payloads generally lead to higher latency. They require more server-side processing time for serialization and deserialization and also incur a higher network data transmission overhead.
Workload needs: Some workloads organically require larger payloads than others. For instance, I’ve worked with a Web3 blockchain company that would store several transactions as BLOBs under a single key, and every key could easily get past 1 megabyte in size.
How you read from these partitions: For example, a time series use case will typically have a timestamp clustering component. In that case, reading from a specific time window will retrieve much less data than if you were to scan the entire partition.

The following table illustrates the impact of large partitions under different payload sizes, such as 1, 2 and 4 kilobytes.

As you can see, the higher your payload gets under the same row count, the larger your partition is going to be. However, if your use case frequently requires scanning partitions as a whole, then be aware that databases have limits to prevent unbounded memory consumption.

For example, ScyllaDB cuts off pages at every 1MB to prevent the system from potentially running out of memory. Other databases (even relational ones) have similar protection mechanisms to prevent an unbounded bad query from starving the database resources.

To retrieve a payload size of 4KB and 10K rows with ScyllaDB, you would need to retrieve at least 40 pages to scan the partition with a single query. This may not seem a big deal at first. However, as you scale over time, it could affect your overall client-side tail latency.

Another consideration: With databases like ScyllaDB and Cassandra, data written to the database is stored in the commit log and under an in-memory data structure called a “memtable.”

The commit log is a write-ahead log that is never really read from, except when there’s a server crash or a service interruption. Since the memtable lives in memory, it eventually gets full. To free up memory space, the database flushes memtables to disk. That process results in SSTables (sorted strings tables), which is how your data gets persisted.

What does all this have to do with large partitions? Well, SSTables have specific components that need to be held in memory when the database starts. This ensures that reads are always efficient and minimizes wasting storage disk I/O when looking for data. When you have extremely large partitions (for example, we recently had a user with a 2.5 terabyte partition in ScyllaDB), these SSTable components introduce heavy memory pressure, therefore shrinking the database’s room for caching and further constraining your latencies.

How do you address large partitions via data modeling? Basically, it’s time to rethink your primary key. The primary key determines how your data will be distributed across the cluster, which improves performance as well as resource utilization.

A good partition key should have high cardinality and roughly even distribution. For example, a high cardinality attribute like User Name, User ID or Sensor ID might be a good partition key. Something like State would be a bad choice because states like California and Texas are likely to have more data than less populated states such as Wyoming and Vermont.

Or consider this example. The following table could be used in a distributed air quality monitoring system with multiple sensors:

CREATE TABLE air_quality_data (
   sensor_id text,
   time timestamp,
   co_ppm int,
   PRIMARY KEY (sensor_id, time)
);

With time being our table’s clustering key, it’s easy to imagine that partitions for each sensor can grow very large, especially if data is gathered every couple of milliseconds. This innocent-looking table can eventually become unusable. In this example, it takes only ~50 days.

A standard solution is to amend the data model to reduce the number of clustering keys per partition key. In this case, let’s take a look at the updated air_quality_data table:

CREATE TABLE air_quality_data (
   sensor_id text,
   date text,
   time timestamp,
   co_ppm int,
   PRIMARY KEY ((sensor_id, date), time)
);

After the change, one partition holds the values gathered in a single day, which makes it less likely to overflow. This technique is called bucketing, as it allows us to control how much data is stored in partitions.

Bonus: See how Discord applies the same bucketing technique to avoid large partitions.

Introducing Hot Spots

Hot spots can be a side effect of large partitions. If you have a large partition (storing a large portion of your data set), it’s quite likely that your application access patterns will hit that partition more frequently than others. In that case, it also becomes a hot spot.

Hot spots occur whenever a problematic data access pattern causes an imbalance in the way data is accessed in your cluster. One culprit: when the application fails to impose any limits on the client side and allows tenants to potentially spam a given key.

For example, think about bots in a messaging app frequently spamming messages in a channel. Hot spots could also be introduced by erratic client-side configurations in the form of retry storms. That is, a client attempts to query specific data, times out before the database does and retries the query while the database is still processing the previous one.

Monitoring dashboards should make it simple for you to find hot spots in your cluster. For example, this dashboard shows that shard 20 is overwhelmed with reads.

For another example, the following graph shows three shards with higher utilization, which correlates to the replication factor of three, configured for the keyspace in question.

Here, shard 7 introduces a much higher load due to the spamming.

How do you address hot spots? First, use a vendor utility on one of the affected nodes to sample which keys are most frequently hit during your sampling period. You can also use tracing, such as probabilistic tracing, to analyze which queries are hitting which shards and then act from there.

If you find hot spots, consider:

Reviewing your application access patterns. You might find that you need a data modeling change such as the previously-mentioned bucketing technique. If you need sorting, you could use a monotonically increasing component, such as Snowflake. Or, maybe it’s best to apply a concurrency limiter and throttle down potential bad actors.
Specifying per-partition rate limits, after which the database will reject any queries that hit that same partition.
Ensuring that your client-side timeouts are higher than the server-side timeouts to prevent clients from retrying queries before the server has a chance to process them (“retry storms”).

Misusing Collections

Teams don’t always use collections, but when they do, they often use them incorrectly. Collections are meant for storing/denormalizing a relatively small amount of data. They’re essentially stored in a single cell, which can make serialization/deserialization extremely expensive.

When you use collections, you can define whether the field in question is frozen or non-frozen. A frozen collection can only be written as a whole; you cannot append or remove elements from it. A non-frozen collection can be appended to, and that’s exactly the type of collection that people most misuse. To make matters worse, you can even have nested collections, such as a map that contains another map, which includes a list, and so on.

Misused collections will introduce performance problems much sooner than large partitions, for example. If you care about performance, collections can’t be very large at all. For example, if we create a simple key:value table, where our key is a sensor_id and our value is a collection of samples recorded over time, our performance will be suboptimal as soon as we start ingesting data.

CREATE TABLE IF NOT EXISTS {table} (
           	sensor_id uuid PRIMARY KEY,
           	events map>>,
        	)

The following monitoring snapshots show what happens when you try to append several items to a collection at once.

You can see that while the throughput decreases, the p99 latency increases. Why does this occur?

Collection cells are stored in memory as sorted vectors.
Adding elements requires a merge of two collections (old and new).
Adding an element has a cost proportional to the size of the entire collection.
Trees (instead of vectors) would improve the performance, BUT…
Trees would make small collections less efficient!

Returning that same example, the solution would be to move the timestamp to a clustering key and transform the map into a frozen collection (since you no longer need to append data to it). These very simple changes will greatly improve the performance of the use case.

CREATE TABLE IF NOT EXISTS {table} (
           	sensor_id uuid,
		record_time timestamp,
           	events FROZEN>,
	 PRIMARY KEY(sensor_id, record_time)
        	)

Learn More: On-Demand NoSQL Data Modeling Masterclass

Want to learn more about NoSQL data modeling best practices for performance? Take a look at our NoSQL data modeling masterclass — three hours of expert instruction, now on demand and free. You will learn how to:

Analyze your application’s data usage patterns and determine which data modeling approach will be most performant for your specific usage patterns.
Select the appropriate data modeling options to address a broad range of technical challenges, including the benefits and trade-offs of each option.
Apply common NoSQL data modeling strategies in the context of a sample application.
Identify signs that indicate your data modeling is at risk of causing hot spots, timeouts and performance degradation — and how to recover.

The post NoSQL Data Modeling Mistakes that Ruin Performance appeared first on The New Stack.

Engineer’s Guide to Cloud Cost Optimization: Prioritize Cloud Rate Optimization

Steven O'Dwyer — Wed, 27 Sep 2023 15:00:04 +0000

More than 60% of Amazon Web Services users’ cloud bills come from compute spend, via resources in Elastic Compute Cloud (EC2), Lambda and/or Fargate. So it makes sense to prioritize optimization in compute — this is where you’ll realize savings most efficiently.

It’s possible to reduce compute costs by more than 40%, while reducing the overall AWS bill by 25%, using RIs and Savings Plans. The trick is to secure the appropriate level of commitment. Both over-committing and under-committing produce suboptimal savings.

Commitment Management Is Complex; Focus on Effective Savings Rate

Even though discount instruments are designed to produce savings, companies need to choose the right instrument for workloads to fully realize those savings.

All discount instruments have benefits, tradeoffs and specific rules. This chart illustrates the elements:

How do these elements relate to cloud savings? They can be easy to misinterpret. That’s why optimizing for Effective Savings Rate (ESR) is a recommended best practice. ESR simplifies rate optimization; it focuses FinOps teams on one metric that reveals the savings outcome.

ESR is the percentage of discount being received. It is calculated by dividing the amount spent using discounts (like RIs and Savings Plans) by the amount that would have been spent via on-demand pricing.

Because utilization, coverage and discount rate are part of the calculation, it produces a consistent measure of savings performance and a reliable benchmarking metric.

Best Practice for Cost Optimization: Rate First, then Resource

Because cloud cost optimization is complicated, it’s helpful to organize it in these buckets and track:

How much you are spending by monitoring cloud spend per month
Savings potential by tracking Effective Savings Rate
Waste reduction/other resource optimization strategies like re-architecting, tracking untagged/unknown spend, rightsize, unused/unattached resources

Optimize Rate and Resource at the Same Time with Autonomous Discount Management

Discount instruments contain many moving parts and they are complex to manage manually.

Using automation in an autonomous approach, however, creates a more efficient rate optimization experience. It enables “hands-free” management of cloud discount instruments. With cloud rate optimization being managed in a way that produces consistent incremental savings, engineering teams can focus on innovation and resource optimization synchronously.

Optimize Cloud Discount Rates and Engineering Resources at the Same Time

While prioritizing cloud cost optimization around discount rates is a way to jump-start cloud savings, it’s possible — and preferable — to optimize discount rates and engineering resources at the same time.

When discount rates are managed using algorithms, it creates intra-team efficiency for FinOps. Not only are challenges mitigated with the manual management of discount instruments, but engineering teams are also freed up to focus on strategic projects and resource optimization. It’s a scenario that produces maximized cloud savings.

Why Is an Autonomous Approach Necessary to Optimize Discount Rates?

There are certain jobs that complex algorithms perform better than humans. Cloud cost management is one of them. More specifically, calculations performed by sophisticated algorithms enable a more efficient, accurate and responsive approach: the autonomous management of discount instruments.

The concept of saving money using AWS Savings Plans and Reserved Instances (RIs) appears simple. However it is challenging to manage and exchange these instruments in a way that provides coverage flexibility. Each instrument contains benefits, limitations and tradeoffs, along with inherent challenges due to infrastructure volatility, commitments and terms.

Automation in the form of algorithmic calculations handles these intricacies efficiently.

Infrastructure Volatility: a Wild Card

Resource usage is dynamic in the cloud, and that movement creates unpredictable patterns whether it manifests as:

Increasing and decreasing usage
Moving from EC2 instances to Spot
Switching between instance families
Converting from EC2 to Fargate
Moving to various containers

These types of engineering optimizations create volatility in company infrastructure and are challenging to match at scale, particularly when rigid discount commitment terms are in place.

Discount Commitment Rules and Coverage Planning

Two things that might not be obvious about working with discount instruments: optimization efforts can be hindered by commitment rules and challenges in discount coverage planning.

Compute Savings Plans, for example, are applied in a specific order: first, to the resource that will receive the greatest discount in the account where Savings Plans are purchased.

The discount benefit can next float to other accounts within an organization, but Savings Plans are not transferable once deployed in an account. In order to maximize benefits to an organization and centralize discount management, it is a best practice to purchase savings plans in an account isolated from resource usage.

Coverage commitment planning for Savings Plans, too, is tricky because commitments are made in post-discount dollars — an abstract concept. FinOps teams must quantify upcoming needs (using post-discount dollars) for resources that contain variable discounts. It’s comparable to estimating a gift card amount for products with varying discount rates that will be bought, exchanged or returned.

Rigid Terms and Lock-In Risk

Companies can get locked into commitment terms that end up creating more risk than benefit.

AWS discount instruments, for example, are procured in 12-month or 36-month commitment terms. While Convertible and Standard RIs can be exchanged (the latter on the RI Marketplace), Savings Plans are immutable. Once made, these terms cannot be modified. They must be maintained through the end of the term.

Most companies respond to these constraints by under-committing their coverage. In an effort to be conservative and avoid risk, they incur on-demand rates, which are much higher.

AWS Discount Instrument Profiles.

Manual Management of Discount Instruments Produces Suboptimal Outcomes

It’s nearly impossible to execute all of these moving parts in a timely manner without a technical assist.

Companies that try to manage discount instruments manually, or via a pure play RI broker, generally wind up with:

Discount mismanagement
Missed savings opportunities
Overcommitment

This media company, for example, was locked into one-year discount rates and did not seek the opportunity to secure more favorable three-year rates. Higher discounts were missed. Because discount changes were being performed manually (and therefore at a slower pace), the company paid for commitments that were unutilized. With suboptimal coverage and discounts, they paid higher prices for cloud services, missing out on $3 million in potential savings.

This dynamic is very common.

Automation, however, is only part of the solution.

Autonomous Discount Management: Algorithms Enable Hands-Free Optimization

Many automated tools simplify steps or entire process sequences. Most will provide recommendations or a list of actions that require human intervention to implement. While that has value, particularly in saving time, cloud rate optimization (or discount management) is achieved holistically, with automation that performs algorithmic calculations and uses real-time telemetry to:

Recognize resource usage patterns and scale up and down to cover them
Autonomously manage and deploy discounts using a blended portfolio of Savings Plans, Standard Reserved Instances and Convertible Reserved Instances, taking into account the benefits and risks of each instrument
Optimize for savings performance using Effective Savings Rate (ESR) and not for coverage or utilization alone.

Autonomous Discount Management is a hands-free experience that enables synchronous rate and resource optimization. FinOps teams can let the autonomous solution act for them (optimizing cloud costs) while they pursue other strategic tasks; engineering teams can optimize resources at the same time.

How Drift Optimizes Rate and Resources Synchronously

Like most companies, Drift had established internal methods to understand AWS costs and optimization performance. Savings, however, were still elusive, because Drift lacked visibility into cost drivers, appropriate tools—and discount instruments were being managed manually.

Consequently, Drift was receiving a 27% discount with only 57% coverage of discountable resources.

With a tool that provides cost visibility and attribution from Cloud Zero and another tool executing autonomous discount management from ProsperOps, Drift was able to optimize rates and resources synchronously. Drift’s ESR nearly doubled and more than $2.9 million in savings has been returned to its cloud budget in only a few months.

Drift can now:

Identify optimization opportunities and proactively respond to anomalies in real time
Review reports in minutes, not hours, with data about costs, usage, coverage, discounts and overall cloud savings performance
Understand ESR improvements over time (savings performance and ROI) and their drivers
Continue to realize incremental savings

Engineering and finance teams also have more efficient communication and coordination. This is just one example of what is possible when rate and resource optimization is synchronized.

Synchronous Optimization Addresses Top FinOps Challenges

The right tools and FinOps-supportive culture are key to achieving synchronous rate and resource optimization—and results like Drift’s. It’s a “better together” approach that helps teams

resolve key FinOps challenges depicted in The State of FinOps 2023 report, including:

Empowering engineers to take action on optimization
Getting to unit economics
Organizational adoption of FinOps
Reducing waste or unused resources
Enabling automation

The approach produces greater visibility into costs across the organization, a fast way to reduce costs and create savings, more time for engineering projects and improved intra-team communication.

Start a More Efficient Cloud Cost Optimization Journey with ProsperOps

We believe cloud cost optimization can be best addressed by first implementing autonomous rate optimization and working with respected vendors for specific resource optimization support.

A free savings analysis is the first step in the cost optimization journey; it reveals savings potential, results being achieved with the current strategy and optimization opportunities. To chart a course for maximized, consistent, long-term cloud savings, register for your savings analysis today!

The post Engineer’s Guide to Cloud Cost Optimization: Prioritize Cloud Rate Optimization appeared first on The New Stack.

The Pillars of Platform Engineering: Part 6 — Observability

Michael Fonseca — Wed, 27 Sep 2023 13:12:16 +0000

This guide outlines the workflows and checklist steps for the six primary technical areas of developer experience in platform engineering. Published in six parts, part one introduced the series and focused on security. Part six addresses observability. The other parts of the guide are listed below, and you can download a full PDF version of the The 6 Pillars of Platform Engineering for the complete set of guidance, outlines and checklists:

Security (includes introduction)
Pipeline (VCS, CI/CD)
Provisioning
Connectivity
Orchestration
Observability (includes conclusion and next steps)

The last leg of any platform workflow is the monitoring and maintenance of your deployments. You want to build observability practices and automation into your platform, measuring the quality and performance of software, services, platforms and products to understand how systems are behaving. Good system observability makes investigating and diagnosing problems faster and easier.

Fundamentally, observability is about recording, organizing and visualizing data. The mere availability of data doesn’t deliver enterprise-grade observability. Site reliability engineering, DevOps or other teams first determine what data to generate, collect, aggregate, summarize and analyze to gain meaningful and actionable insights.

Then those teams adopt and build observability solutions. Observability solutions use metrics, traces and logs as data types to understand and debug systems. Enterprises need unified observability across the entire stack: cloud infrastructure, runtime orchestration platforms such as Kubernetes or Nomad, cloud-managed services such as Azure Managed Databases, and business applications. This unification helps teams understand the interdependencies of cloud services and components.

But unification is only the first step of baking observability into the platform workflow. Within that workflow, a platform team needs to automate the best practices of observability within modules and deployment templates. Just as platform engineering helps security functions shift left, observability integrations and automations should also shift left into the infrastructure coding and application build phases by baking observability into containers and images at deployment. This helps your teams build and implement a comprehensive telemetry strategy that’s automated into platform workflows from the outset.

The benefits of integrating observability solutions in your infrastructure code are numerous: Developers can better understand how their systems operate and the reliability of their applications. Teams can quickly debug issues and trace them back to their root cause. And the organization can make data-driven decisions to improve the system, optimize performance, and enhance the user experience.

Workflow: Observability

An enterprise-level observability workflow might follow these eight steps:

Code: A developer commits code.
1. Note: Developers may have direct network control plane access depending on the RBACs assigned to them.
Validate: The CI/CD platform submits a request to the IdP for validation (AuthN and AuthZ).
IdP response: If successful, the pipeline triggers tasks (e.g., test, build, deploy).
Request: The provisioner executes requested patterns, such as building modules, retrieving artifacts or validating policy against internal and external engines, ultimately provisioning defined resources.
Provision: Infrastructure is provisioned and configured, if not already available.
Configure: The provisioner configures the observability resource.
Collect: Metrics and tracing data are collected based on configured emitters and aggregators.
Response: Completion of the provisioner request is provided to the CI/CD platform for subsequent processing and/or handoff to external systems, for purposes such as security scanning or integration testing.

Observability Requirements Checklist

Enterprise-level observability requires:

Real-time issue and anomaly detection
Auto-discovery and integrations across different control planes and environments
Accurate alerting, tracing, logging and monitoring
High-cardinality analytics
Tagging, labeling, and data-model governance
Observability as code
Scalability and performance for multi-cloud and hybrid deployments
Security, privacy, and RBACs for self-service visualization, configuration, and reporting

Next Steps and Technology Selection Criteria

Platform building is never totally complete. It’s not an upfront-planned project that’s finished after everyone has signed off and started using it. It’s more like an iterative agile development project rather than a traditional waterfall one.

You start with a minimum viable product (MVP), and then you have to market your platform to the organization. Show teams how they’re going to benefit from adopting the platform’s common patterns and best practices for the entire development lifecycle. It can be effective to conduct a process analysis (current vs. future state) with various teams to jointly work on and understand the benefits of adoption. Finally, it’s essential to make onboarding as easy as possible.

As you start to check off the boxes for these six platform pillar requirements, platform teams will want to take on the mindset of a UX designer. Investigate the wants and needs of various teams, understanding that you’ll probably be able to satisfy only 80 – 90% of use cases. Some workflows will be too delicate or unique to bring into the platform. You can’t please everyone. Toolchain selection should be a cross-functional process, and executive sponsorship at the outset is necessary to drive adoption.

Key toolchain questions checklist:

Practitioner adoption: Are you starting by asking what technologies your developers are excited about? What enables them to quickly support the business? What do they want to learn and is this skillset common in the market?
Scale: Can this tool scale to meet enterprise expectations, for both performance, security/compliance, and ease of adoption? Can you learn from peer institutions instead of venturing into uncharted territory?
Support: Are the selected solutions supported by organizations that can meet SLAs for core critical infrastructure (24/7/365) and satisfy your customers’ availability expectations?
Longevity: Are these solution suppliers financially strong and capable of supporting these pillars and core infrastructure long-term?
Developer flexibility: Do these solutions provide flexible interfaces (GUI, CLI, API, SDK) to create a tailored user experience?
Documentation: Do these solutions provide comprehensive, up-to-date documentation?
Ecosystem integration: Are there extensible ecosystem integrations to neatly link to other tools in the chain, like security or data warehousing solutions?

For organizations that have already invested in some of these core pillars, the next step involves collaborating with ecosystem partners like HashiCorp to identify workflow enhancements and address coverage gaps with well-established solutions.

The post The Pillars of Platform Engineering: Part 6 — Observability appeared first on The New Stack.

Controlling the Machines: Feature Flagging Meets AI

Cody De Arkland — Tue, 26 Sep 2023 18:00:21 +0000

Have you ever stopped to consider how many movie plotlines would have been solved with a feature flag? Well, you probably haven’t — but since I spend most of my time working on different scenarios in which teams use feature flags to drive feature releases, it crosses my mind a lot. There are more than six Terminator movies, and if Cyberdyne had just feature-flagged Skynet, they could’ve killswitched the whole problem away! We could make the same analogies to The Matrix or any of a dozen other movies.

Cinema references aside, there are real translations of how these controlled release scenarios apply in the technology space. Artificial intelligence is ushering in a time of great innovation in software. What started with OpenAI and GPT-3 quickly accelerated to what seems like new models being released every week.

We’ve watched GPT-3 move to 3.5 and then to GPT-4. We’re seeing GPT-4’s 32K model emerge for larger content consumption and interaction. We’ve watched the emergence of Llama from Meta, Claude from Anthropic and BARD from Google — and that’s just the text-based LLMs. New LLMs are springing up for image creation, enhancement, document review and many other functions.

Furthermore, within each of these AI model domains, additional versions are being released as new capabilities are unlocked and trained in new ways. I can’t help but see the parallel to software development in the realm of AI models as well. These LLMs have their own software lifecycle as they are enhanced and shipped to users.

Each vendor has its own beta programs supporting segments of users being enabled for models. Product management and engineering teams are evaluating the efficacy of these models versus their predecessors and determining if they are ready for production. There are releases of these new models, in the same way you’d release a new piece of software, and along with that, there’s been rollbacks of models that have already been released.

LLMs as a Feature

Looking at the concept through that lens, it becomes easy to see the connection between AI models and the practice of feature flagging and feature management. We at LaunchDarkly talk a lot about controlling the experience of users, enabling things like beta programs or even robust context-based targeting with regard to features that are being released. The same concepts translate directly to the way users consume any AI model.

What if you wanted to enable basic GPT-3.5 access for the majority of your users, but your power users were entitled to leverage GPT-4, and your most advanced users were able to access the GPT-4-32K model that supports significantly longer character limits at a higher cost? Concepts like this are table stakes for feature flagging. Even Sam Altman at OpenAI talks about the availability of a killswitch concept that lives within GPT-4. Essentially, we’ve come full circle to The Terminator reference and he is advocating for a means to disable it if things ever got too scary.

Take the following JavaScript code sample as an example, from a NextJS 13.4-based application that simulates the ability to opt-in and opt-out of API models:

In this example, we’re getting the model from a LaunchDarkly feature flag, deciding what sort of token length to leverage based on the model selected and feeding that model into our OpenAI API call. This is a specific example leveraging the OpenAI API, but the same concept would translate to using something like Vercel’s AI package, which allows a more seamless transition between different types of AI models.

Within the application itself, once you log in, you’re presented with the option to opt-in to a new model as needed, as well as opt-out to return back to the default model.

Measuring the Model

As these models mature, we’ll want more ways to measure how effective they are against different vendors and model types. We’ll have to consider questions such as:

How long does a model take to return a valid response?
How often is a model returning correct information versus a hallucination?
How can we visualize this performance with data and use it to help us understand where the “right model to use when” is?
What about when we want to serve the new model to 50% of our users to evaluate against?

Software is in a constant state of evolution; this is something we’ve become accustomed to in our space, but it’s so interesting how much of it still relies on the same core principles. The software delivery lifecycle is still a real thing. Code is still shipped to a destination to run on, and released to users to consume. AI is no different in this case.

As we see the LLM space become more commoditized, with multiple vendors offering unique experiences, the tie-ins into concepts like CI/CD, feature flagging and software releases are only going to grow in frequency. The way organizations integrate AI into their product and ultimately switch models to gain better efficiency is going to become a practice the software delivery space will need to adopt.

At LaunchDarkly Galaxy 23, our user conference, I’ll be walking through a hands-on example of these concepts using LaunchDarkly to control AI availability in an application. It’ll be a session focused on hands-on experiences, showing live what this model looks like in a product. With any luck, we’ll build a solid foundation of how we can establish a bit more control over machines and protect ourselves from the ultimate buggy code, which results in the machines taking control. At minimum, I’ll at least show you how to write in a killswitch. =)

The post Controlling the Machines: Feature Flagging Meets AI appeared first on The New Stack.

Address High Scale Google Drive Data Exposure with Bulk Remediation

Adam Gavish — Tue, 26 Sep 2023 17:00:16 +0000

Millions of organizations around the globe use SaaS applications like Google Drive to store and exchange company files internally and externally. Because of the collaborative nature of these applications, company files can be accessed easily by the public, held externally with vendors, or shared within private emails. Data risk exposure exponentially increases as companies scale operations and internal data. Shared files through SaaS applications like Google Drive enable significant business-critical data exposure that could potentially get into the wrong hands.

As technology companies experience mass layoffs, IT professionals should take extra caution when managing shared file permissions. For example, if a company recently laid off an employee that shared work files externally with their private email, the former employee will still have access to the data. Moreover, if the previous employee begins working for a competitor, they can share sensitive company files, reports and data with their new employer. Usually, once internal links are publicly shared with an external source, the owner of the file is unable to see who else has access. This poses an enormous security risk for organizations as anyone, including bad actors or competitors, can easily steal personal or proprietary information within the shared documents.

Digitization and Widespread SaaS Adoption

Smaller, private companies tend to underestimate their risk of data exposure when externally sharing files. An organization is still at risk even if they only have a small number of employees. On average, one employee creates 50 new SaaS assets every week. It only takes one publicly-shared asset to expose private company data.

The growing adoption of SaaS applications and digital transformation are exacerbating this problem. In today’s digital age, companies are becoming more digitized and shifting from on-premises or legacy systems to the cloud. Within 24 months, a typical business’s total SaaS assets will multiply by four times. As organizations grow and scale, the amount of SaaS data and events becomes uncontrollable for security teams to maintain. Without the proper controls and automation in place, businesses are leaving a massive hole in their cloud security infrastructure that only worsens as time goes on. The longer they wait to tackle this challenge, the harder it becomes to truly gain confidence in their SaaS security posture.

Pros and Cons of Bulk Remediating

Organizations looking to protect themselves from this risk should look to bulk remediate their data security. By bulk remediating, IT leaders can quickly ensure a large amount of sensitive company files remain private and are unable to be accessed by third parties without explicit permission. This is a quick way to guarantee data security as organizations scale and become digitized.

However, as an organization grows, they will likely retain more employees, vendors, and shared drives. When attempting to remediate inherited permissions for multiple files, administrators face the difficulty of ensuring accurate and appropriate access levels for each file and user. It requires meticulous planning and a thorough understanding of the existing permission structure to avoid unintended consequences.

Coordinating and executing bulk remediation actions can also be time-consuming and resource-intensive, particularly when dealing with shared drives that contain a vast amount of files and multiple cloud, developer, security, and IT teams with diverse access requirements. The process becomes even more intricate when trying to strike a balance between minimizing disruption to users’ workflows and enforcing proper data security measures.

Managing SaaS Data Security

Organizations looking to manage their SaaS data security should first understand their current risk exposure and the number of applications currently used within the company. This will help IT professionals gain a better understanding of which files to prioritize that contain sensitive information that needs to quickly be remediated. Next, IT leaders should look for an automated and flexible bulk remediation solution to help them quickly manage complex file permissions as the company grows.

Companies should ensure they are only using SaaS applications that are up to their specific security standards. This is crucial to not only avoid data exposure, but also comply with business compliance regulations. IT admins should reassess each quarter their overall data posture and whether current SaaS applications are properly securing their private assets. Automation workflows within specific bulk remediation plans should be continuously updated to ensure companies are not missing security blind spots.

Each organization has different standards and policies that they will determine as best practices to keep their internal data safe. As the world becomes increasingly digital and the demand for SaaS applications exponentially grows, it is important for businesses to ensure they are not leaving their sensitive data exposed to third parties. Those that fail to remediate their SaaS security might be the next victim of a significant data breach.

The post Address High Scale Google Drive Data Exposure with Bulk Remediation appeared first on The New Stack.

Confluent Kafka Cloud Gets Apache Flink Instant Analytics

Joab Jackson — Tue, 26 Sep 2023 15:48:48 +0000

Confluent is embellishing its Kafka-based message brokering cloud service with real-time analytics capabilities, through a managed version of Apache Flink. The service aims to eliminate the burden of setting up analytics on the enterprise backend, and its serverless scale-up model promises to save provisioning costs as well, the company promises.

The serverless-based billing is “based on usage, not on allocated compute or allocated capacity,” said James Rowland-Jones, Confluent stream processing product leader, in an interview with TNS. “And so what users end up paying for is actual usage not provisioned capacity.”

More details will be forthcoming at the company’s Current 2023 user conference, being held this week in San Jose, California.

In addition to the inclusion of Flink, the company also unveiled a new self-service data portal, which will provide a graphical user interface for running data streams. The company has also forged a number of partnerships with AI companies to bring AI capabilities to data streams. It also has a new package, called Enterprise Cluster, that allows users of Confluent Cloud to set up private clusters accessible over VPC. It is built on the Kora Engine.

The use of real-time data analysis is on the rise, as more businesses need to compete in competitive marketplaces. The 2023 Data Streaming report estimated that 72% of organizations, use it to power mission-critical systems.

While Kafka can manage large-scale data streams coming in from a source, additional analysis may be needed to make sense of the data. Over the past few years, Apache Flink has risen to prominence, often in conjunction with Kafka, as an open source platform for high-throughput, low-latency stream processing. Data processing in this context, according to Confluent, can mean tasks such as matching drivers and riders for a ride-sharing company, ferreting out fraudulent activity for financial companies, and detecting unusual activity for security companies.

Initially, the interface for Flink will be through SQL, meaning developers can write SQL queries to interrogate the data. If someone creates a topic and a schema within Kafka, Flink will create a SQL table that can be queried against in Flink, eliminating the need to set a table separately.” And so for you as a user, you don’t have to duplicate the metadata, it’s already there,” Rowland-Jones said. Next year, the company will open Flink for more programmatic analysis through a set of programmatic APIs with Python and Java.

Apache Flink will be available as an open preview for current Confluent Cloud customers using AWS in select regions for testing and experimentation purposes. General availability is coming soon.

More highlights from the conference:

Stream processing was considered niche, with a number of inherent limitations compared to traditional batch, including issues in transactional correctness, lack of tools and poor scalability. But these limitations were not insurmountable — @jaykreps #current23 @confluentinc pic.twitter.com/M8vT7auMUw

— Joab Jackson (@Joab_Jackson) September 26, 2023

Warner Brothers launched a new video streaming service this year, called Max, built on @confluentinc’s managed #Kafka service. WB SVP of engineering Girish Rad cited the service’s elasticity and log management #current23 pic.twitter.com/KqbeRReZxL

— Joab Jackson (@Joab_Jackson) September 26, 2023

.@NotionIQ uses #Kafka-based data streaming for #AI to auto-fill summaries of everything entered into the productivity app, in real time — Daniel Sternberg, Notion Head of Data, #Current23 @confluentinc pic.twitter.com/uW72APsLkA

— Joab Jackson (@Joab_Jackson) September 26, 2023

Most data pipelines can be expressed as SQL statements—@confluentinc’s David Anderson on the values of #ApacheFlink… #Current23 pic.twitter.com/bzZs99Jkx8

— Joab Jackson (@Joab_Jackson) September 26, 2023

Businesses building on batch systems are on the decline — @jaykreps @confluentinc, on how most organizations are operating in stream processing environments now, press lunch #current23 pic.twitter.com/8tkeEtaRNU

— Joab Jackson (@Joab_Jackson) September 26, 2023

Business data is *always* streaming. It is either bounded (“batch”) or unbounded. #ApacheFlink handles both in the same way, making it easier to query both historical and live data — @confluentinc’s Martijn Visser, #Current23 pic.twitter.com/tVjlsIALg3

— Joab Jackson (@Joab_Jackson) September 26, 2023

The post Confluent Kafka Cloud Gets Apache Flink Instant Analytics appeared first on The New Stack.

Engineer’s Guide to Cloud Cost Optimization: Engineering Resources in the Cloud

Steven O'Dwyer — Tue, 26 Sep 2023 15:30:02 +0000

As companies search for ways to optimize costs, they want — and need — data about many different elements, including the amount of money they are spending in the cloud, the savings potential and resources they can eliminate.

It can get complicated fast. To help teams prioritize, it’s a good idea to break down cloud cost optimization into two different universes: engineering and finance.

Resource Optimization vs. Rate Optimization

Engineering teams are actively building and maintaining cloud infrastructure. For this reason, they tend to own resource optimizations.

Rate optimization is about pricing and payment for cloud services. Depending on the organization’s structure and FinOps maturity, a person in IT, finance or FinOps may manage the cloud budget and levers used to achieve cloud savings.

Resource Optimization Happens in the Engineering Universe

Any company can get caught in the vicious cycle of cloud costs. It’s important to understand the implications of resource decisions (made by engineering teams), whether companies are new to the cloud or born in the cloud. Otherwise, they scramble to adjust and react to costs with potentially low visibility and governance.

Companies taking advantage of AWS Migration Acceleration Program (MAP) credits can fast-track cloud migrations with engineering guidance and cost-cutting service credits. Without forethought, resource and cost strategy, and visibility and accountability, there can be sticker shock once the assistance runs out.

Cloud-native companies may understand the cloud and even have costs fairly contained. But sudden, rapid growth or a significant business change means that they have to adjust to unknowns. Costs can rise even with optimization mechanisms in place.

Both situations involve an issue around scale. It takes time for companies to monitor, report and determine how to best optimize cost and resources. As they figure this out, they may be paying higher on-demand rates, which is not cost-effective.

With regards to cloud compute, one area that engineers may feel in command of because of the nature of their work is resource waste. They may use resource deployment and monitoring techniques like the following to use resources more efficiently and manage waste:

Right-sizing: Matching instance types and sizes to workload performance and capacity requirements. This is challenging to implement in organizations that have rapidly changing or scaling environments because it requires constant analysis and manual infrastructure changes to achieve results.
Auto-scaling: Setting up application scaling plans that efficiently utilize resources. This optimization technique might be effective for highly dynamic workloads, but it doesn’t offer as much cost savings for organizations with static usage.
Scheduling: Configuring instances to automatically stop and start based on a certain schedule. This might be practical for organizations with strict oversight and monitoring capabilities, but most basic usage and cost analysis tools make regular predictions challenging.
Spot Instances: Leveraging EC2 Spot Instances to take advantage of unused capacity at a deep discount. These spot instances might only be useful for background tasks or other fault-tolerant workloads because there’s a chance they can be reclaimed by AWS with just a two-minute warning.
Cloud-native design: Building more cost-efficient applications that fully utilize the unique capabilities of the cloud. This can require a significant overhaul if an application is migrated from on-premises infrastructure without any redesign.

It takes time and effort to define, locate and actively manage waste. But the surprising thing is that reducing waste doesn’t affect cloud costs as much as one might expect. There is another lever that produces more immediate, expansive cloud savings: rate optimization.

Cloud Rate Optimization: the Financial Side of Cloud Computing

Cloud providers offer various discount instruments to incentivize the use of cloud services. AWS, for example, offers:

Spot Instance discounts
Savings Plans
- EC2 Instance
- Compute
Reserved Instances
- Standard Reserved Instances
- Convertible Reserved Instances
Enterprise Discount Programs
Promotional Credits

Spot Instances

Spot Instances provide a deep discount for unused EC2 instances. Spot is a good option for applications that can run anytime and be interrupted. When appropriate, consider Spot Instances for operational tasks like data analysis, batch jobs and background processing.

Savings Plans

Compute Savings Plans offer attractive discounts that span EC2, Lambda and Fargate. They require an hourly commitment (in unit dollars) for a one-year or three-year term and can be applied to any instance type, size, tenancy, region and operating system.

It’s possible to save more than 60% using Compute Savings Plans, compared to on-demand pricing.

While the regional coverage is attractive, commitment terms are fixed with Savings Plans; they can’t be adjusted or decreased once made. This rigidity often leads to under- and over-commitment (and suboptimal savings).

EC2 Savings Plans only offer savings for EC2 instances. They require a one-year or three-year commitment with the option to pay all, partial or no upfront. They are flexible enough to cover instance family changes, including alterations in an operating system, tenancy and instance size. But these kinds of changes aren’t very common, and they can negatively affect the discount rate.

Overcommitment is common with EC2 Savings Plans. Standard RIs are viewed as a better option since they produce similar savings and are easier to acquire and offload on the RI marketplace.

Reserved Instances

Reserved Instances (RI) provide a discounted hourly rate for an EC2 capacity reservation or an RI that is scoped to a region. In either scenario, discounts are applied automatically for a quantity of instances:

When EC2 instances match active RI attributes
For AWS Availability Zones and instance sizes specific to certain regions

It’s possible to save more than 70% off of on-demand rates with Reserved Instances. They offer flexibility: instance families, operating system types and tenancies can be changed to help companies capitalize on discount opportunities. RIs don’t alter the infrastructure itself, so they are not a resource risk.

There are two different types of Reserved Instances: Standard RIs and Convertible RIs:

Standard RIs can provide more than 70% off of on-demand rates, but flexibility is restricted to liquidity in the RI marketplace and there are restrictions regarding when and how often they can be sold. They are best used for common instance types, in common regions, on common operating systems.

Convertible RIs can offer up to 66% off of on-demand rates, but their flexibility can actually generate greater savings than other discount instruments due to higher coverage potential. Convertible RIs are:

Flexible. It’s possible to apply discounts to EC2 instances before rightsizing or executing another resource optimization.
Offer Greater Amounts of Savings. Even with three-year commitment terms, CRIs can be exchanged. So if EC2 usage changes, it is still possible to maintain EC2 coverage while maximizing an Effective Savings Rate.
Produce Faster Savings. Commitments can be made in aggregate for expected EC2 usage, instead of by specific instances that may have been forecasted. CRIs enable a more efficient “top-down” approach to capacity planning so that companies can realize greater savings faster without active RI management.

Enterprise Discount Program/Private Pricing Agreements/Promotional Credits

AWS Enterprise Discount Program (EDP) offers high discounts (starting at $500,000 per year)

for commitments to spend a certain amount or use a certain amount of resources over a period of time. Generally, these agreements require a commitment term for higher discounts, typically ranging from two to seven years. They are best for organizations that currently have, or plan to have, a large footprint in the cloud.

It’s common for companies to double down and commit fairly aggressively for discounts using an EDP, but they may not realize how the agreement may limit their ability to make additional optimizations (using additional discount instruments, reducing waste or changing resources).

Overcommitting to high spend or usage before making optimizations is common, so it’s always recommended to negotiate EDPs based on a post-optimization environment. Discount tiers may be fixed, but support type, growth targets and commitment spanning years (instead of a blanket commitment) can — and should — be negotiated.

The post Engineer’s Guide to Cloud Cost Optimization: Engineering Resources in the Cloud appeared first on The New Stack.

The Pillars of Platform Engineering: Part 5 — Orchestration

Michael Fonseca — Tue, 26 Sep 2023 13:14:58 +0000

This guide outlines the workflows and checklist steps for the six primary technical areas of developer experience in platform engineering. Published in six parts, part one introduced the series and focused on security. Part five addresses orchestration. The other parts of the guide are listed below, and you can download a full PDF version of The 6 Pillars of Platform Engineering for the complete set of guidance, outlines, and checklists:

Security (includes introduction)
Pipeline (VCS, CI/CD)
Provisioning
Connectivity
Orchestration
Observability (includes conclusion and next steps)

When it comes time to deploy your application workload, if you’re working with distributed applications, microservices, or generally wanting resilience across cloud infrastructure, it’s going to be much easier using a workload orchestrator.

Workload orchestrators such as Kubernetes and HashiCorp Nomad provide a multitude of benefits over traditional technologies. The level of effort may vary to achieve these benefits. For example, rearchitecting for containerization to adopt Kubernetes may involve a higher degree of effort than using an orchestrator like HashiCorp Nomad which is oriented more toward supporting a variety of workload types. In either case, workload orchestrators enable:

Improved resource utilization
Scalability and elasticity
Multicloud and hybrid cloud support
Developer self-service
Service discovery and networking (built-in or pluggable)
High availability and fault tolerance
Advanced scheduling and placement control
Resource isolation and security
Cost optimization

Orchestrators provide optimization algorithms to determine the most efficient way to allocate workloads into your infrastructure resources (e.g. bin-packing, spread, affinity, anti-affinity, autoscaling, dynamic application sizing, etc.), which can lower costs. They automate distributed computing and resilience strategies without developers having to know much about how it works under the hood.

As with the other platform pillars, the main goal is to standardize workflows, and an orchestrator is a common way modern platform teams unify deployment workflows to eliminate ticket-driven processes.

When choosing an orchestrator, it’s important to make sure it’s flexible enough to handle future additions to your environments and heterogeneous workflows. It’s also crucial that the orchestrator can handle multitenancy and easily federate across multiple on-premises data centers and multicloud environments.

It is important to note that not all systems can be containerized, or shifted to a modern orchestrator such as vendor-provided monolithic appliances or applications, so it is important for platform teams to identify opportunities for other teams to optimize engagement and automation for orchestrators as per the tenets of the other platform pillars. Modern orchestrators provide a broad array of native features. While specific implementations and functionality vary across systems, there are a number of core requirements.

Workflow: Orchestration

A typical orchestration workflow should follow these eight steps:

Code: A developer commits code.
1. Note: Developers may have direct network control plane access depending on the RBACs assigned to them.
Validate: The CI/CD platform submits a request to the IdP for validation (AuthN and AuthZ).
IdP response: If successful, the pipeline triggers common tasks (test, build, deploy).
Request: The provisioner executes requested patterns, such as building modules, retrieving artifacts, or validating policy against internal and external engines, ultimately provisioning defined resources.
Provision: Infrastructure is provisioned and configured, if not already available.
Configure: The provisioner configures the orchestrator resource.
Job: The orchestrator runs jobs on target resources based on defined tasks and policies.
Response: Completion of the provisioner request is provided to the CI/CD platform for subsequent processing and/or handoff to external systems that perform actions such as security scanning or integration testing.

Orchestration flow

Orchestration Requirements Checklist

Successful orchestration requires:

Service/batch schedulers
Flexible task drivers
Pluggable device interfaces
Flexible upgrade and release strategies
Federated deployment topologies
Resilient, highly available deployment topologies
Autoscaling (dynamic and fixed)
An access control system (IAM JWT/OIDC and ACLs)
Support for multiple interfaces for different personas and workflows (GUI, API, CLI, SDK)
Integration with trusted identity providers with single sign-on and delegated RBAC
Functional, logical, and/or physical isolation of tasks
Native quota systems
Audit logging
Enterprise support based on an SLA (e.g. 24/7/365)
Configuration through automation (infrastructure as code, runbooks)

The sixth and final pillar of platform engineering is observability: Check back tomorrow!

The post The Pillars of Platform Engineering: Part 5 — Orchestration appeared first on The New Stack.

The Pillars of Platform Engineering: Part 4 — Connectivity

Michael Fonseca — Mon, 25 Sep 2023 20:00:49 +0000

This guide outlines the workflows and checklist steps for the six primary technical areas of developer experience in platform engineering. Published in six parts, part one introduced the series and focused on security. Part four addresses network connectivity. The other parts of the guide are listed below, and you can download a full PDF version of The 6 Pillars of Platform Engineering for the complete set of guidance, outlines, and checklists:

Security (includes introduction)
Pipeline (VCS, CI/CD)
Provisioning
Connectivity
Orchestration
Observability (includes conclusion and next steps)

Networking connectivity is a hugely under-discussed pillar of platform engineering, with many legacy patterns and hardware still in use at many enterprises. It needs careful consideration and strategies right alongside the provisioning pillar, since connectivity is what allows apps to exchange data and is part of both the infrastructure and application architectures.

Traditionally, ticket-driven processes were expected to support routine tasks like creating DNS entries, opening firewall ports or network ACLs, and updating traffic routing rules. This caused (and still causes in some enterprises) days-to-weeks-long delays in simple application delivery tasks, even when the preceding infrastructure management is fully automated. In addition, these simple updates are often manual, error-prone, and not conducive to dynamic, highly fluctuating cloud environments. Without automation, connectivity definitions and IP addresses quickly become stale as infrastructure is rotated at an increasingly rapid pace.

To adapt networking to modern dynamic environments, platform teams are bringing networking functions, software, and appliances into their infrastructure as code configurations. This brings the automated speed, reliability, and version-controlled traceability benefits of infrastructure as code to networking.

If organizations adopt microservices architectures, they quickly realize the value of software-driven service discovery and service mesh solutions. These solutions create an architecture where services are discovered and automatically connected based on centralized policies in a zero trust network if they have permissions, otherwise the secure default is to deny service-to-service connections. In this model, service-based identity is critical to ensuring strict adherence to common security frameworks.

An organization’s choice for its central shared registry should be multicloud, multiregion, and multiruntime — meaning it can connect a variety of cluster types, including VMs, bare metal, serverless, or Kubernetes. Teams need to minimize the need for traditional networking ingress or egress points that bring their environments back toward an obsolete “castle-and-moat” network perimeter approach to security.

Workflow: Connectivity

A typical network connectivity workflow should follow these eight steps:

Code: The developer commits code.
1. Note: Developers may have direct network control plane access depending on the role-based access controls (RBAC) assigned to them.
Validate: The CI/CD platform submits a request to the IdP for validation (AuthN and AuthZ).
IdP response: If successful, the pipeline triggers tasks (e.g. test, build, deploy).
Request: The provisioner executes requested patterns, such as building modules, retrieving artifacts, or validating policy against internal and external engines, ultimately provisioning defined resources.
Provision: Infrastructure is provisioned and configured, if not already available.
Configure: The provisioner configures the connectivity platform.
Connect: Target systems are updated based on defined policies.
Response: A metadata response packet is sent to CI/CD and to external systems that perform actions such as security scanning or integration testing.

Connectivity flow (the Connect box includes service mesh and service registry)

Connectivity Requirements Checklist

Successful network connectivity automation requires:

A centralized shared registry to discover, connect, and secure services across any region, runtime platform, and cloud service provider
Support for multiple interfaces for different personas and workflows (GUI, API, CLI, SDK)
Health checks
Multiple segmentation and isolation models
Layer 4 and Layer 7 traffic management
Implementation of security best practices such as defense-in-depth and deny-by-default
Integration with trusted identity providers with single sign-on and delegated RBAC
Audit logging
Enterprise support based on an SLA (e.g. 24/7/365)
Support for automated configuration (infrastructure as code, runbooks)

Check back tomorrow for the fifth pillar of platform engineering: Orchestration.

The post The Pillars of Platform Engineering: Part 4 — Connectivity appeared first on The New Stack.

Engineer’s Guide to Cloud Cost Optimization: Manual DIY Optimization

Steven O'Dwyer — Mon, 25 Sep 2023 19:00:23 +0000

Cloud cost optimization often starts with engineering teams that expect to save time and money with a do-it-yourself (DIY) approach. They soon discover these methods are much more time-consuming than initially believed, and the effort doesn’t lead to expected savings outcomes. Let’s break down these DIY methods and explore their inherent challenges.

Introduction

One challenge cited in The State of FinOps data remains constant, despite great strides in FinOps best practices and processes. For the past three years, FinOps practitioners have experienced difficulty getting engineers to take action on cost optimization recommendations.

We don’t think they have to.

What prevents engineering and finance teams from achieving optimal cloud savings outcomes?

According to the State of FinOps data, from the FinOps Foundation:

Teams are optimizing in the wrong order. Engineers typically default to starting with usage optimizations before addressing their discount rate. This guide explains why you should, in reality, begin with rate optimizations, specifically with ProsperOps.
Engineers have too much to do. Relying on them to do more work isn’t helpful and doesn’t yield desired results.
Cost optimization isn’t the work engineers want to do. Is cost optimization the best use of engineering time and talent?

According to engineers working in cost optimization:

Recommendations from non-engineers were over-simplified
- An instance may be sized not to meet a vCPU or memory number, but rather to satisfy a throughput, bandwidth, or device attachment requirement. Therefore from a vCPU/RAM standpoint, an instance may look underutilized, but it cannot be downsized due to a disk I/O requirement or other metric not observed by the recommendation process.
- A lookback period that is too brief will produce poor recommendations, for example, if the lookback period the rightsizing recommendations are based upon is 60 days, and the app has quarterly seasonality, then the recommendation may not capture the “high water mark” or seasonal peak. In these cases, the application would ideally be modified to scale horizontally by increasing instance quantity as needed, but that is not always possible with legacy or monolithic applications.
Other than engineers, most people in the organization are unaware of the impact of each recommendation.
- For instance, engineering may be asked to convert workloads to Graviton due to the cost savings opportunity, however, that would require that all installed binaries have a variant for the ARM microarchitecture, which may be the case for the primary application but may not be available for other tools required by the organization, such as a SIEM agent, DLP tool, antivirus software, etc.
- Another common occurrence in which it may look like instances are eligible for downsizing is when a high availability application has a warm standby (typically at the DB layer) which is online due to a low RTO but is only synchronizing data and not actively serving end users.

Cloud Cost Optimization 2.0 Mitigates FinOps Friction

In a Cloud Cost Optimization 2.0 world, cost-conscious engineers build their architectures smarter — requiring fewer and less frequent resource optimizations.

But to achieve this, the discount management process (the financial side of cloud cost optimization) must be managed efficiently. Autonomous software like ProsperOps eliminates the need for anyone, including engineers, to manage rate optimization — the lever that produces immediate and magnified cloud savings.

Engineers should take their time back and let sophisticated algorithms do the work.

But we also recognize that engineers are expert problem solvers, and some may not see the need to yield control to an autonomous solution. This guide outlines your options and the reality of various cost-optimization methods, starting with a manual do-it-yourself approach.

Part 1: The Ins-and-Outs of Manual DIY Cloud Cost Optimization

In the cloud, costs can be a factor throughout the software lifecycle. So as engineers begin to work with cloud-based infrastructure, they often have to become more cost-conscious. It’s a responsibility that falls outside their primary role, yet they often have to decide how to mitigate uncontrollable cloud costs and optimize what falls in their domain — resources.

Many teams use in-house methods to do this, which may include manual DIY methods and/or software that recommends optimizations.

Manual Cloud Cost Optimization Methods

These options are usually perceived to be low-cost, low-risk, community-based, open-source, or homegrown solutions that address pieces of optimization tasks.

Engineering teams may go this route because they don’t know the reality of a manual DIY optimization experience or the tools that offer the best support. They may oppose third-party tools or believe they can manage everything in-house.

Standard DIY manual methods to identify and track resources and manage waste are usually handled with a combination of:

Spreadsheets
Scripts
Alerts and notifications
Tagging to define and identify resources using AWS Cost Explorer
Native tools like AWS Trusted Advisor, Amazon CloudWatch, and AWS Compute Optimizer, AWS Config, EventBridge, Cost Intelligence Dashboards, AWS Budgets, etc.
Forecasting
Rightsizing and other resource strategies
Writing code to automate simple tasks
Custom dashboards, based on CUR data

Regarding cloud purchases, some engineers may manually calculate to verify that cloud service changes and purchases fall within the budget.

Whatever the method, a typical result is this: it’s time-consuming to act on reported findings.

Manual DIY Challenges

Getting Data and Providing Visibility for Stakeholders

With a manual approach, data lives in different places, and there is a constant effort to assimilate it for decision-making. There isn’t centralized reporting or a single pane of glass through which to view data. Data is hard to collect and is often managed in silos, so multiple stakeholders cannot see the data that matters to them.

Given this constraint, it’s easy to see how DevOps teams can struggle to validate that they are using the correct discount type, coverage of instance families and regions, and/or commitment term.

It Wastes Time and Resources

Cloud costs need to be actively managed. But engineers shouldn’t be the ones doing it.

Requiring engineers to shift from shipping product innovations to manual cloud cost optimizations is an inefficient use of engineering resources. Establishing a FinOps practice, however, helps reallocate and redistribute workload more appropriately. It’s a better strategy.

Processes Don’t Scale

Then there’s the issue of working with limiting processes, like using a spreadsheet to track resources. Spreadsheets offer a simple, static view that doesn’t capture the cloud’s dynamic nature. In practice, it’s challenging to map containers and resources or see usage at any point in time, for example.

Who is managing the costs when multiple people are working in one account? As organizations grow, it becomes more challenging to identify the point of contact managing the moving parts of discounts and resources.

With DIY, one person or a team may be absorbed in time-consuming data management and distribution with processes that don’t scale.

Missed Savings Opportunities and Discount Mismanagement

Many companies take advantage of the attractive discounts Savings Plans provided for AWS cloud services. The trade-off for the discount is a 12-month or 36-month commitment term. Without centralized, real-time data, engineering teams are more likely to act conservatively and under-commit, leaving the company exposed to higher on-demand rates.

Cloud discount instruments, like Savings Plans and Reserved Instances, offer considerable relief from on-demand rates. Managing them involves the coordination of the right type of instrument, which:

Is purchased at different times.
Covers different resources and time frames.
Expires at different times.
Can change resource footprint or architecture.

Managing discount instruments is a dynamic process involving a lot of moving parts. It’s extremely challenging to coordinate these elements manually. That’s why autonomous discount management successfully produces consistent, incremental savings outcomes, an approach we’ll cover later in this guide.

Prioritizes Waste Reduction

Engineering teams naturally focus on reducing waste because it’s familiar ground—and work they are already doing. However it takes effort and time to identify and actively manage waste, which is constantly being created.

And, surprisingly, reducing waste in the cloud doesn’t affect savings as much or as fast as optimizing discount instruments, also called rate optimization. Consider waste #3 on the list of optimization priorities.

The reality of manual DIY optimization is more challenging than many engineering teams expect. Software alleviates some of these pain points.

Cloud Cost Management Software Generates Optimization Recommendations. Is That Enough?

Most cloud optimization software helps teams understand opportunities for optimization. They collect, centralize and present data:

Across accounts and organizations.
From multiple data sources like native tools, including Amazon Cost and Usage Report, Amazon CloudWatch or third parties.
From a single customizable pane of glass for various stakeholders.

These solutions centralize reporting so that stakeholders have a more comprehensive, accurate view of the environment and potential inefficiencies.

Recommendations-Based Software Challenges

It’s Still a Manual Process

The problem is that although engineering teams are being given directives on optimization, they still have to vet the recommendation and physically do the optimization work.

Engineers have to validate the recommendation based on software and other requirements, which could involve coordination between various teams. And once a decision has been made, the change is usually implemented manually.

Limited Savings Potential

Most solutions recommend one discount instrument as part of their optimization recommendation.

Without a strategic mix of discount instruments or flexibility to accommodate usage as it scales up and down or scales horizontally, companies will end up with suboptimal savings outcomes.

Risk and Savings Loss

The instruments that offer the best cloud discounts also require a one-year or three-year commitment. Even with the data assist from software, internal and external factors make it challenging to predict usage patterns over one-year or three-year terms.

As a result, engineering teams may unintentionally under or over-commit, both of which result in suboptimal savings.

Neither DIY Method Solves the Internal Coordination ‘Problem’

Operating successfully — and cost-effectively — in the cloud involves ongoing communication, education and coordination between engineers (who manage resources) and the budget holder (who may serve in IT, finance or FinOps).

This is the case whether teams are using manual DIY methods or recommendation software. The coordination looks like this:

The budget holder sees an optimization recommendation
The engineer validates if it is viable
If the recommendation involves resources, the engineer takes action to optimize
If the recommendation involves cloud rates, the budget holder or FinOps team takes action to optimize

Both DIY methods involve a lot of back-and-forth exchanges and manual effort to perform the optimization.

It’s possible to consolidate steps in this process (eliminating the back-and-forth) and shift the conversations from tactical to strategic with autonomous rate optimization.

FinOps teams, in particular, thrive with autonomous rate optimization. Because the financial aspect of cloud operations is being managed using sophisticated algorithms, cloud savings are actively realized in a hands-free manner.

With costs under control, engineers can focus on innovation and resource optimization as it fits into their project schedule.

FinOps teams, too, are freed up to address other top FinOps challenges and plans related to FinOps maturity.

Rate Optimization Provides Faster Cloud Savings Than DIY

In seeking DIY methods of cloud cost optimization, companies may lay some groundwork in understanding cloud costs, but they aren’t likely to find impactful savings. This is largely because there is still manual work associated with both approaches and resource optimization is the focus.

Resource optimization is important, but rate optimization is an often overlooked approach that yields more efficient, impactful cloud savings. We will explore rate and resource optimization further in this guide and explain why the order of optimization action makes a major difference in savings outcomes.

The post Engineer’s Guide to Cloud Cost Optimization: Manual DIY Optimization appeared first on The New Stack.

How to Use the Docker exec Command

Jack Wallen — Sat, 23 Sep 2023 13:00:54 +0000

For those who are just getting started on your journey with Docker containers, there is much to learn. Beyond pulling images and deploying basic containers, one of the first things you’ll want to understand is the exec command.

Essentially, the exec command allows you to run commands within an already deployed container. The exec command allows you to do this in two different ways…from inside or outside the container. I’m going to show you how to do both. In the end, you’ll be better prepared to interact with your running Docker containers.

What You’ll Need

The only thing you’ll need for this is a running instance of the Docker runtime engine installed on a supported platform. I’ll demonstrate on Ubuntu Server 22.04.

In case you don’t have Docker installed, let’s take care of that first. If you already have Docker installed, go ahead and jump to the next section.

Install Docker

Before you can install Docker on Ubuntu Server, you must first add the official Docker GPG key with the command:

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

You’ll be prompted for your sudo password.

Once the GPG key is successfully added, create the necessary Docker repository, with the following command:

Install a few dependencies with this command:

sudo apt-get install apt-transport-https ca-certificates curl gnupg lsb-release -y

Update apt with:

sudo apt-get update

Install Docker with the command:

sudo apt-get install docker-ce docker-ce-cli containerd.io -y

Next, you must add your user to the docker group with the command:

sudo usermod -aG docker $USER

Log out and log back in so the changes take effect.

Congrats, Docker is now ready to go.

Deploy a Test Container

To use the exec command, we first must deploy a simple test container. For that, we’ll use the tried-and-true NGINX and deploy a container with the command:

docker run --name docker-nginx -p 8080:80 -d nginx

After running the command, Docker should report back the ID of the container. If you miss that, you can always view it with:

docker ps

You’ll only need the first four characters of the ID.

Access the Running Container’s Shell

Now, we can access the running container’s shell, which will allow you to run command from inside the container. This is done with the exec command like so:

docker exec -it ID /bin/bash

Where ID is the first four characters of the running container’s ID. You should then find yourself at the running container’s bash prompt. Let’s say you want to upgrade the software. You can do so with the commands:

apt-get update
apt-get upgrade -y

After the upgrade completes, you can exit the shell with the command:

exit

Let’s simplify the process.

Run a Command from Outside the Container

Thanks to the exec command, you don’t have to first access the container’s shell before running a command. Instead, you can send the command inside the container. Let’s stick with our example of updating and upgrading the running NGINX container. Again, we’ll need the container ID for this command.

To update and upgrade the software for our NGINX container (without first accessing the container), the command would be:

docker exec ID apt-get update && apt-get upgrade

Where ID is the first four characters of the container ID.

The use of the && operator is common in Linux and makes it possible to daisy chain commands together such that they run one after another.

You can use this method to run just about any command. For example, you could view the index.html file used by NGINX with the command:

docker exec ID cat /usr/share/nginx/html/index.html

Where ID is the first four characters of the container ID.

Let’s copy a new index.html file into the running container and then use exec to view it. Create the new file on your host with:

nano index.html

In that file, paste the following contents:

Save and close the file. Next, copy the file to the running NGINX container with the command:

docker cp index.html ID:/usr/share/nginx/html/

Where ID is the ID of the running container.

Now, we can view the contents of the file with:

docker exec ID cat /usr/share/nginx/html/index.html

The output should simply be:

Hello, New Stack

And that’s how you use the Docker exec command. With this tool, you can better (and more efficiently) manage your running Docker containers.

The post How to Use the Docker exec Command appeared first on The New Stack.

3 Tips to Secure Your Cloud Infrastructure and Workloads

David Puzas — Fri, 22 Sep 2023 19:05:37 +0000

As companies move to the cloud for benefits like efficiency and scalability, it is the job of security teams to enable them to do so safely.

In this reality, it is vital that IT leaders understand how threat actors are targeting their cloud infrastructure. As one might suspect, attackers first go after low-hanging fruit — the systems and applications that are the easiest to exploit.

In the 2023 CrowdStrike Global Threat Report, our researchers noted that adversaries:

Target neglected cloud infrastructure slated for retirement that still contains sensitive data.
Use a lack of outbound restrictions and workload protection to exfiltrate data.
Leverage common cloud services as a way to obfuscate malicious activity.

Neglected or Misconfigured Cloud Infrastructure

Neglected and soon-to-be-retired infrastructure are prime targets for attackers, often because that infrastructure no longer receives security configuration updates and regular maintenance. Security controls such as monitoring, expanded logging, security architecture and planning, and posture management no longer exist for these assets.

Lack of Outbound Restrictions and Container Life Cycle Security

Unfortunately, we still see cases where neglected cloud infrastructure still contains critical business data and systems. As such, attacks led to sensitive data leaks requiring costly investigation and reporting obligations. Additionally, some attacks on abandoned cloud environments resulted in impactful service outages, since they still provided critical services that hadn’t been fully transitioned to new infrastructure. Moreover, the triage, containment and recovery from the incident in these environments had a tremendous negative impact on some organizations.

Launching Attacks from the Cloud

Not only are attackers targeting cloud infrastructure, but we also observed threat actors leveraging the cloud to make their attacks more effective. Over the past year, threat actors used well-known cloud services, such as Microsoft Azure, and data storage syncing services, such as MEGA, to exfiltrate data and proxy network traffic. A lack of outbound restrictions combined with a lack of workload protection allowed threat actors to interact with local services over proxies to IP addresses in the cloud. This gave attackers additional time to interrogate systems and exfiltrate data from services ranging from partner-operated, web-based APIs to databases — all while appearing to originate from inside victims’ networks. These tactics allowed attackers to dodge detection by barely leaving a trace on local file systems.

So How Do I Protect My Cloud Environment?

The cloud introduces new wrinkles to proper protection that don’t all translate exactly from a traditional on-premises data center model. Security teams should keep the following firmly in mind as they strive to remain grounded in best practices.

Enable runtime protection to obtain real-time visibility. You can’t protect what you don’t have visibility into, even if you have plans to decommission the infrastructure. Central to securing your cloud infrastructure to prevent a breach is runtime protection and visibility provided by cloud workload protection (CWP). It remains critical to protect your workloads with next-generation endpoint protection, including servers, workstations and mobile devices, regardless of whether they reside in an on-premises data center, virtual cluster or hosted in the cloud.

Eliminate configuration errors. The most common root cause of cloud intrusions continues to be human errors and omissions introduced during common administrative activities. It’s important to set up new infrastructure with default patterns that make secure operations easy to adopt. One way to do this is to use a cloud account factory to create new sub-accounts and subscriptions easily. This strategy ensures that new accounts are set up in a predictable manner, eliminating common sources of human error. Also, make sure to set up roles and network security groups that keep developers and operators from needing to build their own security profiles and accidentally doing it poorly.

Leverage a cloud security posture management (CSPM) solution. Ensure your cloud account factory includes enabling detailed logging and a CSPM — like the security posture included in CrowdStrike Falcon Cloud Security — with alerting to responsible parties including cloud operations and security operations center (SOC) teams. Actively seek out unmanaged cloud subscriptions, and when found, don’t assume it’s managed by someone else. Instead, ensure that responsible parties are identified and motivated to either decommission any shadow IT cloud environments or bring them under full management along with your CSPM. Then use your CSPM on all infrastructure up until the day the account or subscription is fully decommissioned to ensure that operations teams have continuous visibility.

Because the cloud is dynamic, so too must be the tools used to secure it. The visibility needed to see the type of attack that traverses from an endpoint to different cloud services is not possible with siloed security products that only focus on a specific niche. However, with a comprehensive approach rooted in visibility, threat intelligence and threat detection, organizations can give themselves the best opportunity to leverage the cloud without sacrificing security.

The post 3 Tips to Secure Your Cloud Infrastructure and Workloads appeared first on The New Stack.

The Pillars of Platform Engineering: Part 3 — Provisioning

Michael Fonseca — Fri, 22 Sep 2023 13:22:47 +0000

This guide outlines the workflows and checklists for the six primary technical areas of developer experience in platform engineering. Published in six parts, part one introduced the series and focused on security. Part three will address infrastructure provisioning. The other parts of the guide are listed below, and you can download the full PDF version for the complete set of guidance, outlines and checklists.

Security (includes introduction)
Pipeline (VCS, CI/CD)
Provisioning
Connectivity
Orchestration
Observability (includes conclusion and next steps)

In the first two pillars, a platform team provides self-service VCS and CI/CD pipeline workflows with security workflows baked in to act as guardrails from the outset. These are the first steps for software delivery. Now that you have application code to run, where will you run it?

Every IT organization needs an infrastructure plan at the foundation of its applications, and platform teams need to treat that plan as the foundation of their initiatives. Their first goal is to eliminate ticket-driven workflows for infrastructure provisioning, which aren’t scalable in modern IT environments. Platform teams typically achieve this goal by providing a standardized shared infrastructure provisioning service with curated self-service workflows, tools and templates for developers. Then they connect those workflows with the workflows of the first two pillars.

Building an effective modern infrastructure platform hinges on the adoption of Infrastructure as Code. When infrastructure configurations and automations are codified, even the most complex provisioning scenarios can be automated. The infrastructure code can then be version controlled for easy auditing, iteration and collaboration. There are a few solutions for adopting Infrastructure as Code, but the most common is Terraform: a provisioning solution that is more widely used than competing tools by a wide margin.

Terraform is the most popular choice for organizations adopting Infrastructure as Code because of its large integration ecosystem. This ecosystem helps platform engineers meet the final major requirement for a provisioning platform: extensibility. An extensive plugin ecosystem allows platform engineers to quickly adopt new technologies and services that developers want to deploy, without having to write custom code.

Provisioning: Modules and Images

Building standardized infrastructure workflows require platform teams to break down their infrastructure into reusable, and ideally immutable, components. Immutable infrastructure is a common standard among modern IT that reduces complexity and simplifies troubleshooting while also improving reliability and security.

Immutability means deleting and re-provisioning infrastructure for all changes, which minimizes server patching and configuration changes, helping to ensure that every service iteration initiates a new tested and up-to-date instance. It also forces runbook validation and promotes regular testing of failover and canary deployment exercises. Many organizations put immutability into practice by using Terraform, or another provisioning tool, to build and rebuild large swaths of infrastructure by modifying configuration code. Some also build golden image pipelines, which focus on building and continuous deployment of repeatable machine images that are tested and confirmed for security and policy compliance (golden images).

Along with machine images, modern IT organizations are modularizing their infrastructure code to compose commonly used components into reusable modules. This is important because a core principle of software development is the concept of not “reinventing the wheel,” and it applies to infrastructure code as well. Modules create lightweight abstractions to describe infrastructure in terms of architectural principles, rather than discrete objects. They are typically managed through version control and interact with third-party systems, such as a service catalog or testing framework.

High-performing IT teams bring together golden image pipelines and their own registry of modules for developers to use when building infrastructure for their applications. With little knowledge required about the inner workings of this infrastructure and its setup, developers can use infrastructure modules and golden image pipelines in a repeatable, scalable and predictable workflow that has security and company best practices built in on the first deployment.

Workflow: Provisioning Modules and Images

A typical provisioning workflow will follow these six steps:

Code: A developer commits code and submits a task to the pipeline.
Validate: The CI/CD platform submits a request to your IdP for validation (AuthN and AuthZ).
IdP response: If successful, the pipeline triggers tasks (e.g., test, build, deploy).
Request: CI/CD-automated workflow to build modules, artifacts, images and/or other infrastructure components.
Response: The response (success/failure and metadata) is passed to the CI/CD platform.
Output: The infrastructure components such as modules, artifacts and image configurations are deployed or stored.

Module- and image-provisioning flow

Provisioning: Policy as Code

Agile development practices have shifted the focus of infrastructure provisioning from an operations problem to an application-delivery expectation. Infrastructure provisioning is now a gating factor for business success. Its value is aligned around driving organizational strategy and the customer mission, not purely based on controlling operational expenditures.

In shifting to an application-delivery expectation, we need to shift workflows and processes. Historically, operations personnel applied workflows and complaints to the provisioning process by leveraging tickets. These tickets usually involved validating access, approvals, security, costs, etc. The whole process was also audited for compliance and control practices.

This process now must change to enable developers and other platform end users to provision via a self-service workflow. This means that a new set of codified security controls and guardrails must be implemented to satisfy compliance and control practices.

Within cloud native systems, these controls are implemented via policy as code. Policy as code is a practice that uses programmable rules and conditions for software and infrastructure deployment that codify best practices, compliance requirements, security rules and cost controls.

Some tools and systems include their own policy system, but there are also higher-level policy engines that integrate with multiple systems. The fundamental requirement is that these policy systems can be managed as code and will provide evaluations, controls, automation and feedback loops to humans and systems within the workflows.

Implementing policy as code helps shift workflows “left” by providing feedback to users earlier in the provisioning process and enabling them to make better decisions faster. But before they can be used, these policies need to be written. Platform teams should own the policy-as-code practice, working with security, compliance, audit and infrastructure teams to ensure that policies are mapped properly to risks and controls.

Workflow: Policy as Code

Implementing policy-as-code checks in an infrastructure-provisioning workflow typically involves five steps:

Code: The developer commits code and submits a task to the pipeline.
Validate: The CI/CD platform submits a request to your IdP for validation (AuthN and AuthZ).
IdP response: If successful, the pipeline triggers tasks (e.g., test, build, deploy).
Request: The provisioner runs the planned change through a policy engine and the request is either allowed to go through (sometimes with warnings) or rejected if the code doesn’t pass policy tests.
Response: A metadata response packet is sent to CI/CD and to external systems from there, such as security scanning or integration testing.

Provisioning flow with policy as code

Provisioning Requirements Checklist

Successful self-service provisioning of infrastructure requires:

A consolidated control and data plane for end-to-end automation
Automated configuration (infrastructure as code, runbooks)
Predefined and fully configurable workflows
Native integrations with VCS and CI/CD tools
Support for a variety of container and virtual machine images required by the business
Multiple interfaces for different personas and workflows (GUI, API, CLI, SDK)
Use of a widely adopted Infrastructure-as-Code language — declarative language strongly recommended
Compatibility with industry-standard testing and security frameworks, data management (encryption) and secrets management tools
Integration with common workflow components such as notification tooling and webhooks
Support for codified guardrails, including:
- Policy as code: Built-in policy-as-code engine with extensible integrations
- RBAC: Granularly scoped permissions to implement the principle of least privilege
- Token-based access credentials to authenticate automated workflows
- Prescribed usage of organizationally approved patterns and modules
Integration with trusted identity providers with single sign on and RBAC
Maintenance of resource provisioning metadata (state, images, resources, etc.):
- Controlled via deny-by-default RBAC
- Encrypted
- Accessible to humans and/or machines via programmable interfaces
- Stored with logical isolation maintained via traceable configuration
Scalability across large distributed teams
Support for both public and private modules
Full audit logging and log-streaming capabilities
Financial operations (FinOps) workflows to enforce cost-based policies and optimization
Well-defined documentation and developer enablement
Enterprise support based on an SLA (e.g., 24/7/365)

Stay tuned for our post on the fourth pillar of platform engineering: connectivity. Or download the full PDF version of The 6 Pillars of Platform Engineering for the complete set of guidance, outlines and checklists.

The post The Pillars of Platform Engineering: Part 3 — Provisioning appeared first on The New Stack.

A Practical Step-by-Step Approach to Building a Platform

Hemanth Kavuluru — Thu, 21 Sep 2023 17:41:38 +0000

In my previous article, I discussed the concept of a platform in the context of cloud native application development. In this article, I will dig into the journey of a platform engineering team and outline a step-by-step approach to building such a platform. It is important to note that building a platform should be treated no differently than building any other product, as the platform is ultimately developed for internal users.

Therefore, all the software development life cycle (SDLC) practices and methodologies typically employed in product development are equally applicable to platform building. This includes understanding end users’ pain points and needs, assembling a dedicated team with a product owner, defining a minimum viable product (MVP), devising an architecture/design, implementing and testing the platform, deploying it and ensuring its continuous evolution beyond the MVP stage.

Step 1: Define Clear Goals

Before starting to build a platform, it is important to determine if the organization actually needs one and what is driving the need for it. Additionally, it is crucial to establish clear goals for the platform and define criteria for measuring its success. Identifying the specific business goals and outcomes that the platform will address is essential to validate its necessity.

While the benefits of reducing cognitive load for developers, providing self-serve infrastructure and improving the developer experience are obvious, it is important to understand the organization’s unique challenges and pain points and how the platform can address them. Some common business goals include the following:

Accelerating application modernization through shared Kubernetes infrastructure.
Reducing costs by consolidating infrastructure and tools.
Addressing skill-set gaps through automation and self-serve infrastructure.
Improving product delivery times by reducing developer toil.

Step 2: Discover Landscape and Identify Use Cases

Once platform teams establish high-level business goals, the next step in the platform development process is to understand the current technology landscape of the organization. Platform teams must develop a thorough understanding of their existing infrastructure and their future infrastructure needs, applications, services, frameworks and tools. Platform teams must also understand how their internal teams are structured, their skills in using frameworks like Terraform, the SDLC tools, etc. This can be done via a series of discovery calls and user interviews with different application teams/business units, inventory audits and interviews with potential platform users.

Through the discovery process, platform teams must identify the challenges that the internal teams face with the current services and tools, deriving the use cases for the platform based on the pain points of the internal users. The use cases can be as simple as creating self-serve development environments to more complex use cases like a single pane of glass administration for infrastructure management and application deployment. The following are several discovery items:

Current infrastructure (e.g., public clouds, private clouds)
Kubernetes distributions in usage (Amazon EKS, AKS, GKE, Upstream Kubernetes)
Managed services (databases, storage, registry, etc.)
CI/CD methodologies currently in use
Security tools
SDLC tools
Internal teams and their structure for implementing RBAC, clear isolation boundaries and team-specific workflows
HA/DR requirements
Applications, services in use, common frameworks and technology stacks (Python, Java, Go, React.Js, etc.) to create standard templates, catalogs and documentation

Step 3: Define the Product Roadmap

The use cases gathered during the discovery process should be considered to create a roadmap for the platform. This roadmap should outline the MVP requirements necessary to build an initial platform that can demonstrate its value. Platform teams may initially focus on one or two use cases, prioritizing those potentially benefiting a larger group of internal users.

It is recommended to start by piloting the MVP with a small group of internal users, application teams or business units to gather feedback and make improvements. As the platform becomes more robust, it can be expanded to serve a broader range of users and address additional use cases. The following are several example user stories from cloud native application development projects:

As a developer, I want to create a CI pipeline to compile my code and create artifacts. (CI as a Service and Registry as a Service)
As a developer, I want to create a sandbox environment and deploy my application to the sandbox for testing. (Environment as a Service)
As a developer, I want to deploy my applications into Kubernetes clusters. (Deployment as a Service)
As a developer, I want access to application logs and metrics to troubleshoot product issues.
As an SRE, I want to create and manage cloud environments and Kubernetes clusters compliant with my organization’s security and governance policies.
As a FinOps, I want to create chargeback reports and allocate costs to various business units. (Cost management as a Service)
As a security engineer, I want to consistently apply network security and OPA policies across the Kubernetes infrastructure. I also want to see policy violations and access logs in the central SIEM platform. (Network and OPA policy management as a Service)

Step 4: Build the Platform

Building the platform involves developing the automation backend to provide the infrastructure, services and tools that internal users need in a self-serve manner. The self-serve interface can vary from Jenkins pipelines to Terraform modules to Backstage IDP to a custom portal.

The backend involves automating tasks such as creating cloud environments, provisioning Kubernetes clusters, creating Kubernetes namespaces, deploying workloads in Kubernetes, viewing application logs, metrics, etc. Care must be taken to apply the organization’s security, governance and compliance policies as platform teams automate these tasks. The following simple technology stack is assumed for the example organization:

Infrastructure: AWS
Kubernetes: AWS EKS
Registry: AWS ECR
CI/CD: GitLab for CI and ArgoCD for application deployment
Databases: AWS RDS Postgres, Amazon ElasticCache for Redis
Observability: AWS OpenSearch, Prometheus and Grafana for metrics, OpsGenie for alerts
Security: Okta for SSO, Palo Alto Prisma Cloud

The example organization runs workloads in the AWS cloud. All stateless application workloads are containerized and run in Amazon EKS clusters. Workloads utilize AWS RDS Postgres for the database and Amazon ElasticCache (Redis) for the cache. The initial user stories are:

Create an AWS environment that creates a separate AWS account, VPC, an IAM Role, security groups, AWS RDS Postgres, AWS ElasticCache.
Create an EKS cluster with add-ons required for security, governance and compliance.
Download Kubeconfig file.
Create a Kubernetes namespace.
Deploy workload.

Using Backstage as the developer portal and Rafay backstage plugins as the automation backend, the following are the high-level steps to build the self-serve platform supporting the above use cases:

Install the Backstage app and configure Postgres.
Configure authentication using Backstage’s auth provider.
Set up Backstage catalog to ingest organization data from LDAP.
Set up Backstage to load and discover entities using GitHub integration.
Create a blueprint in Rafay console to define a baseline set of software components required by the organization (cost profiles, monitoring, ingress controllers, network security and OPA policies, etc.).
Install Rafay frontend and backend plugins in the Backstage app.
Use template actions provided by the Rafay backend plugin to add software templates for creating services.
- Create a Cluster template with ‘rafay:create-cluster’ action and provide the blueprint and other configuration from user input or by defining defaults in cluster-config.yaml.
- Create Namespace and Workload templates using ‘rafay:create-namespace’ and ‘rafay:create-workload’ actions.
Import UI widgets from the Rafay frontend plugin to create component pages for services and resources developed through templates (EntityClusterInfo, EntityClusterPodList, EntityNamespaceInfo, EntityWorkloadInfo, etc.).

The screens in the backstage developer portal look like the following after the implementation:

While this is a simple representation of a platform built using Backstage and Rafay backstage plugins, the actual platform may need to solve for many other use cases, which may require a larger effort. Similarly, platform teams may use some other interface and automation backend for building the platform.

Treat the Platform as a Product

When embarking on the journey of building a platform, it is essential to treat the platform as a product and follow a systematic approach similar to any other product development. The first step is to invest time in thoroughly discovering and understanding the organization’s technological landscape, identifying current pain points and gathering requirements from internal users. Based on these findings, a roadmap for the platform should be defined, setting clear milestones and establishing success criteria for each milestone.

Building such a platform requires consideration of various factors, including current and future infrastructure needs, application deployment, security, operating models, cost management, developer experience, and shared services and tools. Conducting a build versus buy analysis helps determine which parts of the platform should be built internally and which open source and commercial tools can be leveraged. Most platforms ultimately use all of these components. It is crucial to treat internal users as the platform’s customers, continuously seeking their feedback and iteratively improving the platform to ensure its success.

The post A Practical Step-by-Step Approach to Building a Platform appeared first on The New Stack.

The 6 Pillars of Platform Engineering: Part 2 — CI/CD & VCS Pipeline

Michael Fonseca — Thu, 21 Sep 2023 15:07:38 +0000

This guide outlines the workflows and steps for the six primary technical areas of developer experience in platform engineering. Published in six parts, part one introduced the series and focused on security. Part two will cover the application deployment pipeline. The other parts of the guide are listed below, and you can also download the full PDF version for the complete set of guidance, outlines and checklists.

Security (includes introduction)
Pipeline (VCS, CI/CD)
Provisioning
Connectivity
Orchestration
Observability (includes conclusion and next steps)

Platform Pillar 2: Pipeline

One of the first steps in any platform team’s journey is integrating with and potentially restructuring the software delivery pipeline. That means taking a detailed look at your organization’s version control systems (VCS) and continuous integration/continuous deployment (CI/CD) pipelines.

Many organizations have multiple VCS and CI/CD solutions in different maturity phases. These platforms also evolve over time, so a component-based API platform or catalog model is recommended to support future extensibility without compromising functionality or demanding regular refactoring.

In a cloud native model, infrastructure and configuration are managed as code, and therefore a VCS is required for this core function. Using a VCS and managing code provide the following benefits:

Consistency and standardization
Agility and speed
Scalability and flexibility
Configuration as documentation
Reusability and sharing
Disaster recovery and reproducibility
Debuggability and auditability
Compliance and security

VCS and CI/CD enable interaction and workflows across multiple infrastructure systems and platforms, which requires careful assessment of all the VCS and CI/CD requirements listed below.

Workflow: VCS and CI/CD

A typical VCS and CI/CD workflow should follow these five steps:

Code: The developer commits code to the VCS and a task is automatically submitted to the pipeline.
Validate: The CI/CD platform submits a request to your IdP for validation (AuthN and AuthZ).
Response: If successful, the pipeline triggers tasks (e.g., test, build, deploy).
Output: The output and/or artifacts are shared within platform components or with external systems for further processing.
Operate: Security systems may be involved in post-run tasks, such as deprovisioning access credentials.

VCS and CI/CD pipeline flow

VCS and CI/CD Requirements Checklist

Successful VCS and CI/CD solutions should deliver:

A developer experience tailored to your team’s needs and modern efficiencies
Easy onboarding
A gentle learning curve with limited supplementary training needed (leveraging industry-standard tools)
Complete and accessible documentation
Support for pipeline as code
Platform agnosticism (API driven)
Embedded expected security controls (RBAC, auditing, etc.)
Support for automated configuration (infrastructure as code, runbooks)
Support for secrets management, identity and authorization platform integration
Encouragement and support for a large partner ecosystem with a broad set of enterprise technology integrations
Extended service footprint, with runners to delegate and isolate span of control
Enterprise support based on an SLA (e.g., 24/7/365)

Note: VCS and CI/CD systems may have more specific requirements not listed here.

As platform teams select and evolve their VCS and CI/CD solutions, they need to consider what this transformation means for existing/legacy provisioning practices, security and compliance. Teams should assume that building new platforms will affect existing practices, and they should work to identify, collaborate and coordinate change within the business.

Platform teams should also be forward-looking. VCS and CI/CD platforms are rapidly evolving to further abstract away the complexity of the CI/CD process from developers. HashiCorp looks to simplify these workflows for developers by providing a consistent way to deploy, manage and observe applications across multiple runtimes, including Kubernetes and serverless environments with HashiCorp Waypoint.

Stay tuned for our post on the third pillar of platform engineering: provisioning. Or download the full PDF version of The 6 Pillars of Platform Engineering for the complete set of guidance, outlines and checklists.

The post The 6 Pillars of Platform Engineering: Part 2 — CI/CD & VCS Pipeline appeared first on The New Stack.

The 6 Pillars of Platform Engineering: Part 1 — Security

Michael Fonseca — Wed, 20 Sep 2023 18:00:10 +0000

Platform engineering is the discipline of designing and building toolchains and workflows that enable self-service capabilities for software engineering teams. These tools and workflows comprise an internal developer platform, which is often referred to as just “a platform.” The goal of a platform team is to increase developer productivity, facilitate more frequent releases, improve application stability, lower security and compliance risks and reduce costs.

This guide outlines the workflows and checklist steps for the six primary technical areas of developer experience in platform engineering. Published in six parts, this part, part one, introduces the series and focuses on security. (Note: You can download a full PDF version of the six pillars of platform engineering for the complete set of guidance, outlines and checklists.)

Platform Engineering Is about Developer Experience

The solutions engineers and architects I work with at HashiCorp have supported many organizations as they scale their cloud operating model through platform teams, and the key for these teams to meet their goals is to provide a satisfying developer experience. We have observed two common themes among companies that deliver great developer experiences:

Standardizing on a set of infrastructure services to reduce friction for developers and operations teams: This empowers a small, centralized group of platform engineers with the right tools to improve the developer experience across the entire organization, with APIs, documentation and advocacy. The goal is to reduce tooling and process fragmentation, resulting in greater core stability for your software delivery systems and environments.
A Platform as a Product practice: Heritage IT projects typically have a finite start and end date. That’s not the case with an internal developer platform. It is never truly finished. Ongoing tasks include backlog management, regular feature releases and roadmap updates to stakeholders. Think in terms of iterative agile development, not big upfront planning like waterfall development.

No platform should be designed in a vacuum. A platform is effective only if developers want to use it. Building and maintaining a platform involves continuous conversations and buy-in from developers (the platform team’s customers) and business stakeholders. This guide functions as a starting point for those conversations by helping platform teams organize their product around six technical elements or “pillars” of the software delivery process along with the general requirements and workflow for each.

The 6 Pillars of Platform Engineering

What are the specific building blocks of a platform strategy? In working with customers in a wide variety of industries, the solutions engineers and architects at HashiCorp have identified six foundational pillars that comprise the majority of platforms, and each one will be addressed in a separate article:

Security
Pipeline (VCS, CI/CD)
Provisioning
Connectivity
Orchestration
Observability

Platform Pillar 1: Security

The first questions developers ask when they start using any system are: “How do I create an account? Where do I set up credentials? How do I get an API key?” Even though version control, continuous integration and infrastructure provisioning are fundamental to getting a platform up and running, security also should be a first concern. An early focus on security promotes a secure-by-default platform experience from the outset.

Historically, many organizations invested in network perimeter-based security, often described as a “castle-and-moat” security approach. As infrastructure becomes increasingly dynamic, however, perimeters become fuzzy and challenging to control without impeding developer velocity.

In response, leading companies are choosing to adopt identity-based security, identity-brokering solutions and modern security workflows, including centralized management of credentials and encryption methodologies. This promotes visibility and consistent auditing practices while reducing operational overhead in an otherwise fragmented solution portfolio.

Leading companies have also adopted “shift-left” security; implementing security controls throughout the software development lifecycle, leading to earlier detection and remediation of potential attack vectors and increased vigilance around control implementations. This approach demands automation-by-default instead of ad-hoc enforcement.

Enabling this kind of DevSecOps mindset requires tooling decisions that support modern identity-driven security. There also needs to be an “as code” implementation paradigm to avoid ascribing and authorizing identity-based on ticket-driven processes. That paves the way for traditional privileged access management (PAM) practices to embrace modern methodologies like just-in-time (JIT) access and zero-trust security.

Identity Brokering

In a cloud operating model approach, humans, applications and services all present an identity that can be authenticated and validated against a central, canonical source. A multi-tenant secrets management and encryption platform along with an identity provider (IdP) can serve as your organization’s identity brokers.

Workflow: Identity Brokering

In practice, a typical identity brokering workflow might look something like this:

Request: A human, application, or service initiates interaction via a request.
Validate: One (or more) identity providers validate the provided identity against one (or more) sources of truth/trust.
Response: An authenticated and authorized validation response is sent to the requestor.

Identity Brokering Requirements Checklist

Successful identity brokering has a number of prerequisites:

All humans, applications and services must have a well-defined form of identity.
Identities can be validated against a trusted IdP.
Identity systems must be interoperable across multi-runtime and multicloud platforms.
Identity systems should be centralized or have limited segmentation in order to simplify audit and operational management across environments.
Identity and access management (IAM) controls are established for each IdP.
Clients (humans, machines and services) must present a valid identity for AuthN and AuthZ).
Once verified, access is brokered through deny-by-default policies to minimize impact in the event of a breach.
AuthZ review is integrated into the audit process and, ideally, is granted just in time.
- Audit trails are routinely reviewed to identify excessively broad or unutilized privileges and are retroactively analyzed following threat detection.
- Historical audit data provides non-repudiation and compliance for data storage requirements.
Fragmentation is minimized with a flexible identity brokering system supporting heterogeneous runtimes, including:
- Platforms (VMware, Microsoft Azure VMs, Kubernetes/OpenShift, etc.)
- Clients (developers, operators, applications, scripts, etc.)
- Services (MySQL, MSSQL, Active Directory, LDAP, PKI, etc.)
Enterprise support 24/7/365 via a service level agreement (SLA)
Configured through automation (infrastructure as code, runbooks)

Access Management: Secrets Management and Encryption

Once identity has been established, clients expect consistent and secure mechanisms to perform the following operations:

Retrieving a secret (a credential, password, key, etc.)
Brokering access to a secure target
Managing secure data (encryption, decryption, hashing, masking, etc.)

These mechanisms should be automatable — requiring as little human intervention as possible after setup — and promote compliant practices. They should also be extensible to ensure future tooling is compatible with these systems.

Workflow: Secrets Management and Encryption

A typical secrets management workflow should follow five steps:

Request: A client (human, application or service) requests a secret.
Validate: The request is validated against an IdP.
Request: A secret request is served if managed by the requested platform. Alternatively:
1. The platform requests a temporary credential from a third party.
2. The third-party system responds to the brokered request with a short-lived secret.
Broker response: The initial response passes through an IAM cryptographic barrier for offload or caching.
Client response: The final response is provided back to the requestor.

Secrets management flow

Access Management: Secure Remote Access (Human to Machine)

Human-to-machine access in the traditional castle-and-moat model has always been inefficient. The workflow requires multiple identities, planned intervention for AuthN and AuthZ controls, lifecycle planning for secrets and complex network segmentation planning, which creates a lot of overhead.

While PAM solutions have evolved over the last decade to provide delegated solutions like dynamic SSH key generation, this does not satisfy the broader set of ecosystem requirements, including multi-runtime auditability or cross-platform identity management. Introducing cloud architecture patterns such as ephemeral resources, heterogeneous cloud networking topologies, and JIT identity management further complicates the task for legacy solutions.

A modern solution for remote access addresses the challenges of ephemeral resources and the complexities that arise with ephemeral resources such as dynamic resource registration, identity, access, and secrets. These modern secure remote access tools no longer rely on network access such as VPNs as an initial entry point, CMDBs, bastion hosts, manual SSH and/or secrets managers with check-in/check-out workflows.

Enterprise-level secure remote access tools use a zero-trust model where human users and resources have identities. Users connect directly to these resources. Scoped roles — via dynamic resource registries, controllers, and secrets — are automatically injected into resources, eliminating many manual processes and security risks such as broad, direct network access and long-lived secrets.

Workflow: Secure Remote Access (Human to Machine)

A modern remote infrastructure access workflow for a human user typically follows these eight steps:

Request: A user requests system access.
Validate (human): Identity is validated against the trusted identity broker.
Validate (to machine): Once authenticated, authorization is validated for the target system.
Request: The platform requests a secret (static or short-lived) for the target system.
Inject secret: The platform injects the secret into the target resource.
Broker response: The platform returns a response to the identity broker.
Client response: The platform grants access to the end user.
Access machine/database: The user securely accesses the target resource via a modern secure remote access tool.

Secure remote access flow

Access Management Requirements Checklist

All secrets in a secrets management system should be:

Centralized
Encrypted in transit and at rest
Limited in scoped role and access policy
Dynamically generated, when possible
Time-bound (i.e., defined time-to-live — TTL)
Fully auditable

Secrets management solutions should:

Support multi-runtime, multicloud and hybrid-cloud deployments
Provide flexible integration options
Include a diverse partner ecosystem
Embrace zero-touch automation practices (API-driven)
Empower developers and delegate implementation decisions within scoped boundaries
Be well-documented and commonly used across industries
Be accompanied by enterprise support 24/7/365 based on an SLA
Support automated configuration (infrastructure as code, runbooks)

Additionally, systems implementing secure remote access practices should:

Dynamically register service catalogs
Implement an identity-based model
Provide multiple forms of authentication capabilities from trusted sources
Be configurable as code
Be API-enabled and contain internal and/or external workflow capabilities for review and approval processes
Enable secrets injection into resources
Provide detailed role-based access controls (RBAC)
Provide capabilities to record actions, commands, sessions and give a full audit trail
Be highly available, multiplatform, multicloud capable for distributed operations, and resilient to operational impact

Stay tuned for our post on the second pillar of platform engineering: version control systems (VCS) and the continuous integration/continuous delivery (CI/CD) pipeline. Or download a full PDF version of the six pillars of platform engineering for the complete set of guidance, outlines and checklists.

The post The 6 Pillars of Platform Engineering: Part 1 — Security appeared first on The New Stack.

Linux Foundation to Fork HashiCorp Terraform into ‘OpenTofu’

Steven J. Vaughan-Nichols — Wed, 20 Sep 2023 17:52:37 +0000

BILBAO, SPAIN — When HashiCorp opted out of its open source license for the Business Source License (BSL) for its flagship program, infrastructure as code (IaC) tool, Terraform, many expected some opposition from the open source community. But how many expected to see a full-fledged Terraform fork coming? Or, that it would be Linux Foundation that would launch the fork: called OpenTofu.

When the OpenTF announced its opposition to Terraform’s shift from the Mozilla Public License v2.0 (MPLv2) to the BSL v1.1, many people dismissed its subsequent fork plans as sour grapes and doomed to failure. With the Linux Foundation behind it and major Terraform users such as insurance giant Allianz and major Virtual Private Network (VPN) provider ExpressVPN announcing they were moving to OpenTofu, the new fork is on its way to success.

OpenTofu promises to be a community-centric, modular, and backward-compatible successor to the MPLv2-licensed Terraform. The project has already garnered support from such IaC companies as Harness, Gruntwork, Spacelift, env0, Scalr, and many more. Over 140 organizations and 600 individuals have pledged their commitment to OpenTofu, ensuring a robust development team for at least the next half-decade.

Have some, there’s plenty to go around#opentofu #opensource #ossummit pic.twitter.com/ztB3txIGiB

— The Linux Foundation (@linuxfoundation) September 21, 2023

Jim Zemlin, the Linux Foundation‘s Executive Director, said in his Open Source Summit Europe keynote, “OpenTofu embodies our collective dedication to genuine open collaboration in infrastructure as code. It stands as a testament to our shared goal of delivering dependable, accessible tools for the tech world.” Zemlin added, “OpenTofu’s dedication to open source principles underscores our shared vision of providing accessible, reliable tools that empower the tech community.”

Chris Aniszczyk, CTO of the Cloud Native Computing Foundation (CNCF), said that the OpenTofu is looking into joining the CNCF as sandbox project. Aniszczyk added, “We also look forward to their collaboration with the CNCF community.”

Before this, Zemlin said that while HashiCorp’s license change is disturbing, he thinks it’s because “in the last few years, there was a lot of private equity investment in open source. Venture capitalists looked for popular open source projects and started pumping in capital. They used open sources as a way to go to market. This enabled them to sell commercial products off the open source project. Now, most of those companies are maturing, they might have hundreds of millions of dollars in revenue and have grown significantly.” Now they want to cash out, and “open-source is no longer working as an ongoing basis for them. It’s their prerogative to do so.”

Of course, this is bad news for open source projects. Zemlin suggests that open source developers look away from single-company open-source projects that claim your code’s copyright.

Instead, as Yevgeniy (Jim) Brikman, Co-founder and CEO of Gruntwork, and OpenTofu founding team member, said, “We believe that the essential building blocks of the modern internet — tools such as Linux, Kubernetes, and Terraform — must be truly open source. That is the only way to ensure that we are building our industry on top of solid and predictable underpinnings.” Thus, in OpenTF’s specific case, “having this project in the hands of a foundation, rather than a single company, means OpenTofu will be community-driven and truly open source — always.”

Zemlin concluded, “What I think we’re seeing is a good lesson for all of us. The idea of open governance and an open source license is the sweet spot enabling commercial organizations to invest confidently in the codebase.”

The OpenTF, now OpenTufu founders, are gobsmacked by the rapid pace of their project’s advancement. As Sebastian Stadil, Scalr CEO, said at the conference keynote, “It’s just so hard to believe that all of this started just five weeks ago.” Looking ahead, the next major step will be the release of the code itself. The new codebase should appear within the next few weeks.

Oh, and that name? OpenTofu was chosen to avoid any possible legal trademark problems with Terraform. As Stadil explained, “HashiCorp has been super aggressive in sending cease-and-desist out to folks. We thought TF was a little bit too close to Terraform.”

HashiCorp has declined to respond publicly to this matter.

pic.twitter.com/iQbjunct9S

— The Linux Foundation (@linuxfoundation) September 21, 2023

The post Linux Foundation to Fork HashiCorp Terraform into ‘OpenTofu’ appeared first on The New Stack.

How Apache Flink Delivers for Deliveroo

Alex Williams — Wed, 20 Sep 2023 13:54:21 +0000

Deliveroo is a leading food delivery company that builds tools and services using Apache Flink.

The Flink framework offers a distributed processing engine for “stateful computations over unbounded and bounded data streams.”

Deliveroo has a three-sided marketplace. They have delivery drivers, restaurants, and customers who place orders on the application to get food or groceries delivered to their door. Joining us to discuss using Apache Flink and the Amazon Managed Service for Apache Flink were two engineers from Deliveroo: Felix Angell and Duc Anh Khu.

Deliveroo sought to do more real-time streaming. They explored behaviors to understand the customer journey and how those customers use the Deliveroo application.

That meant modernizing to a more stable platform. The old platform could scale up but not down due to earlier decisions made about the technology. They looked at other services such as Apache Spark and Kafka Streams. But Flink had feature parity with the legacy platform, and large companies used Flink for production.

Deliveroo started experimenting with Flink as a proof-of-concept on Kubernetes. They used third-party operators but found many needed more support and were not maintained. They turned to the Amazon Managed Service for Flink (MSF), which allowed the Deliveroo team to focus on its core responsibilities, such as CI/CD and taking updates to production.

Angell said he would like more Deliveroo teams using Apache Flink. Duc said they move very fast to roll out the latest product. And that means their modeling could be more flexible to adapt to that demand.

Flexibility comes with a cost, he said. Sometimes, you need to remodel things, though other times, you need to normalize the data model. It would help make that process easier for teams to do.

“And for me, one of the features that we would like to see is a self-serve configuration from MSF, so that we can just tweak some of the low-level configuration as well as auto-scaling requirements based on application metrics,” Duc said.

Learn more about Apache Flink and AWS:

Kinesis, Kafka and Amazon Managed Service for Apache Flink

Apache Flink for Real Time Data Analysis

The post How Apache Flink Delivers for Deliveroo appeared first on The New Stack.

What to Know about Container Security and Digital Payments

David Zendzian — Wed, 20 Sep 2023 13:19:03 +0000

Managing containers in the world of digital payments just got a little easier. Now that containers are a preferred option for cloud native architectures, practitioners need guidance to support highly regulated industries, such as financial services and payments. While the PCI guidelines for virtual machines (VMs) are still in use and likely will be for many years, they’ve evolved to include container orchestration. Here’s what you need to know.

New guidance for containers and container orchestration was released by the Payment Card Industry (PCI) Security Standards Council last year. These guidelines are important for any business that processes credit card payments and wants to use containers at scale to support its business goals. The new guidance is part of the PCI Data Security Standards (DSS) 4.0 and is similar to that for VMs, but there are important differences. Any business or practitioner working with a company that takes credit card payments can now reference a set of best practices to help them meet the latest PCI DSS requirements when using containers and container orchestration tools.

I participated in working groups that helped update the PCI guidelines. As a qualified security assessor (QSA), I performed PCI audits, payment application audits and penetration tests for many companies. Eventually, I started a company that helps small and midsize organizations that were struggling with the technical requirements PCI imposed on their applications and infrastructure. At that time, most application infrastructure was run on virtual machines, and many companies were struggling to understand the implications on their application infrastructure. That’s why the PCI Virtualization Special Interest Group (SIG) proved to be an invaluable resource for practitioners and large infrastructure teams alike when it published the PCI Data Security Standards Virtualization Guidelines in 2011.

Today, with the popularity of containers among large enterprises, the PCI Special Interest Group (SIG) focusing on container orchestration aims to do the same for a cloud native world. VMware — along with other SIG participants that work with modern orchestration systems and that represent companies from all over the world — created new guidance for using PCI in containers and container orchestration, which you can read about in this blog post.

Like the virtualization guidelines before, the guidance for containers and container orchestration is not a step-by-step guide to achieving PCI DSS compliance. Rather, it’s an overview of unique threats, best practices, and example use cases to help businesses and practitioners better understand the technologies and practices available that can help with PCI compliance when using containers.

Among other things, the guidance includes a list of common threats specific to containerized environments and the best practices to address each threat. Some of the guidelines will sound familiar, because there are some very basic principles and best practices that just make sense regardless of your environment. The threats and best practices are segmented by use cases like baseline, development and management, account data transmission, and containerization in a mixed scope environment. These use cases help practitioners understand the intent of scope where the best practices apply.

The working group also breaks down the best practices into 16 subsections:

Authentication
Authorization
Workload security
Network security
PKI
Secrets management
Container orchestration tool auditing
Container monitoring
Container runtime security
Patching
Resource management
Container image building
Registry
Version management
Configuration management
Segmentation

While some of these are applicable outside of containerized environments and are considered good security hygiene, some areas are specific to containers. Here are the ones I believe are most critical for containerized environments:

Workload security – In a containerized environment, the workload is the actual container or application. In virtualized environments, PCI defines everything as a system. In containerized environments, a workload is defined as a smaller unit or instance. When applications are packaged as “containers” that fully encapsulate a minimal operating system layer along with an application runtime (e.g., .NET, Njs and Spring), there are no external dependencies, and all internal dependencies are running at versions required by the application.
Container orchestration tool auditing – The main benefit of orchestration tools is automation. The tools can include CI/CD, pipelines, supply chains and even Kubernetes. PCI recognizes that automation is just as critical to the environment as the application and the data, and, as such, the tools you are using must be audited.
Container monitoring – Because of the ephemeral nature of containers, container monitoring should not be tied to a specific instance. PCI suggests a secured, centralized log monitoring system that allows us to make better correlations of events across instances of the same container. In addition, PCI suggests monitoring and auditing access to the orchestration system API(s) for indications of unauthorized access.
Container runtime security – Just as with its container monitoring, PCI’s recommendations for container runtime security are essentially the same as those for VMs. By calling this out as the container runtime security, PCI is recognizing that containers are unique from VMs and, as such, have their own unique runtime elements.
Resource management – Container orchestration capabilities like those found in Cloud Foundry or Kubernetes include resource management as part of the platform. Since it’s common to have workloads (i.e., containers and apps) share clusters and resources, PCI recommends having defined resource limits to reduce the risk of availability issues with workloads in the same cluster. With virtualized environments, this is defined as availability.
Container image building – This is perhaps the most pronounced difference between the PCI recommendations for VMs and containers because images are unique to containers. They are what allow us to run a container anywhere. It is also why there is so much we can do with tooling, building and releasing containers, and why the implications on your security posture are profound. Automating the image build lets us set policies to specify which is the trusted (or “golden”) image. Container provenance and dependencies are major concerns for QSAs and security teams. Consider this: A quick search on Docker Hub for Java produces a list of 10,000 images using Java. While some are “official” images, many are images built by unknown people around the world and might not have been updated in a long time, or that potentially include code that could compromise the security of businesses using it.
Registry – Registries are required if you are running containers at scale. They function as gatekeepers, enforcing policies around things like image signing and scanning.
Version management – Container orchestration systems can run blue-green deploys based on the version management system. This means we can deploy or roll back an upgrade with no effect on our application. Version management should also be used for more than just application changes. Platforms should be automated and have configuration in version control and be treated like any other product that the company might create, operate and manage.

The final section of the orchestration guidance ties together the threats and best practices with practical use cases to help businesses and QSAs apply the provided guidance in real-world scenarios. These use cases provide a diagram and documentation around how a situation is organized, highlighting the threats with the situation and providing a mapping to the previously documented threats and best practices to show how various use cases tie to the best practices shared with the reader.

PCI is a complex process that has many “it depends” scenarios. While the PCI guidance is helpful, this white paper gives a practical example using VMware technologies. You can also reach out to the PCI Security Standards Council or your QSA if you have questions about PCI and your business.

Rita Manachi contributed to this article.

The post What to Know about Container Security and Digital Payments appeared first on The New Stack.

Kubecost Cloud Manages K8s Costs for FinOps Teams

Meredith Shubel — Tue, 19 Sep 2023 10:00:25 +0000

When Kubecost’s two founders, Ajay Tripathy and Webb Brown, were working at Google on the then-internal tool, Borg, they kept running into the same challenge: How do you get users to care about cost?

They bet that when Google open sourced Borg (and became Kubernetes), its new adopters would face the same challenges. They were right.

FinOps Challenges for Kubernetes Users

In an interview with The New Stack, Kubecost’s Founding Partner Rob Faraj elaborated on the challenges Kubernetes (K8s) users face today: visibility, optimization, and internal governance. (For those of you who remember, the Kubecost folks were behind StackWatch, which Kubecost evolved from.)

Perhaps the most immediate concern for companies running multitenant clusters is visibility and what Faraj calls a lack of transparency: “At the end of the month, most K8s users are left thinking, ‘I’m getting a single cost, and now I have to figure out how to distribute it across my teams. But how much does a cluster cost? And what is the fair accurate share of the overall costs for a single K8s cluster?’”

Even as recently as three or four years ago, this was the chief obstacle for companies working with K8s. But that was before optimization became a priority.

Previously considered a nice-to-have, optimization and cost savings have now become must-haves in today’s macroeconomic climate, with companies seeking ways to reduce spending without sacrificing application performance.

Optimization becomes even more important in light of all the cloud spending that is wasted — a frightening 32%, per the Flexera 2022 State of the Cloud Report.

This waste, Faraj says, “is the result of how easy it is to get things up and running in the cloud — and how difficult it is to then go back and audit your activity.”

These challenges with visibility and optimization have given rise to the new practice of proactive notifications and alerts on cost-related activities. Faraj says these simply didn’t exist years ago: “You didn’t inform an engineer that their resources were at or over budget. You would only alert them if their applications were down.” Though designed to help soothe visibility and optimization challenges, these and other FinOps strategies have introduced their own challenges for internal governance.

For example, historically, a cloud bill would arrive at the end of the month, and it would be the sole responsibility of the finance team. Only months later would reports of any issues actually make it back to the infrastructure team. Now, however, that process is changing, with infrastructure teams being pulled into regular discussions on cost savings and optimization.

Turns out, Tripathy and Brown were right: It’s not easy to create a culture of cost-awareness for tech users overnight. And outlining FinOps best practices to help teams monitor, manage, and optimize K8s-related cloud costs requires careful cross-team collaboration.

Bringing Engineers into the FinOps Conversation

Despite growing awareness of its many challenges, cloud optimization remains a stubborn obstacle for many teams. In fact, Flexera’s 2023 State of the Cloud Report names “optimizing existing use of the cloud” as a top initiative — for the seventh year in a row.

When asked why cloud optimization is such a continued struggle for companies, Faraj blames the approach of the industry’s most common tools: “Most tools take a top-down approach; they focus on the finance users and the CFOs. But there’s quite a bit of neglect to engineers who are ultimately the ones capable of implementing the changes to realize savings.”

Kubecost Cloud, on the other hand, he claims takes a different approach: “We’re about getting to the engineers first and giving them the information they need to make informed decisions in the interest of savings.”

Getting engineers to care, it seems, is at the crux of all FinOps challenges. After all, FinOps is, at its core, a cultural shift, focusing on best practices that empower finance and engineering teams to work together to maximize the value of every dollar spent on the cloud. Visibility, optimization, and internal governance challenges can’t be solved by simply pushing a new tool onto engineers. Instead, engineers need data that they can use to take action.

Kubecost Launches Kubecost Cloud

With the launch of Kubecost Cloud, Kubecost thinks they have the solution that both FinOps teams and FinOps engineers will like.

Kubecost Cloud is the SaaS version of Kubecost, a solution for monitoring, managing, and optimizing K8s spend at scale. It’s all built on Opencost, the open source, community-led specification and cost model.

Now available on Google Cloud Marketplace, Kubecost Cloud is supposed to help companies gain visibility into K8s-related cloud costs and, ultimately, reduce waste. For example, to help tackle workload rightsizing, which is identified as one of Google’s Four Golden Signals in their inaugural State of Kubernetes Cost Optimization Report, Kubecost Cloud gives users a clear picture of how many resources a workload requests versus how many it actually uses. Meanwhile, to support demand-based downscaling (another Golden Signal), the solution can also detect which workloads aren’t seeing a lot of use and when, like during weekends when most developers aren’t working. It then sends alerts to trigger these workloads to go offline — a worthwhile strategy for optimization, since cloud providers like Amazon Web Services (AWS) will charge you for compute capacity whether you use it or not.

These are just some of the ways Kubecost Cloud enables companies to act on the Four Golden Signals, which Google identifies as the signals top-performing FinOps teams dedicate their time.

Another notable benefit of Kubecost Cloud is its scalability. Because the solution doesn’t need to be deployed on all clusters at once, a company can start with just one, get billed only for what it uses, and then grow from there.

In the future, Faraj says Kubecost wants to continue helping users further grow and optimize by adding more automation to the SaaS solution. But he cautions: “Even as we push towards automation, we don’t ever want people to think that you can just turn on an autopilot button and then your job is done.”

Instead of striving for a fully automated, robo solution, Faraj says he’s more excited about the collaborative future of the FinOps industry — particularly, The FinOps Foundation, of which Kubecost is a certified platform and partner: “The FinOps Foundation is merging the traditional finance mindset with the folks whose hands are on the keyboard. And I love that marriage.”

But an easy marriage, it isn’t always. As finance teams and engineers come together to deploy FinOps strategies and optimize K8s-related costs, Kubecost Cloud hopes to be the solution that eases the union.

The post Kubecost Cloud Manages K8s Costs for FinOps Teams appeared first on The New Stack.

The /Now Page Movement: Publicly Declaring What You Work on

David Cassel — Sun, 17 Sep 2023 13:00:48 +0000

In 2015 the world’s very first /now page was created — an HTML file announcing what the owner of that domain is working on… now.

In an email interview, creator Derek Sivers remembers that he was mostly inspired by two friends he’d been thinking about often. “Noticing I was wishing for that, I thought maybe someone feels the same about me.” (Sivers is a long-time entrepreneur/author/musician currently living in New Zealand, who is also an occasional giver of TED talks…)

But he also described it as “a public declaration of priorities,” where the very act of creating the page also brings an opportunity to confront this clarifying question. “If I’m doing something that’s not on my list, is it something I want to add, or something I want to stop?”

And that turned out to be just the beginning…

Later in 2015 another /now page was created by Gregory Brown, who offers life coaching services for software developers. Brown later wrote that he experienced the same sensation: that adding new items to his /now page felt like “an intentional priority shift.” And when Sivers had tweeted out Brown’s announcement, eight more people created their own /now pages.

Sivers conclusion? A movement had been created. Soon he’d whipped up a directory of all the /now pages, and given it an easy-to-remember URL: NowNowNow.com.

Sivers told me “It’s been a really fun nerdy project to maintain.”

A Long Time Coming

After eight years, there’s now a whopping 2,822 /now pages in that directory — and it’s still being updated. A few are outdated, and there is even a few that are deadlinks. But there are plenty of others. “It goes in waves,” Sivers said in an email. Though the usual pace is 1 or 2 new pages a day, “Someone mentions it to their network and I’ll get 20 new signups per day for a while.”

The page displays the faces of each happy /now page creator, in rows of four, along with their name, job title, and, of course, the URL of their /now page…

There are programmers and designers, authors and musicians, agile coaches and SEO consultants — a vast and varied range of careers. Yes, many of these so-called /now pages haven’t been updated in years. But many of them have, offering those precious peeks at what somebody else is working on right now.

“Don’t tell anyone, but I’ve started work on my first graphic novel!” writes British filmmaker Adam Westbrook this month.
“At the moment I am working as a Senior Software Developer (full remote) at a large consumer electronics retailer…” writes UK-based Matthew Fedak.
Over in London, “I’m all in on my new business,” writes entrepreneur Jodie Cook. “Coachvox AI. We create artificially intelligent coaches based on thought leaders.”

In our email interview, Sivers shared a fond memory from roughly two years ago, when a web service was offering to create personal webpages for people. “They liked the idea of the /now page so much that they made it automatic for everyone who has a site with them.”

But Sivers also shared the secret history of the URL: he’d already bought the domain back in 2003. By 2009 it was even part of his online startup, with NowNowNow.com becoming a place to document his progress on building services for musicians. Archive.org preserves a copy of that 2009 version, a homepage promising visitors a “transparent office where you can watch everything being built, or even contribute if you’d like.”

So why did he call that domain NowNowNow? Sivers liked the motivational message that was built right into its name. “I liked the idea of doing things now now now instead of procrastinating…”

Sivers admits he probably reinvented the wheel. Early Linux systems included the finger command for pulling up a user’s self-composed status updates. (Though it’s since been removed from many distros — in some cases replaced by a simpler command — named pinky.)

But Sivers takes pride in creating a tool that lets others offer a window into their own world. “It’s not a business. It’s not social media,” explains the official “About” page at NowNowNow.com, emphasizing that businesses are intentionally not being included. “Browsing nownownow.com is only interesting because you get a glimpse into people’s lives and how they focus. If it became full of businesses, it would lose its appeal…”

As a satirical Easter Egg, Sivers even created a /now page for NowNowNow.com. Anyone who thinks to look at NowNowNow.com/now gets a congratulatory webpage gushing “How clever you are… That is so meta, it’s crazy.”

It’s followed by a funny list of six things being done now… that changes every two seconds.

The New Now-ify Movement?

Another superfan of /now pages is Taylor Troesh, a Southern California-based software engineer and tech blogger. “Maintaining a /now page helps me focus on ‘one big thing’ at a time…” Troesh said in an email. “My natural desire to move on to my next Big Thing every few days has been forcing me to complete what I’m working on ‘now’.”

“As a side-effect, I’ve actually found that it’s a great shortcut for people to learn who I am…”

But in 2020 Troesh took things to the next level, creating his own “opinionated time management system” called nowify. (“Use any cron program to make nowify check for overdue items and beep,” explains its ReadMe page on GitHub.) As a newly-remote worker, Troesh saw this system as “an attempt to learn some self-control outside of the physical office context.” And looking back Troesh says the ongoing reminders “successfully rewired my brain as hoped… Timetracking is probably the most effective way to completely change your life.” (Along with turning off intermittent notifications.)

He’s now getting several emails a month from users interested in implementing it themselves. And his friend Brian Hicks was even inspired to make his own similar system called Montage. Hicks says Montage “annoys me when I’m not doing the thing I said I was gonna do” — after prompting the user to specify an intention, and also an amount of time to spend on it.

Troesh tells me, “I suspect that there’s quite a few people out there with similar home-grown systems.”

Followers Are Leaders

And of course, Troesh still has his own /now page, breaking out various projects as “Recent,” “Current,” and “Soon”.

“Life moves both more slowly and more quickly than expected!” Troesh wrote at the top of the page. “Looking back at ‘big things’ puts you into perspective…”

But to this day Sivers gives a lot of credit to Gregory Brown, who will always be that all-important “first follower” — the first person to follow Sivers’ lead and make a second /now page back in 2015.

Strangely, there’s an inspiring leadership lesson in here. One of Sivers’ Ted talks was on the importance of the “first follower,” calling it “an underappreciated form of leadership.” While the original leader stands alone, daring to look ridiculous, it’s that first follower who provides the first public validation of the idea — transforming that original lone nut into an actual leader. “Everyone needs to see the followers, because new followers emulate followers — not the leader… And ladies and gentlemen that is how a movement is made!”

This was in 2010, but it’s easy to see the logic that would lead Sivers’ to declare a /now page movement. Mixed in with logistical advice — “Be public. Be easy to follow!” — Sivers’ talk also included some specific philosophical thoughts. Sivers spoke about “the importance of nurturing your first few followers as equals, making everything clearly about the movement, not you.” The talk eerily foreshadows some of the intentionality Sivers would later bring to his idea for a /now page community.

“We’re told we all need to be leaders, but that would be really ineffective,” Sivers told his audience in 2010. “The best way to make a movement, if you really care, is to courageously follow and show others how to follow.

“When you find a lone nut doing something great, have the guts to be the first person to stand up and join in.”

The post The /Now Page Movement: Publicly Declaring What You Work on appeared first on The New Stack.

Harden Ubuntu Server to Secure Your Container and Other Deployments

Jack Wallen — Sat, 16 Sep 2023 13:00:48 +0000

Ubuntu Server is one of the more popular operating systems used for container deployments. Many admins and DevOps team members assume if they focus all of their security efforts starting with the container image on up, everything is good to go.

However, if you neglect the operating system on which everything is installed and deployed, you are neglecting one of the most important (and easiest) steps to take.

In that vein, I want to walk you through a few critical tasks you can undertake with Ubuntu Server to make sure the foundation of your deployments is as secure as possible. You’ll be surprised at how easy this is.

Are you ready?

Let’s do this.

Schedule Regular Upgrades

I cannot tell you how many servers I’ve happened upon where the admin (or team of admins) failed to run regular upgrades. This should be an absolute no-brainer but I do understand the reasoning behind the failure to do this. First off, people get busy, so upgrades tend to fall by the wayside in lieu of putting out fires.

Second, when the kernel is upgraded, the server must be rebooted. Given how downtime is frowned upon, it’s understanding why some admins hesitate to run upgrades.

Don’t.

Upgrades are the only way to ensure your server is patched against the latest threats and if you don’t upgrade those servers are vulnerable.

Because of this, find a time when a reboot won’t interrupt service and apply the upgrades then.

Of course, you could also add Ubuntu Livepatch to the system, so patches are automatically downloaded, verified, and applied to the running kernel, without having to reboot.

Do Not Enable Root

Ubuntu ships with the root account disabled. In its place is sudo and I cannot recommend enough that you do not enable and use the root account. By enabling the root account, you open your system(s) up to security risks. You can even go so far as to disable root altogether, with the command:

sudo passwd -l root

What the above command does is expire the root password, so until you were to reset the root password, the root user is effectively inaccessible.

Disable SSH Login for the Root User

The next step you should take is to disable the root user SSH login. By default, Ubuntu Server enables root SSH login, which should be considered a security issue in the waiting. Fortunately, disabling root SSH access is very simple.

sudo nano /etc/ssh/sshd_config

In that file, look for the line:

#PermitRootLogin prohibit-password

Change that to:

PermitRootLogin no

Save and close the file. Restart SSH with:

sudo systemctl restart sshd

The root user will no longer be allowed access via SSH.

Use SSH Key Authentication

Speaking of Secure Shell, you should always use key authentication, as it is much more secure than traditional password-based logins. This process takes a few steps and starts with you creating an SSH key pair on the system(s) that will be used to access the server. You’ll want to do this on any machine that will use SSH to remote into your server.

The first thing to do is generate an SSH key with the command:

ssh-keygen

Follow the prompts and SSH will generate a key pair and save it in ~/.ssh.

Next, copy that key to the server with the command:

ssh-copy-id SERVER

Where SERVER is the IP address of the remote server.

Once the key has been copied, make sure to attempt an SSH login from the local machine to verify it works.

Repeat the above steps on any machine that needs SSH access to the server because we’re not going to disable SSH password authentication. One thing to keep in mind is that, once you disable password authentication, you will only be able to access the server from a machine that has copied its SSH key to the server. Because of this, make sure you have local access to the server in question (just in case).

To disable SSH password authentication, open the SSH demon configuration file again and look for the following lines:

#PubkeyAuthentication yes

and

#PasswordAuthentication yes

Remove the # characters from both lines and change yes to no on the second. Once you’ve done that save and close the file. Restart SSH with:

sudo systemctl restart sshd

Your server will now only accept SSH connections using key authentication.

Install Fail2ban

Speaking of SSH logins, one of the first things you should do with Ubuntu Server is install fail2ban. This system keeps tabs on specific log files to detect unwanted SSH logins. When fail2ban detects an attempt to compromise your system via SSH, it automatically bans the offending IP address.

The fail2ban application can be installed from the standard repositories, using the command:

sudo apt-get install fail2ban -y

Once installed, you’ll need to configure an SSH jail. Create the jail file with:

sudo nano /etc/fail2ban/jail.local

In the file, paste the following contents:

[sshd]
enabled = true
port = 22
filter = sshd
logpath = /var/log/auth.log
maxretry = 3

Restart fail2ban with:

sudo systemctl restart fail2ban

Now, anytime someone attempts to log into your Ubuntu server and fails 3 times, their IP address will be permanently blocked.

Secure Shared Memory

By default, shared memory is mounted as read/write. That means the /run/shm space can be exploited and any application or service that has access to /run/shm. To avoid this, you simply mount /run/shm with certain privileges.

The one caveat to this is you might run into certain applications or services that require read/write access to shared memory. Fortunately, most applications that require such access are GUIs, but that’s not an absolute. So if you find certain applications start behaving improperly, you’ll have to return read/write mounting to shared memory.

To do this, open /etc/fstab for editing with the command:

sudo nano /etc/fstab

At the bottom of the file, add the following line:

tmpfs /run/shm tmpfs defaults,noexec,nosuid 0 0

Save and close the file. Reboot the system with the command:

sudo reboot

Once the system reboots, shared memory is no longer mounted with read/write access.

Enable the Firewall

Uncomplicated Firewall (UFW) is disabled by default. This is not a good idea for production machines. Fortunately, UFW is incredibly easy to use and I highly recommend you enable it immediately.

To enable UFW, issue the command:

sudo ufw enable

The next command you’ll want to run is to allow SSH connections. That command is:

sudo ufw allow ssh

You can then allow other services, as needed, such as HTTP an HTTPS like so:

sudo ufw allow http
sudo ufw allow https

For more information on UFW, make sure to read the man page with the command:

man ufw

Final Thoughts

These are the first (and often most important) steps to hardening Ubuntu Server. You can also take this a bit further with password policies and two-factor authentication but the above steps will go a long way to giving you a solid base to build on.

The post Harden Ubuntu Server to Secure Your Container and Other Deployments appeared first on The New Stack.

Drive Platform Engineering Success with Humanitec and Port

Zohar Einy — Fri, 15 Sep 2023 14:18:20 +0000

Platform engineering is evolving like crazy. So many cool new tools are popping up left, right and center, all designed to boost developer productivity and slash lead time. But here’s the thing. For many organizations, this makes picking the right tools to build an internal developer platform an instant challenge, especially when so many seem to do the same job.

At Humanitec we get the need to ensure you’re using the right tech to get things rolling smoothly. This article aims to shed some light on two such tools: the Humanitec Platform Orchestrator and Port, an internal developer portal. We’ll dive into the differences, their functionalities and the unique roles they play in building an enterprise-grade internal developer platform.

The Power of the Platform Orchestrator

The Humanitec Platform Orchestrator sits at the core of an enterprise-grade internal developer platform and enables dynamic configuration management across the entire software delivery cycle. The Platform Orchestrator enables a clear separation of concerns: Platform engineers define how resources should be provisioned in a standardized and dynamic way. For developers, it removes the need to define and maintain environment-specific configs for their workloads. They can use an open source workload specification called Score to define resources required by their workloads in a declarative way. With every git push, the Platform Orchestrator automatically figures out the necessary resources and configs for their workload to run.

When used to build an internal developer platform, the Platform Orchestrator cuts out a ton of manual tasks. The platform team defines the rules, and the Platform Orchestrator handles the rest, as it follows a “RMCD” execution pattern:

Read: Interpret workload specification and context.
Match: Identify the correct configuration baselines to create application configurations and identify what resources to resolve or create based on the matching context.
Create: Create application configurations; if necessary, create (infrastructure) resources, fetch credentials and inject credentials as secrets.
Deploy: Deploy the workload into the target environment wired up to its dependencies.

No more configs headaches, just more time to focus on the important stuff that adds real value.

The Pivotal Role of Internal Developer Portals

Like the Platform Orchestrator, internal developer portals like Port also play a pivotal role from the platform perspective, mainly since it acts as the interface to the platform enhancing the developer experience. However, the two tools belong to different categories, occupy different planes of a platform’s architecture and have different primary use cases.

Note: The components and tools referenced below apply to a GCP-based setup, but all are interchangeable. Similar reference architectures can be implemented for AWS, Azure, OpenShift or any hybrid setup. Use this reference as a starting point, but prioritize incorporating whatever components your setup already has in place.

For example, where the Platform Orchestrator is the centerpiece of an internal developer platform, Port acts as the user interface to the platform, providing the core pillars of the internal developer portal: a software catalog that provides developers with the right information in context, a developer self-service action layer (e.g., setting up a temporary environment, provisioning a cloud resource and scaffolding a service), a scorecards layer (e.g., indicating whether software catalog entities comply with certain requirements) and an automation layer (for instance, alerting users when a scorecard drops below a certain level). Port lets you define any catalog for services, resources, Kubernetes, CI/CD, etc., and it is easily extensible.

Same tools or apples and oranges?

So, you could say comparing the Humanitec Platform Orchestrator and Port is like comparing apples to oranges. Both play important roles in building a successful platform. But they’re not the same thing at all. The Platform Orchestrator is designed to generate and manage configurations. It interprets what resources and configurations are required for a workload to run, it creates app and infrastructure configs based on rules defined by the platform team and executes them. As a result, developers don’t have to worry about dealing with environment-specific configs for their workloads anymore. The Platform Orchestrator handles it all behind the scenes, making life easier for them.

Port, on the other hand, is like the front door to your platform. It acts as the interface, containing anything developers need to use to be self-sufficient, from developer self-service actions through the catalog and even automation that can alert them on vulnerabilities, ensuring AppSec through the entire software development life cycle. In short, an internal developer portal drives productivity by allowing developers to self-serve without placing too much cognitive load on them, from setting up a temporary environment, getting a cloud resource or starting a new service. It’s all about making self-service and self-sufficiency super smooth for developers.

Building the Perfect Platform with Port and the Platform Orchestrator

The real magic happens when these two tools join forces. While they support different stages in the application life cycle, they can be used in tandem to build an effective enterprise-grade internal developer platform that significantly enhances DevOps productivity.

So when it comes to the Humanitec Platform Orchestrator and Port, it’s not about choosing one over the other. Both can be valuable tools for your platform. What matters is the order in which you bring them into the mix, and how you integrate them.

Step one, let’s set the foundation right. You should structure your internal developer platform to drive standardization across the end-to-end software development life cycle and establish a clear separation of concerns between app developers and platform teams. And the best way to do that is by starting with a Platform Orchestrator like the one from Humanitec. Think of it like the beating heart of your platform.

Next, you can decide what abstraction layers should be exposed to developers in the portal, what self-service actions you need to offer them to unleash their productivity, and which scorecards and automations need to be in place. For this, you can adopt Port as a developer portal on top of the platform.

Port and Humanitec in Action

Here’s what combining the Humanitec Platform Orchestrator and Port could look like:

First, you’ll need to set up both Humanitec and Port. For Port, you’ll need to think about the data model of the software catalog that you will want to cover in Port, for instance, bringing in CI/CD data, API data, resource data, Kubernetes data or all of the above. You’ll also need to identify a set of initial popular self-service actions that you will want to provide in the portal.
Let’s assume you want to create a self-service action to deploy a new build of a microservice in Port.
Make sure the microservice repository/definition includes a Score file, which defines the workload dependencies.
Port receives the action request and triggers a GitHub Workflow to execute the service build.
Once the service is built, the Platform Orchestrator is notified and dynamically creates configuration files based on the deployment context. The Platform Orchestrator can derive the context from API calls or from tags passed on by any CI system.
Humanitec deploys the new service.
The resulting new microservice deployment entity will appear in Port’s software catalog.

Don’t forget what happens after Day 1, though. Dealing with complex app and infra configs, and having to add or remove workload-dependent resources (stuff like databases, DNS, storage) for different types of environments, can equate to a ton of headaches.

This is where the Platform Orchestrator and Score do their thing. With Score, open source workload specification, developers can easily request resources their workloads need or tweak configs in a simple way, depending on the context — like what kind of environment they’re working with. Let’s dive into an example to make it clear:

1. Add the following request to the Score file.

Bucket

      Type: s3

2. Run a git push.

3. The Orchestrator will pick this up and update or create the correct S3 based on the context, create the app configs and inject the secrets.

4. At the end, it will register the new resource in the portal.

5. Resources provisioned by Humanitec based on requests from the Score file will be shown on the Port service catalog for visibility and also to enable a graphical overview of the resources.

Port comes with additional capabilities beyond scaffolding microservices and spinning up environments, and also for self-service actions that are long running and asynchronous or require manual approval. Sample Day 2 actions are:

Add temporary permissions to cloud resource.
Extend developer environment TTL.
Update running service replica count.
Create code dependencies upgrade PR.
And more.

Drive Productivity and Slash Time to Market

To sum up, the Humanitec Platform Orchestrator and Port are an awesome match when it comes to building an effective enterprise-grade internal developer platform. And the best place to start? Build your platform around the Platform Orchestrator. That’s the key to unlocking the power of dynamic configuration management (DCM), which will standardize configs, ensure a clear separation of concerns and take your DevEx to the next level. Then, choose your developer abstraction layers. This is where you can use Port as a developer portal sitting on top of the platform. Successfully integrate the two and expect productivity gains, a boost to developer performance and, ultimately, slashed time to market.

The post Drive Platform Engineering Success with Humanitec and Port appeared first on The New Stack.

Deploy Multilanguage Apps to Kubernetes with Open Source Korifi

Sylvain Kalache — Fri, 15 Sep 2023 08:06:43 +0000

Kubernetes can be intricate to manage, and companies want to leverage its power while avoiding its complexity. A recent survey found that 84% of companies don’t see value in owning Kubernetes themselves. To address this complexity, Cloud Foundry introduced open source Korifi, which preserves the classic Cloud Foundry experience of being able to deploy apps written in any language or framework with a single cf push command. But the big difference is that this time, apps are pushed to Kubernetes.

In this tutorial, we’ll explore how to use Korifi to deploy web applications written in different languages: Ruby, Node.js, ASP.NET, and PHP. Today, it is quite common to see web applications made of many languages. I will also provide insights into Korifi’s functioning and basic configuration knowledge, helping you kick-start your multicloud, multitenant, and polyglot journey.

Ruby

For all the examples in this tutorial, I will use sample web applications that you can download from this GitHub repository, but feel free to use your own. You can also find instructions on installing Korifi in this article, which guides you through the easiest way to achieve that by running two Bash scripts that will set everything up for you.

Once you have Korifi installed and have cloned a Ruby sample application, go into the root folder and type the following command:

That’s it! That is all you need to deploy a Ruby application to Kubernetes. Keep in mind that while the first iteration of cf push will take some time as Korifi needs to download a number of elements (I will explain this in the next paragraph), all subsequent runs will be much faster.

At any point, if you want to check the status of a Korifi app, you can use the cf app command, which, in the case of our Ruby app, would be:

Node.js

Before deploying a Node.js application to Kubernetes using Korifi, let me explain how it works under the hood.

One of the key components at play here is Cloud Native Buildpacks. The concept was initially introduced in 2011 and adopted by PaaS providers like Google App Engine, GitLab, Deis, and Dokku. This project became a part of the CNCF in 2018.

Buildpacks are primarily designed to convert your application’s source code into an OCI image, such as a Docker image. This process unfolds in two steps: first, it scans your application to identify its dependencies and configures them for seamless operation across diverse clouds. Then, it assembles an image using a Builder, a structured amalgamation of Buildpacks, a foundational build image, a lifecycle, and a reference to a runtime image.

Although you have the option to construct your own build images and Buildpacks, you can also leverage those provided by established entities such as Google, Heroku, and Paketo Buildpacks. In this tutorial, I will exclusively use ones provided by Paketo — an open source project that delivers production-ready Buildpacks for popular programming languages.

Let’s briefly demonstrate what Korifi does by manually creating a Buildpack from a Node.js application. You can follow the installation instructions here to install the pack CLI. Then, get into the root folder of your application and run the following command:

Your Node.js OCI image is available; you can check this by running the command:

Once the Docker image is ready, Korifi utilizes Kubernetes RBAC and CRDs to mimic the robust Cloud Foundry paradigm of orgs and spaces. But the beauty of Korifi is that you don’t have to manage any of that. You only need one command to push a Node.js application to Kubernetes:

That’s it!

ASP.NET

Now, let’s push an ASP.NET application. If you run cf push my-aspnet-app, the build will fail, and you will get the following error message:

These logs tell us that Korifi may not know a valid Buildpack to package an ASP.NET application. We can verify that by running the following command:

You should get the following output, and we can see that there are no .NET-related buildpacks.

To fix that, first, we need to tell Korifi which Buildpack to use for an ASP.NET application by editing the ClusterStore:

Make sure to replace tutorial-space by the value you used during your Korifi cluster configuration. Add the line – image: gcr.io/paketo-buildpacks/python; your file should look like this

Then we need to tell Korifi in which order to use Buildbacks by editing our ClusterBuilder:

Add the line – id: paketo-buildpacks/dotnet-core at the top of the spec order list. your file should look like this:

If everything was done right, you should see the .NET Core Paketo Buildpack in the list output by the cf buildpacks command. Finally, you can simply run cf push my-aspnet-app to push your ASP.NET application to Kubernetes.

PHP

We need to follow the same process for PHP with the Buildpack paketo-buildpacks/php that needs to be added to the ClusterStore and ClusterBuilder.

But if we run cf push my-php-app, Korifi will fail to start the app and return the following error message:

The OCI image is missing the libxml library, which is required by PHP, this is probably due to the builder not supporting PHP. To check that, let’s look what builder Korifi is using by running this command:

Which will output the following:

As you can see, Korifi currently uses Paketo Jammy Base, which, according to its Github repo description, does not support PHP. You also can check that by looking at the builder’s builder.toml file or by running the command pack builder suggest, which will return the output:

While Jammy Base does not support PHP, the Jammy Full builder does. There are multiple ways to get Korifi to use another builder, I will just cover one way in this tutorial. This way assumes that we used the easy way to install Korifi with the deploy-on-kind.sh script.

You need to go to Korifi source code and edit the file scripts/assets/values.yaml so that the fields clusterStackBuildImage and clusterStackRunImage are set to paketobuildpacks/build-jammy-full by running this command:

Then, run the scripts/deploy-on-kind.sh script.

That’s it! Korifi will use the Jammy full builder, and Korifi will be able to deploy your PHP application with a cf push my-php-app command.

Summary

Hopefully, now you’ve experienced just how easy it is to use Korifi to deploy applications to Kubernetes written in Ruby, Node.js, ASP.NET, and PHP. You can stay tuned with the Korifi project by following Cloud Foundry X account and joining the Slack workspace.

The post Deploy Multilanguage Apps to Kubernetes with Open Source Korifi appeared first on The New Stack.

Automating Retry for Failed Terraform Launches

Shira Ben-Dor — Thu, 14 Sep 2023 17:53:59 +0000

We work with a lot of Terraform users to orchestrate and deploy application environments.

As part of the deployment process, we saw a recurring challenge among Terraform users. A number of transient errors with Terraform or the cloud service provider caused specific commands to fail.

Even though simply retrying the Terraform command tends to fix the problem, it was still just as frustrating to our team as it was to the Terraform users we worked with. Users who typically deploy environments in a matter of minutes were pulled away from their day-to-day work just to retry Terraform deployments. It was the kind of additional step that we try to remove for our users.

So we looked into what we can do to automate the process.

First, it helps to understand how our platform works. Quali Torque orchestrates YAML files — which we call blueprints — for application environments directly from the IaC modules defined in Git. Once an administrator is comfortable with the inputs, dependencies, and outputs for the environment defined in the YAML, they can “publish” it, or release it to the teams to deploy. Publishing a blueprint lists it on a self-service catalog within Quali Torque (and makes it available to integrate with other tools). Once published, users can deploy the environment based on that blueprint as many times as needed.

For every Terraform module in each environment, the deployment process executes the Terraform plan as defined in Git. Any failed Terraform command would break this process down. This also applies to the destroy command — any failure in this function will leave environments running (and accruing costs) longer than the developer intended, even though they attempted to terminate it.

To help address this problem, we spoke with a few of our customers to understand some of the most common transient errors that were fixed after simply retrying the command.

Once we settled on an initial list of common errors, we pushed an update that instructed Torque to recognize a failed command caused by one of those errors and automatically retry the Terraform plan in response. Essentially, the platform automates that step so the DevOps engineer doesn’t get pulled away from more important work.

While it sounds like a simple automation, our users have already reported improved success rate for environment deployments, fewer idle cloud resources, and better overall resiliency as a result. This kind of automation is just one way DevOps teams can cut out redundant manual work.

The post Automating Retry for Failed Terraform Launches appeared first on The New Stack.

Can WebAssembly Get Its Act Together for a Component Model?

B. Cameron Gain — Thu, 14 Sep 2023 16:33:48 +0000

The final mile for WebAssembly remains a work in progress as the Wasm community races to finalize a common standard. Among other things, it is in the wait of the standardization of component interface Wasi, the layer required to ensure endpoint compatibility among the different devices and servers on which Wasm applications are deployed. As progress has been made so apps written in different languages can be deployed with no configuration among numerous and varied endpoints, the survival of WebAssembly as a widely adopted tool remains at stake until such a standard is completed. However, the community is aggressively seeking to finalize the component module, which became apparent during the many talks given at the first Linux Foundation-sponsored WasmCon 2023 last week.

“WebAssembly has boasted a handful of small advantages over other runtime technologies,” Matt Butcher, co-founder and CEO of Fermyon Technologies, told The New Stack. “The component model is the big differentiator. It is the thing that opens up avenues for development that have simply never existed before. It’s fair to call this an existentially important moment for WebAssembly.”

Implementations for WASI-Preview 2

This roadmap released in July reflects changes occurring in standards within the WebAssembly CommunityGroup (CG) and the WASI Subgroup within the W3C. This includes the WebAssembly Core WebAssembly Component Model, WASI (WebAssembly System Interface) and a number of WASI-based interfaces.

The Component Model proposal, developed on top of the core specification, includes the WebAssembly Interface Types (WIT) IDL, while WIT is the language of high-level types that are used to describe the interfaces of a component, as Bailey Hayes, director of the Bytecode Alliance Technical Standards Committee and CTO at Cosmonic, explained in a blog post.

The Component Model adds high-level types with imported and exported interfaces, making components composable and virtualizable, Hayes said. This is important for allowing different programming languages to function in the same module because it allows for the creation of and combining components that were originally written in different programming languages, Hayes said.

The latest standards for WebAssembly (Wasm) are of significant importance as they focus the efforts of developers, community members, and adopters on tooling that supports a portable ecosystem, Liam Randall, CEO and co-founder of Cosmonic, told The New Stack. “With a focus on WebAssembly Components, they enable Components to act as the new containers, ensuring portability across various companies developing across the landscape,” Randall said. “This standardization also fosters better collaboration between language tooling that creates components from various languages and hot-swappable modules defined by WASI. What this means to developers is that we can now use code from across our language silos, creating a powerful ‘better together’ story for the WebAssembly ecosystem.”

In other words, WASI-Preview 2 is an exciting step as it addresses critical areas such as performance, security and JavaScript interactions — and one more step on the journey toward interoperability, Torsten Volk, an analyst for Enterprise Management Associates (EMA), told The New Stack. “The common component model is absolute key for accelerating the adoption of WebAssembly, as it is the precondition for users to just run any of their applications on any cloud, data center or edge location without having to change app code or configuration,” Volk said.

An API call requesting access to a GPU, a database or a machine learning model would then work independently of the specific type of the requested component, Volk said. “This means I could define how a datastream should be written to a NoSQL database and the same code function would work with MongoDB, Cassandra or Amazon DynamoDB,” Volk said.

WASI began as a POSIX-style library for WebAssembly. However, it has outgrown those roots, becoming something more akin to JavaScript’s WinterCG: a core set of interfaces to commonly used features like files, sockets, and environments, Butcher said. “WASI Preview 2 exemplifies this movement away from POSIX and toward a truly modern set of core features. Instead of re-implementing a 1970s vision of network computing, WASI is moving to a contemporary view of distributed applications.”

Check out @Dicejw's post on the @fermyontech blog: Introducing Componentize-Py.

My favorite feature: you can build #WebAssembly components written in #Python with dependencies that use native extensions.

I can use NumPy in my @spinframework apps!https://t.co/6JoWjsVkN7 pic.twitter.com/0i0fOSdu0a

— Radu Matei (@matei_radu) August 29, 2023

The component aspect plays a key role in the release of new features for Fermyon’s experimental SDK for developing Spin apps using the Python programming language.

Relating to components, Fermyon’s new componentize-py can be used to build a simple component using a mix of Python and native code, type-check it using MyPy, and run it using Spin. The user can then update the app to use the wasi-http proposal, a vendor-agnostic API for sending and receiving HTTP requests.

“Providing developers with the ability to integrate with runtime elements that are not yet completely defined by a CCM makes it less likely for them to hit a wall in their development process, and should therefore be welcomed,” Volk said.

Regarding Python, it is a “top language choice, and is vitally important for AI,” Butcher said. “Yet, to this point some of the most powerful Python libraries like NumPy have been unavailable. The reason was that these core libraries were written in C and dynamically loaded into Python,” Butcher said. “Who would have thought that the solution to this conundrum was the Component Model?”

With the new componentize-py project, Python can take its place as a top-tier WebAssembly language, Butcher noted. “Most excitingly, we are so close to being able to link across language boundaries, where Rust libraries can be used from Python or Go libraries can be used from JavaScript,” Butcher said. “Thanks to the Component Model, we’re on the cusp of true polyglot programming.”

Future Work

A very good definition of a #WebAssembly component, @fastly's @luke_wagner communicated during his #WasmCon 2023 keynote "What is a Component (and Why)?": A standard portable lightweight finely sandboxed cross-language compositional module. https://t.co/JSkkCDCFKO #WasmCon pic.twitter.com/P7rVtGUwoO

— BC Gain (@bcamerongain) September 7, 2023

The work to finalize a component model that is necessary for Wasm to see wide-scale adoption remains ongoing, as an extension of the incremental steps described above, Luke Wagner, a distinguished engineer for edge cloud platform provider Fastly, told The New Stack during WasmCon 2023 last week. Wagner defines a component as a “standard portable, lightweight finely-sandbox cross-language compositional module.”

@fastly‘s @luke_wagner described during his #WasmCon 2023 keynote the roadmap for this year and next for #WebAssembly‘s component model. https://t.co/JSkkCDC7Vg pic.twitter.com/SXtjhKvx49

— BC Gain (@bcamerongain) September 13, 2023

During his conference talk, Wagner described the developer preview to be released this year:

Preview 2: It covers both a component module and a subset of Wasi interfaces.
The top-line goals are stability and backward compatibility:

“We have an automatic conversion converting Preview 1 core modules to Preview 2 components and then we’re committing to future having a similar tool to convert to Preview 2 components into whatever comes next,” Wagner said during his talk.

Preview 2 features include, Wagner said:

A first wave of languages that include Rust, JavaScript, Python, Go and C.
A first wave of Wasi proposals, including filesystems socket, CLI, http and possibly others.
A browser/node polyfill: jco transpile.
Preliminary support for Wasi virtualization in the form of wasi-virt.
Preliminary support for component composition: in the form of Wasm-compose.
Experimental component registry tooling: in the form of warg.
“Next year it’s all about improving the concurrency story,” Wagner said. This is because Preview 2 “does the best it can but concurrency remains warty.”

These “wart” aspects Wagner described include:

Async interfaces, which are going to be too complex for direct use and need manual glue code, while the general goal is to be able to use the automatic bindings directly without manual glue code.
Streaming performance isn’t as good as it could be.
Concurrency is not currently composable, which means two components doing concurrent stuff will end up blocking each other in some cases. And if you’ve virtualized one of these async interfaces, it ends up being that you have to virtualize them all.

Preview 3 will be designed to:

Fix these drawbacks by adding native future and stream types to Wit and components.
This will pave the way for ergonomic, integrated automatic bindings for many languages.
Offer an efficient io_uring-friendly ABI.

Composable concurrency: For example, in Preview 2, we need two interfaces for HTTP: one for outgoing requests and one for incoming ones that have different types and different signatures, Wagner said. With Preview 3, the two interfaces will be merged to just have one interface, i.e. with the Wasi handler.

This will allow for a single component that both imports and exports the same interface: It will then be possible to import a handler for outgoing requests and export a handler to receive incoming requests. Because they use the same interface, it will then be possible to take two services to chain together and just link them directly together using component linking and now executing the whole compound request is just an async function call, which can support modularity without a microservices use case.

“Our goal is by the end of this year is to complete Preview 2 milestones, which will lead to a stable, maybe beta release,” Luke Wagner, a distinguished engineer for Fastly, told The New Stack after the WasmCon 2023 last week.

“The idea is, once we hit this, you will continue to be able to produce Preview 2 binaries and run them in Preview 2 engines so stuff stops breaking.”

The post Can WebAssembly Get Its Act Together for a Component Model? appeared first on The New Stack.