The Cloud SustaiNative Era: A Recap of KubeCon + CloudNativeCon NA 2023
The Cloud Native Computing Foundation (CNCF) 's flagship conference KubeCon + CloudNativeCon North America 2023 and its Co-Located events took place in Chicago, from the 6th to the 9th of November, attracting nearly 14,000 attendees both in-person and virtually. In conjunction with the COP28, I would like to share my key takeaways and event highlights, with a focus on sessions covering cloud cost optimization and the often-overlooked aspect of environmental sustainability within the cloud native realm.
But first, let's briefly touch on the broader highlights of the event. Sessions provided insights into platform engineering, emphasizing the overarching theme of "GitOpsifying all the things!" and the codification of golden paths, enabling platform engineers to present streamlined, standardized processes to development teams as self-service abstracts. Tools like Backstage, Crossplane, and Argo CD were spotlighted for their role in building internal developer platforms (IDPs), abstracting complexities, and supporting platform engineering.
Environmental Sustainability Takes Center Stage
In the keynote "Environmental Sustainability in the Cloud Is Not a Mythical Creature”, Frederick Kautz, Rimma Iontel, Tammy McClellan, Marlow Weston, and Niki Manoledaki delved into the intersection of cloud technologies and environmental sustainability. The panel discussion kicked off with a profound question: How much power does it take to run a software? Unsurprisingly, it is a complex question to simply answer. While measuring and reducing the energy and carbon footprints of software is not yet a widespread practice, the momentum is growing. Among the initiatives highlighted by the panelists are the Green Software Foundation ’s Software Carbon Intensity (SCI) specification project and the ongoing efforts within the CNCF TAG Environmental Sustainability to quantify the environmental impact of software.
“We need stubborn optimism to create spaces, find solutions, empower each other, and create change as a community.” – Niki Manoledaki, Grafana Labs
According to Intel estimates, over 50% of greenhouse gas (GHG) emissions in data centers result from infrastructure and software inefficiencies. Inefficiencies were exemplified by a case where a software solution's carbon footprint was slashed by 45% through the simple act of right-sizing the VMs it utilized. A notable highlight was the carbon-aware KEDA (Kubernetes Event-driven Autoscaling) operator that dynamically adjusts scaling behaviors based on carbon intensity data.
The conversation continued into another session, where Marlow Weston and Niki Manoledaki shared updates on the CNCF Environmental Sustainability TAG activities.
A key highlight is the work being carried out by the recently created Green Reviews Working Group, which adopts the SCI specification as its leading principle. This working group collaborates with CNCF projects' maintainers, utilizing tools like Kepler to gather energy metrics and review carbon footprints. The first project is Falco, with plans to share data through the CNCF's Grafana dashboard available at https://devstats.cncf.io. A series of open questions were outlined, inviting the community to contribute. From exploring infrastructure building with external contributors to deciding on load testing tools and cadence, the call to action resonated: Get involved! Regular meetings are held every second and fourth Wednesday of the month, complemented by a dedicated Slack channel for asynchronous collaboration.
Maximizing Cloud Cost Efficiency through FinOps Strategies
Nathan Taber, Rachel Leekin, and Antoinette Mills of AWS talked about cost optimization techniques in their sessions "Sponsored Keynote: Reduce, Reuse, Recycle" and "Kube-Costbusters: Optimizing Kubernetes Clusters for Efficiency and Epic Savings!" respectively.
The impact of decisions, inefficient choices, and how systems run can lead to significant monetary and resource costs. Crucial aspects to bear in mind:
Both sessions featured Karpenter, an open-source cluster auto scaler created by AWS. Karpenter optimizes resource utilization by identifying and consolidating underutilized nodes, exemplified by efficiently moving pods from larger instances to smaller ones and subsequently decommissioning the now underused nodes. Karpeneter was further Spotlighted in the talk titled "Rapidly Scaling for Breaking News with Karpenter and KEDA” presented by Mel Cone and Deepak Goel of The New York Times. To handle traffic spikes without over-provisioning, Karpenter dynamically scales pods by finding the optimal way to pack pods into nodes using a bin packing algorithm.
More tips to optimize the cost include performing regular auditing, monitoring what matters and collecting what is needed, setting resource limits and quotas, deleting load balancers that are no longer needed, right-sizing resources, addressing hidden costs caused by orphan pods and unmounted PVs, moving compliance-required logs to cost-effective storage, caching the latest images, and enabling topology-aware routing to keep traffic in the zone it originated from.
GreenOps in Action
In their presentation “Cutting Climate Costs with Kubernetes and CAPI”, Shiva Rezaie and Steve Francis from Sidero Labs showcased a promising approach to minimize carbon footprint and cut climate costs. Here are the key highlights from their talk:
A focal point of their discourse was the need for a custom, emissions-aware scheduler. The speakers highlighted that specific schedulers can be applied to specific deployments, which allows using the default scheduler to run critical workloads and the control plane, while using the custom scheduler to run less critical workloads.
Diving into the technicalities of the scheduler, the presenters elucidated its three components: scheduling logic, the pod manager responsible for evicting pods, and the node manager that powers on/off the nodes. The way a scheduler is created is by using plugins that can change the behavior at different extension points, sorting the queue, pre-filter, filter, and so on. If an extension is not overridden by using plugins, it uses the default scheduler. The talk underscored the approach of this project that checks the emissions index within the data center's grid area—a scale from zero to 100, with zero denoting entirely renewable and clean energy, and 100 representing the dirtiest. The scheduler checks if the Pod’s priority is higher than the emissions, and in that case, it schedules it, while the Pods with lower priority (workloads that have periodic demand and can be time-shifted like a batch job) only going to run if the energy is completely renewable.
The discussion further delved into priority-based strategies, where high-priority pods will (in the simple case) tend to run on the same node with the MostAllocated strategy, optimizing node allocation and facilitating efficient resource utilization. The considerations for shutting down racks strategically to conserve energy were noted, underlining the importance of annotating service stations and in addition to having the node manager prioritize common rack locations first. The presenters also addressed the concept of time-shifted workloads, suggesting that less time-sensitive tasks, like batch jobs, could be scheduled to run only when renewable energy was available, minimizing the environmental impact.
As the session drew to a close, intriguing ideas popped up for future enhancements, such as implementing a mechanism to ensure a maximum wait time for jobs and creating an audit trail to visually represent energy savings achieved over time.
In another session, "Sustainability and Efficiency: Environmentally Friendly Software Development with Kube-Green", Davide Bianchi of Mia-Platform provided insights into green software development. Highlighting the concept of Green Software, Bianchi touched upon its carbon-efficient nature, striving to minimize carbon emissions. Drawing an analogy with eco modes in appliances, the application's performance can be dynamically adjusted to reduce carbon emissions, offering users the option to switch to an eco-mode when the application is utilizing less environmentally friendly energy sources.
At the heart of the talk is Kube-Green, an open-source operator designed to cut the carbon footprint of Kubernetes clusters. The operator can be installed using a single file through 'kubectl apply' and is configurable which allows users to, for instance, start the workloads during working hours and shut them down otherwise.
AI and Sustainability: Friend or Foe?
In the session titled "Environmentally Sustainable AI via Power-Aware Batch Scheduling" presented by Atanas Atanasov from Intel and Daniel Wilson from Boston University, the focus was on addressing energy consumption in the context of training large language models (LLMs). Atanasov pointed out that while inference often takes the spotlight, the training side is a significant contributor to energy usage at one point in time.
Drawing inspiration from power management techniques employed in High-Performance Computing (HPC), the speakers proposed extending these strategies to cloud computing systems, specifically for batch-oriented training in AI models. The solution presented aims to improve performance per watt for LLM training. The key component was a pod-scheduling algorithm designed to minimize power resources and implement power capping for AI workloads within Kubernetes. The solution leverages the Kueue batch framework to configure power limits and utilizes the new built-in sidecar containers feature introduced in Kubernetes v1.28. The sidecar containers give a place where the power capping mechanism can execute alongside applications without hindering the application.
Moving on to another thought-provoking session titled "Sustainable Scaling of Kubernetes Workloads with In-Place Pod Resize and Predictive AI" presented by Vinay Kulkarni from eBay and Haoran Qiu from UIUC, the conversation pointed out the economic and resource implications of overprovisioning. Citing statistics from Gartner and CAST AI's The State of Kubernetes Report, the speakers underscored the substantial dollar cost of overprovisioning and the underutilization of CPU resources in cloud-native applications.
"37% of CPUs for cloud-native applications are never used, on average." – CAST AI's The State of Kubernetes Report 2022
The session showcased the in-place Pod resize feature, a notable addition starting from Kubernetes v1.27. This feature allows for the dynamic adjustment of pod resources without the need for restarts which opens the doors to sustainable, multidimensional pod autoscaling where reinforcement learning can play a role to further improve the efficiency.
In both sessions, the speakers not only highlighted the current challenges but also presented innovative solutions and practical approaches to foster environmentally conscious practices.
Kepler: Revolutionizing Cloud Efficiency
During the sessions "Energy Observability Using Kepler: Revolutionizing Cloud Efficiency" and "Kepler: Project Update and Deep Dive”, Marcelo Amaral, Sally Omalley, and Tatsuhiro Chiba shared the latest developments in Kepler, an observability framework designed for monitoring power within and beyond Kubernetes. Leveraging eBPF, Kepler collects metrics from different APIs, including Intel's RAPL API that exposes CPU and power consumption, on which regression algorithms are applied to create the power model. For some of the bare metal nodes that do not expose real-time power consumption data, hardware sensors are employed to collect the real-time power metrics and resource utilization.
The sessions underlined Kepler's recent induction as a CNCF sandbox project, symbolizing its growing significance in the cloud native ecosystem. To further reduce the overhead, the speakers also stressed the importance of adjusting the default Prometheus configuration, opting for increased intervals between metric scrapes. Both presentations collectively highlighted Kepler's pivotal role in simplifying the power usage estimation of containers, making it a key player in the evolving landscape of energy-aware observability within Kubernetes environments.
Lastly, in the lightning talk titled "Towards Greener Deployment: Evaluating Energy Efficiency in Argo CD and Traditional CD Pipelines," Al-Hussein Hameed Jasim of Tetra Pak explored the energy footprint of traditional CD pipelines, exemplified by GitHub Actions, versus the GitOps approach with Argo CD. This session focused on revealing power consumption patterns in these deployment methodologies, emphasizing the importance of measurement to drive improvements and adopt more sustainable deployment practices in the cloud native realm.
Wrapping up
As we bid farewell to KubeCon + CloudNativeCon events of the year, the imperative to measure, mitigate, and innovate for a greener future forges the path forward. The cloud native community's commitment to sustainability, efficiency, and innovation echoes loudly, setting the stage for a future where technology and the environment coexist harmoniously.