Geert Pingen @gpgn

Cloud & Machine Learning Engineer, Research Scientist
Sustainable Computing
Sustainability
Systems Engineering
info
This is an active project. It will grow with future updates, check back soon!

Introduction

The average global temperature has been rising since the Industrial Revolution. Although there is some natural variability, there is predominant evidence indicating human activities are causing Earth to warm up, especially those that cause greenhouse gas (GHG) emissions (most importantly carbon dioxide, methane, and nitrous oxide). And while climate change is a known problem, atmospheric concentrations of this major driver have continued to increase in the last years.

Climate change impact includes extreme weather events; changes in sea level; droughts and fresh water scarcity; erosion and other soil degradation; dying coral reefs; frequent forest fires; and ecosystems becoming unhabitable for certain animal species. Even now, islands are dissapearing due to sea level rise. It is also a major threat to international peace and security, hightening competition for resources and leading to mass displacement and fueling socio-economic tension. Critically, the poorest countries face the greatest risks from these changes, although many of the worlds most climate-vulnerable countries are among the least polluting.

Many western countries and large companies are electrifying at a quick pace, and adding renewable energy sources into their energy mix. This however, is not without its issues, as for example The Netherlands is experiencing growing pains exemplified by major congestion of its power grid. These challenges are threatening timely climate action.

The digital sector is currently estimated to be responsible for 2-4% of global greenhouse gas emissions. This includes both production and consumption of digital technologies, for example computers, televisions, smartphones (production), and devices, data centres, and networks (consumption). This number is expected to grow to at least 6% by 2025, as data traffic is rising rapidly. Importantly, greenhouse gas emissions are just one example of the environmental impact of the tech secctor. Other major effects exist, such as depletion of energy, water, and natural resources.

Figure 1. Carbon emission scopes. Image taken from https://go.microsoft.com/fwlink/p/?linkid=2161861.

Figure 1. Carbon emission scopes. Image taken from https://go.microsoft.com/fwlink/p/?linkid=2161861.

Sustainable Computing

The growth of the tech sector is unsustainable. However, changes can be made to turn that around. Industries are being established around recycling EV batteries. Companies are making steps to reduce their carbon footprint by transitioning to electricity from renewable energy sources. Additionally, the Power Usage Effectiveness (PUE) of large data centres is nearing 1.0 - meaning that the energy that is consumed by data centres is being used almost solely for compute (and not for cooling, and other supporting infrastructure).

These uplifting figures often end up in corporate sustainability reports, but only tell part of the story. Many companies report on Scope 1 and 2 emissions, but lack transparency on Scope 3 emissions as they are much harder to track.

As visualized in Figure 1, Scope 1 emissions are direct emissions from company-owned and controlled resources (small in the digital sector); Scope 2 emissions are indirect emissions from the generation of energy that is consumed by a company's resources (substantial in the digital sector); Scope 3 emissions are all indirect upstream and downstream emissions, e.g. those generated in the manufacturing, transport and recycling value chains.

As the carbon content of energy tends to zero, Scope 3 dominates carbon footprints. This means that much of the carbon impact of the tech sector is tied up in Scope 3 emissions. Think about the mining of raw minerals, shipping electronics around the world, and disposal of hardware. A recent Microsoft report estimated them to be up to 50 times the size of Scope 1 and 2 emissions (see Figure 2). Subsequently, usage of material resources are underreported as well.

The big cloud providers are slowly moving to expose Scope 3 carbon emission data. Microsoft and Google are currently providing detailed scope 3 data, but Amazon currently only does this under strict NDA. These figures however are very course-grained (yearly or monthly aggregations), rather than real-time and attributable to specific processes.

It is clear that we are moving in the right direction. However, two aspects that are severely lacking in this space are observability and awareness. While using renewable energy sources is a good step forward, it is important to think about the use of that energy. For example, how much energy or carbon emission is it worth to bump the accuracy of our cat detector model a little? And how are we able to measure that accurately and precisely?

Figure 2. Microsoft 2021 Scope 1, 2, and 3 emissions.

Figure 2. Microsoft 2021 Scope 1, 2, and 3 emissions.

Measuring power consumption

There are currently few tools to estimate GHG emissions (especially Scope 3 emissions) in modern IT infrastructure stacks, which is typically deployed in shared cloud environments that have many processes making use of the same underlying hardware. Additionally, the IT services (compute, data storage, AI models, etc.) are being used by many different tenants. This makes it difficult to track and expose individual contributions to total carbon emission, and thus raise awareness or tie them into local carbon budgets.

A starting point for measuring carbon emissions is measuring power consumption (and multiplying that by carbon content). One reasonably accurate proxy for actual power readings is RAPL, or Running Average Power Limit. This is an interface that among other things exposes accumulated energy consuption of various system-on-chip (SoC) power domains. RAPL provides a set of counters providing energy and power consumption information. RAPL is not an analog power meter, but rather uses a software power model. This software power model estimates energy usage by using hardware performance counters and I/O models.

Thankfully, RAPL readings are highly correlated with actual plug power measurements. The processor has one or more so called packages. These are part of the actual processor. Each package contains multiple cores. Each core typically has hyperhreading, meaning it contains two logical CPUs. The uncore is the part of the package outside of the cores - including various components like L3 cache, memory controller, and possibly an integrated GPU. RAM is seperate from the processor. Depending on chip architecture, we have RAPL exposed for the full package (PKG); the cores (PP0), an uncore device which is usually the GPU (PP1), and the main memory (DRAM). See Figure 3 for a visual overview of the power domains.

In Linux, RAPL is exposed to user space through the power capping framework, and can typically be found under /sys/class/powercap/intel-rapl\.

Note that in a machine, we might have other sources of energy consumption than the CPU, namely devices like GPUs; RAM; disk usage; the power supply itself; fans; cooling; and LEDs; although these typically consume (much) less energy.

To conclude, there are ways to measure power consumption. You can check this for yourself on your machine or laptop. However, things get a bit trickier in virtualized environments - for example Virtual Machines (VMs). These run on top of a hypervisor, and do not automatically have RAPL metrics exposed to them. If the hypervisor does not forward that information to the VM, there is no way to know. Unfortunatly, this is a typical situation in cloud environments, where you do not have access to the underlying host machine. And cloud providers have good reason not to expose that information directly.

Figure 3. CPU power domains supported by RAPL.

Figure 3. CPU power domains supported by RAPL.

From power consumption to carbon emissions

Let's say despite all of these challenges, we obtain real-time process-level metrics on power consumption. We will also disregard Scope 3 emissions for the moment. How do we then determine the carbon content (Scope 2 emissions) of that used energy? We will recap a lot of information from Adrian Cockcroft's excellent slides from QCon London 2023. Let's take a look at the calculation for Scope 2 emissions:

Scope 2 Carbon = Energy Mix * PUE * Capacity Used * Emissions factor per capacity

PUE is not well standardized, but at least reported on for most cloud providers (and as mentioned, nearing 1.0). And based on the last paragraph, we assume we know the energy capacity and emissions factor on a fine-grained level (even user-specific, which is difficult to measure in a multi-tenant environment where infrastructure is shared).

Even including all these assumptions, we still run into some new challenges, for example: What is the energy mix in this specific location? Is it coal, hydro, solar, nuclear, ...? Do we have that data real-time (typically this is only known months later)? And do we have it at sufficient resolution (what if our process runs for a couple of minutes)?

The way cloud providers manage their power mix is through:

  • Standard utility contracts which are billed monthly (or delayed longer).
  • Power Purchase Agreements (PPAs) which are contracts to build and consume power generation capacity (solar, wind, etc.).
  • Renewable Energy Credits (RECs) which are purchases of renewable energy generation capacity on the open market from existing generators (which can also be claimed at a later time). There are different types of RECs (some of which do not even represent usable energy capacity, i.e. they are simply offsetting carbon emissions in another part of the world) which complicates things even more. They can also be traded for up to a year, which changes the entire calculation again, retroactively!

This makes it super challenging to determine the energy mix that was used at the time of a short-running process. We simply don't know exactly. Over time though, as the utility grid decarbonizes, this will be less of a problem. What's also great is that we are starting to see APIs that expose hourly grid mix data.

For now however, the main thing we can do is to use less energy. And for that, we need to empower the end-user by allowing them to make carbon-informed decisions.

Over time, we should move towards reporting real-time carbon metrics with open standards, to enable automated scheduling and optimization.

Tools and Ecosystems

Two main tools in the open-source cloud-native domain that focus on monitoring power consumption are Scaphandre and Kepler. Kepler and Scaphandre both want to give users an answer on how much energy spent by a workload (process, container, or VM).

Scaphandre does this by checking at every interval the PIDs of all relevant processes, and the RAPL output; and uses the CPU clock time/ticks taken from /proc/{pid}/stat and divides the total joules according to ratio. Kepler basically says this is not accurate enough because per-instruction power usage is different (so you cant simply divide according to ratio of scheduled time). Instead, it records the instructions per process (since each instruction uses different Wattage). It then estimates CPU time ratio based on ML model-based coefficients for these instructions.

Using these tools, and by integrating APIs like the Carbon Aware SDK - which can information on current and predicted GHG emmisions from energy consumption - we can start to build carbon-aware and even carbon-adaptive systems.

Kepler is part of a larger ecosystem of tools developed by contributors from Red Hat, IBM, and Intel. At this time development is still in very early stages though, and feels more like a proof of concept. Scaphandre on the other hand is developed by Hubblo and their set of open-source sustainability tools. Hubblo is a member of Boavizta, an international inter-organizational working group aimed at evaluating the environmental impact of digital technologies. In turn, Boavizta is a partner in the Sustainable Digital Infrastructure Alliance (SDIA).

Two other important groups doing interesting work in the domain are the Kubernetes TAG on Environmental Sustainability and the Green Software Foundation.

Experiments

As we've seen, there are still a lot of open challenges in this domain. Most notably, how can we accurately map power consumption to GHG emission and expose that to end users, to empower them to make informed decisions and enable effective regulation?

This is especially difficult for Scope 3 emissions, and involves many sub-questions. For example, how much Scope 3 emissions can be attributed to the use 100ms (or 3s, 10min, 24h, ...) usage of a given CPU chip (likely this will be tied to hardware depreciation expenses, as new hardware purchases embody new Scope 3 emissions). And subsequently, how can we precisely track that usage in a hyper-distributed, multi-owner, multi-tenant, and multi-modality cloud world?

The goal of this project is to experiment with aforementioned technologies, perhaps touching on a few of these challenges, and work towards an implementation that can be used in standard ML workflows. In the mean time, I hope to use the project to get more familiar with Rust.

info
This is an active project. It will grow with future updates, check back soon!

References

[1] NASA.gov | World of Change: Global Temperatures

[2] Malmodin, Jens, et al. "ICT Sector Electricity Consumption and Greenhouse Gas Emissions–2020 Outcome." Available at SSRN 4424264 (2023).

[3] Khan, Kashif Nizam, et al. "RAPL in Action: Experiences in Using RAPL for Power measurements." ACM Transactions on Modeling and Performance Evaluation of Computing Systems (TOMPECS) 3.2 (2018): 1-26.

[4] Lipp, Moritz, et al. "PLATYPUS: Software-based power side-channel attacks on x86." 2021 IEEE Symposium on Security and Privacy (SP). IEEE, 2021.