The Carbon Footprint of LLMs — A Disaster in Waiting?

Nathan Bailey
16 min readSep 2, 2024

--

Extinction Rebellion Protesting Outside a Data Centre [2]

The Carbon Footprint of Pretraining. This single header in Meta’s LlaMA 2 paper [1] intrigued me about the carbon footprint of LLMs. Previously I had seen article headlines about the carbon impact of training LLMs and other large AI models, but I did not know the scope of the energy demands of these models and just how large the problem is or could become.

I delved into some research on the topic and dove into a somewhat shallow rabbit hole around the issue. To clarify, I am by no means an expert on the environment or LLMs, but I thought I’d type up my notes in a blog post to hopefully present some of this information simply and concisely. I aim to spread awareness of the issue of energy impact and LLMs, this is not to say we should stop using them. Frankly, the cork is out of the bottle for LLMs and trying to stop individuals and companies from using them on the grounds of carbon would be silly and impractical.

LLMs — Where does the energy usage come from?

Carbon emissions arising from the use of LLMs can be split into 2 categories: embodied emissions and operational emissions. Operational emissions are the emissions caused by actually using the LLM, either in training or inference. Whereas, embodied emissions come from the manufacturing of the hardware used for the LLM [3].

Embodied Emissions

Embodied emissions are not necessarily the first factor thought about when considering the energy impact of LLMs. It was left out when Meta calculated the carbon footprint of the LLaMA 2 model and is left out in other works too. However, they are a considerable factor when counting up the true carbon impact of a model. One paper estimates that the embodied carbon can be around 24%-35% of the overall carbon footprint of the LLM [3].

As previously stated, embodied emissions arise from the carbon emitted during the manufacturing of the devices used to train and run inference. This does not just include the GPUs or TPUs. One must also consider the servers used, as well as storage, network and devices used to cool and heat the data centre [4].

Luccioni et al [4] calculated the carbon footprint of Hugging Face’s LLM BLOOM. They estimated that the server had an embodied footprint of 2500kg of CO2 eq and each GPU had a footprint of 150kg of CO2 eq. They assumed a replacement rate of 6 years, which given the training time of 1.08 million hours and 384 GPUs across 48 nodes gave a total embodied carbon of 11.2 tonnes of CO2 eq. Due to the difficulty of measuring, this figure did not include the embodied carbon of the infrastructure (storage, network, cooling etc). The total footprint of BLOOM was 50 tonnes of CO2 eq, which makes the embodied carbon roughly 22% of the total carbon of the model, which is in line with the predictions above.

Operational Emissions

Operational emissions arise from actually using the model, in both training and inference mode.

Due to the nature of ML research, training a model can use a significant amount of energy. This can be split into 2 sections: Dynamic Power Consumption and Idle Power Consumption [4].

Dynamic power consumption is focused on the emissions from the usage of the main accelerators used to train the model. It is determined by multiplying the number of GPU hours by the TDP of the GPUs and the carbon intensity of the energy grid used to power the hardware. For example, BLOOM, Hugging Face’s open-source model was trained for 1.08 million GPU hours on a hardware partition of A100 GPUs with 80GB of memory, with a TDP of 400W. This represents an electrical consumption of 433,195 kWh of electricity during training. Multiplied by the carbon intensity of the energy grid which is approximately 57 gCO2 eq/kWh, this gives a total of 24.69 tonnes of CO2 eq [4].

Idle power consumption considers the broader infrastructure that maintains and connects the GPUs. E.g. network, storage, cooling and heating. It considers the network overhead and larger computing infrastructure without which training cannot take place. Hugging Face estimated this to be 256,646 kWh and subsequently 14.6 tonnes of CO2 eq [4].

How do different LLMs stack up when training?

Unfortunately, it is hard to directly compare the energy impact and therefore operational carbon impact of training different LLMs due to a lack of information available to the public. Luccioni et al [4] contrast the energy impact and carbon emissions of different LLMs. Some numbers have to be inferred (italics) and they only initially considered dynamic power consumption. They did attempt to estimate the dynamic and idle power consumption by using the Power Usage Effectiveness (PUE) of a data centre. This is the energy needed for idle consumption (overhead), however, it is important to note that PUE is not a complete measure of energy consumed by data centre infrastructure [4].

The results are shown in the figure below. Unfortunately, it can be seen that the emissions of BLOOM are the smallest compared to other models and significantly smaller when compared to Gopher and GPT-3. To put the numbers into perspective, 25 tonnes of CO2 eq is roughly equivalent to 30 flights between London and New York.

Interestingly, BLOOM used more energy than OPT but resulted in a smaller carbon impact. This was because it was trained on a French supercomputer that is mostly powered by nuclear energy [5].

Comparison of Energy and Carbon Impact of Different LLMs [4] (italics are inferred numbers)

Strubell et al [6] also performed a comparison of a few different types of models back in 2019. They looked at two transformer models (big and base), ELMo, BERT and GPT-2. The models were trained and their carbon footprint was calculated.

They also used the big transformer model to perform a neural architecture search (NAS) which optimises a model through trial and error by tweaking the network’s design [7].

Their calculation of power consumption differs slightly from Luccioni et al. The following formula was used. Where:

  • Pc is the average power draw of the CPU sockets.
  • Pr is the average power draw from DRAM.
  • Pg is the average power draw of a GPU, multiplied by the amount of GPUs g.
  • This is multiplied by 1.58 which is the PUE coefficient.
Formula for Power Consumption of Training a Model [6]

The total power consumption can then be multiplied by the average CO2 eq produced for power consumed in the US. This takes into account the relative proportions of different energy sources (primarily natural gas, coal, nuclear and renewable) consumed to produce energy in the United States.

The results are shown in the figure below (CO2 eq is in lbs).

CO2 eq Results [6]

We can see that the largest model in terms of carbon emissions is BERT with 1438 lbs (652 kg) of CO2 eq. This is roughly equivalent to a trans-American flight [6].

Performing a network architecture search is incredibly costly in terms of carbon, producing 630 thousand pounds (285 thousand kg) of CO2 eq, which only increases the BLEU score by 0.1 [6]. To put this in perspective, the graphic below shows the cost of the NAS compared to other carbon footprint benchmarks.

Comparing a NAS Model to Common Benchmarks [7]

An important note they make is that training a model is normally the bare minimum amount of work. AI researchers will develop a new model from scratch which can require many rounds of training. For example, building and testing a final paper-worthy model required training 4789 models over 6 months. This equated to 78000 pounds of CO2 eq [7].

What about inference?

We have covered training an LLM, but what about inference? Inference is much less energy-demanding than training [8]. However, training is only performed once, ChatGPT for example sees 10 million users per day [8]. Therefore, it must only be a matter of time for the energy impact to reach the impact of training, the question is how long will it take?

Luccioni and Strubell look at this in their recent paper [8]. They find that the energy usage ratio is hundreds of millions of inferences for a single training. For example, it will take roughly 590 million inferences for the largest BLOOMz model (7 billion parameters) to reach the energy usage of training. For the smallest model (560 million parameters) it will take 200 million inferences.

BLOOMz Models and Their Inference Parity [8]

Given that ChatGPT has 10 million users per day and 1.7 billion visits in October 2023 alone, it could take as short as a few weeks or months of deployment for inference energy costs to surpass training energy costs. It is therefore just as important to focus on reducing the inference energy costs as well as the training energy costs [8].

Does the task and model matter?

More importantly for inference, it is important to gather what tasks are performed as different tasks have vastly different energy costs.

This was the main point of focus in Luccioni and Strubell’s paper [8], which mainly focused on the carbon emissions of a range of AI models and their associated tasks in inference. The energy impact of different tasks was compared as well as comparing generic models to task-specific models.

Ten ML tasks were chosen. For each task, 8 task-specific models and 3 datasets were chosen. For each model, 1000 inferences were run for each of the 3 datasets. All experiments were run on an A100 NVIDIA GPU hosted on AWS in the US-West-2 region which has an average carbon intensity of 297.6 grams CO2 eq per kWh [8]. The results for each of the tasks across all models are shown below.

Energy Impact of Different Tasks [8]

The key takeaways from these results are:

  • Classification tasks for images and texts are on the lower end of the energy spectrum.
  • Generative tasks such as text generation and summarisation use an average of 10x more energy.
  • Image generation is the most energy-intensive task and tasks involving images are more energy-intensive than tasks involving text.
  • The length of text generated also impacts energy usage. Text generation uses 15 times more energy than masked language modelling. This makes sense given masked language modelling generates a single token. However, text generation in this case generates 10 new tokens for an input text.

Given these results, we can see the wide difference in energy impact and thus carbon impact from performing different tasks. To add some perspective, the most efficient text generation model used as much energy as 9% of a full smartphone charge for 1000 inferences, whereas the least efficient image generation model used as much energy as 522 smartphone charges. This means per inference it uses as much energy as half a smartphone charge [8].

Luccioni and Strubell also looked at the model size and the energy impact, finding that as the model size increased, so did energy impact and thus emissions. However, it is clear from the graph that task is still the dominant factor in the variation of emissions.

Model Size vs Model Emissions [8]

The investigation above looked at task-specific models, that is models that were fine-tuned to carry out a specific task. Another investigation that they performed in the same paper was to look at the energy impact difference between these task-specific models and general-purpose models (models trained for multiple tasks).

A subset of the tasks were chosen: text classification, extractive question answering, and summarisation. Results show that task-specific models have less emissions than both multi-purpose sequence-to-sequence and decoder-only generative models. As seen from the graph, the difference is several orders of magnitude. This is because these task-agnostic models are trying to do many things at once such as generate, classify and summarise text instead of just being fine-tuned for one task [8].

General-Purpose Models vs Task-Specific Models [8]

They also found differences in the general-purpose models themselves. Finding that size and emissions are correlated. With smaller models emitting less carbon and using less energy. However, sequence-to-sequence models are more efficient than their decoder-only counterparts when models of the same size are compared.

Decoder vs Sequence-to-Sequence Models [8]

They find that for both types of models, an increase in the number of output tokens increases energy usage. Although it is more prominent for decoder models. The gap in energy usage difference between decoder and sequence-to-sequence models increases as output token length increases.

Output Length of Model vs Model Emissions [8]

What can be done about this?

The main conclusion so far is that training LLMs uses a significant amount of energy, an estimated 502 tonnes of CO2 eq for GPT-3 (175 billion parameters) [4]. Given GPT-4 was even larger at an estimated 1.76 trillion parameters, training it would have had an even larger energy impact, potentially increasing the environmental impact even further [9]. Moreover, we should not rule out the energy impact of inference as well. As seen this is significant and can overtake the energy impact of training incredibly quickly.

We should also not be oblivious to the embodied carbon impact. As seen this can make up a significant percentage of the carbon impact of LLMs. Furthermore, data centres are making the move to renewables. For example, 97% of the energy used in Meta’s data centre is renewable [3]. Therefore, the embodied carbon footprint of LLMs will dominate the carbon footprint of LLMs. Currently, carbon-free electricity in Taiwan and South Korea, where most chips are produced is only 6%. The fabs there plan to increase their carbon-free energy by 40% and 20% respectively by 2030. However, assuming that fab demand is unchanged (which is unrealistic), and renewable energy supply increases by 20%, the industry will miss their goal of reducing emissions by 45% by 2030 [10].

The question then is what can be done about this? Even if 100% of the training and inference energy is powered by renewables, the energy impact of LLMs should be looked at and potentially reduced as this could unlock the renewable energy to be used elsewhere and speed up the transition to net zero. Moreover, there is currently little information provided to the public on the energy impact of these models, nor a standardised method to report the carbon impact of models. AI developers having more awareness of the carbon impact of these models could potentially lead to more climate-conscious decisions being made when developing and integrating these models into products and services.

LLMCarbon

LLMCarbon is a tool proposed by Faiz et al [3], which calculates the carbon footprint of an LLM. It factors in both the embodied and operational emissions of the LLM.

Having a tool like this is important as it allows developers to compare and contrast the carbon emissions of various LLMs to allow them to make informed decisions. It also means that there is an estimate of the carbon emissions of an LLM before training it. If the estimate can provide insights into metrics such as test loss, training duration and inference latency, then a trade-off between test loss and carbon emissions can be made (since normally a larger model results in better performance).

The tool predicts the operational carbon emissions of an LLM using a few stages:

  • The parameter model estimates the LLM’s parameter count based on its architectural attributes. Alternatively, the parameter count can be directly inputted.
  • Test loss is calculated using the neural scaling law based on the parameter count and the training token count.
  • The FLOP model is employed to estimate the volume of FLOPs required for LLM processing.
  • Through the parameter count the optimal parallelism settings are generated.
  • Taking into account the parallelism settings and the hardware configuration, the hardware efficiency model computes the hardware efficiency.
  • Finally using the FLOP count, data centre details and hardware efficiency, LLMCarbon applies a carbon model to derive the LLM’s operational carbon footprint

Similarly, using the hardware configuration, LLMCarbon’s embodied carbon model provides the LLM’s embodied carbon footprint.

This process is shown in the diagram below:

LLMCarbon Flow [3]

To test this tool, LLMCarbon is employed to compute the operational footprints of five LLMs. In the worst case, the error of the computed footprint is 8.2% compared to the actual results.

Operational Carbon Emissions [3]

For embodied footprint, LLMCarbon is deployed to compute the embodied footprint of Meta’s XLM model, which results in a small error rate of 3%.

Embodied Carbon Emissions [3]

This tool is incredibly useful and a necessary step to providing a unified framework to easily calculate the carbon emissions of an LLM before training it.

AI Energy Star Rating

LLMCarbon is a useful tool for providing an estimate of the carbon impact of an LLM. However, it is unfeasible for developers to calculate this if they are considering a wide range of models. Moreover, the general public cannot be tasked with using this tool when considering which model to use in a chatbot.

The AI Energy Star project [2] is inspired by Energy Star ratings, which provide customers with a straightforward measure of the energy consumption of washing machines and cars for example. The Energy Star rating programme has helped to achieve more than 4 billion tonnes of greenhouse gas reductions over the past 30 years [2].

The goal of the AI Energy Star project is to provide an energy rating for LLMs to help users choose the most appropriate models for their use case quickly. The project will launch a leaderboard website with a testing platform that can be used to compare and benchmark models as they come out.

Data Centres

Given that we now have two useful tools for providing insight into the emissions of an LLM, what are some of the steps that we can take to reduce the emissions at a data centre level?

In a general paper by Lee et al [10], an ecosystem for more sustainable computing in data centres was considered.

Renewables are key to dealing with this energy demand. Sustainable data centres will need to delay and boost computation when carbon-free energy is scarce and abundant [10].

We will need to employ demand response frameworks to modulate energy demand in response to carbon-free energy supply. The data centre should receive information about energy supply and be given real-time prices that incentivise them to modulate energy use. It could be that tokens are used to modulate energy usage; more tokens could be required when carbon-free energy is scarce and more tokens are allocated to users that delay their jobs in this period [10].

A data centre modulating energy use would necessitate load shifting for developers, where the training could be stopped and saved using model checkpoints and moved around data centres depending on the availability of renewable energy. Having tracking and monitoring of energy usage of LLMs during training and operation is key for this to be employed [11].

We should also investigate new cooling methods, for example, Microsoft is experimenting with a new underwater data centre [9]. We could even enhance energy efficiency by using AI systems itself. DeepMind is looking at using BCOOLER, an AI model to optimise energy consumption in data centres [9].

Hardware

As seen above, embodied carbon is significant. What are some of the steps that we can take to reduce this?

This comes down to the 3 R’s: reduce, reuse and recycle [10].

Reducing: We need to only produce and use exactly the mix of hardware required for application needs. Hardware functions should be created as small chiplets. These chiplets can then be connected with fast interconnects [10].

Re-Use: The hardware in data centres could be organised into collections of Lego-style network-attached components. This will enable individual components in these systems to be replaced, rather than the complete system extending the hardware’s average tenure and amortising embodied carbon over a longer lifespan [10].

Recycling: We need to facilitate an efficient secondary market that disassembles systems into components and sells them for a second life to reduce embodied carbon. Having a system that tells second-hand buyers the usage of the component, similar to vehicle history reports will be incredibly useful in this new economy [10].

There is also a need for AI models that offer performance, efficiency and accuracy on a range of hardware platforms. These models would ensure backwards compatibility and equity of access to AI, slowing down the need for hardware refreshes. Models and platforms should be developed that remain relevant over longer periods and so better amortize the embodied carbon costs [10].

Conclusions

The increasing use and development of LLMs has with it a rise in energy and therefore carbon impact. It is a big problem and will only get worse as models get larger and larger as the pursuit of progress in LLMs increases. Like most advances in our history, it seems that carbon impact is an afterthought, we pursue development and then worry about it later. However, due to the increasing climate-conscious nature of our population, the timeframe of this has reduced massively and active work is being carried out now to understand and reduce the carbon impact of LLMs.

Progress still needs to be made, especially from the likes of DeepMind, OpenAI and Meta in making the carbon impact of these models available to customers and developers. Additionally, having all LLMs and their weights be open-source will mean that potentially these models will only need to be pre-trained once by the developers [1].

Projects such as the AI Energy Star project are a big step in bringing the carbon impact of LLMs to the forefront of developers and users. If users start demanding energy-efficient models they can drive the market towards sustainable solutions. Developers of AI models will be encouraged to prioritise sustainability. Users can nudge developers in the right direction by using models that disclose energy consumption.

There also needs to be more awareness of the energy impact of LLMs on the public side. Users of sites like ChatGPT, Perplexity or CoPilot should be aware of how vast the energy impact is between tasks. Having a monitor that shows the estimated or even real-time energy and carbon impact of the tasks they are performing would be a huge step in this direction.

I hope this work has brought to light the current situation of carbon impact and LLMs, it is clear that more work has to be done. However, I am optimistic that we are on the right path, and as more people become climate-conscious and aware of the carbon impact of LLMs the industry can steer in the right direction and curb this issue.

--

--

Nathan Bailey

MSc AI and ML Student @ ICL. Ex ML Engineer @ Arm, Ex FPGA Engineer @ Arm + Intel, University of Warwick CSE Graduate, Climber. https://www.nathanbaileyw.com