A simple guide for greener data science & ML: everything you need to get started

7 min readJan 11, 2023

Welcome to green data and analytics! This is the second of a short series of posts covering what you need to know to find you way through the sustainability jungle for the corporate data world. I’m focusing on the impact from advanced analytics (ML/data science) activities and I’m discussing what needs to change and how to get started at an organizational level.

TLDR:

State-of-the-art analytics and AI consumes more and more energy
Going forward, analytics projects must be optimized in two dimensions, accuracy and energy efficiency, to get the commercially optimal solution
Primary levers are all about keeping things simple, becoming more effective in the model training and less greedy on data volumes
To change the way we work, we must revise incentives
The best place to get started in your organization? Generate awareness and a shared purpose, collect data and define with your teams what can work for you!

In the space of Natural Language Processing (NLP), today in early 2023 one could sometimes get the impression that there was nothing before ChatGPT and everything since then. In fact, back in the second half of the 2010s we saw a methodological quantum leap when a new generation of so-called transformers was shown to beat established approaches by miles. Training such a model takes a lot of data (think all of Wikipedia) and typically requires immense computational resources: a 2019 paper from researchers at the University of Massachusetts (UMass) estimates that a single training run of Google’s BERT(base) produces more CO2 than the average human over a period of six weeks. In an example using generic neural architecture search (NAS), a single training run generates the CO2 emissions of five cars over their full lifecycle, including fuel. Google published a response in 2022 discussing measures to save energy and assuring that the UMass estimate is above the actual carbon emission when the right hardware and NAS methodology are used.

Whether the estimate is overstated or not, transformer models are a prime example of the biggest challenge in data science and AI today: in the past decade, innovation meant pushing computational boundaries (and budgets) when training a model, with efficiency and energy saving taking the back seat. The figure below nicely illustrates just how much more compute is required with popular AI models today (note the log-scale!).

***In the past decade, compute usage in training AI systems (primarily language, vision, games) shows a steep increase.*** *The graphic displays n=102 milestone ML models across the deep learning space. Their compute in floating point operations per second is depicted for the last reported training run. Source:* *https://arxiv.org/pdf/2202.05924v2.pdf, see also* *https://openai.com/blog/ai-and-compute/*

When we look beyond training to model deployment in production, it gets harder to find relevant estimates of energy consumption per family of models, because there are even more use-case specific parameters to consider: how frequently is your inference engine called, what latency is required, how much parallel resource do you need, etc. To get a very rough and generic idea of the total energy spent on using (training and deploying) ML models, Google researchers estimate that training is making up 40% of energy consumption, while operating data science applications consumes 60%; data from NVIDIA even suggests a 10–90 split. While the big cloud providers are investing heavily in more efficient use of and more renewable energy, the sheer volume of carbon emissions today means that a responsible, more sustainable way of designing and operating data products must be established as the data science and AI application landscape matures and grows further. Coming back to the case of ChatGPT: as of March 2023, OpenAI and Microsoft have neither provided an carbon emission estimate nor the infrastructure information to infer this production footprint, but there are some educated guesses like this one estimating it to be the equivalent to that of 605 Americans. Considering this technology is freely accessible for anyone to play around with and has hundreds of millions of “fun” users, that looks pretty expensive. How can analytics and AI be used more responsibly?

The key levers to improve energy efficiency are data volume used, complexity of your model and lifecycle management. How these levers can best be pulled depends — unsurprisingly — on the use case, so there is no one-size-fits-all solution. Here a few examples:

Keep it simple: it’s very common that ML models are more complex (in terms of parameters/weights) and thus energy-consuming than necessary, as they tend to be built to maximize accuracy without considering the energy footprint. With a sustainability mindset, the winning model should be selected following a two-dimensional optimization function, i.e. considering accuracy and expected life-cycle energy consumption (yes, hard to predict, but an educated guess is better than nothing).
Beware diminishing marginal return on data: analytics people (I’m including myself) are greedy people and will always be biased towards collecting more data rather than less, under the implicit (and correct) assumption that the bigger the data set, the more information it contains. However, data tends to show diminishing marginal returns: one additional observation doesn’t turn the needle once you collected one thousand, just like the engineering of yet another unintuitive feature (“lets try all cross products of all numeric columns”) in ML won’t lead to mind-blowing improvements. A nice empirical evaluation of diminishing marginal return on data has been performed here for computer vision (see also the figure below). In the future, ideally, data-frugal approaches such as model adaptation and updating take a more prominent role: for instance, Bayesian updating (using only the most recent observations to update model parameters) and the adaptation of pre-trained models (see below) are excellent tools to save compute.
Get crafty: if you enjoy automation and work in data science, there’s a risk of becoming lazy — for instance using a week’s compute for grid-searching optimal hyperparameters. In the future, we have to establish transparency on the impact such a strategy has and ultimately balance energy consumption and hours spent coding against each other (and adapt approaches such as Bayesian optimization for hyperparameter or even neural architecture search). Additionally, anyone working in a scalable environment on the cloud has to be trained to decide on type and size of instance, required up time and possibly other twists, such as how to run code in a greener data center.

***Diminishing returns on data in computer vision.*** *For a standard CV application of a deep learner (ImageNet) onto real-world data, as the sample data grows exponentially, accuracy merely increases linearly. Source:* *https://arxiv.org/abs/1805.00932*

Coming back to the evolution of NLP models: the current generation is hugely impressive and has enabled many use cases that were unthinkable before, including in non-core NLP fields such as protein decryption. They also come with an amazing feature, namely that pre-trained versions can be adapted to new use cases. In short, pre-trained models are neural networks that already learned language based on large data sets. During the adaptation, they are fine-tuned using a task-specific data set. Clearly, if your use case requires such advanced NLP solutions, you’re best off “recycling” what others have trained from scratch.

These and similar considerations can define best practices and ultimately create a greener mind set among data scientists, ML engineers, analysts and alike. However, to seriously change how we work, we must revise incentives: the best possible model is not necessarily the one with highest predictive power, but rather the best compromise between expected accuracy and life-time energy cost. It’s not the 100% accuracy (or maximum expected top-line impact) that should be the goal in an analytics project. Rather, if we consider only economic aspects the best solution is the one that maximizes expected top-line impact from a given model less your it’s lifetime energy footprint, which is roughly equivalent to bottom-line value added plus hours invested. At its core, the question becomes “what’s the marginal euro value added for one percent extra accuracy, considering the additional energy cost”. Depending on an organization’s technological stack and geolocation, the energy used is more or less green and the impact from emissions has to be factored in. I’m convinced that in the future, ESG reporting will require organizations to establish transparency regarding their analytics carbon footprint, and ultimately find a balance between profit and impact on the carbon balance sheet.

So how to get the green ball rolling? This will naturally depend on your organization’s DNA (pun intended). However, step one is certainly to generate awareness and a shared purpose. I believe that it takes an open dialogue with data and analytics experts to understand where the organization as a whole is and where it can get to, realistically. Moreover, since change requires creativity and testing out, it takes time and needs enough space for experimentation. That said, there are a few cool tools and resources out there that can help accelerate your journey, such as the codecarbon project.

To wrap this up: today we’re just getting started in our move towards sustainable value from data. Step one is establishing a greener mindset, i.e. awareness of the carbon footprint of our work in the field. Add in transparency on energy consumed and the right incentives for your organization, and nothing will be in the way of greener AI and analytics.

This content is the authors personal opinion. If you feel a topic warrants additional details, or if you do spot a mistake, please reach out. Also, constructive feedback is always welcome! :)

A simple guide for greener data science & ML: everything you need to get started

Written by greenDnA

No responses yet