Error-freeness per kilowatt-hour : a Proposed Metric for Machine Learning

Abstract

Accuracy on well-known problems is widely used as a measure of the state of the art in machine learning. Accuracy is a good metric for algorithms in a world where energy has negligible cost. We do not live in such a world.

I propose an alternative metric, error-freeness per kilowatt-hour, which improves on accuracy by trading-off accuracy against energy efficiency in a useful way. It has the desirable properties of approximate linearity in the relevant ranges (1 error per 1000 is 10 times better than 1 error per 100) and of weighting energy use in a way that accounts for cost of training as a realistic fraction of a delivered service. Error-freeness per kWh is calculated as e = 1/(1 + g - Accuracy)/(h + (training time in hours * (GPU+CPU Wattage)/1000)). The granularity g is the point of diminishing returns for improving accuracy. The overhead h is the energy cost of delivering a software service with no ML training. The parameters may be tuned to circumstance, but as a general-purpose metric, I propose g=1/100,000 and h=100kWh are good human scale, commercially relevant parameter values.

Detail

Where training is very expensive, it is not helpful to score machine learning algorithms on accuracy alone, with no account taken of the resources consumed to train to the level reported. In a competitive setting, it biases to the richest player; in the global, or society-wide, or customer-focussed setting, it ignores a real cost. This leads at best to sub-optimal choices and at worst to a growing harm.

I suggest that a useful, general purpose, metric has the following
characteristics:

  1. For gross errors, it is linear in the error rate. Halving the error doubles
    the score.
  2. For very small errors, improving the error rate even to perfection adds only incremental value. Perfection is only notionally better than an error rate so small that one error during the application's lifetime is unlikely.
  3. For extremely large energy consumption, the financial cost of the delivered service becomes proportional to the energy cost. We are concerned that the total human cost of energy use is very much dis-proportionate, in that emissions from increased energy consumption is an existential threat to the human race. For much less extreme energy consumption we might nonetheless accept the linear financial cost of energy as a proxy for the real cost.
  4. For small energy consumption, the energy cost of training becomes an insignificant fraction of the whole cost of delivering a software service. The parameter h represents the energy cost of a service that uses no training.

Error-freeness per kWh can now be formulated as:

e = 
    1 / (1 + g - Accuracy)
      / (h + (training time in hours * GPU+CPU training wattage)/1000)

Setting the Parameters g and h

The granularity g

For general purpose human scale and commercial purposes I suggest a granularity of g=1/100,000 is a level at which halving the error rate grants only incremental extra value. It is about the level at which human perception of error takes real effort. Consider a 1m x 10m jigsaw of 100x1,000 pieces in which 1 piece is missing. An observer standing back to see the entire 1m x 10m work will not see the error. They would have to spend effort searching the 10 meter length of jigsaw.

Changing g by an order of magnitude either way makes little different to scores, until accuracy approaches 99.999%. So an alternative way to think about g is:

“If you can tell the difference between accuracy of 99% versus 99.9%, but cannot tell the difference between accuracy of 99.99% and 99.999%, then your granularity g is smaller than 1/1,000 but not smaller than 1/100,000.”

The software overhead energy cost, h

We estimate the energy cost of an algorithm-centric software service as follows:

  • A typical single-core of cloud compute requires 135W 1 , for an energy cost of 1.0 MWh per year per server.
  • The software parts of a service in a fast moving sector (and, “being a candidate for using ML” currently all but defines fast-moving sectors) have a typical lifespan of about 1 year. (The whole service may last longer, but as with the ship of Theseus, the parts do not).
  • A typical size of service that uses a single algorithm is 4 cores plus 4 more for development and test. (Larger services will use more algorithms. We want the cost of a service of a size that uses only one algorithm).
  • 8 such cores running 24/7 for 1 year is 8MWh.

We should set h to some fraction of 8MWh. There is little gain in attempting a more accurate baseline for general-purpose use. See the supplementary discussion below. We set that fraction based on 2 considerations.

Those 8 cores are often shared by other services, both in cloud-compute and self-hosting deployments. The large majority of the world's systems—anything outside the global top 10,000 websites—have minimal overnight traffic; office-hours is more realistic. Anything from 1% to 99% of a CPU-year might be a realistic percentage, the lower figure for virtual cloud-computing and the highest for dedicated hardware.

It is pragmatic to measure training train as the time for a single training run, rather than imagining developers keep careful record of every full or partial training run during development. We can more properly account for the total energy cost of all training time by dividing 8MWh by a typical number of training runs. If the final net takes 100 hours to train, it may have taken 10 or 10,000 training runs to settle on that net, in development, hyper-parameter tuning, multiple runs for statistical analysis, comparison with alternatives and so on. For a researcher, 1000 training runs may be too little, where for a commercial team doing only hyper-parameter training and testing, 50 runs might be more than enough. It is the widespread commercial usage that concerns our metric.

Combining the shared CPU usage with a typical number of training runs, one might argue for any fraction of 8MWh as typical, from 1/20th to 1/1000th. I propose a broad-brush rule of thumb setting h= 1/80th of 8MWh, or 100kWh. For large projects, it is simple to set g and h to values based on an actual business case and costs.

Proposed general purpose parameter values

This gives us standard parameters for error-freeness per kWh of

g=1/100,000
h=100kWh

Examples

A net is trained for 100 hours on a grid of ten 135W servers, each with a 400W GPU (i.e. 0.535 kW per server), and achieves an accuracy of 99%:

  • e = 1/(1+ g - 0.99)/(h + 100*10*.535) = 0.16.
  • It reaches accuracy=99.5% by quadrupling the number of servers: e =0.09.
  • A different algorithm for the same task achieves 99.4% on the original 10 servers: e=0.26.

On a different task, a net required only 10 hours on just a single server and GPU to reach 99% accuracy:

  • e = 1/(1+ g - 0.99)/(h + 10 * .535) = 0.95.
  • It reaches accuracy=99.5% by quadrupling training time to 40hours. e=1.64.
  • A different algorithm achieves 99.4% in the original 10 hours. e=1.58.

In the first case, training costs ½ a megawatt-hour per run (around £4,000 for 40 training runs at UK 2020 energy prices) and energy cost is well-reflected in the score. We may consider the doubling of accuracy not worth the quadrupling of cost. Where the training cost is a small fraction (around £40 for 40 training runs) of the cost of a delivered service, even a small gain in accuracy outweighs a quadrupling of energy cost.

Conclusion

When you measure people's performance, “what you measure is what you get”. People who are striving for excellence will measure their success by the measure you use. By promoting a metric that takes explicit account of energy usage, we create a culture of caring about energy usage.

The question this metric aims to answer is, “Given algorithms and training times that can achieve differing accuracy levels for different energy usage, which ought we to choose?” The point is to focus our attention on this question, in preference to letting us linger on the increasingly counter-productive question “what accuracy score can I reach if I ignore resource costs.”

Because the parameters are calibrated for real world general purposes, this metric represents a useful insight into the value vs energy cost of deploying one algorithm versus another.

Supplementary Discussion [Work In Progress] – the energy cost of a software service

To ask for the energy cost of a deployed software service is like asking for the length of a piece of string. In the absence of a survey of systems using ML, the calculation given is anecdotal on 3 points: How big a service does a typical single ML algorithm serve; what is the lifespan of such a service; for what fraction of that lifespan is the service consuming power?

In 1968, typical software application lifespan was estimated at 6-7 years 2, but a single service is a fraction of such an application, and the churn of software services has increased with the ease of the development and replacement. I propose 1 year or less is a realistic lifespan for an algorithmic service in a competitive commercial environment.

The figure of 4 cores for a service arises from considering that although the deployed algorithm may only use a single core (or one low-power GPU), a service is typically deployed as part of an application with a user interface and some persistence mechanism. A whole service might then use 2 cores (for a monolithic deployment with redundancy) or 6 or more (for a multi-tier service with redundancy). Anything beyond that is likely already looking at parts of a larger application, unconnected to the work of machine learning. We can reasonably set the boundary for “that part of the system which we are only shipping because we have an algorithm to power it” at no bigger than that.

The figure of 135W for a single socket server might, in the context of efficiency-driven cloud computing, be discounted even 99% or more for low-usage services sharing hardware and consuming zero energy when not in use. Setting h=1/80 rather than, say h=1/500, probably represents very heavy usage.

Other links

On data centre power usage: https://davidmytton.blog/how-much-energy-do-data-centers-use/

1. https://eta.lbl.gov/publications/united-states-data-center-energy#page-9

2. https://mitosystems.com/software-evolution/

no magic

“If science shows we are composed of trillions of cells and no ‘magic ingredients’ then…”

The thought is blind to the fact that if there are such ingredients then they are ipso facto invisible to scientific tools. Your first-person subjective experience, for instance, is invisible to science. It can only be accessed by asking you to tell.

Physicalism, as an attempt to explain all of reality, has a selective vision problem: it rules out anything it can't see.

fish shell quickstart for converting bash scripts

After some years of bash and PowerShell, and some hours of using fish, I've realised that expansion & predictive typeahead are good features in a shell, whereas “be a great programming language” is less important than I thought: because there is no need to write scripts in the language of your shell.

Fish has slicker typeahead and expansions than bash or even PowerShell. But to switch to a fish shell, you do still have to convert your profile & start-up scripts. So here's my quick-start guide for converting bash to fish.

  • Do this first: at the fish prompt type help. Behold! the fish documentation in your browser is much easier to search than man pages are.
  • Calmly accept that fish uses set var value instead of var=value. Roll your eyes if it helps.
  • Use end everywhere that bash has fi, done, esac, braces {} etc. e.g. function definition is done with function ... end. The keywords do and then are redundant everywhere, just remove them. else has a semicolon after it. case requires a leading switch(expr).
  • There is no [[ condition ]] but [ ... ] or test ... work. Type help test to see all the file and numeric tests you expect, such as if [ -f filename ] etc. string and regex conditionals are done with the string match command (see below). You can replace [[ -f this && -z that || -z other ]] with [ -f this -a -z that -o -z other ] but see below for how fish can also replace || and && constructions with or and and statements.
  • But first! type help string to see the marvels of proper built-in string commands.
  • Replace function parameters $*, $1, $2 etc with $argv, $argv[1], $argv[2] etc. If that makes you scowl, then type help argparse. See! That's much better than kludging about in bash.
  • Remove the $ from $(subcommand) leaving just (subcommand). Inside quotes, take the subcommand outside the quote: "Today is $(date)" becomes "Today is "(date). (Recall that quotes in bash & fish don't work at all like quotes in most programming languages. Quote marks are not token delimiters and a"bc"d is a valid single token and is parsed identically to each of abcd , "abcd", abc'd').
  • Replace heredocs with multi-line literal strings and standard piping syntax. However, note that if you pipe or read to a variable, the default multiline behaviour is to split on newline and generate an array. Defeat this by piping through string split0 – see https://fishshell.com/docs/current/index.html#command-substitution

Search-and-replace Script Snippets

Here is my hit-list of things to search and replace to convert a bash shell to fish. These resolved almost all of my issues in converting a few hundred lines of bash script to fish.

FromToNotes
var=valueset var value
export var=valueset -x var value
export -f functionnameredundant.Just remove it
alias abbr='commandstring'(no change)alias syntax is accepted as an abbreviation for a function definition since fish 3
command $(subshell commmand)
command `subshell commmand`
command (subshell command)
OR
command (subshell commmand | string split0)
Just remove the $ but keep the ()

See below for when you want to add string split0
command "$(subshell commmand)"command (subshell command)Remove both the $ and the quotes ""to make this work
if [[ condition ]] ; then this ; else that ; fiif [ condition ] ; this ; else ; that ; endSee below for more on Fish's multine and and or syntax.
if [[ number != number ]] ; then this ; else that ; fiif [ number -ne number ] ; this ; else ; that ; endSee below for more on Fish's multine and and or syntax.
while condition ; do something ; donewhile condition ; something ; end
$*$argv
$1, $2$argv[1], $argv[2]But see help argparse
if [[ testthis =~ substring ]] if string match -q '*substring*' testthisstring match without -r does glob style testing
if [[ testthis =~ regexpattern ]] if string match -rq regexpattern testthisstring match with -r does regex testing
[ guardcondition ] && command
[ guardcondition ] || command
works as isBut see or and and below for when it's more complex
var=${this:-$that}if set -q this ; set var $this ; else ; set var $that ; end
cat > outfile <<< "heredoc"
cat > outfile <<< "multiline … heredoc"
echo "multiline … heredoc" | cat > outfile no heredocs, but multiline strings are fine
NB printf is better than echo for anything complicated, in any shell.
if [[ -z $this && $that=~$pattern ]]if [ -z $this ] ; and string match -rq $pattern $that ;
content=$(curl $url)set content (curl $url | string split0)without the pipe to string split0, content will be split on newlines to an array of lines.

Fish's multine and and or syntax

Fish has a multiline and and or syntax that may be clearer than && and || in both conditionals and guarded commands. It is less terse.

[ condition ]
and do this
or do that

That said, && and || are still valid in commands :

[ condition ] && do this || do that

Other gotchas

  • You may have to read up on how fish does parameter expansion, and especially handling spaces, differently to bash.
  • Pipe & subcommand output to multiline strings or arrays: set x (cat myfile.txt) will set x to an array of the lines of myfile.txt. To keep x as a single multine string, use string split0 : set x (cat myfile.txt | string split0)

Official tips for new fishers:

See the FAQ at https://fishshell.com/docs/3.0/faq.html

Tower of Babel

Now—all the earth one language and one speech—and as they set off eastward they found a plain in the land of Shinar and became citizens in that place.

They said each to his neighbour 
Come! Let us make bricks and burn them with fire.
And bricks were for them stone,

and asphalt was for them mortar
And they said Come! Let us build for ourselves

a City-and-Tower
And its head in the heavens,
And let us make a name for ourselves
Lest we be scattered on the face of the earth.
…
     And the LORD came down to see 
      the city and the tower 
      which the sons of man built …
And the Lord said, “See, one people one language, all of this, and this their start of work
And nothing will be impossible for them,

all they plan they will do
Come, let us go down to that place

and mix up their language
That they will not hear,

each the speech of his neighbour
So the Lord scattered them from there

over all the earth,
and they stopped building the city.

That is why it was called Babel, because there the Lord confused the language of the whole world and from there the Lord scattered them over the face of the whole earth.” – Genesis 11.

Main structure

The so-called “chiastic” or crossover structure—in which the second half of a story or section (or even a single sentence) mirrors & reverses the first half, is a common structure in the bible's literature.

The structure exposes the themes. The opening and closing sentences tell us the theme of the story and when you contrast the opening with the close you see how the turning point — often the exact middle sentence — has changed things.

The rest provides the detail. Comparing the detail of the first half with the detail of the last half shows what has changed in the light of the central turning point.

Reversals abound. The united language is disunited. The settling together is reversed by scattering. They want to build up to heaven, but instead God comes down from heaven. They want to make a name for themselves but instead are confused.

Less obvious is the importance of the place name. Babel is not named at the beginning because it serves as a pun for “Balel”—to confuse—at the end, which is appropriate after the turning point, not before it. Before then it is referred to as 'that place' in Shinar.

Second Structure

As well as this main structure, there is a second parallel structure between the two halves. The parallels rest as much on the words and sounds as the meaning, and again v5 is the mid-point:

v1
One language One Speech
In That Place
They speak, each to his neighbour
Build a City and a Name
Lest We are Scattered over the face of the whole earth
v5
And the LORD came down to see the city and the tower which the sons of man built.
v6
One People One Language
In That Place
They cannot hear, each the speech of his neighbour
Stop Building the City, 'great' Babel
Scattered Over the face of the whole earth

In this structure we can see that the second half of story repeats, in the same order, the vocabulary of the first half.

Third Structure

The parallel structure can be folded one more time into a third, ‘anti-parallel’ structure:

v1
One language One Speech
  In That Place
    They speak, each to his neighbour
  Build a City
Lest We are Scattered over the face of the whole earth

v5
And the LORD came down to see the city and the tower which the sons of man built.

v6
One People One Language
  In That Place
   They cannot hear! each his neighbour
  Stop! Building the City
Scattered Over the face of the whole earth

The point here is that the first half of each parallel half is parallel-by-similarity (One language; one speech; in that place); but the second half is parallel-by-contrast. Each half is a mini-chiasm on the theme of unity vs scattering. The turning point inside each half is speech; successful in the first half but unsuccessful in the second half. In this structure, the them is speech (successful vs unsuccessful) whilst the place and the city serve as the examples of what might have been when people are united in speech.

Words

We mentioned that Babel is punned in Hebrew as Balel, to mix or confuse. Invisible in English is the more extended alliteration in the short speech v3-4 of the consonants n, b and l in the words for, come let us build, brick, stone. This same alliteration is picked up again in the pun in verse 7 & 9 on “let us confuse” (nbl), Babel (bbl) and “he confused” (bbl).
[Hebrew was first written with just consonants not vowels; the letters h & m in this section are mostly parts of grammar not vocabulary]

Archeology

You can see pictures of Babylonian & nearby towers on Wikipedia:

https://en.wikipedia.org/wiki/Ziggurat

The oldest surviving one is dated to about 3000BC but similar earlier structures have been suggested as early as 6000 BC.

Meaning

The genesis text seems easy enough to interpet: The towers were intended to reach, figuratively at least, to the heavens. The “name” in “Make a name for ourselves” should be understood as fame or reputation.

Polemics?

The genesis text suggests the idea of men reaching heaven and makes no mention of the polytheist religion of Babylon. On the other hand:
• Babylonian texts probably consider the name Babel to derive from “Gate of god”.
• Herodotus says the top of the ziggurat was a shrine for the dwelling of gods.
• The Enmerkar epic has the confusion of languages being due to Enki (a senior god) making mischief and suggests that in the future (possibly the past; interpretation is uncertain) the languages will be united again.
• The main 1st millenium temple to Marduk in Babylon was Esagil–"house with the uplifted head"—and was next to the (probably 2nd millenium) Etemnaki–“Temple of the Foundation of Heaven and Earth”.
• The Enuma Elish considers the Babylonian template to be the “a likeness on earth of what he has wrought in heaven”. Indeed it says it was built by the minor gods the Annunaki:

The Annunaki wielded the hoe
For one whole year they moulded its bricks.
When the second year arrived
they raised the head of Esagil, a replica of the Apsû.
They built the lofty ziggurat of the Apsû
and established its … as a dwelling for Anu, Enlil and Ea [3 of the main gods].

Politically, the Babylonian empire was a major power for much of the period from the time of Genesis 12 in the 2nd millenium BC through to Babylon's defeat by Persia in 539BC.

All of which raises the question whether early readers understood the story as a polemic against the power and religion of Babylon. Like all empires, Babylon thought itself the centre of the world and the divinely blessed pinnacle of humanity.

But the Genesis story mocks. The tower to the heavens is so small that God has to come down to see it. All-conqueroring empire-building Babylon was once defeated by a little trick of speech; the resurgent Babylonian empires of the readers' times should be taken no more seriously.

Readers with their eyes open will be well aware that the impressive structures of Babylon—look again at those pictures on Wikipedia—were, like the monumental architecture of every other empire in history, built on the back of slaves,paid for by conquest, murder and theft. What is alleged to be the impressive demonstration of united humanity is in reality a testament to oppression & forced labour.

This is not the point made in the text however. Rather, the point made by God's interference is more, perhaps, the foolishness of human boasting? They who think themselves great achievers do not notice how contingent their achievements are. They who aspire to fame and monumental achievement should realise how futile those things are

References

• Enmerkar https://www.britannica.com/biography/Enmerkar
• Esagila https://en.wikipedia.org/wiki/Esagila
• Etemnaki https://en.wikipedia.org/wiki/Etemenanki
• Enuma Elish 6:59-64 : Full text at https://www.ancient.eu/article/225/enuma-elish---the-babylonian-epic-of-creation---fu/
• Babel as Gate of God: Wenham's Genesis commentary quoting Gelb, I. J. “The Name of Babylon.” Journal of the Institute of Asian Studies 1 (1955) 1-4