Error-freeness per kilowatt-hour : a Proposed Metric for Machine Learning

Abstract

Accuracy on well-known problems is widely used as a measure of the state of the art in machine learning. Accuracy is a good metric for algorithms in a world where energy has negligible cost. We do not live in such a world.

I propose an alternative metric, error-freeness per kilowatt-hour, which improves on accuracy by trading-off accuracy against energy efficiency in a useful way. It has the desirable properties of approximate linearity in the relevant ranges (1 error per 1000 is 10 times better than 1 error per 100) and of weighting energy use in a way that accounts for cost of training as a realistic fraction of a delivered service. Error-freeness per kWh is calculated as e = 1/(1 + g - Accuracy)/(h + (training time in hours * (GPU+CPU Wattage)/1000)). The granularity g is the point of diminishing returns for improving accuracy. h is a measure of the wider energy cost of a delivered software service and the point of diminishing returns for savings in the training step alone. For an unspecialised general-purpose metric, I propose g=1/100,000 and h=100kWh are good human scale, commercially relevant parameter values.

Detail

Where training is very expensive, it is not helpful to score machine learning algorithms on accuracy alone, with no account taken of the resources consumed to train to the level reported. In a competitive setting, it biases to the richest player; in the global, or society-wide, or customer-focussed setting, it ignores a real cost. This leads at best to sub-optimal choices and at worst to a growing harm.

I suggest that a useful, general purpose, metric has the following
characteristics:

  1. For gross errors, it is linear in the error rate. Halving the error doubles
    the score.
  2. For very small errors, improving the error rate even to perfection adds only incremental value. Perfection is only notionally better than an error rate so small that one error during the application's lifetime is unlikely.
  3. For extremely large energy consumption, the financial cost of the delivered service becomes proportional to the energy cost. We are concerned that the total human cost of energy use is very much dis-proportionate, in that emissions from increased energy consumption is an existential threat to the human race. For much less extreme energy consumption we might nonetheless accept the linear financial cost of energy as a proxy for the real cost.
  4. For small energy consumption, the energy cost of training becomes an insignificant fraction of the whole cost of delivering a software service. The parameter h represents the energy cost of a service that uses no training.

Error-freeness per kWh can now be formulated as:

e = 
    1 / (1 + g - Accuracy)
      / (h + (training time in hours * GPU+CPU training wattage)/1000)

Setting the Parameters g and h

The granularity g

For general purpose human scale and commercial purposes I suggest a granularity of g=1/100,000 is a level at which halving the error rate grants only incremental extra value. It is about the level at which human perception of error takes real effort. (Consider a 1m x 10m jigsaw of 100x1,000 pieces in which 1 piece is missing. An observer positioned to see the entire 1mx10m work will not see the error. It would take some effort to search the 10 meter length of jigsaw).

Changing g by an order of magnitude either way makes little different to scores, until accuracy approaches 99.999%. So an alternative way to think about g is:

“If you can tell the difference between accuracy of 99% versus 99.9%, but cannot tell the difference between accuracy of 99.99% and 99.999%, then your granularity g is smaller than 1/1,000 but not smaller than 1/100,000.”

The service energy cost, h

We estimate the energy cost of an algorithm-centric service as follows:

  • A typical single-core of cloud compute requires 135W 1 , for an energy cost of 1.0 MWh per year per server.
  • The software parts of a service in a fast moving sector (and, “being a candidate for using ML” currently all but defines fast-moving sectors) have a typical lifespan of about 1 year. (The whole service may last longer but as with the ship of Theseus, the parts do not).
  • A typical size of service that uses a single algorithm is 4 cores plus 4 more for development and test. (Larger services will use more algorithms. We want the cost of a service of a size that uses only one algorithm).
  • 8 such cores running 24/7 for 1 year is 8MWh.

We should set h to some fraction of 8MWh. There is little gain in attempting a more accurate baseline for general-purpose use. See the supplementary discussion below. We set that fraction based on 2 considerations.

Those 8 cores are often shared by other services, both in cloud-compute and self-hosting deployments. The large majority of the world's systems—anything outside the global top 10,000 websites—have minimal overnight traffic; office-hours is more realistic. Anything from 1% to 99% of a CPU-year might be a realistic percentage, the lower figure for virtual cloud-computing and the highest for dedicated hardware.

It is pragmatic to measure training train as the time for a single training run, rather than imagining developers keep careful record of every full or partial training run during development. We can more properly account for the total energy cost of all training time by dividing 8MWh by a typical number of training runs. If the final net takes 100 hours to train, it may have taken 10 or 10,000 training runs to settle on that net, in development, hyper-parameter tuning, multiple runs for statistical analysis, comparison with alternatives and so on. For a researcher, 1000 training runs may be too little, where for a commercial team doing only hyper-parameter training and testing, 50 runs might be more than enough. It is the widespread commercial usage that concerns our metric.

Combining the shared CPU usage with a typical number of training runs, one might argue for any fraction of 8MWh as typical, from 1/20th to 1/1000th. I propose a broad-brush rule of thumb setting h= 1/80th of 8MWh, or 100kWh. For large projects, it is simple to set g and h to values based on an actual business case and costs.

Proposed general purpose parameter values

This gives us standard parameters for error-freeness per kWh of

g=1/100,000
h=100kWh

Examples

A net is trained for 100 hours on a grid of ten 135W servers, each with a 400W GPU (i.e. 0.535 kW per server), and achieves an accuracy of 99%:

  • e = 1/(1+ g - 0.99)/(h + 100*10*.535) = 0.16.
  • It reaches accuracy=99.5% by quadrupling the number of servers: e =0.09.
  • A different algorithm for the same task achieves 99.4% on the original 10 servers: e=0.26.

On a different task, a net required only 10 hours on just a single server and GPU to reach 99% accuracy:

  • e = 1/(1+ g - 0.99)/(h + 10 * .535) = 0.95.
  • It reaches accuracy=99.5% by quadrupling training time to 40hours. e=1.64.
  • A different algorithm achieves 99.4% in the original 10 hours. e=1.58.

In the first case, training costs ½ a megawatt-hour per run (around £4,000 for 40 training runs at UK 2020 energy prices) and energy cost is well-reflected in the score. We may consider the doubling of accuracy not worth the quadrupling of cost. Where the training cost is a small fraction (around £40 for 40 training runs) of the cost of a delivered service, even a small gain in accuracy outweighs a quadrupling of energy cost.

Conclusion

When you measure people's performance, “what you measure is what you get”. People who are striving for excellence will measure their success by the measure you use. By promoting a metric that takes explicit account of energy usage, we create a culture of caring about energy usage.

The question this metric aims to answer is, “Given algorithms and training times that can achieve differing accuracy levels for different energy usage, which ought we to choose?” The point is to focus our attention on this question, in preference to letting us linger on the increasingly counter-productive question “what accuracy score can I reach if I ignore resource costs.”

Because the parameters are calibrated for real world general purposes, this metric represents a useful insight into the value vs energy cost of deploying one algorithm versus another.

Supplementary Discussion [Work In Progress] – the energy cost of a software service

To ask for the energy cost of a deployed software service is like asking for the length of a piece of string. In the absence of a survey of systems using ML, the calculation given is anecdotal on 3 points: How big a service does a typical single ML algorithm serve; what is the lifespan of such a service; for what fraction of that lifespan is the service consuming power?

In 1968, typical software application lifespan was estimated at 6-7 years 2, but a single service is a fraction of such an application, and the churn of software services has increased with the ease of the development and replacement. I propose 1 year or less is a realistic lifespan for an algorithmic service in a competitive commercial environment.

The figure of 4 cores for a service arises from considering that although the deployed algorithm may only use a single core (or one low-power GPU), a service is typically deployed as part of an application with a user interface and some persistence mechanism. A whole service might then use 2 cores (for a monolithic deployment with redundancy) or 6 or more (for a multi-tier service with redundancy). Anything beyond that is likely already looking at parts of a larger application, unconnected to the work of machine learning. We can reasonably set the boundary for “that part of the system which we are only shipping because we have an algorithm to power it” at no bigger than that.

The figure of 135W for a single socket server might, in the context of efficiency-driven cloud computing, be discounted even 99% or more for low-usage services sharing hardware and consuming zero energy when not in use. Setting h=1/80 rather than, say h=1/500, probably represents very heavy usage.

Other links

On data centre power usage: https://davidmytton.blog/how-much-energy-do-data-centers-use/

1. https://eta.lbl.gov/publications/united-states-data-center-energy#page-9

2. https://mitosystems.com/software-evolution/

Why p-values can’t tell you what you need to know and what to do about it

Transcript of Prof. David Colquhoun, Oct 2020 talk at RIOT Science Club

https://www.youtube.com/watch?v=agC-SG5-Qyk

“Today I speak to you of war. A war that has pitted statistician against statistician for nearly 100 years. A mathematical conflict that has recently come to the attention of the normal people and these normal people look on in fear, in horror, but mostly in confusion because they have no idea why we're fighting.”

Kristin Lennox (Director of Statistical Consulting, Lawrence Livermore National Laboratory)

So that sums up a lot of what's been going on. The problem is that there is near unanimity among statisticians that p-values don't tell you what you need to know but statisticians themselves haven't been able to agree on a better way of doing things.

This talk is about the probability that if we claim to have made a discovery we'll be wrong. This is what people very frequently want to know. And that is not the p-value. You want to know the probability that you'll make a fool of yourself by claiming that an effect is real when in fact it's nothing but chance.

Just to be clear, what I'm talking about is how you interpret the results of a single unbiased experiment. Unbiased in the sense the experiment is randomized, and all the assumptions made in the analysis are exactly true. Of course in real life false positives can arise in any number of other ways: faults in the randomization and blinding, incorrect assumptions in the analysis, multiple comparisons, p-hacking and all of these things are going to make the risk of false positives even worse. So in a sense what I'm talking about is your minimum risk of making a false positive even if everything else were perfect.

The conclusion of this talk will be:

If you observe a p-value close to 0.05 and conclude that you've discovered something, then the chance that you'll be wrong is not 5%, but is somewhere between 20% and 30% depending on the exact assumptions you make. If the hypothesis was an implausible one to start with, the false positive risk will be much higher.

There's nothing new about this at all. This was written by a psychologist in 1966. The major point of this paper is that the test of significance does not provide the information concerning phenomena characteristically attributed to it, and that a great deal of mischief has been associated with its use. He goes on to say this is already well known, but if so it's not well known by many journal editors or indeed many users.

The p-value

Let's start by defining the p-value. An awful lot of people can't do this but even if you can do it it's surprisingly difficult to interpret it.

I consider it in the context of comparing two independent samples to make it a bit more concrete. So the p-value is defined thus:

If there were actually no effect—for example if the true means of the two samples were equal, so the difference was zero—then the probability of observing a value for the difference between means which is equal to or greater than that actually observed is called the p-value.

Now there's at least four things wrong with that when you think about it. It sounds very plausible but it's not.

  1. “If there are actually no effect …”: first of all this implies that the denominator for the probability is the number of cases in which there is no effect and this is not known.
  2. “… or greater than…” : why on earth should we be interested in values that haven't been observed? We know what the effect size that was observed was, so why should we be interested in values that are greater than that which haven't been observed?
  3. It doesn't compare the hypothesis of no effect with anything else. This is put well by Sellke et al in 2001, “knowing that the data are rare when there is no true difference—that's what the p-value tells you—is of little use unless one determines whether or not they are also rare when there is a true difference”. In order to understand things properly, you've got to have, not only the null hypothesis but an alternative hypothesis.
  4. The definition makes the error of the transposed conditional. That sounds a bit fancy but it's very easy to say what it is. The probability that you have four legs given that you're a cow is high but the probability that you're a cow given that you've got four legs is quite low many animals that have four legs that aren't cows. Take a legal example. The probability of getting the evidence given that you're guilty may be known. (It isn't of course — but that's the sort of thing you can hope to get). But it's not what you want. What you want is the probability that you're guilty given the evidence. The probability you're catholic given that you're the pope is probably very high, but the probability you're a pope given that you're a catholic is very low.

The nub of the matter

So now the nub of the matter: The probability of the observations given that the null-hypothesis is the p-value. But it's not what you want. What you want is the probability that the null-hypothesis is true given the observations.

https://www.youtube.com/watch?v=agC-SG5-Qyk#t=11m07

The first statement is a deductive process; the second process is inductive and that's where the problems lie. So these probabilities can be hugely different and transposing the conditional simply doesn't work.

The False Positive Risk

The false positive risk avoids these problems, if you define the false positive risk as follows:

If you declare a result to be “significant” based on a p-value after doing a single unbiased experiment, the False Positive Risk is the probability that your result is in fact a false positive.

time index 11m46

That, I maintain, is what you need to know. The problem is that in order to get it, you need Bayes theorem and as soon as that's mentioned, contention immediately follows.

The Likelihood Ratio

Suppose we call the null-hypothesis H0, and the alternative hypothesis H1. H0 can be that the true effect size is zero and H1 can be the hypothesis that there's a real effect, not just chance. So the odds on H1 after you've done the experiment are equal to the likelihood ratio times the odds on there being a real effect before the experiment:

Odds on H1 after experiment = Likelihood Ratio * Odds on H1 before experiment

In general we would want a Bayes factor rather than the likelihood ratio, but that's the complication that my assumptions get round: we can use the likelihood ratio, which is a much simpler thing.

The likelihood ratio represents the evidence supplied by the experiment. It's what converts the prior odds to the posterior odds, in the language of Bayes' theorem. The likelihood ratio is a purely deductive quantity and therefore uncontentious. It's the probability of the observations if there's a real effect divided by the probability of the observations if there's no effect:

Likelihood Ratio= Pr[ observations | H1 ] / Pr [ observations | H0 ]

Notice a simplification you can make: if the prior odds equal 1, then the posterior odds are simply equal to the likelihood ratio. “Prior odds of 1” means that it's equally likely before the experiment that there was an effect or that there's no effect. That's probably the nearest you can get to declaring equipoise.

Comparison: Consider Screening Tests

I wrote a statistics textbook in 1971 which by and large stood the test of time but the one thing I got completely wrong was the limitations of p-values. Like many other people I came to see these through thinking about screening tests. These are very much in the news at the moment and this is a an illustration of the problems they pose which is now quite commonplace.

Suppose you test 10,000 people and that 1 in a 100 of those people have the condition, say Covid, and 99 don't. So the prevalence in the population you're testing is 1 in a 100. So you have 100 people with the condition and 9,900 who don't. If the specificity of the test is 95%, you get 5% false positives. This is very much like a null-hypothesis test: it's a test of significance.

The Test of Significance: “there are only 5% false-positives”??

But you can't get the answer without considering the alternative hypothesis, which null-hypothesis significance tests don't do. You've got 1% (so that's 100 people) who have the condition, if the sensitivity of the test is 80%—that's like the power of a significance test—then you get to the total number of positive tests is 80 plus 495 and the proportion of tests that are false is 495 false positives on the total number of positives, which is 86%. 86% false positives is pretty disastrous. It's not 5%! Most people are quite surprised by that when they first come across it.

But that 5% means that 86% of all positives are false positives!

Now Look at Significance Tests In a Similar Way

Now we can do something similar to significance tests though the parallel as I'll explain is not exact.

Suppose we do 1,000 tests and within 10% of them there's a real effect, and in 90% of them there is no effect. If the significance level, so-called, is 0.05 then we get 5% false positive tests, which is 45 false positives.

Are only 5% of results false positives?

But that's as far as you can go with a null-hypothesis significance test. You can't tell what's going on unless you consider the other arm. If the power is 80% then we get 80 true positive tests and 20 false negative tests, so total number of positive tests is 80 plus 45 and the false discovery risk is the number of false positives divided by the total number of positives which is 36 percent.

No. In the full picture, 36% of positives are false positives

So the p-value is not the false positive risk. The type-1 error rate is not the false positive risk.

The difference lies not in the numerator, it lies in the denominator. In that example of the 900 tests in which the null-hypothesis was true, there were 45 false positives. So looking at it from the classical point of view, the false positive risk would turn out to be 45 over 900 which is 0.05 but that's not what you want. What you want is the total number of false positives divided by the total number of positives which is 0.36.

The p-value is NOT the probability that your results occurred by chance. The false positive risk is.

As I said, the false positive risk in that case is 36%.

Complication: “p-equals” vs “p-less-than”

But now we have to come to a slightly subtle complication. It's been around since the 1930s and it was made very explicit by Dennis Lindley in the 1950s. It is unknown to most people which is very weird. The point is that there are two different ways in which we can calculate the likelihood ratio and therefore two different ways of getting the false positive risk.

A lot of writers including Ioannidis and Wacholder and many others use the “p less than” approach. That's what that tree diagram gives you. But it is not what is appropriate for interpretation of a single experiment. It underestimates the false positive risk.

What we need is the “p equals” approach, and I'll try and explain that now.

Suppose we do a test and we observe p=0.047 then all we are interested in is, how tests behave that come out with p=0.047. We aren't interested in any other different p-value. That p-value is now part of the data. The tree diagram approach we've just been through gave a false positive risk of only 6%, if you assume that the prevalence of true effects was 0.5 (prior odds of 1). 6% isn't much different from 5% so it might seem okay.

But the tree diagram approach, although it is very simple, still asks the wrong questions. It looks at all tests that gives p ≤ 0.05, the “p-less-than” case. If we observe p=0.047 then we should look only at tests that give p=0.047 rather than looking at all tests which could be equal to or less than 0.05. If you're doing it with simulations of course as in my 2014 paper then you can't expect any tests to give exactly 0.047; what you can do is look at all the tests that come out with p in a narrow band around there, say 0.045 ≤ p ≤ 0.05.

This approach gives a way of looking at the problem. It gives a different answer from the tree diagram approach. If you look at only tests that give p-values between 0.045 and 0.05, the false positive risk turns out to be not 6% but at least 26%. I say at least, because that assumes a prior probability of there being a real effect of 50:50.

If only 10% of the experiments had a real effect of (a prior of 0.1) in the tree diagram this rises to a disastrous 76% of false positives. That really is pretty disastrous. Now of course the problem is you don't know this prior probability.

The numbers that I got in the 2014 simulation paper, had a p-less-than approach claiming a false positive risk of about 6%, but a p-equals analysis tells us, under the null-hypothesis, the chance of getting a p-value between 0.045 and 0.05 is much smaller, and that the false positive risk is more like 26%.

The likelihood-ratio approach with two hypotheses

The problem with Bayes theorem is that there exists an infinite number of answers. Not everyone agrees with my approach, but it is one of the simplest.

I look at the likelihood ratio—that is to say, the relative probabilities of observing the data given two different hypotheses. One hypothesis is the zero effect (that's the null-hypothesis) and the other hypothesis is that there's a real effect of the observed size. That's the maximum likelihood estimate of the real effect size. Notice that we are not saying that the effect is exactly zero; but rather we are asking whether a zero effect explains the observations better than a real effect.

Now this amounts to putting a “lump” of probability on there being a zero effect. If you put a prior probability of 0.5 for there being a zero effect, you're saying the prior odds are 1. If you are willing to put a lump of probability on the null-hypothesis, then there are several methods of doing that. They all give similar results to mine within a factor of two or so.

EJ Wagenmakers sums it up in a tweet: “at least Bayesians attempt to find an approximate answer to the right question instead of struggling to interpret an exact answer to the wrong question” — that being the p-value.

Some actual results. First, the false positive risk as a function of the observed p-value.

The slide at time index 26m05 is designed to show the difference between the “p-equals” and the “p-less than” cases:

The top row shows what you get with a prior probability of 0.1.
The bottom row shows the prior probability of 0.5.
False positive risk is plotted against the p-value.

On each diagram the dashed red line is the “line of equality” that's where the points would lie if the p-value were the same as the false positive risk. Tou can see that in every case the blue lines—the false positive risk—is greater than the p value. For a prior probability of 0.5 then the false positive risk is about 26% when p=0.05.

So from now on I should use only the “p-equals” calculation which is clearly more relevant to a test of significance.

Now another set of graphs at time 27m46, for the false positive risk as a function of the observed p-value, but this time we'll vary the number in each sample. These are all for comparing two independent samples:

Focus on the log-log plot for prior of 0.5 in the bottom-right corner

Let's just concentrate on this log-log plot for a prior of 0.5. The curves are red for n=4 ; green for n=8 ; blue for n=16.

The top row is for an implausible hypothesis with a prior of 0.1, the bottom row for a plausible hypothesis with a prior of 0.5.

The power these lines correspond to is:
- n=4 has power 22%
- n=8 has power 46%
- n=16 that's the blue one has power 78%

Now you can see these behave in a slightly curious way. For most of the range it's what you'd expect: n=4 gives you a higher false positive risk than n=8 and that still higher than n=16 the blue line. It behaves in an odd way around 0.05;
they actually begin to cross. But the point is that in every case they're above the line of equality, so false positive risk is much bigger than the p-value in any circumstance.

Now the really interesting one: When I first did the simulation study I was challenged by the fact that the false positive risk actually becomes 1 if the experiment is a very powerful one. That seems a bit odd.

This is a false positive risk FPR50 by which I mean “The False Positive Risk for prior odds of 1”, or a 50:50 chance of being a real effect or not a real effect.

Let's just concentrate on the p=0.05 curve and because the number per sample is changing, the power changes throughout the curve. For example on this p=0.05 curve for n=4 (that's the lowest one) r is 0.22 but if we go to the other end of the curve, n=64, The power is 99.99 something not achieved very often in practice.

But how is it p=0.05 can give you a false positive risk which approaches 100% even with p=0.01? It will eventually approach a false positive risk of 100% if p=0.01 and even if p=0.001 (and the same with other p-values) though it does so later and more slowly.
In fact this has been known for donkey's years. It's called the Jeffreys-Lindley paradox though there's nothing paradoxical about it: If the power is 99% then you expect almost every p-value to be very low. Everything is detected if we have a high power like that. In fact it would be very rare, with that very high power, to get a p-value as big as 0.05. Almost every p-value be much less than 0.05, and that's why p=0.05 would in that case provides strong evidence for the null-hypothesis. Even p=0.01 would provide strong evidence because almost every p-value with that would be much less than 0.01. This is a direct consequence of using the “p equals” definition which I think is what one ought to do. So the Jeffreys-Lindley phenomenon makes absolute sense.

Example

Now let's consider an actual practical example. This is a study of transcranial electromagnetic stimulation published in Science magazine (so a bit suspect to begin with), and it concludes that an improved associated memory performance was produced by transcranial electromagnetic stimulation, p=0.043. In order to find out how big the sample sizes were I had to dig right into the supplementary material. It was only 8. Nonetheless let's assume that they had an adequate power and see what we make of it:

In fact it wasn't done in a proper parallel group way, it was done as before and after the stimulation and stab sham stimulation, and it produces one lousy asterisk. In fact most of the paper was about functional magnetic resonance imaging, this was only mentioned as a subsection of figure 1, but this is what was tweeted out because it sounds more dramatic than other things and it got a vast number of retweets. Now according to my calculations p=0.043 means there's at least a 23% chance that it's false positive.

How better might we express the result of this experiment?

We could say that the increase in memory performance was 1.88 ±  0.85 (SEM) with confidence interval 0.055 to 3.7 (extra words recalled on a baseline of about 10). Thus p = 0.043. This implies a false positive risk (i.e. the probability that the results occurred  by chance only) of at least 18% so the result is no more than suggestive.

There are several other ways you can put the same idea. I don't like them as much but, you could say that the increase in performance gave p=0.043, and in order to reduce the false positive risk to 0.05 it would be necessary to assume that the prior probability of there being a real effect was 81%. You'd have to be almost certain that there was a real effect before you did the experiment before that result became convincing. Since there's no independent evidence that that's true, the result is no more than suggestive.

Or you could put it this way: the increase in performance gave p=0.043. In order to reduce the false positive risk to 0.05 it would have been necessary to observe p=0.0043, so the result is no more than suggestive.

The reason I now prefer the first of these possibilities is because the other two involve an implicit threshold of 0.05 for the false positive risk and that's just as daft as assuming a threshold of 0.05 for the for the p-value.

You can calculate these very easily with our web calculator [for latest links please go to http://www.onemol.org.uk/?page_id=456]. There are three options : if you want to calculate the false positive risk for a p-value and prior, we can put in the observed p-value 0.049, the prior probability, the real effect 0.5, the normalized effect size 1 standard deviation, number in each sample and the output panel updates itself automatically.

We see that the false positive risk for the “p-equals” case is 0.26 and the likelihood ratio is 2.8 (I'll come back to that in a minute). This sort of table can be very quickly calculated using the calculator or using the R programs which are provided with the papers:

If we observe p=0.05, then it implies all the other things on this line: the prior probability that you need to postulate to get a 5% false positive risk would be 87%. You'd have to be almost ninety percent sure there was a real effect before the experiment, in order to to get a 5% false positive risk. The likelihood ratio comes out to be about 3; what that means is that your observations will be about 3 times more likely if there was a real effect than if there was no effect. 3:1 is very low odds compared with the 19:1 odds which you might incorrectly infer from p=0.05. The false positive risk for a prior of 0.5—the sort of default value which I call the FPR50—would be 27% when you observe p=0.05.

In fact these are just directly related to each other. Since the likelihood ratio is a purely deductive quantity, we can regard FPR50 as just being a transformation of the likelihood ratio and regard this as also a purely deductive quantity. For example,  1 - 2.8/(1+2.8) = 0.263, the FPR50.  Although in order to interpret it as a posterior probability then you do have to go into Bayes theorem. If the prior probability of a real effect was only 0.1 then that would correspond to a 76% false positive risk when you'd observed p=0.05. If we go to the other extreme, when we observe p=0.001 the likelihood ratio is 100 —notice not 1000, but 100— and the false positive risk would be 1%. That sounds okay but if it was an implausible hypothesis with only a 10% prior chance of being true, then the false positive risk would even then be above 5%, it would be 8% even when you observe p=0.001. In fact, to get this down to 0.05 you'd have to observe p=0.00043, and that's good food for thought.

So what do you do to prevent making a fool of yourself?

  1. Never use the words significant or non-significant and then don't use those pesky asterisks please, it makes no sense to have a magic cut off. Just give a p-value.
  2. Don't use bar graphs. Show the data as a series of dots.
  3. Always remember, it's a fundamental assumption of all significance tests that the treatments are randomized. When this isn't the case, you can't expect an accurate result from a test.
  4. So I think you should still state the p-value and an estimate of the effect size with confidence intervals but be aware that this tells you nothing very direct about the false positive risk. The p-value should be accompanied by an indication of the likely false positive risk. It won't be exact but it doesn't really need to be; it does answer the right question. You can for example specify the FPR50, the false positive risk based on a prior probability of 0.5. That's really just a more comprehensible way of specifying the likelihood ratio. You can use other methods, but they all involve an implicit threshold of 0.05 for the false positive risk. That isn't desirable.

So p=0.04 doesn't mean you discovered something, it means it might be worth another look. In fact even p=0.005 can under some circumstances be more compatible with the null-hypothesis than with there being a real effect.

So this doesn't leave Fisher looking very good. Matthews (1988) said, “the plain fact is that 70 years ago Ronald Fisher gave scientists a mathematical machine for turning boloney into breakthroughs and flukes into funding”. Though it's not quite fair to blame R.A.Fisher because he himself described the 5% point as a quite a low standard of significance.

Q&A

Q: “There are lots of competing ideas about how best to deal with the issue of statistical testing. For the non-statistician it is very hard to evaluate them and decide on what is the best approach. Is there any empirical evidence about what works best in practice? For example, training people to do analysis in different ways, and then getting them to analyze data with known characteristics. If not why not? It feels like we wouldn't rely so heavily on theory in e.g. drug development, so why do we in stats?

A: The gist: why do we rely on theory and statistics? Well, we might as well say, why do we rely on theory in mathematics? That's what it is! You have concrete theories and concrete postulates. Which you don't have in drug testing, that's just empirical.

Q: Is there any empirical evidence about what works best in practice, so for example training people to do analysis in different ways? and then getting them to analyze data with known characteristics and if not why not?

A: Why not: because you never actually know unless you're doing simulations what the answer should be. So no, it's not known which works best in practice. I think you can rely on the fact that a lot of the alternative methods give similar answers. That's why I felt justified in using rather simple assumptions for mine, because they're easier to understand and the answers you get don't differ greatly from much more complicated methods.

In my 2019 or 2017 paper there's a comparison of three different methods, all of which assume that it's reasonable to test a point (or small interval) null-hypothesis (one that says that treatment effect is exactly zero), but given that assumption, all the alternative methods give similar answers within a factor of two or so. A factor of two is all you need: it doesn't matter if it's 26% or 52% or 13%, the conclusions in real life are much the same.

So I think you might as well use a simple method. There is an even simpler one than mine actually, proposed by Berger and Sellke that gives a very simple calculation from the p-value and that gives I think a false positive risk of 29 percent when you observe p=0.05. Mine method gives 26%, so there's no essential difference between them. It doesn't matter which you use really.

Q: The last question gave an example of training people so maybe he was touching on how do we teach people how to analyze their data and interpret it accurately. Reporting effect sizes and confidence intervals alongside p-values has been shown to improve interpretation in teaching contexts. I wonder whether in your own experience that you have found that this helps as well? Or can you suggest any ways to help educators, teachers, lecturers, to help the next generation of researchers properly?

A: Yes I think you should always report the observed effect size and confidence limits for it. But be aware that confidence intervals tell you exactly the same thing as p-values and therefore very suspect. There's a simple one-to-one correspondence between p-values and confidence limits. So if you use the criterion, “the confidence limits exclude zero difference” to judge whether there's a real effect you're making exactly the same mistake as if you use p≤0.05 to to make the judgment. So they they should be given for sure, because they're sort of familiar but you do need, separately, some sort of a rough estimate of the false positive risk too.

Q: I'm struggling a bit with the “p-equals” intuition. How do you decide the band around 0.047 to use for the simulations? Presumably the results are very sensitive to this band. If you are using an exact p-value in a calculation rather than a simulation, the probability of exactly that p-value to many decimal places will presumably become infinitely small. Any clarification would be appreciated.

A: Yes, that's not too difficult to deal with: you've got to use a band which is wide enough to get a decent number in. But the result is not at all sensitive to that: if you make it wider, you'll get larger numbers but the result will be much the same. In fact, that's only a problem if you do it by simulation. If you do it by exact calculation it's easier, (I have a spare slide I can show if you if you want to that illustrates that). To do a 100,000 or a million t-tests with my R script in simulation, doesn’t take long. But it doesn't depend at all critically on the width of the interval; and in any case it's not necessary to do simulations, you can do the exact calculation.

Q: Even if an exact calculation can't be done—it probably can—you can get a better and better approximation by doing more simulations and using narrower and narrower bands around 0.047?

A: Yes that’s absolutely true yes. Yes, I did check it with a million occasionally. The slide at time 53m17.5 shows how you do the exact calculation:


• The students t-value along the bottom
• Probability density at the side
• The blue line is the distribution you get under the null-hypothesis, with a mean of 0 and a standard deviation of 1 in this case.
• So the red areas are the rejection areas for a t-test.
• The green curve is the t-distribution (it’s a non-central t-distribution which is what you need in this case) for the alternative hypothesis. So it's centred round … (inaudible)
• The yellow area is the power, which here is 78%
• The orange area is (1-power) so 22%

Then they consider all values in the red area or in the yellow area as being positives. The p-equals hypothesis is, what you want is not the areas, but the ordinates here, the probability densities. The probability of getting a t-value of 0.04 under the null-hypothesis is that intercept and under (…inaudible…) so for the p-equals hypothesis, the likelihood ratio would be Y1/(2 * Y20) (2 because of the two red tails) and that gives you a likelihood ratio of 2.8. That corresponds to an FPR50 of 26% as we explained. Now it's true that the probability of being in a narrow band around Y0 and Y1, would be zero, but the ratio of the two is perfectly well defined. It's not infinitesimal, in fact it's 2.8. So you can by calculation get the same result as you get from simulation. I hope that was reasonably clear. It may not have been if you aren't familiar with looking at those sorts of things.

Q: To calculate FPR50—false positive risk for a 50:50 prior—I need to assume an effect size. Which one do you use in the calculator? Would it make sense to calculate FPR50 for a range of effect sizes?

A: Yes if you use the web calculator or the R scripts then you want to specify what the normalized effect size is. You can use your observed one. If you're trying to interpret real data, you've got an estimated effect size. It's true (this was shown in the 2014 simulations of the 2017 exact calculations) that that likelihood ratio I gave of 2.8 when you observe p=0.05, that is done using the true effect size. All you've got is the observed effect size. So they're not the same of course. But you can easily show with simulations, that if you use the observed effect size in place of the the true effect size (which you don't generally know) then that likelihood ratio goes up from about 2.8 to 3.6 (from memory); it's around 3, either way. You can plug your observed normalised effect size into the calculator and you won't be led far astray.

Q: Consider hypothesis H1 versus H2 which is the interpretation to go with?

A: Well I'm not quite clear still what the two interpretations he's alluding to are but I shouldn't rely on the p-value interpretation. You can do a full Bayesian analysis. Some forms of Bayesian analysis can give results that are quite similar to the p-values. (That can't possibly be generally true because they make quite different assumptions). Stephen Senn produced an example where there was essentially no problem with p-value, but that was for a one-sided test with a fairly bizarre prior distribution.

In general in Bayes, you specify a prior distribution of effect sizes, what you believe before the experiment. Now, unless you have empirical data for what that distribution is, which is very rare, then I just can't see the justification for that. It's bad enough making up the probability that there's a real effect compared with there being no real effect. To make up a whole distribution just seems to be a bit like fantasy.

Mine is simpler because by considering a point null-hypothesis and a Point alternative hypothesis, what in general would be called Bayes factors become Likelihood Ratios. Likelihood ratios are much easier to understand than Bayes factors because they just give you the relative probability of observing your data under two different hypotheses. This is a special case of Bayes theorem. But as I mentioned, any approach to Bayes theorem which assumes a point null hypothesis gives pretty similar answers, so it doesn't really matter which you use.

There was edition of the American Statistician last year which had 44 different contributions about the world beyond p=0.05 and I find it a pretty disappointing edition because there was no agreement among people and a lot of people didn't get around to making any recommendation. They said what was wrong, but didn't say what you should do in response. The one paper that I did like was the one by James Berger and Sellke. They recommended their false positive risk estimate (as I would call it; they called it something different but that's what it amounts to) and that's even simpler to calculate than mine. It's a little more pessimistic, it can give a bigger false positive risk for a given p value, but apart from that detail, their recommendations are much the same as mine. It doesn't really matter which you choose.

Q: If people want a procedure that does not too often lead them to draw wrong conclusions, is it fine if they use a p-value?

A: No, that maximises your wrong conclusions, among the available methods! The whole point is, that the false positive risk is a lot bigger than the p-value under almost all circumstances. Some people refer to this as the p-value exaggerating the evidence; but it only does so if you incorrectly interpret the p value as being the probability that you're wrong. It certainly is not.

Q: Your thoughts on, there's lots of recommendations about practical alternatives to p-values. Most notably the Nature piece that was published last year—something like 400 signatories—that said that we should retire the p value. Their alternative was to just report effect sizes and confidence intervals. Now you've said you're not against anything that should be standard practice, but I wonder whether this alternative is actually useful, to retire the p-value?

A: I don't think the 400 author thing in nature recommended ditching p-values at all, it recommended ditching the 0.05 threshold, and just stating a p-value. That would mean abandoning the term “statistically significant” which is so shockingly misleading for the reasons I've been talking about. But it didn't say that you shouldn't give p-values, and I don't think it really recommended an alternative. I would be against not giving p-values because it's the p-value which enables you to calculate the equivalent false positive risk which would be much harder work if people didn't give the p-value.

If you use the false positive risk, you'll inevitably get a larger false negative rate. So, if you're using it to make a decision, other things come into it than the false positive risk and the p-value. Namely, the cost of missing an effect which is real, and the cost of getting a false positive. They both matter. If you can estimate the costs associated with either of them, then then you can draw some sort of optimal conclusion. But there aren't usually costs to getting a false positives. Certainly the costs of getting false positives or rather low for most people. In fact, there may be a great advantage to your career to publish a lot of false positives, unfortunately. This is the problem that the riot science club is dealing with I guess.

Q: What about changing the alpha level? To tinker with the alpha level has been popular in the light of the replication crisis, to make it even a more difficult test pass when testing your hypothesis. Some people have said that it should be 0.005 should be the threshold and

A: … Daniel Benjamin said that and a lot of other authors. I wrote to them about that and they said, they didn't really think it was very satisfactory but it would be better than the present practice. They regarded it as a sort of interim thing.

It's true that you would have fewer false positives if you did that, but it's a very crude way of treating the false positive risk problem. I would much prefer to make a direct estimate, even though it's rough, of the false positive risk rather than just crudely reducing to p=0.005. I do have a long paragraph in one of the papers discussing this particular thing. If for example you are testing a hypothesis about teleportation or mind-reading or homeopathy then you probably wouldn't be willing to give a prior of 50% to that being right before the experiment. In which case 0.005 would still be wrong, you would have to have a much lower p-value than that to get the false positive risk.

Links

Twitter

Simpson’s Paradox

Alice and Bob compete. Bob wins convincingly both times. But overall, Alice is better. How come?

This happens for real in medical trials and cases where you don't have much control over the size of groups you test on:


AliceBob
Trial 1 Score8010
–Out Of /100 /10
––Percentage => 80%=> 100%
Trial 2 Score240
–Out Of /10 /100
––Percentage=> 20%=> 40%
Total Score8250
–Total Trials /110 /110
––Percentage=> 75%=> 45%

The points to notice:

• In each trial, they got different sized groups assigned to them. There can be good reasons for this. One procedure may be known (or thought) to be better for specific circumstances so it would unethical to assign procedure based on “we want to simplify our statistical analysis” rather than on best possible outcome. Or you may simply be combining statistics about things not in your control.

• Alice's score for her largest group is better than Bob's score for his largest group, but those ‘largest groups’ were in different trials so they only ever get compared in the overall figure.

More on wikipedia: https://en.wikipedia.org/wiki/Simpson%27s_paradox.

Why use -ŷ⋅log(y) – (1-ŷ)⋅log(1-y) as a loss function?

Why use - ŷ⋅log(y) - (1-ŷ)⋅log(1-y) as the loss function for training logistic regression and sigmoid outputs?

The purpose of this article is to explain the reasoning that leads to a choice of loss function. There are four areas you will want to understand:

  1. Basic Probability: interpreting numbers from 0 to 1 as probability
  2. Statistical Insight : what is the best understanding of “best”?
  3. Mathematical Tricks to Make Life Easier: things you can do to simplify the calculations
  4. Limitations of Computing Hardware: what kind of maths will existing computers get wrong?

Understanding “why this loss function?” should help you understand why not : when should you not use it, and how you might generate a correct alternative.

Background

When training a neural network, for instance in supervised learning, we typically start with a network of random matrices. We then use sample data to repeatedly adjust them until we have shaped them to give the right answers. This requires three things:

  1. Sample data. We need hundreds or thousand or millions of examples that are already labelled with the correct answer.
  2. A cost function. The cost function gives a number to express how wrong our current set of matrices are. “Training a neural network” means “minimise that wrongness by adjusting the matrices to reduce the value of the cost function.”
  3. An algorithm for how exactly you do this minimising. Gradient descent using back-propagation is currently the one-size-fits-all choice and a consequence is that your cost function must be differentiable.

The example we'll work towards is a cost function you commonly see in neural network introductions:
-1/m * ∑(for i=1 to m) y * log( ŷ(i)) + (1-y)* log( 1-ŷ(i) )
where

  • m is the number of examples in your sample data
  • y means the correct result for a given input x
  • ŷ (pronounce y-hat) means your network’s output for a given input x
  • (pronounce sigma) is the mathematical symbol for sum, or adding up.
  • log is the mathematical logarithm function found in log tables.

1. Basic Probability: Interpreting numbers from 0 to 1 as probability

How certain are you that you are reading this article? 100% certain is taken to mean completely certain, and 0% certain to mean not even a little bit certain. In fact, certainly not. Rather than using percentages, we just use the numbers 0 for certainly not; 1for completely certain; and numbers in between for a scale of probability.

The probability of two independent events is found by multiplying the probability of each individual event. Consider rolling some fair six sided dice. The probability of rolling a 6 is , which can also be written as 16.66667% or as 0.166667. The probability of rolling two 6s is ⅙ * ⅙, which is 1/36, which is about 2.77778% or 0.277778.

The probability of something not happening is (1 - probability of it happening). The probability of not rolling a six is 1-⅙ which is or 83.3333% or 0.83333333. It's same as the probability of rolling one of 1,2,3,4,5. The probability of not rolling 2 sixes is (1 - 1/36) which is 97.2222% or 0.97222.

We abbreviate the expression “the probability of …” to p(…) . So if we call the die roll d, then

p( d = 6 )

is read as “the probability that die roll d is 6.”

Conditional probability is “The probability of … given … already happened” and is written p(… | …). So if we call our two dice rolls d1 and d2 then

p( d1+d2 = 12 | d2=6)

can be read as “The probability of d1 + d2 adding to 12 given that we already rolled d2 = 6”

Further reading: https://en.wikipedia.org/wiki/Probability#Mathematical_treatment

2. Statistical Insight : What is the best understanding of “Best”?

So you started with some data, that is some examples for which you already know the right answer. Let’s say you have m examples and call this set X.
X = { x(1),...,x(m) }

and the matching set of correct answers, or labels:
Y = {y(1),...,y(m)}.

You might think that the best setting for your network, given these examples, is:

  • “Best” is the one that most often returns the correct result y(i) for a given input x(i).

But then you find that your network almost never returns exactly the right answer. For instance in binary classification the right answers are integers 0 and 1; but your network returns answers like 0.997254 or 0.0023.

So next you think about ‘being closest to the correct answer’ and consider:

  • “Best” is the one that gets closest to the correct y(i)s

But there is no single definition of ‘closest’ for multiple points. If you know about averages and also learnt Pythagoras’ theorem at school you might think of Mean Squared Error as a good definition of closest:

  • “Best” is the one that reduces mean squared error for your example data.

And that would work. This is how linear regression—finding the best straight line through points plotted on graph paper—is often done. (And, if the input data is normally distributed, mean squared error will give very similar results as the cost function we are deriving here).

But if you've studied even more statistics, or probability, or information theory, you may know that “closest to the correct answer” for a distribution of data which need not be normal is actually really tricky, and you might think of:

  • “Best” is the one that maximises the expected probability of the known correct answers.

The new idea here is “expected value.” It is worth your while to study enough statistics and probability to understand it.

And indeed, statisticians take this as often being the best meaning of “best”. It’s called the Maximum Likelihood Estimator and reasons for considering it “best” include: it is unbiased, consistent, and does not rely on any assumptions about the distribution of the data. So it’s a good default choice. Minimising the mean squared error, surprisingly, can sometimes resulted in a biased estimator.

What the Maximum Likelihood Estimator looks like depends on the detail of what problem you are trying to solve.

Specific Example: Binary Classification with a 0..1 output network

So let’s take the case of

  • A binary classification task : the correct answer is one of two possible labels. Think of them as two buckets.
  • A network where the output is a single number in the range from 0 to 1

A concrete example would be a cat recogniser : the task is to identify photos of cats. The input data are photographs. The sample data we start with are a set of photographs already correctly labelled as ‘cat’ and ‘not-cat’. The final layer of the network is a single sigmoid unit.

After we have seen some mathematical tricks, we will be able to write down a mathematical formula for “expected probability of the known correct answers” for the case of binary classification using a network with a single output in the range 0 to 1.

3. Mathematical Tricks to Make Life Easier: Things you can do to simplify the calculations

1st Trick: Use 0 and 1 as the bucket labels.

You can’t easily do maths with words like ‘cat’ and ’not-cat’. Instead, let’s use 0 and 1 as the names of our buckets. We’ll use 1 to mean ‘it’s a cat’ and 0 to mean ‘it’s not a cat’.

2nd Trick: Interpret a 0-1 output as a probability

This is a neat trick: since the network output is in the range 0 to 1, we decide to interpret the output as “our estimate of the probability that 1 is the correct bucket”. To clarify, we do this not because of any deep insight but just because it makes the maths simpler.

With this interpretation we can turn our statistical phrase “expected probability of the known correct answer” into a mathematical formula:

p( ŷ(i)= y(i) | x(i) ; 𝜃 )

which you should read aloud as “The probability that our estimate y-hat-i equals the correct value y-i, given that the current input is x-i and given that the matrices of our network are currently set to 𝜃.”

We used the letter 𝜃 here as an abbreviation for “all of the current values in the all of the matrices in the network.” 𝜃 is in fact a very long list of hundreds or thousands of numbers.

We will have found the Maximum Likelihood Estimator if we can find the values for the network that maximise this probability.

Before we move on: if the output means “the probability of 1 being correct”, what about the probability of 0 being correct?

In binary classification, everything must go into either bucket 0 or bucket 1. The chance of you being in one or other of the buckets is 100%. But if you’re in 1, the chances of being in 0 are none at all; and vice versa.

If you’re in between – “I estimate the chance of this being a cat photo is 80%” — then remember that in probability, all possibilities together must add up to 100% (It’s 100% certain that a photo goes into one of the buckets). So “I estimate the chance of this being a cat photo is 80%” automatically means, “I estimate the chance of this being a not-cat photo is 20%”.

In general:

  • if output means “the probability of 1 being correct”
  • then (100% - output) or ( 1 - output ) must mean “the probability of 0 being correct”

2½th Trick: Put tricks 1 and 2 together

Here’s the clever bit. If you put tricks 1 and 2 together you can re-write our formula from (2):

p( ŷ(i)= y(i) | x(i) ; 𝜃 )

as just

  • ŷ(i) when the correct answer is 1, and
  • 1 - ŷ(i) when the correct answer is 0.

Which is a lot simpler. I mean, really a lot simpler.

How does that work? Remember that the set of y(i)s were the correct answers for our sample data. So y(i)=1 if the correct answer is 1, and y(i)=0 if the correct answer is 0. Now trying putting these two sentences from earlier together, using the word output to substitute the 2nd line into the first:

-ŷ(i) means the output given the current input is x(i) and given the matrices of our network are currently set to 𝜃.
-We interpret the output as our estimate of the probability that the correct answer is 1

And you get:

ŷ(i) is our estimate of the probability that the correct answer is 1 given the current input is x(i) and given the matrices of our network are currently set to 𝜃.”

But notice that when the correct answer y(i) is 1, this sentence which defines ŷ(i) is exactly what we meant by

p( ŷ(i)= y(i) | x(i) ; 𝜃 )

In the case when the correct answer is y(i) = 0 (remember the rule that “the probability of the result being 0 is (1.0 - probability of the result being 1)” that sentence in quotes is what we mean by

p( ŷ(i)) = 1-y(i) | x(i) ; 𝜃 )

With some high school algebra and our basic knowledge of probability and again the (1-…) rule, we realise that

p( ŷ(i)=1-y(i) ) is the same as p( 1-ŷ(i)=1 ) which is the same as 1- ŷ(i)

Conclusion: We have turned our definition of maximum likelihood into a pair of very simple formulae. For each example datum x(i):

  • if the correct answer is 1, then our maximum likelihood estimate is ŷ(i)
  • if the correct answer is 0, then our maximum likelihood estimate is
    1-ŷ(i)

We need just one more trick. Remember that to use this with back propagation, we need a single differentiable function. We must combine those two formulae into a single formula, and it has to be differentiable.

3rd mathematical trick: Invent a differentiable if(…,then …, else …) function

Let’s try to invent a differentiable if(…,then …, else …) function that can combine the two formulae into one:

if( y(i)=1 , then ŷ(i) , else (1 - ŷ(i))

To keep our maths simple, we want some kind of arithmetic version of this if( y=1, then ŷ(i) , else (1-ŷ(i) )) function. Similar to when we noticed that “the probability of not x is 1-x”, we will use a 1-x trick.

There are a couple of simple options. Try these two definitions:

if(y,a,b) = a*y + b*(1-y)

if(y,a,b) = a^y * b^(1-y)

Both work. We could call the first one the ‘times & add’ version of if and the second one the ’exponent & times’ version. You could invent more. The constraints are:

  1. It must return a when y=1 and return b when y=0
  2. It should have a (preferably simple!) derivative.

Both of my suggestions meet the need, but we go with the second option and define:

if(y,a,b) = a^y * b^(1-y)

This choice is not some profound insight; again, it is only because it will make the maths easier further down the page .

This new if() function combines our two separate formulae for “expected probability of the known correct answer” into a single, simple formula:

ŷ(i)^y * (1-ŷ(i))^(1-y)

Maximum Likelihood Estimator for all m examples

So far, we’ve only consider one sample datum, x(i), at a time. What about all m samples? You recall that the the probability of m independent events is the probability of each of the individual events all multiplied together. We use capital greek letter pi — ∏ — meaning product, to show lots of terms being multiplied together, and write it as:

∏ (for i=1 to m) ŷ(i)^y * (1-ŷ(i))^(1-y)

Which you read aloud as “The product, for every i from 1 to m, of y-hat-i to the power y, times one minus y-hat-i to the power 1 - y”. Or as “multiply all the m individual maximum likelihood estimates together.”

At this point, we have done enough work to get started on our back-propagation algorithm and train our network. We have worked out a cost function that will guide us to a maximum likelihood estimator for all the data we have.

Why don’t we go with it?

4. Limitations of Computing Hardware: What kind of maths will existing computers get wrong?

Well. Suppose you have a small training set of only 1,000 examples. Then this product will be 1,000 small numbers multiplied together. Suppose our typical ŷ(i) is about 0.5, then the result will usually be much less than 2^-1000.

The smallest number that the IEEE 754-2008 standard for 64 bit floating point arithmetic can represent is 2^-1023. With only a thousand training examples we are already within a hair’s breadth of having cost function calculation underflow, and rounding to zero! That’s before we’ve thought about rounding errors for doing a 1,000 multiplications. If we used single precision 32 bit arithmetic—which we might want to because it's about twice as fast—we hit the underflow problem with only about a hundred training examples.

If only we could use additions instead of multiplications. That would avoid underflow and dramatically reduce rounding errors. If only…

Computing trick: Use log() to avoid underflow and rounding errors

Those of you old enough to have used log tables at school will recall that the log() function neatly replaces multiplication with addition:

log(a * b * c) = log(a) + log(b) + log(c)

log also replaces exponentiation with multiplication:
log(a^b) = log(a) * b.
And, the logs of very small numbers are very big (bigly negative, that is). And log() is differentiable. The derivative of log(x) is 1/x.

Sounds perfect. What if, instead of using the product-of-probabilities for our cost function we could use the sum-of-logs-of-probabilities instead?

Important to the success of this trick is, that the log() function is monotonic. That is, when something goes up or down, log(something) goes up or down exactly in step. So when you increase or maximise log(something) then you simultaneously increase or maximisesomething. And vice versa.

What this means is, we can use logs. If we find the value of 𝜃 that improves or maximises the log of

∏ (for i=1 to m) ŷ(i)^y * (1-ŷ(i))^(1-y)

then we know for sure that that same value of 𝜃 simultaneously improves or maximises this product itself.

Let's do it. The log of the product is the sum of the logs:

∑ (for i=1 to m) log( ŷ(i)^y * (1-ŷ(i))^(1-y) )

and remembering your high school grasp of log() you can simplify even further to,

The Maximum Likelihood Estimator is the network that maximises
∑ (for i=1 to m) y*log( ŷ(i) ) + (1-y)*log( 1-ŷ(i) )

You say maximise, I say minimise

For no good reason whatsoever, we often think of optimisation problems as a minimisation challenge, not a maximisation one. For the same reason we prefer to divide by m so that the cost function is sort of averagey.

It's tradition. Whatever. So we stick a minus sign in front and divide by m and, ta-da:

-1/m * ∑ (for i=1 to m) y*log( ŷ(i) ) + (1-y)*log( 1-ŷ(i) )

That is how you derive the familiar formula for the cost function for optimising a sigmoid or logistic output with back propagation and gradient descent.

Recap

The ideas that lead to this formulae are:

  • Basic Probability: Interpret numbers from 0 to 1 as probability
  • Statistical Insight : The best understanding of “best” is usually the Maximum Likelihood Estimator.
  • Mathematical Tricks to Make Life Easier: A series of tricks you can do to simplify the calculations
  • Limitations of Computing Hardware: What kind of maths will existing computers get wrong?

So my recommendations, in priority order, are:

  1. Take a course on probability and statistics. Seriously. You can’t do good machine learning without a good grasp of statistics.
  2. Practise your high school maths
  3. Learn enough about your computer hardware and platform to know the gotchas.