Showing Software Architectures in 2+1 Views with UML

  • How can I show & explain an application or service architecture?
  • How can I draw “correct” (and what does that mean?) UML diagrams?

Let me show you how I document smallish software systems with, typically, no more than four UML diagrams. We’ll cover the questions, what is worth diagramming? and what can I do with these diagrams anyway? and we’ll cover enough UML to be useful.

I will use the example of developing a “lead generation” website, which should be simple enough to not get bogged down, but complex enough to answer some of the but how do I …? questions that more complex systems raise.

Why Diagrams?

Use diagrams as starting point to discuss, describe and explain your system. A diagram can give an instant overview of some aspect of the system. It can show important relationships on a single sheet of paper. It can raise important questions and show design decisions.

Why UML?

The UML is a visual modelling language. It has a vocabulary and grammar for diagramming software such that the diagram is a precise statement. So it can be used to show and explain software architecture.

The UML is the only diagramming standard left standing (with, perhaps, one exception that we’ll see later). You may be tempted to compete in this uncrowded field of standards for software diagrams. I comment only that the UML contains several decades of work by several thoughtful people and if you can produce a usable standard that is simpler, quicker to learn, and not unhelpfully imprecise, then I shall be impressed.

You may also be tempted — as many, many of your colleagues have been — to just do without a standard. I hope these few examples will persuade you that learning a standard is less work than you fear and more useful than you expect.

Why 2+1 views?

The UML tells you how to diagram once you’ve decided what to diagram. “2+1” is a minimal decision about what to diagram. It suggests a couple of diagrams for the logical view of your software and one for the physical view (we’ll explain what those words mean, too). That’s the 2. But it starts with the 1, which is the context.

The System Context: Shown with a UML Use Case Diagram

Always start with context. Always start with simple.

The context diagram should make sense to your non-technical customer and can be so simple you wouldn’t even give it a second glance. It shows:

  • Who and what will use the system
  • What the system will do
  • Who and what the system will rely on

In the UML, anyone or anything who uses the system, or is relied on by the system to do its thing, is an Actor. What the system will do is a UseCase. A UseCase is a instance of the system being used to do something. It is shown on a diagram just by its name, which is a single descriptive phrase, in an ellipse.

  • Anything inside the box is what the system does.
  • Anything outside the box is the context of the system.

How to Use a Context / Use Case Diagram?

Use the diagram to discuss scope, expectations and dependencies with the system’s customers and with the development team who will build it. Its simplicity and brevity should also clarify what is not (or not yet) expected of the system. Like user stories, the brevity calls for conversation to clarify. Complex Use Cases call for detailed careful business analysis too, but that is done with words, not pictures.

For your developers, this diagram is the high level overview of what they’re delivering — everything inside the box — and what the system will need for it to work — everything outside the box.

I really like having a “hand drawn” look when first showing diagrams, because it says “work in progress” and invites participation. A precisely drawn diagram risks the impression of being a final decision.

I try to draw diagrams to be read from top left to bottom right. So I put “active” Actors—the users—on the left, and “dependency” actors on the right. That’s not a part of UML, but it’s part of how I will talk through the diagram.

Here’s the same diagram later on on the project. In phase two, we’re giving visitors SMS feedback, and adding a whole new bit of functionality to read events from the customer service team and integrate then with web analytics.

Use Case diagram with several use cases, 2 user actors and 5 machine actors.


What if you have lots of use cases? Don’t put more on one diagram than you can usefully discuss in a single session. Pick out the main use cases & those that show the main external dependencies (which probably means, the ones that pose the highest risk to your project). Optionally, have a second diagram for all use cases, but only if you have an audience for it.

UML Definitions

There are 3 things you need to know for this diagram:

Actor, UseCase, and Subject aka System Boundary

Here are their definitions, which I’ve abbreviated from the UML 2.5 spec, section 18 Use Cases.

UseCases are a means to capture the requirements of systems, i.e., what systems are supposed to do. The key concepts specified in this clause are Actors, UseCases, and subjects. Each UseCase’s subject represents a system under consideration to which the UseCase applies. Users and any other systems that may interact with a subject are represented as Actors.

A UseCase is a specification of behavior. An instance of a UseCase refers to an occurrence of the emergent behavior that conforms to the corresponding UseCase. Such instances are often described by Interactions.

An Actor models a type of role played by an entity that interacts with the subjects of its associated UseCases (e.g., by exchanging signals and data). Actors may represent roles played by human users, external hardware, or other systems.

NOTE. An Actor does not necessarily represent a specific physical entity but instead a particular role of some entity that is relevant to the specification of its associated UseCases. Thus, a single physical instance may play the role of several different Actors and, conversely, a given Actor may be played by multiple different instances.

NOTE. The term “role” is used informally here and does not imply any technical definition of that term found elsewhere in this specification.

A UseCase is shown as an ellipse, either containing the name of the UseCase or with the name of the UseCase placed below the ellipse. An optional stereotype keyword may be placed above the name.

A subject for a set of UseCases (sometimes called a system boundary) may be shown as a rectangle with its name in the top-left corner, with the UseCase ellipses visually located inside this rectangle.

An Actor is represented by a “stick man” icon with the name of the Actor in the vicinity (usually above or below) the icon. An Actor may also be shown as a Classifier rectangle with the keyword «actor»

Other icons that convey the kind of Actor may also be used to denote an Actor, such as using a separate icon for non- human Actors.


The spec: https://www.omg.org/spec/UML

I have used 3 more UML features to add explanations:

  • A Comment is “a textual annotation that can be attached to a set of Elements”. A Comment is shown as a rectangle with the upper right corner bent (this is also known as a “note symbol”). The rectangle contains the body of the Comment. The connection to each annotatedElement is shown by a separate dashed line. The dashed line connecting the note symbol to the annotatedElement(s) may be suppressed if it is clear from the context, or not important in this diagram.
  • A Dependency “implies that the semantics of the clients are not complete without the suppliers”. A Dependency is shown as a dashed arrow between two model Elements. The model Element at the tail of the arrow (the client) depends on the model Element at the arrowhead (the supplier) .
  • InformationFlows “describe circulation of information through a system in a general manner. They do not specify the nature of the information, mechanisms by which it is conveyed, sequences of exchange or any control conditions”. An InformationFlow is represented using the same notation as Dependency , with the keyword «flow» adorning its dashed line.

HowTo: Linux, Nginx, Mono, Asp.Net Mvc

These are notes I made for getting an application written for Asp.Net Mvc (It was probably Mvc 3-ish) to run on a Centos server.

Note that some of the issues/solutions may vary across Linux variants. The interface Nginx-Mono is FastCGI, which they both support. The Mono webserver is xsp4.

Of course, running cross platform is all much easier these days if you use .Net Core and Kestrel 🙂 And een for Net Framework running on Mono, the work done since Microsoft bought Xamarin has been really helpful for e.g. Mvc compatibility.

The Reading List

The Checklist

Ensure nginx has read execute permissions on your application directory and all parent directories

nginx conf section for webapp

With the Asp.Net website listening on port 9001

#
server {
listen 8080;
server_name your-public-facing-nginx-server-name.com;

# use /smoketest/ to confirm that nginx is reading this config 
# and has read/execute access to the 
# /usr/share/nginx directories and files:

location /smoketest/ {
root /usr/share/nginx/smoketestredir;
index index.html;
}

# pass XSP to FastCGI server listening on 127.0.0.1:9001
# use a different port for each ASP.NET site you create
# (port 9000 is often taken by PHP on default webserver setups)

location / {

# replace fastcgi with this to confirm nginx read address
# to files in root /usr/share/nginx/mywebapplication;
fastcgi_pass  127.0.0.1:9001;
fastcgi_index index.html;

include /etc/nginx/fastcgi_params;
fastcgi_param  PATH_INFO          "";
fastcgi_param  SCRIPT_FILENAME    $document_root$fastcgi_script_name;

fastcgi_buffers 16 16k;
fastcgi_buffer_size 32k;
}
}

Systemd Startup Script

vim /etc/systemd/system/fastcgi-mono-hdmbreas.service

[Unit]
Description=FastCgi-mono-server4 for Application
After=syslog.target

[Service]
Type=simple
ExecStart=/bin/fastcgi-mono-server4 --applications /:/usr/share/nginx/hdmbreas/hdm-fe-web --socket=tcp:127.0.0.1:9001
Restart=always

systemctl daemon-reload
systemctl start fastcgi-mono-app.service
systemctl status fastcgi-mono-app.service
systemctl status fastcgi-mono-app.service

Error: [error] 7216#0: *119 upstream sent too big header while reading response header

  • Considering setting the buffer size
  • Considering setting the busy buffer size
  • e.g. nginx config: fastcgi_buffers 16 16k; fastcgi_buffer_size 32k;

Watching Errors

Watch errors with tail -F /var/log/nginx/error.log

Mathematics: a Definition

I propose a definition of mathematics:

Mathematics is the deductive analysis of structures

  • Deductive, because empirical data does not generate mathematical results; only logical deduction does so.
  • Analysis, because mathematicians tease out the consequences of the definitions of structures, rather than merely admiring, or using, or playing with them.
  • Structures because … well, this is the question: what do mathematicians study?

What do Mathematicians Study?

The OED definition of mathematics has

“The abstract deductive science of space, number, quantity, and arrangement, including geometry, arithmetic, algebra, etc., studied in its own right (more fully pure mathematics), or as applied to various branches of physics and other sciences (more fully applied mathematics).

Shorter OED 2007

but this is clearly an ad hoc list, as the “etc” acknowledges. Are there limits to what mathematicians study? If so, they must change every time a new area is invented.

But every branch of mathematics straightforwardly has this in common : it analyses a particular structure, or family of structures (that is, a structure of structures), and deductively analyses it, that is to say draws out its properties and relationships to other structures.

There are favoured structures. Numbers, of course. Then the Euclidean plane, which is the structure of lines and points on a flat surface. These favoured structures define the familiar major areas of mathematics–number theory, algebra, geometry, analysis. “Progress” in mathematics divides into, on the one hand, discovering new things about known structures; and on the other hand choosing new structures to study.

New structures may be chosen precisely for the light they shed on old ones: complex numbers, for instance, shone a new light on algebra, as did topology on geometry. Whenever a new structure is found to have interesting properties it becomes, inevitably, a part of mathematics and even, if there is work enough in exploring it, is dubbed a branch of mathematics, which is perhaps the ultimate mathematical status.

The virtue of naming structure as the subject of mathematics is that it becomes easy to say whether something is or is not mathematics: anywhere there is a structure that can be analysed deductively, there is a subject of mathematics. The ad-hoc element of the definition is banished.

And it reminds us not to be surprised every time a new branch of mathematics opens up. If it moves, or even if it doesn’t, it’s fair game to a mathematician.


Why use -ŷ⋅log(y) – (1-ŷ)⋅log(1-y) as a loss function?

Why use - ŷ⋅log(y) - (1-ŷ)⋅log(1-y) as the loss function for training logistic regression and sigmoid outputs?

The purpose of this article is to explain the reasoning that leads to a choice of loss function. There are four areas you will want to understand:

  1. Basic Probability: interpreting numbers from 0 to 1 as probability
  2. Statistical Insight : what is the best understanding of “best”?
  3. Mathematical Tricks to Make Life Easier: things you can do to simplify the calculations
  4. Limitations of Computing Hardware: what kind of maths will existing computers get wrong?

Understanding “why this loss function?” should help you understand why not : when should you not use it, and how you might generate a correct alternative.

Background

When training a neural network, for instance in supervised learning, we typically start with a network of random matrices. We then use sample data to repeatedly adjust them until we have shaped them to give the right answers. This requires three things:

  1. Sample data. We need hundreds or thousand or millions of examples that are already labelled with the correct answer.
  2. A cost function. The cost function gives a number to express how wrong our current set of matrices are. “Training a neural network” means “minimise that wrongness by adjusting the matrices to reduce the value of the cost function.”
  3. An algorithm for how exactly you do this minimising. Gradient descent using back-propagation is currently the one-size-fits-all choice and a consequence is that your cost function must be differentiable.

The example we’ll work towards is a cost function you commonly see in neural network introductions:
-1/m * ∑(for i=1 to m) y * log( ŷ(i)) + (1-y)* log( 1-ŷ(i) )
where

  • m is the number of examples in your sample data
  • y means the correct result for a given input x
  • ŷ (pronounce y-hat) means your network’s output for a given input x
  • (pronounce sigma) is the mathematical symbol for sum, or adding up.
  • log is the mathematical logarithm function found in log tables.

1. Basic Probability: Interpreting numbers from 0 to 1 as probability

How certain are you that you are reading this article? 100% certain is taken to mean completely certain, and 0% certain to mean not even a little bit certain. In fact, certainly not. Rather than using percentages, we just use the numbers 0 for certainly not; 1for completely certain; and numbers in between for a scale of probability.

The probability of two independent events is found by multiplying the probability of each individual event. Consider rolling some fair six sided dice. The probability of rolling a 6 is , which can also be written as 16.66667% or as 0.166667. The probability of rolling two 6s is ⅙ * ⅙, which is 1/36, which is about 2.77778% or 0.277778.

The probability of something not happening is (1 - probability of it happening). The probability of not rolling a six is 1-⅙ which is or 83.3333% or 0.83333333. It’s same as the probability of rolling one of 1,2,3,4,5. The probability of not rolling 2 sixes is (1 - 1/36) which is 97.2222% or 0.97222.

We abbreviate the expression “the probability of …” to p(…) . So if we call the die roll d, then

p( d = 6 )

is read as “the probability that die roll d is 6.”

Conditional probability is “The probability of … given … already happened” and is written p(… | …). So if we call our two dice rolls d1 and d2 then

p( d1+d2 = 12 | d2=6)

can be read as “The probability of d1 + d2 adding to 12 given that we already rolled d2 = 6”

Further reading: https://en.wikipedia.org/wiki/Probability#Mathematical_treatment

2. Statistical Insight : What is the best understanding of “Best”?

So you started with some data, that is some examples for which you already know the right answer. Let’s say you have m examples and call this set X.
X = { x(1),...,x(m) }

and the matching set of correct answers, or labels:
Y = {y(1),...,y(m)}.

You might think that the best setting for your network, given these examples, is:

  • “Best” is the one that most often returns the correct result y(i) for a given input x(i).

But then you find that your network almost never returns exactly the right answer. For instance in binary classification the right answers are integers 0 and 1; but your network returns answers like 0.997254 or 0.0023.

So next you think about ‘being closest to the correct answer’ and consider:

  • “Best” is the one that gets closest to the correct y(i)s

But there is no single definition of ‘closest’ for multiple points. If you know about averages and also learnt Pythagoras’ theorem at school you might think of Mean Squared Error as a good definition of closest:

  • “Best” is the one that reduces mean squared error for your example data.

And that would work. This is how linear regression—finding the best straight line through points plotted on graph paper—is often done. (And, if the input data is normally distributed, mean squared error will give very similar results as the cost function we are deriving here).

But if you’ve studied even more statistics, or probability, or information theory, you may know that “closest to the correct answer” for a distribution of data which need not be normal is actually really tricky, and you might think of:

  • “Best” is the one that maximises the expected probability of the known correct answers.

The new idea here is “expected value.” It is worth your while to study enough statistics and probability to understand it.

And indeed, statisticians take this as often being the best meaning of “best”. It’s called the Maximum Likelihood Estimator and reasons for considering it “best” include: it is unbiased, consistent, and does not rely on any assumptions about the distribution of the data. So it’s a good default choice. Minimising the mean squared error, surprisingly, can sometimes resulted in a biased estimator.

What the Maximum Likelihood Estimator looks like depends on the detail of what problem you are trying to solve.

Specific Example: Binary Classification with a 0..1 output network

So let’s take the case of

  • A binary classification task : the correct answer is one of two possible labels. Think of them as two buckets.
  • A network where the output is a single number in the range from 0 to 1

A concrete example would be a cat recogniser : the task is to identify photos of cats. The input data are photographs. The sample data we start with are a set of photographs already correctly labelled as ‘cat’ and ‘not-cat’. The final layer of the network is a single sigmoid unit.

After we have seen some mathematical tricks, we will be able to write down a mathematical formula for “expected probability of the known correct answers” for the case of binary classification using a network with a single output in the range 0 to 1.

3. Mathematical Tricks to Make Life Easier: Things you can do to simplify the calculations

1st Trick: Use 0 and 1 as the bucket labels.

You can’t easily do maths with words like ‘cat’ and ’not-cat’. Instead, let’s use 0 and 1 as the names of our buckets. We’ll use 1 to mean ‘it’s a cat’ and 0 to mean ‘it’s not a cat’.

2nd Trick: Interpret a 0-1 output as a probability

This is a neat trick: since the network output is in the range 0 to 1, we decide to interpret the output as “our estimate of the probability that 1 is the correct bucket”. To clarify, we do this not because of any deep insight but just because it makes the maths simpler.

With this interpretation we can turn our statistical phrase “expected probability of the known correct answer” into a mathematical formula:

p( ŷ(i)= y(i) | x(i) ; 𝜃 )

which you should read aloud as “The probability that our estimate y-hat-i equals the correct value y-i, given that the current input is x-i and given that the matrices of our network are currently set to 𝜃.”

We used the letter 𝜃 here as an abbreviation for “all of the current values in the all of the matrices in the network.” 𝜃 is in fact a very long list of hundreds or thousands of numbers.

We will have found the Maximum Likelihood Estimator if we can find the values for the network that maximise this probability.

Before we move on: if the output means “the probability of 1 being correct”, what about the probability of 0 being correct?

In binary classification, everything must go into either bucket 0 or bucket 1. The chance of you being in one or other of the buckets is 100%. But if you’re in 1, the chances of being in 0 are none at all; and vice versa.

If you’re in between – “I estimate the chance of this being a cat photo is 80%” — then remember that in probability, all possibilities together must add up to 100% (It’s 100% certain that a photo goes into one of the buckets). So “I estimate the chance of this being a cat photo is 80%” automatically means, “I estimate the chance of this being a not-cat photo is 20%”.

In general:

  • if output means “the probability of 1 being correct”
  • then (100% - output) or ( 1 - output ) must mean “the probability of 0 being correct”

2½th Trick: Put tricks 1 and 2 together

Here’s the clever bit. If you put tricks 1 and 2 together you can re-write our formula from (2):

p( ŷ(i)= y(i) | x(i) ; 𝜃 )

as just

  • ŷ(i) when the correct answer is 1, and
  • 1 - ŷ(i) when the correct answer is 0.

Which is a lot simpler. I mean, really a lot simpler.

How does that work? Remember that the set of y(i)s were the correct answers for our sample data. So y(i)=1 if the correct answer is 1, and y(i)=0 if the correct answer is 0. Now trying putting these two sentences from earlier together, using the word output to substitute the 2nd line into the first:

ŷ(i) means the output given the current input is x(i) and given the matrices of our network are currently set to 𝜃.
-We interpret the output as our estimate of the probability that the correct answer is 1

And you get:

ŷ(i) is our estimate of the probability that the correct answer is 1 given the current input is x(i) and given the matrices of our network are currently set to 𝜃.”

But notice that when the correct answer y(i) is 1, this sentence which defines ŷ(i) is exactly what we meant by

p( ŷ(i)= y(i) | x(i) ; 𝜃 )

In the case when the correct answer is y(i) = 0 (remember the rule that “the probability of the result being 0 is (1.0 - probability of the result being 1)” that sentence in quotes is what we mean by

p( ŷ(i)) = 1-y(i) | x(i) ; 𝜃 )

With some high school algebra and our basic knowledge of probability and again the (1-…) rule, we realise that

p( ŷ(i)=1-y(i) ) is the same as p( 1-ŷ(i)=1 ) which is the same as 1- ŷ(i)

Conclusion: We have turned our definition of maximum likelihood into a pair of very simple formulae. For each example datum x(i):

  • if the correct answer is 1, then our maximum likelihood estimate is ŷ(i)
  • if the correct answer is 0, then our maximum likelihood estimate is
    1-ŷ(i)

We need just one more trick. Remember that to use this with back propagation, we need a single differentiable function. We must combine those two formulae into a single formula, and it has to be differentiable.

3rd mathematical trick: Invent a differentiable if(…,then …, else …) function

Let’s try to invent a differentiable if(…,then …, else …) function that can combine the two formulae into one:

if( y(i)=1 , then ŷ(i) , else (1 - ŷ(i))

To keep our maths simple, we want some kind of arithmetic version of this if( y=1, then ŷ(i) , else (1-ŷ(i) )) function. Similar to when we noticed that “the probability of not x is 1-x”, we will use a 1-x trick.

There are a couple of simple options. Try these two definitions:

if(y,a,b) = a*y + b*(1-y)

if(y,a,b) = a^y * b^(1-y)

Both work. We could call the first one the ‘times & add’ version of if and the second one the ’exponent & times’ version. You could invent more. The constraints are:

  1. It must return a when y=1 and return b when y=0
  2. It should have a (preferably simple!) derivative.

Both of my suggestions meet the need, but we go with the second option and define:

if(y,a,b) = a^y * b^(1-y)

This choice is not some profound insight; again, it is only because it will make the maths easier further down the page .

This new if() function combines our two separate formulae for “expected probability of the known correct answer” into a single, simple formula:

ŷ(i)^y * (1-ŷ(i))^(1-y)

Maximum Likelihood Estimator for all m examples

So far, we’ve only consider one sample datum, x(i), at a time. What about all m samples? You recall that the the probability of m independent events is the probability of each of the individual events all multiplied together. We use capital greek letter pi — ∏ — meaning product, to show lots of terms being multiplied together, and write it as:

∏ (for i=1 to m) ŷ(i)^y * (1-ŷ(i))^(1-y)

Which you read aloud as “The product, for every i from 1 to m, of y-hat-i to the power y, times one minus y-hat-i to the power 1 – y”. Or as “multiply all the m individual maximum likelihood estimates together.”

At this point, we have done enough work to get started on our back-propagation algorithm and train our network. We have worked out a cost function that will guide us to a maximum likelihood estimator for all the data we have.

Why don’t we go with it?

4. Limitations of Computing Hardware: What kind of maths will existing computers get wrong?

Well. Suppose you have a small training set of only 1,000 examples. Then this product will be 1,000 small numbers multiplied together. Suppose our typical ŷ(i) is about 0.5, then the result will usually be much less than 2^-1000.

The smallest number that the IEEE 754-2008 standard for 64 bit floating point arithmetic can represent is 2^-1023. With only a thousand training examples we are already within a hair’s breadth of having cost function calculation underflow, and rounding to zero! That’s before we’ve thought about rounding errors for doing a 1,000 multiplications. If we used single precision 32 bit arithmetic—which we might want to because it’s about twice as fast—we hit the underflow problem with only about a hundred training examples.

If only we could use additions instead of multiplications. That would avoid underflow and dramatically reduce rounding errors. If only…

Computing trick: Use log() to avoid underflow and rounding errors

Those of you old enough to have used log tables at school will recall that the log() function neatly replaces multiplication with addition:

log(a * b * c) = log(a) + log(b) + log(c)

log also replaces exponentiation with multiplication:
log(a^b) = log(a) * b.
And, the logs of very small numbers are very big (bigly negative, that is). And log() is differentiable. The derivative of log(x) is 1/x.

Sounds perfect. What if, instead of using the product-of-probabilities for our cost function we could use the sum-of-logs-of-probabilities instead?

Important to the success of this trick is, that the log() function is monotonic. That is, when something goes up or down, log(something) goes up or down exactly in step. So when you increase or maximise log(something) then you simultaneously increase or maximisesomething. And vice versa.

What this means is, we can use logs. If we find the value of 𝜃 that improves or maximises the log of

∏ (for i=1 to m) ŷ(i)^y * (1-ŷ(i))^(1-y)

then we know for sure that that same value of 𝜃 simultaneously improves or maximises this product itself.

Let’s do it. The log of the product is the sum of the logs:

∑ (for i=1 to m) log( ŷ(i)^y * (1-ŷ(i))^(1-y) )

and remembering your high school grasp of log() you can simplify even further to,

The Maximum Likelihood Estimator is the network that maximises
∑ (for i=1 to m) y*log( ŷ(i) ) + (1-y)*log( 1-ŷ(i) )

You say maximise, I say minimise

For no good reason whatsoever, we often think of optimisation problems as a minimisation challenge, not a maximisation one. For the same reason we prefer to divide by m so that the cost function is sort of averagey.

It’s tradition. Whatever. So we stick a minus sign in front and divide by m and, ta-da:

-1/m * ∑ (for i=1 to m) y*log( ŷ(i) ) + (1-y)*log( 1-ŷ(i) )

That is how you derive the familiar formula for the cost function for optimising a sigmoid or logistic output with back propagation and gradient descent.

Recap

The ideas that lead to this formulae are:

  • Basic Probability: Interpret numbers from 0 to 1 as probability
  • Statistical Insight : The best understanding of “best” is usually the Maximum Likelihood Estimator.
  • Mathematical Tricks to Make Life Easier: A series of tricks you can do to simplify the calculations
  • Limitations of Computing Hardware: What kind of maths will existing computers get wrong?

So my recommendations, in priority order, are:

  1. Take a course on probability and statistics. Seriously. You can’t do good machine learning without a good grasp of statistics.
  2. Practise your high school maths
  3. Learn enough about your computer hardware and platform to know the gotchas.