Input-Output Maps are Strongly Biased Towards Simple Outputs, Kamaludin Dingle, Chico Q. Camargo and Ard A. Louis, *Nature Communications* **9**, 761 (2018).

On Saturday I went to my alma mater’s Morning of Theoretical Physics, which was actually on “the Physics of Life” (or Active Matter as theoretical physicists seem to call it). Professor Louis presented this work in relation to RNA folding, but it’s the relevance to neural networks that interested me.

The assertion made in this paper is that if you have a map of a lot of inputs to a (significantly smaller, but still large) collection of outputs, the outputs are not equally likely to occur. Instead, the simpler outputs are preferentially selected.

A quick demonstration of the intuition behind this argument: imagine randomly assembling a fixed number of lego bricks into a shape. Some particular unique shape with weird branches can only be formed by an individual configuration of the bricks. On the other hand, a simpler shape with large degree of symmetry can be formed from different configurations. Therefore the process of randomly selecting a shape will preferentially pick the symmetric shape.

The complexity metric that’s useful here is called Kolmogorov complexity, and roughly speaking it’s a measure of the length of a Universal Turing Machine program needed to describe the shape (or other object). Consider strings. A random string of 40 characters, say `a56579418dc7908ce5f0b24b05c78e085cb863dc`

, may not be representable in any more efficient way than its own characters. But the string `aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa`

, which is 40 characters, can be written with the Python program:

```
'a'*40
```

which is seven characters long including the newline. Assuming eight bits per character, the random string needs 40*8=320 bits to be represented. The forty `a`

s can be found by actually finding the Python program, which is 56 bits. The assertion is that a “find a program that generates character sequences of length 40” algorithm (with some particular assumptions in place) will find the `a56579…`

string with probablity 2^-320, but will find the `aaa…`

string with probability 2^-56, which is much, much more likely.

In fact, this paper shows that the upper *and* lower bounds on the probability of a map yielding a particular output for random input are both dependent on the Kolmogorov complexity of the output. It happens that due to the halting problem, you can’t calculate Kolmogorov complexity for arbitrary outputs. But you can approximate it, for example using Lempel-Ziv complexity (i.e. the length of the input to a lossless compression algorithm needed to recover the same output).

Where does this meet neural networks? In a preprint of a paper selected for the ICLR 2019, with two of the same authors as this paper. Here, we find that a neural network can be thought of as a map between the inputs and weights to a function that does the inference.

Typically neural network architectures have *lots* more parameters than there are points in the training set, so how is it that they manage to generalise so well? And why is it that different training techniques, including stochastic gradient descent and genetic algorithms, result in networks with comparable performance?

The authors argue that a generalising function is much less complex than an overfitting function, using the same idea of complexity shown above. And that as the training process for the network is sampling the map of inputs->functions, it is more likely to hit on the simple functions than the complex ones. Therefore the fact that neural networks generalise well is *intrinsic* to the way they select functions from a wealth of possibilities.

My hope is that this is a step toward a predictive theory of neural network architectures. That by knowing the something of the function we want to approximate, we can set a lower bound on the complexity of a network needed to discover sufficiently generalisable functions. This would be huge for both reducing the training effort needed for networks, and for reducing the evaluation runtime. That, in turn, would make it easier to use pretrained networks on mobile and IoT devices.

]]>Quick, a software engineer mentions a “performance” problem to you. What do they mean?

This is, of course, an unfair question. There are too many different ideas that all get branded “performance” for us to know what we are trying to solve. This paper is simultaneously about two different flavours of performance.

On the one hand, the “performance” of a neural network is related to its ability to perform its task: the correctness of its inferences. There isn’t really a good way to know what neural network configuration will perform efficiently for the (unknown) function you want it to approximate. Therefore, the rule of thumb is find a network that’s too complex, and stop training it when it begins to overfit (when its performance starts to degrade, because it’s being too specific about whether an input looks like an example from the training set rather than whether it shares important features).

Now we meet the other kind of performance: the amount of resources consumed to do the work. A large neural network needs to do a lot of computations with a lot of numbers to classify an input, and that means using a lot of processor cycles and a lot of memory. Because our approach to designing the network was to overspecify it, we are using more computer than we need. But if that computer is relatively low-specification and battery operated—a mobile phone for example—this may render our solution unusable.

So, how can we turn a complex neural network into a simpler neural network? While this isn’t totally satisfying, the answer is: “guess”. Turn off bits of the network (i.e. set weights to zero), and see whether it still classifies (performs) well. This act of turning bits of the network off is called pruning.

Ironically some previous work in this space has actually not been great for performance (the resource kind). You can “unroll” convolutional layers (commonly found in image-classifying networks) into matrix multiplications, and you can turn that into a sparse matrix by approximating all small weights with zero and only storing the non-zero values and their locations. But now, even though you have fewer calculations, you may have *more* memory accesses in trying to solve where the weights should be used. And that could be slower than not reducing the network.

The work in this paper takes a structured approach to pruning the network. Whole feature maps (scores indicating whether particular characteristics of an image were found, and where, in the input image) can be removed, the network retrained, and the performance (ability to classify) measured afterwards. At smaller scales, the kernels can be pruned in particular deterministic ways, replacing a full weights matrix with a start index, a “stride” (gap between each non-zero value) and the list of non-zero weights. The different possibilities are explored using a combination of random generation and evolutionary iteration; networks that have a misclassification rate within given tolerance the original are kept into subsequent generations.

The results seem promising. With pruning at both levels of abstraction, the resulting network is just as deep (it contains as many layers) but it has fewer nodes at each layer and fewer connections between nodes. The systematic pruning approach means that the resulting networks are smaller in memory and faster in use: CPU time measurements are down approximately two thirds when compared with the initial, unpruned network.

However, be careful when interpreting the graph: the authors are showing the reduced execution time of the *unrolled matrix multiplication* for *one layer* of *one network configuration*. It is not clear what this means for overall behaviour of the network, what the misclassification rate of this network was (they show a tolerance cutoff at 4%, which may be too high for a given use case), or in general how the CPU time savings vary with network topology. In other words, we have a single graph, and don’t know how to generalise it.

I hope that at some point a sound theoretical basis for choosing the architecture for a neural network to solve a given problem will be developed. In fact, I sort of hope that it exists now, and that I haven’t found it. I don’t think so: for the moment, whack-a-mole is the state of the art, and this paper shows you can whack quite a lot of moles and still get reasonable results.

]]>This sounds like a fairly niche market, but in fact it’s quite useful because there are a whole lot of other functions that can be *approximated* by polynomial equations. The approach, which is based in calculus, generates a Taylor series (or a MacLaurin series, if the approximation is for input values near zero).

Now, it happens that this collection of other functions includes logarithms:

\(ln(1+x) \approx x – x^2/2 + x^3/3 – x^4/4 + \ldots\)and exponents:

\(e^x \approx 1 + x + x^2/2! + x^3/3! + x^4/4! + \ldots\)and so, given a difference engine, you can make tables of logarithms and exponents.

In fact, your computer is probably using exactly this approach to calculate those functions. Here’s how glibc calculates `ln(x)`

for x roughly equal to 1:

```
r = x - 1.0;
r2 = r * r;
r3 = r * r2;
y = r3 * (B[1] + r * B[2] + r2 * B[3]
+ r3 * (B[4] + r * B[5] + r2 * B[6]
+ r3 * (B[7] + r * B[8] + r2 * B[9] + r3 * B[10])));
// some more twiddling that add terms in r and r*r, then return y
```

In other words, it works out `r`

so that it is calculating `ln(1+r)`

, instead of `ln(x)`

. Then it adds together `r + a*r^2 + b*r^3 + c*r^4 + d*r^5 + ... + k*r^12`

…it does the Taylor series for `ln(1+r)`

!

Now given these approximations, we can combine numbers into probabilities (using the sigmoid function, which is in terms of `e^x`

) and find the errors on those probabilities (using the cross entropy, which is in terms of `ln(x)`

. We can build a learning neural network!

And, more than a century after it was designed, our technique could still do it using the Difference Engine.

]]>Impossibility and Uncertainty Theorems in AI Value Alignment (or why your AGI should not have a utility function), Peter Eckersley. Submitted to the ArXiV on December 31, 2018.

Ethical considerations in artificial intelligence applications have arguably been present since the birth of the field, if not earlier. Karel Čapek wrote R.U.R., the play that introduced both the word “robot” to our lexicon, and the question of whether such creations would *want* to work. Isaac Asimov and Mary Shelley both asked how a society with “natural” and “artificial” people would function.

More recently, as we’ve been able to create things that we can both identify as solving problems and market under the term Artificial Intelligence, the question of AI ethics has reappeared with different specialisations. Should we apply AI techniques to particular fields of endeavour at all, such as autonomous weaponry? Who should be responsible for a decision made, for an example, by an autonomous vehicle? How should such a vehicle make a decision when faced by the classic trolley problem?

Eckersley opens this paper with an aside, important to those who consider AI as an engineering discipline (which is not a clear statement in itself): autonomous vehicle research is a long, long way away from even *contemplating* the trolley problem. Currently, reducing the uncertainty with which obstacles are detected and identified from noisy input signals is a much more pressing issue.

However, *human* drivers are better at object identification, so may be sufficiently advanced to address the trolley problem should it arise in piloting a vehicle. And this is really an important point in thinking philosophically about how AIs should act: we already *have* intelligent actors, and already think about how they act.

Indeed, we already have artificial intelligences, including society-level AIs. A government bureaucracy; an enterprise; a policy document; all of these are abstractions of human intelligence behind machines, rules, and assumptions of how to work in given contexts. Any time that someone relies on external input to make a decision, whether that input is a checklist, a company handbook, a ready reckoner tool or a complex software-built decision support tool, you could replace that tool with an AI system with no change in the ethics of the situation.

Anywhere you could consider an AI making choices that are suboptimal to the welfare of the affected people, you could imagine those choices being suggested by the checklist or ready reckoner. None of this is to minimise the impact or applicability of AI ethics, quite the contrary: if we are only considering these problems now because we think AI has become useful, we are late to the party. Very late: what was the Soviet Gosplan other than a big decision support system trying to optimise a society along utilitarian lines?

The early sections of this paper summarise earlier proofs that it is impossible to define an objective utility function that will optimise for multiple, conflicting output dimensions. Not because it is hard to discover or compute the function, but because an acceptable trade-off may not exist. The proofs take this form:

- imagine a guess at a solution, s.
- there is a rule r1 which, applied to s, yields a better solution, s1.
- there is a rule r2 which, applied to s1, yields a better solution, s2.
- …
- there is a rule rN which, applied to s(n-1), yields a better solution, s.

In other words, the different states that represent solutions *cannot* be “totally ordered” from “best” to “worst”; a cycle exists such that the question “which is the best” is paradoxical. While that has clear impact on AIs that use objective utility functions to choose solutions to their inputs, the wider implication is that *no utilitarian system* can objectively choose the “best” arrangement for welfare.

If you’re building the multivac from Isaac Asimov’s Franchise, then I’m going to have to disappoint you now: it’s impossible. But so is having a room full of people with slide rules and 383 different folders containing next year’s production targets. It’s not *building an AI to optimise for the utility function* that cannot be done; it’s *finding a maximum on the utility function*.

A few techniques for addressing this problem are discussed and rejected: we could ignore the problem on the basis that our AIs are not currently important enough to have huge impact on social welfare. But AI *is* being used in welfare systems, in criminal justice systems, in financial management. And even where *AI* is not used, *corporations* and *bureaucracies* are already being applied, and these will have their decision support tools, which will eventually include AI.

Similarly, simplifying the problem is possible, but undesirable. You could define some kind of moral exchange rate to reduce the dimensionality of your problem: this much wealth in retirement is worth this much poverty today; this much social equality can be traded for this much value of statistical life. Or you could accept one or more of the undesirable properties of a solution as axiomatic; we can solve for inequality if we accept poverty, perhaps. Neither of these are ethically unambiguous.

Ultimately, Eckersley concludes that the only feasible approach is to relax the requirement for objective, total ordering of the possible outcomes. One way is to introduce partial ordering of at least two of the comparison dimensions: to go from saying that a solution is { better than, worse than, the same as } another along some axis to saying it is { better than, worse than, the same as, incomparable to } the other solution. While this breaks the deadlock it’s difficult to work with, and domain-specific interpretations of partial ordering will be needed. And now we’re back at the trolley problem: there are multiple outcomes to choose from, we don’t know (because it’s impossible to say) which is better, and yet a decision must be taken. Which?

A second model replaces the definite ordering of solutions with probabilistic ordering. Rather than the impossible statement “outcome B is better than outcome A”, the pair of statements “outcome B has probability p of being better than outcome A” and “outcome A has probability (1-p) of being better than outcome B” are given. Now a system can find the outcome with the *highest likelihood* of being best, or pick from some with *sufficient likelihood* of being best, even though it is still impossible to find *the* best. There will always be some chance of violating a constraint, but those should at least be free of systematic bias.

If your AI strategy is based on the idea that your domain can be “solved numerically”, by finding some utility function or by training an AI to approximate the utility function, then you need to ask yourself whether this is true without introducing some biases or unethical choices into the solutions. Consider whether your system should find what is “best” and do it, or find actions that are “good enough” and either choose between them, or present them for human supervision.

]]>