Structured Pruning of Deep Convolutional Neural Networks, Sajid Anwar et al. In the ACM Journal on Emerging Technologies in Computing special issue on hardware and algorithms for learning-on-a-chip, May 2017.

## Notes

Quick, a software engineer mentions a “performance” problem to you. What do they mean?

This is, of course, an unfair question. There are too many different ideas that all get branded “performance” for us to know what we are trying to solve. This paper is simultaneously about two different flavours of performance.

On the one hand, the “performance” of a neural network is related to its ability to perform its task: the correctness of its inferences. There isn’t really a good way to know what neural network configuration will perform efficiently for the (unknown) function you want it to approximate. Therefore, the rule of thumb is find a network that’s too complex, and stop training it when it begins to overfit (when its performance starts to degrade, because it’s being too specific about whether an input looks like an example from the training set rather than whether it shares important features).

Now we meet the other kind of performance: the amount of resources consumed to do the work. A large neural network needs to do a lot of computations with a lot of numbers to classify an input, and that means using a lot of processor cycles and a lot of memory. Because our approach to designing the network was to overspecify it, we are using more computer than we need. But if that computer is relatively low-specification and battery operated—a mobile phone for example—this may render our solution unusable.

So, how can we turn a complex neural network into a simpler neural network? While this isn’t totally satisfying, the answer is: “guess”. Turn off bits of the network (i.e. set weights to zero), and see whether it still classifies (performs) well. This act of turning bits of the network off is called pruning.

Ironically some previous work in this space has actually not been great for performance (the resource kind). You can “unroll” convolutional layers (commonly found in image-classifying networks) into matrix multiplications, and you can turn that into a sparse matrix by approximating all small weights with zero and only storing the non-zero values and their locations. But now, even though you have fewer calculations, you may have *more* memory accesses in trying to solve where the weights should be used. And that could be slower than not reducing the network.

The work in this paper takes a structured approach to pruning the network. Whole feature maps (scores indicating whether particular characteristics of an image were found, and where, in the input image) can be removed, the network retrained, and the performance (ability to classify) measured afterwards. At smaller scales, the kernels can be pruned in particular deterministic ways, replacing a full weights matrix with a start index, a “stride” (gap between each non-zero value) and the list of non-zero weights. The different possibilities are explored using a combination of random generation and evolutionary iteration; networks that have a misclassification rate within given tolerance the original are kept into subsequent generations.

The results seem promising. With pruning at both levels of abstraction, the resulting network is just as deep (it contains as many layers) but it has fewer nodes at each layer and fewer connections between nodes. The systematic pruning approach means that the resulting networks are smaller in memory and faster in use: CPU time measurements are down approximately two thirds when compared with the initial, unpruned network.

However, be careful when interpreting the graph: the authors are showing the reduced execution time of the *unrolled matrix multiplication* for *one layer* of *one network configuration*. It is not clear what this means for overall behaviour of the network, what the misclassification rate of this network was (they show a tolerance cutoff at 4%, which may be too high for a given use case), or in general how the CPU time savings vary with network topology. In other words, we have a single graph, and don’t know how to generalise it.

I hope that at some point a sound theoretical basis for choosing the architecture for a neural network to solve a given problem will be developed. In fact, I sort of hope that it exists now, and that I haven’t found it. I don’t think so: for the moment, whack-a-mole is the state of the art, and this paper shows you can whack quite a lot of moles and still get reasonable results.