On squeezing out that last ounce of performance

As I get confused by a component of an application that should be network-bound actually being limited by CPU availability, I get reminded of the times in my career that I’ve dealt with application performance.

I used to work on a platform for distributing MMS and SMS messages, written using GNUstep, Linux and PostgreSQL. I had a duplicate of the production hardware stack sitting in a data centre in our office, which ran its own copy of the production software and even had a copy of the production data. I could use this to run simulations or even replays of real events, finding the locations of the slowdowns and trying various hypotheses to remove the bottlenecks.

My next job was working on antivirus. The world of antivirus evaluations is dominated by independent testers, who produce huge bake-off articles comparing the various products. For at least one decade the business of actually detecting the stuff has been routine, so awards like VB100 are meaningless. In addition to detection stats, analysts like VB and AV-comparative measure resource consumption, and readers take those measurements seriously.

That’s because they don’t want to use anti-virus, and they didn’t pay for their RAM and CPUs to waste it on software they don’t want to use. So given a bunch of apps that all do the same thing, they’ll look at which does it with less impact on everything else. This means that performance is an important requirement of new projects in AV software: on the product I worked on we had a defined set of performance tests. A new project release could not— regardless of how shiny the new features were—ship if the tests took 5% or more time or RAM than the current shipping version. On like hardware. What that really means is that due to developments in hardware, AV software was getting monotonically faster up until a few years ago.

Since then, my relationship with performance optimisation has been more sporadic. I’ve worked on contracts to speed up iOS apps, and even almost took a performance analysis and improvement job on a mobile phone operating system team. But what I usually do is make software work (and make it secure), with making it work in such a way that people can actually use it being a part of that. Here, then, are my Reflections On Making Efficient Software™.

Start at the beginning.

You may have heard the joke about a man on a driving holiday who gets lost and asks a local for directions. The local thinks for a bit, and says “well to get to where you’re going, I wouldn’t start from here if I were you”.

Performance analysis can be like that, too. If you build up all of the functionality first, and optimise it later, you will almost certainly not get a well-performing product. Furthermore, fixing it will be very expensive. You may be able to squeeze a few kilobytes out here, or get a couple of percent speed increase there, but basically the resources used by your app will not change much.

The reason is simple: most of the performance characteristics of your app are baked into the top-level architecture. You need to start thinking about how your app will perform when you start to design what the various parts do and how they fit together. If you don’t design out the architectural bottlenecks, you’ll be stuck with them no matter how good your implementation is.

I’ve been involved with projects like this. It gets to near the ship date, and the app works but is useless because it consumes all of the RAM and/or takes too long to do any work. The project manager gets a developer in (hello!) to address the performance issues. After a few weeks, the developer has managed to understand the code, do some analysis, and improved things by a couple of percent. It’s gone from “sucky” to “quite sucky”, and the ship date isn’t any further away.

An example: If you build a component that can process, say, ten objects per second, then hands them on to another component that can display results at 100 objects per second, you’re always limited to 10 Hz give or take. You might get 12, you’ll never get 100 without replacing the first component or the whole app. Both of these options are more expensive after you’ve written the component than before. Which leads me on to the next top tip…

Simulate, simulate, simulate

So how are you supposed to know how fast your putative architecture is going to run on paper? That’s easy: simulate each component. If you believe that you’ll usually get data from the network at, say, 100 objects/sec, then write a driver that sends fake objects at about 100 Hz. If you think that might spike at 10,000 objects/sec, then simulate that spike too. You’ll be able to see what it takes to develop an app that can respond to those demands.

What’s more, you’ll be able to drop your real components into the simulated environment, and see how they really handle the situations you cook up. You can even use these harnesses as an integration test framework at the intermediate level (i.e. larger than classes, smaller than the whole app).

Your simulated components should use the same interfaces to the filesystem and each other that the real code will use, and the same frameworks or libraries. But they shouldn’t do any real work. E.g if you’ve got a component that should read a JSON stream from the network, break it into objects, do around 10ms of work on each object and post a notification after each one is finished, you can write a simulation to do that using the JSON library you plan on deploying, the sleep() function and NSNotificationCenter. Then you can play around with its innards, trying out operation queues, dispatch queues, caching and other techniques to see how the system responds.

Performance isn’t all about threads

Yes, Apple has a good concurrency programming guide. Yes, dispatch queues are new and cool. But no, not everything is sped up by addition of threads. I’m not going to do the usual meaningless micrometrics of adding ten million objects to a set, because no-one ever does that in real code.

The point is that doing stuff in the background is great for exactly one thing: getting that stuff off of the UI thread. For any other putative benefits, you need to measure. Perhaps threading will speed it up. Perhaps there aren’t any good concurrent algorithms. Perhaps the scheduling overhead will get in the way.

And of course you need to measure performance in an approximation to the customer’s environment. Your 12-core Mac Pro probably runs your multithreaded code in a different way than your user’s MacBook Air (especially if the Pro has a spinning disk). And the iPad is nothing like the simulator, of course.

Speed, memory, work: choose any two

You can make it faster and use less RAM by not doing as much. You can make it faster and do the same amount of work by caching. You can use less RAM and do the same amount of work by reducing the working set. Only very broken code can gain advances in speed and memory use while not changing the outcome.

Measure early and measure often

As I said at the top, there’s no point baking the app and then sprinkling with performance fairy dust at the end. You need to know what you’re aiming for, whether it’s achievable, and how it’s going throughout development.

You must have some idea of what constitutes acceptable performance, so devise tests that discover whether the app is meeting those performance requirements. Run these tests periodically throughout development. If you find that a recent change slowed things down, or caused too much memory to be used, now is a good time to fix that. This is where that simulation comes in useful, so you can get an idea about the system’s overall performance even before you’ve written it all.

Start at the beginning.

Simulate, simulate, simulate

Performance isn’t all about threads

Speed, memory, work: choose any two

Measure early and measure often

About Graham

OOP the Easy Way

APPropriate Behaviour

APPosite Concerns

FSF