NanoToolkit Blog

The Way to Keep in Touch.

Achieving Software Reliability that parallels a Computer Fan

There is something to be said about a low-voltage fan that can run for years without any maintenance. They often come in 12-24 Volt flavors and cool our Computer Power supplies that convert AC Power to DC, cool CPU and GPU. They cause airflow in the computer casing. These Fans usually cost no more than a dollar. Their Mean time to failure is Three to Four of years. I think If There was a Reliability Hall of Fame, an electric Motor that runs at near constant RPM deserves to be the first inductee.

typical computer fan you would find if you opened your computer case

We historically have not seen many software Packages that does this. That is software that has a simple interface where other components can talk to it, Be Reliable and Run for years without any kind of baby-sitting.

Perhaps the only way to come close to this sort of reliability in software with mean-time to failure being multiple years in future would be to limit how many things that software package does. Software should be designed to do one thing and one thing only and later integrated with other software components to achieve more.

But we can point to pieces of software that do emulate this high reliability by doing as little as possible? Think about an Operating System Boot Loader, It tries to perform one function which is placing a piece of code in reserved section of hard-drive. So then it interfaces with The Hardware which loads the Boot Loader in a specific Section of the Memory and runs it. As long as the underlying HardDrive does not fail the Operating system Boot Loader shall continue to work.

Some Examples of non-reliable software would seem to be a Mail-Server. It can stop malfunctioning when too many people connect to it, When the Throughput by Spammers increase, when the system runs out of space …

Think about the piece of Software that puts your Computer or phone display in low-power mode by either lowering the contrast or putting on stand-by. This software is highly reliable because it does one specific thing. When your system inactivity (defined by Keystrokes or Mouse Activity) is prolonged it activates itself and partially shuts of the screen.

I think a lesson we can draw is that if a software is expected to do much such as implement an accounting business logic is very prone to achieving Robustness. That’s because in fact most pieces of Software are reliable. That is they behave correctly under expected conditions which for an accounting software would be for you to enter all the figures correctly. But unfortunately most pieces of software are not exactly Robust which means when some unexpected condition occurs they miss-behave. The worse news is that they go in persisted mode of miss-behaving once a failure occurs. For instance in the case of our Mail-Server it will definitely fail to deliver mail if it is out-of-space. But can our software at least remain reliable when it has failed to be robust. What do I Mean? Well my Mail-Server has run out of Space and it can’t deliver SMTP traffic. But can it should still support theIMAP, POP or ActiveSync Protocols so I can access my existing mail.

If we go back to accounting software example. If I entered and incorrect figure someplace in Third Quarter, It should not affect my ability to edit my Third Quarter Data.

So one thing we like to shoot for is this: If our software enters a state where it is malfunctioning it should easily come out of that mode when the condition that has caused this error is removed.

For Instance if a Browser crashes when it loads “www.someBadSite.com”, then it would make sense for it to have logic that does not load the crashed site next time it loads. The Browser should have some warning system built-in that notifies me that this site has previously caused a mal-function.