February 22, 2008
Is There a Services Mentality?There is lots of discussion at Microsoft these days about the "services mentality", meaning the way programmers who work on services need to think. The implication is that it is differet from the packaged software mentality that most developers at Microsoft have grown up with.
Certainly writing a service is different from writing packaged software, but within the realm of packaged software there are also huge differences. Writing a Windows kernel driver is very different from writing an Xbox game, but is it more different to work on a service? As with any software, when working on a service you have a layer above you and a layer below, and you stitch them together while implementing your requirements. You may have some different sorts of requirements, for example that the service can be upgraded without bringing it down, but packaged software also has a wide range of requirements.
I've been asking people about this and I haven't got back much that really felt "different". BUT there is one thing that seems to stand out that really is a change in thinking when working on a service: you have to expect and deal with failure.
If you're working on an Office app and you get a wierd error from the video display driver when trying to draw a graph, you will in all likelihood expose that error to the user: either by mysteriously failing, or by poppping up an incomprehensible error code. You may realize, in the back of your mind, that some very very small percentage of the time this will actually happen (a one-in-a-million Vista issue will still affect about 100 people), but you still view it as an unnatural event that you can't be expected to accomodate in your code (some errors, of course, are common enough to be handled, such as running out of disk space, but those are the exception, not the rule). So you let the customer figure out what to do (reboot, restart the app, whatever). I don't mean to make us seem callous or uncaring about user-visible errors; certainly you try to handle the common ones, but if you think of all the possible error codes that a Windows API can return, your code can only include specific responses for a small subset.
On a service, given the number of machines you may have in your data center, the distance that your data has to travel and the relative imperfection of the path (network vs. bus), and the knowledge that the customer has no ability to poke your computers, you know you have to handle failures gracefully--and that means ALL failures, not just the ones you can think of.
I consider this a mind-shift because it colors how you solve other problems also. A packaged software person who is given the problem of designing a system that can be updated on the fly may come up with a centrally managed system in which the remaining computers are notified when a subset of the data center is going to be offline, they then adjust to not even try to contact them, and then are told when they are back. So the machines go offline but in a controlled way and nothing unexpected ever happens. Meanwhile, somebody designing with the services "expect failure" mentality is going to architect a system that can adjust for machines disappearing and reappearing at any time, so their in-place upgrade solution can be to have the operators hit the power switch on the machines they want to upgrade, upgrade them, and then put them back in service.
I was thinking about this and I realized that my formative years were spent writing code for this kind of environment. I worked on network transports, and the entire point of a network transport is to recover from mysterious failures: dropped packets, delayed packets, duplicated packets, misordered packets, etc. The mainline path is easy; it's handling the errors that is all the work (I once wrote a sample network transport for the device driver kit that didn't handle any errors; it was about one-tenth the size of the real one). In retrospect I think this was great training as a developer, because it got me thinking in the "services mentality" way back when. And in the future, ideally all packaged software developers would borrow that approach from services, and use the expectation of failure as a way to engineer reliability into software that runs on just one machine.
Posted by AdamBa at February 22, 2008 07:35 PM
TrackBack URL for this entry:
Handling a failure in software typically involves adding new state transitions and often whole new states as well. Handling all but a non-trivial number of errors typically causes an explosion in state space as compared to a completely error-free link.
State transitions typically involve non-local flow control in most implementations and if there is one thing that no computer programming language has fully figured out it's how to represent multiple non-local flow control paths in a way that humans can easily understand and hold in their head. This is why software engineers are taught from a very early age to write code that eliminates as much non-local flow control as possible ("goto considered harmful" etc. etc.).
Some languages, such as SDL (Specification and Description Language), tried to solve the state machine problem graphically and there is, of course, the famous State design pattern. However none of them really succeeded in improving developer productivity all that much. State machines that drive communication links are inherently complex and therefore require complex code to implement and lots of testing to verify.
As Fred Brooks pointed out there is no silver bullet for inherent complexity. The best way is to avoid it altogether and use components that somebody else has already implemented and tested. Unfortunately this isn't typically taught to CS undergrads (although it is to EEs!) and has to be learned slowly and painfully.
Posted by: Andrew at February 24, 2008 01:27 PM
I agree state machines are too complex. I once had to debug a protocol that was written entirely as a state machine -- a giant table of state transitions and an engine to drive it. It was a huge nightmare to track down a problem. It seems like inherently specifying it this way *should* be clearer...and maybe it is when you implement it...but debugging and fixing it is another matter.
Posted by: Adam Barr at February 26, 2008 09:30 PM