« The Halo Story | Main | The Writing is on the Wall... »

October 03, 2007

Fear of a Broken Build

One of the things that drives me slightly bananas at Microsoft is our obsession with not "breaking the build".

Backstory: Ever since a time before I started at Microsoft, projects at Microsoft have used a central server to hold the master copy of the source code (this is variously known as a source code server or source repository). Individual developers will also have a copy of the code on their own machines, so that they can build private copies as they work on changes, but at some point they "check in" their changes to the repository, and those changes then become part of the master copy. Developers will also, as they see fit, perform a "sync" in which they update their local copy of the source with the current version from the repository. And almost every team will create an official build every day, where at an official build server syncs a copy of the current master source and builds it. People can run tests on "private" builds (that they built on their own machines) but it is these official builds that wind up being tested heavily, and eventually one of them (the last one) will be the one that we ship.

While it is perfectly normally to have a build fail on your own machine, as you fix typos and whatnot in your code, it is considered a cardinal sin to check in code that doesn't build properly. It will break the official daily build, and it will also break any other developers who happen to sync at a time when your bad changes are still polluting the repository. So it is bad bad bad to check in broken code, and teams have come up with various "incentives" to prevent this. For example, on some teams if you break the build you have to be the person who starts and babysits the official daily build, continuing to do so every day until some other sap breaks it (occasionally this would be made more punitive by requiring that the daily build be started at, say, 6 am; the severity of this punishment was unpredictable, since you never knew when somebody else was going to break the build and replace you. At this point such builds are almost always completely automated and don't require human coddling). In other cases people had to wear funny hats if they broke the build (I'm 100% serious).

This is sort-of OK, since you don't want people to break the build. But somehow we got on this obsession with not breaking the build. The phrase "breaking" makes it sound like you are breaking a crystal vase or something, that once broken a build might never be repairable. That's not true at all: these are just electronic bits that are breaking, and if somebody checks something in that doesn't build, the worst case is you revert their change (and possibly other related changes) and then poke them with a stick a few times and tell them to be more careful next time. More commonly, you just fix the break; a typical break involves one person checking in a change to the parameters to an API around the same time somebody else checks in a new call to that API. Such changes are usually obvious to fix and any developer or tester who notices the problem can fix them.

Yes, some breaks are due to carelessness, like just not compiling something, or failing to add a new source file to the repository at all (this is a classic: since the file is present on your machine, your local build will still work, but anybody else's will fail). And people did clever things like work for weeks, make a massive checkin (that didn't build), then immediately go on vacation, leaving others to divine their intent and pick up the pieces. But a lot of build breaks are somewhat predestined to happen and can be cleaned up pretty easily.

Yet we developed a culture at Microsoft to treat build breaks as so terrible that we had to go to great lengths to prevent them--lengths that when added up for every checkin on a project wound up far outweighing the damage from the occasional build break. Furthermore, the set of what constituted a "build break" grew as the definition of what an official build machine did grew larger. Many teams now build for x86 and x64, run static analysis, have a set of Build Verification Tests, etc. It's not that these are bad, but they are also not necessarily as deadly (in the short term) as a classic build break where the thing just won't compile and every developer who syncs is dead in the water until fixed. Yet what happened was all this stuff got added to the definition of "build break", and when you combine that with a prohibition on build breaks in the master copy, what you wind up with is every individual developer going through a lengthy process before each checkin, where everybody has to build for x86 and x64 (and a complete build of the whole system, not just what they have changed), run tests on their own machine, etc. Some teams even instituted a non-deterministic process in which you have to do a sync, then a clean build, then run tests...and then if somebody else checks in during that time, you have to start over again. The cure for build breaks winds up being much worse than the illness. And I mean just in terms of time spent, ignoring silly ideas like making the build breaker come in at 6 am every day until somebody else messed up.

It turns out there is a movement afoot in the industry that addresses some of this, under the somewhat misleading name Continuous Integration. CI is actually more about having a central repository and checking in often, which people at Microsoft already do, but along with it comes the notion of having a central server which does builds and tests and all that very frequently (on every checkin, if you want) and immediately notifies the team if there is a break. This is the part of CI that I'd like to see more of at Microsoft: people are free to checkin without a massive amount of work, secure in the knowledge that they will find out quickly if they broke something. It's not that they just checkin whatever they want with no verifying that it builds; it's more an attitude about empowering developers to decide on their own how much building and testing they need to do individually before they check in. A break is still treated as bad, in fact the entire team is supposed to stop whatever they are doing until it is fixed, but it is recognized as something that is not worth knocking yourself out to avoid, because in the end it's not too painful when it happens.

This is something of a return to the old days; if you read my story about breaking the build in front of Dave Cutler you will see that back then we had complete freedom to decide how much testing we did on our checkins. Obviously in that case I went too far (nobody would seriously suggest checking something in without at least compiling the changed files), and in some sense the overburdened checkin processes of today are paying for the sins of the past, committed by goofballs like myself. But technology is also helping here; back then there was a single master repository, so my build break (for the 30 seconds it existed) would have hit everybody in the team who did a sync then. Nowadays we have levels of repository, so when you checkin you are likely only checking in to a repository that affects a handful of developers. So in that situation, certainly, I think the Continuous Integration mindset is the way to go.

Posted by AdamBa at October 3, 2007 04:31 PM

Trackback Pings

TrackBack URL for this entry:
http://proudlyserving.com/cgi-bin/mt-tb.cgi/620

Comments

In the good/bad old days, I think the culture was a bit different. If one developer (let's call him Steve) broke the build, another developer (Let's call him Mark) would often dive in and clean up the mess. Even if Mark didn't know anything about what Steve's checkin was meant to do, he would, as you say, divine the intent, fix it, and check it in. Mark gets to be a cowboy hero, Steve gets a lot of ribbing, and the project lurches forward.

Nowadays I see less of that. If Steve breaks the build, then Steve better fix it. Not Mark's fault, he has other work to do. Developers feel responsibility for their individual feature (or part of a feature) rather than owning the whole product. And the cowboy/heros have often moved on since The Process frustrated them.

The other problem is that the impact of build breaks is different at different parts of the product cycle. Early on, developers are writing lots of code, things are changing rapidly. Tests are being written and rewritten. Things are bound to break but it's not that big a deal. As you say, it gets fixed and we move on. Later in the project it does become a big deal. deadlines are looming, the build is broken, testers can't run their tests, developers can't validate their fixes, PMs can't setup their demos. Precious hours and days are wasted due to build breaks, the product is slipping, and of course somebody imposes The Process to prevent this and keep the project crawling steadily to the finish line. And it is rare that somebody invents a Process flexible enough to account for the different phases of the product. Once a Process is in place, it has a natural tendency to grow over time.

Posted by: John Vert at October 3, 2007 07:07 PM

But do you agree that in the old days, having "Mark" feel he could fix the build was better than just waiting for "Steve" to do it? That is one of the ideas of Continuous Integration, that anybody can fix the build. I mean, yes it was irresponsible of "Steve" to check in that massive change and then go on vacation, but the question of how to stop "Steve" from doing that is separate from what to do once he does.

I don't know if it's a bigger deal to break the build later in the process. I mean, at the very endgame of course you are very careful with checkins--but at that point, when you are taking one checkin at a time, everybody is waiting for the checkin/build/test to be done, so it doesn't really matter much where it is built and tested. What I hate is seeing people early in the cycle stuck with 2-4 hours (or more) process overhead for every checkin--lots of team work like that rather than saying "it gets fixed and we move on".

And I don't think a build break will block the project for days--maybe hours, but probably not if you have a quick way to figure out if it is broken (like a central server doing builds). Late in the game checkins are more critical but they are also much more likely to get watched carefully and breaks detected quickly.

- adam

Posted by: Adam Barr at October 3, 2007 10:12 PM

Can't yuou get rid of the garbage in the previous "comment"?

Posted by: marble chair at October 4, 2007 03:53 PM

Yes. Although now since I deleted it, it looks like you were referring to MY comment.

- adam

Posted by: Adam Barr at October 4, 2007 07:42 PM

I think the build should always be in a known good state. And when it is not in a known good state, everybody should feel personally empowered (and responsible) to fix it by any means necessary. If that's what "Continuous Integration" means, I'm all for it.

I also think there is no single process which makes sense throughout the entire project cycle. You need flexibility and you need a process lever you can adjust.

I think we are totally in agreement on this - you just have fancy names like "Continuous Integration" :-)

Posted by: John Vert at October 4, 2007 10:51 PM

Mostly groups in Microsoft are actually using an automated system which runs build (multiple flavors) / static analysis / units tests / BVTs / checks other quality "gates" prior to checkin. The developer executes something like "snap submit", fills in a form that describes the change (what's it about, what bugs does it resolve, who code-reviewed it, etc.), and the system does all the rest. If the change is "good" the code gets checked-in automatically.

One of these systems actually won an EE award -- search msw for gauntlet or SNAP.

Posted by: Ziv Caspi at October 14, 2007 02:29 PM

Those can be good, but the problem is that the checkin is not really "done" until all those central tests pass (unlike CI where it really is checked in and then we decide if it is OK later). So you have to be prepared for the checkin to get kicked back to you. Also, a bunch of checkins are tested together, so your checkin might wind up getting rejected due to someone else's mistake...which leads to it being considered "very bad" to break the central pre-checkin build...which means people do lots of work on their own machine before each checkin...which leads you back to the original problem of it being so time-consuming for every checkin.

- adam

Posted by: Adam Barr at October 20, 2007 08:54 AM