A couple of days after the now infamous Zune disaster struck, the dogpile began. I don't mind critical bloggers, and I certainly don't mind customers complaining about product issues. As a software quality professional, I encourage both. But I tell... Esther Schindlers recent blog post on the Zune disaster really rubs me the wrong way!
Schindler's stance is that this is an obvious bug which everyone should have caught. That's simply wrong, and it's outrageously unprofessional to make that statement! Someone as experienced as she purports to be should have the courtesy to be honest in their analysis. We've been building software since, what, the sixties. Since 1970, there have been nine leap years, and there were four in the past eighteen years. What this means should be obvious: if this were a low-hanging, "obvious" bug, it would have popped up already. It would be on our common radar and we would be testing for it.
Another issue I take with Schindler's rant is that the developer who wrote this bug is 'still out there, in the wild, writing more bugs of the same nature'. Yeah, I will grant you that there are many developers writing stupid bugs like this. In my last organization, no less than three developers wrote the SAME bug in three different releases (it was a URL information disclosure/escalation bug). It was a silly bug and should have never been written once, let alone thrice. But in almost thirteen years at Microsoft, I've never seen a engineer last who has made repeat mistakes, period. Fool me once, shame on me. Fool me twice, this way is the door! Her concerns that a similar bug will be lurking in Microsoft's next software are just hype to sell columns. There's little or no substance to them.
She also complains that the bug is simple date math. I think she's jumping to inexperienced conclusions here--first of all, I see little in her resume to lead me to believe she's an active developer qualified to typecast a bug as a date math defect. Secondly, all we know is that, due to being the last day in a leap year, the Zune froze. One driver from one component used in one Zune release froze. Is it date math, or something else? Only the Zune team really knows and it would be presumptuous for anyone else to claim they knew the cause.
So one lesson, she says the lesson is that it's a failure of the development and QA process. Uh, duh. No kidding! Second lesson she learned is that we need to learn from history (of course, there's no real history to learn from here except that people make mistakes). Third lesson is that the developer is still working. Uh, duh... If we fired every developer who wrote a bug, we'd just have testers left looking at their hands! If we were to lose our jobs over mistakes, we'd all be unemployed! I'm not sure exactly what Schindler's looking for here, but I can tell you I don't like the direction she's going.
Schindler's post is mindless crowd-mongering, aimed at stirring up a crowd (and keeping her name in print). It's also rife with wrong conclusions and, if her assertions were taken to heart, the result would be drastically unrealistic.
But let's ask the question: should Microsoft, and the engineering community within and without, be worried about this defect? Yes! What are the real take-aways from this issue? Read on...
First: Update the Test Repository
The first take-away here is that this is a bug with high potential repeatability. While this is the first known instance of this defect, it does seem possible that it could happen again. So lesson learned - update the test repository. Next time you test leap year handling, be sure to cover not just Feb 28, 29, March 1, 2 but also December 31.
More than updating the repository with this specific test case, testers need to extrapolate a series of test cases. There's something happening here. Schindler might, in fact be right that it's a math bug. It might also just be a buffer overrun (365 days were used up by Dec 30) or something else. The point is, there's a class of bug to be researched and investigated. It's around date management, date accounting, infrequent and semi-regular change and similar topics. And you can bet that both the developer and the feature tester on this component is working on this!
Second: Component Testing
Bill Gates' vision for Trustworthy Computing has always included the concept of distributed programming, where the application model is heavily object-oriented and componentized. The mistake many testers and organizations make is they only want to perform system or functional testing on the completed product. One of the key lessons of Agile and XP development methodologies is that you break things up into pieces and test them. According to the press release, this bug is at the component level and that's where it should be caught.
Organizations need to reconsider their risk-based approach, and ask if component-level testing (unit testing, functional testing: different names for the same activity) would be more effective. Component-level testing, combined with rapid feedback (ie, daily builds with testing involved) will prevent defects from creeping in at the component level, and might have caught an issue like this one.
Third: Take Time to Think
This third point is a personal mission for me. In the goal to minimize investment, IT organizations are constantly squeezing test. Development can overrun dates willy-nilly, but it's always a requirement for the test org to 'make it up. Hey, it's business and we need to hit deadlines. But applying the same effort at the development level (in this case, the effort of increasing discipline to stay on schedule) as the test team is expected to apply (in this case, the effort of catching up when dev has blown the unrealistic schedule) could prevent the need for herculean test effort, and would yield more time for the test organization to think about and improve upon strategy.
Why does this matter? Finding defects like this requires creativity. It requires looking beyond functional test cases and thinking about 'what could go wrong'. Schindler's making the same mistake many people make - overlooking the fact that testers have to catch all of the bugs, whereas the user only needs to experience one. Yes, we sign up for this when we enter a career in quality, but that doesn't change the fact that there's a disparity in expectations. A highly successful, high-quality release cannot be achieved if you force-march your test team and cut corners.
Most IT organizations get away with this approach because their customers are internal and, therefore, are more forgiving. But this kind of bug is very similar to security defects. They're difficult to find, require a lot of creativity, and take time to root out.
So, a great big raspberry to Esther Schindler for a poorly-written article on the Zune incident. True, customers were served poorly. More importantly, the engineering community needs to take lessons here. Her lessons? I question the logic used in drawing her conclusions and I really feel her posting was a populist knee-jerk lacking any substantive value for the engineering community.
At the same time, there is plenty to be learned, and learn we better! Let's build off this new class of defect, let's approach our testing from a more component perspective, and let's think about how much time testing is given to being creative.
Full disclosure: yes, I work at Microsoft as a Senior SDET Lead. But I'm critical of my employer where it makes sense. If I were the manager of the Zune test org (and I know several people in the org), I'd be having the same conversation with them that I wrote in this blog posting.