Tuesday, January 6, 2009

The three REAL software testing/software quality lessons learned from the Zune incident

A couple of days after the now infamous Zune disaster struck, the dogpile began. I don't mind critical bloggers, and I certainly don't mind customers complaining about product issues. As a software quality professional, I encourage both. But I tell... Esther Schindlers recent blog post on the Zune disaster really rubs me the wrong way!

Schindler's stance is that this is an obvious bug which everyone should have caught. That's simply wrong, and it's outrageously unprofessional to make that statement! Someone as experienced as she purports to be should have the courtesy to be honest in their analysis. We've been building software since, what, the sixties. Since 1970, there have been nine leap years, and there were four in the past eighteen years. What this means should be obvious: if this were a low-hanging, "obvious" bug, it would have popped up already. It would be on our common radar and we would be testing for it.

Another issue I take with Schindler's rant is that the developer who wrote this bug is 'still out there, in the wild, writing more bugs of the same nature'. Yeah, I will grant you that there are many developers writing stupid bugs like this. In my last organization, no less than three developers wrote the SAME bug in three different releases (it was a URL information disclosure/escalation bug). It was a silly bug and should have never been written once, let alone thrice. But in almost thirteen years at Microsoft, I've never seen a engineer last who has made repeat mistakes, period. Fool me once, shame on me. Fool me twice, this way is the door! Her concerns that a similar bug will be lurking in Microsoft's next software are just hype to sell columns. There's little or no substance to them.

She also complains that the bug is simple date math. I think she's jumping to inexperienced conclusions here--first of all, I see little in her resume to lead me to believe she's an active developer qualified to typecast a bug as a date math defect. Secondly, all we know is that, due to being the last day in a leap year, the Zune froze. One driver from one component used in one Zune release froze. Is it date math, or something else? Only the Zune team really knows and it would be presumptuous for anyone else to claim they knew the cause.

So one lesson, she says the lesson is that it's a failure of the development and QA process. Uh, duh. No kidding! Second lesson she learned is that we need to learn from history (of course, there's no real history to learn from here except that people make mistakes). Third lesson is that the developer is still working. Uh, duh... If we fired every developer who wrote a bug, we'd just have testers left looking at their hands! If we were to lose our jobs over mistakes, we'd all be unemployed! I'm not sure exactly what Schindler's looking for here, but I can tell you I don't like the direction she's going.

Schindler's post is mindless crowd-mongering, aimed at stirring up a crowd (and keeping her name in print). It's also rife with wrong conclusions and, if her assertions were taken to heart, the result would be drastically unrealistic.

But let's ask the question: should Microsoft, and the engineering community within and without, be worried about this defect? Yes! What are the real take-aways from this issue? Read on...

First: Update the Test Repository

The first take-away here is that this is a bug with high potential repeatability. While this is the first known instance of this defect, it does seem possible that it could happen again. So lesson learned - update the test repository. Next time you test leap year handling, be sure to cover not just Feb 28, 29, March 1, 2 but also December 31.

More than updating the repository with this specific test case, testers need to extrapolate a series of test cases. There's something happening here. Schindler might, in fact be right that it's a math bug. It might also just be a buffer overrun (365 days were used up by Dec 30) or something else. The point is, there's a class of bug to be researched and investigated. It's around date management, date accounting, infrequent and semi-regular change and similar topics. And you can bet that both the developer and the feature tester on this component is working on this!

Second: Component Testing

Bill Gates' vision for Trustworthy Computing has always included the concept of distributed programming, where the application model is heavily object-oriented and componentized. The mistake many testers and organizations make is they only want to perform system or functional testing on the completed product. One of the key lessons of Agile and XP development methodologies is that you break things up into pieces and test them. According to the press release, this bug is at the component level and that's where it should be caught.

Organizations need to reconsider their risk-based approach, and ask if component-level testing (unit testing, functional testing: different names for the same activity) would be more effective. Component-level testing, combined with rapid feedback (ie, daily builds with testing involved) will prevent defects from creeping in at the component level, and might have caught an issue like this one.

Third: Take Time to Think

This third point is a personal mission for me. In the goal to minimize investment, IT organizations are constantly squeezing test. Development can overrun dates willy-nilly, but it's always a requirement for the test org to 'make it up. Hey, it's business and we need to hit deadlines. But applying the same effort at the development level (in this case, the effort of increasing discipline to stay on schedule) as the test team is expected to apply (in this case, the effort of catching up when dev has blown the unrealistic schedule) could prevent the need for herculean test effort, and would yield more time for the test organization to think about and improve upon strategy.

Why does this matter? Finding defects like this requires creativity. It requires looking beyond functional test cases and thinking about 'what could go wrong'. Schindler's making the same mistake many people make - overlooking the fact that testers have to catch all of the bugs, whereas the user only needs to experience one. Yes, we sign up for this when we enter a career in quality, but that doesn't change the fact that there's a disparity in expectations. A highly successful, high-quality release cannot be achieved if you force-march your test team and cut corners.

Most IT organizations get away with this approach because their customers are internal and, therefore, are more forgiving. But this kind of bug is very similar to security defects. They're difficult to find, require a lot of creativity, and take time to root out.

So, a great big raspberry to Esther Schindler for a poorly-written article on the Zune incident. True, customers were served poorly. More importantly, the engineering community needs to take lessons here. Her lessons? I question the logic used in drawing her conclusions and I really feel her posting was a populist knee-jerk lacking any substantive value for the engineering community.

At the same time, there is plenty to be learned, and learn we better! Let's build off this new class of defect, let's approach our testing from a more component perspective, and let's think about how much time testing is given to being creative.

Full disclosure: yes, I work at Microsoft as a Senior SDET Lead. But I'm critical of my employer where it makes sense. If I were the manager of the Zune test org (and I know several people in the org), I'd be having the same conversation with them that I wrote in this blog posting.

8 comments:

  1. Did you read the offending code before you wrote this article?

    Perhaps doing so would have changed your mind a bit.

    -joe

    ReplyDelete
  2. Perhaps, but who has access to the source code? Is the offending code publicly available? It's not like we all sit around accessing one another's build servers at MS, so I haven't dug around to find the source.

    Regardless, I feel there are lessons to be learned about quality, about thinking deeper, and about testing at a component level. To me the takeaways from any mistake need to be positive - mistakes are an effective (but costly) way to learn. To fire someone who makes a mistake just means your competitor gains.

    There are developers who are serial defect generators. They need to be managed up or out - up meaning they overcome their tendencies, out meaning there's no room in an organization for that behavior. But you don't fire someone over one mistake like this.

    Quality begins with the mindset of each engineer and with management. It requires commitment at all phases of the lifecycle. This defect needs to be a lesson to all about how we value and where we expect quality.

    ReplyDelete
  3. "Perhaps, but who has access to the source code? Is the offending code publicly available? It's not like we all sit around accessing one another's build servers at MS, so I haven't dug around to find the source."

    Apparently, others have done so:

    http://www.aeroxp.org/2009/01/lesson-on-infinite-loops/

    ReplyDelete
  4. "She also complains that the bug is simple date math. I think she's jumping to inexperienced conclusions here--first of all, I see little in her resume to lead me to believe she's an active developer qualified to typecast a bug as a date math defect. Secondly, all we know is that, due to being the last day in a leap year, the Zune froze. One driver from one component used in one Zune release froze. Is it date math, or something else? "

    So, do you want to revise anything here?

    ReplyDelete
  5. It's not so much the bug it's the effect it had, out of all proportion to the issue of not being able to deal with an extra day at the end of the year.

    It could easily have manifested itself as something quirky such as the 32nd December or 0th January which would have been amusing but would not have affected functionality.

    But then it wouldn't have been the news...

    It's easy for me to comment because it's not my mistake, but perhaps two things we can learn are:

    * Not to structure code in a way that can ever give rise to an infinite loop. Always provide an exit of sorts.
    * Test anything that depends on a date in any way by exercising it around date transitions such as year ends and daylight savings time adjustments.

    ReplyDelete
  6. "It's not so much the bug it's the effect it had, out of all proportion to the issue of not being able to deal with an extra day at the end of the year."

    I agree that the effect on the end user wasn't significant. But just look at some of the headlines generated in response to this bug, and consider the effect on Microsoft's reputation in this field:

    -30GB Zunes Failing En Masse
    -30GB Zune apocalypse arrives as devices enter digital coma
    -Zune 30s all freezing up at once. -Ack! Aliens!
    -Zune plagued by massive freeze
    -Did the Y2K failure come to Zune 30 devices 9 years later?
    -Microsoft Zunes spontaneously dying all over the place
    -Microsoft Zunes Hit By Rash of Lock-Up Bugs
    -Microsoft Zune 30GB users reporting freezing problems
    -Zunesday
    -30GB Zunes Everywhere Are Frozen. -Z2K9?
    -Worldwide Zune suicide?
    -30GB Zunes Fail Simultaneously Everywhere
    -30GB Zunes Killing Themselves In Droves
    -Microsoft's Latest Global Problem
    30GB Zunes hibernating for 2009?
    -Z-Day Hits Zune 30s
    -Some Zunes Expire Along with 2008
    -Z2K for Zune music players?

    ReplyDelete
  7. BTW: Joe may be misreading my comments. I'm not trying to defend anyone here (well, maybe help the poor dev keep his job, whom Schindler would gladly axe). It's a bug! And it's had huge impact on Microsoft.

    My key takeaway is that Schindler's lessons learned miss the point. The point is, we need to learn and grow from mistakes, we need to be testing more at the component level, and we need to continually reflect on our strategy.

    If it pleases you Joe, fine. I was wrong about the nature of the bug.

    ReplyDelete
  8. I think people just like to pick on the No.1 guy or the bad guy, Microsoft. If things like this happened on iPod, would people react differently? They might or they might not. But I do think people are making a huge deal out of it, like no one has never seen a bug before.

    And yes we need to do a better job as a programmer.

    ReplyDelete