Google Testing Blog: Minimizing Unreproducible Bugs

Minimizing Unreproducible Bugs

Monday, February 03, 2014

by Anthony ValloneAvoid and test for race conditions, deadlocks, timing issues, memory corruption, uninitialized memory access, memory leaks, and resource issues Guidelines for development:

Simplify your synchronization logic. If it’s too hard to understand, it will be difficult to reproduce and debug complex concurrency problems.

Always obtain locks in the same order. This is a tried-and-true guideline to avoid deadlocks, but I still see code that breaks it periodically. Define an order for obtaining multiple locks and never change that order.

Don’t optimize by creating many fine-grained locks, unless you have verified that they are needed. Extra locks increase concurrency complexity.

Avoid shared memory, unless you truly need it. Shared memory access is very easy to get wrong, and the bugs may be quite difficult to reproduce.

Guidelines for testing:

Stress test your system regularly. You don't want to be surprised by unexpected failures when your system is under heavy load.

Test timeouts. Create tests that mock/fake dependencies to test timeout code. If your timeout code does something bad, it may cause a bug that only occurs under certain system conditions.

Test with debug and optimized builds. You may find that a well behaved debug build works fine, but the system fails in strange ways once optimized.

Test under constrained resources. Try reducing the number of data centers, machines, processes, threads, available disk space, or available memory. Also try simulating reduced network bandwidth.

Test for longevity. Some bugs require a long period of time to reveal themselves. For example, persistent data may become corrupt over time.

Use dynamic analysis tools like memory debuggers, ASan, TSan, and MSan regularly. They can help identify many categories of unreproducible memory/threading issues.

Enforce preconditions void ScheduleEvent(int timeDurationMilliseconds) { if (timeDurationMilliseconds <= 0) { timeDurationMilliseconds = 1; } ... } Guidelines for development:

Enforce preconditions in your functions unless you have a good reason not to.

Use defensive programming Defensive programmingdouble GetMonthlyLoanPayment() { double rate = GetTodaysInterestRateFromExternalSystem(); if (rate < 0.001 || rate > 0.5) { throw BadInterestRate(rate); } ... } Guidelines for development:

When possible, use defensive programming to verify the work of your dependencies with known risks of failure like user-provided data, I/O operations, and RPC calls.

Guidelines for testing:

Use fuzz testing to test your systems hardiness when enduring bad data.

Don’t hide all errors from the user Guidelines for development:

Only hide errors from the user when you are certain that there is no impact to system state or the user.

Any error with impact to the user should be reported to the user with instructions for how to proceed. The information shown to the user, combined with data available to an engineer, should be enough to determine what went wrong.

Test error handling Guidelines for testing:

Always test your error handling code. This is usually best accomplished by mocking or faking the component triggering the error.

It’s also a good practice to examine your log quality for all types of error handling.

Check for duplicate keys Guidelines for development:

Try to guarantee uniqueness of all keys.

When not possible to guarantee unique keys, check if the recently generated key is already in use before using it.

Watch out for potential race conditions here and avoid them with synchronization.

Test for concurrent data access Guidelines for testing:

Always test for concurrent data access if it’s a feature of the system. Actually, even if it’s not a feature, verify that the system rejects it. Testing concurrency can be challenging. An approach that usually works for me is to create many worker threads that simultaneously attempt access and a master thread that monitors and verifies that some number of attempts were indeed concurrent, blocked or allowed as expected, and all were successful. Programmatic post-analysis of all attempts and changing system state may also be necessary to ensure that the system behaved well.

Steer clear of undefined behavior and non-deterministic access to data undefined behaviorJava’s SetGuidelines for development:

Understand when the APIs and operations you use might have undefined behavior and prevent those conditions.

Do not depend on data structure iteration order unless it is guaranteed. It is a common mistake to depend on the ordering of sets or associative arrays.

Log the details for errors or test failures Guidelines for development:

Follow good logging practices, especially in your error handling code.

If logs are stored on a user’s machine, create an easy way for them to provide you the logs.

Guidelines for testing:

Save your test logs for potential analysis later.

Anything to add?

10 comments :

UnknownFebruary 4, 2014 at 2:04:00 AM PST
Simply stated and enlightening. I am certainly going to incorporate these as soon much as possible. Thanks for sharing your thoughts.
ReplyDelete
Replies
UnknownFebruary 4, 2014 at 10:42:00 AM PST
I wrote code that enabled me to read SQL 2000 deadlocks from the log by converting it into text that showed the tables and indexes involved and the process that was running. Certainly was a bug we couldn't reproduce but with the log we were able to figure out the odd locking order that was causing it with a read and an update.
The update locked the table page and then tried to lock the index. The read had tried to use the index to read the table. The update couldn't lock the index, the read couldn't look at the data page.
Altered the read to retrieve just the clustered PK field values. (A covering index read) Then read the table using the PK values so it is delayed by the update, not deadlocked.
ReplyDelete
Replies
UnknownFebruary 4, 2014 at 10:49:00 AM PST
PS Your sign-in procedure exposed a bug. I wrote a comment, did not have a Google account, so signed up for it. Google sign-up procedure was successful, but then blew up when I tried to return to submit my comment. Had to rewrite the comment after going back in web history to your article.
So, did I just find an unreproducible bug in your blog?
ReplyDelete
Replies
Michael Kingsford GrayFebruary 7, 2014 at 10:26:00 PM PST
Most "unreproducible bugs" are caused by using the execrable "C" language, with its inbuilt design flaws.
Anyone programming in "C" (for a safety critical application) should be indicted for crimes against humanity.
The total cost to the world for avoidable bugs directly as a result of employing "C" code or libraries possibly exceeds several billions of dollars.
C++ is a willing accessory to these crimes against humanity.
ReplyDelete
Replies
Amandeep SinghFebruary 10, 2014 at 7:08:00 AM PST
Good one Ken :D
ReplyDelete
Replies
Amandeep SinghFebruary 10, 2014 at 7:11:00 AM PST
Well written Anthony - thought provoking... I have been in testing for more than 6 years now and it happens all the time when a reported bug doesn't reproduce especially when a developer updates you about it...

This is a great list of guidelines to follow... thanks for sharing...

Regards,
Aman
ReplyDelete
Replies
kimkullingMarch 12, 2014 at 3:57:00 AM PDT
Great article, thanks for sharing it!
ReplyDelete
Replies
AliMarch 27, 2014 at 4:24:00 PM PDT
Interesting. We recently conducted an empirical study to characterize non-reproducible bugs. If you're interested, the paper is available here: http://salt.ece.ubc.ca/publications/docs/mona-msr14.pdf
ReplyDelete
Replies
Conrad BraamApril 22, 2014 at 6:21:00 AM PDT
I'm not going to agree with Kingsford Gray on the C language bias standpoint at all. That's not the topic. But it is the paradigm! Some parts of a testers job do involve understanding the specific vulnerabilities the software under test has, based not only on historical knowledge of the kind of bugs your dev team inject, but also on the kind of bugs the software production process itself injects. As well as the compiler-toolchain and finally the language chosen to implement the software. So back to taking out locks and synchronization, when it comes to test-code I'm going to disagree with the author Mr Vallone.
Test code and instrumentation code introduces synchronisation windows in the software being tested a lot of the time. Normally the "window" is quite wide like when turning on logging all threads end up having to access one log file. The tester who knows the exact affect this has on the system as well as about all the other synchronization points in the software will be able to exploit them to find timing bugs.
And on this point I am your keen follower Anthony, being able to test timeouts properly as if they are simply another axis to your test driven approaches or simply another table-driven test test-case can uncover lots of bugs the devs might take years to find. Great guide, packed with nuggets :)
ReplyDelete
Replies

Add comment

The comments you read and contribute here belong only to the person who posted them. We reserve the right to remove off-topic comments.

Testing Blog

Minimizing Unreproducible Bugs

10 comments :

Labels

Archive

Feed