Reason for Flakiness
Tips for Triaging
Type of Remedy
Improper initialization or cleanup.
Look for compiler warnings about uninitialized variables. Inspect initialization and cleanup code. Check that the environment is set up and torn down correctly. Verify that test data is correct.
Explicitly initialize all variables with proper values before their use.
Properly set up and tear down the testing environment. Consider an initial test that verifies the state of the environment.
Invalid assumptions about the state of test data.
Rerun test(s) independently.
Make tests independent of any state from other tests and previous runs.
Invalid assumptions about the state of the system, such as the system time.
Explicitly check for system dependency assumptions.
Remove or isolate the SUT dependencies on aspects of the environment that you do not control.
Dependencies on execution time, expecting asynchronous events to occur in a specific order, waiting without timeouts, or race conditions between the tests and the application.
Log the times when accesses to the application are made.
As part of debugging, introduce delays in the application to check for differences in test results.
Add synchronization elements to the tests so that they wait for specific application states. Disable unnecessary caching to have a predictable timeline for the application responses.
Note: Do NOT add arbitrary delays as these can become flaky again over time and slow down the test unnecessarily.
Dependencies on the order in which the tests are run. (Similar to the second case above.)
Make tests independent of each other and of any state from previous runs.
Failure to allocate enough resources for the SUT, thus preventing it from running.
Check logs to see if SUT came up.
Allocate sufficient resources.
Improper scheduling of the tests so they “collide” and cause each other to fail.
Explicitly run tests independently in different order.
Make tests runnable independently of each other.
Insufficient system resources to satisfy the test requirements. (Similar to the first case but here resources are consumed while running the workflow.)
Check system logs to see if SUT ran out of resources.
Fix memory leaks or similar resource “bleeding.”
Allocate sufficient resources to run tests.
Race conditions.
Log accesses of shared resources.
Add synchronization elements to the tests so that they wait for specific application states. Note: Do NOT add arbitrary delays as these can become flaky again over time.
Uninitialized variables.
Look for compiler warnings about uninitialized variables.
Being slow to respond or being unresponsive to the stimuli from the tests.
Log the times when requests and responses are made.
Check and remove any causes for delays.
Memory leaks.
Look at memory consumption during test runs. Use tools such as Valgrind to detect.
Fix programming error causing memory leak. This Wikipedia article has an excellent discussion on these types of errors.
Oversubscription of resources.
Changes to the application (or dependent services) out of sync with the corresponding tests.
Examine revision history.
Institute a policy requiring code changes to be accompanied by tests.
Networking failures or instability.
Check for hardware errors in system logs.
Fix hardware errors or run tests on different hardware.
Disk errors.
Resources being consumed by other tasks/services not related to the tests being run.
Examine system process activity.
Reduce activity of other processes on test system(s).
Posted by George Pirocanac, Test Engineering ManagerEarlier Blog entries described the strategy and methodology for testing the functionality of various kinds of applications. The basic approach is to isolate the logic of the application from the external API calls that the application makes through the use of various constructs called mocks, fakes, dummy routines, etc. Depending on how the application is designed and written, this can lead to smaller, simpler tests that cover more, execute more quickly, and lead to quicker diagnosis of problems than the larger end-to-end or system tests. On the other hand, they are not a complete replacement for end-to-end testing. By their very nature, the small tests don't test the assumptions and interactions between the application and the APIs that it calls. As a result, a diversified application testing strategy includes small, medium, and large tests. (See Copeland’s GTAC Video, fast forward about 5 minutes in to hear a brief description of developer testing and small, medium, large)
What about testing the APIs themselves? What if anything is different? The first approach mirrors the small test approach. Each of the API calls is exercised with a variety of inputs and the outputs that are verified according to the specification. For isolated, stateless APIs (math library functions come to mind), this can be very effective by itself. However, many APIs are not isolated or stateless, and their results can vary according to the *combinations* of calls that were made. One way to deal with this is to analyze the dependencies between the calls and create mini-applications to exercise and verify these combinations of calls. Often, this falls into the so-called typical usage patterns or user scenarios. While good, this first approach only gives us limited confidence. We also need to test what happens when not-so-typical sets of calls are made as well. Often application writers introduce usage patterns that the spec didn't anticipate.
Another approach is to capture the API calls made by real applications under controlled situations and then replay only the calls under the same controlled situations. These types of tests fall into the medium category. Again, the idea is to test series and combinations of calls, but the difficulty can lie in recreating the exact environment. In addition, this approach is vulnerable to building tests that traverse the same paths in the code. Adding fuzz to the parameters and call patterns can help, but not eliminate, this problem.
The third approach is to pull out the big hammer. Does it make sense to test the APIs with large applications? After all, if something goes wrong, you have to have specific knowledge about the application to triage the problem. You also have to be familiar with the techniques in testing the application. Testing a map-based application can be quite different from a calendar-based one, even if they share a common subset of APIs. The strongest case for testing APIs with large applications is compatibility testing. APIs not only have to return correct results, but they have to do it in the same manner from revision to revision. It's a sort of contract between the API writer and the application writer. When the API is private, then only a relative small number of parties have to agree on a change to the contract, but when it is public, even a small change can break a lot of applications.
So when it comes to API testing, it seems we are back to small, medium, and large approaches after all. Just as in application testing where you can't completely divorce the API from the application, we cannot completely divorce the application from API testing.