Google Testing Blog: Flaky Tests at Google and How We Mitigate Them

Flaky Tests at Google and How We Mitigate Them

Friday, May 27, 2016

by John Miccoroot causesthis article

A tool that monitors the flakiness of tests and if the flakiness is too high, it automatically quarantines the test. Quarantining removes the test from the critical path and files a bug for developers to reduce the flakiness. This prevents it from becoming a problem for developers, but could easily mask a real race condition or some other bug in the code being tested.

Another tool detects changes in the flakiness level of tests and works to identify the change that caused the test to change the level of flakiness.

We have a new team dedicated to providing accurate and timely information about test flakiness to help developers and build monitors so that they know whether they are being harmed by test flakiness.

As we analyze the data from flaky test executions, we are seeing promising correlations with features that should enable us to identify a flaky result accurately without re-running the test.

36 comments :

TCDooMMay 28, 2016 at 10:21:00 PM PDT
I hear you. same issues, same solutions. but we have another tool up our sleeve, we have a section called Reservoir that runs all new tests added in a loop for a week to determine if there is any flakiness in them, in that time they are not yet part of the critical CI path.
happy to hear we are not alone.
good day.
ReplyDelete
Replies
Arjan KranenburgMay 29, 2016 at 12:11:00 AM PDT
Thanks for the great blog-post!
It seems that you categorize flakiness as a test issue, but the cause of the flaky test result could be in the production-code and therefor be a real issue.
Did you investigate how many of the flaky tests are due to a real issue? And do you give flaky test results lower prio than tests that fail every time?
ReplyDelete
Replies
Sławomir RadzymińskiMay 29, 2016 at 8:34:00 AM PDT
Thanks John Micco for sharing your experience.

Are those GUI level tests you're struggling with? They're usually considered flaky.

Do you use rerun mechanism for tests that fail?

Regards,
Sławek
ReplyDelete
Replies
MMay 29, 2016 at 9:08:00 AM PDT
Thanks John, wildly enough these are pretty common issues in large Functional Automation implementations. I know you have a heavy vest in Selenium, do you use another tool for Service Virtualization ?
ReplyDelete
Replies
UnknownMay 30, 2016 at 4:58:00 AM PDT
You touched very important and common problem here.
Have you tried to track the wasted time caused by flaky tests (developers unable to submit the changes, CI that requires additional cycles)?
ReplyDelete
Replies
Stanislav BashkyrtsevMay 30, 2016 at 10:50:00 PM PDT
I repeat this almost every day: do not write many UI System Tests - they should be rare. You need to build a pyramid (http://qala.io/blog/test-pyramid.html). There are almost always possibilities to write tests at lower level.

Often it's the separation of AQA and Dev teams that leads to flakiness since AQA usually write system tests only. Let Devs write all(!) the tests and the proportion of flaky tests would drop to 1:1000.
ReplyDelete
Replies
UnknownMay 31, 2016 at 2:12:00 AM PDT
Great post. I guess we all experience same issues when it comes to large scale automation processes.
@John Micco - do you consider versions/experiments/configurations between test cycles when deciding if test passed Beta or marked as Flaky?

Thanks,
Elad - WIX.com
ReplyDelete
Replies
UnknownMay 31, 2016 at 7:50:00 AM PDT
This comment has been removed by the author.
ReplyDelete
Replies
UnknownMay 31, 2016 at 7:59:00 AM PDT
Nice Article John!! We too have this kind of issues , and we came up with rerun concept, in which You can rerun the tests 3 times if it is failed in each of the runs. If the test fails more than 3 times we mark them as failure. We have configuration to setup the rerun count.
And Some times flaky tests are also the way tests are written up.
-Surya
ReplyDelete
Replies
UnknownJune 1, 2016 at 1:29:00 AM PDT
Hi John,

Thanks for the post, great read! Can I ask what tool you are using to monitor the flakiness of your tests?

Thanks,

Sarah
ReplyDelete
Replies
TurboJune 1, 2016 at 12:43:00 PM PDT
Thanks John. Do you also classify your flakiness data along small, medium and large tests?
ReplyDelete
Replies
AnonymousJune 1, 2016 at 1:19:00 PM PDT
It's great to hear that you're working on this. I've been fighting against flaky tests in our c++ projects as well. I've noticed that some projects are adding flaky test information to junit xml results used by Jenkins, but the googletest framework doesn't yet support this ( https://github.com/google/googletest/issues/727 ). For projects that see lots of flaky test failures, we currently re-run failing tests one time and only report as a failure if it fails twice in a row.
ReplyDelete
Replies
Matt GriscomJune 1, 2016 at 2:43:00 PM PDT
Marking tests as flaky is addressing the problem from the wrong direction, and it will lose potentially valuable information.

Instead, have a test monitor itself for what it does. If it fails, look at root cause from available information. Then, depending on what failed (for example, an external dependency), do a smart retry. Is the failure reproduced? Then, fail the test!

"Marking a test as flaky" gives one permission to ignore failures, but there is potentially important and potentially actionable information there.

Instead, *use* the information to manage quality risk and/or improve the quality of the product.

MetaAutomation has patterns that describe at a high level how to do this. Don't drop information on the floor that can have value for the team and for the product!
ReplyDelete
Replies
Wayne RoseberryJune 2, 2016 at 9:45:00 AM PDT
We have been dealing with this same phenomenon for years.

Currently, we execute reliability runs of all of the CI tests (we try for hundreds of executions, but it depends on automation system load levels) per build to generate consistency rates. Using those numbers, we push product teams to move all tests that fall below a certain consistency level out of the CI tests. We keep them in the reliability suite for sake of coverage and issue discovery, but do not use them to gate submission into the main code branch.

We likewise have difficulty accounting for the costs, but ballpark estimates show it is very expensive. I have done prior analysis to demonstrate that intermittent failures cause engineers to take longer to submit. Intermittent failures have a high duplicate bug rate, and ad hoc estimates from engineers is that we lose ~20 per duplicate bug for an engineer to determine there is duplication. The costs go way beyond all of that, though, particularly as process gates close down team productivity (failing CI tests lock a branch from changes until it is resolved), but also from legitimate bug escapes that were ignored because of the noise.

It is my own opinion that even after tons of effort to reduce noise from tests, flaky tests are inevitable when the test condition reaches a certain complexity. There are more stable coding patterns (mostly in product, but also in test) which stabilize the test results, but they can only take you so far. Once you have moved the tests (e.g. convert end to end to unit tests, move pre-release tests to TIP methodologies) you still have a core set of test problems only discoverable in an integrated end to end system. And those tests will be flaky. If they are not flaky, they tend to never find bugs. This is not because the test is bad. It is because the conditions of the test, the thing that makes them flaky, are EXACTLY the same thing that caused the bug to be introduced in the first place. These bugs are scarier, riskier and harder to find. The secret, then, is to appropriately manage them. I prefer to rely more on repetition, statistics and runs that do not block the CI process. I prefer to data mine the test results and feed the work backlog.
ReplyDelete
Replies
Peter BindelsJune 7, 2016 at 2:40:00 AM PDT
Did you look at correlated unreliability? We have a number of tests that in themselves are stable, but use some form of global state (/tmp files, other global state) that cause it to fail if run together with another test. We also use a test environment that preferentially runs failing tests first, with the rest after it. That leads to the situation that if they ever fail, they are then run first with other failing tests, making it more likely to fail, and when they succeed they're run later making it less likely they fail again.

Of course, this makes it even harder to know if you broke something, as the test will reliably fail on your machine but only for you. And even when you revert any changes you did.
ReplyDelete
Replies
UnknownJune 7, 2016 at 11:06:00 AM PDT
Any chance these tools are open source?
ReplyDelete
Replies
DevasenaJune 8, 2016 at 5:45:00 AM PDT
@John Micco,

I do not have the background on the 'process' approach for categorizing, grouping, prioritizing your tests.... Still, would like to know
if a combination of Exploratory testing and CI has been considered.
One of the basic premise for automation is to consider software candidates, that are stable and are not changed too often.

Please, let me know.

Thanks,
Devasena.
ReplyDelete
Replies
Mesut GüneşJune 23, 2016 at 11:34:00 PM PDT
We solved this problem by re-running the failed test cases three Times. And checking them to find what causing flakiness. it is not much time consuning but more robust solution since 90 % pass at first try.
ReplyDelete
Replies
UnknownJuly 1, 2016 at 7:23:00 AM PDT
This is pretty interesting I'll definitely be investigating the flaky test handler for jenkins that was posted in the comments
ReplyDelete
Replies
UnknownSeptember 7, 2016 at 6:05:00 AM PDT
TestProject conducted a survey that compares AngularJS VS. ReactJS, exposes current front end development technologies and unit testing tool preferences of software professionals! See the results here:
http://blog.testproject.io/2016/09/01/front-end-development-unit-test-automation-trends2/
ReplyDelete
Replies
SujaySeptember 21, 2016 at 9:24:00 AM PDT
How do you mark a test as flaky? Do you annotate the test in source code in some way, or does this information reside elsewhere (perhaps in a database)?
ReplyDelete
Replies
UnknownOctober 12, 2016 at 2:29:00 PM PDT
Summary: flaky tests are bugs either in the tests or in the production code. When in doubt, doubt the tests first. Do you whitelist tests that are known to be robust for a long time?
ReplyDelete
Replies
Professor FontanezApril 26, 2017 at 10:02:00 PM PDT
As dangerous "flaky" tests that give false negatives are, giving false positives is even more dangerous. Writing test cases for misunderstood requirements could lead of incorrect validation of production code and potentially dangerous bugs to remain undetected for a long time.

I experienced such problem while working at Nokia's manufacturing facility in Fort Worth, TX in the late 90s. An incorrect calibration (adjustment) of a camera lead to a number of low-quality displays to be assembled on mobile phones. The problem was discovered by QC auditing and tedious examination of test data logged by the production test stations. The "false positive" lead to an unusually high prime pass yield of the test station in question which wasn't detected because it is almost impossible to sense a problem when all the tests are passing.
ReplyDelete
Replies
dzieciouMay 2, 2017 at 8:22:00 AM PDT
What are the plugins that you use to identify and handle flaky tests on Jenkins?

So far I have found the following two:

* Flaky Test Handler Plugin: This plugin is designed to handle flaky tests, including re-running failed tests, aggregate and report flaky tests statistics and so on. https://wiki.jenkins-ci.org/display/JENKINS/Flaky+Test+Handler+Plugin

* Test Results Analyzer Plugin: Displays a matrix subsequent runs of the same tests, so you can identify which tests are ocassionaly red. https://wiki.jenkins-ci.org/display/JENKINS/Test+Results+Analyzer+Plugin
ReplyDelete
Replies
UnknownAugust 2, 2021 at 1:49:00 PM PDT
Flaky Tests? I think frequently, test environments are to blame and are taken for granted. It is important to be able to run tests of "components" in an isolated environment where you control the inputs. In other words, besides the "component" under test, you control everything else to provide known inputs (this includes data transfer rates). Otherwise, you are setting yourself up to get inconsistent results. Per scientific process, only one thing is changing at a time, to learn something new. If more than one thing is changed at a time the result gets ambiguous.
If you need more bandwidth to test more changes at the same time, you need to be able to stand up multiple identical environments that can each test one change at a time.

ReplyDelete
Replies