I hear you. same issues, same solutions. but we have another tool up our sleeve, we have a section called Reservoir that runs all new tests added in a loop for a week to determine if there is any flakiness in them, in that time they are not yet part of the critical CI path. happy to hear we are not alone. good day.
Thanks for the great blog-post! It seems that you categorize flakiness as a test issue, but the cause of the flaky test result could be in the production-code and therefor be a real issue. Did you investigate how many of the flaky tests are due to a real issue? And do you give flaky test results lower prio than tests that fail every time?
We do not currently keep accurate count of the number of times that flaky tests are really masking bugs in the code. We see it as a testing issue mostly because it makes it more difficult to use the tests for their intended purpose - finding problems with the code. From the testing system point of view a test that fails reliably is far better than a test that is flaky! A persistently failing test is giving a clear signal about what to do - even it means fixing the test.
Agreed. UI tests are definitely flaky because of how test harnesses interact with the UI, timing issues, handshaking, and extraction of state. See more here: http://comet.unl.edu/tutorial.php
Thanks John, wildly enough these are pretty common issues in large Functional Automation implementations. I know you have a heavy vest in Selenium, do you use another tool for Service Virtualization ?
Today at Google test authors and test infrastructure developers throughout the organization are responsible for creating/using service virtualization in their tests. We do not have a central framework - other than providing generic Mocking frameworks like Mockito.
You touched very important and common problem here. Have you tried to track the wasted time caused by flaky tests (developers unable to submit the changes, CI that requires additional cycles)?
We are currently working to better analyze the cost to the developer workflows caused by test flakiness - we do not yet have anything publishable out of that effort.
I repeat this almost every day: do not write many UI System Tests - they should be rare. You need to build a pyramid (http://qala.io/blog/test-pyramid.html). There are almost always possibilities to write tests at lower level.
Often it's the separation of AQA and Dev teams that leads to flakiness since AQA usually write system tests only. Let Devs write all(!) the tests and the proportion of flaky tests would drop to 1:1000.
It's not only GUI tests. There're many sources of flakiness, some of which Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and I analyzed in this paper: http://mir.cs.illinois.edu/marinov/publications/LuoETAL14FlakyTestsAnalysis.pdf
Great post. I guess we all experience same issues when it comes to large scale automation processes. @John Micco - do you consider versions/experiments/configurations between test cycles when deciding if test passed Beta or marked as Flaky?
Good point, we definitely separate the test with all flags and configuration values and differentiate whether they are flaky based on the flag / configurations being tested.
Nice Article John!! We too have this kind of issues , and we came up with rerun concept, in which You can rerun the tests 3 times if it is failed in each of the runs. If the test fails more than 3 times we mark them as failure. We have configuration to setup the rerun count. And Some times flaky tests are also the way tests are written up. -Surya
How will you distinguish between a failure due to environmental problems and a failure due to a software bug? A software bug can still succeed 4 out of 5 times.
It's great to hear that you're working on this. I've been fighting against flaky tests in our c++ projects as well. I've noticed that some projects are adding flaky test information to junit xml results used by Jenkins, but the googletest framework doesn't yet support this ( https://github.com/google/googletest/issues/727 ). For projects that see lots of flaky test failures, we currently re-run failing tests one time and only report as a failure if it fails twice in a row.
Marking tests as flaky is addressing the problem from the wrong direction, and it will lose potentially valuable information.
Instead, have a test monitor itself for what it does. If it fails, look at root cause from available information. Then, depending on what failed (for example, an external dependency), do a smart retry. Is the failure reproduced? Then, fail the test!
"Marking a test as flaky" gives one permission to ignore failures, but there is potentially important and potentially actionable information there.
Instead, *use* the information to manage quality risk and/or improve the quality of the product.
MetaAutomation has patterns that describe at a high level how to do this. Don't drop information on the floor that can have value for the team and for the product!
We have been dealing with this same phenomenon for years.
Currently, we execute reliability runs of all of the CI tests (we try for hundreds of executions, but it depends on automation system load levels) per build to generate consistency rates. Using those numbers, we push product teams to move all tests that fall below a certain consistency level out of the CI tests. We keep them in the reliability suite for sake of coverage and issue discovery, but do not use them to gate submission into the main code branch.
We likewise have difficulty accounting for the costs, but ballpark estimates show it is very expensive. I have done prior analysis to demonstrate that intermittent failures cause engineers to take longer to submit. Intermittent failures have a high duplicate bug rate, and ad hoc estimates from engineers is that we lose ~20 per duplicate bug for an engineer to determine there is duplication. The costs go way beyond all of that, though, particularly as process gates close down team productivity (failing CI tests lock a branch from changes until it is resolved), but also from legitimate bug escapes that were ignored because of the noise.
It is my own opinion that even after tons of effort to reduce noise from tests, flaky tests are inevitable when the test condition reaches a certain complexity. There are more stable coding patterns (mostly in product, but also in test) which stabilize the test results, but they can only take you so far. Once you have moved the tests (e.g. convert end to end to unit tests, move pre-release tests to TIP methodologies) you still have a core set of test problems only discoverable in an integrated end to end system. And those tests will be flaky. If they are not flaky, they tend to never find bugs. This is not because the test is bad. It is because the conditions of the test, the thing that makes them flaky, are EXACTLY the same thing that caused the bug to be introduced in the first place. These bugs are scarier, riskier and harder to find. The secret, then, is to appropriately manage them. I prefer to rely more on repetition, statistics and runs that do not block the CI process. I prefer to data mine the test results and feed the work backlog.
Did you look at correlated unreliability? We have a number of tests that in themselves are stable, but use some form of global state (/tmp files, other global state) that cause it to fail if run together with another test. We also use a test environment that preferentially runs failing tests first, with the rest after it. That leads to the situation that if they ever fail, they are then run first with other failing tests, making it more likely to fail, and when they succeed they're run later making it less likely they fail again.
Of course, this makes it even harder to know if you broke something, as the test will reliably fail on your machine but only for you. And even when you revert any changes you did.
I have created sbt plugin to detect flaky tests in our Java/Scala projects: https://github.com/otrebski/sbt-flaky. It runs tests many times and analyze JUnits reports. Also it can calculate trends for tests. You can check example HTML report: http://sbt-flaky-demo.bitballoon.com
I do not have the background on the 'process' approach for categorizing, grouping, prioritizing your tests.... Still, would like to know if a combination of Exploratory testing and CI has been considered. One of the basic premise for automation is to consider software candidates, that are stable and are not changed too often.
We solved this problem by re-running the failed test cases three Times. And checking them to find what causing flakiness. it is not much time consuning but more robust solution since 90 % pass at first try.
TestProject conducted a survey that compares AngularJS VS. ReactJS, exposes current front end development technologies and unit testing tool preferences of software professionals! See the results here: http://blog.testproject.io/2016/09/01/front-end-development-unit-test-automation-trends2/
How do you mark a test as flaky? Do you annotate the test in source code in some way, or does this information reside elsewhere (perhaps in a database)?
Summary: flaky tests are bugs either in the tests or in the production code. When in doubt, doubt the tests first. Do you whitelist tests that are known to be robust for a long time?
As dangerous "flaky" tests that give false negatives are, giving false positives is even more dangerous. Writing test cases for misunderstood requirements could lead of incorrect validation of production code and potentially dangerous bugs to remain undetected for a long time.
I experienced such problem while working at Nokia's manufacturing facility in Fort Worth, TX in the late 90s. An incorrect calibration (adjustment) of a camera lead to a number of low-quality displays to be assembled on mobile phones. The problem was discovered by QC auditing and tedious examination of test data logged by the production test stations. The "false positive" lead to an unusually high prime pass yield of the test station in question which wasn't detected because it is almost impossible to sense a problem when all the tests are passing.
What are the plugins that you use to identify and handle flaky tests on Jenkins?
So far I have found the following two:
* Flaky Test Handler Plugin: This plugin is designed to handle flaky tests, including re-running failed tests, aggregate and report flaky tests statistics and so on. https://wiki.jenkins-ci.org/display/JENKINS/Flaky+Test+Handler+Plugin
* Test Results Analyzer Plugin: Displays a matrix subsequent runs of the same tests, so you can identify which tests are ocassionaly red. https://wiki.jenkins-ci.org/display/JENKINS/Test+Results+Analyzer+Plugin
Flaky Tests? I think frequently, test environments are to blame and are taken for granted. It is important to be able to run tests of "components" in an isolated environment where you control the inputs. In other words, besides the "component" under test, you control everything else to provide known inputs (this includes data transfer rates). Otherwise, you are setting yourself up to get inconsistent results. Per scientific process, only one thing is changing at a time, to learn something new. If more than one thing is changed at a time the result gets ambiguous. If you need more bandwidth to test more changes at the same time, you need to be able to stand up multiple identical environments that can each test one change at a time.
I hear you. same issues, same solutions. but we have another tool up our sleeve, we have a section called Reservoir that runs all new tests added in a loop for a week to determine if there is any flakiness in them, in that time they are not yet part of the critical CI path.
ReplyDeletehappy to hear we are not alone.
good day.
Thanks for the great blog-post!
ReplyDeleteIt seems that you categorize flakiness as a test issue, but the cause of the flaky test result could be in the production-code and therefor be a real issue.
Did you investigate how many of the flaky tests are due to a real issue? And do you give flaky test results lower prio than tests that fail every time?
We do not currently keep accurate count of the number of times that flaky tests are really masking bugs in the code. We see it as a testing issue mostly because it makes it more difficult to use the tests for their intended purpose - finding problems with the code. From the testing system point of view a test that fails reliably is far better than a test that is flaky! A persistently failing test is giving a clear signal about what to do - even it means fixing the test.
DeleteThanks John Micco for sharing your experience.
ReplyDeleteAre those GUI level tests you're struggling with? They're usually considered flaky.
Do you use rerun mechanism for tests that fail?
Regards,
Sławek
Flaky tests appear everywhere in our corpuse, but there is probably some skew toward UI testing that we observe - although I have not quantified this.
DeleteOur rerun mechanism is only used for tests that are marked as flaky or when users specifically request it.
Agreed. UI tests are definitely flaky because of how test harnesses interact with the UI, timing issues, handshaking, and extraction of state. See more here: http://comet.unl.edu/tutorial.php
DeleteThanks John, wildly enough these are pretty common issues in large Functional Automation implementations. I know you have a heavy vest in Selenium, do you use another tool for Service Virtualization ?
ReplyDeleteToday at Google test authors and test infrastructure developers throughout the organization are responsible for creating/using service virtualization in their tests. We do not have a central framework - other than providing generic Mocking frameworks like Mockito.
DeleteYou touched very important and common problem here.
ReplyDeleteHave you tried to track the wasted time caused by flaky tests (developers unable to submit the changes, CI that requires additional cycles)?
We are currently working to better analyze the cost to the developer workflows caused by test flakiness - we do not yet have anything publishable out of that effort.
DeleteI repeat this almost every day: do not write many UI System Tests - they should be rare. You need to build a pyramid (http://qala.io/blog/test-pyramid.html). There are almost always possibilities to write tests at lower level.
ReplyDeleteOften it's the separation of AQA and Dev teams that leads to flakiness since AQA usually write system tests only. Let Devs write all(!) the tests and the proportion of flaky tests would drop to 1:1000.
It's not only GUI tests. There're many sources of flakiness, some of which Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and I analyzed in this paper: http://mir.cs.illinois.edu/marinov/publications/LuoETAL14FlakyTestsAnalysis.pdf
DeleteGreat post. I guess we all experience same issues when it comes to large scale automation processes.
ReplyDelete@John Micco - do you consider versions/experiments/configurations between test cycles when deciding if test passed Beta or marked as Flaky?
Thanks,
Elad - WIX.com
Good point, we definitely separate the test with all flags and configuration values and differentiate whether they are flaky based on the flag / configurations being tested.
DeleteThis comment has been removed by the author.
ReplyDeleteNice Article John!! We too have this kind of issues , and we came up with rerun concept, in which You can rerun the tests 3 times if it is failed in each of the runs. If the test fails more than 3 times we mark them as failure. We have configuration to setup the rerun count.
ReplyDeleteAnd Some times flaky tests are also the way tests are written up.
-Surya
How will you distinguish between a failure due to environmental problems and a failure due to a software bug? A software bug can still succeed 4 out of 5 times.
DeleteHi John,
ReplyDeleteThanks for the post, great read! Can I ask what tool you are using to monitor the flakiness of your tests?
Thanks,
Sarah
Thanks John. Do you also classify your flakiness data along small, medium and large tests?
ReplyDeleteIt's great to hear that you're working on this. I've been fighting against flaky tests in our c++ projects as well. I've noticed that some projects are adding flaky test information to junit xml results used by Jenkins, but the googletest framework doesn't yet support this ( https://github.com/google/googletest/issues/727 ). For projects that see lots of flaky test failures, we currently re-run failing tests one time and only report as a failure if it fails twice in a row.
ReplyDeleteMarking tests as flaky is addressing the problem from the wrong direction, and it will lose potentially valuable information.
ReplyDeleteInstead, have a test monitor itself for what it does. If it fails, look at root cause from available information. Then, depending on what failed (for example, an external dependency), do a smart retry. Is the failure reproduced? Then, fail the test!
"Marking a test as flaky" gives one permission to ignore failures, but there is potentially important and potentially actionable information there.
Instead, *use* the information to manage quality risk and/or improve the quality of the product.
MetaAutomation has patterns that describe at a high level how to do this. Don't drop information on the floor that can have value for the team and for the product!
We have been dealing with this same phenomenon for years.
ReplyDeleteCurrently, we execute reliability runs of all of the CI tests (we try for hundreds of executions, but it depends on automation system load levels) per build to generate consistency rates. Using those numbers, we push product teams to move all tests that fall below a certain consistency level out of the CI tests. We keep them in the reliability suite for sake of coverage and issue discovery, but do not use them to gate submission into the main code branch.
We likewise have difficulty accounting for the costs, but ballpark estimates show it is very expensive. I have done prior analysis to demonstrate that intermittent failures cause engineers to take longer to submit. Intermittent failures have a high duplicate bug rate, and ad hoc estimates from engineers is that we lose ~20 per duplicate bug for an engineer to determine there is duplication. The costs go way beyond all of that, though, particularly as process gates close down team productivity (failing CI tests lock a branch from changes until it is resolved), but also from legitimate bug escapes that were ignored because of the noise.
It is my own opinion that even after tons of effort to reduce noise from tests, flaky tests are inevitable when the test condition reaches a certain complexity. There are more stable coding patterns (mostly in product, but also in test) which stabilize the test results, but they can only take you so far. Once you have moved the tests (e.g. convert end to end to unit tests, move pre-release tests to TIP methodologies) you still have a core set of test problems only discoverable in an integrated end to end system. And those tests will be flaky. If they are not flaky, they tend to never find bugs. This is not because the test is bad. It is because the conditions of the test, the thing that makes them flaky, are EXACTLY the same thing that caused the bug to be introduced in the first place. These bugs are scarier, riskier and harder to find. The secret, then, is to appropriately manage them. I prefer to rely more on repetition, statistics and runs that do not block the CI process. I prefer to data mine the test results and feed the work backlog.
Did you look at correlated unreliability? We have a number of tests that in themselves are stable, but use some form of global state (/tmp files, other global state) that cause it to fail if run together with another test. We also use a test environment that preferentially runs failing tests first, with the rest after it. That leads to the situation that if they ever fail, they are then run first with other failing tests, making it more likely to fail, and when they succeed they're run later making it less likely they fail again.
ReplyDeleteOf course, this makes it even harder to know if you broke something, as the test will reliably fail on your machine but only for you. And even when you revert any changes you did.
Any chance these tools are open source?
ReplyDeleteThere is a flaky test handler plugin for jenkins:
Deletehttps://github.com/jenkinsci/flaky-test-handler-plugin
We have also written a script for merging junit xml files to mark flaky tests so that the jenkins plugin can parse them.
https://bitbucket.org/osrf/release-tools/src/default/jenkins-scripts/tools/
That's all that I'm familiar with.
cloudera's tool which uses a distributed cluster for repeating tests and reproducing flaky ones: https://github.com/cloudera/dist_test
DeleteAnother tool (mine) for controlling non-determinism so as to reproduce flaky tests: https://github.com/osrg/namazu
I have created sbt plugin to detect flaky tests in our Java/Scala projects: https://github.com/otrebski/sbt-flaky. It runs tests many times and analyze JUnits reports. Also it can calculate trends for tests. You can check example HTML report: http://sbt-flaky-demo.bitballoon.com
Delete@John Micco,
ReplyDeleteI do not have the background on the 'process' approach for categorizing, grouping, prioritizing your tests.... Still, would like to know
if a combination of Exploratory testing and CI has been considered.
One of the basic premise for automation is to consider software candidates, that are stable and are not changed too often.
Please, let me know.
Thanks,
Devasena.
We solved this problem by re-running the failed test cases three Times. And checking them to find what causing flakiness. it is not much time consuning but more robust solution since 90 % pass at first try.
ReplyDeleteThis is pretty interesting I'll definitely be investigating the flaky test handler for jenkins that was posted in the comments
ReplyDeleteTestProject conducted a survey that compares AngularJS VS. ReactJS, exposes current front end development technologies and unit testing tool preferences of software professionals! See the results here:
ReplyDeletehttp://blog.testproject.io/2016/09/01/front-end-development-unit-test-automation-trends2/
How do you mark a test as flaky? Do you annotate the test in source code in some way, or does this information reside elsewhere (perhaps in a database)?
ReplyDeleteSummary: flaky tests are bugs either in the tests or in the production code. When in doubt, doubt the tests first. Do you whitelist tests that are known to be robust for a long time?
ReplyDeleteAs dangerous "flaky" tests that give false negatives are, giving false positives is even more dangerous. Writing test cases for misunderstood requirements could lead of incorrect validation of production code and potentially dangerous bugs to remain undetected for a long time.
ReplyDeleteI experienced such problem while working at Nokia's manufacturing facility in Fort Worth, TX in the late 90s. An incorrect calibration (adjustment) of a camera lead to a number of low-quality displays to be assembled on mobile phones. The problem was discovered by QC auditing and tedious examination of test data logged by the production test stations. The "false positive" lead to an unusually high prime pass yield of the test station in question which wasn't detected because it is almost impossible to sense a problem when all the tests are passing.
What are the plugins that you use to identify and handle flaky tests on Jenkins?
ReplyDeleteSo far I have found the following two:
* Flaky Test Handler Plugin: This plugin is designed to handle flaky tests, including re-running failed tests, aggregate and report flaky tests statistics and so on. https://wiki.jenkins-ci.org/display/JENKINS/Flaky+Test+Handler+Plugin
* Test Results Analyzer Plugin: Displays a matrix subsequent runs of the same tests, so you can identify which tests are ocassionaly red. https://wiki.jenkins-ci.org/display/JENKINS/Test+Results+Analyzer+Plugin
Flaky Tests? I think frequently, test environments are to blame and are taken for granted. It is important to be able to run tests of "components" in an isolated environment where you control the inputs. In other words, besides the "component" under test, you control everything else to provide known inputs (this includes data transfer rates). Otherwise, you are setting yourself up to get inconsistent results. Per scientific process, only one thing is changing at a time, to learn something new. If more than one thing is changed at a time the result gets ambiguous.
ReplyDeleteIf you need more bandwidth to test more changes at the same time, you need to be able to stand up multiple identical environments that can each test one change at a time.