Interesting, yet not so surprising that larger tests are more flaky. Here are a couple of quick thoughts: 1) I'm not so sure that the linear trend is a good fit, given the large variance for larger tests (looks rather heteroscedastic to me). 2) It's surprising to me that very small tests (unit tests) are flaky. Is there a pattern in these small flaky tests? 3) The analysis of Android emulator is interesting. I'm not too familiar with the emulator: does it exhibit non-determinism, e.g. in the form of scheduling?
1) The large variance for larger tests is actually variance to the buckets I've put the tests into. Buckets for the larger tests contain many fewer tests than those for smaller tests and are more likely to deviate - and deviate by a greater amount. The graph with the smallest 96% of tests doesn't have this issue.
You bring up a good point. The linear trend is just a default, though I tried a few others and nothing looked significantly better. Without knowing the mechanism, it's hard to say what it really should be.
2) There are a few patterns - some tests rely on random numbers, some occasionally hit time outs in test and infra code, etc. I believe that many of these tests can and do get fixed quickly, but we have enough tests that if you take a snapshot at any given moment there will be a few that have issues.
3) I'm not too familiar with the emulator either so I don't have a good answer here. It's worth looking into it further certainly.
Jeff, thanks very much for your detailed response. As for the visualization, my first thought was explicitly representing uncertainty, e.g. as done by the lmplot function in seaborn (or ggplot, ... )- for an example, see the first plot at http://seaborn.pydata.org/generated/seaborn.lmplot.html#seaborn.lmplot
Have you drawn any correlations between test behavior and flakiness? For example, capture product telemetry during the test. Compare metrics like number of events during test, number of distinct events during test, degree of action variance between iterations of test (in terms of which events show per iteration and in what order), and then see if any of those correlate with test flakiness?
I have not done any research along those lines. I think that would be interesting though choosing the correct set of events and measuring them across a large enough set of tests would be tricky. Are there specific events you had in mind? I could see certain things related to threads and locking being correlated.
In this particular case, the events are application specific. They are part of the product's own telemetry. They track things like which commands a user chose, what activities they are doing, whether specific application events were triggered for things like timers, operating system events, etc. So you get a different set of events for a word processor than you might a spreadsheet, but you also get a core set of events common to shared code.
My own observation is that events at this level exhibit a surprisingly high level of variance. A test doing the same thing ever time at the command level will typically yield different sequences of events at the application layer. About 90% of the events happen every time, but the remaining 10% or so vary. And as expected, this is highly application dependent. The more a given application is multi-threaded, asynchronous or event driven, the more variance you see in the actual telemetry signature.
Additional question: have you drawn any comparisons of bug discovery rates and bug priority as relates to test flakiness?
On my own analysis, I have found that for test suites where the tests have extremely low flakiness (between .1% and 1%), bug discovery is likewise very low. We get bugs, for certain, but not nearly as much as we get for tests with much higher flake rates. Certainly the product crashes and unexpected exits are much higher on tests with higher flake rates.
As expensive as flaky tests are, I am finding that there is a trade off between low flake and discovering product bugs. Stabilizing the test behavior seems to sanitize, or one prefers a more colorful metaphor, neuter, the test. My current hypothesis is that once you get past the obvious, bad test code problems, you are left with an inevitable amount of flaky that is not only unavoidable, but possibly even desirable. It is an indication you are exercising areas of the code that have legitimate problems for the same reasons the flake manifests.
I think it depends how you stabilize the test and whether you then lose some of the coverage you thought you had.
I think your last point is the key: "It is an indication you are exercising areas of the code that have legitimate problems for the same reasons the flake manifests".
You're right, a flake and a bug are different manifestations of the same problem. If a test consistently fails, and the problem lies in production code we consider it a bug. If a test flakily fails, and the problem lies in production code we should also consider it a bug. Lots of code, that is highly complicated, and changes rapidly is likely to lead to both bugs and flakes.
The key part of writing and testing that code is to break it down into manageable chunks to reduce both of those (unpleasant) outcomes.
The topic is enlightening. Currently, while we use JIRA/Zephyr in manual mode, and I have worked in other more automated environments, I'm prototyping an automated test system for 'systems testing.' Preventing flaky tests certainly is a priority.
I could not determine however if these stats applied to "white box" testing, or unit testing of the code done by developers OR system type testing where the product is being used by systems testers with the product having a server-client architecture with a web app interface to the user.
I also feel with respect to the size to flakiness ratio is a good indicator as has been found by the data, but is size related to complexity? Is it test code growth (meaning algorithm complexity) and|or is it simple code, but bigger data? So the test logic is sound, but it runs out of time because the data set is 10 times bigger and the timer is set too low? It doesn't complete so there is an incomplete result (in a sense). Yes, a test bug, but the test logic itself is not changed but how it is run is?
Versus on the Google side of it with a system in place and having to deal with it, what is the best way to minimize risk when building a new system? My thought is to write smaller tests with single purposes while watching data growth processing.
The stats apply to continuously run, automated tests which are generally hermetic. Some of these are unit tests, others are system or even integration type tests.
The overall size is often due to the complexity of the code rather than the size of the data but I'm sure there are exceptions. If a timer is set too low and the test becomes flaky due to organic growth of test data it should be an easy problem to fix - though that doesn't mean that it is fixed immediately.
When building a system I think you always want to start with the smallest tests you can. You will need to add larger tests at some point, but you need to understand why they're necessary and think about whether you can test it in a different, smaller way.
I agree it's important to re-assess your tests as code changes. Breaking down a test into smaller pieces is key to having clean, well-maintained tests.
"What's the goal of this test?" is a great question to answer. An equally important question is: "What code are we running for this test?" I think many system / integration tests actually have a fairly narrow answer to the first question but a very large answer to the second.
Excellent post; thank you for sharing these findings with the larger community. As a future post, I would be very interested in hearing any correlation between fixing flaky tests and finding bugs in production code (as opposed to the test themselves). No doubt that flaky tests are a drain in engineering resources, but dedicating scarce engineering resources to fix large (and complex) flaky tests can be a challenge. If it was established that there is a real payback in finding bugs in production code by fixing flaky tests, it would help motivate teams to dedicate time to this exercise.
Also, are you aware of any tools/scripts that can help categorize tests results over time? For instance, if a test is run daily; then over a 14 day period this test may exhibit one of the following patterns: a) pass consistently; b) fail consistently; c) pass consistently then fail consistently -> new consistent failure; d) pass consistently, then fail intermittently -> new flaky test; e) etc...
Categorizing test results over time to identify patterns (eg. finding new consistent failures) is currently a manually exercise for us; but having this automated would eliminate another manual step and would help us identify new problems as they occur. If anybody has pointers on automating the test result categorization, that would be appreciated.
It seems you would need a shared test execution and result repository system to have any shared tools or scripts for measurement. Test result schema is all too often specific to the toolset and sometimes the system under test.
That said, I am not so sure this is an arduous problem. Assuming your system stores test results in some kind of database, then establishing consistency is a matter of querying passing executions over all executions on a per test basis. If one wants to be even more precise, also keep track of the number of distinct failures per test (some tests may fail more than once but in different ways). My own definition is if a test either always fails (with the same failure) or always passes then it is consistent - otherwise it is inconsistent. This is usually a trivial query in any repository.
I do recommend changing the execution practices. It is insufficient to have a daily run of a test (I assume "daily" approximates "new build"). You ought to have multiple iterations per test per build, ideally hundreds, so that you can establish granularity to at least the .01 level. In my own experience, I find that product teams can chase a test's flake factor down to .001 or better when needed (or feasible, depending on the type of test), and to establish that you need many iterations. We started what we call a "reliability run" several years ago, and the information value of that investment has paid back many times over.
One team within Google gathered some data regarding this. They found that when a stable test became flaky, and we could track it to a specific code change, the problem was a bug in production code 1/6th of the time.
If the default is to ignore the flaky tests then you will eventually be ignoring a real bug.
Wayne - Thanks for your reply. Our tests are generally system level tests with custom hardware; thus, there are multiple variables at play which can lead to flaky tests. We have multiple strategies in place to help tackle this problem: a) Testers need to ensure there tests pass consistently (run at least 10 times in succession without error; b) A failure in regression is automatically re-run 3 times to determine whether the failure is consistent; c) All test results from our regression system is logged into a CSV file; d) Automated tests typically run against each check-in, but the full regression takes at least 1/2 day, so hundreds of test runs is simply not possible. That being said, I like what was said about having "reliability" runs and I think our system will allows these to run as "best effort" to fill up spare capacity.
Jeff- thanks for the data point. I agree that flaky tests need to be fixed, it just a matter of the priority given to addressing the failure. As pointed out elsewhere, release velocity is important too and thus trade-offs are required.
Nice posting. I also agree that as tests get larger, it gets more flaky. I understand the whole concept of this posting but, I am kind of unfamiliar with some vocabulary. I'm not sure the exact meaning of 'binary size' .. is it the size of the test? and second, I'm not sure on the meaning of 'bucket'.. is seems like measure of something but I am a bit confused..I'd be thankful if somebody explains to me :)
The tests I looked at get compiled to a single binary that gets run. This contains all the code and data needed to run the test. Binary size is the overall size of that executable.
Bucket is a grouping of tests of similar size. Every test is either flaky (1) or not (0). By putting them into a bucket with tests of similar size we can calculate the percent of tests within that bucket which is flaky and come up with a continuous number between 0 and 1.
It's bizarre to me to read an article about "testing" that includes no insight into the relationship between testers and the automated fact checks that you are calling "tests." Testing is what people do. The closest our tools come to testing is the fact check.
It's as if I'm reading an article about noisy coffins, and you are wondering about trends and patterns of muffled screams coming from some buried coffins, but not asking what those screams might mean in any individual case. Surely the testers at Google either know the answer or think that answer is very important to discover for each and every "flaky" program you are referring to?
I write software that helps me test. When I do that, my code may behave in "flaky" ways. If so, then what I need to do is ask what that says about the product I am testing as well as my test strategy as a whole. I do not mind if one of my programs behaves in a flaky way, if I feel I am getting good information about the product-under-test by using it. If I am not getting good information, I need to tear that program down and rethink what I'm doing.
This goes back to the goal of testing. The goal is not certainty. The goal is not determinism. That's the way people think who actually hate testing and want to get past it as fast as they can, even if it means doing lousy fake testing. My goal, instead, is insight. I craft my tools to convey important information. If a "flaky" check is providing that, then my testing (which is a human process that accumulates and integrates the facts I glean from my automation) may be just fine.
I wholeheartedly agree, but we have to remember that not all insights are the same or serve the same purpose.
Some purposes have a low noise tolerance, and a higher escape rate tolerance. Some purposes have a higher noise tolerance and a lower escape rate tolerance. It is my belief that one cannot generally optimize for both with the same test. Thus we use different tests with different levels of noise and different levels of ability to discover based on the purpose.
The trend I see right now is that release velocity is dominating everybody's psychology, which puts pressure on the low noise requirements. This is creating back pressure on the "oops, you missed something" problem (higher discovery requirements paid for with more noise), but velocity pressure is receiving more attention. Further, it easier for people to just think of "tests" as a simple activity with only one purpose, so the entire focus yields to high precision, low noise tests. Not enough time has passed to force the balance to swing back.
I'm not quite sure I understand the point here but maybe we're talking about different things. These tests that are flaky are automated tests. By design, there is no human involved in running them. There are engineers that write the test, maintain the tests, and need to diagnose them when they fail. If you're testing that 1 + 1 = 2 and occasionally your program tells you the answer is 1 that's a problem that needs to be solved.
If you're talking about manual testing, there is some extra leeway for flakiness in the test tools. But even then, doesn't the flakiness at some point become an issue that needs to be fixed? If your test tools don't give you a clear signal then how do you trust them?
I cannot account for James' story, but in my case I am talking about automated tests.
The SUT, the environment, and the test conditions are too complex in end to end integrated tests to drive consistency to zero. In fact, variance is very much the true state of the system, and bugs derive from the complexity that derives from that variance. Some automated tests should be designed to exacerbate that variance, but my experience is that even the tests that are invariant and simple from their own behaviors will manifest underlying variance that is intrinsic to the SUT, and from out of that come the "flaky" results.
"If your test tools don't give you a clear signal then how do you trust them?"
I worry we are conflating correctness and consistency. For both, though, it is a matter of probability and what you can afford. What we do is measure the tests for consistency and separate them into groups where some are for fast/automatic decisions (gating and release procedures) and others have more room for having to filter noise (discovering bugs).
The tests which drift farther from 100% consistent (which also tend to have correctness issues - so while the variables are independent, they often have a relationship) need more statistical analysis to "trust" the signal. Teams usually adopt frequency as their rule of thumb, although some attributes of the reported failure motivate more fixes (e.g. if the entire call stack at time failure was reported is all in test code that manages generic UI and web page navigation, there is a tendency to punt...).
I have been putting a lot of my recent work using customer telemetry analysis to further mine value from intermittent test failures. Something that may get ignored because it doesn't occur consistently enough in test may have other signals we can relate to something customers do that will deserve more attention. It is new ground still, so I will have to come out of the mineshaft sometime later and talk about discoveries.
It is interesting to see a statistic on where the flaky tests come from, based on code analysis. I would argue that there is another layer, I tend to see more flaky tests from juniors or people less experienced with unit testing, as they tend to write the larger type of tests. And the discussion can go on and on.
I used to think about this a lot, when I was doing TDD and was in charge of a relatively small team. My thinking was that if only I could teach people how to write proper code and proper tests (that's another dimension you can make a statistic on, usually if the code under test is large then the test large as well, and it tends to lend to more "flakiness"), then we could avoid this all together.
However, in recent years, being in charge of larger teams, I learned that flacky tests tend to be a fact of life. I've not given up hope that we can eliminate them, but I realized that we also need a way to live with them. Training and getting better takes time (and by the time you train those guys, they leave and other fresh colleagues come in and start doing the same mistakes). My advice for people in similar positions would be: 1. Isolate the effects of flacky tests on your builds by running failed tests a few more times at the end. This is different than just marking the flacky tests as ignored in two ways. First, it will be automatically maintained, the process just picks up whatever test failed and runs it again. And second, even flacky tests can catch problems, and then will fail 100% of the time and point to a real issue that might have otherwise gone unnoticed. 2. Have a couple of processes around these tests. One for getting better and not doing them in the first place. And one for dealing with the ones in your codebase, these tests are trying to tell you something.
You're getting at issues with scaling - specifically the development team. Training and education is certainly important. We put a large emphasis on this and new hires go through training on testing. They also learn from senior engineers who hopefully already have the right practices.
Your item #1 should be a short-term solution for any one test. Tests can be flaky due to production issues and isolating/ignoring them for any length of time means possibly missing a bug.
Item #2 is the long-term fix. Avoid writing new flaky tests, and get better at fixing the ones you have. You'll still need to follow item #1 but don't treat that as the only choice. Many people do.
Can you please let us know how you are executing all 4.2 million tests?
Are you keeping the entire test suite in any CI like Jenkins? Can you tell me what best approach is followed here?
I have a similar requirement from one of my client, but i dont have any pointers. So please let me know how these test suites are executed and it can be great if you can refer any pointers.
This is great data, I'm curious how you have been acting on it? Was an effort made to break the biggest tests up into smaller ones? If so did that reduce the flakiness?
Great post, thanks. Would you say that the majority of the WebDriver tests are against web apps? What do you see from a mobile apps flakiness perspective? I know that mobile has bigger challenges when it comes to testing especially when executing across devices, OS versions etc. - just written my blog about it to try and address one of the test optimization pains i see in the market lately - would appreciate your comments on this blog or other ideas around it. https://mobiletestingblog.com/2017/05/30/optimizing-android-test-automation-development/
Looking at the plots I'm having following questions:
- The plot line of "likelihood of beeing flaky" versus "binary size" looks like a pearl necklace. Why? I would expect a more irregular distribution arround the linear correlation line. It looks like that the different tests are not independent of each other.
- It's surprsing to find a (more or less) linear correlation as large systems become chaotic. Why? For chaotic systems one would expect an exponential dependency and not a linear one.
- There are some large test setups that perform significantly better then others. Why? What do they better than the others?
re: pearl necklace - They aren't always independent. Some tests are similar - they test the same system, or are using much of the same framework. This may also cause them to have similar binary sizes and flakiness rates. I don't know if that's the entire explanation, but it's some part of it.
re: linear correlation - I'm not sure a linear correlation is correct, but it is the simplest way to view this. Exponential had slightly better r2 values in some cases but not enough for me to say it's right. It seems like without knowing the true mechanism, it's hard to make a claim one way or another.
re: Some better/worse - Aside from android, most of the differences look fairly small when you factor in relative sizes. My belief is that much of the flakiness in these tests comes from absolute timing (you expect something to complete in XX time and it doesn't) or relative timing (one thread occasionally executes faster than another). To some extent, it's hard for the test framework / setup to deal with all cases where that can occur.
Interesting, yet not so surprising that larger tests are more flaky. Here are a couple of quick thoughts:
ReplyDelete1) I'm not so sure that the linear trend is a good fit, given the large variance for larger tests (looks rather heteroscedastic to me).
2) It's surprising to me that very small tests (unit tests) are flaky. Is there a pattern in these small flaky tests?
3) The analysis of Android emulator is interesting. I'm not too familiar with the emulator: does it exhibit non-determinism, e.g. in the form of scheduling?
1) The large variance for larger tests is actually variance to the buckets I've put the tests into. Buckets for the larger tests contain many fewer tests than those for smaller tests and are more likely to deviate - and deviate by a greater amount. The graph with the smallest 96% of tests doesn't have this issue.
DeleteYou bring up a good point. The linear trend is just a default, though I tried a few others and nothing looked significantly better. Without knowing the mechanism, it's hard to say what it really should be.
2) There are a few patterns - some tests rely on random numbers, some occasionally hit time outs in test and infra code, etc. I believe that many of these tests can and do get fixed quickly, but we have enough tests that if you take a snapshot at any given moment there will be a few that have issues.
3) I'm not too familiar with the emulator either so I don't have a good answer here. It's worth looking into it further certainly.
Jeff, thanks very much for your detailed response. As for the visualization, my first thought was explicitly representing uncertainty, e.g. as done by the lmplot function in seaborn (or ggplot, ... )- for an example, see the first plot at http://seaborn.pydata.org/generated/seaborn.lmplot.html#seaborn.lmplot
DeleteThanks for the post. I'm agree that conclusions confirm that big test is more flaky.
ReplyDeleteIt will be so interesting to read how you struggle with flaky tests. Maybe some special technics, some infrastructure solutions?
That could be it's own blog post. (Or multiple.)
DeleteMentally, I split it into four areas - identification, notification, triage, prevention.
Identification - You need to be able to identify which tests are flaky, how flaky they are, and the effect this has.
Notification - You need to tell people their test is flaky, or their infrastructure is flaky.
Triage - You need to provide tools to make it easy to debug the problem and find a solution.
Prevention - You need to try to prevent developers from writing flaky tests in the first place.
Have you drawn any correlations between test behavior and flakiness? For example, capture product telemetry during the test. Compare metrics like number of events during test, number of distinct events during test, degree of action variance between iterations of test (in terms of which events show per iteration and in what order), and then see if any of those correlate with test flakiness?
ReplyDeleteI have not done any research along those lines. I think that would be interesting though choosing the correct set of events and measuring them across a large enough set of tests would be tricky. Are there specific events you had in mind? I could see certain things related to threads and locking being correlated.
DeleteIn this particular case, the events are application specific. They are part of the product's own telemetry. They track things like which commands a user chose, what activities they are doing, whether specific application events were triggered for things like timers, operating system events, etc. So you get a different set of events for a word processor than you might a spreadsheet, but you also get a core set of events common to shared code.
DeleteMy own observation is that events at this level exhibit a surprisingly high level of variance. A test doing the same thing ever time at the command level will typically yield different sequences of events at the application layer. About 90% of the events happen every time, but the remaining 10% or so vary. And as expected, this is highly application dependent. The more a given application is multi-threaded, asynchronous or event driven, the more variance you see in the actual telemetry signature.
Additional question: have you drawn any comparisons of bug discovery rates and bug priority as relates to test flakiness?
ReplyDeleteOn my own analysis, I have found that for test suites where the tests have extremely low flakiness (between .1% and 1%), bug discovery is likewise very low. We get bugs, for certain, but not nearly as much as we get for tests with much higher flake rates. Certainly the product crashes and unexpected exits are much higher on tests with higher flake rates.
As expensive as flaky tests are, I am finding that there is a trade off between low flake and discovering product bugs. Stabilizing the test behavior seems to sanitize, or one prefers a more colorful metaphor, neuter, the test. My current hypothesis is that once you get past the obvious, bad test code problems, you are left with an inevitable amount of flaky that is not only unavoidable, but possibly even desirable. It is an indication you are exercising areas of the code that have legitimate problems for the same reasons the flake manifests.
I think it depends how you stabilize the test and whether you then lose some of the coverage you thought you had.
DeleteI think your last point is the key: "It is an indication you are exercising areas of the code that have legitimate problems for the same reasons the flake manifests".
You're right, a flake and a bug are different manifestations of the same problem. If a test consistently fails, and the problem lies in production code we consider it a bug. If a test flakily fails, and the problem lies in production code we should also consider it a bug. Lots of code, that is highly complicated, and changes rapidly is likely to lead to both bugs and flakes.
The key part of writing and testing that code is to break it down into manageable chunks to reduce both of those (unpleasant) outcomes.
The topic is enlightening. Currently, while we use JIRA/Zephyr in manual mode, and I have worked in other more automated environments, I'm prototyping an automated test system for 'systems testing.' Preventing flaky tests certainly is a priority.
ReplyDeleteI could not determine however if these stats applied to "white box" testing, or unit testing of the code done by developers OR system type testing where the product is being used by systems testers with the product having a server-client architecture with a web app interface to the user.
I also feel with respect to the size to flakiness ratio is a good indicator as has been found by the data, but is size related to complexity? Is it test code growth (meaning algorithm complexity) and|or is it simple code, but bigger data?
So the test logic is sound, but it runs out of time because the data set is 10 times bigger and the timer is set too low? It doesn't complete so there is an incomplete result (in a sense). Yes, a test bug, but the test logic itself is not changed but how it is run is?
Versus on the Google side of it with a system in place and having to deal with it, what is the best way to minimize risk when building a new system? My thought is to write smaller tests with single purposes while watching data growth processing.
The stats apply to continuously run, automated tests which are generally hermetic. Some of these are unit tests, others are system or even integration type tests.
DeleteThe overall size is often due to the complexity of the code rather than the size of the data but I'm sure there are exceptions. If a timer is set too low and the test becomes flaky due to organic growth of test data it should be an easy problem to fix - though that doesn't mean that it is fixed immediately.
When building a system I think you always want to start with the smallest tests you can. You will need to add larger tests at some point, but you need to understand why they're necessary and think about whether you can test it in a different, smaller way.
I agree it's important to re-assess your tests as code changes. Breaking down a test into smaller pieces is key to having clean, well-maintained tests.
ReplyDelete"What's the goal of this test?" is a great question to answer. An equally important question is: "What code are we running for this test?" I think many system / integration tests actually have a fairly narrow answer to the first question but a very large answer to the second.
Excellent post; thank you for sharing these findings with the larger community. As a future post, I would be very interested in hearing any correlation between fixing flaky tests and finding bugs in production code (as opposed to the test themselves). No doubt that flaky tests are a drain in engineering resources, but dedicating scarce engineering resources to fix large (and complex) flaky tests can be a challenge. If it was established that there is a real payback in finding bugs in production code by fixing flaky tests, it would help motivate teams to dedicate time to this exercise.
ReplyDeleteAlso, are you aware of any tools/scripts that can help categorize tests results over time? For instance, if a test is run daily; then over a 14 day period this test may exhibit one of the following patterns: a) pass consistently; b) fail consistently; c) pass consistently then fail consistently -> new consistent failure; d) pass consistently, then fail intermittently -> new flaky test; e) etc...
Categorizing test results over time to identify patterns (eg. finding new consistent failures) is currently a manually exercise for us; but having this automated would eliminate another manual step and would help us identify new problems as they occur. If anybody has pointers on automating the test result categorization, that would be appreciated.
It seems you would need a shared test execution and result repository system to have any shared tools or scripts for measurement. Test result schema is all too often specific to the toolset and sometimes the system under test.
DeleteThat said, I am not so sure this is an arduous problem. Assuming your system stores test results in some kind of database, then establishing consistency is a matter of querying passing executions over all executions on a per test basis. If one wants to be even more precise, also keep track of the number of distinct failures per test (some tests may fail more than once but in different ways). My own definition is if a test either always fails (with the same failure) or always passes then it is consistent - otherwise it is inconsistent. This is usually a trivial query in any repository.
I do recommend changing the execution practices. It is insufficient to have a daily run of a test (I assume "daily" approximates "new build"). You ought to have multiple iterations per test per build, ideally hundreds, so that you can establish granularity to at least the .01 level. In my own experience, I find that product teams can chase a test's flake factor down to .001 or better when needed (or feasible, depending on the type of test), and to establish that you need many iterations. We started what we call a "reliability run" several years ago, and the information value of that investment has paid back many times over.
Regarding flaky tests vs. bugs in production...
DeleteOne team within Google gathered some data regarding this. They found that when a stable test became flaky, and we could track it to a specific code change, the problem was a bug in production code 1/6th of the time.
If the default is to ignore the flaky tests then you will eventually be ignoring a real bug.
Wayne - Thanks for your reply. Our tests are generally system level tests with custom hardware; thus, there are multiple variables at play which can lead to flaky tests. We have multiple strategies in place to help tackle this problem: a) Testers need to ensure there tests pass consistently (run at least 10 times in succession without error; b) A failure in regression is automatically re-run 3 times to determine whether the failure is consistent; c) All test results from our regression system is logged into a CSV file; d) Automated tests typically run against each check-in, but the full regression takes at least 1/2 day, so hundreds of test runs is simply not possible. That being said, I like what was said about having "reliability" runs and I think our system will allows these to run as "best effort" to fill up spare capacity.
DeleteJeff- thanks for the data point. I agree that flaky tests need to be fixed, it just a matter of the priority given to addressing the failure. As pointed out elsewhere, release velocity is important too and thus trade-offs are required.
Great to see all the discussion on this topic!
Nice posting. I also agree that as tests get larger, it gets more flaky.
ReplyDeleteI understand the whole concept of this posting but, I am kind of unfamiliar with some vocabulary. I'm not sure the exact meaning of 'binary size' .. is it the size of the test? and second, I'm not sure on the meaning of 'bucket'.. is seems like measure of something but I am a bit confused..I'd be thankful if somebody explains to me :)
The tests I looked at get compiled to a single binary that gets run. This contains all the code and data needed to run the test. Binary size is the overall size of that executable.
DeleteBucket is a grouping of tests of similar size. Every test is either flaky (1) or not (0). By putting them into a bucket with tests of similar size we can calculate the percent of tests within that bucket which is flaky and come up with a continuous number between 0 and 1.
It's bizarre to me to read an article about "testing" that includes no insight into the relationship between testers and the automated fact checks that you are calling "tests." Testing is what people do. The closest our tools come to testing is the fact check.
ReplyDeleteIt's as if I'm reading an article about noisy coffins, and you are wondering about trends and patterns of muffled screams coming from some buried coffins, but not asking what those screams might mean in any individual case. Surely the testers at Google either know the answer or think that answer is very important to discover for each and every "flaky" program you are referring to?
I write software that helps me test. When I do that, my code may behave in "flaky" ways. If so, then what I need to do is ask what that says about the product I am testing as well as my test strategy as a whole. I do not mind if one of my programs behaves in a flaky way, if I feel I am getting good information about the product-under-test by using it. If I am not getting good information, I need to tear that program down and rethink what I'm doing.
This goes back to the goal of testing. The goal is not certainty. The goal is not determinism. That's the way people think who actually hate testing and want to get past it as fast as they can, even if it means doing lousy fake testing. My goal, instead, is insight. I craft my tools to convey important information. If a "flaky" check is providing that, then my testing (which is a human process that accumulates and integrates the facts I glean from my automation) may be just fine.
I wholeheartedly agree, but we have to remember that not all insights are the same or serve the same purpose.
DeleteSome purposes have a low noise tolerance, and a higher escape rate tolerance. Some purposes have a higher noise tolerance and a lower escape rate tolerance. It is my belief that one cannot generally optimize for both with the same test. Thus we use different tests with different levels of noise and different levels of ability to discover based on the purpose.
The trend I see right now is that release velocity is dominating everybody's psychology, which puts pressure on the low noise requirements. This is creating back pressure on the "oops, you missed something" problem (higher discovery requirements paid for with more noise), but velocity pressure is receiving more attention. Further, it easier for people to just think of "tests" as a simple activity with only one purpose, so the entire focus yields to high precision, low noise tests. Not enough time has passed to force the balance to swing back.
I'm not quite sure I understand the point here but maybe we're talking about different things. These tests that are flaky are automated tests. By design, there is no human involved in running them. There are engineers that write the test, maintain the tests, and need to diagnose them when they fail. If you're testing that 1 + 1 = 2 and occasionally your program tells you the answer is 1 that's a problem that needs to be solved.
DeleteIf you're talking about manual testing, there is some extra leeway for flakiness in the test tools. But even then, doesn't the flakiness at some point become an issue that needs to be fixed? If your test tools don't give you a clear signal then how do you trust them?
I cannot account for James' story, but in my case I am talking about automated tests.
DeleteThe SUT, the environment, and the test conditions are too complex in end to end integrated tests to drive consistency to zero. In fact, variance is very much the true state of the system, and bugs derive from the complexity that derives from that variance. Some automated tests should be designed to exacerbate that variance, but my experience is that even the tests that are invariant and simple from their own behaviors will manifest underlying variance that is intrinsic to the SUT, and from out of that come the "flaky" results.
This question is very rich in its implications:
Delete"If your test tools don't give you a clear signal then how do you trust them?"
I worry we are conflating correctness and consistency. For both, though, it is a matter of probability and what you can afford. What we do is measure the tests for consistency and separate them into groups where some are for fast/automatic decisions (gating and release procedures) and others have more room for having to filter noise (discovering bugs).
The tests which drift farther from 100% consistent (which also tend to have correctness issues - so while the variables are independent, they often have a relationship) need more statistical analysis to "trust" the signal. Teams usually adopt frequency as their rule of thumb, although some attributes of the reported failure motivate more fixes (e.g. if the entire call stack at time failure was reported is all in test code that manages generic UI and web page navigation, there is a tendency to punt...).
I have been putting a lot of my recent work using customer telemetry analysis to further mine value from intermittent test failures. Something that may get ignored because it doesn't occur consistently enough in test may have other signals we can relate to something customers do that will deserve more attention. It is new ground still, so I will have to come out of the mineshaft sometime later and talk about discoveries.
James, it seems like you are conflating verification and validation, which are two very different testing activities with very different goals.
DeleteIt is interesting to see a statistic on where the flaky tests come from, based on code analysis. I would argue that there is another layer, I tend to see more flaky tests from juniors or people less experienced with unit testing, as they tend to write the larger type of tests. And the discussion can go on and on.
ReplyDeleteI used to think about this a lot, when I was doing TDD and was in charge of a relatively small team. My thinking was that if only I could teach people how to write proper code and proper tests (that's another dimension you can make a statistic on, usually if the code under test is large then the test large as well, and it tends to lend to more "flakiness"), then we could avoid this all together.
However, in recent years, being in charge of larger teams, I learned that flacky tests tend to be a fact of life. I've not given up hope that we can eliminate them, but I realized that we also need a way to live with them. Training and getting better takes time (and by the time you train those guys, they leave and other fresh colleagues come in and start doing the same mistakes). My advice for people in similar positions would be:
1. Isolate the effects of flacky tests on your builds by running failed tests a few more times at the end. This is different than just marking the flacky tests as ignored in two ways. First, it will be automatically maintained, the process just picks up whatever test failed and runs it again. And second, even flacky tests can catch problems, and then will fail 100% of the time and point to a real issue that might have otherwise gone unnoticed.
2. Have a couple of processes around these tests. One for getting better and not doing them in the first place. And one for dealing with the ones in your codebase, these tests are trying to tell you something.
You're getting at issues with scaling - specifically the development team. Training and education is certainly important. We put a large emphasis on this and new hires go through training on testing. They also learn from senior engineers who hopefully already have the right practices.
DeleteYour item #1 should be a short-term solution for any one test. Tests can be flaky due to production issues and isolating/ignoring them for any length of time means possibly missing a bug.
Item #2 is the long-term fix. Avoid writing new flaky tests, and get better at fixing the ones you have. You'll still need to follow item #1 but don't treat that as the only choice. Many people do.
"We put a large emphasis on this and new hires go through training on testing."
DeleteI would love to know your curriculum.
Great post! I'm curious though, which tools you guys find to be the least flaky?
ReplyDeleteJeff,
ReplyDeleteThanks for sharing detailed information.
Can you please let us know how you are executing all 4.2 million tests?
Are you keeping the entire test suite in any CI like Jenkins? Can you tell me what best approach is followed here?
I have a similar requirement from one of my client, but i dont have any pointers. So please let me know how these test suites are executed and it can be great if you can refer any pointers.
Thanks,
Uday
This is great data, I'm curious how you have been acting on it? Was an effort made to break the biggest tests up into smaller ones? If so did that reduce the flakiness?
ReplyDeleteGreat post, thanks. Would you say that the majority of the WebDriver tests are against web apps? What do you see from a mobile apps flakiness perspective? I know that mobile has bigger challenges when it comes to testing especially when executing across devices, OS versions etc. - just written my blog about it to try and address one of the test optimization pains i see in the market lately - would appreciate your comments on this blog or other ideas around it. https://mobiletestingblog.com/2017/05/30/optimizing-android-test-automation-development/
ReplyDeleteLooking at the plots I'm having following questions:
ReplyDelete- The plot line of "likelihood of beeing flaky" versus "binary size" looks like a pearl necklace. Why? I would expect a more irregular distribution arround the linear correlation line. It looks like that the different tests are not independent of each other.
- It's surprsing to find a (more or less) linear correlation as large systems become chaotic. Why? For chaotic systems one would expect an exponential dependency and not a linear one.
- There are some large test setups that perform significantly better then others. Why? What do they better than the others?
re: pearl necklace - They aren't always independent. Some tests are similar - they test the same system, or are using much of the same framework. This may also cause them to have similar binary sizes and flakiness rates. I don't know if that's the entire explanation, but it's some part of it.
ReplyDeletere: linear correlation - I'm not sure a linear correlation is correct, but it is the simplest way to view this. Exponential had slightly better r2 values in some cases but not enough for me to say it's right. It seems like without knowing the true mechanism, it's hard to make a claim one way or another.
re: Some better/worse - Aside from android, most of the differences look fairly small when you factor in relative sizes. My belief is that much of the flakiness in these tests comes from absolute timing (you expect something to complete in XX time and it doesn't) or relative timing (one thread occasionally executes faster than another). To some extent, it's hard for the test framework / setup to deal with all cases where that can occur.
That you! Do you have metrics what test is long? 30 seconds or 50 seconds, is it long?
ReplyDelete