Google Testing Blog: Where do our flaky tests come from?

Where do our flaky tests come from?

Monday, April 17, 2017

author: Jeff Listfield

Flaky Tests at Google

Google has around 4.2 million tests that run on our continuous integration system. Of these, around 63 thousand have a flaky run over the course of a week. While this represents less than 2% of our tests, it still causes significant drag on our engineers.

post

Test size - Large tests are more likely to be flaky

We categorize our tests into three general sizes: small, medium and large. Every test has a size, but the choice of label is subjective. The engineer chooses the size when they initially write the test, and the size is not always updated as the test changes. For some tests it doesn't reflect the nature of the test anymore. Nonetheless, it has some predictive value. Over the course of a week, 0.5% of our small tests were flaky, 1.6% of our medium tests were flaky, and 14% of our large tests were flaky ^[1]. There's a clear increase in flakiness from small to medium and from medium to large. But this still leaves open a lot of questions. There's only so much we can learn looking at three sizes.

The larger the test, the more likely it will be flaky[2][3][4]

Correlation between metric and likelihood of test being flaky

Metric

Binary size

0.82

RAM used

0.76

[5][6]

Certain tools correlate with a higher rate of flaky tests

Some tools get blamed for being the cause of flaky tests. For example, WebDriver tests (whether written in Java, Python, or JavaScript) have a reputation for being flaky ^[7]. For a few of our common testing tools, I determined the percentage of all the tests written with that tool that were flaky. Of note, all of these tools tend to be used with our larger tests. This is not an exhaustive list of all our testing tools, and represents around a third of our overall tests. The remainder of the tests use less common tools or have no readily identifiable tool.

Flakiness of tests using some of our common testing tools

35 comments :

MatthiasApril 17, 2017 at 1:30:00 PM PDT
Interesting, yet not so surprising that larger tests are more flaky. Here are a couple of quick thoughts:
1) I'm not so sure that the linear trend is a good fit, given the large variance for larger tests (looks rather heteroscedastic to me).
2) It's surprising to me that very small tests (unit tests) are flaky. Is there a pattern in these small flaky tests?
3) The analysis of Android emulator is interesting. I'm not too familiar with the emulator: does it exhibit non-determinism, e.g. in the form of scheduling?
ReplyDelete
Replies
AnonymousApril 18, 2017 at 1:07:00 AM PDT
Thanks for the post. I'm agree that conclusions confirm that big test is more flaky.

It will be so interesting to read how you struggle with flaky tests. Maybe some special technics, some infrastructure solutions?
ReplyDelete
Replies
Wayne RoseberryApril 18, 2017 at 8:13:00 AM PDT
Have you drawn any correlations between test behavior and flakiness? For example, capture product telemetry during the test. Compare metrics like number of events during test, number of distinct events during test, degree of action variance between iterations of test (in terms of which events show per iteration and in what order), and then see if any of those correlate with test flakiness?
ReplyDelete
Replies
Wayne RoseberryApril 18, 2017 at 8:18:00 AM PDT
Additional question: have you drawn any comparisons of bug discovery rates and bug priority as relates to test flakiness?

On my own analysis, I have found that for test suites where the tests have extremely low flakiness (between .1% and 1%), bug discovery is likewise very low. We get bugs, for certain, but not nearly as much as we get for tests with much higher flake rates. Certainly the product crashes and unexpected exits are much higher on tests with higher flake rates.

As expensive as flaky tests are, I am finding that there is a trade off between low flake and discovering product bugs. Stabilizing the test behavior seems to sanitize, or one prefers a more colorful metaphor, neuter, the test. My current hypothesis is that once you get past the obvious, bad test code problems, you are left with an inevitable amount of flaky that is not only unavoidable, but possibly even desirable. It is an indication you are exercising areas of the code that have legitimate problems for the same reasons the flake manifests.
ReplyDelete
Replies
UnknownApril 18, 2017 at 12:55:00 PM PDT
The topic is enlightening. Currently, while we use JIRA/Zephyr in manual mode, and I have worked in other more automated environments, I'm prototyping an automated test system for 'systems testing.' Preventing flaky tests certainly is a priority.

I could not determine however if these stats applied to "white box" testing, or unit testing of the code done by developers OR system type testing where the product is being used by systems testers with the product having a server-client architecture with a web app interface to the user.

I also feel with respect to the size to flakiness ratio is a good indicator as has been found by the data, but is size related to complexity? Is it test code growth (meaning algorithm complexity) and|or is it simple code, but bigger data?
So the test logic is sound, but it runs out of time because the data set is 10 times bigger and the timer is set too low? It doesn't complete so there is an incomplete result (in a sense). Yes, a test bug, but the test logic itself is not changed but how it is run is?

Versus on the Google side of it with a system in place and having to deal with it, what is the best way to minimize risk when building a new system? My thought is to write smaller tests with single purposes while watching data growth processing.
ReplyDelete
Replies
UnknownApril 18, 2017 at 1:41:00 PM PDT
I agree it's important to re-assess your tests as code changes. Breaking down a test into smaller pieces is key to having clean, well-maintained tests.

"What's the goal of this test?" is a great question to answer. An equally important question is: "What code are we running for this test?" I think many system / integration tests actually have a fairly narrow answer to the first question but a very large answer to the second.
ReplyDelete
Replies
Brian VandegriendApril 18, 2017 at 3:08:00 PM PDT
Excellent post; thank you for sharing these findings with the larger community. As a future post, I would be very interested in hearing any correlation between fixing flaky tests and finding bugs in production code (as opposed to the test themselves). No doubt that flaky tests are a drain in engineering resources, but dedicating scarce engineering resources to fix large (and complex) flaky tests can be a challenge. If it was established that there is a real payback in finding bugs in production code by fixing flaky tests, it would help motivate teams to dedicate time to this exercise.

Also, are you aware of any tools/scripts that can help categorize tests results over time? For instance, if a test is run daily; then over a 14 day period this test may exhibit one of the following patterns: a) pass consistently; b) fail consistently; c) pass consistently then fail consistently -> new consistent failure; d) pass consistently, then fail intermittently -> new flaky test; e) etc...

Categorizing test results over time to identify patterns (eg. finding new consistent failures) is currently a manually exercise for us; but having this automated would eliminate another manual step and would help us identify new problems as they occur. If anybody has pointers on automating the test result categorization, that would be appreciated.
ReplyDelete
Replies
콜라양April 18, 2017 at 9:40:00 PM PDT
Nice posting. I also agree that as tests get larger, it gets more flaky.
I understand the whole concept of this posting but, I am kind of unfamiliar with some vocabulary. I'm not sure the exact meaning of 'binary size' .. is it the size of the test? and second, I'm not sure on the meaning of 'bucket'.. is seems like measure of something but I am a bit confused..I'd be thankful if somebody explains to me :)
ReplyDelete
Replies
James Marcus BachApril 19, 2017 at 11:48:00 PM PDT
It's bizarre to me to read an article about "testing" that includes no insight into the relationship between testers and the automated fact checks that you are calling "tests." Testing is what people do. The closest our tools come to testing is the fact check.

It's as if I'm reading an article about noisy coffins, and you are wondering about trends and patterns of muffled screams coming from some buried coffins, but not asking what those screams might mean in any individual case. Surely the testers at Google either know the answer or think that answer is very important to discover for each and every "flaky" program you are referring to?

I write software that helps me test. When I do that, my code may behave in "flaky" ways. If so, then what I need to do is ask what that says about the product I am testing as well as my test strategy as a whole. I do not mind if one of my programs behaves in a flaky way, if I feel I am getting good information about the product-under-test by using it. If I am not getting good information, I need to tear that program down and rethink what I'm doing.

This goes back to the goal of testing. The goal is not certainty. The goal is not determinism. That's the way people think who actually hate testing and want to get past it as fast as they can, even if it means doing lousy fake testing. My goal, instead, is insight. I craft my tools to convey important information. If a "flaky" check is providing that, then my testing (which is a human process that accumulates and integrates the facts I glean from my automation) may be just fine.
ReplyDelete
Replies
AnonymousApril 20, 2017 at 4:18:00 AM PDT
It is interesting to see a statistic on where the flaky tests come from, based on code analysis. I would argue that there is another layer, I tend to see more flaky tests from juniors or people less experienced with unit testing, as they tend to write the larger type of tests. And the discussion can go on and on.

I used to think about this a lot, when I was doing TDD and was in charge of a relatively small team. My thinking was that if only I could teach people how to write proper code and proper tests (that's another dimension you can make a statistic on, usually if the code under test is large then the test large as well, and it tends to lend to more "flakiness"), then we could avoid this all together.

However, in recent years, being in charge of larger teams, I learned that flacky tests tend to be a fact of life. I've not given up hope that we can eliminate them, but I realized that we also need a way to live with them. Training and getting better takes time (and by the time you train those guys, they leave and other fresh colleagues come in and start doing the same mistakes). My advice for people in similar positions would be:
1. Isolate the effects of flacky tests on your builds by running failed tests a few more times at the end. This is different than just marking the flacky tests as ignored in two ways. First, it will be automatically maintained, the process just picks up whatever test failed and runs it again. And second, even flacky tests can catch problems, and then will fail 100% of the time and point to a real issue that might have otherwise gone unnoticed.
2. Have a couple of processes around these tests. One for getting better and not doing them in the first place. And one for dealing with the ones in your codebase, these tests are trying to tell you something.
ReplyDelete
Replies
AnonymousApril 20, 2017 at 5:43:00 AM PDT
Great post! I'm curious though, which tools you guys find to be the least flaky?
ReplyDelete
Replies
UnknownApril 25, 2017 at 11:20:00 PM PDT
Jeff,

Thanks for sharing detailed information.

Can you please let us know how you are executing all 4.2 million tests?

Are you keeping the entire test suite in any CI like Jenkins? Can you tell me what best approach is followed here?

I have a similar requirement from one of my client, but i dont have any pointers. So please let me know how these test suites are executed and it can be great if you can refer any pointers.

Thanks,
Uday
ReplyDelete
Replies
UnknownMay 5, 2017 at 10:33:00 AM PDT
This is great data, I'm curious how you have been acting on it? Was an effort made to break the biggest tests up into smaller ones? If so did that reduce the flakiness?
ReplyDelete
Replies
UnknownMay 31, 2017 at 10:54:00 AM PDT
Great post, thanks. Would you say that the majority of the WebDriver tests are against web apps? What do you see from a mobile apps flakiness perspective? I know that mobile has bigger challenges when it comes to testing especially when executing across devices, OS versions etc. - just written my blog about it to try and address one of the test optimization pains i see in the market lately - would appreciate your comments on this blog or other ideas around it. https://mobiletestingblog.com/2017/05/30/optimizing-android-test-automation-development/
ReplyDelete
Replies
PeterSeptember 15, 2017 at 12:13:00 AM PDT
Looking at the plots I'm having following questions:

- The plot line of "likelihood of beeing flaky" versus "binary size" looks like a pearl necklace. Why? I would expect a more irregular distribution arround the linear correlation line. It looks like that the different tests are not independent of each other.

- It's surprsing to find a (more or less) linear correlation as large systems become chaotic. Why? For chaotic systems one would expect an exponential dependency and not a linear one.

- There are some large test setups that perform significantly better then others. Why? What do they better than the others?
ReplyDelete
Replies
JeffOctober 20, 2017 at 6:39:00 AM PDT
re: pearl necklace - They aren't always independent. Some tests are similar - they test the same system, or are using much of the same framework. This may also cause them to have similar binary sizes and flakiness rates. I don't know if that's the entire explanation, but it's some part of it.

re: linear correlation - I'm not sure a linear correlation is correct, but it is the simplest way to view this. Exponential had slightly better r2 values in some cases but not enough for me to say it's right. It seems like without knowing the true mechanism, it's hard to make a claim one way or another.

re: Some better/worse - Aside from android, most of the differences look fairly small when you factor in relative sizes. My belief is that much of the flakiness in these tests comes from absolute timing (you expect something to complete in XX time and it doesn't) or relative timing (one thread occasionally executes faster than another). To some extent, it's hard for the test framework / setup to deal with all cases where that can occur.
ReplyDelete
Replies
Kateryna February 14, 2021 at 12:41:00 PM PST
That you! Do you have metrics what test is long? 30 seconds or 50 seconds, is it long?
ReplyDelete
Replies

Add comment

The comments you read and contribute here belong only to the person who posted them. We reserve the right to remove off-topic comments.

Testing Blog

Where do our flaky tests come from?

35 comments :

Labels

Archive

Feed