Google Testing Blog: Efficacy Presubmit

Monday, September 17, 2018

By Peter Spragins
with input from John Roane, Collin Johnston, Rose Rodrigues and Dave Chen

A Brief History of Efficacy

Originally named "Test Efficacy", a small team was formed in 2014 to quantify the value of individual tests to the development process. Some tests were particularly valuable because they provided a reliable breakage signal for critical code. Some tests were not useful because they were non-deterministic or they never failed. Confoundingly, tests would change in value over time as well. The team’s initial intention was to present this information to developers and help them optimize the development process.

To achieve the goal of informing developers about their tests, the team had to collect a huge amount of developer infrastructure/workflow data from a variety of sources across Google. Collecting all of this data in one place turned out to be incredibly valuable.

In addition to collecting and processing the data, the team developed a somewhat radical philosophy towards running tests at scale: the only important results come from tests which deterministically fail. Running an additional test that you know will pass is not a valuable signal to developers, and likely a waste of resources.

Background on Google Presubmit

The process of committing code at Google has several testing stages. Perhaps the three most important testing stages are:

Individual ad-hoc testing
Presubmit
Continuous build/continuous integration (hereafter referred to as continuous build).

Stages 1 and 2 can actually be interleaved in any order and repeated any number of times.

A presubmit executes all of the tests which are known to be affected by the edited code within one user's proposed code changes. The "affected tests" are calculated with the help of a "project definition", a configuration maintained by teams. A presubmit can run at any point during the change proposal process, but most importantly it must run before a user can permanently commit their changes.

Continuous build, (3), is the continuous running of all tests within a project at the newest committed version of the code. Continuous build will execute tests even when they have already passed at presubmit.

The same test may run several times at presubmit during the development process, one last time at presubmit before a commit and then finally once again at continuous build, after being merged into the main branch of Google's huge repository. For this reason, a "missed failure" at presubmit is not a critical failure. The test will still be run at continuous build, and then rolled back if it fails.

Efficacy Presubmit Service

Efficacy Presubmit Service is the fusion of "running the right tests at the right time" with one of the largest collections of test/developer data in the world. The service has one simple job: save time and resources by not running, or even compiling, tests that we are very confident will pass at Presubmit. The ideal "Efficacy Presubmit" would predict which tests will pass ahead of time and only run tests which were going to fail. Then the user can get feedback from the failing tests, and fix their mistakes with the minimal possible cost of user and CPU time.

To make this idea possible we have made one significant abstraction of the actual presubmit testing process. In a given presubmit there may be zero tests run, or many. In a presubmit with one test, if that test fails then the presubmit fails. In a presubmit with a thousand tests, only one failing test will still fail the presubmit. Efficacy Presubmit makes the abstraction that each of these test executions is an equivalent unit. This greatly simplifies creating a training dataset.

Machine Learning / Probabilistic Safety

Quick background on ML

ML techniques and processes are quite well known throughout the industry at this point. The Tensorflow tutorials are a great introduction. The type of ML we use is classification. A classifier is essentially a mapping from the domain of the dataset, to the range of the classes. Mnist is a very famous example of classification. An mnist classifier maps from the domain of the input image to the range of digits {0, 1, …, 9}.

In some other classification problems, the inputs are more "tabular". A famous example of tabular classification is Iris Species. This is very similar to what Efficacy does.

Efficacy's Application of ML

Given the abstraction on the presubmit testing process described above, predicting the outcomes of automated testing at a large company is a perfect machine learning problem in many ways. You have:

The set of test executions and results is a very large labelled dataset

Copious numerical feature columns with trustworthy values

Recent failure history of each test
Various "distance" metrics from edited source files to tests - i.e. is this a test for the edited code?
Test size and runtime data

Several dimensions that can be aggregated

There are some aspects of the problem which make ML difficult as well:

The classes are highly imbalanced with respect to labels (the vast majority of tests are going to pass, not fail)
Flaky tests can mislead the model because their labels are "untrue"

We chose to reduce the problem to binary classification. The model chooses whether or not to run the test. In other words, failure is the positive class, and everything else is the negative class.

We pick a threshold that results in an extremely low number of false negatives - failing tests which are not run because the model thinks they would have passed. This does reduce the number of skipped tests, true negatives, in exchange for a very high margin of safety. In addition to this, tests will be run afterwards at continuous build anyway, making presubmit skipping very safe.

Difficulties of Scale

In addition to the problems that were natural to the "schema" of the dataset, we faced some problems due to the scale of Google's testing.

Many of these problems stem from the fact that Google works out of one large repository (paper, talk). Because of this some presubmits have a very large number of tests and some commits require a large number of presubmits before they are finished. This means that the service has to make predictions for a very large number of tests all at once. If a presubmit tried to run every test at Google, then the service would have to predict each test individually. That means N times the number of columns, etc. Loading the data to generate all of these feature values uses a lot of memory.

Another difficulty of doing this work at scale is that even with very rare false negatives, they will still happen somewhat frequently. This requires our team to be open to communication with any customer team. In some cases we may have to tell them they were the victim of a very low probability event. In other cases we may find a bug, or room for improvement.

Results

The two key numbers for the system's performance are sensitivity, the percentage of failing tests we actually execute, and specificity, the percentage of passing tests we actually skip. The two numbers go hand in hand. For a given model, requiring a higher sensitivity will result in a lower specificity, or vice versa. We can easily tune the percentage of tests skipped, resulting in changes to the fidelity of the testing signal the developers receive. When the system is wrong, it can have some negative impact to developers if the prediction is a false negative. Rarely, it will allow a developer to commit code that will break a test during continuous build. This results in a broken "project", which takes some time to detect, and then a roll-back of the code. This requires some developer time, and a flexible mentality towards testing. In order to achieve a positive balance from this, we must extract millions of skipped tests for every negative developer experience. The sensitivity of our system is very high, and our specificity is around 25%.

13 comments :

UnknownSeptember 17, 2018 at 10:11:00 PM PDT
very nice way to reduce time and get maximum result to deliver the quality the product. I do have a question. If some tests fail in stage 2: presubmit but the set of tests fails for some other reason and not a product defect. Do you still roll back the changes? And what happens if tests pass in stage 2 but fails in stage 3.
ReplyDelete
Replies
AnonymousSeptember 18, 2018 at 5:26:00 AM PDT
This is interesting research. I wonder if your team has any intention to productionize such a system for teams/products outside of Google to use?
ReplyDelete
Replies
The Standing WallSeptember 18, 2018 at 7:51:00 AM PDT
It really sounds great. Would drastically reduce the testing effort.Any plan to make the data and approach open to community.
Thank you.
ReplyDelete
Replies
UnknownSeptember 20, 2018 at 9:46:00 PM PDT
Very nice article Peter. The dimensions you have combined together are really interesting - code to test distance and failure history. But the greatest challenge here is labelling of data and even bigger would be dynamically update this, for every run. Picking up Failure history may be simpler, as you have my have this in some form or the other, such AS Test Management tools, but how do you manage to get the code to test distance unless you have a predefined traceability between tests and code modules? Having a threshold for failure probability is nice, but do you not feel that the ML problem you have coined is drifting more towards a rule based approach?
ReplyDelete
Replies
UnknownSeptember 24, 2018 at 5:26:00 AM PDT
Sounds like a very efficient and interesting solution. I am wondering about the following points, though:

1) Does the system require a lot of maintenance?
2) Do you have numbers regarding how much more efficient the system has become? Maybe a comparison to "the old" system, taking into account the amount of effort that was used to setup/maintain this system?
ReplyDelete
Replies
Kailai ShaoJuly 18, 2019 at 7:17:00 PM PDT
Sounds good. But how do you link your code and its tests? I think this is the point since an underlying code change may have an effect on countless tests.
ReplyDelete
Replies
Venkatesh-Prasad RanganathAugust 29, 2019 at 12:20:00 PM PDT
Nice work! This effort seems similar to the test minimization effort that me and my team did with Windows USB Team and Device Compatibility Lab to expedite testing Windows against USB devices. We used structural and temporal patterns in USB stack traffic to identify similar devices (tests) and expedite testing cycles. See "Embrace Dynamic Efforts" chapter in "Perspectives on Data Science for Software Engineering" https://www.elsevier.com/books/perspectives-on-data-science-for-software-engineering/menzies/978-0-12-804206-9 for more info and pointers.
ReplyDelete
Replies
Mehtab Ahmed KhanNovember 8, 2019 at 5:00:00 AM PST
Very nice article Peter. The dimensions you have combined together are really interesting - code to test distance and failure history. But the greatest challenge here is labelling of data and even bigger would be dynamically update this, for every run. Picking up Failure history may be simpler, as you have my have this in some form or the other, such AS Test Management tools, but how do you manage to get the code to test distance unless you have a predefined traceability between tests and code modules? Having a threshold for failure probability is nice, but do you not feel that the ML problem you have coined is drifting more towards a rule based approach?

Please visit my some of best post:

adjectives that start with n | adjectives that start with d | christmas 2019 | claw definition
ReplyDelete
Replies

The comments you read and contribute here belong only to the person who posted them. We reserve the right to remove off-topic comments.

Testing Blog

Efficacy Presubmit