very nice way to reduce time and get maximum result to deliver the quality the product. I do have a question. If some tests fail in stage 2: presubmit but the set of tests fails for some other reason and not a product defect. Do you still roll back the changes? And what happens if tests pass in stage 2 but fails in stage 3.
Hi good questions. For your first question, there is a threshold for automatically rolling back on the number of failing tests. If only one test fails then it is up to the team in question to manually do the roll back and evaluate if the failure was "real".
Second question, if the tests pass in stage 2 (presubmit) but truly fail in stage 3 (continuous build), there is something wrong. This may be an issue of flakiness, or the test not being hermetic. Fortunately this is quite rare.
Thanks! We have had very preliminary conversations about this. The difficulty would be in standardizing the data to the point that the product could be applicable to different external products. Internally we have the advantage of uniform data and relatively streamlined data pipelines.
The company Appsurify (appsurify.com) has come out with a commercial product that does something very similar, using machine learning to find and run the relevant tests on every commit, and analyzing the test results to separate our flaky failures. It's integrated with most common development environments. Microsoft offers something similar called Test Impact Analysis as part of Azure DevOps, but it only for .net and c# files.
Thanks! The data itself will most likely not become public but the approach could potentially apply elsewhere. As written above, we have had very preliminary conversations about helping people outside of Google do this.
Very nice article Peter. The dimensions you have combined together are really interesting - code to test distance and failure history. But the greatest challenge here is labelling of data and even bigger would be dynamically update this, for every run. Picking up Failure history may be simpler, as you have my have this in some form or the other, such AS Test Management tools, but how do you manage to get the code to test distance unless you have a predefined traceability between tests and code modules? Having a threshold for failure probability is nice, but do you not feel that the ML problem you have coined is drifting more towards a rule based approach?
Sounds like a very efficient and interesting solution. I am wondering about the following points, though:
1) Does the system require a lot of maintenance? 2) Do you have numbers regarding how much more efficient the system has become? Maybe a comparison to "the old" system, taking into account the amount of effort that was used to setup/maintain this system?
Sounds good. But how do you link your code and its tests? I think this is the point since an underlying code change may have an effect on countless tests.
Nice work! This effort seems similar to the test minimization effort that me and my team did with Windows USB Team and Device Compatibility Lab to expedite testing Windows against USB devices. We used structural and temporal patterns in USB stack traffic to identify similar devices (tests) and expedite testing cycles. See "Embrace Dynamic Efforts" chapter in "Perspectives on Data Science for Software Engineering" https://www.elsevier.com/books/perspectives-on-data-science-for-software-engineering/menzies/978-0-12-804206-9 for more info and pointers.
Very nice article Peter. The dimensions you have combined together are really interesting - code to test distance and failure history. But the greatest challenge here is labelling of data and even bigger would be dynamically update this, for every run. Picking up Failure history may be simpler, as you have my have this in some form or the other, such AS Test Management tools, but how do you manage to get the code to test distance unless you have a predefined traceability between tests and code modules? Having a threshold for failure probability is nice, but do you not feel that the ML problem you have coined is drifting more towards a rule based approach?
very nice way to reduce time and get maximum result to deliver the quality the product. I do have a question. If some tests fail in stage 2: presubmit but the set of tests fails for some other reason and not a product defect. Do you still roll back the changes? And what happens if tests pass in stage 2 but fails in stage 3.
ReplyDeleteHi good questions. For your first question, there is a threshold for automatically rolling back on the number of failing tests. If only one test fails then it is up to the team in question to manually do the roll back and evaluate if the failure was "real".
DeleteSecond question, if the tests pass in stage 2 (presubmit) but truly fail in stage 3 (continuous build), there is something wrong. This may be an issue of flakiness, or the test not being hermetic. Fortunately this is quite rare.
This seems not rare when we are talking about regression bugs.
DeleteThis is interesting research. I wonder if your team has any intention to productionize such a system for teams/products outside of Google to use?
ReplyDeleteThanks! We have had very preliminary conversations about this. The difficulty would be in standardizing the data to the point that the product could be applicable to different external products. Internally we have the advantage of uniform data and relatively streamlined data pipelines.
DeleteThe company Appsurify (appsurify.com) has come out with a commercial product that does something very similar, using machine learning to find and run the relevant tests on every commit, and analyzing the test results to separate our flaky failures. It's integrated with most common development environments. Microsoft offers something similar called Test Impact Analysis as part of Azure DevOps, but it only for .net and c# files.
DeleteIt really sounds great. Would drastically reduce the testing effort.Any plan to make the data and approach open to community.
ReplyDeleteThank you.
Thanks! The data itself will most likely not become public but the approach could potentially apply elsewhere. As written above, we have had very preliminary conversations about helping people outside of Google do this.
DeleteVery nice article Peter. The dimensions you have combined together are really interesting - code to test distance and failure history. But the greatest challenge here is labelling of data and even bigger would be dynamically update this, for every run. Picking up Failure history may be simpler, as you have my have this in some form or the other, such AS Test Management tools, but how do you manage to get the code to test distance unless you have a predefined traceability between tests and code modules? Having a threshold for failure probability is nice, but do you not feel that the ML problem you have coined is drifting more towards a rule based approach?
ReplyDeleteSounds like a very efficient and interesting solution. I am wondering about the following points, though:
ReplyDelete1) Does the system require a lot of maintenance?
2) Do you have numbers regarding how much more efficient the system has become? Maybe a comparison to "the old" system, taking into account the amount of effort that was used to setup/maintain this system?
Sounds good. But how do you link your code and its tests? I think this is the point since an underlying code change may have an effect on countless tests.
ReplyDeleteNice work! This effort seems similar to the test minimization effort that me and my team did with Windows USB Team and Device Compatibility Lab to expedite testing Windows against USB devices. We used structural and temporal patterns in USB stack traffic to identify similar devices (tests) and expedite testing cycles. See "Embrace Dynamic Efforts" chapter in "Perspectives on Data Science for Software Engineering" https://www.elsevier.com/books/perspectives-on-data-science-for-software-engineering/menzies/978-0-12-804206-9 for more info and pointers.
ReplyDeleteVery nice article Peter. The dimensions you have combined together are really interesting - code to test distance and failure history. But the greatest challenge here is labelling of data and even bigger would be dynamically update this, for every run. Picking up Failure history may be simpler, as you have my have this in some form or the other, such AS Test Management tools, but how do you manage to get the code to test distance unless you have a predefined traceability between tests and code modules? Having a threshold for failure probability is nice, but do you not feel that the ML problem you have coined is drifting more towards a rule based approach?
ReplyDeletePlease visit my some of best post:
adjectives that start with n | adjectives that start with d | christmas 2019 | claw definition