What Test Engineers do at Google: Building Test Infrastructure
Friday, November 18, 2016
Author: Jochen Wuttke
In a recent post, we broadly talked about What Test Engineers do at Google. In this post, I talk about one aspect of the work TEs may do: building and improving test infrastructure to make engineers more productive.
Refurbishing legacy systems makes new tools necessary
A few years ago, I joined an engineering team that was working on replacing a legacy system with a new implementation. Because building the replacement would take several years, we had to keep the legacy system operational and even add features, while building the replacement so there would be no impact on our external users.
The legacy system was so complex and brittle that the engineers spent most of their time triaging and fixing bugs and flaky tests, but had little time to implement new features. The goal for the rewrite was to learn from the legacy system and to build something that was easier to maintain and extend. As the team's TE, my job was to understand what caused the high maintenance cost and how to improve on it. I found two main causes:
At first, I explored if I could split the large tests into smaller ones that would test specific functionality and depend on fewer external services. This proved impossible, because of the poorly structured legacy code. Making this approach work would have required refactoring the entire system and its dependencies, not just the parts my team owned.
In my second approach, I also focussed on large tests and tried to mock services that were not required for the functionality under test. This also proved very difficult, because dependencies changed often and individual dependencies were hard to trace in a graph of over 200 services. Ultimately, this approach just shifted the required effort from maintaining test code to maintaining test dependencies and mocks.
My third and final approach, illustrated in the figure below, made small tests more powerful. In the typical end-to-end test we faced, the client made RPC calls to several services, which in turn made RPC calls to other services. Together the client and the transitive closure over all backend services formed a large graph (not tree!) of dependencies, which all had to be up and running for the end-to-end test. The new model changes how we test client and service integration. Instead of running the client on inputs that will somehow trigger RPC calls, we write unit tests for the code making method calls to the RPC stub. The stub itself is mocked with a common mocking framework like Mockito in Java. For each such test, a second test verifies that the data used to drive that mock "makes sense" to the actual service. This is also done with a unit test, where a replay client uses the same data the RPC mock uses to call the RPC handler method of the service.
This pattern of integration testing applies to any RPC call, so the RPC calls made by a backend server to another backend can be tested just as well as front-end client calls. When we apply this approach consistently, we benefit from smaller tests that still test correct integration behavior, and make sure that the behavior we are testing is "real".
To arrive at this solution, I had to build, evaluate, and discard several prototypes. While it took a day to build a proof-of-concept for this approach, it took me and another engineer a year to implement a finished tool developers could use.
Adoption
The engineers embraced the new solution very quickly when they saw that the new framework removes large amounts of boilerplate code from their tests. To further drive its adoption, I organized multi-day events with the engineering team where we focussed on migrating test cases. It took a few months to migrate all existing unit tests to the new framework, close gaps in coverage, and create the new tests that validate the mocks. Once we converted about 80% of the tests, we started comparing the efficacy of the new tests and the existing end-to-end tests.
The results are very good:
Building and improving test infrastructure to help engineers be more productive is one of the many things test engineers do at Google. Running this project from requirements gathering all the way to a finished product gave me the opportunity to design and implement several prototypes, drive the full implementation of one solution, lead engineering teams to adoption of the new framework, and integrate feedback from engineers and actual measurements into the continuous refinement of the tool.
In a recent post, we broadly talked about What Test Engineers do at Google. In this post, I talk about one aspect of the work TEs may do: building and improving test infrastructure to make engineers more productive.
Refurbishing legacy systems makes new tools necessary
A few years ago, I joined an engineering team that was working on replacing a legacy system with a new implementation. Because building the replacement would take several years, we had to keep the legacy system operational and even add features, while building the replacement so there would be no impact on our external users.
The legacy system was so complex and brittle that the engineers spent most of their time triaging and fixing bugs and flaky tests, but had little time to implement new features. The goal for the rewrite was to learn from the legacy system and to build something that was easier to maintain and extend. As the team's TE, my job was to understand what caused the high maintenance cost and how to improve on it. I found two main causes:
- Tight coupling and insufficient abstraction made unit testing very hard, and as a consequence, a lot of end-to-end tests served as functional tests of that code.
- The infrastructure used for the end-to-end tests had no good way to create and inject fakes or mocks for these services. As a result, the tests had to run the large number of servers for all these external dependencies. This led to very large and brittle tests that our existing test execution infrastructure was not able to handle reliably.
At first, I explored if I could split the large tests into smaller ones that would test specific functionality and depend on fewer external services. This proved impossible, because of the poorly structured legacy code. Making this approach work would have required refactoring the entire system and its dependencies, not just the parts my team owned.
In my second approach, I also focussed on large tests and tried to mock services that were not required for the functionality under test. This also proved very difficult, because dependencies changed often and individual dependencies were hard to trace in a graph of over 200 services. Ultimately, this approach just shifted the required effort from maintaining test code to maintaining test dependencies and mocks.
My third and final approach, illustrated in the figure below, made small tests more powerful. In the typical end-to-end test we faced, the client made RPC calls to several services, which in turn made RPC calls to other services. Together the client and the transitive closure over all backend services formed a large graph (not tree!) of dependencies, which all had to be up and running for the end-to-end test. The new model changes how we test client and service integration. Instead of running the client on inputs that will somehow trigger RPC calls, we write unit tests for the code making method calls to the RPC stub. The stub itself is mocked with a common mocking framework like Mockito in Java. For each such test, a second test verifies that the data used to drive that mock "makes sense" to the actual service. This is also done with a unit test, where a replay client uses the same data the RPC mock uses to call the RPC handler method of the service.
This pattern of integration testing applies to any RPC call, so the RPC calls made by a backend server to another backend can be tested just as well as front-end client calls. When we apply this approach consistently, we benefit from smaller tests that still test correct integration behavior, and make sure that the behavior we are testing is "real".
To arrive at this solution, I had to build, evaluate, and discard several prototypes. While it took a day to build a proof-of-concept for this approach, it took me and another engineer a year to implement a finished tool developers could use.
Adoption
The engineers embraced the new solution very quickly when they saw that the new framework removes large amounts of boilerplate code from their tests. To further drive its adoption, I organized multi-day events with the engineering team where we focussed on migrating test cases. It took a few months to migrate all existing unit tests to the new framework, close gaps in coverage, and create the new tests that validate the mocks. Once we converted about 80% of the tests, we started comparing the efficacy of the new tests and the existing end-to-end tests.
The results are very good:
- The new tests are as effective in finding bugs as the end-to-end tests are.
- The new tests run in about 3 minutes instead of 30 minutes for the end-to-end tests.
- The client side tests are 0% flaky. The verification tests are usually less flaky than the end-to-end tests, and never more.
Building and improving test infrastructure to help engineers be more productive is one of the many things test engineers do at Google. Running this project from requirements gathering all the way to a finished product gave me the opportunity to design and implement several prototypes, drive the full implementation of one solution, lead engineering teams to adoption of the new framework, and integrate feedback from engineers and actual measurements into the continuous refinement of the tool.
How do you mark a test failure as flaky? Do you have an automated/intelligent system that flags a test run failure as flaky or do you do it manually?
ReplyDeleteYes, pacts were a strong influence on what we did.
ReplyDeleteHowever, we never went quite as far as they did, and cut out some of the stuff that makes pacts very powerful in theory, but hard to write in practice. Most importantly, instead of writing the contracts in code, we simply store the exchanged data as protocol buffers (https://developers.google.com/protocol-buffers/). That has the advantage of being far simpler, but also restricts what contracts can do, since you have a "passive" contract, instead of code that gets executed.
This is really interesting, thanks for sharing. Will you be open sourcing your tool?
ReplyDeleteUnderstood that integration testing is now carried out as part of unit testing. Just wondering, is functional testing also being covered as part of unit tests? Wouldn't functional testing require some of E2E tests be retained?
ReplyDeleteYes, functional and system testing do require some E2E tests to be retained. But these tests do not have to run during the developer cycle, which basic integration tests are quite important in SOA systems that change rapidly.
Deleteany chance this will be open sourced? We would love to contribute!
ReplyDeleteThis is interesting approach. We are nearly in the same situation, but in the beginning. Could you share how did you solve connection to databases (and other services with different protocols). Do you have started DBs for tests or do you mock them as well?
ReplyDeleteAnother thing we need to cope with is order of returned data. Some of our methods are allowed to return items in array in random order (this random order originates form DB without order specification).
Did you see such a problems?
So, generally speaking, at Google DBs are also services that "speak" protocol buffer. But for most tests and languages, we also have very lightweight in-memory implementations that are more convenient to use where a DB is needed.
DeleteFor sorted/unsorted stuff in arrays, that's a common problem. In the final version we opted for always treating arrays as unsorted, so our matching algorithm just checks if each element (and duplicates) occur, but not in which position. In a previous version we tried to add a markup language to the stored data to modify the way things are matched, but in this particular case, it turned out that just un-ordered lists work well practically always.
There have been several questions about open-sourcing the implementation of the library we built.
ReplyDeleteThere are currently no plans to do that. The two main reasons are:
* A lot of what we did in the implementation is Google specific. Once we split of the parts that make sense in open source, there wouldn't be much left.
* There are very good implementations of these principles out there that work well with common languages and OS stacks. For example https://docs.pact.io/.
Hi, thank you for this post I agree with you that Tight coupling and insufficient abstraction made unit testing very hard, and as a consequence, a lot of end-to-end tests served as functional tests of that code.very useful information
ReplyDeleteThis is really interesting, thanks for sharing....Automation engineers design, program, simulate and test automated machinery and processes in order to complete exact tasks. They typically are employed in industries such as car manufacturing or food processing plants, where robots or machines are used to perform specific functions.i agree that a testing Engineer is required to fully test the product or system to ensure it functions properly and meets the business needs..worth reading
ReplyDeleteThanks for sharing. if integration test need to prepare some data, such as: get account by id, integration test may need to create one first, so question is : how to handle the preparation when replay?
ReplyDelete