Google Testing Blog: Fixing a Test Hourglass

Fixing a Test Hourglass

Monday, November 09, 2020

Automated tests make it safer and faster to create new features, fix bugs, and refactor code. When planning the automated tests, we envision a pyramid with a strong foundation of small unit tests, some well designed integration tests, and a few large end-to-end tests. From Just Say No to More End-to-End Tests, tests should be fast, reliable, and specific; end-to-end tests, however, are often slow, unreliable, and difficult to debug.

As software projects grow, often the shape of our test distribution becomes undesirable, either top heavy (no unit or medium integration tests), or like an hourglass.

The hourglass test distribution has a large set of unit tests, a large set of end-to-end tests, and few or no medium integration tests.

To transform the hourglass back into a pyramid — so that you can test the integration of components in a reliable, sustainable way — you need to figure out how to architect the system under test and test infrastructure and make system testability improvements and test-code improvements.

I worked on a project with a web UI, a server, and many backends. There were unit tests at all levels with good coverage and a quickly increasing set of end-to-end tests.

The end-to-end tests found issues that the unit tests missed, but they ran slowly, and environmental issues caused spurious failures, including test data corruption. In addition, some functional areas were difficult to test because they covered more than the unit but required state within the system that was hard to set up.

We eventually found a good test architecture for faster, more reliable integration tests, but with some missteps along the way.

An example UI-level end-to-end test, written in protractor, looked something like this:

describe('Terms of service are handled', () => {
  it('accepts terms of service', async () => {
    const user = getUser('termsNotAccepted');
    await login(user);
    await see(termsOfServiceDialog());
    await click('Accept')
    await logoff();
    await login(user);
    await not.see(termsOfServiceDialog());
  });
});

This test logs on as a user, sees the terms of service dialog that the user needs to accept, accepts it, then logs off and logs back on to ensure the user is not prompted again.

This terms of service test was a challenge to run reliably, because once an agreement was accepted, the backend server had no RPC method to reverse the operation and “un-accept” the TOS. We could create a new user with each test, but that was time consuming and hard to clean up.

The first attempt to make the terms of service feature testable without end-to-end testing was to hook the server RPC method and set the expectations within the test. The hook intercepts the RPC call and provides expected results instead of calling the backend API.

This approach worked. The test interacted with the backend RPC without really calling it, but it cluttered the test with extra logic.

describe('Terms of service are handled', () => {
  it('accepts terms of service', async () => {
    const user = getUser('someUser');
    await hook('TermsOfService.Get()', true);
    await login(user);
    await see(termsOfServiceDialog());
    await click('Accept')
    await logoff();
    await hook('TermsOfService.Get()', false);
    await login(user);
    await not.see(termsOfServiceDialog());
  });
});

The test met the goal of testing the integration of the web UI and server, but it was unreliable. As the system scaled under load, there were several server processes and no guarantee that the UI would access the same server for all RPC calls, so the hook might be set in one server process and the UI accessed in another.

The hook also wasn't at a natural system boundary, which made it require more maintenance as the system evolved and code was refactored.

The next design of the test architecture was to fake the backend that eventually processes the terms of service call.

The fake implementation can be quite simple:

public class FakeTermsOfService implements TermsOfService.Service {
  private static final Map<String, Boolean> accepted = new ConcurrentHashMap<>();

  @Override
  public TosGetResponse get(TosGetRequest req) {
    return accepted.getOrDefault(req.UserID(), Boolean.FALSE);
  }

  @Override
  public void accept(TosAcceptRequest req) {
    accepted.put(req.UserID(), Boolean.TRUE);
  }
}

describe('Terms of service are handled', () => {

  it('accepts terms of service', async () => {
    const user = getUser('termsNotAccepted');
    await login(user);
    await see(termsOfServiceDialog());
    await click('Accept')
    await logoff();
    await login(user);
    await not.see(termsOfServiceDialog());
  });
});

Because the fake stores the accepted state in memory, there is no need to reset the state for the next test iteration; it is enough just to restart the fake server.

This worked but was problematic when there was a mix of fake and real backends. This was because there was state between the real backends that was now out of sync with the fake backend.

Our final, successful integration test architecture was to provide fake implementations for all except one of the backends, all sharing the same in-memory state. One real backend was included in the system under test because it was tightly coupled with the Web UI. Its dependencies were all wired to fake backends. These are integration tests over the entire system under test, but they remove the backend dependencies. These tests expand the medium size tests in the test hourglass, allowing us to have fewer end-to-end tests with real backends.

Note that these integration tests are not only the option. For logic in the Web UI, we can write page level unit tests, which allow the tests to run faster and more reliably. For the terms of service feature, however, we want to test the Web UI and server logic together, so integration tests are a good solution.

This resulted in UI tests that ran, unmodified, on both the real and fake backend systems.

When run with fake backends the tests were faster and more reliable. This made it easier to add test scenarios that would have been more challenging to set up with the real backends. We also deleted end-to-end tests that were well duplicated by the integration tests, resulting in more integration tests than end-to-end tests.

By iterating, we arrived at a sustainable test architecture for the integration tests.

If you're facing a test hourglass the test architecture to devise medium tests may not be obvious. I'd recommend experimenting, dividing the system on well defined interfaces, and making sure the new tests are providing value by running faster and more reliably or by unlocking hard to test areas.

References

Just Say No to More End-to-End Tests, Mike Wacker, https://testing.googleblog.com/2015/04/just-say-no-to-more-end-to-end-tests.html
Test Pyramid & Antipatterns, Khushi, https://khushiy.com/2019/02/07/test-pyramid-antipatterns/
Testing on the Toilet: Fake Your Way to Better Tests, Jonathan Rockway and Andrew Trenk, https://testing.googleblog.com/2013/06/testing-on-toilet-fake-your-way-to.html
Testing on the Toilet: Know Your Test Doubles, Andrew Trenk, https://testing.googleblog.com/2013/07/testing-on-toilet-know-your-test-doubles.html
Hermetic Servers, Chaitali Narla and Diego Salas, https://testing.googleblog.com/2012/10/hermetic-servers.html
Software Engineering at Google, Titus Winters, Tom Manshreck, Hyrum Wright, https://www.oreilly.com/library/view/software-engineering-at/9781492082781/

4 comments :

brabenetzNovember 9, 2020 at 11:23:00 PM PST
Sorry,
but I stopped reading after the first paragraph!
If your end-to-end tests, are often slow, unreliable, and difficult to debug, then just fix the f****g Problem (Hint: it's most likely not the e2e tests).

OK, split it up:

- slow: There could be two reason for that: Either the test-code is slow, or the production code is slow. In either way it's a bug: Just fix it.
Also parallelize the test gets a big performance boost, but for that, your application needs a correlationId (or something similar like orderId, requestId etc..).
Parallelizing the tests has also the benefit that the tests points directly to mutli-threading bugs (Sure it's annoying if tests shows you that your code crap).

- unreliable: And again: Just fix it, because you don't know if it is a real problem of the production-code or 'only' the test-code (It is easy to say: the "test" is unreliable ).

- difficult to debug: How will you analyze Production-Bugs? You (normally) cannot "debug" the Production system.
End to end tests prepares you for the real world and you have the change to prepare the code for this problems!

On the other hand there are benefits:

- black-box testing: As long as your application-API and DB doesn't change, you can refactor the code without touching the tests.
This is not the case for Unit-Tests. For Unit-Test, even moving a function from one class to another, often means that you can throw away your test and write new one.
Too much Unit-Tests is often the death of an application, because it prevents important refactorings.

- Testing the "orchestration" BETWEEN all libraries, frameworks, (micro-)Services, infrastructure Many end-user-application doesn't need many custom code where unit-test would make sense.

- Documentation: Yes, tests can be written that also NON-Developers understand the Scenarios. Look into Cucumber-Tests and Serenity-Reports.

Important: End-2-End tests can have different scopes: eg. testing a singe (Micro-)service or a whole system composed by many services.
This must be defined per project and communicated to the developers (less well defined Scopes is more).

Do not understand me wrong, Unit-Testing is important in many situations. Especially for Libraries and Frameworks.
But out in the real world there are also End-User Applications.
On this Applications also the "orchestration" BETWEEN all libraries, frameworks, (micro-)Services, infrastructure must be tested.
Such application has not much code where Unit-Tests makes even sense.
Conclusion: An upside down "Test-Pyramide" makes more sense in that cases.

At the end you must first master all variants of tests (unit, integration, e2e). Only after that you can decide which test serves you the most.
ReplyDelete
Replies
minhhoangNovember 10, 2020 at 5:14:00 PM PST
I have read the article and I have a little concern. As far as I know about Cypress, it allows users to return the stub a response request by intercepting in the browser requests. So can I know the reasons you don't use that features instead of implement the fake backends.
ReplyDelete
Replies
UnknownDecember 1, 2020 at 7:39:00 AM PST
I think you missed the point of this post.
"End-to-end tests, however, are often slow, unreliable, and difficult to debug."
This is well known and well documented. E2E tests are the last automated tests that get ran in a typical CI/CD pipeline. They are high dependency tests (no this is not a f****g problem, this is the nature of these tests) as they require the test environment to be setup with the latest version under test.
You: "slow: There could be two reason for that: Either the test-code is slow, or the production code is slow. In either way it's a bug: Just fix it."
Wrong - E2E tests are the slowest for feeding back issues to the development team. Not because of problems with the test or production code. Because in order to run an E2E test a feature/fix has to first of all have been: Developed > Unit tested > Code reviewed > Merged > Built > Test environment teared down > Deployed > Environment setup > Run E2E tests.
Simplified steps, but they summarise the typical stages in getting a feature/fix into an environment that is ready for your E2E tests to run.
E2E tests simulate the paths a user take through a system, logging in, navigation, CRUD actions etc...
We need confidence that the critical paths through a system work, so we automate them, but because we are now simulating a users actions, we can't take programmatic shortcuts and need to strictly model the behaviour of the user, this is another reason the tests take longer to run.
Another example - you just finish deploying a change to run your E2E tests and the developer realises their last commit never went it. In order to run the tests you now need the latest change and have to go through all the stages in your pipeline again to run the E2E tests.... slooooowww.
You: "unreliable: And again: Just fix it, because you don't know if it is a real problem of the production-code or 'only' the test-code (It is easy to say: the "test" is unreliable )"
E2E are heavily dependant on the many things, and therefore have multiple potential points of failure. The deployment server may not have started properly, somebody else may have been testing something on the environment when your tests were running, the deployment script may have been setup wrong. All of these could cause E2E tests to fail whereby the failure was due to neither a problem with the test or production code. UI automated tests are notoriously flaky, your test script tries to click a button and the page hadn't loaded in time...... unreliable!
You: "difficult to debug: How will you analyze Production-Bugs? You (normally) cannot "debug" the Production system"
Not sure if you disagree of agree here. E2E tests are difficult to debug - your system is a black box to your E2E test code, so all test failures have to be investigated and determined if the reason for the failure was due to dependency issues, problems in test code, problems in production code, flaky tests.
You: "End to end tests prepares you for the real world and you have the change to prepare the code for this problems!"
Agree! E2E are extremely valuable and provide a great confidence measure that the latest code changes aren't going to break anything. But the point of this post is to discuss how you can reduce an over reliance on E2E tests, not get rid of them completely. E2E tests ARE slow, unreliable, difficult to debug etc....
Give me 10 E2E tests over a 1000 E2E tests any day of the week!
If you don't have any unit or integration tests, then your regression checks responsibility sits with either E2E tests or manual testing. Necessary if you don't have any unit/integration tests, but not a position you really want to be in - aka upside down pyramid or ice cream cone.
I would suggest reading the entire blog post (if you actually did stop at the first paragraph) it’s an excellent piece on how a team has approached migrating some of their tests further down the pyramid into the integration test layer.
ReplyDelete
Replies
oleksandr.podoliakoFebruary 8, 2021 at 1:18:00 PM PST
Thank you for an interesting article. Faking backend is a great idea. I am sure that someday it will be a common approach in automation testing.

Can you give more technical details? Did you have a separate environment where fake backend was deployed? How switching from real to fake backend was implemented? With different branches or was set up with build parameters?
ReplyDelete
Replies

Add comment

New comments are not allowed.

Testing Blog

Fixing a Test Hourglass

4 comments :

Labels

Archive

Feed