Testing Blog
Just Say No to More End-to-End Tests
Wednesday, April 22, 2015
by Mike Wacker
At some point in your life, you can probably recall a movie that you and your friends all wanted to see, and that you and your friends all regretted watching afterwards. Or maybe you remember that time your team thought they’d found the next "killer feature" for their product, only to see that feature bomb after it was released.
Good ideas often fail in practice, and in the world of testing, one pervasive good idea that often fails in practice is a testing strategy built around end-to-end tests.
Testers can invest their time in writing many types of automated tests, including unit tests, integration tests, and end-to-end tests, but this strategy invests mostly in end-to-end tests that verify the product or service as a whole. Typically, these tests simulate real user scenarios.
End-to-End Tests in Theory
While relying primarily on end-to-end tests is a bad idea, one could certainly convince a reasonable person that the idea makes sense in theory.
To start, number one on Google's list of
ten things we know to be true
is: "Focus on the user and all else will follow." Thus, end-to-end tests that focus on real user scenarios sound like a great idea. Additionally, this strategy broadly appeals to many constituencies:
Developers
like it because it offloads most, if not all, of the testing to others.
Managers and decision-makers
like it because tests that simulate real user scenarios can help them easily determine how a failing test would impact the user.
Testers
like it because they often worry about missing a bug or writing a test that does not verify real-world behavior; writing tests from the user's perspective often avoids both problems and gives the tester a greater sense of accomplishment.
End-to-End Tests in Practice
So if this testing strategy sounds so good in theory, then where does it go wrong in practice? To demonstrate, I present the following composite sketch based on a collection of real experiences familiar to both myself and other testers. In this sketch, a team is building a service for editing documents online (e.g., Google Docs).
Let's assume the team already has some fantastic test infrastructure in place. Every night:
The latest version of the service is built.
This version is then deployed to the team's testing environment.
All end-to-end tests then run against this testing environment.
An email report summarizing the test results is sent to the team.
The deadline is approaching fast as our team codes new features for their next release. To maintain a high bar for product quality, they also require that at least 90% of their end-to-end tests pass before features are considered complete. Currently, that deadline is one day away:
Days Left
Pass %
Notes
1
5%
Everything is broken! Signing in to the service is broken. Almost all tests sign in a user, so almost all tests failed.
0
4%
A partner team we rely on deployed a bad build to their testing environment yesterday.
-1
54%
A dev broke the save scenario yesterday (or the day before?). Half the tests save a document at some point in time. Devs spent most of the day determining if it's a frontend bug or a backend bug.
-2
54%
It's a frontend bug, devs spent half of today figuring out where.
-3
54%
A bad fix was checked in yesterday. The mistake was pretty easy to spot, though, and a correct fix was checked in today.
-4
1%
Hardware failures occurred in the lab for our testing environment.
-5
84%
Many small bugs hiding behind the big bugs (e.g., sign-in broken, save broken). Still working on the small bugs.
-6
87%
We should be above 90%, but are not for some reason.
-7
89.54%
(Rounds up to 90%, close enough.) No fixes were checked in yesterday, so the tests must have been flaky yesterday.
Analysis
Despite numerous problems, the tests ultimately did catch real bugs.
What Went Well
Customer-impacting bugs were identified and fixed before they reached the customer.
What Went Wrong
The team completed their coding milestone a week late (and worked a lot of overtime).
Finding the root cause for a failing end-to-end test is painful and can take a long time.
Partner failures and lab failures ruined the test results on multiple days.
Many smaller bugs were hidden behind bigger bugs.
End-to-end tests were flaky at times.
Developers had to wait until the following day to know if a fix worked or not.
So now that we know what went wrong with the end-to-end strategy, we need to change our approach to testing to avoid many of these problems. But what is the right approach?
The True Value of Tests
Typically, a tester's job ends once they have a failing test. A bug is filed, and then it's the developer's job to fix the bug. To identify where the end-to-end strategy breaks down, however, we need to think outside this box and approach the problem from first principles. If we "focus on the user (and all else will follow)," we have to ask ourselves how a failing test benefits the user. Here is the answer:
A failing test does not directly benefit the user.
While this statement seems shocking at first, it is true. If a product works, it works, whether a test says it works or not. If a product is broken, it is broken, whether a test says it is broken or not. So, if failing tests do not benefit the user, then what does benefit the user?
A bug fix directly benefits the user.
The user will only be happy when that unintended behavior - the bug - goes away. Obviously, to fix a bug, you must know the bug exists. To know the bug exists, ideally you have a test that catches the bug (because the user will find the bug if the test does not). But in that entire process, from failing test to bug fix, value is only added at the very last step.
Stage
Failing Test
Bug Opened
Bug Fixed
Value Added
No
No
Yes
Thus, to evaluate any testing strategy, you cannot just evaluate how it finds bugs. You also must evaluate how it enables developers to fix (and even prevent) bugs.
Building the Right Feedback Loop
Tests create a feedback loop that informs the developer whether the product is working or not. The ideal feedback loop has several properties:
It's fast
. No developer wants to wait hours or days to find out if their change works. Sometimes the change does not work - nobody is perfect - and the feedback loop needs to run multiple times. A faster feedback loop leads to faster fixes. If the loop is fast enough, developers may even run tests before checking in a change.
It's reliable
. No developer wants to spend hours debugging a test, only to find out it was a flaky test. Flaky tests reduce the developer's trust in the test, and as a result flaky tests are often ignored, even when they find real product issues.
It isolates failures
. To fix a bug, developers need to find the specific lines of code causing the bug. When a product contains millions of lines of codes, and the bug could be anywhere, it's like trying to find a needle in a haystack.
Think Smaller, Not Larger
So how do we create that ideal feedback loop? By thinking smaller, not larger.
Unit Tests
Unit tests take a small piece of the product and test that piece in isolation. They tend to create that ideal feedback loop:
Unit tests are fast
. We only need to build a small unit to test it, and the tests also tend to be rather small. In fact, one tenth of a second is considered slow for unit tests.
Unit tests are reliable
. Simple systems and small units in general tend to suffer much less from flakiness. Furthermore, best practices for unit testing - in particular practices related to hermetic tests - will remove flakiness entirely.
Unit tests isolate failures
. Even if a product contains millions of lines of code, if a unit test fails, you only need to search that small unit under test to find the bug.
Writing effective unit tests requires skills in areas such as dependency management, mocking, and hermetic testing. I won't cover these skills here, but as a start, the typical example offered to new Googlers (or Nooglers) is how Google
builds
and
tests
a stopwatch.
Unit Tests vs. End-to-End Tests
With end-to-end tests, you have to wait: first for the entire product to be built, then for it to be deployed, and finally for all end-to-end tests to run. When the tests do run, flaky tests tend to be a fact of life. And even if a test finds a bug, that bug could be anywhere in the product.
Although end-to-end tests do a better job of simulating real user scenarios, this advantage quickly becomes outweighed by all the disadvantages of the end-to-end feedback loop:
Unit
End-toEnd
Fast
Reliable
Isolates Failures
Simulates a Real User
Integration Tests
Unit tests do have one major disadvantage: even if the units work well in isolation, you do not know if they work well together. But even then, you do not necessarily need end-to-end tests. For that, you can use an integration test. An integration test takes a small group of units, often two units, and tests their behavior as a whole, verifying that they coherently work together.
If two units do not integrate properly, why write an end-to-end test when you can write a much smaller, more focused integration test that will detect the same bug? While you do need to think larger, you only need to think a little larger to verify that units work together.
Testing Pyramid
Even with both unit tests and integration tests, you probably still will want a small number of end-to-end tests to verify the system as a whole. To find the right balance between all three test types, the best visual aid to use is the testing pyramid. Here is a simplified version of the
testing pyramid
from the opening keynote of the
2014 Google Test Automation Conference
:
The bulk of your tests are unit tests at the bottom of the pyramid. As you move up the pyramid, your tests gets larger, but at the same time the number of tests (the width of your pyramid) gets smaller.
As a good first guess, Google often suggests a 70/20/10 split: 70% unit tests, 20% integration tests, and 10% end-to-end tests. The exact mix will be different for each team, but in general, it should retain that pyramid shape. Try to avoid these anti-patterns:
Inverted pyramid/ice cream cone
. The team relies primarily on end-to-end tests, using few integration tests and even fewer unit tests.
Hourglass
. The team starts with a lot of unit tests, then uses end-to-end tests where integration tests should be used. The hourglass has many unit tests at the bottom and many end-to-end tests at the top, but few integration tests in the middle.
Just like a regular pyramid tends to be the most stable structure in real life, the testing pyramid also tends to be the most stable testing strategy.
84 comments
Quantum Quality
Wednesday, April 01, 2015
UPDATE
: Hey, this was an April fool's joke but in fact we wished we could
have realized this idea and we are looking forward to the day this has
been worked out and becomes a reality.
by Kevin Graney
Here at Google we have a long history of capitalizing on the latest research and technology to improve the quality of our software. Over our past 16+ years as a company, what started with some humble unit tests has grown into a massive operation. As our software complexity increased, ever larger and more complex tests were dreamed up by our Software Engineers in Test (SETs).
What we have come to realize is that our love of testing is a double-edged sword. On the one hand, large-scale testing keeps us honest and gives us confidence. It ensures our products remain reliable, our users' data is kept safe, and our engineers are able to work productively without fear of breaking things. On the other hand, it's expensive in both engineer and machine time. Our SETs have been working tirelessly to reduce the expense and latency of software tests at Google, while continuing to increase their quality.
Today, we're excited to reveal how Google is tackling this challenge. In collaboration with the
Quantum AI Lab
, SETs at Google have been busy revolutionizing how software is tested. The theory is relatively simple: bits in a traditional computer are either zero or one, but bits in a quantum computer can be both one and zero at the same time. This is known as superposition, and the classic example is
Schrodinger's cat
. Through some clever math and cutting edge electrical engineering, researchers at Google have figured out how to utilize superposition to vastly improve the quality of our software testing and the speed at which our tests run.
Figure 1
Some qubits inside a Google quantum device.
With superposition, tests at Google are now able to simultaneously model every possible state of the application under test. The state of the application can be thought of as an
n
bit sequential memory buffer, consistent with the traditional
Von Neuman architecture
of computing. Because each bit under superposition is simultaneously a 0 and a 1, these tests can simulate 2
n
different application states at any given instant in time in
O(n)
space. Each of these application states can be mutated by application logic to another state in constant time using quantum algorithms developed by Google researchers. These two properties together allow us to build a state transition graph of the application under test that shows every possible application state and all possible transitions to other application states. Using traditional computing methods this problem has intractable time complexity, but after leveraging superposition and our quantum algorithms it becomes relatively fast and cheap.
Figure 2
The application state graph for a demonstrative 3-bit application. If the start state is 001 then 000, 110, 111, and 011 are all unreachable states. States 010 and 100 both result in deadlock.
Once we have the state transition graph for the application under test, testing it becomes almost trivial. Given the initial startup state of the application, i.e. the executable bits of the application stored on disk, we can find from the application's state transition graph all reachable states. Assertions that ensure proper behavior are then written against the reachable subset of the transition graph. This paradigm of test writing allows both Google's security engineers and software engineers to work more productively. A security engineer can write a test, for example, that asserts "no executable memory regions become mutated in any reachable state". This one test effectively eliminates the potential for security flaws that result from
memory safety violations
. A test engineer can write higher level assertions using graph traversal methods that ensure data integrity is maintained across a subset of application state transitions. Tests of this nature can detect data corruption bugs.
We're excited about the work our team has done so far to push the envelope in the field of quantum software quality. We're just getting started, but based on early dogfood results among Googlers we believe the potential of this work is huge. Stay tuned!
2 comments
Labels
TotT
101
GTAC
61
James Whittaker
42
Misko Hevery
32
Code Health
30
Anthony Vallone
27
Patrick Copeland
23
Jobs
18
Andrew Trenk
12
C++
11
Patrik Höglund
8
JavaScript
7
Allen Hutchison
6
George Pirocanac
6
Zhanyong Wan
6
Harry Robinson
5
Java
5
Julian Harty
5
Alberto Savoia
4
Ben Yu
4
Erik Kuefler
4
Philip Zembrod
4
Shyam Seshadri
4
Adam Bender
3
Chrome
3
Dillon Bly
3
John Thomas
3
Lesley Katzen
3
Marc Kaplan
3
Markus Clermont
3
Max Kanat-Alexander
3
Sonal Shah
3
APIs
2
Abhishek Arya
2
Alan Myrvold
2
Alek Icev
2
Android
2
April Fools
2
Chaitali Narla
2
Chris Lewis
2
Chrome OS
2
Diego Salas
2
Dori Reuveni
2
Jason Arbon
2
Jochen Wuttke
2
Kostya Serebryany
2
Marc Eaddy
2
Marko Ivanković
2
Mobile
2
Oliver Chang
2
Simon Stewart
2
Stefan Kennedy
2
Test Flakiness
2
Titus Winters
2
Tony Voellm
2
WebRTC
2
Yiming Sun
2
Yvette Nameth
2
Zuri Kemp
2
Aaron Jacobs
1
Adam Porter
1
Adam Raider
1
Adel Saoud
1
Alan Faulkner
1
Alex Eagle
1
Amy Fu
1
Anantha Keesara
1
Antoine Picard
1
App Engine
1
Ari Shamash
1
Arif Sukoco
1
Benjamin Pick
1
Bob Nystrom
1
Bruce Leban
1
Carlos Arguelles
1
Carlos Israel Ortiz García
1
Cathal Weakliam
1
Christopher Semturs
1
Clay Murphy
1
Dagang Wei
1
Dan Maksimovich
1
Dan Shi
1
Dan Willemsen
1
Dave Chen
1
Dave Gladfelter
1
David Bendory
1
David Mandelberg
1
Derek Snyder
1
Diego Cavalcanti
1
Dmitry Vyukov
1
Eduardo Bravo Ortiz
1
Ekaterina Kamenskaya
1
Elliott Karpilovsky
1
Elliotte Rusty Harold
1
Espresso
1
Felipe Sodré
1
Francois Aube
1
Gene Volovich
1
Google+
1
Goran Petrovic
1
Goranka Bjedov
1
Hank Duan
1
Havard Rast Blok
1
Hongfei Ding
1
Jason Elbaum
1
Jason Huggins
1
Jay Han
1
Jeff Hoy
1
Jeff Listfield
1
Jessica Tomechak
1
Jim Reardon
1
Joe Allan Muharsky
1
Joel Hynoski
1
John Micco
1
John Penix
1
Jonathan Rockway
1
Jonathan Velasquez
1
Josh Armour
1
Julie Ralph
1
Kai Kent
1
Karin Lundberg
1
Kaue Silveira
1
Kevin Bourrillion
1
Kevin Graney
1
Kirkland
1
Kurt Alfred Kluever
1
Manjusha Parvathaneni
1
Marek Kiszkis
1
Marius Latinis
1
Mark Ivey
1
Mark Manley
1
Mark Striebeck
1
Matt Lowrie
1
Meredith Whittaker
1
Michael Bachman
1
Michael Klepikov
1
Mike Aizatsky
1
Mike Wacker
1
Mona El Mahdy
1
Noel Yap
1
Palak Bansal
1
Patricia Legaspi
1
Per Jacobsson
1
Peter Arrenbrecht
1
Peter Spragins
1
Phil Norman
1
Phil Rollet
1
Pooja Gupta
1
Project Showcase
1
Radoslav Vasilev
1
Rajat Dewan
1
Rajat Jain
1
Rich Martin
1
Richard Bustamante
1
Roshan Sembacuttiaratchy
1
Ruslan Khamitov
1
Sam Lee
1
Sean Jordan
1
Sharon Zhou
1
Shiva Garg
1
Siddartha Janga
1
Simran Basi
1
Stan Chan
1
Stephen Ng
1
Tejas Shah
1
Test Analytics
1
Test Engineer
1
Tim Lyakhovetskiy
1
Tom O'Neill
1
Vojta Jína
1
automation
1
dead code
1
iOS
1
mutation testing
1
Archive
►
2024
(11)
►
Sep
(1)
►
Aug
(1)
►
Jul
(1)
►
May
(3)
►
Apr
(3)
►
Mar
(1)
►
Feb
(1)
►
2023
(14)
►
Dec
(2)
►
Nov
(2)
►
Oct
(5)
►
Sep
(3)
►
Aug
(1)
►
Apr
(1)
►
2022
(2)
►
Feb
(2)
►
2021
(3)
►
Jun
(1)
►
Apr
(1)
►
Mar
(1)
►
2020
(8)
►
Dec
(2)
►
Nov
(1)
►
Oct
(1)
►
Aug
(2)
►
Jul
(1)
►
May
(1)
►
2019
(4)
►
Dec
(1)
►
Nov
(1)
►
Jul
(1)
►
Jan
(1)
►
2018
(7)
►
Nov
(1)
►
Sep
(1)
►
Jul
(1)
►
Jun
(2)
►
May
(1)
►
Feb
(1)
►
2017
(17)
►
Dec
(1)
►
Nov
(1)
►
Oct
(1)
►
Sep
(1)
►
Aug
(1)
►
Jul
(2)
►
Jun
(2)
►
May
(3)
►
Apr
(2)
►
Feb
(1)
►
Jan
(2)
►
2016
(15)
►
Dec
(1)
►
Nov
(2)
►
Oct
(1)
►
Sep
(2)
►
Aug
(1)
►
Jun
(2)
►
May
(3)
►
Apr
(1)
►
Mar
(1)
►
Feb
(1)
▼
2015
(14)
►
Dec
(1)
►
Nov
(1)
►
Oct
(2)
►
Aug
(1)
►
Jun
(1)
►
May
(2)
▼
Apr
(2)
Just Say No to More End-to-End Tests
Quantum Quality
►
Mar
(1)
►
Feb
(1)
►
Jan
(2)
►
2014
(24)
►
Dec
(2)
►
Nov
(1)
►
Oct
(2)
►
Sep
(2)
►
Aug
(2)
►
Jul
(3)
►
Jun
(3)
►
May
(2)
►
Apr
(2)
►
Mar
(2)
►
Feb
(1)
►
Jan
(2)
►
2013
(16)
►
Dec
(1)
►
Nov
(1)
►
Oct
(1)
►
Aug
(2)
►
Jul
(1)
►
Jun
(2)
►
May
(2)
►
Apr
(2)
►
Mar
(2)
►
Jan
(2)
►
2012
(11)
►
Dec
(1)
►
Nov
(2)
►
Oct
(3)
►
Sep
(1)
►
Aug
(4)
►
2011
(39)
►
Nov
(2)
►
Oct
(5)
►
Sep
(2)
►
Aug
(4)
►
Jul
(2)
►
Jun
(5)
►
May
(4)
►
Apr
(3)
►
Mar
(4)
►
Feb
(5)
►
Jan
(3)
►
2010
(37)
►
Dec
(3)
►
Nov
(3)
►
Oct
(4)
►
Sep
(8)
►
Aug
(3)
►
Jul
(3)
►
Jun
(2)
►
May
(2)
►
Apr
(3)
►
Mar
(3)
►
Feb
(2)
►
Jan
(1)
►
2009
(54)
►
Dec
(3)
►
Nov
(2)
►
Oct
(3)
►
Sep
(5)
►
Aug
(4)
►
Jul
(15)
►
Jun
(8)
►
May
(3)
►
Apr
(2)
►
Feb
(5)
►
Jan
(4)
►
2008
(75)
►
Dec
(6)
►
Nov
(8)
►
Oct
(9)
►
Sep
(8)
►
Aug
(9)
►
Jul
(9)
►
Jun
(6)
►
May
(6)
►
Apr
(4)
►
Mar
(4)
►
Feb
(4)
►
Jan
(2)
►
2007
(41)
►
Oct
(6)
►
Sep
(5)
►
Aug
(3)
►
Jul
(2)
►
Jun
(2)
►
May
(2)
►
Apr
(7)
►
Mar
(5)
►
Feb
(5)
►
Jan
(4)
Feed
Follow @googletesting