Testing Blog
A Tale of Two Features
Tuesday, February 01, 2022
By George Pirocanac
I have often been asked, “What is the most memorable bug that you have encountered in your testing career?” For me, it is hands down a bug that happened quite a few years ago. I was leading an Engineering Productivity team that supported Google App Engine. At that time App Engine was still in its early stages, and there were many challenges associated with testing rapidly evolving features. Our testing frameworks and processes were also evolving, so it was an exciting time to be on the team.
What makes this bug so memorable is that I spent so much time developing a comprehensive suite of test scenarios, yet a failure occurred during such an obvious use case that it left me shaking my head and wondering how I had missed it. Even with many years of testing experience it can be very humbling to construct scenarios that adequately mirror what will happen in the field.
I’ll try to provide enough background for the reader to play along and see if they can determine the anomalous scenario. As a further hint, the problem resulted from the interaction of two App Engine features, so I like calling this story
A Tale of Two Features.
Feature 1 - Datastore Admin (backup, restore, delete)
Google App Engine was released 13 years ago as Google’s first Cloud product. It allowed users to build and deploy highly scalable web applications in the Cloud. To support this, it had its own scalable database called the Datastore. An administration console allowed users to manage the application and its Datastore through a web interface. Users wrote applications that consisted of request handlers that App Engine invoked according to the URL that was specified. The handlers could call App Engine services like Datastore through a remote procedure call (RPC) mechanism. Figure 1 illustrates this flow.
The first feature in this
Tale of Two Features
resided in the administration console, providing the ability to back up, restore, and delete selected or all of an application’s entities in the Datastore. It was implemented in a clever way that incorporated it directly into the application, rather than as an independent utility. As part of the application it could freely operate on the Datastore and incur the same billing charges as other datastore operations within the application. When the feature was invoked, traffic would be sent to its handler and the application would process it. Figure 2 illustrates this request flow.
By the time this memorable bug occurred, this Datastore administration feature was mature, well tested, and stable. No changes were being made to it.
Feature 2 - Utilities for Migrating to the HR - Datastore
The second feature (or more accurately, set of features) came at least a year after the first feature was released and stable. It helped users migrate their applications to a new High Replication (HR) Datastore. The HR Datastore was more reliable than its predecessor, but using it meant creating a new application and copying over all the data from the old Datastore. To support such migrations, App Engine developers added two new features to the administration console. The first copied all the data from the Datastore of one application to another, and the second redirected all traffic from one application to another. The latter was particularly useful because it meant the new application would seamlessly receive the traffic after a migration. This set of features was written by another team, and we in Engineering Productivity supported them by creating processes for testing various Datastore migrations. The migration-support features were thoroughly tested and released to developers. Figure 3 illustrates the request flow of the redirection feature.
What Could Possibly Go Wrong?
So this was the situation when we released these utilities for migrating to the new Datastore. We were very confident that they worked, as we had tested migrations of many different types and sizes of Datastore entities. We had also tested that a migration could be interrupted without data loss. All checked out fine. I was confident that this new feature would work, yet soon after we released it, we started getting problem reports.
If you have been playing along, now is the time to ask yourself, “What could possibly go wrong?” As an added hint, the problem reports claimed that all the data in the newly migrated application was disappearing.
What Did Go Wrong
As mentioned above, developers began to report that data was disappearing from their newly migrated applications. It wasn’t at all common, yet of course it is most disconcerting when data “just disappears.” We had to investigate how this could occur. Our standard processes ensured that we had internal backups of the data, which were promptly restored. In parallel we tried to reproduce the problem, but we couldn’t—at least until we figured out what was happening. As I mentioned earlier, once we understood it, it was quite obvious, but that only made it all the more painful that we missed it.
What was happening was that, after migrating and automatically redirecting traffic to the new application, a number of customers thought they still needed to delete the data from their old application, so they used the first Datastore admin feature to do that. As expected, the feature sent traffic to that application to delete the entities from the Datastore. But that traffic was now being automatically redirected to the new application, and voila—all the data that had been copied earlier was now deleted there. Since only a handful of developers tried to delete the data from their old applications, this explained why the problem occurred only rarely. Figure 4 illustrates this request flow.
Obvious, isn’t it, once you know what is happening.
Lessons Learned
This all occurred years ago, and App Engine is based on a far different and more robust framework today. Datastore migrations are but a memory from the past, yet this experience made a great impression on me.
The most important thing I learned from this experience is that, while it is important to test features for their functionality, it’s also important to think of them as part of workflows. In performing our testing we exercised a very limited number of steps in the migration process workflow and omitted a very reasonable step at the end: trying to delete the data from the old application. Our focus was in testing the variability of contents in the Datastore rather than different steps in the migration process. It was this focus that kept our eyes away from the relatively obvious failure case.
Another thing I learned was that this bug might have been caught if the developer of the first feature had been in the design review for the second set of migration features (particularly the feature that automatically redirects traffic). Unfortunately, that person had already joined a new team. A key step in reducing bugs can occur at the design stage if “what-if” questions are being asked.
Finally, I was enormously impressed that we were able to recover so quickly. Protecting against data loss is one of the most important aspects of Cloud management, and being able to recover from mistakes is at least as important as trying to prevent them. I have the utmost respect for my coworkers in Site Reliability Engineering (SRE).
References
An Overview of Google App Engine
Labels
TotT
103
GTAC
61
James Whittaker
42
Misko Hevery
32
Code Health
31
Anthony Vallone
27
Patrick Copeland
23
Jobs
18
Andrew Trenk
13
C++
11
Patrik Höglund
8
JavaScript
7
Allen Hutchison
6
George Pirocanac
6
Zhanyong Wan
6
Harry Robinson
5
Java
5
Julian Harty
5
Adam Bender
4
Alberto Savoia
4
Ben Yu
4
Erik Kuefler
4
Philip Zembrod
4
Shyam Seshadri
4
Chrome
3
Dillon Bly
3
John Thomas
3
Lesley Katzen
3
Marc Kaplan
3
Markus Clermont
3
Max Kanat-Alexander
3
Sonal Shah
3
APIs
2
Abhishek Arya
2
Alan Myrvold
2
Alek Icev
2
Android
2
April Fools
2
Chaitali Narla
2
Chris Lewis
2
Chrome OS
2
Diego Salas
2
Dori Reuveni
2
Jason Arbon
2
Jochen Wuttke
2
Kostya Serebryany
2
Marc Eaddy
2
Marko Ivanković
2
Mobile
2
Oliver Chang
2
Simon Stewart
2
Stefan Kennedy
2
Test Flakiness
2
Titus Winters
2
Tony Voellm
2
WebRTC
2
Yiming Sun
2
Yvette Nameth
2
Zuri Kemp
2
Aaron Jacobs
1
Adam Porter
1
Adam Raider
1
Adel Saoud
1
Alan Faulkner
1
Alex Eagle
1
Amy Fu
1
Anantha Keesara
1
Antoine Picard
1
App Engine
1
Ari Shamash
1
Arif Sukoco
1
Benjamin Pick
1
Bob Nystrom
1
Bruce Leban
1
Carlos Arguelles
1
Carlos Israel Ortiz García
1
Cathal Weakliam
1
Christopher Semturs
1
Clay Murphy
1
Dagang Wei
1
Dan Maksimovich
1
Dan Shi
1
Dan Willemsen
1
Dave Chen
1
Dave Gladfelter
1
David Bendory
1
David Mandelberg
1
Derek Snyder
1
Diego Cavalcanti
1
Dmitry Vyukov
1
Eduardo Bravo Ortiz
1
Ekaterina Kamenskaya
1
Elliott Karpilovsky
1
Elliotte Rusty Harold
1
Espresso
1
Felipe Sodré
1
Francois Aube
1
Gene Volovich
1
Google+
1
Goran Petrovic
1
Goranka Bjedov
1
Hank Duan
1
Havard Rast Blok
1
Hongfei Ding
1
Jason Elbaum
1
Jason Huggins
1
Jay Han
1
Jeff Hoy
1
Jeff Listfield
1
Jessica Tomechak
1
Jim Reardon
1
Joe Allan Muharsky
1
Joel Hynoski
1
John Micco
1
John Penix
1
Jonathan Rockway
1
Jonathan Velasquez
1
Josh Armour
1
Julie Ralph
1
Kai Kent
1
Kanu Tewary
1
Karin Lundberg
1
Kaue Silveira
1
Kevin Bourrillion
1
Kevin Graney
1
Kirkland
1
Kurt Alfred Kluever
1
Manjusha Parvathaneni
1
Marek Kiszkis
1
Marius Latinis
1
Mark Ivey
1
Mark Manley
1
Mark Striebeck
1
Matt Lowrie
1
Meredith Whittaker
1
Michael Bachman
1
Michael Klepikov
1
Mike Aizatsky
1
Mike Wacker
1
Mona El Mahdy
1
Noel Yap
1
Palak Bansal
1
Patricia Legaspi
1
Per Jacobsson
1
Peter Arrenbrecht
1
Peter Spragins
1
Phil Norman
1
Phil Rollet
1
Pooja Gupta
1
Project Showcase
1
Radoslav Vasilev
1
Rajat Dewan
1
Rajat Jain
1
Rich Martin
1
Richard Bustamante
1
Roshan Sembacuttiaratchy
1
Ruslan Khamitov
1
Sam Lee
1
Sean Jordan
1
Sebastian Dörner
1
Sharon Zhou
1
Shiva Garg
1
Siddartha Janga
1
Simran Basi
1
Stan Chan
1
Stephen Ng
1
Tejas Shah
1
Test Analytics
1
Test Engineer
1
Tim Lyakhovetskiy
1
Tom O'Neill
1
Vojta Jína
1
automation
1
dead code
1
iOS
1
mutation testing
1
Archive
►
2025
(1)
►
Jan
(1)
►
2024
(13)
►
Dec
(1)
►
Oct
(1)
►
Sep
(1)
►
Aug
(1)
►
Jul
(1)
►
May
(3)
►
Apr
(3)
►
Mar
(1)
►
Feb
(1)
►
2023
(14)
►
Dec
(2)
►
Nov
(2)
►
Oct
(5)
►
Sep
(3)
►
Aug
(1)
►
Apr
(1)
▼
2022
(2)
▼
Feb
(2)
Code Health: Now You're Thinking With Functions
A Tale of Two Features
►
2021
(3)
►
Jun
(1)
►
Apr
(1)
►
Mar
(1)
►
2020
(8)
►
Dec
(2)
►
Nov
(1)
►
Oct
(1)
►
Aug
(2)
►
Jul
(1)
►
May
(1)
►
2019
(4)
►
Dec
(1)
►
Nov
(1)
►
Jul
(1)
►
Jan
(1)
►
2018
(7)
►
Nov
(1)
►
Sep
(1)
►
Jul
(1)
►
Jun
(2)
►
May
(1)
►
Feb
(1)
►
2017
(17)
►
Dec
(1)
►
Nov
(1)
►
Oct
(1)
►
Sep
(1)
►
Aug
(1)
►
Jul
(2)
►
Jun
(2)
►
May
(3)
►
Apr
(2)
►
Feb
(1)
►
Jan
(2)
►
2016
(15)
►
Dec
(1)
►
Nov
(2)
►
Oct
(1)
►
Sep
(2)
►
Aug
(1)
►
Jun
(2)
►
May
(3)
►
Apr
(1)
►
Mar
(1)
►
Feb
(1)
►
2015
(14)
►
Dec
(1)
►
Nov
(1)
►
Oct
(2)
►
Aug
(1)
►
Jun
(1)
►
May
(2)
►
Apr
(2)
►
Mar
(1)
►
Feb
(1)
►
Jan
(2)
►
2014
(24)
►
Dec
(2)
►
Nov
(1)
►
Oct
(2)
►
Sep
(2)
►
Aug
(2)
►
Jul
(3)
►
Jun
(3)
►
May
(2)
►
Apr
(2)
►
Mar
(2)
►
Feb
(1)
►
Jan
(2)
►
2013
(16)
►
Dec
(1)
►
Nov
(1)
►
Oct
(1)
►
Aug
(2)
►
Jul
(1)
►
Jun
(2)
►
May
(2)
►
Apr
(2)
►
Mar
(2)
►
Jan
(2)
►
2012
(11)
►
Dec
(1)
►
Nov
(2)
►
Oct
(3)
►
Sep
(1)
►
Aug
(4)
►
2011
(39)
►
Nov
(2)
►
Oct
(5)
►
Sep
(2)
►
Aug
(4)
►
Jul
(2)
►
Jun
(5)
►
May
(4)
►
Apr
(3)
►
Mar
(4)
►
Feb
(5)
►
Jan
(3)
►
2010
(37)
►
Dec
(3)
►
Nov
(3)
►
Oct
(4)
►
Sep
(8)
►
Aug
(3)
►
Jul
(3)
►
Jun
(2)
►
May
(2)
►
Apr
(3)
►
Mar
(3)
►
Feb
(2)
►
Jan
(1)
►
2009
(54)
►
Dec
(3)
►
Nov
(2)
►
Oct
(3)
►
Sep
(5)
►
Aug
(4)
►
Jul
(15)
►
Jun
(8)
►
May
(3)
►
Apr
(2)
►
Feb
(5)
►
Jan
(4)
►
2008
(75)
►
Dec
(6)
►
Nov
(8)
►
Oct
(9)
►
Sep
(8)
►
Aug
(9)
►
Jul
(9)
►
Jun
(6)
►
May
(6)
►
Apr
(4)
►
Mar
(4)
►
Feb
(4)
►
Jan
(2)
►
2007
(41)
►
Oct
(6)
►
Sep
(5)
►
Aug
(3)
►
Jul
(2)
►
Jun
(2)
►
May
(2)
►
Apr
(7)
►
Mar
(5)
►
Feb
(5)
►
Jan
(4)
Feed
Follow @googletesting