How We Tested Google Instant Pages
Wednesday, July 27, 2011
By Jason Arbon and Tejas Shah
Google Instant Pages are a cool new way that Google speeds up your search experience. When Google thinks it knows which result you are likely to click, it preloads that page in the background, so when you click the page it renders instantly, saving the user about 5 seconds. 5 seconds is significant when you think of how many searches are performed each day--and especially when you consider that the rest of the search experience is optimized for sub-second performance.
The testing problem here is interesting. This feature requires client and server coordination, and since we are pre-loading and rendering the pages in an invisible background page, we wanted to make sure that nothing major was broken with the page rendering.
The original idea was for developers to test out a few pages as they went.But, this doesn’t scale to a large number of sites and is very expensive to repeat. Also, how do you know what the pages should look like? To write Selenium tests to functionally validate thousands of sites would take forever--the product would ship first. The solution was to perform automated test runs that load these pages from search results with Instant Pages turned on, and another run with Instant Pages turned off. The page renderings from each run were then compared.
How did we compare the two runs? How to compare pages when content and ads on web pages are constantly changing and we don't know what the expected behavior is? We could have used cached versions of these pages, but that wouldn’t be the realworld experience we were testing and would take time setting up, and the timing would have been different. We opted to leverage some other work that compares pages using the Document Object Model (DOM). We automatically scan each page, pixel by pixel, but look at what element is visible at the point on the page, not the color/RGB values. We then do a simple measure of how closely these pixel measurements match. These so-called "quality bots" generate a score of 0-100%, where 100% means all measurements were identical.
When we performed the runs, the vast majority (~95%) of all comparisons were almost identical, like we hoped. Where the pages where different we built a web page that showed the differences between the two pages by rendering both images and highlighting the difference. It was quick and easy for the developers to visually verify that the differences were only due to content or other non-structural differences in the rendering. Anytime test automation scales, is repeatable, quantified, and developers can validate the results without us is a good thing!
How did this testing get organized? As with many things in testing at Google, it came down to people chatting and realizing their work can be helpful for other engineers. This was bottom up, not top down. Tejas Shah was working on a general quality bot solution for compatibility (more on that in later posts) between Chrome and other browsers. He chatted with the Instant Pages developers when he was visiting their building and they agreed his bot might be able to help. He then spend the next couple of weeks pulling it all together and sharing the results with the team.
And now more applications of the quality bot are surfacing. What if we kept the browser version fixed, and only varied the version of the application? Could this help validate web applications independent of a functional spec and without custom validation script development and maintenance? Stay tuned...
Google Instant Pages are a cool new way that Google speeds up your search experience. When Google thinks it knows which result you are likely to click, it preloads that page in the background, so when you click the page it renders instantly, saving the user about 5 seconds. 5 seconds is significant when you think of how many searches are performed each day--and especially when you consider that the rest of the search experience is optimized for sub-second performance.
The testing problem here is interesting. This feature requires client and server coordination, and since we are pre-loading and rendering the pages in an invisible background page, we wanted to make sure that nothing major was broken with the page rendering.
The original idea was for developers to test out a few pages as they went.But, this doesn’t scale to a large number of sites and is very expensive to repeat. Also, how do you know what the pages should look like? To write Selenium tests to functionally validate thousands of sites would take forever--the product would ship first. The solution was to perform automated test runs that load these pages from search results with Instant Pages turned on, and another run with Instant Pages turned off. The page renderings from each run were then compared.
How did we compare the two runs? How to compare pages when content and ads on web pages are constantly changing and we don't know what the expected behavior is? We could have used cached versions of these pages, but that wouldn’t be the realworld experience we were testing and would take time setting up, and the timing would have been different. We opted to leverage some other work that compares pages using the Document Object Model (DOM). We automatically scan each page, pixel by pixel, but look at what element is visible at the point on the page, not the color/RGB values. We then do a simple measure of how closely these pixel measurements match. These so-called "quality bots" generate a score of 0-100%, where 100% means all measurements were identical.
When we performed the runs, the vast majority (~95%) of all comparisons were almost identical, like we hoped. Where the pages where different we built a web page that showed the differences between the two pages by rendering both images and highlighting the difference. It was quick and easy for the developers to visually verify that the differences were only due to content or other non-structural differences in the rendering. Anytime test automation scales, is repeatable, quantified, and developers can validate the results without us is a good thing!
How did this testing get organized? As with many things in testing at Google, it came down to people chatting and realizing their work can be helpful for other engineers. This was bottom up, not top down. Tejas Shah was working on a general quality bot solution for compatibility (more on that in later posts) between Chrome and other browsers. He chatted with the Instant Pages developers when he was visiting their building and they agreed his bot might be able to help. He then spend the next couple of weeks pulling it all together and sharing the results with the team.
And now more applications of the quality bot are surfacing. What if we kept the browser version fixed, and only varied the version of the application? Could this help validate web applications independent of a functional spec and without custom validation script development and maintenance? Stay tuned...
How extensively does Google use Selenium for test automation and in what ways? Thanks
ReplyDeleteChris
Why couldn't you just preload the HTML and all associated files? Does the rendering take that much?
ReplyDeleteIt sounds like you are hinting at setting up regression tests of application generated DOMs. The differences are screened by humans. The keys would be (1) creating/choosing test scripts that generate plenty of coverage maximizing true positives and minimizing false positives. (2) Possibly masking out parts of the DOM that are likely to change most of the time. (3) Make it it easy and fast for humans to review the possible regressions and of course report the true regressions.
ReplyDeleteAre there any plans to OS this utility as I would be interested to see how I could use something like this in my work.
ReplyDeleteHello.
ReplyDeleteThat quality bot you mention sounds really cool. Do you know if this would ever be made available? Sounds like it would make a good companion to selenium/webdriver
Python + Selenum (ScreenShots!) + ImageMagic.
Delete1) Get the page, save it to a db/archive.
2) Compare against last/base.
Chris, Selenium (and Webdriver) are used very heavily at Google, and we have a centralized farm of Selenium machines to execute these tests around the clock.
ReplyDeleteBlackTigerX, rendering does take time and every millisecond is interesting :) Also, for some of the larger, script-driven and AJAXy sites, they need the full DOM loaded to complete rendering.
Kazumatan, you are right. We are also working to make it easy for the human raters to label the non-interesting but changing portions of the DOM that is changing (think Google Feedback style region selection), for later filtering.
Dojann, Open Sourcing is definitely on the map, only the timing is a question. We've designed most of it so that it could run outside of Google's infrastructure for just this reason :) We are also looking at hosting options that let folks easily run on hosted machines they own, with VPN access to their staging environments. I can't speak to the timing, but it is partly dependent on the level of interest from the community in these options.
Ben, yup, we are hoping to share the service and code 'soon'. The more interest we see, the faster this will happen.
cheers!
Expect major updates and perhaps even OS at GTAC in October.
ReplyDeleteI have two questions
ReplyDelete1. Comparing page rendering with instant pages turned off and on seems cool.. But, there should have been some tool/ automation that verified the rendering of pages before even the instant pages feature was introduced.. How was that being done and why wasn't that used here?
2. How did the pixel/DOM comparison solve the problem of dynamically generating ads? Did you just verify that the place holders/DOM elements and not the content?
Great questions Raghav...Chrome and many internal team at google uses variety of tools including "quality bots" to automatically verify rendering and catch layout issues. Reason we had to use quality bots here is because it does work at scale automagically vs most traditional automation tools requires custom test for each page which is hard to make it scale. Also, we need to keep in mind that page is hidden until made visible and only way to know page was prerendered is to have injected JS while page was in pre-rendered state.
ReplyDeleteRe-Dynamic Ads: You are right. In general Bots verify information about the elements, but not the content. On top of that we have ad detection mechanism in place to detect and ignore ads while comparing.
Will there be any tutorials on building these kinds of bots?
ReplyDeleteCould you used Sikuli, to compare the images ?
ReplyDeleteRegards,
Thiago Peçanha
We use pixel comparison (against a defined baseline) at our company in our automated regression testing. We have found that dynamically generated ads have caused a lot of problems. To start with we just set the success threshold lower, but this was not satisfactory. So now we only do the pixel comparison for pages with web ads.
ReplyDeleteI'm thinking about solving this problem by asserting that partial images can be found within the page being tested.
Photographer-Anairda. Likely no tutorials soon, we hope to do one better and just open source it, document it, and let folks re-host their own instances if they like. I'm happy to chat about the details. The crawler is relatively easy--think lots of elementFromPoint() calls, the problems are in scale and reporting and rendering the data.
ReplyDeleteHi Tiago. Sikuli is very cool. Some folks have used it at Google, some have built similar approaches, and there are even some commercial products that work this way. We fundamentally have focused on the DOM diff, instead of the pixel diff for three reasons. 1. When you detect and file a 'bug' based on a screenshot, it is a significant amount of work to repro-debug the underlying issue that caused the pixels to be off, so why not just get that data while you are on the site? 2. If you know the structure of the web page, you dont need to use fancy and probabilistic approaches to identify elements that have either scaled, translated, or failed to appear--you know exactly which DOM elements have have failed. 3. We are building a corpus of which elements tend to cause differences, so we can hopefully correlate failures across many sites/runs, to determine if there are underlying issues in the browsers, tooling, or DOM usage--thats the ultimate goal. Great question--this is fundamental to the what and why of Bots.
Hi Cithan. False positives and noise from ads was the reason a lot of people avoided this area, thinking it couldn't be useful data :) We use an 'ignore' filter for data from common ad-like sites during our crawl. We also are working on a way for our first-line crowd sourced evaluators to mark page areas as dont care, on a per-site basis to add the filter set. Most significantly though, we also have the notion of a 'baseline' for a site. If the site permutes all the time, but within a range, you can choose to only flag sites as they go outside of that normal range. Many top portal site data looks like this...the urls and divs shift around a bit day to day, but they amount of entropy day over day stays within a normal range/band.