Google Explains How It Chooses Canonical Webpages

In a Google Search Central video Google’s Gary Illyes explained part of webpage indexing that involves selecting canonicals, explaining what a canonical means to Google, a thumbnail explanation of webpage signals, he mentions the centerpiece of a page and tells what it does with the dupl

 

Google's Gary Illyes describes the signals it uses to choose canonical pages and shares why duplicate pages can be important for SEO

How Google chooses canonical webpages

In a Google Search Central video Google’s Gary Illyes explained part of webpage indexing that involves selecting canonicals, explaining what a canonical means to Google, a thumbnail explanation of webpage signals, he mentions the centerpiece of a page and tells what it does with the duplicates which implies a new way of thinking about them.

What Is A Canonical Webpage?

There are several ways of considering the what canonical means, the publisher and the SEO’s viewpoint from our side of the search box and what canonical means from Google’s side.

Publishers identify what they feel is the “original” webpage and SEOs conception of canonicals is about choosing the “strongest” version of a webpage for ranking purposes.

Canonicalization for Google is an entirely different thing from what publishers and SEOs think it is so it’s good to hear it from a Googler like Gary Illyes.

Google’s official documentation about canonicalization uses the word deduplication to reference the process of choosing a canonical and lists five typical reasons for why a site might have duplicate pages.

Five Reasons For Duplicate Pages

  1. “Region variants: for example, a piece of content for the USA and the UK, accessible from different URLs, but essentially the same content in the same language
  2. Device variants: for example, a page with both a mobile and a desktop version
  3. Protocol variants: for example, the HTTP and HTTPS versions of a site
  4. Site functions: for example, the results of sorting and filtering functions of a category page
  5. Accidental variants: for example, the demo version of the site is accidentally left accessible to crawlers”

Canonicals can be considered in three different ways and there are at least five reasons for duplicate pages.

Gary describes one more way to think of canonicals.

Signals Are Used For Choosing Canonicals

Ilyes shares one more definition of a canonical, this time from the indexing point of view, and talks about the signals that are used for selecting canonicals.

Gary explains:

“Google determines if the page is a duplicate of another already known page and which version should be kept in the index, the canonical version.

But in this context, the canonical version is the page from a group of duplicate pages that best represents the group according to the signals we’ve collected about each version.”

Gary stops to explain duplicate clustering and then returns to talking about signals a short while later.

He continued:

“For the most part, only canonical pages appear in Search results. But how do we know which page is canonical?

So once Google has the content of your page, or more specifically the main content or centerpiece of a page, it will group it with one or more pages featuring similar content, if any. This is duplicate clustering.”

Just want to stop here to note that Gary refers to the main content as the “centerpiece of a page” which is interesting because there’s a concept introduced by Google’s Martin Splitt called the Centerpiece Annotation. He didn’t really explain what the Centerpiece Annotation is but this bit that Gary shared helps.

The following is the part of the video where Gary talks about what signals actually are.

Illyes explains what “signals” are:

“Then it compares a handful of signals it has already calculated for each page to select a canonical version.

Signals are pieces of information that the search engine collects about pages and websites, which are used for further processing.

Some signals are very straightforward, such as site owner annotations in HTML like rel=”canonical”, while others, like the importance of an individual page on the internet, are less straightforward.”

Duplicate Clusters Have One Canonical

Gary next explains that one page is chosen to represent the canonical for each cluster of duplicate pages in the search results. Every cluster of duplicates has one canonical.

He continues:

“Each of the duplicate clusters will have a single version of the content selected as canonical.

This version will represent the content in Search results for all the other versions.

The other versions in the cluster become alternate versions that may be served in different contexts, like if the user is searching for a very specific page from the cluster.”

Alternate Versions Of Webpages

That last part is really interesting and is important to consider because it can be helpful for being able to rank for multiple variations of a keyword, particularly for ecommerce webpages.

Sometimes the content management system (CMS) creates duplicate webpages to account for variations of a product like the size or color of a product which then can impact the description. Those variations can be chosen by Google to rank in the search results when that variant page more closely serves as a match for a search query.

This is important to think about because it might be tempting to redirect noindex variant webpages to keep them out of the search index out of fear of the (non-existent) keyword cannibalization problem. Adding a noindex to pages that are variants of one page can backfire because there are scenarios where those variant pages are the best ones to rank for a more nuanced search query that contains colors, sizes or version numbers that are different than on the canonical page.

Top Takeaways About Canonicals (And More) To Remember

There is a lot of information packed in Gary’s discussion of canonicals, including some side topics about the main content.

Here are seven takeaways to consider:

  1. The main content is referred to as the Centerpiece
  2. Google calculates a “handful of signals” for each page it discovers.
  3. Signals are data that are used for “further processing” after webpages are discovered.
  4. Some signals are in control of the publisher, like hints (and presumably directives). The hint that Illyes mentioned is the the rel=canonical link attribute.
  5. Other signals are outside of the control of the publisher, like the importance of the page in the context of the Internet.
  6. Some duplicate pages can serve as alternate versions
  7. Alternate versions of webpages can still rank and are useful for Google (and the publisher) for ranking purposes.

Watch the Search Central Episode about indexing:

How Google Search indexes pages


Sandra Santeyian

239 Blog posts

Comments