Canonical Tags and Duplicate Content | The basics explained

May, 2020

It comes up often. “What does canonical mean? And what is a canonical tag?” and if not that, then the error from getting this wrong: “What is duplicate content and why is my site penalised for that?“

Canonical more than likely comes from the idea of having a ‘canon’ – a canon of law, the canon of Scripture, the whole canon of English Literature – and inferring that it is the authoritative collection or item within a subject. Without delving too much into linguistics and etymology (which I’d love to do, trust me) with regard to your website it essentially means: the canonical version of something is the original, preferred, best, authentic, “look-at-this-one-primarily-please-Google” page on a website. This has two main applications:

A page on your website, as compared to the others on it – more on that later. And,
Your website’s page, as compared to all the others on the interwebs. This can be seen vice versa, too, as you may use it if you are linking to something on another’s website and inserting half the text from that page you’d want to create a canonical tag to that. Not doing so is basically (and fraudulently) pretending it is your own, so you would reference it (much like a citation when writing essays at University) and add a canonical tag to show Google you are aware you got it from there and are not trying to claim it as your own original content.

A canon of philosopy literature? Also, these leather bound books make you think I am well researched and intelligent.

To save you time from reading too much about it, having “duplicate content” is a huge no-no for Google, and they can penalise you heavily – the truth is: they actually reward those with unique content, so it’s a kind of deductive penalty, my dear Watson. So, simply put, you want to avoid that by making use of canonical tags to indicate what on your site is the authoritative page (of many on your website), or of all (across the internet). There are a few other ways to indicate this outside of these tags (header responses, 301 redirects, etc) but we are going to use the tag in this instructional, informative, and lovely article.

It’s also important to point out now that you get both a “canonical tag” (tagging a url to say it is the canonical version) and a “canonical url” which is, as Google puts it so sublimely:

“A canonical URL is the URL of the page that Google thinks is most representative from a set of duplicate pages on your site.” (Google)

From a coding perspective, the tag would sit between the <head> and </head> part of your site’s code (learn how to inspect your website’s code on chrome in a later tutorial coming soon, or Google it) and if you are not competent at implementing it, then most platforms have easy methods of doing it, or ask your developer to do it for you. The example I will give is on WordPress (the ever popular CMS) and using the best SEO plugin: Yoast SEO.

If you had Yoast SEO installed on your site, and were to edit a page or post, and scroll down to the Yoast bar, you can click on Advanced and then see the ‘Canonical’ option and insert where that page should point to, or if it is authoritative in itself, then just self-reference it.

Why does the need for canonicalisation happen? Well, let’s use WordPress again as a great example. As it started as a blogging platform, it logically was (and is still) used to write a flurry of blog posts. Then, you categorise those blogs in a multitude of ways. There can be different authors, they can be published in different months, within certain categories (eg. business, tech, Europe, advice, etc), and you can even then tag them. Wonderful.

Now, as the blog grows it would obviously be super helpful to just click on one category, or author and see all the entries from that person/category. So you click on ‘Business’ and then you see the ten articles underneath. Brilliant for user experience! But sadly… duplicate content created – because when you navigate to theda.co.za/business/ and see the articles nested underneath (with some excerpted text) you will be serving the content again, but from a separate url. (The original url would perhaps be theda.co.za/ten-tips-for-doing-SEO-business/ – and the second, problematic url would then be something liketheda.co.za/business/ten-tips-for-doing-SEO-business/).

Solution: You make the original one canonical, and the extra one that is automatically created should then point to the canonical version. Problem solved.

But, problem #1: That category page houses ten blog posts (and 100 down the line) so you want it to rather point to all the blog posts. But that is not really feasible, or logical. So, your options are to rather just deindex the whole category. Or, because Google is going to crawl the whole site and try work out which is the canonical version, you can save them time (and they will reward you) by telling them which one it is. Which is easiest? Well, to be honest, I find it best just to not have the hundreds of different ways these new ways urls can be spat out, so I first, try not have all these extra methods (eg. tag categories, author categories) and second, if they are there, to deindex them outright.

A little bit lost in what I am saying? Let’s use a worked example…

Here are some urls for a potential ‘Travel Company’

All their tours in February 2019

https://travel-company.com/tours/tours-by-date-february-2019/

All their tours for the whole of 2019

https://travel-company.com/tours/tours-by-date-2019/

You can imagine how there will be duplicates in both those categories, right?

Now imagine how the tours may be filtered by country

https://travel-company.com/tours/jamaica/

And again, we likely have tours that are both in 2019, and to Jamaica, so a tour (let’s call it ‘Best of Jamaica in Ten days’) is now appearing three times. This can go on and on and on. It makes your site “heavy” in Google’s eyes, it means they use more of your “crawl budget”, and you are potentially likely to get cannabilised pages – like it sounds, it means the pages “eat each other” in competition for getting the search results. So if I searched for ‘Best short tour of Jamaica’ maybe I would see the first url with that date category, and visit that, but next time (or another user) would see another result (maybe the February one) and visit that. Simply put: this is not great, for Google, or for users.

Then, problem #2: There are actually a plethora of ways these duplicate methods of serving data are created. Here are the most common…

Http vs Https – Most folks don’t take the five minutes to ensure their now-SSL-encrypted website is redirecting all “http” requests to “https”. Those two protocols are seen as two separate sites to Google.
www. vs non-www – Again, like the above, your “www.” prefix to the rest of your website (eg. www.theda.co.za/) is a separate site to the one without “www”. In fact, the “www” is what is called a subdomain of the pure, naked, wonderful theda.co.za/ – and you can get tonnes of these – it’s easy to make blog.theda.co.za/ or mobile.theda.co.za/ or backup/theda.co.za/ etc etc.
The trailing slash – Would you believe it, that little slash (“/’) at the end of your url is another, annoying way of Google seeing it as a separate url. Luckily, not another website, but “theda.co.za/SEO/” and “theda.co.za/SEO” are two different pages (to Google at least). This one is at least a lot easier to solve.
There are likely other ways – But if you have got this far, and fixed these issues (or your developer has) you have likely done 95% of the work.

The sites that I have seen produce these errors the worst are e-commerce sites, because: there are a myriad number of ways to segment/categorise products. Imagine a toaster that can be sorted by price, popularity, colour, in/out of stock etc and it becomes a nightmare as the site (in a sense) creates all those parameters to make new urls each time that combination is … combined. Pure torture. From an SEO perspective. But, great for users wanting to segment for summers on end.

So why is this important, and why can’t Google figure it out for themselves? Well, they have, and they do but rather help them help you. Much like with submitting a sitemap on Google Search Console if you make life easier for them, they will reward you. They have what is called ‘crawl budget’ for each site (their crawlers don’t trawl the internet at no cost, it costs them money to run those spider bots) and you want them to partition it as economically and efficiently as possible. Hence, you try get them to only crawl your most important pages (so deindex all the useless ones) and to recognise all the canonical pages, and the ones that point to them, and are inthemselves just there for better use of the site for your customers. So give the bots as little work to do as possible and they will love you for it!

So, key things to do for your website if you are starting from scratch are:

Plan the site’s architecture in advance with your developer and decide on the above up front
Choose, http vs https (pretty obvious which one to go for)
www vs not-www (then force redirect the one to the other, across the board)
Trailing slash vs not (trailing slash is better, it resolves quicker)
What categories to have (don’t get carried away!)
How to set up your permalinks (from day one; I personally like “/category/blog-post-name/”)
Getting your sitemaps done right (after the above is all sorted)
What to deindex (at the end)
Then I’d also say to go and check if you are getting any errors (404’s and others) using a tool that works well for you.

If it is too late and you already have 1,543 errors, then good luck doing all the hard work of going and changing all the already live content you have to the correct way. It is certainly not impossible, just very tedious! Lastly just make sure to use the absolute url in your canoncical tag. So for instance do this:

Instead of just:

Coulnd’t be bothered? Then get in touch and I can likely help you and, teaming up with a developer, can hopefully iron this all our for you in record time.