The Biggest Revelations From The Google Search Leak 2024

A disclaimer right at the beginning — no one has leaked Google’s official search algorithm. What instead happened was that the leak was a collection of thousands (14,014 to be exact!) of previously unreleased internal Google factors — or attributes — that gave many people an insight into how Google Search “really” works.

One of the most widely accepted facts that has come out from this leak is that Google representatives have misled people over the years into how Google actually ranks pages and what website owners and publishers must do in order to rank better on Google search engine results pages (SERPs).

We look at some of the most significant disclosures this leak has made public, how it differs from what Google has been putting out in the public space all along and what it means for website owners and publishers.

The Highlights of the Google Search Document Leak in Short

If you don’t have the time to go through everything that these documents reveal, just take a look below at some of the most important points the documents cover.

Ranking attributes: The Google Search document leak contains 2,596 modules which include 14,014 attributes. However, these documents don’t mention how much weightage these features get.
Identifying the type of business: The documents reveal Google has processes in place to identify various business models such as news, YMYL, small websites / blogs, ecommerce and video sites. But the documents don’t reveal the purpose of this segmentation.
Measuring clicks: Google uses metrics like badClicks, goodClicks, lastLongestClicks, and unsquashedClicks to understand user interactions.
Brand matters:
Links: Links are a vital factor that Google considers while ranking pages. The more diverse and fresh the links, the more the chances of the page ranking higher. The documents further go on to reduce the importance previously given to the anchor text.
Google tracks Chrome browser data: Although Google has repeatedly denied this, the documents suggest that Google does indeed use data from the Chrome browser to understand user behavior.
Site authority score: This is another aspect Google has denied in the past. But the latest leak reveals that Google does in fact maintain a site authority score.
Sandbox: Yet another feature Google has denied. Not only does Google have an attribute called “hostAge” that identifies a site’s age, the documentation uses the actual term “sandbox” to prevent “fresh spam.”
Google keeps a record of older web page versions: While the documents don’t reveal this directly, it is logical to conclude that Google stores earlier versions of every web page on a website. The maximum limit is 20, which means that if you update a particular page more than 20 times, you could make the earlier versions disappear from Google’s archive.
AI Overviews absent: There’s no mention of AI Overviews anywhere in the document. This lends more credence to what many SEO experts believe — the documents leaked are not exactly new.

Is the Google Search Document Leak Genuine?

Yes! Google has officially confirmed that the search documentation leak is in fact genuine. That said, you must note that in an email reply to The Verge on 29th May 2024, Google spokesperson Davis Thompson warned against reaching incorrect conclusions based on information that could be incomplete, out of context or even those that may no longer be true.

In the statement, Google’s spokesperson went on to say that “We’ve shared extensive information about how Search works and the types of factors that our systems weigh, while also working to protect the integrity of our results from manipulation.”

The leaked documentation was first discovered by Rand Fishkin (of SparkToro) and Mike King, who explained what the documents actually meant in a detailed article. Some reports suggest that it was Erfan Azimi of EA Digital Eagle who first spotted the leak.

It’s believed that the leak occurred due to an error at Google’s end on March 13. The leaked documents explain a previous (undated) version of Google’s Content Warehouse API, which gives a treasure trove of information on how search engine rankings work.

The documents don’t contain any code, but repeatedly refer to the in-house systems and projects and look like internal documents detailing these 14,000 attributes that Google uses to rank pages using its ultra-secret algorithm.

We look at some of these attributes and how Google uses them to identify and rank pages.

1. NavBoost Tracks Users’ Clicks

NavBoost is a feature that Google uses to rank pages based on a user’s click behavior. Google had been denying that they use such a parameter to rank pages, but this leak proves otherwise.

What’s even more interesting is that Google uses data from its own Chrome browser. According to the document, Google assigns each link with a specific ratio based on the number of clicks. The documents also mention that links, topics and clicks are interconnected.

With this information coming to the fore, it’s not a far stretch to conclude that Google may extend such usage to Android devices as well to track clicks.

2. Not Just Links, The Right Type of Links

A lot has been said over the years about links being a vital part of increasing your site’s ranking even though Google’s own John Mueller in a recent tweet suggested that links don’t carry as much value.

According to the leaks, while links from established sources boost your website reputation in Google’s eyes, the reverse is also true — that is, links from spammy websites may lead to Google distrusting your own site.

That’s why it’s important to ensure that referring domains to your site are genuine and relevant ones.

Another attribute in the leak that makes this conversation about links more interesting is the one mentioned in the “AnchorAnchorSource” module. According to this module, local relevance may be a factor that Google considers when it weighs links.

The specific attribute mentioned in the module is the “localCountryCodes,” that stores the country the page is local to and also those that are the most relevant.

A hot topic of debate in many digital PR circles has been the importance of links coming from different regions and this leak kind of puts this debate to rest.

In addition to the above points, the leaks also suggest that Google may weigh links from newer pages more heavily compared to those from older content. The document covers this aspect under the “sourceType” attribute, where the algorithm supposedly ranks the quality of a link’s source page by relating it to the page’s index tier.

Google specifically references the attribute “TYPE_FRESHDOCS” and considers it to be the same as “high quality” links. One important point to note is that Google can consider a page to be “high quality” without it being a fresh page.

A final feature about links is that, the more Google trusts the homepage of a specific website, the more weightage Google gives to links from that particular website. This can be traced to the “homePageInfo” attribute.

3. Google Sees More Than Anchor Text to Check Links

For the longest time, those within the SEO industry have shared differing views about the best practices when using anchor text — the in-content text — to link to other pages. According to Google’s official Search Central documentation, Google’s algorithm uses anchor text data to understand something about the page that the text links to.

However, according to the latest leaks, Google does not use only the anchor text to understand such context. The content around the anchor text also plays a vital role to help Google understand more about the link.

The document alludes to the terms “context2, fullLeftContext, and fullRightContext,” which mean the terms near the link and the fact that Google uses these words to decide the relevance of the link. This also gives further relevance to what we saw before — the fact that links from relevant content / pages get more weight than those from content that’s not as relevant.

4. Yes! Site Age Matters & Google Tracks This

Something that’s been mentioned repeatedly in different ways is that site age matters. To put more precisely, this refers to the site, the content and the domain. But not all of this points in only one direction.

While Google does have a thing for websites that have been around a while, it wants websites to offer fresh content. Google is more likely to trust your site as it becomes older, but it would prefer if you put out fresher up to date content.

Not only this, Google also encourages you to make it clear that said content is fresh by showing when it was last updated. Also, make sure that the information is consistent across your page by avoiding different dates in the URL (if that’s something you typically do) and the web page as this will end up confusing Google’s algorithm and users who visit your site.

Another interesting fact concerns the attribute “hostAge,” which refers to the age of the domain. While the document doesn’t directly mention a website’s age to rank it, “hostAge” has been linked to the sandbox.

For those not privy to the term, sandbox is the practice several SEOs have mentioned over the years that’s similar to a kind of “cooling off period’ before Google ranks a new website. Google has denied this practice, but the presence of the exact term in the leaked documentation raises more than a few eyebrows.

5. Things to Avoid

While there are several things to consider to rank on Google SERPs, it’s also worthwhile to note the things you should not do to avoid getting penalized by Google’s algorithm. The recent leak mentioned several such factors that website owners can look into such as:

Poor navigational experience on your website can hurt your rankings.
Google has a score that measures “user click dissatisfaction.” With this, Google doesn’t just consider the amount of time a user spends on your website, but if the user continues to search for the same information after they have visited your site, it could also count against you.
If you’re trying to rank a page for a location that’s not linked to your location identity, it could count against you.
Google tracks an attribute aptly named “gibberishScores,” to identify fluff, AI generated and downright senseless content on a website.
Google also has a “keywordStuffingScore” similar to the ““gibberishScore” that flags unwarranted keyword stuffing practices.
Google’s algorithm maintains a “spamRank” that calculates the likelihood of a document linking to known spammers.

6. Google Quantifies Sitewide Authority

One of the biggest (and most surprising) revelations in the leak — especially for SEOs — is the presence of an attribute that measures the “siteAuthority.” Several SEO tools like Moz (Domain Authority – DA) and Ahrefs (Domain Rating – DR) use this metric to rank websites, but Google has time and again denied ever using such a metric to rank pages.

That said, Google’s statements on several aspects have long left a lot to be desired and have on many occasions confused those who rely on their advice instead of clarifying them. One such example is when Google’s John Mueller stated in 2020 that “Just to be clear, Google doesn’t use Domain Authority *at all* when it comes to Search crawling, indexing, or ranking.”

But later that same year, he said about domain authority “I don’t know if I’d call it authority like that, but we do have some metrics that are more on a site level, some metrics that are more on a page level, and some of those site-wide level metrics might kind of map into similar things.”

Such statements from Google’s representatives, in addition to finding such a term in leaked documentation leave users utterly confused.

It’s not exactly clear what parameters Google uses to determine the “siteAuthority” attribute. Both Moz and Ahrefs use backlink data — both quality and quantity — to determine their DA and DR scores respectively. But it looks like Google uses a combination of page-level quality scores that involve click data and other NavBoost signals to arrive at its own score.

7. Google Identifies Links From High Quality Websites

Google saves more than just the usual amount of information for specific links. This could mean links from websites like The Wall Street Journal, The New York Times, or The LA Times carry more weight than links from other known but less important websites.

The attribute “encodedNewsAnchorData” tracks information about the newsiness of the anchor text by determining whether the website is a “newsy, high quality” source. While it’s not exactly clear how much more weight such links may have, it definitely means that using digital PR to get links from such news websites could be incredibly valuable.

What to Make of the Google Search Leak?

Over the years, while Google regularly laid out the best practices to rank, they also held back on some information and rightly so. Putting out all the information in the public domain would make it easier for those trying to game the system, which is why you can’t fault Google’s method of putting.

What does the documentation leak mean for website owners, publishers and SEO specialists? Well, for one, what was true before still remains true. While Google was putting out information it deemed helpful to those who were using their search products, there were also those who were testing things out for themselves — and this still is a great way to get things done.

One thing’s for sure, the leaked documentation has certainly given those that test the waters consistently lots to test for! The documentation has given a mother lode of information, but it should be considered as just one part of the puzzle.

One thing that many in the industry agree unanimously is that creating content to help users who search for the information (and not Google) should be the constant guiding light. There’s a continuous learning curve that you should be ready to embrace to help sail through the ever-changing waters of Google search.