How Google uses data to train AI & index content in a granular manner

We discussed Google’s announcement of passage and sub-topic indexing and how to optimise for it. In the same announcement, they also mentioned they now can understand different segments of a video and return specific time stamps of relevant content to the user. How did Google get to this amazing stage? How did it train its AI to achieve this level of granularity? A discussion.

We will not discuss the technical aspects of the AI algorithm. Our focus here would beÂ howÂ Google got to this stage of discerning content. The first step is data and as we well know, we are the source!

Let us consider just a handful of example to appreciate how granular index was achieved. First,Â what is granular indexing? This just refers to Google’s ability to tell site A and site B apart; tell video A and video B apart; passage one and passage 10 in the same article apart; Time state 3:20 and 4:50 in the same video apart and so on.

1 The disavow tool

The first example I can think of the disavow tool in Google search console.Â What was once an integral part of the search console, (when it was called Google Webmaster tool prior May 2015) has now been moved to obscurity because of AI capabilities.

If we suscept a spammy or suspicious website has linked to our site, we can inform google about this. We could upload a set of all such spammy sites in a text file. Google happily got this information from tens of thousands of Webmaster for several years and fed it to their AI algorithm.

With access to so much data, they could tinker their AI code to near perfection and spot a good site from a spammy site. At that point they happily told webmasters manual disavowal was no longer necessary (although still available) and “AI would take care of it”

The granular indexing formula: Get data from users, use it as input, perfect the code and let it loose on the web.

This Twitter poll says it all!

How many of you have disavowed links in GSC this year?

â€” Tim Soulo (@timsoulo) October 8, 2020

2 Employing Quality Raters

The data they got from searchers and users was not enough! Google actually employed thousands of quality raters with extensive guidelines to test how the AI was coughing up results. The raters were trained to identify sites with authentic information especially in YMYL (your money or your life) niches like health and finance.

For example, if a website was claiming that the Earth was flat, it would never feature in scientific searches; if a website claims (to borrow an example from the actual guideline) that carrots can cure cancer, the site could even be removed entirely from the index.

Quality rates are used to evaluate author profiles, site privacy policy, conflict of interest etc. Once they got enough information to create a search pattern Google made an extraordinary announcement in late 2019.

3 Removing Manual Submission to Google News

This announcement shock those who wanted to “get into” Google News – a huge source of traffic and authority. Previously site owners had to apply for inclusion into Google News. Now the AI would decide which is a news site and which is not. In fact, the AI is so good it can differentiate between article A and B from the same site – A could feature in Google News and B will not. Here are two examples of this.

Screenshot of freefincal articles appearing on Google News

Brief Google News appearance of freefincal articles as seen from search console screenshot

A search for “freefincal” on the Google News tab occasionally shows actual articles instead of references to freefincal or its author in news sites already on Google News. This is a precise example of granular indexing.

Other examples

Google Maps understands the nature of a business from its reviews.
Youtube now can understand comments and offers a choice of automated responses to the channel owner.
They study likes; dislikes and comments to understand what videos would keep viewers engaged on YouTube longer
Gmail can now understand what emails are about and offer automated responses and autocomplete

Passage indexing; video key moment indexing, sub-topic indexing are all part of this progression arc. The future is existing but also scary. This is the reason why they say “data is the new oil”. Google and Facebook stocks holders are laughing themselves to check on their demat accounts!