My last post detailed the frustration I now feel when searching Google, using a search for “twitter client mac” as an example.
This post will explain the solution. I don’t think it is too complicated and certainly it is simpler than the original PageRank algorithm itself. Unfortunately, the solution involves people, which means it is messy.
-
Fix the PageRank algorithm by considerably expanding the concept of a “link”. That’s right…by expanding the concept of a link, the opposite of what Google has tried to accomplished with the “nofollow” attribute. In fact, Google has expanded the concept of a link somewhat by trying to follow JavaScript and other tricky redirects. Now Google (or a new competitor!) needs to follow links that are purely conceptual or textual. For example, the name of this site is “twitmenulet”–a series of characters that only ever appear together in one place: here. Whenever those characters appear on the web, Google should recognize that the person writing those characters is talking about this place. That discussion should constitute a link. I can’t tell you how many people have written on Twitter or in blog postings that they liked Twit Menulet–but then forgotten to include a link. The presence or absence of an anchor in the HTML code shouldn’t matter a bit. Same for mentions of Twitter, Google, and other very specific names.
In addition, and more subtly, discussion of identical topics should constitute a link or partial link. For example, sites or stories containing the words “pagerank matrix eigenvector” are clearly linked to each other conceptually and should therefore be considered linked in the PageRank algorithm. You may ask: what about spam? Wouldn’t this lead to a proliferation of thousands of blog comments containing nonsense words? There is only one answer to this question and it is the human answer, the Yahoo answer, the Alexa answer, which Google refuses to engage:
-
Links must be weighted by their apparent value to real humans. The original PageRank algorithm is mathematically equivalent to a “random-surfer” model and yields estimates of how much time a random surfer would spend on each web page. Plainly, this is ridiculous. Spam links, by definition, could never fool a human surfer and therefore shouldn’t be considered real links. Google needs to buy Alexa or get equivalent data on where people actually spend their time–which links are actually followed–and use that information to tweak the PageRank matrix. It makes no sense to rank a site highly because a random surfer would spend an enormous chunk of time on the site, if a human surfer would leave immediately.
I believe that this information on human behavior should come partly from Google’s own site. Sites could be cycled within search results (occasionally giving the site higher visibility than it normally has) to detect the level of human interest in the site. Sites that get a click for a given search would then be linked within the PageRank matrix to other sites that tend to get clicks for that search.
I believe that these two changes could fix the PageRank alorithm. Fix #1 would, at a stroke, bring Wikipedia, Twitter, TechCrunch, and other PageRank hoarders back into the connected web. It’s called the internet for a reason, folks. Fix #2 would begin to eliminate the spam problem by finding out what real people consider to be the correct “answers” to their search problems. And, if both these things happen, then perhaps we’d all be able to find a nice Twitter client more easily!