Google should open source what actually matters: their search ranking algorithm

Websites live or die based on how a small group of programmers at Google decide their sites should rank in Google’s main search results.  As the “router” of the vast majority of traffic on the internet, Google’s secret ranking algorithm is probably is the most powerful piece of software code on the planet.

Google talks a lot about openness and their commitment to open source software. What they are really doing is practicing a classic business strategy known as “commoditizing the complement“*.

Google makes 99% of their revenue by selling text ads for things like plane tickets, dvd players and malpractice lawyers. Many of these ads are syndicated to non-Google properties. But the anchor that gives Google their best “inventory” is the main search engine at Google.com.  And the secret sauce behind Google.com is the algorithm for ranking search results. If Google is really committed to openness, it is this algorithm that they need to open source.

The alleged argument against doing so is that search spammers would be able to learn from the algorithm to improve their spamming methods. This form of argument is an old argument in the security community known as “security through obscurity.” Security through obscurity is a technique generally associated with companies like Microsoft and is generally opposed as ineffective and risky by security experts. When you open source something you give the bad guys more info, but you also enlist an army of good guys to help you fight them.

Until Google open sources what really matters – their search ranking algorithm – you should dismiss all their other open-source talk as empty posturing. And millions of websites will have to continue blindly relying on a small group of anonymous engineers in charge of the secret algorithm that determines their fate.

* You can understand a large portion of technology business strategy by understanding strategies around complements. One major point: companies generally try to reduce the price of their products complements (Joel Spolsky has an excellent discussion of the topic here). If you think of the consumer as having a willingness to pay a fixed N for product A plus complementary product B, then each side is fighting for a bigger piece of the pie. This is why, for example, cable companies and content companies are constantly battling. It is also why Google wants open source operating systems to win, and for broadband to be cheap and ubiquitous. [link to full post]

Anatomy of a bad search result

In a post last week, Paul Kedrosky noted his frustration when looking for a new dishwasher using Google.  I thought it might be interesting to do some forensics to see which sites rank highly and why.

Paul started by querying Google with the phrase dishwasher reviews:

Screen shot 2009-12-18 at 11.36.20 PM

Pretty much every link on this page has an interesting story to tell about the state of the web.  I’ll just focus here on the top organic (non-sponsored) result:

http://www.consumersearch.com/dishwasher-reviews

clicking through this link takes you here:

Screen shot 2009-12-18 at 11.41.17 PM

Consumersearch is owned by About.com, which in turn is owned by the New York Times.

So how did consumersearch.com get the top organic spot? Most SEO experts I talk to (e.g. SEOMoz‘s Rand Fishkin) think inbound links from a large number of domains still matter far more than other factors. One of the best tools for finding inbound links is Yahoo Site Explorer (which, sadly, is supposed to be killed soon). Using this tool, here’s one of the sites linking to the dishwasher section of Consumersearch:

http://www.whirlpooldishwasher.net/

Screen shot 2009-12-18 at 11.50.38 PM

(Yes, this site’s CSS looks scarily like my own blog – that’s because we both use a generic WordPress template).

This site appears has two goals: 1) fool Google into thinking it’s a blog about dishwashers and 2) link to consumersearch.com.

Who owns this site?  The Whois records are private. (Supposedly the reason Google became a domain registrar a few years ago was to peer behind the domain name privacy veil and weed out sites like this.)

I spent a little time analyzing the “blog” text (it’s actually pretty funny – I encourage you to read it).  It looks like the “blog posts” are fragments from places like Wikipedia run through some obfuscator (perhaps by machine translating from English to another language and back?).  The site was impressively assembled from various sources. For example, the “comments” to the “blog entries” were extracted from Yahoo Answers:

Screen shot 2009-12-18 at 11.57.33 PM

Here is the source of this text on Yahoo Answers:

Screen shot 2009-12-18 at 11.57.58 PM

The key is to have enough dishwaster-related text to look like it’s a blog about dishwashers, while also having enough text diversity to avoid being detected by Google as duplicative or automatically generated content.

So who created this fake blog?  It could have been Consumersearch, or a “black hat” SEO consultant, or someone in an affiliate program that Consumersearch doesn’t even know. I’m not trying to imply that Consumersearch did anything wrong. The problem is systematic. When you have a multibillion dollar economy built around keywords and links, the ultimate “products” optimize for just that:  keywords and links. The incentive to create quality content diminishes.

Google’s feature creep

Microsoft used to be considered the king of feature creep.  Here was Microsoft Word when it was most cluttered:

thumb-paperclipinterference

I don’t use any of Microsoft’s software anymore, but from what I hear they’ve toned down the feature creep a lot in recent versions of Windows and Word.

Google has been adding so many new features to its results page, they are starting to feel like the new Microsoft.  Here’s an approximation of what Google used to look like (I couldn’t find an image of actual Google 1998 SRPs — anyone have one?)

bbc-google-search

And here is Google today:

Screen shot 2009-12-17 at 11.35.35 AM

Options on the left, ads on top and on the right, news results up top, images, and buttons to vote results up/down and annotate them.  But worst of all are the new scrolling “real time” results.  The static image I’ve embedded doesn’t do justice to how annoying this is. Random, out-of-context, and mostly asinine fragments of conversations scrolling by.  I think it might be Google’s Clippy.

Search and the social graph

Google has created a multibillion-dollar economy based on keywords.  We use keywords to find things and advertisers use keywords to find customers.  As Michael Arrington points out, this is leading to increasing amounts of low quality, keyword-stuffed content. The end result is a very spammy internet. (It was depressing to see Tim Armstrong cite Demand Media, a giant domain-name owner and robotic content factory, as a model for the new AOL.)

Some people hope the social web — link sharing via Twitter, Facebook etc — will save us.  Fred Wilson argues that “social beats search” because it’s harder to game people’s social graph.  Cody Brown tweeted:

On Twitter you have to ‘game’ people, not algorithms. Look how many followers @demandmedia has. A lot less then you guys: @arrington @jason

These are both sound points. Lost amid this discussion, however, is that the links people tend to share on social networks – news, blog posts, videos – are in categories Google barely makes money on. (The same point also seems lost on Rupert Murdoch and news organizations who accuse Google of profiting off their misery).

Searches related to news, blog posts, funny videos, etc. are mostly a loss leaders for Google. Google’s real business is selling ads for plane tickets, dvd players, and malpractice lawyers. (I realize this might be depressing to some internet idealists, but it’s a reality). Online advertising revenue is directly correlated with finding users who have purchasing intent. Google’s true primary competitive threats are product-related sites, especially Amazon. As it gets harder to find a washing machine on Google, people will skip search and go directly to Amazon and other product-related sites.

This is not to say that the links shared on social networks can’t be extremely valuable.  But most likely they will be valuable as critical inputs to better search-ranking algorithms. Cody’s point that it’s harder to game humans than machines is very true, but remember that Google’s algorithm was always meant to be based on human-created links. As the spammers have become more sophisticated, the good guys have come to need new mechanisms to determine which links are from trustworthy humans. Social networks might be those new mechanisms, but that doesn’t mean they’ll displace search as the primary method for navigating the web.

Some thoughts on SEO

“SEO” (==”Search Engine Optimization”) is a term widely used to mean “getting users to your site via organic search traffic.”  I don’t like the term at all.  For one thing, it’s been frequently associated with illicit techniques like link trading and search engine spamming.  It is also associated with consultants who don’t do much beyond very basic stuff your own developers should be able to do.   But the most pernicious aspect to the phrase is that the word “optimization” suggests that SEO is a finishing touch, something you bolt on, instead of central to the design and development of your site. Unfortunately, I think the term is so widespread that we are stuck with it.

SEO is extremely important because normal users – those who don’t live and breath technology – only type a few of their favorite websites directly into the URL bar and for everything else go to search engines, most likely Google*.  In the 90s, people talked a lot about “home pages” and “site flow.” This matters if you are getting most of your traffic from people typing in your URL directly.  For most startups, however, this isn’t the case, at least for the first few years. Instead, the flow you should be thinking about is users going to Google, typing in a keyphrase and landing on one of your internal pages.

The biggest choice you have to make when approaching SEO is whether you want to be a Google optimist or a Google pessimist**. Being an optimist means trusting that the smart people in the core algorithm team in Mountain View are doing their job well – that, in general, good content rises to the top.

The best way to be a Google optimist is to think of search engines as information marketplaces – matchmakers between users “demanding” information and websites “supplying” it. This means thinking hard about what users are looking for today, what they will be looking for in the future, how they express those intentions through keyphrases, where there are gaps in the supply of that information, and how you can create content and an experience to fill those gaps.

All this said, there does remain a technical, “optimization” side to SEO. Internal URL structure, text on your landing pages, and all those other things discussed by SEO consultants do matter.  Luckily, most good SEO practices are also good UI/UX practices.  Personally I like to do all of these things in house by asking our programmers and designers to include search sites like SEOMozSearch Engine Land, and Matt Cutts in their daily reading list

* I’m just going to drop the illusion here that most people optimize for anything besides Google.  ComScore says Google has ~70% market share but everyone I know gets >90% of their search traffic from Google.  At any rate, in my experience, if you optimize for Google, Bing/Yahoo will give you SEO love about a 1-6 months later.

** Even if you choose to be a pessimist, I strongly recommend you stay far away from so-called black hat techniques, especially schemes like link trading and paid text ads that are meant to trick crawlers.  Among other things, this can get your site banned for life from Google.