What’s not evil: ranking content fairly *and* letting public content get indexed

Please see update at bottom

Most websites spend massive amounts of time and money to get any of their pages index and ranked by Google’s search engine. Indeed, there is a entire billion dollar industry (SEO) devoted to helping companies get their content indexed and ranked.

Twitter and Facebook have decided to disallow Google from indexing 99.9% of their content. Twitter won’t let Google index tweets and Facebook won’t let Google index status updates and most other user and brand generated content. In Facebook’s case this makes sense for content that users have designated as non-public. In Twitter’s case, the vast majority of the blocked content is designated by users as public. Furthermore, Twitter’s own search function rarely works for tweets older than a week (from Twitter’s search documentation, they return “6-9 days of Tweets”).

There is a debate going today in the tech world: Facebook and Twitter are upset that Google won’t highly rank the 0.1% of their content they make indexable. Facebook and Twitter even created something they call the “Don’t be evil” toolbar that reranks Google search results the way they’d like them to be ranked. The clear implication is that Google is violating their famous credo and acting “evil”.

The vast majority of websites would dream of having the problem of being able to block Google from 99.9% of their content and have the remaining 0.1% rank at the top of results. What would be best for users – and least “evil” – would be to let all public content get indexed and have Google rank that content “fairly” without favoring their own content. Facebook and Twitter are right about Google’s rankings, but Google is right about Facebook and Twitter blocking public content from being indexed.

Update: after posting this I got a bunch of emails, tweets and comments telling me that Twitter does in fact allow Google to index all their tweets, and that any missing tweets are the fault of Google, not Twitter. A few people suggested that without firehose access Google can’t be expected to index all tweets. At any rate, I think the “Why aren’t all tweets indexed?” issue is more nuanced than I argued above.

The TripAdvisor IPO

Great startup story. Raised a total of $4.2m in venture capital, sold to IAC/Expedia for $210M, and had some interesting adventures and pivots along the way. They started out by trying to aggregate reviews from other websites and white label their product to Expedia and other large travel websites. TripAdvisor.com was just a showcase that accidentally became a destination site. As of today TripAdvisor is an independent public company, trading at a market cap of $3.5B.

Great for Boston. Fairly or not, Boston is often typecast as an infrastructure, B2B, hardware, and biotech town. Between Tripadvisor and Kayak, Boston now has at least two very important consumer internet companies.

Big win for the “golden age of SEO”.  By which I’m referring to roughly 2001-2008 when “demand” for content (people typing in search queries) far outpaced supply (good content). Companies like Yelp and TripAdvisor (along with Wikipedia, IMDB, etc) grew huge during this period, almost entirely through SEO. They did this by getting highly defensible flywheels spinning where more content meant more SEO which meant more users which meant more content. It is now far more difficult to grow a startup primarily through SEO. Almost all monetizable search categories have vast excesses of SEOd content. Moreover, Google is creating their own content (e.g. Google Places) which, at least at times, they have favored in their search results.

The user experience should improve. MG Siegler and others have criticized TripAdvisor for an excess of ads. I don’t disagree with MG, but I also think this is largely the result of the broken online ad attribution system that punishes intent generators and rewards intent harvestors. Travel reviews are for users at the beginning of the travel research process (which on average takes weeks), but all CPA and CPC ad programs pay only for the last click which usually means when users are purchasing tickets or making reservations. Hence review sites are forced to saturate their website real estate with purchasing widgets and display ads. Hopefully as online ad attribution improves this will no longer be necessary.

It’s weird how little coverage this IPO got and how the financial press missed the interesting stories. TripAdvisor ended the day at ~$3.5B in market cap, making it the second most valuable East Coast consumer internet company (after Priceline). Every story I saw focused on the share price drop over the day. The fact that the price dropped from its opening price simply means the bankers mispriced the stock and therefore insiders didn’t get the sweetheart deal they thought they were getting.

Update: I interviewed the CEO/founder of TripAdvisor on TechCrunch yesterday. Topics include the company’s origins, relationship with Google, SOPA, and advice to fledgling entrepreneurs.

SEO is no longer a viable marketing strategy for startups

Many of the today’s most successful informational sites such as Yelp, Wikipedia and TripAdvisor relied heavily on SEO for their initial growth. Their marketing strategy (whether deliberate or not) was roughly: 1) build a community of contributors that created high-quality content, 2) become the definitive place to link to for the topics they covered, 3) rank highly in organic search results.  This led to a virtuous cycle where SEO drew more users, leading to more contributors and more inbound links, leading to more SEO, and so on.  From roughly 2001-2008, SEO was the most effective marketing channel for high-quality informational sites.

I talk to lots of startups and almost none that I know of post-2008 have gained significant traction through SEO (the rare exceptions tend to be focused on content areas that were previously un-monetizable). Google keeps its ranking algorithms secret, but it is widely believed that inbound links are the preeminent ranking factor.  This ends up rewarding sites that are 1) older and have built up years of inbound links 2) willing to engage in aggressive link building, or what is known as black-hat SEO. (It is also very likely that Google rewards sites for the simple fact that they are older. For educated guesses on which factors matter most for SEO, see SEOMoz’s excellent search engine ranking factors survey).

Consider, for example, the extremely lucrative category of hotel searches. Search Google for “Four Seasons New York” and this ad-riddled TripAdvisor page ranks highly:

(TechCrunch had a very good article on the TripAdvisor’s decline in quality).

In contrast, this cleaner and more informative page from the relatively new website Oyster ranks much lower in Google results:

As a result, web users have a worse experience and startups are incentivized to clutter their pages with ads and use aggressive tactics to increase their SEO when they should just be focused on creating great user experiences.

The web economy (ecommerce + advertising) is a multi-hundred billion dollar market.  Much of this revenue comes from traffic that comes from SEO. This has led to a multibillion-dollar SEO industry. Some of the SEO industry is “white hat,” which generally means consultants giving benign advice for making websites search-engine friendly. But there is also a huge industry of black-hat SEO consultants who trade and sell links, along with companies like content farms that promote their own low-quality content through aggressive SEO tactics.

Google seems to be doing everything it can to improve its algorithms so that the best content rises to the top (the recent “panda” update seems to be a step forward). But there are many billions of dollars and tens of thousands of people working to game SEO. And for now, at least, high-quality content seems to be losing. Until that changes, startups – who generally have small teams, small budgets, and the scruples to avoid black-hat tactics – should no longer consider SEO a viable marketing strategy.

Anatomy of a bad search result

In a post last week, Paul Kedrosky noted his frustration when looking for a new dishwasher using Google.  I thought it might be interesting to do some forensics to see which sites rank highly and why.

Paul started by querying Google with the phrase dishwasher reviews:

Screen shot 2009-12-18 at 11.36.20 PM

Pretty much every link on this page has an interesting story to tell about the state of the web.  I’ll just focus here on the top organic (non-sponsored) result:

http://www.consumersearch.com/dishwasher-reviews

clicking through this link takes you here:

Screen shot 2009-12-18 at 11.41.17 PM

Consumersearch is owned by About.com, which in turn is owned by the New York Times.

So how did consumersearch.com get the top organic spot? Most SEO experts I talk to (e.g. SEOMoz‘s Rand Fishkin) think inbound links from a large number of domains still matter far more than other factors. One of the best tools for finding inbound links is Yahoo Site Explorer (which, sadly, is supposed to be killed soon). Using this tool, here’s one of the sites linking to the dishwasher section of Consumersearch:

http://www.whirlpooldishwasher.net/

Screen shot 2009-12-18 at 11.50.38 PM

(Yes, this site’s CSS looks scarily like my own blog – that’s because we both use a generic WordPress template).

This site appears has two goals: 1) fool Google into thinking it’s a blog about dishwashers and 2) link to consumersearch.com.

Who owns this site?  The Whois records are private. (Supposedly the reason Google became a domain registrar a few years ago was to peer behind the domain name privacy veil and weed out sites like this.)

I spent a little time analyzing the “blog” text (it’s actually pretty funny – I encourage you to read it).  It looks like the “blog posts” are fragments from places like Wikipedia run through some obfuscator (perhaps by machine translating from English to another language and back?).  The site was impressively assembled from various sources. For example, the “comments” to the “blog entries” were extracted from Yahoo Answers:

Screen shot 2009-12-18 at 11.57.33 PM

Here is the source of this text on Yahoo Answers:

Screen shot 2009-12-18 at 11.57.58 PM

The key is to have enough dishwaster-related text to look like it’s a blog about dishwashers, while also having enough text diversity to avoid being detected by Google as duplicative or automatically generated content.

So who created this fake blog?  It could have been Consumersearch, or a “black hat” SEO consultant, or someone in an affiliate program that Consumersearch doesn’t even know. I’m not trying to imply that Consumersearch did anything wrong. The problem is systematic. When you have a multibillion dollar economy built around keywords and links, the ultimate “products” optimize for just that:  keywords and links. The incentive to create quality content diminishes.

Some thoughts on SEO

“SEO” (==”Search Engine Optimization”) is a term widely used to mean “getting users to your site via organic search traffic.”  I don’t like the term at all.  For one thing, it’s been frequently associated with illicit techniques like link trading and search engine spamming.  It is also associated with consultants who don’t do much beyond very basic stuff your own developers should be able to do.   But the most pernicious aspect to the phrase is that the word “optimization” suggests that SEO is a finishing touch, something you bolt on, instead of central to the design and development of your site. Unfortunately, I think the term is so widespread that we are stuck with it.

SEO is extremely important because normal users – those who don’t live and breath technology – only type a few of their favorite websites directly into the URL bar and for everything else go to search engines, most likely Google*.  In the 90s, people talked a lot about “home pages” and “site flow.” This matters if you are getting most of your traffic from people typing in your URL directly.  For most startups, however, this isn’t the case, at least for the first few years. Instead, the flow you should be thinking about is users going to Google, typing in a keyphrase and landing on one of your internal pages.

The biggest choice you have to make when approaching SEO is whether you want to be a Google optimist or a Google pessimist**. Being an optimist means trusting that the smart people in the core algorithm team in Mountain View are doing their job well – that, in general, good content rises to the top.

The best way to be a Google optimist is to think of search engines as information marketplaces – matchmakers between users “demanding” information and websites “supplying” it. This means thinking hard about what users are looking for today, what they will be looking for in the future, how they express those intentions through keyphrases, where there are gaps in the supply of that information, and how you can create content and an experience to fill those gaps.

All this said, there does remain a technical, “optimization” side to SEO. Internal URL structure, text on your landing pages, and all those other things discussed by SEO consultants do matter.  Luckily, most good SEO practices are also good UI/UX practices.  Personally I like to do all of these things in house by asking our programmers and designers to include search sites like SEOMozSearch Engine Land, and Matt Cutts in their daily reading list

* I’m just going to drop the illusion here that most people optimize for anything besides Google.  ComScore says Google has ~70% market share but everyone I know gets >90% of their search traffic from Google.  At any rate, in my experience, if you optimize for Google, Bing/Yahoo will give you SEO love about a 1-6 months later.

** Even if you choose to be a pessimist, I strongly recommend you stay far away from so-called black hat techniques, especially schemes like link trading and paid text ads that are meant to trick crawlers.  Among other things, this can get your site banned for life from Google.