Chris Dixon

What’s not evil: ranking content fairly *and* letting public content get indexed

Please see update at bottom

Most websites spend massive amounts of time and money to get any of their pages index and ranked by Google’s search engine. Indeed, there is a entire billion dollar industry (SEO) devoted to helping companies get their content indexed and ranked.

Twitter and Facebook have decided to disallow Google from indexing 99.9% of their content. Twitter won’t let Google index tweets and Facebook won’t let Google index status updates and most other user and brand generated content. In Facebook’s case this makes sense for content that users have designated as non-public. In Twitter’s case, the vast majority of the blocked content is designated by users as public. Furthermore, Twitter’s own search function rarely works for tweets older than a week (from Twitter’s search documentation, they return “6-9 days of Tweets”).

There is a debate going today in the tech world: Facebook and Twitter are upset that Google won’t highly rank the 0.1% of their content they make indexable. Facebook and Twitter even created something they call the “Don’t be evil” toolbar that reranks Google search results the way they’d like them to be ranked. The clear implication is that Google is violating their famous credo and acting “evil”.

The vast majority of websites would dream of having the problem of being able to block Google from 99.9% of their content and have the remaining 0.1% rank at the top of results. What would be best for users – and least “evil” – would be to let all public content get indexed and have Google rank that content “fairly” without favoring their own content. Facebook and Twitter are right about Google’s rankings, but Google is right about Facebook and Twitter blocking public content from being indexed.

Update: after posting this I got a bunch of emails, tweets and comments telling me that Twitter does in fact allow Google to index all their tweets, and that any missing tweets are the fault of Google, not Twitter. A few people suggested that without firehose access Google can’t be expected to index all tweets. At any rate, I think the “Why aren’t all tweets indexed?” issue is more nuanced than I argued above.

  • http://harryh.org harryh

    What do you mean Twitter won’t let Google index tweets?

    https://www.google.com/#q=site:twitter.com+cdixon

    And also:

    https://twitter.com/robots.txt (hat tip: @jorgeo)

    • Anonymous

      Indeed: https://twitter.com/robots.txt

    • http://www.cdixon.org chris dixon

      that’s a tiny minority of tweets. i believe from when they had a search deal. i’ve search for old tweets many times on google and not found them. also heard this claim directly from google execs.

      • http://harryh.org harryh

        You’re wrong about “from when they had a search deal.”

        Note how the 2nd result from the google link above is from a tweet of yours from 10 days ago. The search deal expired a long time ago.

        The first result is your profile page on twitter (as it should be).

        • http://www.cdixon.org chris dixon

          go try to find specific tweets in google’s index. you’ll find a tiny fraction. and then go talk to people at google and twitter. this is a well known fight between the companies.

          • http://harryh.org harryh

            Oh it’s clearly a well known fight, that much is obvious.  But you are not correctly explaining the parameters of the fight.

            • http://www.cdixon.org chris dixon

              According to google there have been 24m tweets with “4sq.com” in the content.

              That is about 24 days of content indexed.

              Do you think Google is deciding to throw away that content, Twitter isn’t allowing it, or something else?

              • http://harryh.org harryh

                It’s not technically feasible for Google to index all the tweets without specialized access.

          • Matt McGee

            Chris, Google can and does index tweets. No deal is required for that. The reason that not ALL tweets are indexed is likely due to sheer amount of tweets produced every day – it’s just not feasible for every tweet to be indexed. (Imagine if Google indexed all the replies that are nothing more than “LOL” or a smiley face.)

            Google’s indexing of Twitter is probably managed like any extremely large site — URLs that don’t have any inbound link equity aren’t going to be crawled. I formerly managed a website with about 15 million pages and Google didn’t come to close to indexing all of the deep content, simply because its algorithm saw no value in indexing pages that no one else had deemed important (by virtue of links, mentions, etc.).

            The deal that Google and Twitter had was, as I understand, for the “firehose” of tweets, or perhaps for a portion of the firehouse (more likely). With firehose access, Google had direct access to content prior to no-follow attributes being added to links (as Kevin Marks mentioned above). But the lack of a deal for this has nothing to do with this.

            As Twitter Communications tweeted earlier, Google crawls Twitter 120 million times per day:

            https://twitter.com/#!/twittercomms/status/161578517698580481

            • http://parislemon.com MG Siegler

              This is my understanding of the situation as well. It’s a technical challenge, not a content one. Which is to say it’s complicated, naturally. 

              • http://www.cdixon.org chris dixon

                Fair points. Perhaps this is just a Google infrastructure failing which seems strange given their crawling prowess but sounds like you are better informed than me.

                • http://twitter.com/mattmcknight matt mcknight

                  Not necessarily a Google infrastructure failure, but also a Twitter infrastructure failure- creating 8000 new URLs per second means they’d have to support that a crawl rate above that level, which is why they created the firehose to avoid the need to crawl.

                  Twitter can’t try to charge Google to access their data and then complain when not all their content shows up. Facebook has even less of an argument…a lot of public facebook content is that way by accident.

                • http://twitter.com/TheNextCorner Dennis Goedegebuure

                  Having experience with SEO for an ultra large site (eBay) I can assure you that Google crawlers are missing a lot of newly created URL’s on a daily basis.
                  It’s challenging for both the search engine as for product teams to get all URL’s indexed.

                  What is more interesting, is that all of the comments are now focused on the technicalities of Twitter or Facebook to get indexed, going passed the argument made by Sarah Lacy here: http://pandodaily.com/2012/01/23/google-do-yourself-a-favor-and-just-come-clean-already/.

                  Just read all the Google statements from years ago, where Google claimed to be “different” than all other companies, putting customer needs and experience before Google’s.

                  From the outcry of the tech industry, it seems Google has lost trust.

                  Not so much with people like my mother or other average users, as these folks will not see the difference immediately.

                  • http://andybeard.eu AndyBeard

                    Dennis don’t forget crawl efficiency, canonicalization etc. Google could easily crawl 400% less pages, and potentially less Twitter home page hits if it could be assured to eventually get the same content later.
                    That is especially true if the crawler understands the site structure.
                    They don’t need to crawl individual tweets.

            • http://andybeard.eu AndyBeard

              Matt, Twitter blocks crawling beyond the first page, so whilst individual tweets may get indexed, the site internal linking is broken thus preventing a natural crawl.

      • http://twitter.com/kevinmarks Kevin Marks

        Google now respects the rel=”nofollow” on all links inside twitter, which means it can’t surface things linked to from twitter. When they had firehose access, they could

        • http://www.facebook.com/blake Blake Ross

          Tweets are not “linked” from twitter. They are right on a user’s Twitter profile, which is crawled and indexed by Google.

          • http://www.cdixon.org chris dixon

            I think he’s talking about links in tweets, not on user’s profiles.

            • http://www.facebook.com/blake Blake Ross

              Then what does that have to do with Google’s ability to crawl, index and surface tweets in its search results?

        • http://autoaccessoriesgarage.com Jared McKiernan

          Nofollow doesn’t mean google doesn’t follow the link for its index, it means google doesn’t flow any PageRank from the linking site through the link. For googlebot to not “click on” the link, you’d use rel=”noindex” and I’m not even sure that googlebot wouldn’t still follow this link and just not add it to the index.

          • http://andybeard.eu AndyBeard

            There is no such thing as a noindex link directive.
            Noindex is only page based using a number of mechanisms but those work only after Google has crawled the page to see the directive.
            Even when it is a page directive, Google can still cache the page, and use the links on the page in its calculations.

        • http://andybeard.eu AndyBeard

          Google respect nofollow specific to this guideline
          http://support.google.com/webmasters/bin/answer.py?hl=en&answer=96569 

          There are some details there of situations where they might crawl a link, such as when it includes rel=”me” but that doesn’t mean the link passes PageRank.

    • http://andybeard.eu AndyBeard

      I don’t need to hattip people for wrong information
      Disallow: /*?That line has been blocking natural crawlingof the site from the pagination at the bottom of the pages for atleast the last 2 years that I have been monitoring it.Current indexation of my own 5100 tweets is around 5%I have achieved better than that on a site with hardly any juice that archive them

  • http://twitter.com/ScotchGuyDan Dan Bowen

    World War amongst these three means nothing but opportunity for entrepreneurs who build meaningful solutions outside of the cold war power mentality.

  • http://www.kidmercuryblog.com kidmercury

    what’s evil and what’s not is subjective. people need to align themselves with the platform that espouses their own values. 

    and i’m confident for most people that will be google. google is way more open than facebook and twitter, and that’s what this issue boils down to; the morality of openness. google has the moral trump card and it’s not even close. as if any of the bubblized kids in silicon valley could ever drop something like android. please. they’ll take android, customize it, and then whine when google tries to profit in some way from it. pfft. amateurs. 

  • http://twitter.com/BrentHurley Brent Hurley

    I absolutely agree. Twitter hinges on the premise that real-time data is extremely valuable…so much so that they believe search engines should pay them for it.  However, the true value of this data is still TBD.

  • http://www.facebook.com/blake Blake Ross

    This is incorrect.

    Clearly Google indexes tweets. Here are plenty of yours: https://www.google.com/search?sourceid=chrome&ie=UTF-8&q=twitter+chris+dixon#sclient=psy-ab&hl=en&source=hp&q=site:twitter.com%2Fcdixon&pbx=1&oq=site:twitter.com%2Fcdixon&aq=f&aqi=&aql=&gs_sm=e&gs_upl=4927l5182l1l5355l2l2l0l0l0l0l156l240l1.1l2l0&bav=on.2,or.r_gc.r_pw.r_cp.,cf.osb&fp=8e36662740888547&biw=1243&bih=627

    And what is the 99.9% of *public* Facebook context that is not indexable? You don’t explain this assertion.

    And what about Quora, Pinterest, Tumblr, Instagram and the dozens of other social networks?

    Lastly, the toolbar doesn’t “rerank Google search results the way they’d like them to be ranked” — it reranks Google results the way *Google’s algorithms themselves* would rank them if they were not hardcoded to Google+.

    I would urge you to spend some time and read through the site to see how the tool actually works.

    • http://www.cdixon.org chris dixon

      I don’t have any hard data on number of tweets indexed (does anyone besides Google/Twitter?) but I’ve found many times when I can’t find tweets I know exist in Google. All of the testing I’ve done seems to show Google is missing many Tweets. But since I posted this post I’ve gotten emails from smart industry people saying this is Google’s fault so maybe I’m wrong.

      • FAKE GRIMLOCK

        INDEX THAT VOLUME NOT EASY.

        TWITTER NOT AVOID IT BECAUSE LAZY, AVOID BECAUSE NO CAN FIGURE OUT HOW TO DO RIGHT.

        SAME FOR GOOGLE.

        • http://cynthiaschames.tumblr.com/ Cynthia Schames

          Twitter breaks during a football game, for godssakes.

  • http://twitter.com/mgrooves Matt Graves

    Chris: your statement that Twitter has “decided to disallow Google from indexing 99.9% of their content” is flat wrong. Google can and does crawl Tweets, and as Danny Sullivan pointed out two weeks ago (http://marketingland.com/schmidt-google-not-favored-happy-to-talk-twitter-facebook-integration-3151) they have more than 3 billion pages indexed. 

    And, as we tweeted earlier today (https://twitter.com/#!/twittercomms/status/161578517698580481), Google’s web crawlers are hitting our servers at a rate of more than 1,400 queries per second; that’s more than *120 million* times a day. 

    Presumably, they are not throwing that data away. But they’re also choosing not to use it to provide more relevant results. And as the bookmarklet that Blake released today amply demonstrates, it *is* a choice.

    I see from the comments that many people are correcting you, here and by mail. I hope you’ll consider amending your post. 

    • http://www.cdixon.org chris dixon

      yes, a bunch of people said the same thing, and I updated my post to reflect this. i’d love to hear from google more specifics of their complaint against twitter.

      thanks.

    • http://andybeard.eu AndyBeard

      Matt I really hope he doesn’t amend it.

      The don’t be evil bookmarklet is actually very evil as it links to a Mark Zuckerberg topic page on Quora, not a profile that he has verified as his.

      You also block natural crawling, at least you are no longer forcing juice to list pages and spam accounts as was the case a while back.

      I will be responding with yet another blog post on your indexation issues as Twitter has pretty much ignored all the pro bono advice it has received from the SEO community up until now.

    • FAKE GRIMLOCK

      GOOGLE DEFINE “EVIL” AS “ANYTHING NOT GOOD FOR GOOGLE” AND THEN NOT DO IT.

  • http://natwelch.com Nat

    The thing about tweets is no one links to them right? Because of this, I imagine most tweets have a pretty low ranking in Google’s algorithms. So while the points the other commenters are making are accurate for the most part, aren’t we ignoring the basis of pagerank?

    • http://www.cdixon.org chris dixon

      I think we are all assuming the pagerank is just a tiny portion of the modern google ranking algorithm.

      • http://andybeard.eu AndyBeard

        PageRank with a huge fudge factor for perceived quality are the primary factors in indexation still.

  • http://twitter.com/sacca Chris Sacca

    It comes down to what each company has promised its users. Facebook promised its users their stuff would be private, which is why users rightfully get pissed when that line blurs. Twitter has promised users, well, that it will stay up, and that is why users rightfully get pissed when the whale is back. 

    Google has promised its users and the entire tech community, again and again, that it would put their interests first, and that is why Google users, rightfully get pissed when their results are deprecated to try to promote a lesser Google product instead.

    (Disclaimer: I own Twitter and Facebook stock.)

    • http://www.kidmercuryblog.com kidmercury

      who says putting goog+ posts isn’t in the best interests of users? who is in charge of product development, the scientists at google who spend all day solving these problems, or the user? on some level trust is needed, if google has lost that trust then users should migrate elsewhere. if they trust facebook or twitter more i feel sorry for them. 

      • http://cynthiaschames.tumblr.com/ Cynthia Schames

        Wait, so we should trust Google to feed us whatever they’re cooking, but 9/11 was a government coverup? 

        • http://www.kidmercuryblog.com kidmercury

          no, people should decide for themselves who they want to believe. for me, google isn’t perfect, but a lot better than the alternatives. as for the bubble kids trying to take the moral high ground in their beef with google…..lol. they should look in the mirror first. 

          regarding 9/11, plenty of government officials already admit there was a cover up. patriotsquestion911 dot com. add to that there has never been a criminal investigation of 9/11, that FBI agent coleen rowley says attempted investigations were thwarted…..i’m simply believing what those government officials believe. if others want to believe dick cheney, george bush, the current president, and those who believe several cave dwellers defied the laws of physics to take down three buildings with two airplanes, that is their prerogative. 

          • http://cynthiaschames.tumblr.com/ Cynthia Schames

            I believe what I saw with my own two eyes and experienced. I won’t go into the personal details of my experience that day, but suffice it to say that I was more than close enough to the situation.
            I’m no fan of *any* President we’ve had in my lifetime, believe me, and trust politicians in general about as far as I can throw them.
            But the same goes for Google. Don’t Be Evil is a joke.

            • http://www.kidmercuryblog.com kidmercury

              you obviously did not see building 7 fall, because that fell late in the afternoon without being struck by a plane. anyone who saw that fall definitely had a wtf moment. cast aside your emotions and look at the situation rationally and scientifically — the truth is obvious.  

              • http://cynthiaschames.tumblr.com/ Cynthia Schames

                I’m not debating with you on this. I respect your right to have an opinion, and hopefully you will respect my sensitivity to this topic.

    • JamesHRH

      Your user based approach is a good one.

      Who are Google’s users, in your view?

    • fjpoblam

      Google made a different sort of promise oh so many years ago. Did you read this “op-ed”?

      Google: do yourself a favor and just come clean already http://pandodaily.com/2012/01/23/google-do-yourself-a-favor-and-just-come-clean-already/

    • http://daleallyn.com Dale Allyn

      Good relationship fundamentals, Chris.

  • Pingback: Chris Sacca on the implied user contract - Chris Dixon

  • Pingback: Twitter and Google are both responsible for you not being able to search tweets « reDesign

  • Pingback: Evil, Greed, And Antitrust Aren’t Google’s Real Problems, Relevancy Is | PandoDaily

  • http://www.kleemi.com Bruce Wayne

    …. Interesting that most of the discusion seems to be within the framework that Facebook, Google, and Twitter have somehow been given the “right” to represent the “Communities” that have given them value….Based on most of the current events surrounding these companies it does not seem that putting value creating “Communities” first is part of their agenda….Most if not all of the decisions have been about what is good for the “Companies” and in most cased this means that “Community” that creates real value will be commoditized and marginalized….

    • http://essaychampions.com/ custom essay

      interesting thoughts

  • http://www.victusspiritus.com/ Mark Essel

    Business deals.
    That’s pretty much it, companies aggregating and selling attention, interest, and information. It’s all good.

  • http://evan.status.net/ Evan Prodromou

    I think the market can decide this.

    Bing has access to Facebook and Twitter firehoses. It uses that data to guide search results. If users prefer having FB and Twitter data guide their search results, we should see an uptick in usage of Bing.

    If, on the other hand, search engine loyalty trumps the social graph, we should see more growth of Google+ to the detriment of Facebook and Twitter.

    Frankly, I think it’s far too early to tell.
    I don’t think FB and Twitter give a crap about this, however. The whole “Don’t Be Evil” thing is a cynical effort to manipulate anti-trust regulators to keep Google out of social sites.The sad part is that these crocodile tears will probably work. The incumbents in the social networking space will use wolf-crying to force out the first viable third party in years, and the rest of us will suffer.I’d be pretty ashamed to be part of this effort at FB or Twitter right now.

  • http://daleallyn.com Dale Allyn

    The last thing I want in search results is a bunch of Twitter (or FB) noise (not suggesting that all tweets are noise, but in search they would become that to me). For Twitter content indexing and presentation in search results to be of any use to me, they’d need to be under a separate tab. There’s so much useless info on results pages already, adding tweets would just make it worse.

    This is not a commentary on Twitter or FB, just on useful search. That said, search tools on Twitter could be improved if it could provide a deeper history element IMO. 

    • http://www.dailybathroom.com Albert

      Indexing tweets would be not only noise,  but also spammy, because I think what people post on twitter (and FB) are mostly spammy.

  • http://technbiz.blogspot.com paramendra

    Twitter Should Open Up Its API —- To Google http://bit.ly/zLxkkx

  • http://arnoldwaldstein.com/ awaldstein

    I wonder sometimes whether this issue, as important as it is, isn’t being played like some Rambo  ’who drew first blood?’ childlike macho spitting contest.

    I don’t know the answer to the indexing question posed and am anxious to know the true facts. And I respect businesses right to do as they see fit and let the market sort it out. But I do feel leaned on heavily and incorrectly by Google with the forced action of having to participate in G+ in order to get ranked better.

    The loser is the small business and the startup for now…and eventually the consumer. Google rankings matter. Facebook matters. Twitter matters. They all do to businesses getting found by the right customers. This mess just spreads resources and reduces focus. 

    Do you advise your portfolio company’s to forego the G+ resource allocation? It’s a tough one. An interesting debate.  A challenging situation for the small biz. 

    Thanks for this post Chris. 

  • Pingback: Google And Twitter – Still Nowhere in Its Ongoing Digital ‘Verbal’ War - I2Mag

  • Pingback: Google And Twitter – Still Nowhere in Their Ongoing Digital ‘Verbal’ War - I2Mag

  • Pingback: Once We Have Attention Then We Only Have Trust « Elia Insider

  • Pingback: Evil, Greed, And Antitrust Aren’t Google’s Real Problems, Relevancy Is | ResultsON

  • Pingback: Chris Sacca on the implied user contract | Bookmarks