Chris Dixon

There are two ways to make large datasets useful

I’ve spent the majority of my career building technologies that try to do useful things with large datasets.*

One of the most important lessons I’ve learned is that there are only two ways to make useful products out of large data sets. Algorithms that deal with large data sets tend to be accurate at best 80%-90% of the time (an old “joke” about machine learning is that it’s really good at partially solving any problem). Consequently, you either need to accept you’ll have some errors but deploy the system in a fault-tolerant context, or you need to figure out how to get the remaining accuracy through manual labor.

What do I mean by fault-tolerant context? If a search engine shows the most relevant result as the 2nd or 3rd result, users are still pretty happy. The same goes for recommendation systems that show multiple results (e.g. Netflix). Trading systems that hedge funds use are also often fault tolerant: if you make money 80% of the time and lose it 20% of the time, you can still usually have a profitable system.

For fault-intolerant contexts, you need to figure out how to scalably and cost-effectively produce the remaining accuracy through manual labor. When we were building SiteAdvisor, we knew that any inaccuracies would be a big problem: incorrectly rating a website as unsafe hurts the website, and incorrectly rating a website as safe hurts the user. Because we knew automation would only get us 80-90% accuracy, we built 1) systems to estimate confidence levels in our ratings so we would know what to manually review, and 2) a workflow system so that our staff, an offshore team we hired, and users could flag or fix inaccuracies.

* My first job was as a programmer at a hedge fund, where we built systems that analyzed large data sets to trade stock options. Later, I cofounded SiteAdvisor where the goal was to build a system to assign security safety ratings to tens of millions of websites. Then I cofounded Hunch, which was acquired by eBay – we are now working on new recommendation technologies for ebay.com and other eBay websites.

  • http://twitter.com/mordyk Mordy Kaplinsky

    I would add that even in fault-tolerant environments the best practice would be to try and put in place a workflow for manual amelioration.  The workflow submitted data can actually teach the machine to enhance its learning and gain an improved accuracy measure.

    • http://www.cdixon.org chris dixon

      Absolutely. Google does this with a large human QA team and also features for “regular” users to flag stuff, etc.

      • http://twitter.com/mordyk Mordy Kaplinsky

        Re: the regular users. in most cases such as the Google example, there isn’t any real incentive for the user to make any attempt to correct the system, but when its a system that a user returns to use the same services for the same or similar requirements there is a greater incentive for them to take a moment to improve the system.  The same would obviously apply to traders where taking a moment to improve a trade can improve the average of a later transaction.

        • http://www.cdixon.org chris dixon

          The impression I get is the Googles etc of the world rely more on paid manual review than user feedback. It is hard to give regular users the right incentives. Also need to fighting people gaming the system etc.

          • http://twitter.com/mordyk Mordy Kaplinsky

            So true on both counts. the problem with paid reviewers from where I sit is that they cant really take into account the users intent to suggest why a particular suggestion didn’t cut it, so if you could figure out a way to get the user to take the time a provide feedback the upside is immense. This aside for the obvious fact that you remove or at a minimum reduce the costs of paid reviewers.

          • http://fraserharris.tumblr.com/ Fraser Harris

            An interesting side effect of Google’s move to one combined identity/account is that it should be a lot easier to detect fraudulent accounts.

          • Anonymous

            As you touched upon below, in case of recommendations, regular users have a direct incentive to give their feedback: they choose what they want among the recommendation set. 
            Improvements in UI have made this even more valuable. LinkedIn now present all suggestions to connect in a single page, and connections can be made with just one click, staying in the page.UX improvements may seem a small thing but I think it increases by an order of magnitude the user value of algorithmic suggestions in the fault-intolerant case.

  • http://venturatis.wordpress.com/ Guillermo Ramos Venturatis.com

    .

  • http://venturatis.wordpress.com/ Guillermo Ramos Venturatis.com

    Excelent post!

    No clue about how Google algrth. works, but I think the user feed back is implicit in the click trhough, right?

  • Anonymous

    Thanks for the post. I’m curious about your experience with / thoughts on the following two potential additions to the points you made:

    (1) Sometimes different types of errors have different costs associated with them, and the costs may be suffered by different entities. The question of fault (in)tolerance may then need to be evaluated separately for each type of error, weighting by the cost of making that type of error. For example, in spam classification, it’s much worse to classify a real email as spam than to let a couple of spam emails through occasionally. In finance, depending on what you’re doing, there can be asymmetry in the value of upside potential  vs the cost of downside risk. So you may be able to adjust the method to reduce the expensive error at the cost of an increased rate of the cheaper one. In your SiteAdvisor example, both types of errors are bad, but the cost is suffered by a different entity in each case (the website vs the user). So you need to fix both, but the method for boosting accuracy may be different depending on who is primarily affected by the error.

    (2) In some practical contexts, there is some leeway in choosing how to set up the problem, and those different formulations might not be equivalent from a fault (in)tolerance perspective. For example, Amazon’s recommender system mostly tries to find a few things you might be interested in buying every so often. As long as you find at least one thing interesting each time they draw your attention to a list of items, you probably don’t mind if the others are not of interest. Netflix, on the other hand, seems to try to predict how you would rate every item in their database, rather than simply trying to find a few movies you might want to watch sometimes. Intuitively, it seems the latter may be more error prone, because the user is somehow more exposed to potential errors made by the algorithm. (I guess another way to put this is the methods may make different tradeoffs in precision vs recall.) Given your work in recommendation systems in particular, I’m curious how you think about this kind of thing.

    • http://www.cdixon.org chris dixon

      Great points.

      1) Agree, in spam false positives are more damaging than false negatives, and the algorithms should be weighted accordingly. For SiteAdvisor, once we got a significant user base, websites would let us know about their feelings of incorrect ratings (sometimes through scary legal threats), whereas users could have their computers destroyed with false negatives. So we tended to err on the side of user caution. We also, after long debates, added in a middle rating (“caution” – a yellow rating as opposed to green or red).

      2) Netflix is an interesting case. The story I’ve heard is when they were DVD focused the recommendation algorithm was partly about better user experience and partly about getting people to watch from the “long tail” to better capacity-utilize Netflix’s inventory. I think it has changed since then. Amazon and eBay, on the other hand, just want to suggest more stuff you’d like so you’ll buy more. The other trade off is between revenue (you buy a John Grisham novel, you will likely buy another John Grisham novel), vs “delight” – helping the user find something they otherwise never would have found. Personally I think Amazon weights far too much on the revenue maximizing side. 

      The big unsolved problem in recommendation systems is providing a justification for the recommendations that are compelling to users. Social networks try it by showing your friends’ activities. Apple Genius tries it by showing other apps you’ve downloaded. But they are all gross simplifications of complicated algorithms. This was something we found frustrating at Hunch. We felt like we built a system that provided provably good recommendations but without showing justification people didn’t believe it.

      • Anonymous

        Yes, I now remember hearing once about Netflix taking into account capacity utilization as well as user interest; that’s interesting. One way to think about at least some of these problems is as exploration/exploitation tradeoffs; in the Amazon case, “exploration” being “delight” (i.e. try something risky and off the beaten track) and “exploitation” (i.e. go for safe bets) being going for revenue. I agree many systems are heavily skewed to the exploitation side, probably because the financial incentives seem to heavily favor exploitation; put differently, the rewards don’t seem to scale with the risk. Also, exploration is going to be more error-prone, while exploitation is going to be less error-prone, so in a fault intolerant context the costs may also get too high.

        In light of your last comment about Hunch’s system, it also seems like it is much easier to justify “exploitation” recommendations than “exploration” ones (to the end user, which I assume is what you meant). For example, “we are recommending this John Grisham novel because you’ve bought 8 John Grisham novels” should be compelling to the user. But while the explanation is fine, the actual recommendation is boring. The further you get into exploration, the more interesting things get (potentially), but the harder it gets to justify the recommendations to users, especially when the system is probably making many more errors. So I guess Hunch’s system provides much more interesting recommendations, but then it is also much harder to explain why you’re recommending various things (?).

        What do you mean by saying that Hunch’s recommendations were “provably” good, though?

        Thanks.

        • http://www.cdixon.org chris dixon

          Provable in the sense that a very high percentage of the time the top result would be the same result as if you user had gone and done 4 hours of internet research.

  • http://needforair.com/ Louis Chatriot

    This thinking applies to text summarization: while we do have algorithms (like TextRank) that can do a decent job of summarizing overall, there is still a significant portion of texts where it fails. Which is why we still need some human involvement if we want to get quality summaries. 

    • http://www.cdixon.org chris dixon

      Agree. I suppose I could have phrased this post to apply more broadly to any AI/machine learning system applied to real world problems.

  • Pingback: There are two ways to make large datasets useful | Marketing

  • Pingback: Actually using big data | MicahLogic

  • Anonymous

    One usually has some leeway in figuring out where they’d like to be on Fault-Tolerant v.s. Intolerant binary…   I’m sure you’re familiar with the ROC curve ( http://en.wikipedia.org/wiki/Receiver_operating_characteristic  )  

    In datasets I study (genomics)  there is no free lunch as usually is the case.  One can get more true positives (quantity) if you are willing to tolerate greater error, or one can get fewer answers, but each one more accurate (quality).  Either algorithmic innovation (which usually revolves around finding clever ways of detecting and removing systematic errors)  or better quality data is needed to move the needle usually. 

  • Anonymous

    Interesting thoughts based on experience.  What do you mean by “The big unsolved problem in recommendation systems is providing a justification for the recommendations that are compelling to users.”? For example, does Amazon provide a justification for its recommendations?  Curious to understand what is behind the statement.

    • http://www.cdixon.org chris dixon

      I may be getting ahead of myself – that’s just my strong personal bias after working on the problem for a while. Amazon provides weak justifications (because you liked X book). Justifications would make recommendations much more compelling, and no one has really figured it out from a design and algorithm perspective.

      • Anonymous

        (from someone also in this biz). Why do you want to provide a justification for the suggestions to the consumer?  Search engine results are recommendations of-sorts but come with no justification.  Why should a recommendation for a book, movie, app or eBay item and so on come with a justification? I understand that recommendation technoloy per se can deliver poor results either through the algorithms deployed and/or poor data.

  • http://www.andyinsandiego.com/ Andy

    “* My first job was as a programmer at a hedge fund, where we built systems that analyzed large data sets to trade stock options. Later, I cofounded SiteAdvisor where the goal was to build a system to assign security safety ratings to tens of millions of websites. Then I cofounded Hunch, which was acquired by eBay – we are now working on new recommendation technologies for ebay.com and other eBay websites.”

    Chris, do you ever get tired of crushing life?

    • http://www.cdixon.org chris dixon

      not sure what that means…?

      • http://www.andyinsandiego.com/ Andy

        It was a compliment.

  • Pingback: Quora