There are two ways to make large datasets useful

I’ve spent the majority of my career building technologies that try to do useful things with large datasets.*

One of the most important lessons I’ve learned is that there are only two ways to make useful products out of large data sets. Algorithms that deal with large data sets tend to be accurate at best 80%-90% of the time (an old “joke” about machine learning is that it’s really good at partially solving any problem). Consequently, you either need to accept you’ll have some errors but deploy the system in a fault-tolerant context, or you need to figure out how to get the remaining accuracy through manual labor.

What do I mean by fault-tolerant context? If a search engine shows the most relevant result as the 2nd or 3rd result, users are still pretty happy. The same goes for recommendation systems that show multiple results (e.g. Netflix). Trading systems that hedge funds use are also often fault tolerant: if you make money 80% of the time and lose it 20% of the time, you can still usually have a profitable system.

For fault-intolerant contexts, you need to figure out how to scalably and cost-effectively produce the remaining accuracy through manual labor. When we were building SiteAdvisor, we knew that any inaccuracies would be a big problem: incorrectly rating a website as unsafe hurts the website, and incorrectly rating a website as safe hurts the user. Because we knew automation would only get us 80-90% accuracy, we built 1) systems to estimate confidence levels in our ratings so we would know what to manually review, and 2) a workflow system so that our staff, an offshore team we hired, and users could flag or fix inaccuracies.

* My first job was as a programmer at a hedge fund, where we built systems that analyzed large data sets to trade stock options. Later, I cofounded SiteAdvisor where the goal was to build a system to assign security safety ratings to tens of millions of websites. Then I cofounded Hunch, which was acquired by eBay – we are now working on new recommendation technologies for and other eBay websites.

Machine learning is really good at partially solving just about any problem

There’s a saying in artificial intelligence circles that techniques like machine learning (and NLP) can very quickly get you, say, 80% of the way to solving just about any (real world) problem, but going beyond 80% is extremely hard, maybe even impossible.  The Netflix Challenge is a case in point: hundreds of the best researchers in the world worked on the problem for 2 years and the (apparent) winning team got a 10% improvement over Netflix’s in-house algorithm.  This is consistent with my own experience, having spent many years and dollars on machine learning projects.

This doesn’t mean machine learning isn’t useful – it just means you need to apply it to contexts that are fault tolerant:  for example, online ad targeting, ranking search results, recommendations, and spam filtering.  Areas where people aren’t so fault tolerant and machine learning usually disappoints include machine translation, speech recognition, and image recognition.

That’s not to say you can’t use machine learning to attack these non-fault tolorant problems, but just that you need to realize the limits of automation and build mechanisms to compensate for those limits.  One great thing about most machine learning algorithms is you can infer confidence levels and then, say, ship low confidence results to a manual process.

A corollary of all of the above is that it is very rare for startup companies to ever have a competitive advantage because of their machine learning algorithms.  If a worldwide concerted effort can only improve Netflix’s algorithm by 10%, how likely are 4 people in an R+D department in a startup going to have a significant breakthrough.  Modern ML algorithms are the product of thousands of academics and billions of dollars of R+D and are generally only improved upon at the margins by individual companies.