To make smarter systems, it’s all about the data

As this article by Alex Wright in the New York Times last week reminded me, when the mainstream press talks about artificial intelligence – machine learning, natural language processing, sentiment analysis, and so on – they talk as if it’s all about algorithmic breakthroughs.  The implication is it’s primarily a matter of developing new equations or techniques in order to build systems that are significantly smarter than the status quo.

What I think this view misses (but I suspect the companies covered in the article understand) is that significant AI breakthroughs come from identifying or creating new sources of data, not inventing new algorithms.

Google’s PageRank was probably the greatest AI-related invention ever brought to market by a startup.  It was one of very few cases where a new system was really an order of magnitude smarter than existing ones.  The Google founders are widely recognized for their algorithmic work.  Their most important insight, however, in my opinion, was to identify a previously untapped and incredibly valuable data source – links – and then build a (brilliant) algorithm to optimally harness that new data source.

Modern AI algorithms are very powerful, but the reality is there are thousands of programmers/researchers who can implement them with about the same level of success.  The Netflix Challenge demonstrated that a massive, world-wide effort only improves on an in-house algorithm by approximately 10%. Studies have shown that naive bayes is as good or better than fancy algorithms in a surprising number of real world cases.  It’s relatively easy to build systems that are right 80% of the time, but very hard to go beyond that.

Algorithms are, as they say in business school, “commoditized.”  The order of magnitude breakthroughs (and companies with real competitive advantages) are going to come from those who identify or create new data sources.

Machine learning is really good at partially solving just about any problem

There’s a saying in artificial intelligence circles that techniques like machine learning (and NLP) can very quickly get you, say, 80% of the way to solving just about any (real world) problem, but going beyond 80% is extremely hard, maybe even impossible.  The Netflix Challenge is a case in point: hundreds of the best researchers in the world worked on the problem for 2 years and the (apparent) winning team got a 10% improvement over Netflix’s in-house algorithm.  This is consistent with my own experience, having spent many years and dollars on machine learning projects.

This doesn’t mean machine learning isn’t useful – it just means you need to apply it to contexts that are fault tolerant:  for example, online ad targeting, ranking search results, recommendations, and spam filtering.  Areas where people aren’t so fault tolerant and machine learning usually disappoints include machine translation, speech recognition, and image recognition.

That’s not to say you can’t use machine learning to attack these non-fault tolorant problems, but just that you need to realize the limits of automation and build mechanisms to compensate for those limits.  One great thing about most machine learning algorithms is you can infer confidence levels and then, say, ship low confidence results to a manual process.

A corollary of all of the above is that it is very rare for startup companies to ever have a competitive advantage because of their machine learning algorithms.  If a worldwide concerted effort can only improve Netflix’s algorithm by 10%, how likely are 4 people in an R+D department in a startup going to have a significant breakthrough.  Modern ML algorithms are the product of thousands of academics and billions of dollars of R+D and are generally only improved upon at the margins by individual companies.