Machine learning is really good at partially solving just about any problem

There’s a saying in artificial intelligence circles that techniques like machine learning (and NLP) can very quickly get you, say, 80% of the way to solving just about any (real world) problem, but going beyond 80% is extremely hard, maybe even impossible.  The Netflix Challenge is a case in point: hundreds of the best researchers in the world worked on the problem for 2 years and the (apparent) winning team got a 10% improvement over Netflix’s in-house algorithm.  This is consistent with my own experience, having spent many years and dollars on machine learning projects.

This doesn’t mean machine learning isn’t useful – it just means you need to apply it to contexts that are fault tolerant:  for example, online ad targeting, ranking search results, recommendations, and spam filtering.  Areas where people aren’t so fault tolerant and machine learning usually disappoints include machine translation, speech recognition, and image recognition.

That’s not to say you can’t use machine learning to attack these non-fault tolorant problems, but just that you need to realize the limits of automation and build mechanisms to compensate for those limits.  One great thing about most machine learning algorithms is you can infer confidence levels and then, say, ship low confidence results to a manual process.

A corollary of all of the above is that it is very rare for startup companies to ever have a competitive advantage because of their machine learning algorithms.  If a worldwide concerted effort can only improve Netflix’s algorithm by 10%, how likely are 4 people in an R+D department in a startup going to have a significant breakthrough.  Modern ML algorithms are the product of thousands of academics and billions of dollars of R+D and are generally only improved upon at the margins by individual companies.

Share:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • Reddit
  • Slashdot
  • Suggest to Techmeme via Twitter
  • Tumblr
  • Twitter
  • HackerNews

Related posts:

  1. To make smarter systems, it’s all about the data

View Comments

#1 Machine learning is really good at partially solving just about any problem | Igniting Startups - nPost on 08.20.09 at 7:19 pm

[...] From cdixon.org [...]

#2 Andres Burgos on 08.20.09 at 8:29 pm

It’s the equivalent of wanting cars to drive themselves. It’s not going to happen for a very very long time. For now let’s focus on building better roads and drivers.

#3 Twitter Trackbacks for cdixon.org / Machine learning is really good at partially solving just about any problem [cdixon.org] on Topsy.com on 08.20.09 at 10:18 pm

[...] First Tweet 4 hours ago cdixon chris dixon Highly Influential Machine Learning is really good at partially solving just about any problem (and bad at fully solving them) http://www.cdixon.org/?p=342 view retweet [...]

#4 Martin on 08.21.09 at 7:57 am

Quite interesting but I would argue it depends on how you use ML. If it is applied to a problem where nobody ever used ML before you can have quite an advantage.

Especially less mathematical areas people tend to stay away from advanced methods due to a lack of understanding (on both sides, application side does not know about the powers of ML and the ML community has no clue what the application side needs).

#5 chris on 08.21.09 at 8:18 am

Martin – good point. I guess I’m coming from the perspective of the tech startup world where people are generally familiar with ML techniques.

If you have any examples of areas where ML was freshly applied to create an advantage I’d be really interested to hear about them.

#6 Daniel Tunkelang on 08.24.09 at 12:49 pm

Chris, I’m with you. Machine learning is great, but one of the lessons I derive from the Netflix Challenge is that it quickly hits a point of diminishing return. Rather the focus exclusively on automated methods, it might be a good idea to develop interfaces that draw the must useful information out of people.

More here:

http://thenoisychannel.com/2008/11/21/the-napoleon-dynamite-problem/

#7 Rathan Haran on 08.24.09 at 1:28 pm

What about attending to the problem from an entirely new prospective? It seems like the NetFlix challenge participants used a lot of techniques in data mining rather than approaching it from a completely clean slate. I wonder how many teams started in this fashion instead of jumping right into the data.

#8 Ian Ma on 08.29.09 at 4:45 pm

Hi Chris. I was just starting a conversation about that in my forum. I’m building a home for ML folks and I’d love to have you in conversation with us. Please take a look at http://machine-learning.eggsprout.com and think about joining. Thanks for the article — I hope you don’t mind me sharing it on our thread!

–Ian

#9 cdixon.org / To make smarter systems, it’s all about the data on 08.30.09 at 7:07 am

[...] surprising number of real world cases.  It’s relatively easy to build systems that are right 80% of the time, but very hard to go beyond [...]

#10 Ramaseshan on 08.31.09 at 2:45 am

I agree with you that the insight lies in the data itself. Most of the time we solve problems using sparse data or known data to solve the problem is limited. Known contexts, social or otherwise, and ML algos may help us put the puzzle pieces together.

#11 Stef Damianakis snd@ne on 09.25.09 at 9:30 am

@ Chris…

I humbly submit Netrics as an example of freshly applying Machine Learning to deliver value and create market advantage.

Also, using the Netflix Challenge as a single data point to flog all of ML is not exactly fair.

The impact and importance of ML will only grow – these are a very exciting times!

#12 INDEX // mb - Against Forecasting: A Case for More Agility in Book Publishing on 10.04.09 at 11:56 pm

[...] number of real world cases.  He says “it’s relatively easy to build systems that are right 80% of the time, but very hard to go beyond that.” This post is well worth your [...]

#13 irene on 02.07.10 at 6:00 pm

In collaboration with an art history PhD candidate, I used various ML suites to see if they could correctly classify ancient Mesopotamian ivory sculptures. They did well. A bit better than 80%.

Interestingly enough, it was the mistakes the algorithms made that lead to the most significant discovery. I took all the misclassifications and examined them myself in an Excel spreadsheet. In doing so, I found an intriguing pattern which ended up adding a lot of value to our study.

#14 stealth_reader on 04.19.10 at 4:27 pm

Chris,

WAY wrong directions.

‘Machine learning’ is a junk field because it has no solid rational foundation and no powerful methodology. About all the field is is heuristics and bad applications of cookbook statistics 101.

I worked in the field at the Watson lab in Yorktown Heights and each day had to hold my nose not to upchuck from the stench. Finally I took our central problem, found a solid solution, and published it.

E.g., ‘machine learning’ keeps looking for an ‘algorithm’. They are already lost, digging in the wrong place. By analogy they would look for an ‘algorithm’ to say how to navigate a space craft from Earth to a selected spot on a selected moon of Jupiter. Laughable. Instead, start with Newton’s law of gravitation and laws of motion and some ordinary differential equations. For the software and computing, that’s just to do the arithmetic. There’s no ‘algorithm’ (unless want to count some numerical techniques for solving the differential equations). What the computer science people are missing in ‘machine learning’ is analogous to Newton’s laws.

It’s possible to do much better on the problems being attacked by machine learning, but the computer science community doesn’t know how to proceed. The needed techniques are rock solid but they are quite advanced. Nearly no one, even at the top of research computer science, has the prerequisites because they didn’t take the right courses in grad school. The fields that understand the needed techniques believe that as research ‘machine learning’ problems are trivial and that too much is already known.

Leave a Comment

blog comments powered by Disqus