There are two ways to make large datasets useful

I’ve spent the majority of my career building technologies that try to do useful things with large datasets.*

One of the most important lessons I’ve learned is that there are only two ways to make useful products out of large data sets. Algorithms that deal with large data sets tend to be accurate at best 80%-90% of the time (an old “joke” about machine learning is that it’s really good at partially solving any problem). Consequently, you either need to accept you’ll have some errors but deploy the system in a fault-tolerant context, or you need to figure out how to get the remaining accuracy through manual labor.

What do I mean by fault-tolerant context? If a search engine shows the most relevant result as the 2nd or 3rd result, users are still pretty happy. The same goes for recommendation systems that show multiple results (e.g. Netflix). Trading systems that hedge funds use are also often fault tolerant: if you make money 80% of the time and lose it 20% of the time, you can still usually have a profitable system.

For fault-intolerant contexts, you need to figure out how to scalably and cost-effectively produce the remaining accuracy through manual labor. When we were building SiteAdvisor, we knew that any inaccuracies would be a big problem: incorrectly rating a website as unsafe hurts the website, and incorrectly rating a website as safe hurts the user. Because we knew automation would only get us 80-90% accuracy, we built 1) systems to estimate confidence levels in our ratings so we would know what to manually review, and 2) a workflow system so that our staff, an offshore team we hired, and users could flag or fix inaccuracies.

* My first job was as a programmer at a hedge fund, where we built systems that analyzed large data sets to trade stock options. Later, I cofounded SiteAdvisor where the goal was to build a system to assign security safety ratings to tens of millions of websites. Then I cofounded Hunch, which was acquired by eBay – we are now working on new recommendation technologies for ebay.com and other eBay websites.

The problem with investing based on pattern recognition

A famous story in artificial intelligence is how the US military developed algorithms to determine whether an image had a tank in it. They used a standard machine learning method: feed the computer a “training set” of photos, some of which had tanks in them and some of which didn’t, and let algorithms identify which features in the photos correlated to tanks being shown.

This method worked for a while but then mysteriously stopped working. Since the features the computer identified were embedded in complicated mathematical equations, no one could figure out what it was really doing and therefore why it stopped working. Eventually someone realized that in the training set, all of the images with tanks were taken on a cloudy day, and all the images without tanks were taken on a sunny day. The algorithms had fixated on the most obvious pattern – the color of the sky. When the algorithm was tested on new photos where the weather varied, it was completely flummoxed.

It is commonly said that good startup investors develop “pattern recognition” that allows them to identify great entrepreneurs and companies. If you look at the hugely successful startups of the last decade, the founders have many similarities that are easy to observe. When they started, many were male, young, unmarried, computer programmers, dropouts of elite universities, etc. As a result, a lot of investors look for founders with these characteristics. But without an understanding of the deeper reasons these founders succeeded, these observable characteristics could just as well be the color of the sky and not the tanks.

At the level of individual investors, pattern recognition can lead to bad investments and missed opportunities. In the context of markets, it can cause companies and sectors with the “right patterns” to be overvalued, and ones with the “wrong patterns” to be undervalued. In the broader cultural context, it can cause large groups of talented entrepreneurs to be denied access to capital.

The classic scientific method provides a better model for investing. Scientists observe data, notice patterns, develop hypotheses, and then test those hypotheses. Pattern recognition is only a step along the way to developing hypotheses about the underlying cause.

Perhaps dropping out of college shows a strong level of commitment. Knowing computer science was probably a necessary condition for starting a tech company in the past, but no longer is. Being young could mean you are inexperienced enough to pursue bold ideas that more experienced people would consider crazy. I am just speculating – I don’t know why these characteristics are common among past successful founders. But the mere repetition of patterns shouldn’t be satisfactory to anyone who wants to understand and predict the success of startups.

To make smarter systems, it’s all about the data

As this article by Alex Wright in the New York Times last week reminded me, when the mainstream press talks about artificial intelligence – machine learning, natural language processing, sentiment analysis, and so on – they talk as if it’s all about algorithmic breakthroughs.  The implication is it’s primarily a matter of developing new equations or techniques in order to build systems that are significantly smarter than the status quo.

What I think this view misses (but I suspect the companies covered in the article understand) is that significant AI breakthroughs come from identifying or creating new sources of data, not inventing new algorithms.

Google’s PageRank was probably the greatest AI-related invention ever brought to market by a startup.  It was one of very few cases where a new system was really an order of magnitude smarter than existing ones.  The Google founders are widely recognized for their algorithmic work.  Their most important insight, however, in my opinion, was to identify a previously untapped and incredibly valuable data source – links – and then build a (brilliant) algorithm to optimally harness that new data source.

Modern AI algorithms are very powerful, but the reality is there are thousands of programmers/researchers who can implement them with about the same level of success.  The Netflix Challenge demonstrated that a massive, world-wide effort only improves on an in-house algorithm by approximately 10%. Studies have shown that naive bayes is as good or better than fancy algorithms in a surprising number of real world cases.  It’s relatively easy to build systems that are right 80% of the time, but very hard to go beyond that.

Algorithms are, as they say in business school, “commoditized.”  The order of magnitude breakthroughs (and companies with real competitive advantages) are going to come from those who identify or create new data sources.