tech

The farther away the better

A Harvard Business School study out today:

Found that VC firms based in San Francisco, Boston and New York generally return more money on investments outside of their local geographies than on investments close to home.

The academics offer an explanation for this:

The paper suggests that this differential could be caused by VC firms using higher hurdle rates for long-distance deals. Such portfolio companies may require a higher level of managerial/monitoring effort, so more thought is given before offering up a term sheet.

Another explanation that many entrepreneurs would give: when you are far away, it’s harder to meddle with the company, telling them to outsource to india, create a facebook app, or whatever the trend du jour is.

Categories: tech

45 replies »

  1. You’re right on Chris….this is exactly what I thought the other day when one of my colleagues pointed me to this post from some smart person out there about what could be done with a Kindle API – an example of a new technology (digital book reading) that all of the sudden delivers all sorts of new data and business opportunities. Check it out:

    http://drop.io/swl/asset/kindle-api-in-pictures-download

  2. yep, kindle reader usage data would be quite interesting (assuming it doesn’t violate privacy policy etc – seems like in aggregate it would be safe to use). I bet authors would love the data too.

  3. Awesome. The insight is that the web is the database – full of unstructured data however. Adding structure to that (we call them washing machine businesses – take in dirty data, spit out clean) results in many new biz opportunities.

  4. That’s the big problem though: the data is unstructured. All these semantic web companies claim that there is this vast amount of info, but then they fall in to the trap of trying to invest more algorithms to parse that unstructured data, which seems to be a fools game.

    At least with PageRank, links were very structured and easy to parse. Not so with general data on the internet. In my experience, you can get 90% of the power of the internet by simply sucking down the Wikipedia XML dump, which is highly structured compared to a normal web site. Everything else has hugely diminishing returns, especially when that time could be put in to really smart interface for your new product that would get it early traction.

    It would be great if there was a new markup in which we tags to help identify every little bit of content structurally on the web, but I don’t think it’ll ever happen: It’s just too much of a pain in the butt to think about when adding the content, and there has yet to be a killer app that would make me want to spend the time to put “rel” attributes on all my links and sprinkle in invisible semantic tags for the benefit of someone other than my direct audience.

  5. Kindle is a good one. Also, a friend of mine who works at Mozilla on Firefox was telling me about the crazy amount of data they are capturing/will capture when you use Firefox. Much more advanced than basic URL history. That could be a great feed to have for systems automatically determining your interests and try to solve the “firehose” problem with most social networks today.

  6. Actually, I don’t think the main impediment to smarter systems is more data. The real problem is the terrible interface.Right now, the interface for using machine learning is to go talk to a scientist. BAAD interface.Business people are not sure what technologies are applicable to their problem, or how machine learning can empower them. Users are confused by the interface to the artificial intelligence, which is either daunting or oversimplified.This is likely to remain the case in the foreseeable future. Until major improvements in natural language processing occur that natural language communication with computers is a viable interface, the real gains will be seen by companies that can creatively and naturally interface machine learning for everyday users.For example, look at the WSJ story about My TiVo Thinks I’m Gay: (http://online.wsj.com/article_email/SB103826193…)

    Mr. Iwanyk, 32 years old, first suspected that his TiVo thought he was
    gay, since it inexplicably kept recording programs with gay themes. A
    film studio executive in Los Angeles and the self-described “straightest
    guy on earth,” he tried to tame TiVo’s gay fixation by recording war
    movies and other “guy stuff.”

    “The problem was, I overcompensated,” he says. “It started giving me
    documentaries on Joseph Goebbels and Adolf Eichmann. It stopped thinking I
    was gay and decided I was a crazy guy reminiscing about the Third Reich.”

    The problem is that the workings of the system are opaque.This example underscores the importance of having an interface that communicates with the user, and attempts to convey its interpretation of the user’s query.The companies with the real competitive advantages are not those that can eke out 5% more accuracy in topic models, it’s the companies that can figure out how to make topic model useful and simple for everyday users.

  7. Yes. But I like Chris corrollary to BW’s Axiom:

    1. Axiom Structure Unstructured Data:
    2. Corollary: Collect new pools of data to create competitive advantage (Chris maybe you can improve)

    Chris, good move to get disqus rolling

  8. patrick – i agree that the lack of structure on the web is an impediment and I’m skeptical new algorithms by themselves will fix this. We either need more structure or complementary data sources to help structure it.

  9. “What I think this view misses (but I suspect the companies covered in the article understand) is that significant AI breakthroughs come from identifying or creating new sources of data, not inventing new algorithms.”

    Hi Chris, the companies are indeed very aware of the centrality of data. You and your readers may want to check out this recent article by three of Google’s research scientists:

    “The Unreasonable Effectiveness of Data”

    http://googleresearch.blogspot.com/2009/03/unreasonable-effectiveness-of-data.html

    http://www.computer.org/portal/cms_docs_intelligent/intelligent/homepage/2009/x2exp.pdf

  10. fascinating, thanks. I will definitely read it. I should mention I’m generally highly influenced in my views by my good friend Michael Kearns, who I know is friends with Fernando among others, so much of what I’m saying might be indirectly coming from these authors.

  11. In addition to UI (which is the user experience with the ML), there is I think also the business interface with the ML.

    Improved solutions to existing problems might not have nearly as large an impact as figuring out “new” problems that can be solved using ML.

  12. That’s a fair point. I’m coming from the perspective of the technology startup world where you often see companies attacking the same problems and sometimes claim they’ve made a purely algorithmic breakthrough. I’m skeptical of that.

  13. It is not clear how structuring data can be monetized, in many applications. Many of the consumers of your structured data will not initially be sure that they can monetize and pay for whatever they do with your information. So if you charge them up-front, you lose many potential customers. But if you delay charging them, you might fail to monetize your offering. How do you solve this conundrum?

  14. Chris- totally agree with you on this..there’s only one thing that really bothers me about the Netflix competition (which I would imagine is not unique question)… what was it about the 10% threshold that so many people and teams ended up getting to 8 and 9% relatively quickly, but the last year or two of the competition ended up being dominated by incremental progress to the 10% mark.Do you think if they had set the threshold to 15% the results would have been different? Do you think the Netflix team had some unique insight that there was some magic boundary around 10%…

  15. Right on Chris. We’re clearly witnessing the increase of volume of data (doubling every 18 months), local data structuring (social graphs, geo graphs, genome graphs, brain graphs, body system graphs, energy graphs, real time robot vision / environment graphs, etc) and combinatorial/macro data structuring (e.g. Twitter + Google Maps mashups), which is clearly adding to the capabilities of what we’ve come to label AI. AI exists for a purpose, a specific function or task. Just like a basic lifeform needs relevant environmental information to increase it’s chances of success, AI functions best with access to the richest, most rapidly computable, system/task-relevant data. e.g. robots that can navigate the Darpa Road Challenge need maps, real-time road/environment sensors, ability to sense and determine the meaning of signs, etc. — Circling back to the original point, the algorithm is just part of the AI – the other part is an environment of structured data. Intelligence arises from the interplay of the two, depends on the system context. So we can expect the algorithms that most effectively draw on the best data available to them for given tasks to be most successful – that means ongoing rise of AI-ish bots tailored for / carefully tuned to new data environments increasingly capable of performing more complex tasks. Clearly there’s an expanding market for these (search being a huge part of that), as Norvig and company have realized.When considering complementary data sources and the drive to increase intelligence in the system, it’s occurred to me that we generally appear to be trending toward the super-structuring of all data (the everything graph), or total system quantification. By cross-referencing different rich data sets, we can interpolate value, push toward quantification / state closure, generating much value and “intelligence” along the way. If it becomes understood that this process is making our system smarter, then data may continue to centralize, be drawn together for certain higher uses, thus further commoditizing data structures, algorithms, and combinations thereof.Related articles that explore these thoughts:http://www.memebox.com/futureblogger/show/1591-http://memebox.com/futureblogger/show/1518-intehttp://socialnode.blogspot.com/2009/06/simulati

  16. I have my doubts, Chris. What you say is invariably true for a certain class of problems and tasks (finding popular recommendations on Amazon, finding home pages on Google). But by biasing your algorithms to large data, you might make other classes of problems even more difficult. Rather than repeat all the arguments, let me point you to a couple of places where I wrote about it a few months ago:

    http://irgupf.com/2009/04/09/retrievability/

    http://irgupf.com/2009/04/23/retrievability-and-prague-cafes/

    http://irgupf.com/2009/04/09/large-data-versus-limited-applicability/

    In a nutshell, large data allows you to solve certain types of problems well, but may end up making other types of problems much more difficult, if all you have is naive Bayes on top of that data making your inferences.

  17. I think data is like the height of NBA players. It’s very hard to be a pro with a 5′ height, but it doesn’t mean the tallest player is the best, in fact it’s seldom the case. Same with data. You need enough of it to make things interesting but the idea that google for example, is dominant because she has more data than no. 2-10 is absurd. At some point it’s not who has more, it’s what it does with it.

  18. A very clever and useful insight. Almost a perfect blog post to me. An archetype for the form. The Google example is great. Got me thinking.

  19. Eran – I was speaking about Google circa 1998. At the time the insight of including links and anchor text really did make their search engine vastly better. All search engines use that data today so that advantage is gone. Probably today the biggest advantages in search today comes from years of devisings “bags of tricks” – lots of little algorithms that collectively yield a better experience.

  20. Isn’t there a false dichotomy in the post? Better algorithms will confer functional benefits while new data sources will increase the range of their coverage. Depth and breadth respectively. One or both approaches may be useful for different use cases. Indeed data and code can be inter-dependent. If the quantity of data increases say, while its quality (according to some specified requirement) simultaneously deteriorates, net gain could be negative unless the algorithm can be altered to compensate.

    It is good to hark back to Google in 1998 or to the nascent WWW ten years earlier. To be thinking of what would it take to make another radical improvement in information management. My view is that the next generational shift will be ubiquitous semantic tagging of public data by the publisher. These tags will be interpreted using consistent, open algorithms but they will be interpreted subjectively by each subscriber, according to private data unevenly distributed across the system.
    The high cost of creating tags is an empirical observation: true in respect of the Semantic Web and no doubt other systems, but not a universal law. When the requirement for objectivity is dropped, semantic tags with good-enough efficacy can be created at very low marginal cost.

  21. Right on, Chris. Because many startups believe in changing the world, there’s tendency towards believing that there’s gotta be a breakthrough algorithm to structure data in a useful way. The problem is… there’s really not much to apply the algorithms against.

    Starting with the Web 1.0 where it’s onerous of publishers to generate the data sources, Web 2.0 made it possible to catapult the rate of data generation possible. The big outstanding question is that the user-generated data cannot really be used to structure due to 1) unstructuredness, 2)lack of cleanliness, 3) lack of irrelevance, at least to systematically digest what human beings really try to convey, etc. Many user contents are indeed useful, but not much money floating around to buy them.

    The semantic web never delivered its promise because people keep banging their heads against structured vs. unstructured data when the real problem is structuring inconsistent, meaningless, and sometime garbage data wouldn’t really lead to a value creation for the ultimate data consumer.

    If I were to aggregate all the world’s information (cost aside) and structure the data somehow, would I be able to answer all questions in the universe? I have doubts. Who knows?

  22. I agree it’s a bit of slippery slope between data and algorithms. You could create an algorithm that creates a new data source from an existing one. But I bet you if someone has a breakthrough doing it the algorithm itself won’t be as interesting as the data source they identified.

    Re semantic tagging – If publishers were to ubiquitously start doing so, that would qualify as a massive new data source in my way of thinking. People have been talking about that for years but right now their is no real incentive for publishers to do it. Maybe if Google made it help you SEO or something people would start to care.

  23. “If I were to aggregate all the world’s information (cost aside) and structure the data somehow,” One problem is that 99% of the “information” (speaking in the broadest sense) iis in people’s heads, out in nature, etc – not in digitally accessible form.

  24. Interesting point about the links. I never really thought about it before. More and more metadata will begin to appear around the web which means that the systems that “understand” the data will be able to do new and more powerful thigns. Similar to how last.fm can know which person is most like me – a “musical neighbor.” There was never a source/database of listened and liked tracks before, but once you have it you can do things like this. Very interesting post

  25. To illustrate the incredible subtlety in the interplay of system components, let me paint a metaphor from nature. It is not the only possible mapping between these two domains, but it serves my current purpose.
    The algorithm is a copying mechanism; data encoding is DNA / RNA; information is the array of working combinations of encoded data; application processes are organisms; applications are species; communication (pub/sub) is natural selection; the ecosystem is the ecosystem.
    Living organisms are incented to survive and replicate. Likewise the aim of a publisher is to communicate – deeply, broadly and for a long time. SEO happens to be a powerful form of communication at present. Certainly, a breakthrough in this area will need to get established in some existing niche. Long term though it need not be sustained by currently extant forms of communication.

  26. Well said. The big fact about ‘data’ is that if it is not ‘whole’ then it tends to be dangerous (in terms of the predictions that it produces – the predictions on the face of it could look awesome, but have a propensity to be as wacky as not having data at all).

    I have seen entities make big mistakes in trying to solve major problems with machine learning (and resting on their laurels) without considering the fact that not all data needed that influences the outcomes are being sourced or even though about.

  27. Chris, first of all, congrats on the blog. It is terrific. And based on the breadth and intelligence of the comments, this has already become a very exciting ecosystem in which to participate.While I agree with the thrust of your post, you’ve taken a very horizontal view of the problem. I do not spend time on Google-scale problems, but on much more targeted, vertical solutions to the “big data” problem. By layering domain specificity onto the problem of semantic analysis, many of the pitfalls of NLP and AI become far more manageable. I’m not saying they’re a panacea, and certainly not when trying to solve problems in real-time, but they can take you a lot farther than when applying them to horizontal data sets. And yes, tagging rich data and creating additional metadata for analysis holds many of the keys to extracting true meaning from unstructured data sets. I could write on this topic for hours. Thanks for the post.Roger

  28. Access to data was one of the things overlooked in the Msft/YHOO search deal. There was a lot of talk about revenue shares and upfront payments, but people forget that Msft now gets a larger source of data to improve it’s product. Without that query stream (i.e. data) they would never be able to build as intelligent tools for spelling correction, query intent, auto-complete, etc. One (of many) reasons Google has knocked the ball out of the park is access to this data. Their bucket tests in a week probably provide more insights than the other guys get in a quarter.

  29. Hey roger, thanks! Glad to see you here.

    Re domain specific – I agree, but I think the reason what you say works is precisely because the domain becomes small enough that you can do all sorts of things to fill in the gaps in the data. I think of my last company, SiteAdvisor, as a data company. The way we got from 80% to near 100% was all sorts of techniques, from hacks to manual processes to integrating other data sets. We couldn’t have used those techniques in a horizontal setting.

  30. Yeah, from what I hear in the rumor mill, search engines today use click data, bounce rates etc much more than people suspect. With such a long tail of key phrases people enter into search engines, they must have almost unlimited appetite for more user data to get statistically meaningful tests.

  31. Sure, MS now has all this Yahoo data. And Google has plenty of data too. But what is that data? It’s known-item, factoid retrieval data. There is no exploratory search data in there. There is no recall-oriented search there. So the only way the data can be used is to improve known-item oriented searching. But that in turn feeds back on itself.. and when Bing gets better at known-item searching, more people will use it for known-item searching, and then the data they collect pigeon-holes them further into that one, narrow, Google-like information seeking behavior.

    So it seems to me that the only way out of the constrains imposed upon Bing search is for Bing to come up with clever-er algorithms that do something different and better, despite the gradient which the data is pointing it toward.

    If one relies on the data alone, one will not solve a very large range of AI problems. Intelligent algorithms are needed to make those breakthroughs.

  32. think the reason what you say works is precisely because the domain becomes small enough that you can do all sorts of things to fill in the gaps in the data.

    But isn’t that the point? At the end of the day, no matter what the reason, you resorted to clever algorithms, rather than large data. So the thing about data being the only good source of future AI breakthroughs just ain’t true. Relying on large data isn’t “wrong”. It’s just not the universal panacea that you make it out to be.

    The way I see it, there is a small head of large-scale, breadth-loving domains (e.g. web) and/or tasks (e.g. known-item finding.. rather than exploratory search..) in which large data is very appropriate.

    At the same time, there is a long tail of medium and small-scale, depth-loving domains (e.g. content-based music search) and tasks (e.g. exploratory search) in which large data does not give you as much as an intelligently-constructed algorithm.

    So what if the only reason you can construct those algorithms is because the domain is well-enough constrained? We know from power-law distributions that the volume (usage, whatever) of tasks and problems in the tail sums up to be equal in magnitude to the head.

    So at the most, you can say that large data will help you make AI breakthroughs in *half* of the open problems. Intelligent algorithms will still be necessary for the other half. imho.

  33. Good points jeremy. I think this is also where the line between data and algorithms starts to blur. If you are doing, say, vertical music search, a lot of your “algorithm” will come from ML on music-related corpora, which might include having humans labeling the data at various points.

  34. Actually, I was driving at something different than having humans label the data. What I was talking about was adjusting the machine learning, so that domain-specific knowledge is built into the learning algorithm.

    For example, I had some work a few years ago that used HMMs to label chord information on a set of Beatles tunes. However, instead of using massive amounts of training data on the HMMs, I used zero training data. Zero. Instead, I initialized the HMM using musicologically-sensible initial conditions, and then I adjusted the standard HMM E-M algorithm so that the [B] output probability matrix did NOT get updated; it stayed stable.

    All of the intelligence was in the algorithm. There was no human labeled data. And the algorithm performed best in class — better than solutions that had been trained on lots of human-labeled data.

    So to say we can’t make smarter algorithms, or that breakthroughs will only come via data, simply doesn’t sit right with me.

    I think what people tend to mean when they say “it’s all about the data” is that there is no purpose coming up with better general purpose machine learning algorithms, i.e. SVMs vs. gaussian mixture models vs. Markov random fields vs. whatever. If that’s your main point, then I agree.

    But we can also create more intelligent, specialized, intelligent algorithms by building our own smarts into a general purpose ML algorithm, thereby making the algorithm smarter. And we can do it without the need for massive amounts of data. Again, imho.

    So I just don’t buy that

  35. Hi ChrisI agree. You might like to check out FluidDB, which is all about using a better data representation to change how we work with information. See http://fluidinfo.com and http://blogs.fuidinfo.com/fluidDBI’ve also written about this exact subject a few times. A starting link:http://blogs.fluidinfo.com/terry/2007/03/19/why…And I talk a little about Alex Wright at http://blogs.fluidinfo.com/terry/2008/01/04/tag…Please feel free to get in touch, I’m terry fluidinfo com and would be happy to go into more depth / hear more from you, etc.

  36. I’d like to add an orthogonal viewpoint here. Yes, there’s tremendous value in building databases. But it is in the leveraging of existing processing techniques where great value explosions will occur. Take for instance the coupling of a dozen software services that all independently produce improved customer purchase action. The combination of these independent sources in a novel way, can give a multiplicative value ad of 1.1^12 or a factor of 3 improvement. Now expand this to hundreds or even thousands of independent techniques or connections within a network and you can reveal massive improvements in quality.A brain with only a couple of nodes is pretty weak. A mind with billions of nodes and hundreds of billions of connections is capable of advanced conscious connections, creativity, and unpredictable value advancement. The potent software applications of the future will exist and thrive by utilizing the network of APIs optimally in the construction of their databases and decision architectures.If your curiosity is piqued but you’re still not convinced, check out some of Kevin Kelly’s swarm and emergence concepts. I really enjoy some of his far out predictions and their closer than most folks would guess (in the 10-20 year horizon).

    10th popular post, and I give it a 7/10 only because you didn’t tie in network effects.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s