Entries Tagged 'computer science' ↓

Collective knowledge systems

I think you could make a strong argument that the most important technologies developed over the last decade are a set of systems that are sometimes called “collective knowledge systems”.

The most successful collective knowledge system is the combination of Google plus the web. Of course Google was originally intended to be just a search engine, and the web just a collection of interlinked documents. But together they provide a very efficient system for surfacing the smartest thoughts on almost any topic from almost any person.

The second most successful collective knowledge system is Wikipedia. Back in 2001, most people thought Wikipedia was a wacky project that would at best end up being a quirky “toy” encyclopedia. Instead it has become a remarkably comprehensive and accurate resource that most internet users access every day.

Other well-known and mostly successful collective knowledge systems include “answer” sites like Yahoo Answers, review sites like Yelp, and link sharing sites like Delicious.  My own company Hunch is a collective knowledge system for recommendations, building on ideas originally developed by “collaborative filtering” pioneer Firefly and the recommendation systems built into Amazon and Netflix.

Dealing with information overload

It has been widely noted that the amount of information in the world and in digital form has been growing exponentially. One way to make sense of all this information is to try to structure it after it is created. This method has proven to be, at best, partially effective (for a state-of-the-art attempt at doing simple information classification, try Google Squared).

It turns out that imposing even minimal structure on information, especially as it is being created, goes a long way. This is what successful collective knowledge systems do. Google would be vastly less effective if the web didn’t have tags and links. Wikipedia is highly structured, with an extensive organizational hierarchy and set of rules and norms. Yahoo Answers has a reputation and voting system that allows good answers to bubble up. Flickr and Delicious encourage user to explicitly tag items instead of trying to infer tags later via image recognition and text classification.

Importance of collective knowledge systems

There are very practical, pressing needs for better collective knowledge systems. For example, noted security researcher Bruce Schneier argues that the United States’ biggest anti-terrorism intelligence challenge is to build a collective knowledge system across disconnected agencies:

What we need is an intelligence community that shares ideas and hunches and facts on their versions of Facebook, Twitter and wikis. We need the bottom-up organization that has made the Internet the greatest collection of human knowledge and ideas ever assembled.

The same could be said of every organization, large and small, formal and and informal, that wants to get maximum value from the knowledge of its members.

Collective knowledge systems also have pure academic value. When Artificial Intelligence was first being seriously developed in the 1950’s, experts optimistically predicted they’d create machines that were as intelligent as humans in the near future.  In 1965, AI expert Herbert Simon predicted that “machines will be capable, within twenty years, of doing any work a man can do.”

While AI has had notable victories (e.g. chess), and produced an excellent set of tools that laid the groundwork for things like web search, it is nowhere close to achieving its goal of matching – let alone surpassing – human intelligence. If machines will ever be smart (and eventually try to destroy humanity?), collective knowledge systems are the best bet.

Design principles

Should the US government just try putting up a wiki or micro-messaging service and see what happens? How should such a system be structured? Should users be assigned reputations and tagged by expertise? What is the unit of a “contribution”? How much structure should those contributions be required to have? Should there be incentives to contribute? How can the system be structured to “learn” most efficiently? How do you balance requiring up front structure with ease of use?

These are the kind of questions you might think are being researched by academic computer scientists. Unfortunately, academic computer scientists still seem to model their field after the “hard sciences” instead of what they should modeling it after — social sciences like economics or sociology. As a result, computer scientists spend a lot of time dreaming up new programming languages, operating system architectures, and encryption schemes that, for the most part, sadly, nobody will every use.

Meanwhile the really important questions related to information and computer science are mostly being ignored (there are notable exceptions, such as MIT’s Center for Collective Intelligence). Instead most of the work is being done informally and unsystematically by startups, research groups at large companies like Google, and a small group of multi-disciplinary academics like Clay Shirky and Duncan Watts.

Most popular posts

I’ve been trying to set up a “Popular Posts” widget on the sidebar of this blog but somehow repeatedly failed.  So instead I’ll just post them here:

The most important question to ask before taking seed money link

The challenge of creating a new category link

Man and superman link

The new economy link

Why content sites are getting ripped off link

Software patents should be abolished link

Climbing the wrong hill link

Google and newspapers: the false choice of opting out link

New York City is poised for a tech revival link

To make smarter systems, it’s all about the data link

The one number you should know about your equity grant link

Why you shouldn’t keep your startup idea secret link

Ideal first round funding terms link

To make smarter systems, it’s all about the data

As this article by Alex Wright in the New York Times last week reminded me, when the mainstream press talks about artificial intelligence – machine learning, natural language processing, sentiment analysis, and so on – they talk as if it’s all about algorithmic breakthroughs.  The implication is it’s primarily a matter of developing new equations or techniques in order to build systems that are significantly smarter than the status quo.

What I think this view misses (but I suspect the companies covered in the article understand) is that significant AI breakthroughs come from identifying or creating new sources of data, not inventing new algorithms.

Google’s PageRank was probably the greatest AI-related invention ever brought to market by a startup.  It was one of very few cases where a new system was really an order of magnitude smarter than existing ones.  The Google founders are widely recognized for their algorithmic work.  Their most important insight, however, in my opinion, was to identify a previously untapped and incredibly valuable data source – links – and then build a (brilliant) algorithm to optimally harness that new data source.

Modern AI algorithms are very powerful, but the reality is there are thousands of programmers/researchers who can implement them with about the same level of success.  The Netflix Challenge demonstrated that a massive, world-wide effort only improves on an in-house algorithm by approximately 10%. Studies have shown that naive bayes is as good or better than fancy algorithms in a surprising number of real world cases.  It’s relatively easy to build systems that are right 80% of the time, but very hard to go beyond that.

Algorithms are, as they say in business school, “commoditized.”  The order of magnitude breakthroughs (and companies with real competitive advantages) are going to come from those who identify or create new data sources.

Machine learning is really good at partially solving just about any problem

There’s a saying in artificial intelligence circles that techniques like machine learning (and NLP) can very quickly get you, say, 80% of the way to solving just about any (real world) problem, but going beyond 80% is extremely hard, maybe even impossible.  The Netflix Challenge is a case in point: hundreds of the best researchers in the world worked on the problem for 2 years and the (apparent) winning team got a 10% improvement over Netflix’s in-house algorithm.  This is consistent with my own experience, having spent many years and dollars on machine learning projects.

This doesn’t mean machine learning isn’t useful – it just means you need to apply it to contexts that are fault tolerant:  for example, online ad targeting, ranking search results, recommendations, and spam filtering.  Areas where people aren’t so fault tolerant and machine learning usually disappoints include machine translation, speech recognition, and image recognition.

That’s not to say you can’t use machine learning to attack these non-fault tolorant problems, but just that you need to realize the limits of automation and build mechanisms to compensate for those limits.  One great thing about most machine learning algorithms is you can infer confidence levels and then, say, ship low confidence results to a manual process.

A corollary of all of the above is that it is very rare for startup companies to ever have a competitive advantage because of their machine learning algorithms.  If a worldwide concerted effort can only improve Netflix’s algorithm by 10%, how likely are 4 people in an R+D department in a startup going to have a significant breakthrough.  Modern ML algorithms are the product of thousands of academics and billions of dollars of R+D and are generally only improved upon at the margins by individual companies.

password hints, security questions etc are a bad idea, reason #723

As I’ve said before, security questions, password hints etc are a really bad idea.

Today, I was on gap.com and forgot my password.  When you put in an email on their login page and click “I forgot my password” they show you your password hint.  You can put in any email address and find out their password hint this way.  This is a great way for hackers to figure out your password.  (How many people just use the password itself as their hint?  I bet a lot).

When I saw my own hint I put in a long time ago, I had to chuckle at my obnoxious former self :

picture-2

NYTimes gets computer security wrong again

I love the NYTimes and read it every day.

But almost every computer security related article I read in it is just dead wrong. As someone who started and succesfully sold a computer security company (SiteAdvisor to McAfee) I feel like this is one area I know something about. (scary thought: does the NYTimes just happen to be wrong about my area of expertise or are they wrong about a lot more and this is the only area where I’m able to detect it?).

Today’s poorly researched and flat-out wrong security article claims Macs Aren’t Safer, Just a Smaller Target. The sole piece of evidence comes from a study by Symantec, a company that sells Mac anti-virus software. When your only source has a significant business interest in “results” of the study, shouldn’t the “newspaper of record” get a second opinion? For example, maybe talk to an operating systems expert, most if not all of whom will tell you Mac’s Unix-based OS is just a vastly better architecture from a security perspective.

Moreover, as comments on the article point out, Mac’s market share is big enough now (~10%) that it certainly seems like a reasonable target. In fact with all the talk of how Mac’s don’t get viruses, if I were a virus writer today looking to make my name, I’d imagine targeting the Mac would be a far more interesting way to go.

I literally can’t remember the last time I met a techie in CA or NYC who used a PC. At this point using a Mac versus PC in the tech world has become an IQ test, not a preference.

Bad trend of the week: security questions

I’ve noticed more and more websites ask you to enter answers to security questions like “what is your mother’s maiden name” or “where were you born”. This is a very bad trend. Here’s why:

1. Answers to security questions are far more easily guessed than passwords. e.g. Just guess the biggest US cities for “where were you born?”

2. Security questions are far more easily figured out by clever hackers. See e.g. “Messin’ with Texas, Deriving Mother’s Maiden Names Using Public Records.” V. Griffith and M. Jakobsson. ACNS ‘05, 2005 and CryptoBytes Winter ‘07. link to pdf
3. Every time you answer a question like “What city you were born?”, you spread that important piece of information more widely, thereby increasing the chance that a rogue employee or a data breach exposes valuable information about you. This is particularly bad when the same security questions are shared by sites with no real need for security (e.g. a casual game site) and sites with a strong need for security (e.g. your bank).

My advice next time you are asked for an answer to a security question is not to answer truthfully but instead use a “strong” password (”strong” means, roughly, 8 or more letters (with mixed case), numbers and perhaps if allowed symbols). So when they ask you what “City were you born?” answer “5ght11YT” or something, effectively turning their security question into a conventional password. And then I’d write that password down on a piece of paper (not on your computer).

Update:  David Weinberger has some other reasons security questions are dumb.