“Natural languages are adequate, but that doesn’t mean they’re optimal”

Languages are something of a mess. They evolve over centuries through an unplanned, democratic process that leaves them teeming with irregularities, quirks, and words like “knight.” No one who set out to design a form of communication would ever end up with anything like English, Mandarin, or any of the more than six thousand languages spoken today.

“Natural languages are adequate, but that doesn’t mean they’re optimal,” John Quijada, a fifty-three-year-old former employee of the California State Department of Motor Vehicles, told me. In 2004, he published a monograph on the Internet that was titled “Ithkuil: A Philosophical Design for a Hypothetical Language.” Written like a linguistics textbook, the fourteen-page Web site ran to almost a hundred and sixty thousand words. It documented the grammar, syntax, and lexicon of a language that Quijada had spent three decades inventing in his spare time. Ithkuil had never been spoken by anyone other than Quijada, and he assumed that it never would be.

From Utopian for Beginners, an excellent 2012 New Yorker article about constructed human languages.

There are two ways to make large datasets useful

I’ve spent the majority of my career building technologies that try to do useful things with large datasets.*

One of the most important lessons I’ve learned is that there are only two ways to make useful products out of large data sets. Algorithms that deal with large data sets tend to be accurate at best 80%-90% of the time (an old “joke” about machine learning is that it’s really good at partially solving any problem). Consequently, you either need to accept you’ll have some errors but deploy the system in a fault-tolerant context, or you need to figure out how to get the remaining accuracy through manual labor.

What do I mean by fault-tolerant context? If a search engine shows the most relevant result as the 2nd or 3rd result, users are still pretty happy. The same goes for recommendation systems that show multiple results (e.g. Netflix). Trading systems that hedge funds use are also often fault tolerant: if you make money 80% of the time and lose it 20% of the time, you can still usually have a profitable system.

For fault-intolerant contexts, you need to figure out how to scalably and cost-effectively produce the remaining accuracy through manual labor. When we were building SiteAdvisor, we knew that any inaccuracies would be a big problem: incorrectly rating a website as unsafe hurts the website, and incorrectly rating a website as safe hurts the user. Because we knew automation would only get us 80-90% accuracy, we built 1) systems to estimate confidence levels in our ratings so we would know what to manually review, and 2) a workflow system so that our staff, an offshore team we hired, and users could flag or fix inaccuracies.

* My first job was as a programmer at a hedge fund, where we built systems that analyzed large data sets to trade stock options. Later, I cofounded SiteAdvisor where the goal was to build a system to assign security safety ratings to tens of millions of websites. Then I cofounded Hunch, which was acquired by eBay – we are now working on new recommendation technologies for ebay.com and other eBay websites.

The P vs NP problem

One of the great unsolved questions in computer science is the P vs NP problem. It is one of the seven Millennium Prize Problems – if you solve one of them, you get $1 million and become really famous among mathematicians and computer scientists.

Here’s my non-technical interpretation of the essence of the P vs NP problem:

Can every answer that can be feasibly verified also be feasibly calculated?

What I am calling “feasible” is what computer scientists call algorithms that can run “polynomial” as opposed to “exponential” time.

There are at least four possible outcomes to the attempts to solve this problem: 1) the current situation continues – no proof of anything is found, 2) P=NP is proved true, 3) P=NP is proved false, 4) it is proved that it’s impossible to prove P=NP to be true or false.

If P=NP were proved true, there would be many serious real-world consequences. All known encryption schemes rely on the fact that prime factors of large numbers are something that can be feasibly verified but not calculated. If P=NP, that means there would also be feasible ways to calculate prime factors, and hence decrypt codes without their private keys. So if someone does prove P=NP, he or she should probably inform authorities before publishing the proof and all hell breaks loose (thanks Matt for this observation – you could also imagine a lot of conspiracy theories about what happens to scientists who try to prove P=NP..!)

Most computer scientists seem to suspect P does not equal NP. MIT computer scientist Scott Aaronson gives informal arguments against P=NP in this entertaining blog post, including this philosophical argument:

If P=NP, then the world would be a profoundly different place than we usually assume it to be. There would be no special value in “creative leaps,” no fundamental gap between solving a problem and recognizing the solution once it’s found. Everyone who could appreciate a symphony would be Mozart; everyone who could follow a step-by-step argument would be Gauss; everyone who could recognize a good investment strategy would be Warren Buffett. It’s possible to put the point in Darwinian terms: if this is the sort of universe we inhabited, why wouldn’t we already have evolved to take advantage of it?

He follows up with a much longer essay (which I found really interesting but ultimately unconvincing) on the philosophical implications of computational complexity (the field of computer science that studies questions like P vs NP).


Bedrock programming

“Bedrock programming” is a phrase used to describe a style of programming that favors building code from the ground up versus reusing existing open-source or proprietary code.

In my first programming job out of college our bosses told us to entirely rebuild our product. The person in charge of the networking layer decided the best way to do this was to write our own low-level networking toolkit, using some new, relatively untested networking techniques. We also wrote our own versions of core Java libraries (because, it was said, the existing ones weren’t sufficiently thread safe). This decision ended up leading to repeated delays and bugs, and a codebase that most of the other employees didn’t understand. It also made it much harder to train new hires and find replacements for departed employees.

A related issue is what is usually called the “bleeding edge” tendency: the desire to use the shiny & new over the older & battle-tested. Lately, I’ve personally been programming with MongoDB and love it. But I’m also an investor in a startup that made Mongo their main production database, and when their Mongo expert left unexpectedly it took them far longer to find a replacement than it would have to find a MySQL expert.

Great programmers are intensely curious and inventive. They love to improve code and try new things. There will always be bedrock and bleeding edge tendencies within strong engineering teams. The key is to have a great VP Engineering/CTO who can balance those tendencies with the reality that talent, money, and time are scarce, especially in startups.

Who should learn to program?

Recently, there’s been a lot of talk in the tech world and beyond about getting more people to learn computer programming. I think this is a worthy goal*, but the question should be considered from various angles.

1. Jobs & the economy. Businesses all over the world need more programmers. Every company I know is hiring engineers (e.g. see this list of NY tech startups). Top programmers can make $100K+ right out of college. Yet there were only about 14,000 computer science (CS) majors last year. Meanwhile about 40,000 people got law degrees even though demand for lawyers has been shrinking. America is suffering from what economists call structural unemployment:  jobs are available but our labor force isn’t trained for those jobs.

2. Programming is a great foundation for a tech/startup career. CS is a great foundation to do other things in tech industry like starting a tech company (although I’d argue that design is an increasingly valuable foundation for web startups). I suspect one of the reasons for the low number of CS majors is people don’t realize all the non-programming opportunities that are opened up by a background in programming.

3. Programming is an important part of being “culturally literate.” Algorithmic thinking is as fundamental a type of thinking as mathematical thinking. For example, Daniel Dennett convincingly argues that the best way to understand Darwin’s theory of evolution is by thinking of it as an algorithm. (I haven’t read it yet but I’m told the premise of Stephen Wolfram’s A New Kind of Science is that algorithmic methods should be applied much more broadly across the sciences). Teaching algorithmic thinking – which is what CS does – should be a core part of a liberal arts education.

4. Programming is a great activity.  Most people who program describe themselves as entering a mental flow state where they are intensely immersed and time seems to fly by. It feels similar to reading a great book. You also feel great afterwards – it is the mental equivalent of going to the gym.

5. Should non-technical people at tech startups learn to code? This is where I disagree with some of the conventional wisdom. Certainly it is worthwhile learning programming, at least for reasons 3 & 4 above. You should realize, however, that to become a good programmer takes thousands of hours of practice. I’d also argue that if you are a non-technical person working at a web company the the first thing you should learn is internet architecture (DNS, http, html, web servers, database, TCP/UDP, IP, etc). Learning some programming is good too, to help relate to technical colleagues. But if your goal is to build a large-scale web service, your time as a non-technical person is better spent recruiting people who have been coding for years.

* Disclosure: I’m an investor via Founder Collective in two companies related to teaching programming:  Codecademy and Hacker School.