The need for wikidata

We need a wikipedia for data is a very spot on post about something that every code monkey has spent countless hours struggling with – the massive data sets that are required to do things even as simple as “what’s your address?”.

Even when data is available under a reasonable license, it often suffers from extremely serious quality or discoverability problems. The US Census Bureau publishes map data, but it only includes a small subset of the attributes required for a real mapping product. The Reuters corpus, which is a standard body of text used in data mining and information retrieval research, requires you to sign two agreements, send them to some organization via snail mail, and get the corpus via snail mail on CDs (what century is this, folks?).

I think all of these barriers to data are holding back innovation at a scale that few people realize. The most important part of an environment that encourages innovation is low barriers to entry. The moment a contract and lawyers are involved, you inherently restrict the set of people who can work on a problem to well-funded companies with a profitable product. Likewise, companies that sell data have to protect their investments, so permitted uses for the data are almost always explicitly enumerated in contracts. The entire system is designed to restrict the data to be used in product categories that already exist.