Crunchbase is a great data source, but there are considerable issues with it not covered by the OP. It's not just self-reported bias, but problems with mistakes in the data (including typos), mistakes of categorization, and inherently limited structure.
The following below reads like waay too harsh of a criticism on Crunchbase maintainers...none of its problems are out of the fault of TC, but of trying to make a service that encompasses such a massive universe. This is not a "Oh, what terrible data designers TC has" but, "Hey, CB is a great service, but you must know of these limitations before making an analysis"
(Last I scoured the data, it was in October, so I apologize if I bring up anything that has already been fixed).
One of the overarching problems is that Crunchbase is especially targeted for startups, with a focus on investment rounds, and yet there's a considerable number of entries for well-established companies, for which there is no relevant data. I suppose it's easy enough to filter the complete database on an inner join (return all companies for which there's at least one investment round)
Related to that: Without the itemized valuation rounds, the company valuation field (which is a column in the companies listing) is not very helpful...because it depends on users actively updating that field with each financial quarterly report or what have you. That's clearly not going to happen, so perhaps that valuation field should be moved to an events table where users can list the valuation for the company on a per quarter/year basis.
And so on the topic of inflexible structure...the category field is woefully limited...small single-focus companies aren't consistently categorized and then when you get to the big multifaceted companies...Google and Microsoft...how do you categorize them with a single word? "Internet"? "Software"? "Enterprise"? Tags would be of best use here.
So the above problems are difficult to solve on both an internal level and for your submitters...but here's an example that made me give up (for now) on doing a thorough analysis of CB data:
So I'm not arguing whether Color's acquisition is a "fail" on the same level as shutting down, the point is that there is a difference in the startup world, and CB needs to formalize that definition, because such distinctions are among the most important datapoints that a startup DB can provide. The ambiguity/error in Color's listing is worth pointing out because it was a highly watched company, highly mocked, and highly discussed...that its CB status wasn't noticed or fixed is a sort of bellwether for what the case may be for many of the other companies in the DB.
For a fun data-digging project, do a query for recently started companies with large investment rounds but by firms that seem to be one-offs...they are either very interesting companies (i.e. fly under the radar companies that have attracted large sums of money)...or, it's just made-up data.
OK, so what's great about CB data, besides its ambition and its generous terms of service? Doing analysis based on investment rounds. I think it's safe to say that startups have it in their best interest to tell the world how much big firms have so far invested into them. And the structure of the investments table forces the inputter to provide useful data, at least compared to the other tables.
To reiterate, not a criticism of CB's efforts...but pointing out natural limitations that exist so far. Would love to do a hackathon...or rather, a "cleanathon" to make the data more uniform.
The following below reads like waay too harsh of a criticism on Crunchbase maintainers...none of its problems are out of the fault of TC, but of trying to make a service that encompasses such a massive universe. This is not a "Oh, what terrible data designers TC has" but, "Hey, CB is a great service, but you must know of these limitations before making an analysis"
(Last I scoured the data, it was in October, so I apologize if I bring up anything that has already been fixed).
One of the overarching problems is that Crunchbase is especially targeted for startups, with a focus on investment rounds, and yet there's a considerable number of entries for well-established companies, for which there is no relevant data. I suppose it's easy enough to filter the complete database on an inner join (return all companies for which there's at least one investment round)
Related to that: Without the itemized valuation rounds, the company valuation field (which is a column in the companies listing) is not very helpful...because it depends on users actively updating that field with each financial quarterly report or what have you. That's clearly not going to happen, so perhaps that valuation field should be moved to an events table where users can list the valuation for the company on a per quarter/year basis.
And so on the topic of inflexible structure...the category field is woefully limited...small single-focus companies aren't consistently categorized and then when you get to the big multifaceted companies...Google and Microsoft...how do you categorize them with a single word? "Internet"? "Software"? "Enterprise"? Tags would be of best use here.
So the above problems are difficult to solve on both an internal level and for your submitters...but here's an example that made me give up (for now) on doing a thorough analysis of CB data:
Color: http://www.crunchbase.com/company/color-labs
Color is listed as "closing". Yet, AFAIK, it was "acquired"...even TechCrunch's own reporting says so:
http://techcrunch.com/2012/11/19/sources-apple-paid-7-millio...
So I'm not arguing whether Color's acquisition is a "fail" on the same level as shutting down, the point is that there is a difference in the startup world, and CB needs to formalize that definition, because such distinctions are among the most important datapoints that a startup DB can provide. The ambiguity/error in Color's listing is worth pointing out because it was a highly watched company, highly mocked, and highly discussed...that its CB status wasn't noticed or fixed is a sort of bellwether for what the case may be for many of the other companies in the DB.
For a fun data-digging project, do a query for recently started companies with large investment rounds but by firms that seem to be one-offs...they are either very interesting companies (i.e. fly under the radar companies that have attracted large sums of money)...or, it's just made-up data.
OK, so what's great about CB data, besides its ambition and its generous terms of service? Doing analysis based on investment rounds. I think it's safe to say that startups have it in their best interest to tell the world how much big firms have so far invested into them. And the structure of the investments table forces the inputter to provide useful data, at least compared to the other tables.
To reiterate, not a criticism of CB's efforts...but pointing out natural limitations that exist so far. Would love to do a hackathon...or rather, a "cleanathon" to make the data more uniform.