Dang, I wish I could find the link to this...an HP data scientist wrote a short essay (something like "Intro to Data Science") and said that the proper collection and cleaning of data is often seen as dirty grudge work that has to be done (by someone else, hopefully) before the real groundbreaking work can be done. However, the author said, this dirty grudge work is the real work.
When I think about, in my data programming related work, I'd say about 5% is doing analysis or executing statistical routines. And 95% of my time is spent on finding, cleaning, and properly normalizing data. This applies to whether you're a solo researcher or Facebook...think about it: Facebook is a pretty good website, but what it excels better at than just about anyone is being a platform to collect personal data in a way that...well, causes you to quite willingly give it your personal data.
There was a presentation where Peter Norvig pointed out a data routine in which someone had implemented with a naive Bayesian classifier with a comment saying that they'd think of something better...and years later, no one realized it was still a todo. Norvig said something like "You don't have to be very smart when you have a lot of data"
>Dang, I wish I could find the link to this...an HP data scientist wrote a short essay (something like "Intro to Data Science") and said that the proper collection and cleaning of data is often seen as dirty grudge work that has to be done (by someone else, hopefully) before the real groundbreaking work can be done. However, the author said, this dirty grudge work is the real work.
It's called data munging. Good short article on dataspora about it a while back:
When I think about, in my data programming related work, I'd say about 5% is doing analysis or executing statistical routines. And 95% of my time is spent on finding, cleaning, and properly normalizing data. This applies to whether you're a solo researcher or Facebook...think about it: Facebook is a pretty good website, but what it excels better at than just about anyone is being a platform to collect personal data in a way that...well, causes you to quite willingly give it your personal data.
There was a presentation where Peter Norvig pointed out a data routine in which someone had implemented with a naive Bayesian classifier with a comment saying that they'd think of something better...and years later, no one realized it was still a todo. Norvig said something like "You don't have to be very smart when you have a lot of data"