Pretty decent list of links. However, I feel the importance of SQL has been completely downplayed...there are more hadoop-oriented links than there are SQL. Data retrieval and manipulation is where a data scientist will spend 95% of her time, and SQL is still more ubiquitous by far.
Dang, I wish I could find the link to this...an HP data scientist wrote a short essay (something like "Intro to Data Science") and said that the proper collection and cleaning of data is often seen as dirty grudge work that has to be done (by someone else, hopefully) before the real groundbreaking work can be done. However, the author said, this dirty grudge work is the real work.
When I think about, in my data programming related work, I'd say about 5% is doing analysis or executing statistical routines. And 95% of my time is spent on finding, cleaning, and properly normalizing data. This applies to whether you're a solo researcher or Facebook...think about it: Facebook is a pretty good website, but what it excels better at than just about anyone is being a platform to collect personal data in a way that...well, causes you to quite willingly give it your personal data.
There was a presentation where Peter Norvig pointed out a data routine in which someone had implemented with a naive Bayesian classifier with a comment saying that they'd think of something better...and years later, no one realized it was still a todo. Norvig said something like "You don't have to be very smart when you have a lot of data"
>Dang, I wish I could find the link to this...an HP data scientist wrote a short essay (something like "Intro to Data Science") and said that the proper collection and cleaning of data is often seen as dirty grudge work that has to be done (by someone else, hopefully) before the real groundbreaking work can be done. However, the author said, this dirty grudge work is the real work.
It's called data munging. Good short article on dataspora about it a while back:
Ryan, co-founder of Zipfian Academy here. Completely agree -- data scientists can spend up to 90% of their time cleaning and getting their data in the proper format for analysis. This, plus the emergence of Hive and Pig as dominant higher-level abstractions on top of Hadoop, have made robust SQL skills more important than ever. We have an upcoming blog post specially focusing on learning SQL and the differences between SQL/HiveQL.