Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

AFAIK, the data lake is the next step in the evolution of the data warehouse. Instead of storing data in a data mart/data warehouse, the concept of the datalake (as a design pattern) is that you don't schemas (support for unstructured data), better support for auditing and data governance/democratization, and schema (?) evolution


I wouldn't say you don't have schema's, rather you have schema-on-read instead of schema-on-write, and you use an extract-load-transform pattern instead of extract-transform-load. The data is replicated as-is into the data lake and only then do you figure out what to do with it.


Yes, in my mind this is the key of a data lake. Take all your raw data and store it somewhere, then provide ways for people to access and query the raw data.

This means ingestion is faster (no transformation) and you don't throw away any data that you might want later. If multiple teams want to query the same data in different ways they have the ability to do so. And ideally it prevents data silos because everyone can stuff their raw data into a master data lake and each team has access to all the data but is responsible for doing the work to make it look like they want.

Reality of the above obviously doesn't always match the theory but schema-on-read/ELT are the easiest ways to handle the above. Typically this involves some kind of Hadoop-style technology, like Hive or SparkSQL for SQL-based querying, Spark for non-SQL, etc. But you've always got the raw data and can go back and re-ELT it from the data lake if your needs change.


I don't know much about the strict definition, but that's how I use them. I have had several clients that want to analyze data they didn't capture in their schema. I'd say: disk is cheap. Throw everything in there (medical records, events, etc.). If we need it later, we'll fish it out. Ugly, but simple.


Note that a data lake does not necessarily replace the data warehouse, but rather often complements it. As such, you store your raw data from various sources in a centralised data store (Hadoop-like, NoSQL, etc.). From there, you prune, clean, select, and potentially aggregate data that you would like to provide in a quality-controlled way to your business users, in a data warehouse. This data warehouse most often will be a more traditional relational data store (usually some flavour of SQL database), which allows users to select data from a curated, pre-selected slice of the overall data stored in the data lake, and which enables easier integration with common reporting tools, whether more traditional standard reporting tools or self-service BI tools.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: