However, since Singer is built around piping data between applications, your suggestion - to code something that sits between taps and targets - makes perfect sense. The whole "flow" would look like:
The only thing I'd add from Chris's blog post is that in the workflow we tend to see is that most of the transformations tend to be done after loading into the destination. For example, in Redshift the transformations could be defined in SQL or Python UDFs.
There are a couple reasons why we included schemas in the spec:
- JSON doesn't have a robust set of data types, and specifically lacks a datetime/timestamp type. With a schema, Taps can, for example, denote fields in the JSON that contain datetimes represented as strings, and then targets can convert those to proper datetimes and handle them accordingly.
- Dealing with un-structured or flexibly-structured data is hard. Requiring a schema forces a Tap author to think about the structure of the data up front. By validating each data point against a schema, the Tap author should be able to more quickly identify nuances in the data set - like missing fields, nullable fields, mixed-type fields, etc - and either decide to clean them out of the data (if appropriate), or provide the right schema to inform downstream applications about them. Identifying and handling these problems requires an understanding of the source data set, so it is best done as close to the data source as possible.
It's accurate to say that we're more focused on the extraction and loading parts of ETL. In our experience, almost all useful data analysis requires data to be transformed at multiple stages - for example, once to cleanse and normalize the data, and again to aggregate it. Our goal is to leave the data in the rawest form possible without losing accuracy or precision, so that further transformation can occur, likely using SQL. We recognize this isn't perfect for every use case, but our customers love it for getting rapid access to data that would otherwise be locked away in SaaS applications and transactional databases.
I'm an engineer at Stitch. Our approach to transformation is to do just enough to move data from one system to another without losing precision or fidelity. So, we transform datatypes and structures into more appropriate forms for the target system, but we don't have any transformation operators like aggregation or windowing.
We have found that this approach works well for our users, who prefer to get the rawest possible data, and the systems we target like Redshift that are themselves powerful transformation engines. This gives the user unlimited flexibility for defining transformations, and a full audit trail for understanding how their data has changed.
We are always evolving, though, so if there's a use case that you think requires this approach, I would be eager to hear more about it.
I have no idea what you're talking about. Scanning your docs, I'm no more illuminated.
I've done a lot of ETL, mostly for healthcare.
Yes, engineers should be doing ETL work. Any "workflow engine" that promises patch cord or visual programming is hooey. At the end of the day, someone somewhere is gonna be writing some code. And its not the "business analyst" or "subject area expert". No, its a dev. And all that clever framework stuff is just an angry 800lb gorilla sitting between her and her work.
ETL is just fancy talk for data processing. Input, processing, output. Copy a string from a source, maybe mangle it a bit, paste that string somewhere else. Extra credit for type awareness, eg "oh! that string's a date!". Trophies for logging, alerts, and services which heal themselves.
Do you have any sort of SDK for adding integrations that you do not support?
While this looks super useful if you support all of the integrations someone needs, it seems like the moment that's not the case someone needs to maintain a complete ETL pipeline for those data sources you don't support, and their load is only reduced by the fact that they have to maintain fewer data sources.
I work with robertjmoore. I, along with a few other colleagues at our small office, voted for his past three blog posts. Many of those votes probably came from our single office IP address. Many were probably also placed after clicking a direct link to the HN posting. There couldn't have been more than about 10 votes like this, because we don't have many employees. And, in the interest of full disclosure, I don't vote for many HN posts besides those written by authors I know.
Is this the behavior that HN's vote ring detector is trying to discourage? I understand that these things are a slippery slope - but if so, it's too bad, because quality content like robertjmoore's last three posts is getting lost, and I would imagine that other authors at small companies like ours are unwittingly falling into the same trap.
If there have been other posts explaining the DOs and DONTs of the HN vote ring detector, I apologize in advance for not having read them.
>Is this the behavior that HN's vote ring detector is trying to discourage?
I hope so. It's great and all that you have 10 people to vote up an article as soon as its posted... but I don't. Shouldn't the content be voted up based on its own merits vs. how many people you know? An instant 10 votes is a huge unfair advantage.
> if so, it's too bad, because quality content like robertjmoore's last three posts is getting lost
If the problem is the content getting "lost" because you guys are voting up the articles... then stop doing that. If the content is vote-worthy it will get votes.
Additional comment: sure, I get that this wasn't clear to you guys... but come on. On some level you can see how this would be unfair- right?
I would like to continue seeing high-quality content on the front page too. However, upvoting submissions based solely on the author is poor form. Submissions should live or die on their quality alone.
Giving a "boost" to your friend or colleague, even though your intentions are good, is unfair to other submitters.
You're neglecting to factor in bandwidth and EBS-usage (because, keep in mind, EC2 micro instances MUST be EBS backed). 10GB of outbound data transfer is another $1.10, plus another $.50-$1.00 in EBS costs per instance.e