Hacker Newsnew | past | comments | ask | show | jobs | submit | cm's commentslogin

How would this work for my 18th century agrarian business?


It's all the same principles


Congrats on the launch! Your site is loading now, albeit a bit slowly :)


Thanks, we're working on it :)


We don't currently have use cases that require heavy transformations (see this blog post I wrote to explain why: https://blog.stitchdata.com/why-our-etl-tool-doesnt-do-trans...).

However, since Singer is built around piping data between applications, your suggestion - to code something that sits between taps and targets - makes perfect sense. The whole "flow" would look like:

$ tap-mydatasource | do-aggregations | target-mytarget

We'd be eager to hear from anyone who tries this approach!


The only thing I'd add from Chris's blog post is that in the workflow we tend to see is that most of the transformations tend to be done after loading into the destination. For example, in Redshift the transformations could be defined in SQL or Python UDFs.


There are a couple reasons why we included schemas in the spec:

- JSON doesn't have a robust set of data types, and specifically lacks a datetime/timestamp type. With a schema, Taps can, for example, denote fields in the JSON that contain datetimes represented as strings, and then targets can convert those to proper datetimes and handle them accordingly.

- Dealing with un-structured or flexibly-structured data is hard. Requiring a schema forces a Tap author to think about the structure of the data up front. By validating each data point against a schema, the Tap author should be able to more quickly identify nuances in the data set - like missing fields, nullable fields, mixed-type fields, etc - and either decide to clean them out of the data (if appropriate), or provide the right schema to inform downstream applications about them. Identifying and handling these problems requires an understanding of the source data set, so it is best done as close to the data source as possible.


It's accurate to say that we're more focused on the extraction and loading parts of ETL. In our experience, almost all useful data analysis requires data to be transformed at multiple stages - for example, once to cleanse and normalize the data, and again to aggregate it. Our goal is to leave the data in the rawest form possible without losing accuracy or precision, so that further transformation can occur, likely using SQL. We recognize this isn't perfect for every use case, but our customers love it for getting rapid access to data that would otherwise be locked away in SaaS applications and transactional databases.


I'm an engineer at Stitch. Our approach to transformation is to do just enough to move data from one system to another without losing precision or fidelity. So, we transform datatypes and structures into more appropriate forms for the target system, but we don't have any transformation operators like aggregation or windowing.

We have found that this approach works well for our users, who prefer to get the rawest possible data, and the systems we target like Redshift that are themselves powerful transformation engines. This gives the user unlimited flexibility for defining transformations, and a full audit trail for understanding how their data has changed.

We are always evolving, though, so if there's a use case that you think requires this approach, I would be eager to hear more about it.


I have no idea what you're talking about. Scanning your docs, I'm no more illuminated.

I've done a lot of ETL, mostly for healthcare.

Yes, engineers should be doing ETL work. Any "workflow engine" that promises patch cord or visual programming is hooey. At the end of the day, someone somewhere is gonna be writing some code. And its not the "business analyst" or "subject area expert". No, its a dev. And all that clever framework stuff is just an angry 800lb gorilla sitting between her and her work.

ETL is just fancy talk for data processing. Input, processing, output. Copy a string from a source, maybe mangle it a bit, paste that string somewhere else. Extra credit for type awareness, eg "oh! that string's a date!". Trophies for logging, alerts, and services which heal themselves.


Do you have any sort of SDK for adding integrations that you do not support?

While this looks super useful if you support all of the integrations someone needs, it seems like the moment that's not the case someone needs to maintain a complete ETL pipeline for those data sources you don't support, and their load is only reduced by the fact that they have to maintain fewer data sources.


We do have an API for sending data into the Pipeline, documentation for it can be found here: https://docs.stitchdata.com/hc/en-us/categories/203326787-Im...

Additionally, we'll be releasing a Java client library any day now, with other languages and platforms to follow.


I added some explanation about this - thanks for the feedback


I fail to see where...


I work with robertjmoore. I, along with a few other colleagues at our small office, voted for his past three blog posts. Many of those votes probably came from our single office IP address. Many were probably also placed after clicking a direct link to the HN posting. There couldn't have been more than about 10 votes like this, because we don't have many employees. And, in the interest of full disclosure, I don't vote for many HN posts besides those written by authors I know.

Is this the behavior that HN's vote ring detector is trying to discourage? I understand that these things are a slippery slope - but if so, it's too bad, because quality content like robertjmoore's last three posts is getting lost, and I would imagine that other authors at small companies like ours are unwittingly falling into the same trap.

If there have been other posts explaining the DOs and DONTs of the HN vote ring detector, I apologize in advance for not having read them.


Just my opinions here:

>Is this the behavior that HN's vote ring detector is trying to discourage?

I hope so. It's great and all that you have 10 people to vote up an article as soon as its posted... but I don't. Shouldn't the content be voted up based on its own merits vs. how many people you know? An instant 10 votes is a huge unfair advantage.

> if so, it's too bad, because quality content like robertjmoore's last three posts is getting lost

If the problem is the content getting "lost" because you guys are voting up the articles... then stop doing that. If the content is vote-worthy it will get votes.

Additional comment: sure, I get that this wasn't clear to you guys... but come on. On some level you can see how this would be unfair- right?


I would like to continue seeing high-quality content on the front page too. However, upvoting submissions based solely on the author is poor form. Submissions should live or die on their quality alone.

Giving a "boost" to your friend or colleague, even though your intentions are good, is unfair to other submitters.


RJMetrics is hiring software development interns in Philly

http://www.rjmetrics.com/jobs


You're neglecting to factor in bandwidth and EBS-usage (because, keep in mind, EC2 micro instances MUST be EBS backed). 10GB of outbound data transfer is another $1.10, plus another $.50-$1.00 in EBS costs per instance.e


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: