Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thank you for this article!

I wrote a CSV parser in Javascript for my site https://www.csvplot.com.

Even the "industry standard" PapaParse JS library was a lot slower than my no-frills implementation, so I thought I was onto something.

Then I read the "So You Want To Write your own CSV Code" [0]. I realized if I supported every corner case I would lose my entire speed advantage, which in hindsight is obvious. That newlines can exist inside a column if quoted ( Something like "my \n value with newline") is specifically what caused the most issues. Incidentally the article also points at that this ruins the possibility of processing each row independently.

I started reading about highly efficient text parsing, but most of what I could find was dedicated to JSON parsing, with simdjson as one example. I briefly considered using webassembly to leverage existing fast implementations, but that never happened. My motivation for this specific side project ran out the minute it was fast enough for my own needs.

[0] http://thomasburette.com/blog/2014/05/25/so-you-want-to-writ...



I wonder if you can speculatively assume there are no line ends in quoted blocks and then try to go fast and then fall back to a slower method if you detect that to be the case.

Sort of like branch prediction where a failed prediction is costly but on average you are right.


How do you detect it without ruining the performance of the fast path? Seems like you need to make some kind of assumption. Perhaps in the number of columns in each row. But then what happens if you need to parse data with a different number of columns in some row? (Which is not that uncommon.)

The problem with csv is that many people consuming it don't actually choose that format. They have to parse what they are given.


Assume all linebreaks aren't part of quoted strings.

Split the data in linebreaks.

Process each line in parallel (massive speedup, can use many cores or even GPU).

Record "errors" (ie. Where a quoted block doesn't end before the newline)

Remove and reprocess all records containing errors or following a record with errors. Repeat until no errors exist.


Yeah, you kind of fell right into the trap I was talking about. Your approach either requires being able to memory map the CSV data or storing the CSV data in memory. And to be honest, it's not clear to me that your approach would end up being faster anyway. There's a lot of overhead buried in there, and it sounds like it's at least two passes over the data in common cases. And in cases where there are line breaks in the data, your approach will perform very poorly I think.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: