I wonder if you can speculatively assume there are no line ends in quoted blocks...

burntsushi · on Oct 10, 2020

How do you detect it without ruining the performance of the fast path? Seems like you need to make some kind of assumption. Perhaps in the number of columns in each row. But then what happens if you need to parse data with a different number of columns in some row? (Which is not that uncommon.)

The problem with csv is that many people consuming it don't actually choose that format. They have to parse what they are given.

londons_explore · on Oct 11, 2020

Assume all linebreaks aren't part of quoted strings.

Split the data in linebreaks.

Process each line in parallel (massive speedup, can use many cores or even GPU).

Record "errors" (ie. Where a quoted block doesn't end before the newline)

Remove and reprocess all records containing errors or following a record with errors. Repeat until no errors exist.

burntsushi · on Oct 11, 2020

Yeah, you kind of fell right into the trap I was talking about. Your approach either requires being able to memory map the CSV data or storing the CSV data in memory. And to be honest, it's not clear to me that your approach would end up being faster anyway. There's a lot of overhead buried in there, and it sounds like it's at least two passes over the data in common cases. And in cases where there are line breaks in the data, your approach will perform very poorly I think.