The Investigative Consortium of Journalists used Neo4j to untangle the Panama Pa...

joelschw · on Jan 26, 2017

The problem with Neo4j is that the end results are great, but the ingestion pipeline (especially for unstructured data) is very hard to make general purpose.

The ICIJ used a combination of Apache Tika, Nuix, Tesseract and a bunch of other components when loading data into Neo4j before interrogating it within Linkurious.

It's also worth noting, that Panama data-set is riddled with data quality issues (even if this is understandable given the size of the team compared to the scale of the problem).

throwaway2601 · on Jan 27, 2017

My first job out of university was with a Palantir competitor (Detica at the time I was hired, then BAE Systems Detica, then finally BAE Systems Analytics or something like that[1]). There's nothing general purpose about either of their platforms (nor the similar offering from SAS). Companies like that just throw a bunch of fresh graduates at the data, and they hand-write loads of custom ETL code for every data set. A lot of times, even the "analytics" are just shitty little pattern matches over tiny subgraphs formed from the data, the vast majority of which are also coded anew by the "analysts"[2] for each data set. Data quality issues were handled with a massive case analysis during ETL.

There's really no magic going on in such products — the only part that's really general purpose is the GUI used to view the end results. The rest of it is just a bunch of lowly peons doing a ton of gruntwork to hammer the data into a form that said GUI will accept.

[1]: That last name change was after I'd left. I didn't even make it a full year at the company before my conscience got the better of me and I quit.

[2]: An "analytics" job at Detica was really just a half-step above data entry. It was mind-numbing and soul-eating. There was a very high turnover rate because even new graduates were overqualified for the position, and almost everyone was miserable.