Yes, the dataset also has three entries for Virginia Giuffre, "Virginia L. Giuffre", "Virginia Roberts Giuffre", and "Jane Doe Number 3 (Virginia Roberts)"
I read a recent observation that people subject to discovery are often making purposeful typos in key names in order for the communication to remain under the radar.
LLMs are awful for this. I've got a project that's doing structured extraction and half the work is deduplication.
I didn't go down the route of LLMs for the clean up, as you're getting into scale and context issues with larger datasets.
I got into semantic similarity networks for this use case. You can do efficient pairwise matching with Annoy, set a cutoff threshold, and your isolated subgraphs are merger candidates.
I wrapped up my code in a little library if you're into this sort of thing.