Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"Brad Edwards" and "Bradley Edwards" might be the same individual.


Yes, the dataset also has three entries for Virginia Giuffre, "Virginia L. Giuffre", "Virginia Roberts Giuffre", and "Jane Doe Number 3 (Virginia Roberts)"


I read a recent observation that people subject to discovery are often making purposeful typos in key names in order for the communication to remain under the radar.


Everyone is potentially subject to discovery. Some people are just more aware of it.


Likewise for instances of "Larry" and "Lawrence" Summers... probably a lot of those.


I’m sure some developer/archivist is working on a name authority as we speak.


great use case for using AI to suggest mergers and clean up.


LLMs are awful for this. I've got a project that's doing structured extraction and half the work is deduplication.

I didn't go down the route of LLMs for the clean up, as you're getting into scale and context issues with larger datasets.

I got into semantic similarity networks for this use case. You can do efficient pairwise matching with Annoy, set a cutoff threshold, and your isolated subgraphs are merger candidates.

I wrapped up my code in a little library if you're into this sort of thing.

github.com/specialprocedures/semnet


Nice looking library! Might try it for one of my own projects.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: