However, had the author normalized the instances.json with something like "jq -S" then one would end up with a more reasonable 1736 textual changes, which github would have almost certainly rendered
It doesn't help fix GitHub UI views, but you can use the --tool option to git diff and configure alternative diff tools in your git config, including something like piping through a pretty printer or using a (generally much slower) character-based diff tool rather than a line-based one.
Hey, I have been doing the same thing as you for some types of online resources I am interested in for a long time, too. Really nice work! One small thing you might find interesting: in the beginning, I would push the automated commits via my GitHub identity so they would all be associated to my account as my activity. This annoyed me but I also couldn't accept the idea of the commits coming from a non-GitHub account (so the username wouldn't be clickable) nor the idea of creating and maintaining a separate set of credentials to push the change under. I though about what would be a good default identity I could use and after some experimentation I found that if you use these credentials, the commits appear as if they GitHub Actions native GitHub bot pushed them:
They have the right icon, clickable username and it is as simple as just using this email and name. You or someone else might like to do this, too, so here's me sharing this neat trick I found.
I've been doing this to track the UK's "carbon intensity" forecast and compare it with what is actually measured. Now have several months' data about the quality of the model and forecast published here: https://carbonintensity.org.uk/ . Thanks for the inspiration!
One thing I notice is that the diff still requires pretty deep analysis. You need to be able to compare xml or JSON over time.
I keep thinking the real power of git-based data-over-time storage is flatter data structures. Rather than one or a dozen or so files, scaped & stored, we could synthesize some kind of directory structure & simple value files - alike a plan9/9p system - that express the data, but where changes are more structurally apparent.
I don't know enough about plan9 to understand what you're getting at there.
There's definitely a LOT of scope for innovation around how the values are compared over time. So far my explorations have been around loading the deltas into SQLite in various ways, see https://simonwillison.net/2021/Dec/7/git-history/
Sysfs or procfs on Linux are similar. Rather than have deeply structured data files, let the file system be used to make hierarchy.
Rather than a JSON with a bunch of weather stations, make a directory with a bunch of stations as subdirectories, each with lat, long, temp, humidity properties. Let the fs express the structure.
Then when we watch in git, we can filter by changes to one of these subdirs, for example. Or see every time the humidity changes in one. I don't have a good name for the general practice, but trying to use the filesystem to express the structure is the essence.
Yeah, there's definitely a lot to be said for breaking up your scraped data into separate files rather than having it all in a single file.
I have a few projects where I do that kind of thing. My best example is probably this one, where I scrape the "--help" output for the AWS CLI commands and write that into a separate file for each command:
I have to agree. I've been playing around with this for a while at https://github.com/outages/ (which includes the bchydro-outages mentioned in the original comment).
While it's easy to gather the data, the friction in analyzing it has always pushed the priority of doing so below other datasets I've gathered.
Yeah this is great, I'm sure many including me have used a similar tool but half-arsed for internal use. For me it was tracking changes to a website where all the changes were done through a gui. I was asked to provide a backup / rollback which I did with git scraping.
I also started but never finished a terms-of-service, privacy-statement tracker. I stopped at the boring part where you'd have find the url for thousands of companies and/or engage others to do it.
A fun way to track how people are using this is with the git-scraping topic on GitHub:
https://github.com/topics/git-scraping?o=desc&s=updated
That page orders repos tagged git-scraping by most-recently-updated, which shows which scrapers have run most recently.
As I write this, just in the last minute repos that updated include:
queensland-traffic-conditions: https://github.com/drzax/queensland-traffic-conditions
bbcrss: https://github.com/jasoncartwright/bbcrss
metrobus-timetrack-history: https://github.com/jackharrhy/metrobus-timetrack-history
bchydro-outages: https://github.com/outages/bchydro-outages