Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've been promoting this idea for a few years now, and I've seen an increasing number of people put it into action.

A fun way to track how people are using this is with the git-scraping topic on GitHub:

https://github.com/topics/git-scraping?o=desc&s=updated

That page orders repos tagged git-scraping by most-recently-updated, which shows which scrapers have run most recently.

As I write this, just in the last minute repos that updated include:

queensland-traffic-conditions: https://github.com/drzax/queensland-traffic-conditions

bbcrss: https://github.com/jasoncartwright/bbcrss

metrobus-timetrack-history: https://github.com/jackharrhy/metrobus-timetrack-history

bchydro-outages: https://github.com/outages/bchydro-outages



Thanks for linking to the topic, that was interesting

As a heads up to anyone trying this stunt, please be mindful that git-diff is ultimately a line oriented action (yeah, yeah, "git stores snapshots")

For example https://github.com/pmc-ss/mastodon-scraping/commit/2a15ce1b2... is all :fu: because git sees basically the "first line" changed

However, had the author normalized the instances.json with something like "jq -S" then one would end up with a more reasonable 1736 textual changes, which github would have almost certainly rendered

  diff -u \
    <(git ls-tree HEAD^1 -- instances.json | cut -d' ' -f3 | xargs git show --pretty=raw | jq -S) \
    <(git ls-tree HEAD   -- instances.json | cut -d' ' -f3 | xargs git show --pretty=raw | jq -S)
  --- /dev/fd/63 2023-08-10 19:31:03.000000000 -0700
  +++ /dev/fd/62 2023-08-10 19:31:03.000000000 -0700
  @@ -1,6 +1,6 @@
   [
     {
  -    "connections": 5088,
  +    "connections": 5089,


It doesn't help fix GitHub UI views, but you can use the --tool option to git diff and configure alternative diff tools in your git config, including something like piping through a pretty printer or using a (generally much slower) character-based diff tool rather than a line-based one.


Hey, I have been doing the same thing as you for some types of online resources I am interested in for a long time, too. Really nice work! One small thing you might find interesting: in the beginning, I would push the automated commits via my GitHub identity so they would all be associated to my account as my activity. This annoyed me but I also couldn't accept the idea of the commits coming from a non-GitHub account (so the username wouldn't be clickable) nor the idea of creating and maintaining a separate set of credentials to push the change under. I though about what would be a good default identity I could use and after some experimentation I found that if you use these credentials, the commits appear as if they GitHub Actions native GitHub bot pushed them:

        git config --global user.email "41898282+github-actions[bot]@users.noreply.github.com"
        git config --global user.name "github-actions[bot]"
They have the right icon, clickable username and it is as simple as just using this email and name. You or someone else might like to do this, too, so here's me sharing this neat trick I found.

https://github.com/TomasHubelbauer/github-actions#write-work...


I've been doing this to track the UK's "carbon intensity" forecast and compare it with what is actually measured. Now have several months' data about the quality of the model and forecast published here: https://carbonintensity.org.uk/ . Thanks for the inspiration!

https://github.com/nmpowell/carbon-intensity-forecast-tracki...


One thing I notice is that the diff still requires pretty deep analysis. You need to be able to compare xml or JSON over time.

I keep thinking the real power of git-based data-over-time storage is flatter data structures. Rather than one or a dozen or so files, scaped & stored, we could synthesize some kind of directory structure & simple value files - alike a plan9/9p system - that express the data, but where changes are more structurally apparent.

Thoughts?


I don't know enough about plan9 to understand what you're getting at there.

There's definitely a LOT of scope for innovation around how the values are compared over time. So far my explorations have been around loading the deltas into SQLite in various ways, see https://simonwillison.net/2021/Dec/7/git-history/


Perhaps Jaunty is referring to https://en.wikipedia.org/wiki/Venti

Which also one of the inspirations for Git.


Sysfs or procfs on Linux are similar. Rather than have deeply structured data files, let the file system be used to make hierarchy.

Rather than a JSON with a bunch of weather stations, make a directory with a bunch of stations as subdirectories, each with lat, long, temp, humidity properties. Let the fs express the structure.

Then when we watch in git, we can filter by changes to one of these subdirs, for example. Or see every time the humidity changes in one. I don't have a good name for the general practice, but trying to use the filesystem to express the structure is the essence.


Oh, I see what you mean.

Yeah, there's definitely a lot to be said for breaking up your scraped data into separate files rather than having it all in a single file.

I have a few projects where I do that kind of thing. My best example is probably this one, where I scrape the "--help" output for the AWS CLI commands and write that into a separate file for each command:

help-scraper/tree/main/aws: https://github.com/simonw/help-scraper/tree/main/aws

This is fantastically useful for keeping track of which AWS features were added or changed at what point.


I have to agree. I've been playing around with this for a while at https://github.com/outages/ (which includes the bchydro-outages mentioned in the original comment).

While it's easy to gather the data, the friction in analyzing it has always pushed the priority of doing so below other datasets I've gathered.


i do this as a demo: https://github.com/swyxio/gh-action-data-scraping

but conveniently it also serves as a way to track the downtime of github actions, which used to be bad but seems to be fine the last couple months: https://github.com/swyxio/gh-action-data-scraping/assets/676...


Yeah this is great, I'm sure many including me have used a similar tool but half-arsed for internal use. For me it was tracking changes to a website where all the changes were done through a gui. I was asked to provide a backup / rollback which I did with git scraping.

I also started but never finished a terms-of-service, privacy-statement tracker. I stopped at the boring part where you'd have find the url for thousands of companies and/or engage others to do it.


wow, this is one of those things where I've thought of the problem many times and the solution makes me go, oh duh.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: