Love this! Would be super interested in any details the author could share on the data engineering needed to make this work. The vis is super impressive but I suspect the data is the harder thing to get working.
The most time and energy has been getting my head around the source data [0] and industry-specific nuances.
In terms of stack I have a self-hosted Dagster [1] data pipeline that periodically dumps the data onto Cloudflare R2 as parquet files. I then have a self-hosted NodeJS API that uses DuckDB to crunch the raw data and output everything you see on the map.
My wife's old company, a fairly significant engineering consultancy, ran it's entire time/job management and invoicing system from a company wide, custom developed Microsoft Access app called 'Time'.
It was developed by a single guy in the IT department and she liked it.
About 5 years ago the company was acquired, and they had to move to their COTS 'enterprise' system (Maconomy).
All staff from the old company had to do a week long (!) training course in how to use this and she hates it.
In future I think there will be more things like 'Time' (though presumably not MS Access based!)
> In future I think there will be more things like 'Time' (though presumably not MS Access based!)
That's my assertion - those things like 'Time' can be developed by an AI primarily because there is no requirement of an existence of a community from which to hire.
It's an example of a small ERP system - no consultants, no changes, no community, etc.
Large systems (Sage, SAP, Syspro, etc) are purchased based on the existing pool of contractors that can be hired.
Right now, if you had a competing SAP/Syspro system freshly developed, that had all the integrations that a customer needs, how on earth will they deploy it if they cannot hire people to deploy it?
Not to mention Sage's midline - Sage100 is incredibly cheap and effective for it's cost. I mean it's ridiculous what a mature software can do. Everything under the sun basically for a pittance.
It's certainly not "SAP 10 million dollar deployments". we see implementation rarely run into 6 figures for SMB distributors and manufacturing firms. That's less than most of their yearly budget for buying new fleet vehicles or equipment
I still think MS Access was awesome. In the small companies I worked it was used successfully by moderately tech savvy directors and support employees to manage ERP, license generation, invoices, etc.
The most heard gripe was the concurrent access to the database file but I think that was solved by backing the forms by accessing anything over odbc.
It looked terrible but also was highly functional.
Agreed! The first piece of software I built was a simple inventory and sales management system, around 2000. I was 16 and it was just about my first experience programming.
It was for school, and I recently found the write up and was surprised how well the system worked.
Ever since I've marvelled at how easy it was to build something highly functional that could incorporate complex business logic, and wished there was a more modern equivalent.
Grist[1] is great for this stuff, at first glance its a spreadsheet but that spreadsheet is backed by a SQLite database and you can put an actual UI on top of it without leaving the tool, or you can write full blown plugins in Javascript and HTML if you need to go further than that.
Just another yay for Grist here! I've been looking for an Access alternative for quite a while and nothing really comes close. You can try hacking it together with various BI tools, but nothing really feels as accessible as the original Access. While it's not a 1:1 mapping and the graphical report building is not really there, you can still achieve what you need. It's like Access 2.0 to me.
Access as a front end for mssqlserver ran great in a small shop. Seems like there was a wizard that imported the the access tables easily into sqlserver.
I've not seen anything as easy to use as the Access visual query builder and drag-n-drop report builder thing.
Agree. Much of the value of devs is understanding the thing they're working on so they know what to do when it breaks, and knows what new features it can easily support. Doesn't matter whether they wrote the code, a colleague wrote it, or an AI.
At the very least, it can quickly build throwaway productivity enhancing tools.
Some examples from building a small education game:
- I needed to record sound clips for a game. I vibe coded a webapp in <15 mins that had a record button, keyboard shortcuts to progress though the list of clips i needed, and outputted all the audio for over 100 separate files in the folder structure and with the file names i needed, and wrote the ffmpeg script to post process the files
- I needed json files for the path of each letter. gemini 3 converted images to json and then codex built me an interactive editor to tidy up the bits gemini go wrong by hand
The quality of the code didn't matter because all i needed was the outputs.
Does anyone have direct experience with Claude making damaging mistakes in dangerously skip permissions mode? It'd be great to have a sense of what the real world risk is.
Claude is very happy to wipe remote dbs, particularly if you're using something like supabase's mcp server. Sometimes it goes down rabbitholes and tries to clean itself up with `rm -rf`.
There is definitely a real world risk. You should browse the ai coding subreddits. The regularity of `rm -rf` disasters is, sadly, a great source of entertainment for me.
I once was playing around, having Claude Code (Agent A) control another instance of Claude Code (Agent B) within a tmux session using tmux's scripting. Within that session, I messed around with Agent B to make it output text that made Agent A think Agent B rm -rf'd entire codebase. It was such a stupid "prank", but seeing Agent A's frantic and worried reaction to Agent B's mistake was the loudest and only time I've laughed because of an LLM.
Everywhere I’ve ever worked, there was always some way to access a production system even if it required multiple approvals and short-lived credentials for something like AWS SSM. If the user has access, the agent has access, no matter how briefly.
Supabase virtually encouraged it last year haha. I tried using it once and noped out after using it for an hour, when claude tried to do a bunch of migrations on prod instead of dev.
Claude has twice now thought that deleting the database is the right thing to do. It didn't matter as it was local and one created with fixtures in the Docker container (in anticipation of such a scenario), but it was an inappropriate way of handling Django migration issues.
One recent example. For some reason, recently Claude prefer to write scripts in root /tmp folder. I don't like this behavior at all. It's nothing destructive, but it should be out of scope by default. I notice they keep adding more safeguards which is great, eg asking for permissions, but it seems to be case by case.
If you're not using .claude/instructions.md yet, I highly recommend it, for moments like this one you can tell it where to shove scripts. Trickery with the instructions file is Claude only reads it during a new prompt, so any time you update it, or Claude "forgets" instructions, ask it to re-read it, usually does the trick for me.
Claude, I noticed you rm -rf my entire system. Your .instructions.md file specifically prohibits this. Please re-read your .instructions.md file and comply with it for all further work
I think that's totally fine for individual work, but in larger data engineering teams it's less good to switch between tools because other people may have to maintain your code.
That said, polars is good, and if the team agree to standardise on it then that's a totally reasonable choice.
I guess one of my reservations is I've been (historically) burned by decisions within data eng teams to use pandas, causing all sorts of problems with data typing and memory and eventually having to rewrite it all. But I accept polars doesn't suffer from the same problems (and actually some of them are even mitigated in more recent versions of pandas)
Worse in some ways, better in others. DuckDB is often an excellent tool for this kind of task. Since it can run parallelized reads I imagine it's often faster than command line tool, and with easier to understand syntax
More importantly, you have your data in a structured format that can be easily inspected at any stage of the pipeline using a familiar tool: SQL.
I've been using this pattern (scripts or code that execute commands against DuckDB) to process data more recently, and the ability to do deep investigations on the data as you're designing the pipeline (or when things go wrong) is very useful. Doing it with a code-based solution (read data into objects in memory) is much more challenging to view the data. Debugging tools to inspect the objects on the heap is painful compared to being able to JOIN/WHERE/GROUP BY your data.
Yep. It’s literally what SQL was designed for, your business website can running it… the you write a shell script to also pull some data on a cron. It’s beautiful
Pipes are parallelized when you have unidirectional data flow between stages. They really kind of suck for fan-out and joining though. I do love a good long pipeline of do-one-thing-well utilities, but that design still has major limits. To me, the main advantage of pipelines is not so much the parallelism, but being streams that process "lazily".
On the other hand, unix sockets combined with socat can perform some real wizardry, but I never quite got the hang of that style.
Pipelines are indeed one flow, and that works most of the time, but shell scripts make parallel tasks easy too. The shell provides tools to spawn subshells in the background and wait for their completion. Then there are utilities like xargs -P and make -j.
UNIX provides the Makefile as go-to tool if a simple pipeline is not enough. GNUmake makes this even more powerful by being able to generate rules on-the-fly.
If the tool of interest works with files (like the UNIX tools do) it fits very well.
If the tool doesn't work with single files I have had some success in using Makefiles for generic processing tasks by creating a marker file that a given task was complete as part of the target.
Found myself nodding along. I think increasingly it's useful to think of PRs from unknown external people as more like an issue than a PR (kind of like the 'issue first' policy described in the article).
There's actually something very valuable about a user specifying what they want using a working solution, even if the code is not mergeable.
Author here. I wouldn't argue SQL or duckdb is _more_ testable than polars. But I think historically people have criticised SQL as being hard to test. Duckdb changes that.
I disagree that SQL has nothing to do with fast. One of the most amazing things to me about SQL is that, since it's declarative, the same code has got faster and faster to execute as we've gone through better and better SQL engines. I've seen this through the past five years of writing and maintaining a record linkage library. It generates SQL that can be executed against multiple backends. My library gets faster and faster year after year without me having to do anything, due to improvements in the SQL backends that handle things like vectorisation and parallelization for me. I imagine if I were to try and program the routines by hand, it would be significantly slower since so much work has gone into optimising SQL engines.
In terms of future proof - yes in the sense that the code will still be easy to run in 20 years time.
> I disagree that SQL has nothing to do with fast. One of the most amazing things to me about SQL is that, since it's declarative, the same code has got faster and faster to execute as we've gone through better and better SQL engines.
Yeah, but SQL isn't really portable between query all query engines. You always have to be speaking the same dialect. Also, SQL isn't the only "declarative" dsl, polars's lazyframe api is similarly declarative. Technically Ibis's dataframe dsl also works as a multi-frontend declarative query language. Or even substrait.
Anways my point is that SQL is not inherently a faster paradigm than "dataframes", but that you're conflating declarative query planning with SQL.
reply