Baba is You is a great game part of a collection of 2D grid puzzle games.
(Shameless plug: I am one of the developers of Thinky.gg (https://thinky.gg), which is a thinky puzzle game site for a 'shortest path style' [Pathology] and a Sokoban variant [Sokoath] )
These games are typically NP Hard so the typical techniques that solvers have employed for Sokoban (or Pathology) have been brute forced with varying heuristics (like BFS, dead-lock detection, and Zobrist hashing). However, once levels get beyond a certain size with enough movable blocks you end up exhausting memory pretty quickly.
These types of games are still "AI Proof" so far in that LLMs are absolutely awful at solving these while humans are very good (so seems reasonable to consider for for ARC-AGI benchmarks). Whenever a new reasoning model gets released I typically try it on some basic Pathology levels (like 'One at a Time' https://pathology.thinky.gg/level/ybbun/one-at-a-time) and they fail miserably.
Simple level code for the above level (1 is a wall, 2 is a movable block, 4 is starting block, 3 is the exit):
000
020
023
041
Similar to OP, I've found Claude couldn’t manage rule dynamics, blocked paths, or game objectives well and spits out random results.
NP hard isn't much of a problem, because the levels are fairly small, and instances are not chosen to be worst case hard but to be entertaining for humans to solve.
SMT/SAT solvers or integer linear programming can get you pretty far. Many classic puzzle games like Minesweeper are NP hard, and you can solve any instance that a human would be able to solve in their lifetime fairly quickly on a computer.
As a game developer for a grid based puzzle game (https://thinky.gg - one of the games Pathology is a game where you have to go from Point A to Point B in shortest amount of steps).
I have found A* fascinating not because of the optimization but also from the various heuristics that can be built on top of it make it more generalized for other types of pathfinding.
Some devs have built solvers that use techniques like bidirectional search, precomputed pattern databases, and dead locking detection.
> The least-likely part of the work is behind us; the scientific insights that got us to systems like GPT-4 and o3 were hard-won, but will take us very far.
> 2026 will likely see the arrival of systems that can figure out novel insights
Interesting the level of confidence compared to recent comments by Sundar [1]. Satya [2] also is a bit more reserved in his optimism.
I've done something similar for learning about a controversial topic. I ask it to act as if it is called Bob is a well informed supporter of one side (like Ukraine) and then act as if it is something named Alice who is a well informed supporter of another side (Russia) and they have to debate each other over a few prompts with a moderator named 'Sue'
Then after a few rounds of the debate where Sue asks a bunch of questions, I ask it to go to the judges - Mark, Phil, Sarah (and I add a few personalities to each of them... Sometimes I pretend they are famous moral philosophers) and then I have them each come up with a rubric and decide who is the winner.
Really fun, and helps me understand different sides of issues.
That seems like a terrible idea. At best it seems likely to help you make a false but convincing sounding case.
I really hope no one is using that to help them understand controversial topics much less using that to determine their stances.
Id recommend looking into actual human experts who are trustworthy and reading them. Trying to get LLM to argue the case will just get you a lot of false information presented in a more convincing fashion
I've been a lukewarm user of Supabase for my side projects. Unfortunately the amount of work to get off of it has been too high for me to leave.
The major issue is - cost. It is way more expensive than I realized as they have so many little ways they charge you. It's almost like death by thousands of paper cuts. My bill for my app with just a few thousand users was $70 last month.
I do like the tooling and all, but the pricing has been very confusing.
Kind of the same feeling, I don't use all of services they offer either and when I looked at self-hosting, it all seemed kind of heavy and fragile to self-host. I ended up replicating the parts I used with a small API layer connected to a managed postgres db for a tenth of the cost or something. I'd say it's pretty handy for prototyping but not sure I'd want to build a business on the back of it.
Few Thousand!?! Sound very reasonable to me. Monetize just two of those users at $35 per month and your server costs are covered. Or run it yourself, there's a lot of moving parts but it's all open source.
That's quite challenging to do. I've myself spent quite a few hours looking into it and came to the conclusion that they make it their goal to complicate the self-hosting by lack of detailed docs. For example: I recall seeing a comment/warning in their docs similar to "for production, don't use this default setup" and it kinda felt more like "tough luck, figure it out or fork out $ome ca$h". Perfectly fine business model, but not 100% self-hostable on a production level (even for a very basic app)
This is exactly the reason I’ve been avoiding it despite seeing it mentioned all the time. I’m sure I’m missing out on some conveniences but it’s just too cheap to host my own pg DB. I can deal with backups and auth if it means saving a not so insignificant amount of money per month.
There is a similar issue with image and video generation. Asking the AI to "Generate an image of a man holding a pencil with his left hand" or "Generate a clock showing the time 5 minutes past 6 o'clock" often fail due to so many images in the training set being similar (almost all clock images on show 10:10 (https://generativeai.pub/in-the-ai-art-world-the-time-is-alm...)
(Shameless plug: I am one of the developers of Thinky.gg (https://thinky.gg), which is a thinky puzzle game site for a 'shortest path style' [Pathology] and a Sokoban variant [Sokoath] )
These games are typically NP Hard so the typical techniques that solvers have employed for Sokoban (or Pathology) have been brute forced with varying heuristics (like BFS, dead-lock detection, and Zobrist hashing). However, once levels get beyond a certain size with enough movable blocks you end up exhausting memory pretty quickly.
These types of games are still "AI Proof" so far in that LLMs are absolutely awful at solving these while humans are very good (so seems reasonable to consider for for ARC-AGI benchmarks). Whenever a new reasoning model gets released I typically try it on some basic Pathology levels (like 'One at a Time' https://pathology.thinky.gg/level/ybbun/one-at-a-time) and they fail miserably.
Simple level code for the above level (1 is a wall, 2 is a movable block, 4 is starting block, 3 is the exit):
000
020
023
041
Similar to OP, I've found Claude couldn’t manage rule dynamics, blocked paths, or game objectives well and spits out random results.