Valve became serious about software quality in Dota 2 around 2017 - about 7 years after launch. Before that game updates were accompanied with lots of bugs that would take weeks to fix. These days, there are still tons of bugs, but much better than before. They just released one of the biggest updates in the game's history this week, and there are hardly any bugs being reported.
I am pretty sure there is some sort of automated testing happening that is catching these bugs before release.
Reminds me of an article about the testing infrastructure of League and Legends [1] back in 2016. 5500 tests per build in 1 to 2 hours.
Games are extremely hard to test. For me it falls into the same category like GUI testing frameworks which imho are extremely annoying and brittle. Except that games are comparable to a user interface consisting of many buttons which you can short and long press and drag around while at the same time other bots are pressing the same buttons, sharing the same state influenced by a physics engine.
How do you test such a ball of mud which also constantly changes by devs trying to follow the fun? Yes you can unittest individual, reusable parts. But integration tests, which require large, time sensitive modules, all strapped together and running at the same time? It's mindboggling hard.
Moreover if you're in a conceptual phase of development and prototyping and idea, tests make no sense. The requirements change all the time and complex tests hold you back. But the funny thing is, that game development stays in that phase most of the time. And when the game is done, you start a new one with a completely different set of requirements.
There are exceptions, like League of Legends. The game left the conceptual phase many years ago and its rules are set in stone. And a game which runs successfully for that long is super rare.
I recall some Minecraft tests being saved worlds with redstone logic that will light a beacon green if it is working or red if not. That's usefull for games like that.
For games like Starcraft 2 with replay functionality, you could probably record/use several matches and test that the behaviour matches the recorded behaviour. If you can make your game have a replay feature you can make use of this, even if you don't ship that replay code.
For things like CYOA type games or decision trees, you could have a logging mechanism that prints out the choices, player stats, hidden stats, etc. and then have a way to run through the decisions, then check the actual log output against the expected output. -- I've done something similar when writing parsers by printing out the parse tree (for AST parser APIs) or the parse events (for reader/SAX parser APIs).
I'm sure there are other techniques for testing other parts of the system. For example, you could test the rendering by saving the render to an image and comparing it against an expected image. IIRC, Firefox does something similar for some systems like the SVG renderer and the HTML paint code.
Various of these features (replay, screenshots) are useful to have in the main game.
You're right about parts, which are mostly state machines. The have a defined input and output. Tests are straightforward to implement and adjust.
But recording and replaying matches? Taking screenshots and comparing the output? Just think about it: If you have recorded a match and change the hitpoints of a single creature, the test could possibly fail. And then? Re-record the match?
The same applies to screenshots: What happens if models, sprites or colors change?
In my experience, tests like this are annoying, because:
1) They take a long time to create and adjust/recreate.
2) They fail for minor reasons.
3) It takes time to understand, what such tests even measure, if someone else made them.
4) You need a large, self made framework to support such tests.
5) It takes a long time to run them, because they are time dependent.
6) They hinder you to make large changes.
7) It's cheaper to make some low wage game testers play your game. Or better, make the game early access and let 1000s of players test your game for free, while even making money out of them
Yes, when you are trying to intentionally change the output, you simply regenerate the gold file to be used as reference (and yes, it should be easy). It’s brittle for sure but it does catch unintentional changes and should be used where relevant (if sparingly). There are definitely existing frameworks that do this (eg Jest calls this snapshot testing and has tooling to make it easy).
I’m sorry your experiences with this kind of stuff have been bad. I’ve generally had good experiences in the machine learning space where we used it judiciously where appropriate but didn’t overdo it.
I don’t see how it can ever hinder you though - you can always choose to go “I don’t care that the output has changed dramaticallly - it’s the new ground truth” as long as you communicate that’s what happening in your commit. What it doesn’t let you do is that the output is different every time you run it but that’s generally a positive (randomness should be intentionally injected deterministically).
I doubt Dota 2 devs are writing code like this to test. The game is far too complicated, even more so than league, and changes a lot over the years, for this to be viable.
Dota 2 and openai had a collaboration in 2018ish, and during this time the Dota 2 bots system was reworked completely. They already can generate videos of every spell in action [1], and I would assume this is done by asking AI bots to demonstrate the spell. My guess is that before pushing out an update, a human looks at these videos and other more complex interaction videos for every major change, along with relevant numbers (damage, healing, movement speed), and see if everything makes sense.
I think this, because a lot of times recently, changes in one hero often cause an un-updated hero to break, because they had some backend similarity. And the patch is released with the bug.
Then again, there is no public info, so all the above are wild speculations.
> They already can generate videos of every spell in action [1]
I'm fairly certain those videos are all handmade. (Yes, all 500+ of them.) Notice that the videos for each hero are recorded in different locations on the map, and the "victim" hero isn't always the same.
In my experience (full stack web development), unit tests are mostly useless and it is the high-level system tests which add real value. Unfortunately it can take a fair amount of work or skill to architect the test suite in the first place, but once it’s working you can write elegant tests that verify large swathes of code with fairly few lines of test code.
I think UI testing in general is hard though, and given how large a part of games involves UI, that’d be the real reason games don’t have much tests.
Agreed. We don't have a lot of UI unit tests in our software in our day job (almost none), but we have extensive tests for utility and data processing functions.
And that's pretty much the same for me in the game I'm making in my spare time. I have no unit tests for UI (it's not worth it, I can easily see when something in the UI isn't working, it's more important for me to just log the bug so I don't forget about it).
But for game logic, like verifying calculations for A.I. are happening as expected, or functions that manipulate numbers on the screen in different ways (scores, power adjustment, etc), yeah I write unit tests for those. And to the article's point, sometimes I have to significantly adjust or redo or even scrap them because I happened to think of a different mechanism and it seems to play better.
There was a long time (over about a dozen games I made) where I never bothered to write a unit test, and I still might not for a tiny game. But for my most recent bigger game (which I started a few years ago), I finally decided to write a few for some tricky numeric logic in the game, and it immediately helped me resolve a logic bug I was seeing periodically but was having a hard time pinning down the cause of it with breakpoints and logs. So I do try to do it more often for checking those things.
Part of it is terminology. You get "unit testing" and "functional testing" and "integration testing" and "system testing" thrown around, often with people meaning different things by these, and vague definitions that partially or wholly overlap.
My rule of thumb is really simple: a test should always be defined in terms of what the user expects. Thus for most apps you should, at the minimum, have tests corresponding to their functional specification. In addition, if the app contains functionality that is consistently reused inside (i.e. embedded libraries), then users of those libraries are the code that calls into them, and so there should also be tests at that boundary (but only after you wrote the high-level tests). Repeat recursively until you get to the bottom.
I am pretty sure there is some sort of automated testing happening that is catching these bugs before release.