tl;dr: players get attacked based on monster targeting RNG that's supposed to take an interval [0,num_players], assign players to subintervals that are shorter or longer based on a bunch of factors, and then roll a random number somewhere in that full interval. Whoever's subinterval the number ends in, they get targetted.
Instead, the code assigned subintervals and then rolled a number between 0 and 1, instead of 0 and num_players. If your player happened to sort to the top of the list for subinterval assignment, you'd be it. You'd always be it.
Someone had to think to look at this, then confirm the maths, before it was found, years after the symptoms got reported. A unit test would have caught this, but didn't. Writing tests is annoying, especially in a codebase that keeps changing, but it's so important that this should count as one of those "this is what happens when you don't" lessons. Money was lost here.
I've always wondered the best way to write tests for "This event should happen x% of the time."
Obviously we could re-run the test 100 times and see if it happened close to x%, but not only is that inefficient, how close is "close"? You'll get a bell curve (or similar), and most of the time you'll be close to x but sometimes you'll be legitimately far away from x and your test will fail.
You could start from a known seed, but then are you really testing the percentages, or just checking that the RNG gives you the same output for the same seed, which you already know?
In the similar situations I've run, what I've often done is:
1. Start with a known seed.
2. Run a single test run like what you say, and verify this run by hand.
3. Freeze the test in this state, that is, assert you get that exact result every time on the given seed.
What this creates is not what I would strictly speaking call a "unit test", but it does sort of pin the algorithm to your examined and verified output. In this case, a human would quite likely have caught this problem on a decent test set. Obviously, there are other pathologies that would slip right by a human; the human being careful only raises the bar for such pathologies, it doesn't completely eliminate them.
But at least freezing it solves the problem where a change you did not realize would be a change slips by unnoticed and this function suddenly has a completely different outcome.
This has worked for me, in the sense it has caught a couple of bugs that would have had non-trivial customer implications. But I've never worked on an MMORPG or anything else where randomness was intrinsic to my problem; it has always been incidental, like, is my password generation algorithm correct and does this sample of my data look like what I expect, not the core of my system.
I'm not a fan of the set-seed solution. In the past, when I've tested PRNG implementations (Erlang used to not have the ability to have multiple, independently seeded RNGs), my approach was to decide on an acceptable rate of false negatives, and design my test around that. I figured I'd run the test suite no more than 10,000 times, and I wanted a 1/1,000,000 chance of ever seeing a false negative.
I can't remember the exact math I used at the time (I had to crack open a stats textbook), but ultimately it boiled down to generating a bunch of (seed, number of values previously pulled based on that seed, value) tuples, running a linear regression against them, and defining a maximum acceptable R^2 value based on my 10^-30 acceptable probability of a false fail.
When the RNG is not the thing being tested, mocking the RNG to do a sampled sweep through the RNG's output range is typically the correct move.
> I've always wondered the best way to write tests for "This event should happen x% of the time."
I use https://github.com/pkhuong/csm/blob/master/csm.py to set a known false alarm rate (e.g., one in a billion) and test that some event happens with probability in a range [lo, hi], with lo < expected < hi (e.g., expected +/- 0.01). The statistical test will run more iteration as needed. If that's too slow, you can either widen the range (most effective), or increase the expected rate of false alarms (not as effective, because the number of iteration scales logarithmically wrt false alarms).
You would probably not test "happens x% of the time", but rather that for a given set of inputs, passing in 0.1, 0.2, 0.3, etc, let you have the expected outcomes across the distribution. So you're testing that you can achieve each outcome across the spectrum with a pre-selected "random" number.
unit tests are notoriously bad about testing n>2. I just ran into a problem with sorting recently that was caused by buggy comparators, but it didn't bubble up to the tests until the runtime switched to a different sort implementation. Most of the tests were doing n=2 so it was not visible under the old runtime version but was under the new one. This has been broken in production for months if not years. If the test had used n=5 I'm pretty sure the old tests would fail as well.
100? Let's make that 100,000 instead, and then you check the resulting distribution. It's a unit test, not an integration test: this will run very quickly. Even on 1999 hardware.
The first unit testing library, JUnit (Java), was released in 1997.
Asheron's Call was released in 1999 after four years of development. It's quite possible that this bug was introduced before the concept of unit testing even existed or was widely known.
My startup integrates with numerous third-party data sources, and they'll send quotes in all shapes and colors, usually providing both a total and a complicated breakdown which we need to parse and reshape/classify into line items in our system. And since it's very easy to introduce a bug that skips over a fee accidentally, part of our code review checklist is to ensure that we always explicitly check that our code summing over our final line items sums to the actual ground-truth total; we have a runtime assertion that logs an error to Sentry and, depending on business requirements, shows a (friendly) error rather than incorrect pricing. It's saved us many a time from silly bugs where we're non-exhaustively handling parts of the breakdown data structure.
To generalize, it's vital to have something like Sentry from day one - a low-cost abstraction that lets you monitor broken assumptions asynchronously. Though of course, these kinds of tools didn't exist in 2002!
> Writing tests is annoying, especially in a codebase that keeps changing
Only if you don't have a functional core.
One of the psychology tricks I've learned about unit testing is just how bad sunk cost fallacy affects some people (all the time, and most people some of the time). I've witnessed people pairing for two days to fix a couple of nasty integration tests. That's 3 person-days of work lost on a couple of tests. How many release cycles would it take for that test to pay for itself versus manual testing?
You want the tests at the bottom of your pyramid to be so simple that people think of them as disposable. They should not feel guilty deleting one test and writing a replacement if the requirements change. Elaborate mocks or poorly organized suites can take that away. Which is hard to explain to people who don't see the problem with their tests.
You also want the sides of that pyramid to be pretty shallow, especially if you keep changing your design.
The RNG was supposed to generate a random number in [0, sum of weights] not [0, num_players].
It's also possible to write a test for this where the behavior in the test accurately tests the wrong behavior. The error is pretty subtle. There was even a “bug” in your summary of the bug ;)
Unit tests for things with probability are hard. I've written one recently, and among other reasons, it convinced me to write the selection differently so it was both more 'fair' and more easily tested.
It's slightly more complex than that: if you were farther away or otherwise in a better position than average your portion of the interval would be <1, and so even if you sorted first you still weren't guaranteed to be hit.
Is it so important? I think this made the community and game world more interesting. It's emergent behavior. As a game developer, I hope I can make things complicated enough to have emergent behavior, even if that's just from my bugs.
You know what would have made it even better though? Not having a bug that made your character cursed through no fault of your own without any recourse.
Instead, the code assigned subintervals and then rolled a number between 0 and 1, instead of 0 and num_players. If your player happened to sort to the top of the list for subinterval assignment, you'd be it. You'd always be it.
Someone had to think to look at this, then confirm the maths, before it was found, years after the symptoms got reported. A unit test would have caught this, but didn't. Writing tests is annoying, especially in a codebase that keeps changing, but it's so important that this should count as one of those "this is what happens when you don't" lessons. Money was lost here.