tl;dr: players get attacked based on monster targeting RNG that's *supposed* to ...

SamBam · on Feb 10, 2023

> A unit test could have caught this

I've always wondered the best way to write tests for "This event should happen x% of the time."

Obviously we could re-run the test 100 times and see if it happened close to x%, but not only is that inefficient, how close is "close"? You'll get a bell curve (or similar), and most of the time you'll be close to x but sometimes you'll be legitimately far away from x and your test will fail.

You could start from a known seed, but then are you really testing the percentages, or just checking that the RNG gives you the same output for the same seed, which you already know?

jerf · on Feb 10, 2023

In the similar situations I've run, what I've often done is:

1. Start with a known seed. 2. Run a single test run like what you say, and verify this run by hand. 3. Freeze the test in this state, that is, assert you get that exact result every time on the given seed.

What this creates is not what I would strictly speaking call a "unit test", but it does sort of pin the algorithm to your examined and verified output. In this case, a human would quite likely have caught this problem on a decent test set. Obviously, there are other pathologies that would slip right by a human; the human being careful only raises the bar for such pathologies, it doesn't completely eliminate them.

But at least freezing it solves the problem where a change you did not realize would be a change slips by unnoticed and this function suddenly has a completely different outcome.

This has worked for me, in the sense it has caught a couple of bugs that would have had non-trivial customer implications. But I've never worked on an MMORPG or anything else where randomness was intrinsic to my problem; it has always been incidental, like, is my password generation algorithm correct and does this sample of my data look like what I expect, not the core of my system.

OkayPhysicist · on Feb 10, 2023

I'm not a fan of the set-seed solution. In the past, when I've tested PRNG implementations (Erlang used to not have the ability to have multiple, independently seeded RNGs), my approach was to decide on an acceptable rate of false negatives, and design my test around that. I figured I'd run the test suite no more than 10,000 times, and I wanted a 1/1,000,000 chance of ever seeing a false negative.

I can't remember the exact math I used at the time (I had to crack open a stats textbook), but ultimately it boiled down to generating a bunch of (seed, number of values previously pulled based on that seed, value) tuples, running a linear regression against them, and defining a maximum acceptable R^2 value based on my 10^-30 acceptable probability of a false fail.

When the RNG is not the thing being tested, mocking the RNG to do a sampled sweep through the RNG's output range is typically the correct move.

pkhuong · on Feb 10, 2023

> I've always wondered the best way to write tests for "This event should happen x% of the time."

I use https://github.com/pkhuong/csm/blob/master/csm.py to set a known false alarm rate (e.g., one in a billion) and test that some event happens with probability in a range [lo, hi], with lo < expected < hi (e.g., expected +/- 0.01). The statistical test will run more iteration as needed. If that's too slow, you can either widen the range (most effective), or increase the expected rate of false alarms (not as effective, because the number of iteration scales logarithmically wrt false alarms).

delecti · on Feb 10, 2023

You would probably not test "happens x% of the time", but rather that for a given set of inputs, passing in 0.1, 0.2, 0.3, etc, let you have the expected outcomes across the distribution. So you're testing that you can achieve each outcome across the spectrum with a pre-selected "random" number.

hashkb · on Feb 10, 2023

You can mock the RNG. You don’t need to test/prove it’s random, you need to test your code which is not random.

hinkley · on Feb 10, 2023

unit tests are notoriously bad about testing n>2. I just ran into a problem with sorting recently that was caused by buggy comparators, but it didn't bubble up to the tests until the runtime switched to a different sort implementation. Most of the tests were doing n=2 so it was not visible under the old runtime version but was under the new one. This has been broken in production for months if not years. If the test had used n=5 I'm pretty sure the old tests would fail as well.

TheRealPomax · on Feb 12, 2023

100? Let's make that 100,000 instead, and then you check the resulting distribution. It's a unit test, not an integration test: this will run very quickly. Even on 1999 hardware.

perlgeek · on Feb 10, 2023

This is pretty hard to black-box test.

What would have helped in this case:

* Test that all players can be selected by the algorithm at all

* Test that the totals of the weights is the same as the RNG max range

hinkley · on Feb 10, 2023

Scatter plots are good when you can't exhaustively test.

thecapybara · on Feb 10, 2023

> A unit test would have caught this, but didn't

The first unit testing library, JUnit (Java), was released in 1997.

Asheron's Call was released in 1999 after four years of development. It's quite possible that this bug was introduced before the concept of unit testing even existed or was widely known.

btown · on Feb 10, 2023

My startup integrates with numerous third-party data sources, and they'll send quotes in all shapes and colors, usually providing both a total and a complicated breakdown which we need to parse and reshape/classify into line items in our system. And since it's very easy to introduce a bug that skips over a fee accidentally, part of our code review checklist is to ensure that we always explicitly check that our code summing over our final line items sums to the actual ground-truth total; we have a runtime assertion that logs an error to Sentry and, depending on business requirements, shows a (friendly) error rather than incorrect pricing. It's saved us many a time from silly bugs where we're non-exhaustively handling parts of the breakdown data structure.

To generalize, it's vital to have something like Sentry from day one - a low-cost abstraction that lets you monitor broken assumptions asynchronously. Though of course, these kinds of tools didn't exist in 2002!

hinkley · on Feb 10, 2023

> Writing tests is annoying, especially in a codebase that keeps changing

Only if you don't have a functional core.

One of the psychology tricks I've learned about unit testing is just how bad sunk cost fallacy affects some people (all the time, and most people some of the time). I've witnessed people pairing for two days to fix a couple of nasty integration tests. That's 3 person-days of work lost on a couple of tests. How many release cycles would it take for that test to pay for itself versus manual testing?

You want the tests at the bottom of your pyramid to be so simple that people think of them as disposable. They should not feel guilty deleting one test and writing a replacement if the requirements change. Elaborate mocks or poorly organized suites can take that away. Which is hard to explain to people who don't see the problem with their tests.

You also want the sides of that pyramid to be pretty shallow, especially if you keep changing your design.

squeaky-clean · on Feb 10, 2023

Was money lost here? Everyone in this thread who played AC is speaking pretty fondly of the bug.

rcme · on Feb 10, 2023

The RNG was supposed to generate a random number in [0, sum of weights] not [0, num_players].

It's also possible to write a test for this where the behavior in the test accurately tests the wrong behavior. The error is pretty subtle. There was even a “bug” in your summary of the bug ;)

toast0 · on Feb 10, 2023

> A unit test would have caught this, but didn't.

Unit tests for things with probability are hard. I've written one recently, and among other reasons, it convinced me to write the selection differently so it was both more 'fair' and more easily tested.

jefftk · on Feb 10, 2023

> You'd always be it.

It's slightly more complex than that: if you were farther away or otherwise in a better position than average your portion of the interval would be <1, and so even if you sorted first you still weren't guaranteed to be hit.

TheRealPomax · on Feb 10, 2023

Until you were actually taking part in combat and did damage to something, at which point you'd be it.

EarthLaunch · on Feb 10, 2023

Is it so important? I think this made the community and game world more interesting. It's emergent behavior. As a game developer, I hope I can make things complicated enough to have emergent behavior, even if that's just from my bugs.

TheRealPomax · on Feb 10, 2023

You know what would have made it even better though? Not having a bug that made your character cursed through no fault of your own without any recourse.

claytongulick · on Feb 10, 2023

Would it have?

Now a player named Wi is still the subject of conversation on at least two major online discussion forms over 20 years later.

From what I've read, it introduced quirkiness and a bunch of unintentional community involvement, with likely much hilarity.

The goal of playing a game is to have fun.

Dealing with frustration can also be fun, in fact the Dark Souls series recognized this brilliantly.

This bug, while undoubtedly frustrating, sounds like it was also a lot of fun for a lot of folks.