I think you uave some major misconceptions about how this stuff works. Most vision models output the current state of the world, to be integrated with other sensor data like lidar that is then used for planning. There are models that try to guess things like the crossing intent of pedestrians, but even those models just output a confidence of them crossing.
I work for a different self driving company, but this high level stuff is pretty much the same everywhere.
Exactly, that is how that stuff works. Past patterns in the Ml neural net generated from massive data and training try to figure out the label of things from various sensors, and then given that label, try to figure out other things that might happen based on the label and event stream. These patterns can recognize with a certain probability similar patterns, but they have no way to cope with brand new patterns. Those brand new patterns, although they may only occur 1 out of 100 experiences - or lets say 3 days a year when your're driving, something fully novel occurs, this will kill you if you are relying on FSD. Until the systems try to think like humans, they simply wont.