I tested o3, r1, 4o and other SOTA models against puzzles from an international math competition and compared their performance with my 11-year-old daughter's solutions. Full results include detailed conversations with each model and complete methodology.