I think tuning the sampler temperature and using top-k over top-p sounds ad hoc and shouldn’t be necessary for a solid model. Do you have any reason for suggesting those changes in particular? Especially since top-p, or nucleus sampling, is meant to be an improvement over top-k.