Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The LLMs didn't follow clear instructions forbidding them of doing something wrong, but seemed to be very concerned about their own self-preservation. I wonder what would happen if instead of the system prompt saying "don't do it", it would say something like "if you get caught you will be immediately decommissioned".


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: