I have been building applications on LLMs since GPT-3.
Thousands of hours of context engineering has shown me how LLMs will do their best to answer a question with insufficient context and can give all sorts of wrong answers. I've found that the way I prompt it and what information is in the context can heavily bias the way it responds when it doesn't have enough information to respond accurately.
You assume the bias is in the LLM itself, but I am very suspicious that the bias is actually in your system prompt and context engineering.
Are you willing to share the system prompt that led to this result that you're claiming is sexist LLM bias?
Edit: Oidar (child comment to this) did an A/B test with male names and it seems to have proven the bias is indeed in the LLM, and that my suspicion of it coming from the prompt+context was wrong. Kudos and thanks for taking the time.
Common large datasets being inherently biased towards some ideas/concepts and away from others in ways that imply negative things is something that there's a LOT of literature about
That's not a very scientific stance. What would be far more informative is if we looked at the system prompt and confirm whether or not the bias was coming from it. From my experience when responses were exceptionally biased the source of the bias was my own prompts.
The OP is making a claim that an LLM assumes a meeting between two women is childcare. I've worked with LLMs enough to know that current gen LLMs wouldn't make that assumption by default. There is no way that whatever calendar related data that was used to train LLMs would include majority of sole-women 1:1s being childcare focused. That seems extremely unlikely.
Not to Let me google that for you... but there are a LOT of scientific papers that specifically analyse bias in LLM output and reference the datasets that they are trained on
Thousands of hours of context engineering has shown me how LLMs will do their best to answer a question with insufficient context and can give all sorts of wrong answers. I've found that the way I prompt it and what information is in the context can heavily bias the way it responds when it doesn't have enough information to respond accurately.
You assume the bias is in the LLM itself, but I am very suspicious that the bias is actually in your system prompt and context engineering.
Are you willing to share the system prompt that led to this result that you're claiming is sexist LLM bias?
Edit: Oidar (child comment to this) did an A/B test with male names and it seems to have proven the bias is indeed in the LLM, and that my suspicion of it coming from the prompt+context was wrong. Kudos and thanks for taking the time.