More

BlueFalconHD · 2025-07-07T15:16:07 1751901367

The framework loading these is in Swift. I haven’t gotten around to the logic for the JSON/regex parsing but ChatGPT seems to understand the regexes just fine

BlueFalconHD · 2025-07-07T00:37:41 1751848661

Nope. This is a separate system. It’s not even abstracted for any asset, it is specifically only for these overrides. The decryption is done in the ModelCatalog private framework.

BlueFalconHD · 2025-07-06T22:06:19 1751839579

Yep. These filters are applied first before the safety model (still figuring out the architecture, I am pretty confident it is an LLM combined with some text classification) runs.

brookst · 2025-07-06T22:25:31 1751840731

All commercial LLM products I’m aware of use dedicated safety classifiers and then alter the prompt to the LLM if a classifier is tripped.

latency-guy2 · 2025-07-06T23:17:31 1751843851

The safety filter appears on both ends (or multi-ended depending on the complexity of your application), input and output.

I can tell you from using Microsoft's products that safety filters appears in a bunch of places. M365 for example, your prompts are never totally your prompts, every single one gets rewritten. It's detailed here: https://learn.microsoft.com/en-us/copilot/microsoft-365/micr...

There's a more illuminating image of the Copilot architecture here: https://i.imgur.com/2vQYGoK.png which I was able to find from https://labs.zenity.io/p/inside-microsoft-365-copilot-techni...

The above appears to be scrubbed, but it used to be available from the learn page months ago. Your messages get additional context data from Microsoft's Graph, which powers the enterprise version of M365 Copilot. There's significant benefits to this, and downsides. And considering the way Microsoft wants to control things, you will get an overindex toward things that happen inside of your organization than what will happen in the near real-time web.

BlueFalconHD · 2025-07-06T22:03:28 1751839408

This is definitely an old test left in. But that word isn’t just a silly one, it is offensive (google it). This is the v1 safety filter, it simply maps strings to other strings, in this case changing golliwog into “test complete”. Unless I missed some, the rest of the files use v2 which allows for more complex rules

BlueFalconHD · 2025-07-06T21:59:26 1751839166

This is definitely the right answer. It’s just testing stuff.

BlueFalconHD · 2025-07-06T21:56:59 1751839019

These are the contents read by the Obfuscation functions exactly. There seems to be a lot of testing stuff still though, remember these models are relatively recent. There is a true safety model being applied after these checks as well, this is just to catch things before needing to load the safety model.

BlueFalconHD · 2025-07-06T21:54:47 1751838887

There is definitely some testing stuff in here (e.g. the “Granular Mango Serpent” one) but there are real rules. Also if you test phrases matched by the regexes with generation (via Shortcuts or Foundation Models Framework) the blocklists are definitely applied.

This specific file you’ve referenced is rhetorical v1 format which solely handles substitution. It substitutes the offensive term with “test complete”

BlueFalconHD · 2025-07-06T21:48:25 1751838505

One additional note for everyone is that this is an additional safety step on top of the safety model, so this isn’t exhaustive, there is plenty more that the actual safety model catches, and those can’t easily be extracted.

BlueFalconHD · on April 23, 2024

x mixed with y = join any combination of x and y based on some condition within x and y x augmented by y = change part of x by adding y

x = real world y = virtual world

BlueFalconHD · on April 23, 2024

What about "Meta Horizon SDK for Android SDK for Linux?"