mastodon.ie is one of the many independent Mastodon servers you can use to participate in the fediverse.
Irish Mastodon - run from Ireland, we welcome all who respect the community rules and members.

Administered by:

Server stats:

1.6K
active users

A new preprint of ours is out, where we look at just how easy it is for generative LLMs to evade detection by classification fine-tuned LLMs - currently considered SotA.

And turns out it is extremely easy - at least in our setting (GPT2 vs BERT).

So easy, in fact, that we could entirely stop the LLM detector from training in the open-label scenario with a minimum of "normal" usage tricks - fine-tuning, prompting, and access to the reference "human" dataset.

1/🧵

arxiv.org/abs/2304.08968

In the process, we also demonstrated that we could obtain well-generalizing critic model fine-tunes on models with as little as 10k prompts if you drop Adam in favor of AdamW.

Before breaking LLM detectors, we used it to make a slightly more cheerful and less psycho version of GPT-2-small.

Because if you only ever interacted with modern conversational agents, you have no idea just how unhinged they are under the hood. And the smaller ones are kinda less psycho than the larger ones.

2/🧵

We also showed that RNN-based text GANs - that were en vogue right before the Transformers-based generation and are often cited as a model for an upcoming arms race between LLM generators and LLM detectors - are potentially fully broken due to the lower representative power of RNN being able to hide major algorithm flaws.

3/🧵

In particular, we show that the diversity-promoting GAN (DP-GAN; EMNLP 2018) not only doesn't really promote diversity, but in the actual adversarial phase, it degenerates extremely fast. This is due to an interaction between the type of data it uses and the adversarial rewards function that gets masked by less powerful underlying models (RNN, shallow Transformer).

4/🧵

While this work was complete by mid-2022, in late November last year, we decided to delay its release.

When we started this work in early 2022, as part of my research at the @cydcampus, we assumed an APT-level LLM operator, given the size of high-quality generative models and the difficulty of prompt engineering at the time.

5/🧵

With the arrival of ChatGPT and smaller, conversationally fine-tuned models, those assumptions went out of the window, and the attack outlined in our paper could actually be easily implemented for evasion against the basic LLM detectors most people were relying on.

So we tried reaching out to operators of some such detectors (notably the @huggingface OpenAI detector) for a responsible disclosure, but 5 months later, we are still waiting to hear back.

6/🧵

Unfortunately, in the meantime, the LLM detectors are being widely deployed in positions where they can seriously hurt people with false positives - notably in education but also in recruitment and mail filtering.

Given that there are good chances that the false positives tend to disproportionately affect a vulnerable minority, we decided that a public disclosure was warranted.

7/🧵

PJ Coffey

@andrei_chiffa

Oof... there's a lot going on there. Having to release your paper because you're worried about minorities being denied jobs by poorly and obscurely implemented black boxes is never a fun time.