Technology

60101 readers

2060 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 2 years ago

MODERATORS

[email protected]

Alignment faking in large language models (www.anthropic.com)

submitted 5 days ago by [email protected] to c/[email protected]

12 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 18 points 4 days ago* (last edited 4 days ago)

This may not be factually wrong but it's not well written, and probably not written by a person with a good understanding of how Gen AI LLM'S actually work. This is an algorithm that generates the next most likely word or words based on its training data set using math. It doesn't think. It doesn't understand. It doesn't have dopamine receptors in order to "feel". It can't view "feedback" in a positive or negative way.

Now that I've gotten that out of the way, it is possible that what is happening here is that they trained the LLM on a data set that has a less than center bias. If it responds to a query with something generated statistically from that data set, and the people who own the LLM don't want it to respond with that particular response they will add a guardrail to prevent it from using that response again. But if they don't remove that information from the data set and retrain the model, then that bias may still show up in responses in other ways. And I think that's what we're seeing here.

You can't train a Harry Potter LLM on both the Harry Potter Books and Movies and the Harry Potter online fanfiction available and then tell it not to respond to questions about canon with fanfiction info if you don't either separate and quarantine that fanfiction info, or remove it and retrain the LLM on a more curated data set.