49
submitted 1 year ago by [email protected] to c/[email protected]

Using model-generated content in training causes irreversible defects, a team of researchers says. "The tails of the original content distribution disappears," writes co-author Ross Anderson from the University of Cambridge in a blog post. "Within a few generations, text becomes garbage, as Gaussian distributions converge and may even become delta functions."

Here's is the study: http://web.archive.org/web/20230614184632/https://arxiv.org/abs/2305.17493

you are viewing a single comment's thread
view the rest of the comments
[-] [email protected] 8 points 1 year ago

But then you're training on more and more outdated data

[-] [email protected] 6 points 1 year ago

Both in terms of factual information, news, etc, and just in terms of language change. An LLM needs to be able to keep up with slang and other new words, both for understanding prompts and for producing passable results.

[-] [email protected] 1 points 1 year ago

Afaik, there are already solution to that.

You first train the data on the outdated but correct data, to establish the correct "thought" patterns.

And then you can train the ai on the fresh but flawed data, without tripping about the mistakes.

this post was submitted on 20 Jun 2023
49 points (100.0% liked)

Technology

37603 readers
429 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:


This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 2 years ago
MODERATORS