this post was submitted on 09 Jul 2023
500 points (97.0% liked)
Technology
59583 readers
3195 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each another!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, to ask if your bot can be added please contact us.
- Check for duplicates before posting, duplicates may be removed
Approved Bots
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
I'm not shifting the goal post--I have been consistent in my position that AI does not truly "learn" in the way that humans do, and is incapable of the comprehension required for actual human creativity. Tay spouting racist rhetoric because that's what was put into it supports that position, if anything; if it were capable of comprehending the language it was being fed, it wouldn't have done that.
You have stated that it's not infringing on copyright to train a model on published works, yes. I wholeheartedly disagree, because, as I have previously stated, AI models as they currently exist cannot produce new, derivative works based off the training model, but only reconstitute the training model together in various different combinations. This is important because one of the requirements for copyright protection, as per the US Copyright Office, is that it's an independent creation, which "means that the author created the work without copying from other works." AI's inability to create its own work without copying from other works means that it cannot produce copyrightable material.
As a result, if you input an infringing dataset into an AI's training model, the resulting output is also infringing, because it is not, and cannot, be transformative to the level required to meet the minimal creativity threshold needed for copyright protection. At best, you can make an argument that the infringement in an AI's output is acceptable under the de minimis doctrine (i.e. that the amount of the copyrighted work contained in an infringing work is so trivial as to not warrant protection). However, my belief is that if a hypothetical composite work takes all of its source material from 100 different copyrighted sources, it wouldn't qualify for de minimis protection because the composite work is 100% infringing, even though each individual source only contributed 1% to the total work.
To summarize, my line of thinking is as follows:
Since the specific output of an AI model lacks any copyright protection, that output does not qualify for any related defenses such as fair use because as these defenses require significant transformative effort of the work in question. If something cannot be transformative, novel, or new enough to qualify for copyright protection in the first place, it's impossible for it to be transformative enough for a fair use defense. It also cannot qualify for copyright protection as a compilation or derivative work, as they both must contain copyrightable subject matter--since the AI output is not copyrightable, they cannot be claimed as either compliations or derivatives.
As a result, if the training dataset input to an AI model is infringing, then the output of that AI model is also infringing, since the output does not independently qualify for copyright protection, nor can they leverage related defenses.
Large corporations and open-source AI models are scraping our IP without consent because they think they can get away with it, and because it's easier to steal it than properly obtaining consent from the people whose content they are using. And to be clear, I don't give a shit if preventing AI from stealing copyrighted content kills large open-source AI tools. If the only way they can be useful is by committing mass infringement, then they don't deserve to exist. They can either use their own internally-developed datasets, datasets that only draw from the public domain, obtain the consent (which may or may not include royalties) from creators, or wither on the vine. That applies to both open-source and commercial AI technology.
Finally, I want to make it 100% clear that I have no issues with AI models that do not use copyrighted material in their training datasets. My employer introduced an AI chatbot trained entirely on our internal and public knowledgebases, and I'm perfectly fine with that morally/ethically/legally. (Personally, I think it's a little useless since the last time I used it the damn thing confidently gave me a false answer with fake links to nonexistent KB articles, but that's besides the point.) My entire issue with AI is centered around the unlicensed use of copyrighted material by AI models without the creator's consent, attribution, or compensation.