this post was submitted on 09 Jul 2023

500 points (97.0% liked)

Technology

59583 readers

3235 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 1 year ago

MODERATORS

[email protected]

500

2 authors say OpenAI 'ingested' their books to train ChatGPT. Now they're suing, and a 'wave' of similar court cases may follow. (www.businessinsider.com)

submitted 1 year ago by [email protected] to c/[email protected]

138 comments fedilink hide all child comments

Two authors sued OpenAI, accusing the company of violating copyright law. They say OpenAI used their work to train ChatGPT without their consent.

(page 2) 50 comments

sorted by: hot top controversial new old

[–] [email protected] 2 points 1 year ago

Can’t reply directly to @[email protected] because of that “language” bug, as well. This is an interesting argument. I would imagine that the AI does not have the ability to follow plagiarism rules. Does it even credit sources? I've seen plenty of complaints from students getting in trouble because anti cheating software flags their original work as plagiarism. More importantly I really believe we need to take a firm stance on what is ethical to feed into chat gpt. Right now it's the wild west.

[–] [email protected] 2 points 1 year ago (5 children)

Too be honest, I hope they win. While I my passion is technology, I am not a fan of artificial intelligence at all! Decision-making is best left up to the human being. I can see where AI has its place like in gaming or some other things but to mainstream it and use it to decide who's resume is going to be viewed and/or who will be hired; hell no.

[–] [email protected] 2 points 1 year ago

I got a degree with a sub focus in AI and I hate where this has gone extremely fast. It’s not exciting anymore, it’s just depressing. I’m trying to get out of tech sooner rather than later and go live off the grid somewhere.

AI will kill society long before it’ll save it

[–] [email protected] 1 points 1 year ago (7 children)

I'm not against artificial intelligence, it could be a very valuable tool, but that's nowhere near a valid reason to break laws as OpenAI has done, that's why I too hope authors win.

load more comments (7 replies)

load more comments (3 replies)

[–] [email protected] 2 points 1 year ago* (last edited 1 year ago) (3 children)

The only question I have to content creators of any kind who are worried about AI...do you go after every human who consumed your content when they create anything remotely connected to your work?

I feel like we have a bias towards humans, that unless you're actively trying to steal someone's idea or concepts we ignore the fact that your content is distilled into some neurons in their brain and a part of what they create from that point forward. Would someone with an eidetic memory be forbidden from consuming your work as they could internally reference your material when creating their own?

[–] [email protected] 2 points 1 year ago (13 children)

The problem with AI as it currently stands is that it has no actual comprehension of the prompt, or ability to make leaps of logic, nor does it have the ability to extend and build upon existing work to legitimately transform it, except by using other works already fed into its model. All it can do is blend a bunch of shit together to make something that meets a set of criteria. There's little actual fundamental difference between what ChatGPT does and what a procedurally generated game like most roguelikes do--the only real difference is that ChatGPT uses a prompt while a roguelike uses a RNG seed. In both cases, though, the resulting product is limited solely to the assets available to it, and if I made a roguelike that used assets ripped straight from Mario, Zelda, Mass Effect, Crash Bandicoot, Resident Evil, and Undertale, I'd be slapped with a cease and desist fast enough to make my head spin.

The fact that OpenAI stole content from everybody in order to make its model doesn't make it less infringing.

[–] [email protected] 0 points 1 year ago (1 children)

The fact that OpenAI stole content from everybody in order to make its model doesn’t make it less infringing.

Totally in agreement with you here. They did something wrong and should have to deal with that.

But my question is more about...

The problem with AI as it currently stands is that it has no actual comprehension of the prompt, or ability to make leaps of logic, nor does it have the ability to extend and build upon existing work to legitimately transform it, except by using other works already fed into its model

Is comprehension necessary for breaking copyright infringement? Is it really about a creator being able to be logical or to extend concepts?

I think we have a definition problem with exactly what the issue is. This may be a little too philosophical but what part of you isn't processing your historical experiences and generating derivative works? When I saw "dog" the thing that pops into your head is an amalgamation of your past experiences and visuals of dogs. Is the only difference between you and a computer the fact that you had experiences with non created works while the AI is explicitly fed created content?

AI could be created with a bit of randomness added in to make what it generates "creative" instead of derivative but I'm wondering what level of pure noise needs to be added to be considered created by AI? Can any of us truly create something that isn't in some part derivative?

There’s little actual fundamental difference between what ChatGPT does and what a procedurally generated game like most roguelikes do

Agreed. I think at this point we are in a strange place because most people think ChatGPT is a far bigger leap in technology than it truly is. It's biggest achievement was being able to process synthesized data fast enough to make it feel conversational.

What worries me is that we will set laws and legal precedent based on a fundamental misunderstanding of what the technology does. I fear that had all the sample data been acquired legally people would still have the same argument think their creations exist inside the AI in some full context when it's really just synthesized down to what is necessary to answer the question posed "what's the statically most likely next word of this sentence?"

[–] [email protected] 0 points 1 year ago (1 children)

Is comprehension necessary for breaking copyright infringement? Is it really about a creator being able to be logical or to extend concepts?

I think we have a definition problem with exactly what the issue is. This may be a little too philosophical but what part of you isn’t processing your historical experiences and generating derivative works? When I saw “dog” the thing that pops into your head is an amalgamation of your past experiences and visuals of dogs. Is the only difference between you and a computer the fact that you had experiences with non created works while the AI is explicitly fed created content?

That's part of it, yes, but nowhere near the whole issue.

I think someone else summarized my issue with AI elsewhere in this thread--AI as it currently stands is fundamentally plagiaristic, because it cannot be anything more than the average of its inputs, and cannot be greater than the sum of its inputs. If you ask ChatGPT to summarize the plot of The Matrix and write a brief analysis of the themes and its opinions, ChatGPT doesn't watch the movie, do its own analysis, and give you its own summary; instead, it will pull up the part of the database it was fed into by its learning model that relates to "The Matrix," "movie summaries," "movie analysis," find what parts of its training dataset matches up to the prompt--likely an article written by Roger Ebert, maybe some scholarly articles, maybe some metacritic reviews--and spit out a response that combines those parts together into something that sounds relatively coherent.

Another issue, in my opinion, is that ChatGPT can't take general concepts and extend them further. To go back to the movie summary example, if you asked a regular layperson human to analyze the themes in The Matrix, they would likely focus on the cool gun battles and neat special effects. If you had that same layperson attend a four-year college and receive a bachelor's in media studies, then asked them to do the exact same analysis of The Matrix, their answer would be drastically different, even if their entire degree did not discuss The Matrix even once. This is because that layperson is (or at least should be) capable of taking generalized concepts and applying them to specific scenarios--in other words, a layperson can take the media analysis concepts they learned while earning that four-year degree, and apply them to a specific thing, even if those concepts weren't explicitly applied to that thing. AI, as it currently stands, is incapable of this. As another example, let's say a brand-new computing language came out tomorrow that was entirely unrelated to any currently existing computing languages. AI would be nigh-useless at analyzing and helping produce new code for that language--even if it were dead simple to use and understand--until enough humans published code samples that could be fed into the AI's training model.

[–] [email protected] 1 points 1 year ago

Hmm that is an interesting take.

The movie summary question is interesting. For most people I doubt they have asked ChatGPT for its own personal views on the subject matter. Asking for a movie plot summary doesn't inherrantly require the one giving it to have experienced the movie. If this were the case then pretty much all papers written in a history class would fall under this category. No high schooler today went to war but could write about it because they are synthesizing other's writings about the topic. Granted we know this to be the case and the students are required to cite their sources even when not directly quoting them...would this resolve the first proble?

If we specifically asked ChatGPT "Can you give me your personal critique of the movie The Matrix?" and it returned something along the lines of "Well I cannt view movies and only generate responses based on writings of others who have seen it." would that make the usage more clear? If its required for someone to have the ability to have their own critical analysis, there would be a handful of kids from my high school who would fail at that task too and did so regularly.

I like your college example as that is getting better at a definition, but I think we need to find a very explicit way of describing what is happening. I agree current AI can't do any of this so we are very much talking about future tech.

With the idea of extending matterial, do we have a good enough understanding of how humans do it? I think its interesting when we look at computer neural networks. One of the first ones we build in a programming class is an AI that can read single digit, hand written numbers. What eventually happens is the system generates a crazy huge and unreadable equation to convert bits of an image into a statistically likely answser. When you disect it you'd think, "Oh to see the number 9 the equation must see a round top and a straight part on the right side below it." And that assumption would be wrong. Instead we find its dozens of specific areas of the image that you and I wouldn't necessarily associate with a "9".

But then if we start to think about our own brains, do we actually process reading the way we think we do? Maybe for individual characters. But we know when we read words we focus specifically on the first and last character, the length of the word and any variation of the height of the text. We can literally scramble up the letters in the middle and still read the text.

The reason I bring this up iss that we often focus on how huamsn can transform data using past history but we often fail to explain how this works. When asking ChatGPT a more vague concept it does pull from other's works but one thing it also does is creates a statistical analysis of human speech. It literally figures out what is the most likely next word to be said in the given sentence. The way this calculation occurs is directly related to the matterial provided, the order in which it was provided, the weights programmed into it to make decisions, etc. I'd ask how this is fundamentally different than what humans do.

I'm a big fan of students learning a huge portion of the same literature when in high school. It creates a common dialog we can all use to understand concepts. I, in my 40s, have often referenced a character or event, statement or theme from classic literature and have noticed that only those older than me often get it. In less than a few words I've conveyed a huge amount of information that only occurs when the other side of the conversation gets the reference. I'm wondering if at some point AI is able to do this type of analysis would it be considered transformative?

load more comments (12 replies)

[–] [email protected] 1 points 1 year ago (1 children)

By nature of a human creating something "connected" to another work, then the work is transformative. Copyright law places some value on human creativity modifying a work in a way that transforms it into something new.

Depending on your point of view, it's possible to argue that machine learning lacks the capacity for transformative work. It is all derivative of its source material, and therefore is infringing on that source material's copyright. This is especially true when learning models like ChatGPT reproduce their training material whole-cloth like is mentioned elsewhere in the thread.

[–] [email protected] 0 points 1 year ago* (last edited 1 year ago) (1 children)

I'd argue that all human work is derivative as well. Not from the legal stance of copyright law but from a fundamental stance of how our brains work. The only difference is that humans have source material outside that which is created. You have seen an apple on a tree before, not all of your apple experiences are pictures someone drew, photos someone took or a poem someone wrote. At what point would you consider enough personal experience to qualify as being able to generate transformative work? If I were to put a camera in my head and record my life and donate it as public domain would that be enough data to allow an AI to be considered able to create transformative works? Or must the AI have genuine personal experiences?

Our brains can do some level of randomness but it's current state is based on its previous state and the inputs it received. I wonder when trying to come up with something unique, what portion of our brains dive into memories versus pure noise generation. That's easily done on a computer.

As for whole cloth reproduction...I memorized many poems in school. Does that mean I can never generate something unique?

Don't get me wrong, they used stolen material, that's wrong. But had it been legally obtained I see less of an issue.

[–] [email protected] 1 points 1 year ago (1 children)

But derivative and transformative are legal terms with legal meanings. Arguing how you feel the word derivative applies to our brain chemistry is entirely irrelevant.

You've memorized poems, and (assuming the poem is not in the public domain) if you reproduce that poem housed in a collection of poems without any license from the copyright owner you've infringed on that copyright. It is not any different when ChatGPT reproduces a poem in it's output.

[–] [email protected] 1 points 1 year ago* (last edited 1 year ago) (1 children)

I think it's very relevant because those laws were created at a time when there was no machine generated material. The law makes the assumption that one human being is creating material and another human being is stealing some material. In no part of these laws do they dictate rules on creating a non-human third party that would do the actual copying. There were specific rules added for things like photocopy machines and faxes where attempts are made to create exact facsimiles. But ChatGPT isn't doing what a photocopier does.

The current lawsuits, at least the one's I've read over, have not been explicitly about outputting copyright material. While ChatGPT could output the material just as i could recite a poem, the issues being brought up is that the training materials were copyright and that the AI system then "contains" said material. That is why i asked my initial question. My brain could contain your poem and as long as i dont write it down as my own, what violation is occuring? OpenAI could go to the library, rent every book and scan them in and all would be ok, right? At least from the recent lawsuits.

[–] [email protected] 1 points 1 year ago (1 children)

The current (at least in the US) laws do cover work that isn't created by a human. It's well-tread legal ground. The highest profile case of it was a monkey taking a photograph: https://en.m.wikipedia.org/wiki/Monkey_selfie_copyright_dispute

Non-human third parties cannot hold copyright. They are not afforded protections by copyright. They cannot claim fair use of copyrighted material.

[–] [email protected] 1 points 1 year ago (1 children)

I meant in the opposite direction. If I teach an elephant to paint and then show him a Picasso and he paints something like it am I the one violating copyright law? I think currently there is no explicit laws about this type of situation but if there was a case to be made MY intent would be the major factor.

The 3rd party copying we see laws around are human driven intent to make exact replicas. Photocopy machines, Cassette/VHS/DVD duplication software/hardware, Faxes, etc. We have personal private fair use laws but all of this about humans using tools to make near exact replicas.

The law needs to catch up to the concept of a human creating something that then goes out and makes non replica output triggered by someone other than the tool's creator. I see at least 3 parties in this whole process:

AI developer creating the system
AI teacher feeding it learning data
AI consumer creating the prompt

If the data fed to the AI was all gathered by legal means, lets say scanned library books, who is in violation if the content output were to violate copyright laws?

[–] [email protected] 1 points 1 year ago

These are questions that, again, are tread pretty well in the copyright space. ChatGPT in this case acts more like a platform than a tool, because it hosts and can reproduce material that it is given. Again, US only perspective, and perspective of a non-lawyer, the DMCA outlines requirements for platforms to be protected from being sued for hosting and reproducing copyrighted works. But part of the problem is that the owners of the platforms are the parties that are uploading, via training the MLL, copyrighted works. That automatically disqualifies a platform from any sort of safe harbor protections, and so the owners of the ChatGPT platform would be in violation.

load more comments (1 replies)

[–] [email protected] 1 points 1 year ago

Good, hope they win.

[–] [email protected] 0 points 1 year ago* (last edited 1 year ago) (3 children)

I don't really understand why people are so upset by this. Except for people who train networks based on someone's stolen art style, people shouldn't be getting mad at this. OpenAI has practically the entire internet as its source, so GPT is going to have so much information that any specific author barely has an effect on the output. OpenAI isn't stealing peoples art because they are not copying the artwork, they are using it to train models. imagine getting sued for looking at reference artwork before creating artwork.

load more comments (3 replies)

load more comments