this post was submitted on 08 Jan 2024

334 points (96.1% liked)

Technology

59670 readers

2743 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 1 year ago

MODERATORS

[email protected]

334

Microsoft, OpenAI sued for copyright infringement by nonfiction book authors in class action claim (www.cnbc.com)

submitted 10 months ago by [email protected] to c/[email protected]

62 comments fedilink hide all child comments

Microsoft, OpenAI sued for copyright infringement by nonfiction book authors in class action claim::The new copyright infringement lawsuit against Microsoft and OpenAI comes a week after The New York Times filed a similar complaint in New York.

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 18 points 10 months ago (1 children)

Already seeing people come in to defend these suits. I just see it like this: AI is a tool, much like a computer or a pencil are tools. You can use a computer to copyright infringe all day, just like a pencil can. To me, an AI is only going to be plagiarizing or infringing if you tell it to. How often does AI plagiarize without a user purposefully trying to get it to do so? That’s a genuine question.

You are misrepresenting the issue. The issue here is not if a tool just happens to be able to be used for copyright infringement in the hands of a malicious entity. The issue here is whether LLM outputs are just derivative works of their training data. This is something you cannot compare to tools like pencils and pcs which are much more general purpose and which are not built on stole copyright works. Notice also how AI companies bring up "fair use" in their arguments. This means that they are not arguing that they are not using copryighted works without permission nor that the output of the LLM does not contain any copyrighted part of its training data (they can't do that because you can't trace the flow of data through an LLM), but rather that their use of the works is novel enough to be an exception. And that is a really shaky argument when their services are actually not novel at all. In fact they are designing services that are as close as possible to the services provided by the original work creators.

[+] [email protected] -7 points 10 months ago* (last edited 10 months ago) (3 children)

In fact they are designing services that are as close as possible to the services provided by the original work creators.

I disagree and I feel like you're equally misrepresenting the issue if I must be as well. LLMs can do far more than simply write stories. They can write stories, but that is just one capability among numerous. Can it write stories in the style of GRRM? I suppose, but honestly doesn't GRRM also borrow a lot of inspiration from other authors? Any writer claiming to be so unique that they aren't borrowing from other writers is full of shit.

I'm not a lawyer or legal expert, I'm just giving a layman's opinion on a topic. I hope Sam Altman and his merry band get nailed to the wall, I really do. It's going to be a clusterfuck of endless legal battles for the foreseeable future, especially now that OpenAI isn't even pretending to be nonprofit anymore.

[–] [email protected] 13 points 10 months ago (1 children)

This story is about a non-fiction work.

What is the purpose of a non-fiction work? It's to give the reader further knowledge on a subject.

Why does an LLM manufacturer train their model on a non-fiction work? To be able to act as a substitute source of the knowledge.

End result is that

the original is made redundant.
the original author is no longer credited.

So, not only have they stolen their work, they've stolen their income and reputation.

[+] [email protected] -10 points 10 months ago* (last edited 10 months ago) (1 children)

If you're using an LLM as any form of authoritative source-and literally any LLM specifically warns NOT to do that--then you're going to have a bad time. No one is using them to learn in any serious capacity. Ideally, the AI should absolutely be citing its sources, and if someone is able to figure out how to do that reliably, they'll be made quite rich, I'd imagine. In my opinion, the fiction writers have a stronger case than non-fiction (I believe the fiction writers' class action against OpenAI in September is still ongoing).

[–] [email protected] 12 points 10 months ago (1 children)

For someone who claimed to not be a fan of OpenAI, you sure do know all the fan arguments against regulation for AI.

[–] [email protected] 7 points 10 months ago (1 children)

There's a big difference between borrowing inspiration and just using entire paragraphs of text or images wholesale. If GRRM uses entire paragraphs of JK Rowling with just the names changed and uses the same cover with a few different colors you have the same fight. LLM can do the first, but also does the second.

The "in the style of" is a different issue that's being debated, as style isn't protected by law. But apparently if you ask in the style of, the LLM can get lazy and produces parts of the (copyrighted) source material instead of something original.

[–] [email protected] 4 points 10 months ago (1 children)

Just as with the right query you could get a LLM to output a paragraph of copyrighted material, you can with the right query get Google to give you a link to copyrighted material. Does that make all search engines illegal?

[–] [email protected] 7 points 10 months ago* (last edited 10 months ago) (1 children)

Legally it's very different. One is a link, the other content. It's the same difference as pointing someone to the street where the dealers hang out or opening your coat and asking how many grams they want.

[–] [email protected] 4 points 10 months ago (1 children)

Websites that provide links to copyrighted material are illegal in the US. It's why torrent sites are taken down and need to be hosted in countries with different copyright laws .

So Google can be used to pirate but that's not it's intention. It requires careful queries to get Google to show pirate links. Making a tool that could be used for unintentional copyright violation illegal makes all search engines illegal.

It could even make all programming languages illegal. I could use C to write a program to add two numbers or to crawl the web and return illegal movies.

[–] [email protected] 4 points 10 months ago

Oh. Linking and even downloading torrents is legal in my place. Hosting and sharing is not. My bad.

From how I understand it is that the copyright holders want the LLM to do at least the same as Google is doing against torrents: it checks so no parts of the source material is in the output.

[–] [email protected] 6 points 10 months ago

LLMs can do far more

What does this mean? I don't care what you (claim) your model "could" do, or what LLMs in general could do. What we've got are services trained on images that make images, services trained on code that write code etc. If AI companies want me to judge the AI as if that is the product, then let them give us all equal and unrestricted access to it. Then maybe I would entertain the "transformative use" argument. But what we actually get are very narrow services, where the AI just happens to be a tool used in the backend and not part of the end product the user receives.

Can it write stories in the style of GRRM?

Talking about "style" is misleading because "style" cannot be copyrighted. It's probably impractical to even define "style" in a legal context. But an LLM doesn't copy styles, it copies patterns, whatever they happen to be. Some patterns are copyrightable, eg a character name and description. And it's not obvious what is ok to copy and what isn't. Is a character's action copyrightable? It depends, is the action opening a door or is it throwing a magical ring into a volcano? If you tell a human to do something in the style of GRRM, they would try to match the medieval fantasy setting and the mood, but they would know to make their own characters and story arcs. The LLM will parrot anything with no distinction.

Any writer claiming to be so unique that they aren’t borrowing from other writers is full of shit.

This is a false equivalence between how an LLM works and how a person works. The core ideas expressed here is that we should treat products and humans equivalently, and that how an LLM functions is basically how humans think. Both of these are objectively wrong.

For one, humans are living beings with feelings. The entire point of our legal system is to protect our rights. When we restrict human behavior it is justified because it protects others; at least that's the formal reasoning. We (mostly) judge people based on what they've done and not what we know they could do. This is not how we treat products and that makes sense. We regulate weapons because they could kill someone, but we only punish a person after they have committed a crime. Similarly a technology designed to copy can be regulated, whereas a person copying someone else's works could be (and often is) punished for it after it is proven that they did it. Even if you think that products and humans should be treated equally, it is a fact that our justice system doesn't work that way.

People also have many more functions and goals than an LLM. At this point it is important to remember that an LLM does literally one thing: for every word it writes it chooses the one that would "most likely" appear next based on its training data. I put "most likely" in quotes because it sounds like a form of prediction, but actually it is based on the occurrences of words in the training data only. It has nothing else to incorporate to its output, and it has no other need. It doesn't have ideas or a need to express them. An LLM can't build upon or meaningfully transform the works it copies, it's only trick is mixing together enough data to make it hard for you to determine the sources. That can make it sometimes look original but the math is clear, it is always trying to maximize the similarity to the training data, if you consider choosing the "most likely" word at every step to be a metric of similarity. Humans are generally not trying to maximize their works' similarity to other peoples' works. So when a creator is inspired by another creator's work, we don't automatically treat that as an infringement.

But even though comparing human behavior to LLM behavior is wrong, I'll give you an example to consider. Imagine that you write a story "in the style of GRRM". GRRM reads this and thinks that some of the similarities are a violation of his copyright so he sues you. So far it hasn't been determined that you've done something wrong. But you go to court and say the following:

You pirated the entirety of GRRM's works.
You studied them only to gain the ability to replicate patterns in your own work. You have no other user for them, not even personal satisfaction gained from reading them.
You clarify that replicating the patterns is achieved by literally choosing your every word to be the one that you determined GRRM would most likely use next.
And just to be clear you don't who GRRM is or what he talks like. Your understanding of what word he would most likely use is based solely on the pirated works.
You had no original input of your own.

How do you think the courts would view any similarities between your works? You basically confessed that anything that looks like a copy is definitely a copy. Are these characters with similar names and descriptions to GRRM's characters just a coincidence? Of course not, you just explained that you chose those names specifically because they appear in GRRM's works.