Technology

37737 readers

424 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:

This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 2 years ago

MODERATORS

[email protected]

219

Sarah Silverman and other authors are suing OpenAI and Meta for copyright infringement, alleging that they're training their LLMs on books via Library Genesis and Z-Library (www.thedailybeast.com)

submitted 1 year ago* (last edited 1 year ago) by [email protected] to c/[email protected]

132 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 11 points 1 year ago (1 children)

It’s more like reading a book and then charging people to ask you questions about it.

No, it's really nothing like reading at all. Your example requires a human element. This is just the consumption of data, not reading.

[–] [email protected] 7 points 1 year ago (1 children)

Humans are the ones making these models. It's not entirely the same thing, but you should read this article by the EFF.

[–] [email protected] 10 points 1 year ago (1 children)

I don't think that it is even remotely close to being the same thing. I'm sorry but we shouldn't be affording companies the ability to profit off other people's creations without their consent, regardless of how current copyright law works.

Acting as though a human writing a summary is the same thing as a vast network of computers processing data at a speed that is hundreds if not thousands times faster than a human is foolish. Perhaps it is also foolish to try and apply our current copyright laws (which already favour large corporations and not individual creators) to this slew of new technology, but just ignoring the fundamental difference between the two is no way of going about it. We need copyright reform, we need protections for creators, and we need to stop acting as though machine learning algorithms are remotely comparable to humans both in their capabilities, responsibilities and rights.

There is a perfectly reasonable way of doing this ethically, and that is using content that people have provided to the model of their own volition with their consent either volunteered or paid for, but not scraped from an epub, regardless of if you bought it or downloaded it from libgen.

There are already companies training machine learning models ethically in this manner, and if creators do not want their content used as training data, it should not be.

[–] [email protected] 2 points 1 year ago

Human writing and LLM output can be creative, original, informative, or useful, depending on the context and purpose. It is a tool to be used by humans, we are in control of the input and the output. What we say goes, no one ever has to see LLM output without people making those decisions. Restricting LLMs is restricting the people that use them. Mega-corporations will have their own models, no matter the price. What we say and do here will only affect our ability to catch up and stay competitive.

You also seem to be arguing a slippery slope argument, by implying that if LLMs are allowed to use copyrighted books as data, it will lead to negative consequences for creators and society, without explaining how or why this will happen, or providing any evidence. It's a one-sided look at the issue that ignores the positive outcomes from LLMs, like increasing accessibility, diversity, and quality of literature and thought. As well as inspiring new forms of expression and creativity.

Finally, you seem to be making a moralistic fallacy. You claim that there is a perfectly reasonable way of doing this ethically, by using content that people have provided. However, you don’t support this claim, or address its challenges. How would you ensure that the content providers are the original authors or have the rights to the content? How would you compensate them for their contribution? Is this a good way to get content that is diverse and representative of different perspectives and cultures? What about bias or manipulation in the data collection and processing?

I don't think we need any more expansions to copyright, but a better understanding of LLMs’ capabilities and responsibilities. I think we need to be open-minded and critical about the potential and challenges of LLMs, but also be on guard against fallacious arguments or emotional appeals.