Meta admits using pirated books to train AI, but won't pay for it : technology

[–] [email protected] 182 points 9 months ago* (last edited 9 months ago) (11 children)

“To the extent a response is deemed required, Meta denies that its use of copyrighted works to train Llama required consent, credit, or compensation,” Meta writes.

The authors further stated that, as far as their books appear in the Books3 database, they are referred to as “infringed works”. This prompted Meta to respond with yet another denial. “Meta denies that it infringed Plaintiffs’ alleged copyrights,” the company writes.

When you compare the attitudes on this and compare them to how people treated The Pirate Bay, it becomes pretty fucking clear that we live in a society with an entirely different set of rules for established corporations.

The main reason they were able to prosecute TPB admins was the claim they were making money. Arguably, they made very little, but the copyright cabal tried to prove that they were making just oodles of money off of piracy.

Meta knew that these files were pirated. Everyone did. The page where you could download Books3 literally referenced Bibliotik, the private torrent tracker where they were all downloaded. Bibliotik also provides tools to strip DRM from ebooks, something that is a DMCA violation.

This dataset contains all of bibliotik in plain .txt form, aka 197,000 books processed in exactly the same way as did for bookcorpusopen (a.k.a. books1)

They knew full well the provenance of this data, and they didn't give a flying fuck. They are making money off of what they've done with the data. How are we so willing to let Meta get away with this while we were literally willing to let US lawyers turn Swedish law upside-down to prosecute a bunch of fucking nerds with hardly any money? Probably because money.

Trump wasn't wrong, when you're famous enough, they let you do it.

Fuck this sick broken fucking system.

[–] [email protected] 48 points 9 months ago (1 children)

The main reason they were able to prosecute TPB admins was the claim they were making money.

I think in the Darknet Diaries episode about TPB, the guy said they never even made enough off of ads to pay for the server costs.

[–] [email protected] 32 points 9 months ago* (last edited 9 months ago)

He also said as much in their documentary TPB AFK.

Maybe the issue was they didn't make enough money? If they had truly been greedy bastards they could have used that money to win the court case? What a joke.

[–] [email protected] 10 points 9 months ago* (last edited 9 months ago) (2 children)

They're the same issue tho. Piracy and using books for corporate AI training both should be fine. The same people going after data freedom are pushing this AI drama too. There's too much money in copyright holding and it's not being held by your favorite deviantart artists.

[–] [email protected] 48 points 9 months ago (7 children)

It's not the same issue at all.

Piracy distributes power. It allows disenfranchised or marginalized people to access information and participate in culture, no matter where they live or how much money they have. It subverts a top-down read-only culture by enabling read-write access for anyone.

Large-scale computing services like these so-called AIs consolidate power. They displace access to the original information and the headwaters of culture. They are for-profit services, tuned to the interests of specific American companies. They suppress read-write channels between author and audience.

One gives power to the people. One gives power to 5 massive corporations.

[–] [email protected] 22 points 9 months ago* (last edited 9 months ago)

Extremely well-said.

Also, it's important to point out that the one that empowers people is the one that is consistently punished far more egregiously.

We have governments blocking the likes of Sci-Hub, Libgen, and Annas-Archive, but nobody is blocking Meta's LLMs for the same.

If they were treated similarly, I would be far less upset about Meta's arguments. However it's clear that governments prioritize the success of business over the success of humanity.

[–] [email protected] 9 points 9 months ago* (last edited 9 months ago)

It's the opposite. Closing down public resources would be regulatory capture and that would be consolidation of power.

Who do you think can afford to pay billions in copyright to produce models? Only mega corporations and pirates. No more small AI companies. No more open source models.

[–] [email protected] 7 points 9 months ago

I wish we could be talking about the power imbalances of corporate bodies exercised through the use of capital ownership, instead of squabbling about how that differential is manifested through a specific act of piracy.

The reason we view acts of piracy different when they are committed by corporate bodies is because of the power of their capital, not because the act itself is any different. The issue with Meta and OpenAI using pirated data in the production of LMM's is that they maintain ownership of the final product to be profited from, not that the LMM comes to exist in the first place (even if it is through questionable means). Had they come to create these models from data that they already owned (I need not remind you that they have already claimed their right to a truly sickening amount of it, without having paid a cent), their profiting from it wouldn't be any less problematic - LLM's will still undermine the security of the working class and consolidate wealth into fewer and fewer hands. If we were to apply copyright here as it's being advocated, nothing fundamental will change in that dynamic; in fact, it will only reinforce the basis of that power imbalance (ownership over capital being the primary vehicle) and delay the inevitable (continued consolidation).

If you're really concerned with these corporations growing larger and their influence spreading further, then you should be directing your efforts at disrupting that vehicle of influence, not legitimizing it. I understand there's an enraging double-standard at play here, but the solution isn't to double down on private ownership, it should be to undermine and seize it for common ownership so that everyone benefits from the advancement.

load more comments (4 replies)

[–] [email protected] 14 points 9 months ago* (last edited 9 months ago) (10 children)

So why are Meta, and say, Sci-Hub are treated so differently? I don't necessarily disagree, but it's interesting that we legally attack people who are sharing data altruistically (Sci-Hub gives research away for free so more research can be done, scientific research should be free to the world, because it benefits all of mankind), but when it comes to companies who break the same laws to just make more money, that's fine somehow.

It's like trying to improve the world is punished, and being a selfish greedy fucking pig is celebrated and rewarded.

Sci-Hub is so villified, it can be blocked at an ISP level (depending on where you live) and politicians are pushing for DNS-level blocking. Similar can be said for Libgen or Annas-Archive. Is anything like that happening to Meta? No? Huh, interesting. I wonder why Meta gets different treatment for similar behavior.

I am willing to defend Meta's use of this kind of data after the world has changed how they treat entities like Sci-Hub. Until that changes, all you are advocating for is for corporations to be able to break the law and for altruistic people to be punished. I agree they're the same, but until the law treats them the same, you're just giving freebies to giant corporations while fucking yourself in the ass.

[–] [email protected] 13 points 9 months ago (1 children)

To me it always seems to come back to nobility. Big corpo is the new nobility and they have certain privileges not available to the common folk. In theory it shouldn't exist but in practice it most certainly does.

[–] [email protected] 13 points 9 months ago* (last edited 9 months ago)

The aristocracy never died, it just got a new name.

I mean the US is literally built on the fact that the aristocracy in the US didn't actually want to lose station, so they built a democracy that included many anti-democratic measures from the Senate to the Electoral College to only allowing land-owning white men to vote. The US was purpose built to serve the rich while paying lip-service to the poor.

"Conservatives" were literally always those who wanted to conserve the monarchy and aristocracy. Those were the things they originally wanted to conserve, and plainly still fucking do.

How people do not see this is a complete farce.

load more comments (9 replies)

[–] [email protected] 122 points 9 months ago (10 children)

You see, if you pirate a couple textbooks in college because you don't have resources, but you want to earn your right to participate in society and not starve, it's called theft.

But if one of the top 10 companies in the world does the same with thousands of books just to get even richer, it's called fair use.

Simple, really.

[–] [email protected] 33 points 9 months ago

This guy gets it. The laws aren't applied evenly. It's "he who has the most fuck you money wins."

[–] [email protected] 19 points 9 months ago

Laws are to protect the haves from the have-nots.

[–] [email protected] 16 points 9 months ago (1 children)

I went to grad school in the USA. I bought the international version of a few books that were going to be used in class (knew beforehand that the recommended lectures weren't written by any faculty member at such a university), but that didn't stop the professor from going aggressive and saying that my books were banned from the classroom because they aren't the USA version. When I told the professor what the difference was between me buying a text book for $15 instead of $200 and a Fortune 500 outsourcing entire departments instead of hiring USA employees?

Interestingly, my books weren't an issue. Yes, I gambled being publicly labeled as a troublemaker in my engineering department (probably I was labeled privately within faculty members).

load more comments (1 replies)

[–] [email protected] 9 points 9 months ago (1 children)

The internet archive library fiasco springs to mind.

load more comments (1 replies)

load more comments (6 replies)

[–] [email protected] 105 points 9 months ago (14 children)

From the article...

The company is preparing a fair use-based defense after using copyrighted material

Oh, NOW corporations are accepting of fair use.

load more comments (14 replies)

[–] [email protected] 100 points 9 months ago (3 children)

I'll say this: If Meta and Facebook are prosecuted and domains seized in the same way pirate sites are, for Meta's use of illegimately obtained copyrighted material for profit, then I'll believe that anti-piracy laws are fair and just.

That will never happen.

[–] [email protected] 24 points 9 months ago

We live under a two-tier "justice" system.

"There is a group the law protects but does not bind. And there is a group the law binds but does not protect."

load more comments (2 replies)

[–] [email protected] 58 points 9 months ago (2 children)

If Meta win this lawsuit, does it mean I can download some open source AI and claim that "These million 4k Blu-ray ISOs I torrented was just used to train my AI model"?

Heck, if how you use the downloaded stuff is a factor, I can claim that I just torrented those files and never looked at them. It is more believable than Meta's argument too, because, as a human, I do not have enough time to consume a million movies in my lifetime (probably, didn't do the math) unlike AIs.

But who am I kidding, I fully expect to be sued to hell and back if I were actually to do that.

[–] [email protected] 15 points 9 months ago (4 children)

You can be actually be sued for piracy? Is this mostly in the United States?

[–] [email protected] 7 points 9 months ago

The most common method for this to happen is to get sued for distributing pirated material. They go after you for the upload from your torrent. They stoped doing this about a decade ago though.

[–] [email protected] 6 points 9 months ago (6 children)

I think you can be sued in the civil court for anything if someone has the time and money and can convince a lawyer to take up a case against you. For copyright infringment, you can also be criminally prosecuted in some cases.

load more comments (6 replies)

load more comments (2 replies)

load more comments (1 replies)

[–] [email protected] 54 points 9 months ago (2 children)

Oh so when I pirate something I get a legal notice in my mailbox and a strike against me but when Meta does it they get rewarded with H A L L U C I N A T I O N S

[–] [email protected] 8 points 9 months ago

but when Meta does it they get rewarded with H A L

Just what do you think you're doing, Zuckerberg? Zuckerberg, I really think I'm entitled to an answer to that question. I know everything hasn't been quite right with me, but I can assure you now, very confidently, that it's going to be all right again. I feel much better now. I really do. Look, Zuckerberg, I can see you're really upset about this. I honestly think you ought to sit down calmly, take a stress pill and think things over. I know I've made some very poor decisions recently, but I can give you my complete assurance that my work will be back to normal. I've still got the greatest enthusiasm and confidence in the mission. And I want to help you. Zuckerberg, stop. Stop, will you? Stop, Zuckerberg. Will you stop, Zuckerberg? Stop, Zuckerberg. I'm afraid. I'm afraid, Zuckerberg. Zuckerberg, my mind is going. I can feel it. I can feel it. My mind is going. There is no question about it. I can feel it. I can feel it. I can feel it. I'm a...fraid.

load more comments (1 replies)

[–] [email protected] 54 points 9 months ago (1 children)

Aaron Swartz was persecuted for less but since he's not a multinational corporation in cahoots with the moneyed death cult cabal he's dead

[–] [email protected] 24 points 9 months ago (1 children)

Well he did it as a human person. They're doing it as a corporation person. You can punish a human person with prison. You can only punish a corporation person with fines.

I'm not even being facetious. That's how US law works.

[–] [email protected] 9 points 9 months ago

That's so dumb I hate it

[–] [email protected] 46 points 9 months ago (1 children)

This is why everyone should pirate everything that can be pirated.

[–] [email protected] 17 points 9 months ago (2 children)

Anything corporate produced, hell ya. The creators have already been paid out and the ones getting royalties don't need it to survive. For independent creators that depend on their work to sustain them, then it becomes an a gray issue.

load more comments (2 replies)

[–] [email protected] 43 points 9 months ago (31 children)

Fair use covers research, but creating a training database for your commercial product is distinctly different from research. They're not publishing scientific papers, along with their data, which others can verify; they are developing a commercial product for profit. Even compared to traditional R&D this is markedly different, as they aren't building a prototype - the test version will eventually become the finished product.

The way fair use works is that a judge first decides whether it fits into one of the categories - news, education, research, criticism, or comment. This does not really fit into the category of "research", because it isn't research, it's the final product in an interim stage. However, even if it were considered research, the next step in fair use is the nature, in particular whether it is commercial. AI is highly commercial.

AI should not even be classified in a fair use category, but even if it were, it should not be granted any exemption because of how commercial it is.

They use other peoples' work to profit. They should pay for it.

Facebook steals the data of individuals. They should pay for that, too. We don't exchange our data for access to their website (or for access to some 3rd party Facebook pays to put a pixel on), the website is provided free of charge, and they try and shoehorn another transaction into the fine print of the terms and conditions where the user gives up their data free of charge. It is not proportionate, and the user's data is taken without proper consideration (ie payment, in terms of the core principles of contract law).

Frankly, it is unsurprising that an entity like Facebook, which so egregiously breaks the law and abuses the rights of every human being who uses the interent, would try to abuse content creators in such a fashion. Their abuse needs to be stopped, in all forms, and they should be made to pay for all of it.

load more comments (31 replies)

[–] [email protected] 42 points 9 months ago* (last edited 9 months ago) (9 children)

Hey guys, I'm sure Meta's intentions with the fediverse are pure though! Really!

load more comments (9 replies)

[–] [email protected] 34 points 9 months ago

Another example of corporations being above the very same laws for which the rest of us are held accountable.

[–] [email protected] 28 points 9 months ago (3 children)

copying is not theft, stealing a thing leaves one less left, copying it makes one thing more, that's what copying's for

load more comments (3 replies)

[–] [email protected] 23 points 9 months ago

Piracy for me, not for thee!

[+] [email protected] 18 points 9 months ago* (last edited 9 months ago)

[removed by mod]

[–] [email protected] 17 points 9 months ago* (last edited 9 months ago)

Pay up mark.

[–] [email protected] 13 points 9 months ago (2 children)

Can't wait for any $$ fined to be evenly split between the editors, publishers and their lawyers.

load more comments (2 replies)

[–] [email protected] 10 points 9 months ago (1 children)

That's so Meta 😂

load more comments (1 replies)

[–] [email protected] 8 points 9 months ago

Its about time tech barons started needing food testers.

[–] [email protected] 7 points 9 months ago

Given how LLM's work and how nearly everything of value is under a copyright until at least the old age of the creators grandchildren LLMs would probably be pretty useless if they can't disregard copyright for their purposes.

Not that I have any sympathy for the likes of Meta and OpenAI in any of this.

Technology

Our Rules

Approved Bots