this post was submitted on 22 Dec 2024
1589 points (97.5% liked)

Technology

60076 readers
4241 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 2 years ago
MODERATORS
 

It's all made from our data, anyway, so it should be ours to use as we want

top 50 comments
sorted by: hot top controversial new old
[–] [email protected] 59 points 1 day ago (1 children)

I don't think it should be a "punishment." It should be done on principal.

[–] [email protected] 4 points 1 day ago* (last edited 1 day ago) (2 children)

Not sure making their LLMs public domain would really hurt their principal, their secret sauce is in the code around the model.

And yes, I do recognize that you meant "principle".

[–] [email protected] 2 points 18 hours ago (2 children)

That's not true though. The models themselves are hella intensive to train. We already have open source programs to run LLMs at home, but they are limited to smaller open-weights models. Having a full ChatGPT model that can be run by any service provider or home server enthusiast would be a boon. It would certainly make my research more effective.

load more comments (2 replies)
[–] [email protected] 3 points 15 hours ago

This is a terrible idea. Very easy to circumvent, doesn't actually help the training sources.

[–] [email protected] 20 points 23 hours ago

I'd rather they were destroyed, but practically speaking that's impossible, and this sounds like the next best idea to me.

[–] [email protected] 8 points 20 hours ago* (last edited 20 hours ago) (1 children)

Calling something illegal in spite of or in absence of precedent is a time-honored tactic - though not a particularly persuasive one.

[–] [email protected] 2 points 18 hours ago

AI is just a plagiarism machine with thousands of copyrighted materials that "trained" it, which they paid nothing for.

[–] [email protected] 12 points 1 day ago (4 children)

I want to have a personal llm that learns all my interests from my files and websites visited. I just want to ask it stuff that I don't have to remember.

[–] [email protected] 5 points 22 hours ago

I'm working on something along these lines for myself, I think of it like using AI as a filter to create a bubble of good Internet around me

[–] [email protected] 3 points 20 hours ago (1 children)

So basically Microsoft's Recall if it was actually good. I've wanted that for a long time https://lemmy.dbzer0.com/comment/12921637

[–] [email protected] 3 points 20 hours ago

Possibly but just not Microsoft anything ever.

[–] [email protected] 4 points 1 day ago

I think that'd be ok, even with this proposal, as long as you don't sell that LLM for public use. It's fine it I draw a picture of Mickey Mouse in my notebook, but if I try to sell that picture I could get in legal trouble.

[–] [email protected] 3 points 1 day ago (1 children)

Isn't that similar to what recall is?

[–] [email protected] 17 points 1 day ago (1 children)

Yes, except without Microsoft spying on you

[–] [email protected] 2 points 1 day ago

Exactly. I don't want a service, I don't want to pay for a service, I don't want to send my files for free to get stuck for later ransom like Google did with email. I just want to purchase a product called a computer and load up a program in it that runs locally and gives me access to my data.

[–] [email protected] 34 points 1 day ago* (last edited 1 day ago) (6 children)

intellectual property doesn't really exist in most of the world. they don't give a shit about it in india, bangladesh, vietnam, china, the philippines, malaysia, singapore...

it's arbitrary law that is designed to protect corporations and it's generally unenforceable.

[–] [email protected] 12 points 1 day ago

it’s arbitrary law that is designed to protect corporations and it’s generally unenforceable.

It's arbitrary, but it was designed to protect individuals, but it has been morphed to protect corporations. If we reset the law back to the original copyright act of 1790 w/ a 14-year duration, it would go a long way toward removing power from corporations. I think we should take it a step further and perhaps make it 10 years, with an optional extension for another 10 years if you can show need (i.e. you're an indie dev and your game is finally making a splash after 8 years).

[–] [email protected] 7 points 1 day ago* (last edited 1 day ago)

So true. IP only helps the corps and slows tech development. Contracts, ndas, and trade secrets are all you really need to keep your ideas safe. If you want your country to develop fast, get rid of any IP laws.

[–] [email protected] 7 points 1 day ago (2 children)

they don’t give a shit about it in india, bangladesh, vietnam, china, the philippines, malaysia, singapore…

Unless it's their intellectual property, whereupon it's suddenly a whole different story. I'm sure you knew that.

load more comments (2 replies)
load more comments (3 replies)
[–] [email protected] 11 points 1 day ago

They don't mean your data, silly. They don't give a fuck about that.

They mean other huge corporations data.

[–] [email protected] 19 points 1 day ago

I used whisper to create subs of a video and in a section with instrumental relaxing music it filled on repeat with

La scuola del Dr. Paret è una tecnologia di ipnosi non verbale che si utilizza per risultati di un'ipnosi non verbale

Clearly stolen from this Dr paret YouTube channels where he's selling hypnosis lessons in Italian. Probably in one or multiple videos he had subs stating this over the same relaxing instrumental music that I used and the model assumed the sound corresponded to that text

[–] [email protected] 33 points 1 day ago (26 children)

Although I'm a firm believer that most AI models should be public domain or open source by default, the premise of "illegally trained LLMs" is flawed. Because there really is no assurance that LLMs currently in use are illegally trained to begin with. These things are still being argued in court, but the AI companies have a pretty good defense in the fact analyzing publicly viewable information is a pretty deep rooted freedom that provides a lot of positives to the world.

The idea of... well, ideas, being copyrightable, should shake the boots of anyone in this discussion. Especially since when the laws on the book around these kinds of things become active topic of change, they rarely shift in the direction of more freedom for the exact people we want to give it to. See: Copyright and Disney.

The underlying technology simply has more than enough good uses that banning it would simply cause it to flourish elsewhere that does not ban it, which means as usual that everyone but the multinational companies lose out. The same would happen with more strict copyright, as only the big companies have the means to build their own models with their own data. The general public is set up for a lose-lose to these companies as it currently stands. By requiring the models to be made available to the public do we ensure that the playing field doesn't tip further into their favor to the point AI technology only exists to benefit them.

If the model is built on the corpus of humanity, then humanity should benefit.

[–] [email protected] 9 points 1 day ago* (last edited 1 day ago) (1 children)

As per torrentfreak

OpenAI hasn’t disclosed the datasets that ChatGPT is trained on, but in an older paper two databases are referenced; “Books1” and “Books2”. The first one contains roughly 63,000 titles and the latter around 294,000 titles.

These numbers are meaningless in isolation. However, the authors note that OpenAI must have used pirated resources, as legitimate databases with that many books don’t exist.

Should be easy to defend against, right-out trivial: OpenAI, just tell us what those Books1 and Books2 databases are. Where you got them from, the licensing contracts with publishers that you signed to give you access to such a gigantic library. No need to divulge details, just give us information that makes it believable that you licensed them.

...crickets. They pirated the lot of it otherwise they would already have gotten that case thrown out. It's US startup culture, plain and simple, "move fast and break laws", get lots of money, have lots of money enabling you to pay the best lawyers to abuse the shit out of the US court system.

[–] [email protected] 3 points 1 day ago

For OpenAI, I really wouldn't be surprised if that happened to be the case, considering they still call themselves "OpenAI" despite being the most censored and closed source AI models on the market.

But my comment was more aimed at AI models in general. If you are assuming they indeed used non-publicly posted or gathered material, and did so directly themselves, they would indeed not have a defense to that. Unfortunately, if a second hand provided them the data, and did so under false pretenses, it would likely let them legally off the hook even if they had every ethical obligation to make sure it was publicly available. The second hand that provided it to them would be the one infringing.

If that assumption turns out to be a truth (Maybe through some kind of discovery in the trial), they should burn for that. Until then, even if it's a justified assumption, it's still an assumption, and most likely not true for most models, certainly not those trained recently.

load more comments (25 replies)
[–] [email protected] 55 points 1 day ago (1 children)

It's not punishment, LLM do not belong to them, they belong to all of humanity. Tear down the enclosing fences.

This is our common heritage, not OpenAI's private property

load more comments (1 replies)
[–] [email protected] 87 points 1 day ago (7 children)

So banks will be public domain when they're bailed out with taxpayer funds, too, right?

[–] [email protected] 3 points 21 hours ago (1 children)

Just FYI of the bank bailouts in the US, the banks paid back the bailout plus interest back to the government. Meaning the govt actually made a profit off the bailout. There’s a lot of things wrong with both banks and the govt, but generally this is not one of them. https://www.propublica.org/article/the-bailout-was-11-years-ago-were-still-tracking-every-penny

[–] [email protected] 1 points 21 hours ago

Super interesting, learned something new today. Thanks!

[–] [email protected] 60 points 1 day ago (3 children)

They should be, but currently it depends on the type of bailout, I suppose.

For instance, if a bank completely fails and goes under, the FDIC usually is named Receiver of the bank's assets, and now effectively owns the bank.

load more comments (3 replies)
load more comments (5 replies)
[–] [email protected] 130 points 2 days ago* (last edited 1 day ago) (41 children)

It won't really do anything though. The model itself is whatever. The training tools, data and resulting generations of weights are where the meat is. Unless you can prove they are using unlicensed data from those three pieces, open sourcing it is kind of moot.

What we need is legislation to stop it from happening in perpetuity. Maybe just ONE civil case win to make them think twice about training on unlicensed data, but they'll drag that out for years until people go broke fighting, or stop giving a shit.

They pulled a very public and out in the open data heist and got away with it. Stopping it from continuously happening is the only way to win here.

load more comments (41 replies)
[–] [email protected] 61 points 1 day ago (5 children)

A similar argument can be made about nationalizing corporations which break various laws, betray public trust, etc etc.

I'm not commenting on the virtues of such an approach, but I think it is fair to say that it is unrealistic, especially for countries like the US which fetishize profit at any cost.

load more comments (5 replies)
load more comments
view more: next ›