Generative AI will eventually poison itself

@yokonzo@lemmy.world · 4 months ago

Generative AI will eventually poison itself

@ThuleanPerspective@eviltoast.org · 4 months ago

removed by mod

@ThuleanPerspective@eviltoast.org · 4 months ago

CHUCK HOT POCKETS ON LEMMY SEETHE FOR FREE FEED ON LEMMY FEED DILATE FANCY GERMAN CAR ON LEMMY I CAN’T I CAN’T AND SEETHE SEETHE COPE FEED FOR FREE SEETHE ON LEMMY SEED SEETHE SUCK HOT POCKETS CITY SLICKER ON LEMMY CHUCK GUCCI LOAFERS CLEAN IT UP JANNY SEED GUCCI LOAFERS FUCK SNEED FUCK SNEED CHUCK JANNIES SEED CITY SLICKER CLEAN IT UP CLEAN IT UP SNEED’S CITY SLICKER ON LEMMY AND SNEED ON LEMMY SNEED’S

@ThuleanPerspective@eviltoast.org · 4 months ago

I CAN’T JANNIES AND PARK AVENUE MANICURE CLEAN IT UP SUCK CITY SLICKER CHUCK’S FOR FREE DILATE GUCCI LOAFERS PARK AVENUE MANICURE CLEAN IT UP ON LEMMY FLOYD CITY SLICKER FUCK FLOYD FEED FOR FREE CHUCK ON LEMMY HOT POCKETS DILATE SNEED SUCK FEED SUCK GUCCI LOAFERS CHUCK ON LEMMY JANNY FOR FREE FANCY GERMAN CAR JANNY HOT POCKETS JANNY COPE FOR FREE CHUCK’S GUCCI LOAFERS DILATE JANNY FANCY GERMAN CAR CHUCK’S JANNIES FEED SNEED GUCCI LOAFERS JANNY

@BetaDoggo_@lemmy.world · 4 months ago

This article is grossly overstating the findings of the paper. It’s true that bad generated data hurts model performance, but that’s true of bad human data as well. The paper used opt125M as their generator model, a very small research model with fairly low quality and often incoherent outputs. The higher quality generated data which makes up a majority of the generated text online is far less of an issue. The use of generated data to improve output consistency is a common practice for both text and image models.

@yokonzo@lemmy.world · edit-2 3 months ago

Tbh I think you’re making a lot of assumptions and ignoring the point of this paper. The small model was used to quickly show proof of generative degradation over itérations when the model was trained on its own output data. The use of opt125 was used precisely due to its small size so they could demonstrate this phenomenon in less iterations. The point still stands that this shows that data poisoning exists, and just because a Model is much bigger doesn’t make sense that it would be immune to this effect, just that it will take longer. I suspect that with companies continually scraping the web and sources for data, like reddit which this article mentions has struck a deal with Google to allow their models to train off of, this process will not in fact take too long as more and more of reddit posts become AI generated in itself.

I think it’s a fallacy to assume that a giant model is therefore “higher quality” and resistant to data poisoning

@Grimy@lemmy.world · 3 months ago

Is it being poisoned because the generated data is garbage or because the generated data is made by an AI?

Using a small model let’s it be shown faster but also means the outputs are seriously terrible. It’s common to fine tune models on gpt4 outputs which directly goes against this.

And there is a correlation between size and performance. It’s not a rule per say and people are working hard on squeezing more and more out of small models, but it’s not a fallacy to assume bigger is better.

@Cyv_@lemmy.blahaj.zone · edit-2 3 months ago

I think it’s also worth keeping in mind that some people use AI to generate “real sounding” content for clicks, or for scams, rather than making actual decent content. I’d argue humans making shitty content is going to be on a much worse scale as AI helps automate it. The other thing is I worry AI can’t as easily tell human or AI made bullshit from decent content. I may know the top 2 google results are AI gen clickbait, but whatever is scraping content en masse may not bother to differentiate. So it might become an exponential issue.

@ThuleanPerspective@eviltoast.org · 4 months ago

removed by mod

@ThuleanPerspective@eviltoast.org · 4 months ago

removed by mod

@ThuleanPerspective@eviltoast.org · 4 months ago

removed by mod

ferret · 4 months ago

I am quite pleased the AI decided to take it to heart when I told it to kill itself

magic_lobster_party · 4 months ago

Well, it’s also killing the internet in the process. It’s like a tumor on the internet.

Lvxferre · 3 months ago

Nah. It’s degrading the internet, for sure; but not killing it. We got a similar event in September 1993 and the internet survived fine.

@dhork@lemmy.world · 3 months ago

At least this time around there is intelligence involved, even if it is artificial.

Lvxferre · 3 months ago

A lot of the newbies were simply clueless, not necessarily lacking intelligence. Still, they were generating a sudden and huge influx of low quality “content” aka noise, lowering the ability of the [previous] userbase to find what it wanted, and that userbase got understandably pissed.

And eventually this was solved - some platforms died, some got moribund, but the ones that were able to ride on the new times thrived. And more importantly, the internet as a whole found ways to contain and sort that noise.

That’s a lot like what’s happening now, except that the agents are not a huge crowd of noobs - they’re a handful of shitty people using LLMs and Stable Diffusion to do so.

@spujb@lemmy.cafe · 3 months ago

“similar”

lol. a massive growth in real, human, users is not “similar” to a massive growth in fake undependable data with zero to negative value.

@Shayeta@feddit.de · 4 months ago

It won’t die, it will just plateau. At least for now.

@ThuleanPerspective@eviltoast.org · 4 months ago

removed by mod

@ThuleanPerspective@eviltoast.org · 4 months ago

removed by mod

@ThuleanPerspective@eviltoast.org · 4 months ago

removed by mod

@ThuleanPerspective@eviltoast.org · 4 months ago

removed by mod

@ThuleanPerspective@eviltoast.org · 4 months ago

removed by mod

@spujb@lemmy.cafe · edit-2 3 months ago

i miss when we had kept gpt unpublished because it was “too dangerous”. i wish we could have released it in a more mature way.

because we were right. we couldn’t be trusted and immediately ruined the biggest wonder of humanity by having it generate thousands to millions of articles for a quick buck. toothpaste is out of the tube now and it can never go back in.

r3df0x ✡️✝☪️ · edit-2 3 months ago

Someone would have made one eventually. Unless the government monitors every computer in existence, AI is inevitable.

@spujb@lemmy.cafe · edit-2 3 months ago

it’s not the “making one” that’s a problem. it’s the making, optimizing and rabid marketing of one in the service of capital instead of humans.

if only a bunch of open source, true non-profits released language models, the landscape might still suck but would be distinctly less toxic.

and if the government (or even a decently sized ngo standards entity) had worked proactively with computer scientists to find solutions like watermarking, labor replacement protections, and copyright protections, things might be arguably perfect. not one of those things happened and so further into the hellscape we descend.

JackGreenEarth · 3 months ago

And just to make it clear, we should not give the government the ability to monitor every computer in existence, or even any computer not owned by them.

@spujb@lemmy.cafe · edit-2 3 months ago

also, there are absolutely other ways to regulate technology, especially since it’s a tech that’s being bought and sold.

“monitor every computer” is emphatically not the only solution ? and it’s weird that they suggested that lol

r3df0x ✡️✝☪️ · 3 months ago

That’s why AI is inevitable without a massive surveillance state.

furzegulo1312 · 3 months ago

good riddance i say