Wikipedia is one of the last genuine places on the Internet, and these rat bastards are trying to contaminate that, too - AOS for Lemmy.World - A generic Lemmy server for everyone to use.

844146

Wikipedia is one of the last genuine places on the Internet, and these rat bastards are trying to contaminate that, too

4mon 28d ago by lemmy.world/u/destructdisc in fuck_ai

Seeing as OpenAI struggled to make its AI avoid the em dash and still hasn't entirely managed to do it, I'm not too worried.

TBF OpenAI are a bunch of idiots running the world's largest ponzi scheme. If DeepMind tried it and failed then...

Well I still wouldn't be surprised, but at least it would be worth citing.

I think the inherit issue is the current "AI" is inherently non-deterministic, so it's impossible to fix these issues totally. You can feed am AI all the data on how to not sound AI, but you need massive amounts of non-AI writing to reinforce that. With AI being so prevalent nowadays you can't guarantee a dataset nowadays is AI free, so you get the old "garbage in garbage out" problem that AI companies cannot solve. I still think generative AI has it's place as a tool, I use it for quick and dirty text manipulation, but it's being applied to every problem we have like it's a magic silver bullet. I'm ranting at this point and I'm going to stop here.

I honestly disagree that it has any use. Being a statistical model with high variance makes it a liability, no matter which task you use it for will produce worse results than a human being and will create new problems that didn't exist before.

The high variance is why I only use it for dead simple tasks, e.g. "create and array of US states abbreviations in JavaScript", otherwise I'm in full agreement with you. If you can't verify the output is correct the it's useless.

Why would you have this use for multi-billion dollar earth scorching torment nexus?

Wouldn’t that be slower to do, simply because checking it got all states, didn’t repeat any and didn’t make up any would be slower than copying a list from the web and quickly turning that into an array by hand with multiline cursors?

That’s like one web search and then one shell command. You can probably just copy paste a column of a table from wikipedia and then run a simple search/replace in your text editor. Why are you feeding the orphan crushing machine for this?

Because its .01% easier to do this.

Also many people laugh at you if you try to say how ai is destroying the environment for no reason. Doesn't affect them, you go live in a cave you luddite!

If you're running it locally you can set how much variance it has. However, I mostly agree, in that it creates a bunch of trash. This doesn't mean it has no use though. It's like the monkeys on a typewriter thought experiment, but the monkey's output is fairly constrained so it takes much fewer attempts to create what you want. It depends on the complexity of the solution required whether it'll come up with a good solution in a reasonable amount of tries. If it's a novel solution, it probably never will, because it's constrained to solutions it's seen before.

I use it to put together study guides so that, instead of spending a bunch of time typing and formatting, I'm spending it studying. It's fed directly from my notes and slides and it rarely gets anything wrong (I read through it twice and cross reference with my notes). If anything, I'm usually removing stuff for being unnecessary or rewording things here and there to be.bstter suited to me. What took several hours now takes 30-45 minutes

Don't take this as a defense of AI, it definitely isn't. If AI disappeared tomorrow the world would be better off. Formatting study guides are literally the only utility I've found in LLMs

Any amount of using AI for learning is learning from Slop which will make you less informed.

Tell that to my 3.9 GPA lol. I'm not learning from slop, I'm using a program to format my notes and slides from my lectures, and then verifying that information before committing it to memory

Man have I got a solution for you, check out this formatted method to accomplish the same task:

I’m ~~using a program to format my notes and slides from my lectures, and then~~ verifying that information before committing it to memory

My deepest apologies, sir. You're clearly a man of deep intellect and wisdom beyond your years! You obviously know how I should do things and I'd be a buffoon not to follow your sagely demands. How silly was I, a lowly working full time student, to doubt you! I'll get to grinding out those study materials by hand in what precious little free time I have immediately! Executive dysfunction, ADHD, and depression be damned! My limited study time, already hampered by pre-existing conditions, would obviously be better spent fiddling with margins and bullet points on a word document and flashcard sets! The smug satisfaction and sense of superiority, knowing I painstakingly typed out every letter on those documents will surely yeild me the straight A's I so desperately desire, but have yet failed to realize! Thank you, thank you, thank you FiniteBanjo, I don't know where I'd be without you!!!

Ai is useful for sorting datasets amd pulling relevent info in some cases, ie propublica has used it for articles.

Obviously simple sorting for them, case law is too complicated for such sifting of data, it was trained on reddit after all.

And when, not if but when, it makes a mistake by pulling hallucinated info or data then it's going to be your fault, that's why it's a liability.

The simple stuff it can do, trying to remember how propublica used it, but it was just like sifting through a database and pulling out all mentions of a word.

When you get into giving case law, it's way too complicated for it and it hallucinates.

sifting through a database and pulling out all mentions of a word.

You mean keyword search that has existed since the beginning of time?

Idk there are legitimate uses of it sorting through large data sets that keyword searches do not fulfill.

You're describing RAG, the others are describing LLMs.

I think the best use is "making filler" so like in a game, having some deep background shit that no one looks at, or making a fake advertisement in a cyberpunk type game. Something to fill the world out that reduces the work of real artists if they choose to

If you can't be bothered to write filler then it's an insult for you to expect others to read it. You're just wasting people's time.

I guess the point is for people to not read the filler.

I think of the text that's too small to read on a computer in the background. It's nice that it's slightly more real looking than a copy/paste screen.

Not even close to worth destroying the environment over, but it's a neat use case to me

I think of the text that’s too small to read on a computer in the background.

Lorem ipsum has been used in typesetting since the 60s. If it’s not meant to be read, it doesn’t matter if it’s lorem ipsum text.

Not trying to dogpile you, I just think even things that seem ‘useful’ for LLMs almost always have preexisting solutions that are decades old.

Fair enough, I'm trying pretty hard to devil's advocate the "it has zero use" commentary. I hate the AI hype and LLMs getting shoved down our throats. I try a little to imagine a world where it's somewhat helpful, cuz that type of tech would've been cool if it wasn't a dystopian nightmare socially.

I mean long and short of it is Fuck AI and especially the people pushing it

Datasets are not the only mechanism to train AI. You can also use reinforcement learning. This requires you to have a good fitness function. In some domains, that is not a problem. For LLMs, however, we do not have such a function. However, we can use a hybrid approach, where we train a model based on a data set and optimizing for fitness functions that address part of what we want (e.g. avoiding em dashes). In practice, this tends to be tricky, as ML tends to be a bit too good at optimizing for fitness functions, and will often do it in ways you don't want. This is why if you want to develop a real AI product, you actually need AI engineers who know what they are doing; not prompt engineers who will try and find the magic incantation that makes someone else's AI do what they want

FWIW, LLMs are deterministic. Usually the commercial front-ends don't let you set the seed but behind the scenes the only reason the output changes each time it's that the seed changes. If you set a fixed seed, input X always leads to output Y.

From the user perspective: no? I think they called that "temperature" and even setting that to 0 didn't make the result the same the next day after cache cleared.

We should crowdsource a program to sniff out ai data crawlers, then poison the data they harvest without them knowing, for companies to employ.

i'm fine with LLM contributions to wikipedia as long as they have references and are human validated

it's actually something that LLMs can potentially do quite well

You have to understand that their public facing product is not the same as the one they allow enterprise or state actors to use.

They benefit from public thinking they have these stupid limitations, gives them more space to curate their product offerings where the real money is made.

I don't understand how the public thinking these are bad products is an incentive for especially state actors to use them. That seems counterintuitive.

You do understand this is more akin to white hat testing, right?

Those who want to exploit this will do it anyway, except they won't publish the result. By making the exploit public, the risk will be known if not mitigated.

I'm admittedly not knowledgeable in White Hat Hacking, but are you supposed to publicize the vulnerability, release a shortcut to exploit it telling people to 'enjoy', or even call the vulnerability handy ?

Responsible disclosure is what a white hat does. You report the bug to whomever is the party responsible for patching and give them time to fix it.

That sort of depends on the situation. Responsible disclosure is for if there is some relevant security hole that is an actual risk to businesses and people, while this here is just "haha look LLMs can now better pretend to write good text if you tell it to". That's not really responsible disclosurable. It's not even specific to one singular product.

Considering the "vulnerability" here is on the level of "don't use password as your password" - yeah, releasing it all is exactly the right step.

Wikipedia is one of the last genuine places on the Internet, and these rat bastards are trying to contaminate that, too

Wikipedia just sold the rights to use Wikipedia for AI training to Microsoft and openai....

It's getting scraped anyway. So why not get some money from it?

Imo this. Selling access also implies its illegal to access without purchasing rights which imho helps undermine AI's only monetary advantage

They lose the right to sue them

They probably realized that it was a losing battle and they didn't want to pay legal fees.

Wouldn't it give them more rights? Before, anyone could scrape it and claim "Wikipedia's public, so it's fair game", but now Wikipedia can say "no, you must licence the content, as did OpenAI and Microsoft." That could give more protection against other AI companies scraping it for their models, wouldn't it?

Suing an ai company with the orange dipshit in office? Good luck...

This right here is the reason why companies that started out with good quality/intentions turn into companies with crappy mediocre products that now actually contribute to the opposite effect on the world than everything they once stood for.

How exactly does that work? Wikipedia does not "own" the content on the website, it's all CC-BY licensed.

The BY term is not respected by LLMs

So? Still doesn't make sense to me that wikipedia can sell anything meaningful here, but I'm also not a lawyer. Do they promise not to sue them or sell them some guarantee that contributors also can't sue them? Is it just some symbolic PR washing?

Yeah, they're selling the work of others. That's how the site always worked. This venture into "AI" is nothing new.

Why? Wikipedia has like a decade of operating expenses on hand, so they don't need the money

This number inflates every time I read it. First it was ten years of hosting cost. Then it's operating costs. Soon it will be ten years of the entire US GDP.

I'd believe they have ten years of hosting costs on hand.

My quick googling says they have 170m in assets and all 180m in annual operating costs. Give or take.

I just love how people just shit "facts" out of their ass while citing zero sources and people will just believe them and upvote because it confirms their bias.

Greed? It’s probably greed.

OK then why sell data right to m$?

Well as mentioned Wikipedia seems to be in the red

They keep saying that... at least when they're asking for more money.

Is wiki in the red? Unclear, omi mean they ask for money donations, but someone in this thread claims they are set for a decade, I’ve seen people post something about how they are fine, and even donate a bunch themselves. I don’t know, and I guess it doesn’t matter.

Not sure where you are going with your second comment, and uninterested in engaging with your comparison as I don’t think it’s very good

On the traffic front, other than donations, if they don’t show ads, isn’t more traffic just more cost? So, I guess if copilot instead just shows info without the user going to wiki that might be good in a sense? But if they drove more traffic there, not so much? Unless they are donating….

I mean, I guess it’s better than ahem…. Grok with its fictitious information, but, I don’t think this of the ai_lovers community either…

You for maybe have an argument that at least the ai will be fed dates with some basis in reality, which could be good.

Many conflicting feelings

I mean you’re right, it doesn’t, but it does feel a bit bad considering all that data is mostly the work of volunteers, who now get the intense privilege of becoming AI feed.

I hate this derivative AI slop fest we are driving towards, so I guess I’m a little sensitive to news like this.

If microsoft is "buying access to training data" it makes what Open AI is doing look illegal. I would encourage every data broker to sell 'AI training data rights" because it undermines the only real advantage AI has and it helps pave the way to forcing AI companies to comply with open source licenses.

Essentially selling ai data rights is a trojan horse for the AI companies. Obviously it would be better to pass laws but until that happens this is imo a better strategy than doing nothing.

I mean, what open ai is doing and did should be illegal if it’s not, in my opinion.

I mean it's free money, why not?

These fuckin AI "enthusiasts" are just making the rest of the world hate AI more.

Losers who cant achieve anything without AI are just going to keep doing this shit.

Fr if they just let it go instead of forcing it on everyone people might even be enthusiastic.

Download an offline copy while you still can.

Here's a link to the Kiwix library download for all of Wikipedia. It's 111GB though, so you'll need a lot of space and also a lot of time to wait for it to download.

Note, you'll also need Kiwix in some manner to read the zim file once it's downloaded.

Kiwix library - All of Wikipedia - direct download link

Kiwix app download page

But this'll let you have a local copy you can reference should actual Wikipedia ever get ruined by GenAI, or worse, get taken down by hostile governments.

I really despise how Claude's creators and users are turning the definition of "skill" from "the ability to use [learned] knowledge to enhance execution" into "a blurb of text that [usefully] constrains a next-token-predictor".

I guess, if you squint, it's akin to how biologists will talk about species "evolving to fit a niche" amongst themselves or how physicists will talk about nature "abhorring a vacuum". At least they aren't talking about a fucking product that benefits from hype to get sold.

I can't help but get secondhand embarrassment whenever I see someone unironically call themselves a "prompt engineer". 🤮

Sloperator

Hey, they had to learn thermodynamics and spend 3 semesters in calculus to write those prompts

I'm a terrible procrastinator engineer.

Isn't this a thing that authoritarians do. They co-opt language. It's the same thing conservatives do. The venn diagram of tech bros and the far right is too close to being a circle.

You can pretty put any word out of the dictionary into a search engine and the first results are some tech company that took the word either as their company name or redefined it into some buzzword.

Skills were functions/frameworks built for Alexa, so they just appropriated the term from there.

If these "signs of AI writing" are merely linguistic, good for them. This is as accurate as a lie detector (i.e., not accurate) and nobody should use this for any real world decision-making.

The real signs of AI writing are not as easy to fix as just instructing an LLM to "read" an article to avoid them.

As a teacher, all of my grading is now based on in person performances, no tech allowed. Good luck faking that with an LLM. I do not mind if students use an LLM to better prepare for class and exams. But my impression so far is that any other medium (e.g., books, youtube explanation videos) leads to better results.

I sucked in oral exams and therefore hated them. Then again, if they had been mixed into regular school, it might not have sucked so much.

Doesn't need to be oral, I remember occasionally having exams that were essay questions that needed to be answered in class.

I do both of these as well as smaller but more frequent tests, group work, project work over several sessions etc... The only things I stopped doing are reports to write at home, paper summaries etc. Doesn't make sense anymore.

Fuck you, Siqi Chen.

Congrats on inventing what high school students figured out a year ago to skirt AI homework detectors.

In French, one of the way to spot AI writing is that sentences will often miss articles or have bad grammar. Can this dude also ask the LLM to include more articles and make complete sentences in the language it's trying to imitate?

I was using the Discover feed on my phone but Google started to insert rewritten stories & headlines by AI and they were so annoyingly bad at making simple sentences in French that it made me stop using that thing.

We'd rather the dude kill the LLM entirely. No one needs that shit

Weird, in Italian they usually have impeccable grammar

The most recent one I saw was "Neige cause ralentissement de REM". It works translated in English, giving something like "Snow causes slowing down of REM", but in French it's missing enough articles to sound wrong. The cromulent sentence would have been "La neige cause le/un ralentissement du REM". Back into English this would add "The snow causes the/a slowing down of the REM". A missing "le/la", or using "de" instead of "du" doesn't change the meaning of a sentence, but it makes it obvious that it wasn't from a native speaker.

Maybe it depends on the model and the source but so far, in French, for news, most of what I've read becomes uncanny after a few sentences. They often read like something passed through a translation program but just a tad better. You can understand it fine but it's missing an article there, reversing the word order in another sentence, uses vocabulary that is just slightly off, and often ends up like something written by someone that learned French as a second language for most of their life. A very good learner, nearly native level, but not quite there yet and still a bit off.

Nice try, LLM. In French, it's probably spelled "IA" (for "intelligence artificiel"), like basically every other originally english acronym. /j

Nice try, LLM.

I think you mean GML

"just tell your LLM not to do that"

You ever ask an LLM to modify a picture and "don't change anything else"? It's going to change other things.

Case in point: https://youtu.be/XnWOVQ7Gtzw

That's why you always add "and no mistakes"

Also "don't hallucinate"

And "don't become self arrest"

You are mixing two kind of AI, LLM and diffusion.
It's way harder for a diffusion model to not change the rest, the first step of a diffusion model is to use a lossy compression to transform the picture into a soup of digits that the diffusion model can understand.

And an LLM will convert a prompt into a bunch of tokens the model can understand.

Tokens are a lossless conversion, you can convert it back to the original text.

This isn't about saying "return the original text" this is about assuming LLMs understand language, and they don't. Telling an LLM "don't do these things" will be as effective as telling it "don't hallucinate" or asking it "how many 'r's in 'strawberry'.

In order to make such affirmation or infirmation we'll need to define understanding.
The example you gave can be explained by other way than "it doesn't understand".

For example, the "how many 'r' in strawberry", LLMs see tokens, and the dataset they use, doesn't contain a lot of data about the letters that are present in a token.

From the repo:

Have opinions. Don't just report facts - react to them. "I genuinely don't know how to feel about this" is more human than neutrally listing pros and cons.

That will at least be easy to spot in a Wikipedia entry.

lol brilliant

What is wrong in the techbrodude head that makes them only think of ruining things? Like it seems to me that they literally spend their days looking at things that are good and saying "what can I do to fuck this up for a profit?"

Should being a techie go into the DSM-V as a subheading under narcissistic personality disorder?

Gotta disrupt to peak bro. Just one more app bro

I am so goddamned tired of AI being shoved into every collective orifice of our society.

So they are using AI to make it so AI can't detect that they are using AI?

What kind of technological ouroborous of nonsense is this?

It gets better. Using llm's to check if the output of an llm is hallucinated or not! They call it a judge and its funny as hell tbh

Magic

It can't avoid doing those things. That's the reason for the article.

It's an arms race, AI identification vs AI adaptation. I wonder which side the companies that own these LLMs want to win...

They don't want anyone to win. The arms race makes money.

Bro isnt even gonna check its output anyway.

How likely is the list to be AI generated as well?

And now you know how and why so many programmers are just fucking awful and literally responsible for the hell we’re living in

Kinda surprised how they don’t get more hate programmers fucking suck

Wow, such programmer.

Especially that "investor" in twitter bio and all his posts about finance.

Hell even if he was a programmer, disney hires artists as well. Entire art community is transphobic now?

(I am sorry if comment was meant to be satirical)

Edit : He is apparently a CEO too.

I was about to defend the lack of contributions and then I kept reading. I have a handful of different accounts I use and some have the same look about them, but yea the investor thing is an obvious tell.

Imagine pinning that tweet and thinking, "oh yeah, this is the one" lol

And some lawyers defend the innocent but the profession as a whole is rotten to the core

Like how FB is VERY maliciously coded to make you comply. My friends wont believe me when I try to explain how fucking evil these corps are and they do not care about you or user experience.

And all the programmers just following orders for massive paychecks to make it all possible

Programmers is a total name of a whole amount of people doing some sort of programming at all. The ones who heavily rely on AI and don't do programming well are called vibecoders as far as I know.

Jesus Christ what a wretched twit of a man.

Finally! Now all of the "Scientology" histories will be safe! /s

Stuff like that doesn't always work though, at least on free versions in my experience. I use Ai to write flowery emails to people to sound nice when I normally wouldn't bother and I used it to negotiate buying my car. I would continually tell it not to use - dashes while writing emails. And inevitably after 1 answer it would go back to using them.

Maybe paid versions are different but on free ones you have to continually correct it.

Even the paid models I've tried do that. The style LLMs use seems deeply ingrained. Either companies do it on purpose, or it's just the result of all the companies using similar training data and techniques.

And it's the same reason this guy's solution to not sound like an LLM isn't a real solution, because no amount of prompting can fix the inherent issue that they don't actually know anything.

Wikipedia has already partnered with AIbcompanies to help train their LLMs.

Honestly, I think that training on wikipedia is probably better than training from reddit. At least on wikipedia the llm might get some factual information

tan(Ai)

Well yeah AIs learn stuff. This is a neverending fight

llms do not learn, they are a spicy keyboard autocomplete

Even my keyboard autocomplete learns from my typing

you and I have very different definitions of learning

I guess so

Unlike llms, who will get all confused whenever you ask for a non-existent emoji, over and over and over again, instead of realizing that "there is no seahorse emoji"

It does have weird hangups and stuff it just stubbornly refuses to listen others on. Very lifelike lmao

Wikipedia is astroturfed BS for anything remotely politically related.
Useful if you want to learn about the Ivory-billed Woodpecker or a closed-cycle regenerative heat engine, etc..
So no politics and subjects with political implications such as history.

That's what prolewiki is for

I trust no one with my history or geopolitics but ML's.
Most informed, well read and thorough people on the planet

It's really "strange" how people have this delusional viewpoint of wikipedia as neutral, honest, etc.

I guess it seems fine if you've got lib politics completely within the hegemonic narrative.

The point is you can check the citations to validate. It can be poisoned by wuzzles but you can discover that with just a bit of effort checking the citations.

Wikipedia is not genuine, unless you live in the West and have no interest in Eastern perspective.

citation requested; I've seen it having various biases but not east/west.

is it, perhaps, a particular article that you could point to?

It's fairly obvious to anybody outside the dominant culture. There are many articles if you bother searching. Here's a quote from just one:

Wikipedia materialized through predominantly westernized cisgender male voices, opinions, and biases. The awareness in the community, at that time, illustrated a rather singular point-of-view and developed policies and practices accordingly. This foundation is difficult to break. Preference on Wikipedia concerning changes or inclusion is still very singular and causes diverse participants to have work within the dominant culture... This narrow and inflexible behavior functions within the Wikipedia community to oppress and exclude. Simply because experience and history have been traditionally told from a white, cisgender male perspective, these voices and perspectives within society are taken as fact when often they are opinions or interpretations. We all experience life from our lived experiences; the Wikipedia community is no different. By infusing homogeneous points-of-view into policies and practices of a community, a disservice is being done. Content and people are being removed and excluded if they do not fit the policies and practices designed by the existing cohort of contributors. - https://wikipedia20.mitpress.mit.edu/pub/u5vsaip5

That's could be valid. I can see how it grew first in the male dominated tech sphere which was predominantly western.

I will say, the vast majority of people I encounter looking for Wiki to be defunded are the ones upset by articles like this:

https://en.wikipedia.org/wiki/1989_Tiananmen_Square_protests_and_massacre

It upsets them that they can't censor history.

I see in other posts you reference Wiki....

I live on eastern hemisphere and i don't really see what you mean. You sure your opinion is even genuine and not influenced by your political believe?

I hate AI as much as the next guy, but what the actual fuck?!

Who in the fuck is the they you are referring to?

Killing people en masse for being fucking morons is not as good of a take as you think it is.

Wikipedia is one of the last genuine places on the Internet,

Come on now. It's like 99% corporate propaganda if not worse. Literally run by a "libertarian".

TBH I don't think "AI" can make it any worse.

Does Maryana Iskander know she's a libertarian?