LLMs have a strong bias against use of African American English

BlackEco@lemmy.blackeco.com · 22 days ago

LLMs have a strong bias against use of African American English

[email protected]@lemmy.federate.cc · edit-2 22 days ago

This kind of seems like a non-article to me. LLMs are trained on the corpus of written text that exists out in the world, which are overwhelmingly standard English. American dialects effectively only exist while spoken, be it a regional or city dialect, the black or chicano dialect, etc. So how would LLMs learn them? Seems like not a bias by AI models themselves, rather a reflection of the source material.

lily33@lemm.ee · edit-2 22 days ago

It’s not an article about LLMs not using dialects. In fact, they have learned said dialects and will use them if asked.

What they did was, ask the LLM to suggest adjectives associated with sentences - and it would associate more aggressive or negative adjectives with African dialect.

Seems like not a bias by AI models themselves, rather a reflection of the source material.

All (racial) bias in AI models is actually a reflection of the training data, not of the modelling.

JohnEdwa@sopuli.xyz · edit-2 21 days ago

I would assume the small amount of training data written that way doesn’t contain that many professional research papers, corporate emails or calm poetry, but would consist mostly of social media posts and comments which have a rather heavy bias towards aggressive and negative.

BlackEco@lemmy.blackeco.com · edit-2 22 days ago

Seems like not a bias by Al models themselves, rather a reflection of the source material.

That’s what is usually meant by AI bias: a bias in the material used to train the model that reflects in its behavior

Melody Fwygon@lemmy.one · 22 days ago

Yeah this seems like a non-issue to me as well; the source material for the models is probably the cause of this bias.

I also don’t think there’s a lot of sources for this manner of speaking. Let’s also not forget that there’s oftentimes instructions given to the LLM that ask it to avoid certain topics which it will in fact do.

Toribor@corndog.social · 20 days ago

I’m from the Midwest US and I know there are words and sounds I pronounce with a Midwestern accent but I can still type and spell them correctly.

If’n I typ lik dis den o’course people gonna think I hev the big dumb or that I’m a mole from a Redwall book.

davehtaylor@beehaw.org · 22 days ago

All the people here saying “well of course because they weren’t trained on AAVE”:

THAT’S THE WHOLE POINT

It’s the same reason facial recognition and voice recognition software have a difficult time with anyone who isn’t white or a speaker of perfect, uninflected standard english. The bias is created by the developers, conscious or not, because they only train it on what’s in their own bubble. If you don’t have diverse teams behind the development and training, you will create this bias, whether you want to or not. This is well known.

GiveMemes@jlai.lu · 21 days ago

There’s also just the issue of the fact that there’s significantly more books, articles, etc. written in standard english vs AAVE so that’s gonna be a huuuge barrier to overcome regardless of diversity of development and training teams. Not to say diversity isn’t important, but also that there’s just certain challenges surrounding finding adequate amounts of high quality training data, especially for less mainstream concepts. It’s the same reason an AI couldn’t give a summary of a book that has almost no info abt it on the internet.

Moonrise2473@feddit.it · 22 days ago

The problem is that they trained the models using millions of pirated books in standard english.

AAE is mostly used when spoken: they also pirated also millions of tv series and youtube videos that can contain that, but as of now, it was mostly for training voice recognition models

(proof that they pirated television content and youtube videos to train whisper: https://community.openai.com/t/subtitles-created-by-amara-org-qtss-etc/462561 - https://gist.github.com/riotbib/3b3c5f817b55b68801d14b8bdb02df09)

millie@beehaw.org · 21 days ago

Given the responses in this thread, it seems that the same bias exists even in ostensibly leftist spaces. Yikes.

Y’all need to get out more.

gnu@lemmy.zip · edit-2 22 days ago

It’d be interesting to see how much this changes if you were to restrict the training dataset to books written in the last twenty years, I suspect the model would be a lot less negative. Older books tend to include stuff which does not fit with modern ideals and it’d be a real struggle to avoid this if such texts are used for training.

For example I was recently reading a couple of the sequels to The Thirty-Nine Steps (written during WW1) and they include multiple instances that really date them to an earlier era with the main character casually throwing out jarringly racist stuff about black South Africans, Germans, the Irish, and basically anyone else who wasn’t properly English. Train an AI on that and you’re introducing the chance for problematic output - and chances are most LLMs have been trained on this series since they’re now public domain and easily available.

shutz@lemmy.ca · 22 days ago

I don’t like the idea of restricting the model’s corpus further. Rather, I think it would be good if it used a bigger corpus, but added the date of origin for each element as further context.

Separately, I think it could be good to train another LLM to recognize biases in various content, and then use that to add further context for the main LLM when it ingests that content. I’m not sure how to avoid bias in that second LLM, though. Maybe complete lack of bias is an unattainable ideal that you can only approach without ever reaching it.

Kwakigra@beehaw.org · edit-2 22 days ago

I just tested out the classic “She working” vs “She be working,” and the machine got it backwards. It can’t translate to AAVE, but it probably can appear to be well enough for people who wouldn’t know the difference. In terms of available written materials just by population and historical access it seems like there would be way more incorrect white imitations of AAVE to draw from than its correct usage. Like a lot of LLM issues, it’s been a problem for a loooong time but is now being put into overdrive by being automated.

Warning: Some posts on this platform may contain adult material intended for mature audiences only. Viewer discretion is advised. By clicking ‘Continue’, you confirm that you are 18 years or older and consent to viewing explicit content.

LLMs have a strong bias against use of African American English

LLMs have a strong bias against use of African American English