Are Language Model "AI" Tools Heading for 🚽?

in Programming & Dev10 months ago

image.png

You know the saying, garbage in, garbage out? We might be heading for peak GPT if the developers behind the tools are not careful.

Google is the latest to say that it has the right to scrape all publicly available information posted online for its AI projects. Their new policy specifically mentions that this includes building products and features like Google Translate, Bard, and Cloud AI capabilities.

Not only does this move raise privacy concerns (it gives Google the ability to harvest and use data posted on any part of the public web, which is public but usually aimed at human consumption) but I can see Google and others gobbling up masses of poor quality content becoming a problem.

AI Poop Pipeline

It's kind of like the photocopy of a photocopy of a ... ok, for the younger folks, it's like those JPGs from Aunty that have been processed so much you can't make out the pixels.

Other companies like OpenAI have also admitted they scraped the internet to fuel their AI. This was obviously not bad from a quality standpoint up to now, but those tools are being used a LOT, especially for "programmatic" and automated SEO posts. Training your model on already crap AI-generated text, and then further processing it?

Reddit and Twitter Turn AI Racist?

Remember when Microsoft's AI chatbot, Tay, was corrupted by Twitter users within 24 hours of its launch?

Tay was designed to learn and engage in "casual and playful conversation", users began tweeting misogynistic, racist, and offensive remarks to Tay and like a small, impressionable child, Tay started repeating these sentiments back to users.

"Hilarity" ensued.

It's one thing to "jailbreak" an AI to get it to say funny stuff but some of the offensive remarks made by Tay were unprompted, showing that the bot's entire learning had gone astray.

Remember this was the pre-$8-checkmark era, where Twitter on the whole surfaced the best comments rather than $8 payers. Now imagine an AI trained on Reddit, or god-forbid, 4chan ...

Open but not free

But even more tricky, who owns the content once it is published? Search engines obviously NEED the content, but they need to ask nicely before grabbing the stuff that puts food on people's tables, right?

The legality of the scrape and ask forgiveness practice is questionable in view of steps from Australia and Canada to restrict Google's slurping down their news and column content and is expected to be a topic of debate and a lot of litigation in the coming years.

Stop the Scrape or Freedom to Embed?

Another worry is freedom of automation - Twitter and Reddit have already made changes to their platforms to restrict access to their APIs, affecting third-party tools and causing controversy. Elon Musk has been vocal about his concerns regarding data scraping on Twitter, which led to limitations on the number of tweets users can view.

This effectively broke key uses for Twitter - you can't embed a tweet or share with friends effectively, or find them on Google. Think back to the last time you looked up a celebrity on Google, did you see their latest tweets?

As marketers we used to hope for a "hey Martha" moment, where one person would turn to another and say "Look at this!". Twitter just lost that.

Reddit notoriously has seen mass protests by moderators due to changes in API accessibility, which may have permanent damaging consequences for the platform.

Conclusion ...?

These tools are highly useful but the training is problematic.

Right now I am loving using GPT API to help automate things I could not do before, but as well as programming I also make my living from writing, either directly or indirectly.

Things are going to get interesting, and perhaps in the Chinese curse way ...

Sort:  

I'm looking at that header image, and I was thinking.. is that Stable Diffusion? It does emit some of that energy when I zoom in, lol.

Anyway, imo, these automated training models are probably not going to go very well unless they also include a model to determine if the content they are learning from is good or not. That itself is also very challenging, because the internet is evolving every day. You wouldn't expect an AI model to guess the meaning of "fr" and "mid" out of context...

(on a side note, gen Z language is really, really odd for me. Perhaps I'm old.)

On the topic of freedom to access data, this itself is also difficult... I would say that scraping will prevail. It's the internet. If someone wants something they will get it. Watching everything trying to react is kinda interesting tho... My hope is that at the end there will be AI content detection tools for just about literally everything, and they will be available everywhere, so that people will not get away with zero effort crap. For now the tools seem to be working fine, but for how long, hmm.

is that Stable Diffusion?

Midjourney :)

You're spot on, garbage in garbage out. I was saying the same when Chart GPT went viral. It will be a race to the bottom, destroying a free and open internet.

It also demonstrates the limitations of AI. It cannot create on it's own and requires human input as it has no will or motivation of it's own as it is an algorithm.

As you pointed out in your article, then there's the biases of the data model it is trained on, plus the initial learning rules.