How AI-generated content is displacing human-made content on the web

After the initial introduction of AI large-language models in 2022, ChatGPT was launched by OpenAI in November of the same year, and other advanced AI chat bots have since followed. Gemini, formerly known as Bard, was launched by Google in March 2023; Claude from Anthropic also appeared that month; and Ernie from Baidu was unveiled in August 2023. Take-up was rapid, with huge growth in user numbers over 2023.

These large-language models have impressed us and certainly exceeded previous expectations of what would be possible from an AI chat bot. Content generation within minutes is now possible whatever the topic, with the bots able to adapt their responses to feedback, and use a variety of writing styles as well as making use of a significant quantity of source material. Through the use of descriptive and in-depth prompts you can get the large-language model to produce almost anything you want in terms of written output, meaning the creation of content can now be almost fully automated, and we believe it has never been easier or cheaper to obtain written content at scale.

The downside of this new way of creating content is that every chat bot’s knowledge and understanding is limited (some would say illusory). Without human fact-checking and cross-referencing, the information in AI responses may contain errors (often known as 'hallucinations'), leading to a risk of the spread of misinformation if these responses are taken on trust, and then shared or published - and the prospect of decisions being taken based on incorrect information. Great efforts have also been made to prevent most of these systems from creating illegal content, and the logic in place may lead to certain subjects or content areas being restricted, or being considered in a one-sided way, potentially reflecting biases not just of the people working on the algorithms and supervising training of the model and determining which areas are off-limits, but also biases from the source materials. We also wonder whether, regardless of how much prompting a chat bot is given, it would be able to take on the complete voice of a human writer it was tasked with imitating, given the unpredictability that often marks interesting writing, and the fact that all imitation tends to involve exaggeration and simplification.

The introduction of AI image generators has also served to compound the many ethical issues around AI-generated content. DALL-E, a text-to-image model developed by OpenAI, launched initially in January 2021 with a second version following shortly after in 2022. This paved the way for various image generators. Midjouney from Midjourney Inc. and Stable Diffusion from Stability AI both launched in 2022, and Imagen from Google in 2023. These programs have been quickly adopted for faster generation of images from scratch, as well as making changes to existing ones. Depending on the prompt, these generators have the potential to produce something that may not be of obvious AI origin, and might be mistaken for a real artist’s image, just as the text output from AI programs could easily be mistaken for that of a real human writer. While this may involve a value judgment on the importance of something being created by a real person in the traditional way, there are moral questions over the impact these 'simulacra' of human creativity will have. The danger of these tools being manipulated by those who seek to use them for ill, such as the dissemination of false information, or creating photos and videos that are plausible but completely misleading, is another issue. That is perhaps not a new one - propagandists having relied on the power of the written word and of images and videos to manipulate opinions for as long as those have existed - but it arguably takes things to a whole new level, which we will have to come to terms with as a society.

Thanks to the ease-of-use and speed of these tools, people are using them to create content at a rate much faster than would otherwise have been possible. That may exacerbate the digital overload / overwhelm (and content exhaustion) that many people experience, as well as making it harder for search engines to evaluate what is important and what isn't. The influx of AI-generated content being released may have unwanted consequences as it spreads, effectively bombarding the internet, using up storage space and energy in the data centres that host content on the internet, and potentially relegating human-generated writing and images to also-rans, drowned out by weight of imitations.

We have already noticed evidence of this trend in the results of image searches we conduct for some projects. The results that the image search returned were, in some cases, roughly 90%+ artificially-generated images. This hindered our search, as we were looking for authentic human-generated images, and led to the need for a filter to remove AI-generated results. It is shocking just how much visibility and influence these AI images have gained in a short space of time - the images were swamping what we wanted to find in the search results. When we compared the results of our image search with one using the same search prompt but also adding a filter to show only images submitted before December 2021 (before AI image generation was so widely available), the difference was significant. These AI-generated images have come to dominate search engine results in certain areas, with a time-filtered search needed if you prefer not to see them. Whether you would describe traditional images as 'real' and the AI-generated ones as 'fake' or take a more nuanced approach - there is perhaps more of a spectrum than a clear divide, as photographers have been manipulating images since the dawn of photography - the lack of clarity as AI models continue to get more sophisticated is a cause for concern.

So just what is Google doing to combat the spread of AI-generated content?

In August 2023, Google Deepmind announced that it was launching a beta version of a technology called SynthID in collaboration with Google Cloud. This tool was designed for the purpose of identifying AI-generated images through the use of a watermark digitally embedded into their pixels of an image. The watermark would be invisible to the human eye but detectable digitally. The technology can highlight whether a whole image or just a part of it has been generated by Google’s own AI tools. This information is now available to any user clicking the ‘About this image’ feature within Google Search or when using the Google Chrome browser.

This digital tagging technology has since been further developed; and earlier this year Google Deepmind implemented SynthID Text in Gemini to watermark text generated by AI. In October this year, SynthID Text was made open-source and released to developers looking to build a safety guard into their software. In an age when misinformation is prolific, the ability to identify AI-generated content is vital in maintaining trust online.

The SynthID tools can now be used to identify artificially generated images, text, audio and video, for all of which categories a different way of watermarking is used. They all formally remain beta versions as they continue to be developed.

In a Deepmind blog, Google acknowledged:

While not a silver bullet for addressing problems such as misinformation or misattribution, SynthID is a suite of promising technical solutions to this pressing AI safety issue.

No doubt this toolkit is a step in the right direction; but without the technology being available to the public or the other big Large Language Model players adopting their own watermarking system, this does not fully address the issue.

A recent article in The Verge reports that OpenAI, the creators of ChatGPT, have an AI text watermarking tool of their own ready to go, however not all at the company are decided on whether to launch it or not, being torn between ethical transparency and the impact on paying customers. Releasing it seems like a positive move, but it could have a negative impact on OpenAI users who don’t want their content creation process dependency on AI to be public, and who may therefore be inclined to move to other platforms, if such encoding becomes mandatory with content produced by OpenAI technologies.

The universal roll-out of AI content watermarking tools, while it seems promising, isn’t a foregone conclusion. It will depend on the willingness of companies that control sophisticated AI content-generating tools to voluntarily adopt this practice. Perhaps in the future we will see international legislation enforcing a universal watermarking standard for all AI-generated content, but currently we think that universal adoption by all countries of such laws is a long way off, even if legislators in some countries are moving in this direction.

How can we deal with this?

Filtering your search results by date to avoid the work of AI content generators is generally not a realistic option, although it may work in some areas, such as searching for images or stock photography, or dated articles.

While it is frustrating that search engines are increasingly showing AI-generated content in their results if you don't wish to see that, at an individual user level there are ways we can combat this influx of artificially generated content. Take time when you are carrying out your searches to evaluate results, and keep in mind that not all content on the web may be written by a person, even when an article has a by-line. It can be hard to spot AI-generated content, but with time and practice it is often possible to recognise some tell-tale signs - from the unrealistic elements in AI images, through to the sentence construction, and the way that text generated by AI typically reads.

You may also wish to try browser extensions set up to block unwanted websites from search results. uBlacklist is one that can hide specific sites you flag from appearing in your search results. You can either add the sites that you know are producing AI-generated content to the list of presets, or use third-party lists of sites. Although this tool won’t solve the underlying problem of AI-generated content in search results, as you couldn’t possibly block every website using AI-generated content even with an extensive list (this will be constantly changing), filtering to remove the worst offenders will help to narrow down the results of your searches, and hopefully encourage more sites not to use AI content.

What are the consequences of flooding the web with AI-generated content?

A group of researchers believes that there could be further consequences, beyond the risks of spreading misinformation, if AI-generated content continues to be spread to all corners of the web.

An article published in the academic journal Nature earlier this year documents a phenomenon known as ‘model collapse’. The Large Language Models of today have been trained using pre-existing data found on the web. As next-generation models are created and begin to learn, the data sets from which they will be learning from will include content generated by those AI models that came before them. The opportunity to learn from original, human-made content and data will steadily reduce. The generational distortion effect compounds over time as subsequent LLM models are created and learn from data sets with existing generational losses and imperfections, which it is believed will lead to a progressively more distorted view of the world, and potentially a ‘model collapse’ where the existing logic ceases to work.

With each new version, understanding and accuracy is therefore at risk of being lost as the quality of available source material declines. This could lead to misreporting and potentially catastrophic results where businesses or organisations rely on AI solutions to guide their decision-making processes. How can future LLMs make predictions about upcoming business trends if the information that has been used to learn from doesn’t provide a realistic and accurately originally researched view of the business and economic landscapes as they are?

It seems that an important step to address this will be to program future LLMs to rely only on credible primary and secondary human reporting. Ideally, they should be given access to a vast array of original human-made content to learn from; and in the case of factual material, it should preferably be drawn from trustworthy peer-reviewed publications as a matter of priority.

To what degree such carefully filtered sourcing may be programmable and attainable remains to be seen. With more and more AI-generated content appearing on the web, this will need to be differentiated from human content in order for quality standards in the sourcing of material by LLMs to be upheld, and to try and ensure that the use of AI is more beneficial than harmful.

Even within the parameters of human-generated content online, some sources are more trustworthy than others. There is already a lot of unreliable human-written material on the web that is partly derivative of other sources and partly opinionated; and even this non-AI-generated content may be a poor source of learning for AI content-generation models.

The LLMs of today are still relatively free from the risk of model collapse from relying on AI-generated source material, raising the possibility that the current iterations may be the most valuable ones, if the newer ones enter a spiral of rehashing content that was previously created by multiple generations of AI.

When we factor in these future feedback loops from AI to AI, there is a risk that the already occasionally inaccurate claims made by large language models will become ever more common.

Conclusion

There is no doubt that the extent to which AI-generated content is appearing on the internet is concerning. Whether this may take the form of hindering users from finding useful results from search queries, or even potentially spiral into a doom loop with the model collapse that has been predicted, we have certainly not yet seen the full set of consequences from the influx of AI-generated content that is being created and published on the web.

Clearly there are many benefits from these large language models, such facilitating research and brainstorming, and what they can do in terms of enhanced personal efficiency and productivity. Set against these are some of the negative consequences mentioned above - as well as the potential, as yet unknown consequences that will lie in the future.

As with all new technology, society will ultimately need to adapt and change and learn to live with it. Humans tend to adopt something with short-term benefits, even though there may be a price to pay in the future. Without this kind of forward-looking approach to technological development and adoption, progress and innovation will be hindered. Better regulation of this technology may be needed more in this case than in others, with the scope for AI to be harnessed for ill, or even to break its chains and decide we are there to serve it, rather than the other way around. The future isn't clear, but at least serious consideration is being given to the potential negative consequences.

If you would like more information on our services, or have any thoughts on this article you would like to share, please send us a message.

Get in touch

How AI-generated content is displacing human-made content on the web

So just what is Google doing to combat the spread of AI-generated content?

How can we deal with this?

What are the consequences of flooding the web with AI-generated content?

Conclusion

Have a project you'd like to discuss?

Thank you!