Generative-AI content needs clear disclaimers for news readers
Smart Data Initiative Blog | 12 March 2023
GPT-3, the large language model that powers ChatGPT, was trained using a technology called neural networks. They are the big win of recent years in machine learning, allowing a high number of powerful computers to join up in digesting, processing, and rearranging a bunch of data — and, in so doing, creating new learning and rules.
The problem with this method is that it makes explainability (which we will talk about in a second) difficult, and, therefore it makes ascertaining how something was actually generated difficult or impossible to do.
Wherever machine learning occurs, a key question — one that will be familiar to any seasoned journalist — is asking where the information was acquired, and, importantly, where information was not acquired.
You’d ask a politician whether they considered all sides in building a policy. You’d ask a music critic whether they are informed about all genres of music and have taken stock of a deep bench within each genre before deciding to trust their information and opinions.
With deep learning — the kind of Artificial Intelligence that doesn’t use human rules to go about building itself — the question of the underlying data set is crucial to appreciating where and how the resulting intelligence may have gaps.
The large language model GPT2 (the foreparent of GPT3) has this warning on its Github page:
“The dataset our GPT-2 models were trained on contains many texts with biases and factual inaccuracies, and thus GPT-2 models are likely to be biased and inaccurate as well.
“To avoid having samples mistaken as human-written, we recommend clearly labeling samples as synthetic before wide dissemination. Our models are often incoherent or inaccurate in subtle ways, which takes more than a quick read for a human to notice.”
This suggestion is a good one — and one we are familiar with in our industry: We have codified the way we edit certain changes (corrections). We have codified how certain types of content appear (advertorials, native advertising, whether free products or services were used in a review or whether affiliate links may be generating income for the organisation).
The Partnership on AI, a partnership of organisations (which includes the BBC, the CBC, and The New York Times for its news media industry partners) is working on broad recommendations for the adoption of AI-generated content across a range of industries and use cases. They have already recommended disclaimers as well, with suggestions from watermarking visual media, audio disclaimers for audio content, and text indicators for text-based media.
Regardless the specific lines that your organisation has drawn for its own practices, rare are the organisations that don’t have a code of conduct for disclaiming the specific circumstances under which a piece of content was produced. And this is especially true wherever we realise our readers or users wouldn’t be able to recognise these distinctions.
News organisations also have long codified attribution: What is in quotes is not like what we paraphrase and attribute, and this too is different than material that is handed out — where the material is used as is, but a copyright label provides the attribution.
For anyone who has sat in a newsroom legal seminar (like yours truly), you’ll remember that beyond the usefulness of disclaimers and attribution for the sake of our reader, there is also a liability angle to consider. For example, content in quotation doesn’t have the same libelous weight for a publication reporting it than content out of quote. (“So-and-so is a thief and a liar’ said Joe Schmoe” is not the same as your organisation calling So-and-so the same thing without attribution.)
This is not legal advice (because I’m not an attorney), but wherever we are taking the work of an AI unedited, we would do well to remind our readers of this particular origination. Whether from the perspective of intellectual honesty of authorship — a kind of byline or credit line — or from the perspective of disclaiming, it seems we’d never be wrong for erring on the side of informing our users about the origination of AI content.
This means using plain language to do this. Attributing something to Dall-E will only be understood to be automated content generation by a minuscule fraction of your audience You have to know what Dall-E is in the first place, and I can tell you that my mother has never heard of it (and she reads the newspaper). So this work of disclaimers really means a whole education of our users for them to be properly aware of the distinction of AI-generated content.
If you’d like to subscribe to my bi-weekly newsletter, INMA members can do so here.