Pragmatic approach to natural language generation delivers reliable robot journalism
Big Data For News Publishers | 30 August 2020
Every month, we send hundreds of thousands of robot-written articles to news publishers in Europe and North America. Almost all of them are published directly onto sites and apps without going through any editorial checks, so predictability is key when texts are generated. To ensure this reliable language quality, we have developed our own flavour of natural language generation, pragmatic NLG.
As the transformation to digital accelerates due to the current worldwide crisis, we’re quickly approaching a pivoting point in publishing. Publishers are looking for ways to leverage new technology to truly become digital — in the way newsrooms work as well as in how they serve readers with their journalism.
Artificial Intelligence (AI) and NLG are key aspects of the new technology in publishing. AI, in particular, is now a topic at every industry virtual event and the focus of a plethora of projects, exploring how it might be deployed to support the transformation into the future.
In contrast, the core of the work at United Robots is focused on the here and now, on delivering automated content consistently and reliably, based on structured and regularly published data sets. We’re not new; we have been doing this since 2015, when we were born out of a project within Swedish local media group Bonnier News Local (formerly known as Mittmedia).
Our robots reflect the fact that we originated in journalism. So, what does that mean? Fundamentally, it means our focus is on language quality, variety, and reliability. To achieve this, we’ve built our NLG technology on rules and algorithms, developed in collaboration with journalists, and based around how they reason when they write.
We also leverage machine learning (ML) for a number of tasks. Machine learning is based on prediction and probability using recurrent neural networks. In a language context, that means that as sequences of words are passed through the network, they are the basis for how the network calculates the probability of what words should come next, as new text is generated. The network “teaches itself” as it grows.
While this type of NLG works well for things like summarising texts or driving chatbots, at this point in its development, it’s not ideal when what you’re aiming to do is deliver consistently accurate editorial texts.
However, we take advantage of ML in our image selection process, such as from Google Streetview, where we’ve taught the system to flag if the property pictured is obscured by something. We also deploy ML for tooling when we develop text algorithms, as well as for spell checks. And no doubt we’ll develop new ML-based systems in the future.
We build for variety and reliability. Our core job is to build NLG robust enough to deliver reliable and accurate texts which can be distributed directly to publishers’ customers. This is a key USP of our content products; they save editorial man hours. We can’t afford mistakes, which is why our NLG tech is based on rules and algorithms. This also means that on those rare occasions when an error does occur, we know why and how it happened and can correct it — something that is extremely tricky to do with language generated through ML.
ML models also strive for the correct outcome (based on the words that have come before). With our pragmatic NLG, the aim is to put in the same three data points, for example, and instead generate many different texts — variation in the outcome.
The beauty of pragmatic NLG is that it also allows us to create bespoke texts for each publisher, such as by incorporating editorial style guides into the robots. And, of course, while we currently generate text in six different languages, we can relatively easily add more within our structured approach.
Pragmatic NLG has been, and continues to be, developed out of a desire to help publishers with their newsroom challenges right now. While it will evolve over time, it’s already solid enough to reliably and automatically generate and publish hundreds of thousands of texts to publishers every single month.
Banner image courtesy of Pexels from Pixabay.