Predictive text AI too effective for public release

Artificial Intelligence company OpenAI has developed a text writing AI which has left them both impressed and concerned. 

OpenAI is a non-profit AI research company focused on “discovering and enacting the path to safe artificial general intelligence.”  

Founder Elon Musk is an outspoken advocate of such AI, having commented extensively publicly and online about the risks such technology poses if not developed in a controlled environment.  

Citing ethical concerns of malicious application of the technology, OpenAI has publicly released a smaller model for researcher use, alongside a technical paper of their application and findings thus far. The fully trained model will not be open sourced at present. 

This is significant as AI has developed in an open-source culture in which developers generally make their work publicly available to encourage other researchers and developers to test and improve on their work.  

Musk has in the past expressed similar concerns over AI’s potential misuse.  

Perhaps related to such concerns, Musk recently Tweeted that he left OpenAI, on good terms, due to internal disagreements.    

The AI, named GPT-2, was designed with the simple goal of predicting the next word in a sentence. This type of technology is commonly referred to as ‘predictive text’.

It works by inputting a sample text, such as the opening paragraph of a book or a news snippet, then prompting it to fill in the rest. The writings it produces are lengthy and often deceivingly convincing. 

A dataset of 8 million webpages, sourced from outgoing links on Reddit, was used to train the AI.  

Only links with a sufficiently positive user-rating were sourced, giving the sample data a quality standard. 

The use of outgoing Reddit links diversifies the dataset far moreso than similar language model AI’s sampling from singular sources, such as Wikipedia or books, as outgoing Reddit links lead to a multitude of text sources. 

OpenAI explains in a blog post that GPT-2 generates synthetic text samples in response to the model being primed with an arbitrary input. “The model is chameleon-like — it adapts to the style and content of the conditioning text. This allows the user to generate realistic and coherent continuations about a topic of their choosing…” 

An impressive and entertaining example of what the AI can produce was included in the blog post, which reads as follows: 


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. 


The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science. 

Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.

Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow. 

Pérez and the others then ventured further into the valley. “By the time we reached the top of one peak, the water looked blue, with some crystals on top,” said Pérez. 

Pérez and his friends were astonished to see the unicorn herd. These creatures could be seen from the air without having to move too much to see them – they were so close they could touch their horns. 

While examining these bizarre creatures the scientists discovered that the creatures also spoke some fairly regular English. Pérez stated, “We can see, for example, that they have a common ‘language,’ something like a dialect or dialectic.” 

Dr. Pérez believes that the unicorns may have originated in Argentina, where the animals were believed to be descendants of a lost race of people who lived there before the arrival of humans in those parts of South America. 

While their origins are still unclear, some believe that perhaps the creatures were created when a human and a unicorn met each other in a time before human civilization. According to Pérez, “In South America, such incidents seem to be quite common.” 

However, Pérez also pointed out that it is likely that the only way of knowing for sure if unicorns are indeed the descendants of a lost alien race is through DNA. “But they seem to be able to communicate in English quite well, which I believe is a sign of evolution, or at least a change in social organization,” said the scientist.

As evident in the example above, GPT-2 can produce high quality texts that are nigh indistinguishable from human writing.  

The blog post does mention failures such as repetitive text and world modelling failures. “…the model sometimes writes about fires happening under water, and unnatural topic switching.”  

This coupled with the fact that it takes multiple prompts to produce a text of convincing standard means that GPT-2 is far from perfect, but still highly impressive and a step in the right direction for language model AI.  

“Overall, we find that it takes a few tries to get a good sample, with the number of tries depending on how familiar the model is with the context. When prompted with topics that are highly represented in the data (Brexit, Miley Cyrus, Lord of the Rings, and so on), it seems to be capable of generating reasonable samples about 50% of the time.” the blog explains.  

However, the AI performs poorly when esoteric or particularly technical content is involved. This can be remedied in future by teaching the AI using only topic-specific datasets to create specialized models.  

For instance, a system trained on datasets mostly concerning geographic information would create impressive texts on geography but fail at most other topics.  

“… large language models are becoming increasingly easy to steer towards scalable, customized, coherent text generation, which in turn could be used in a number of beneficial as well as malicious ways.” the blog explains. 

The AI can also answer questions regarding events and topics included in its datasets.  

An example is included in the blog post in which questions are asked regarding the 2008 Summer Olympics torch relay. It answered roughly half correctly.  

“We hypothesize that since these tasks are a subset of general language modeling, we can expect performance to increase further with more compute and data.” the blog explains.  

This technology is highly promising. Unsupervised translation between languages, highly efficient AI writing assistants and superior AI support systems for workplace sectors such as customer service and sales are just some examples of the AI’s potential.  

Conversely, fake news and advanced automated phishing and scam content are likely utilizations of this technology by malicious actors.  

It is for these reasons that the OpenAI team is apprehensive to release their fully trained GPT-2 publicly.  

They assert that the AI community should engage in more nuanced discussion regarding responsible publication of new technologies to minimize malevolent usage. It is also noted that politicians may want to consider introducing legislation which penalizes the misuse of such AI technologies. 

How Useful Was This Post?

Let Us Know How We Are Doing - Click A Star To Rate This Post

Average Vote Rating / 5. Vote Count :