Why teaching A.I. to read is a lifelong endeavor

It’s not just tech giants that are using artificial intelligence to understand human language, so that products like digital assistants can respond to basic questions.

More conventional businesses are also increasingly using a subset of A.I. called natural language processing (NLP) to create more powerful software to help answer basic customer call center queries or create summaries of long, complicated documents. 

LexisNexis, for instance, has been using NLP to improve the legal research software that lawyers, journalists, and analysts use to find relevant court documents. It’s light years ahead of the user unfriendly Boolean search system that I regularly used over a decade ago as a cub reporter.

With A.I., LexisNexis’ search interface is more intuitive. That’s partly because the company used Google’s free, open-source language model BERT as the foundation. The BERT model, trained on a vast amount of web data including Wikipedia pages, helps software better understand how some words mean different things depending on the context in which they appear.  

But LexisNexis can’t use BERT for all of its language needs because the company deals with information that is specific to the legal industry. This particular data can’t be found on the open web, which means the information doesn’t come baked into BERT.

Min Chen, vice president and chief technology officer for the Lexis Nexis Asia-Pacific and global search team, said that BERT “provides a good base model to start with.” But the company must fine-tune the technology with additional legal data so that it understands legal linguistics.

This fine-tuning is increasingly common for many companies operating in areas like finance or healthcare. Every industry has its own lingo that makes no sense in another context.

Chen said it took LexisNexis 12 months to train a version of BERT that understands case citations and even Latin. If someone wants to find a document showing that a case has been adjudicated, or closed, the technology knows to look for documents with the Latin term res judicata (claim preclusion, or a matter decided). 

As Amanda Stent, an NLP expert for financial news and information service Bloomberg, explained, technologies like BERT are important because they remove a lot of the grunt work required to train a language model from scratch. For a 10-word sentence, Stent said, “the combinations [of words] are astronomical,” and having a powerful language model like BERT as a starting point is very helpful.

But as other A.I. researchers have pointed out, because language models are typically trained on Internet data, they sometimes parrot back the offensive text they’ve scanned. You’ll be happy to know that companies can take precautions to make this less likely.

Stent and her colleagues recently published a best practices that companies can follow when training A.I.-powered language models and other machine learning systems. They recommended using human subject-matter experts to help annotate and label the text used for training (to ensure data is labelled accurately) and ensuring that product managers and engineers coordinate on big projects (to help ensure that problems don’t slip through the cracks).

The goal is to eliminate any problems before companies introduce new products. After all, no user wants to be bombarded with vile language.  

One thing companies should be prepared for is that data training projects are never done. There’s always room for improvement. 

Said Stent, “It never stops.”

Jonathan Vanian 
@JonathanVanian
jonathan.vanian@fortune.com

Leave a Reply

Your email address will not be published. Required fields are marked *