CS520 Knowledge Graphs
What
should AI
Know ?
 

How to Create a Knowledge Graph from Text?


1. Introduction

Textual sources such as business and financial news, SEC filings, and websites such as Wall Street Journal, contain information that is of great value for numerous business tasks such as market research, business intelligence, etc. We can use Natural language processing (NLP) to process textual sources, and create knowledge graphs which can support a variety of analytics. NLP, however, is a specialized and sophisticated technology, and our goal here is not to provide a comprehensive and detailed coverage of it. Our goal here is to introduce some basic concepts of NLP that can be useful to those whose primary interest and goal is to create a knowledge graph. Several vendors have created businesses around processing natural language text and delivering structured data to others for consumption.

There are three NLP tasks that are directly relevant to knowledge graph construction: entity extraction, relation extraction, and entity resolution. Entity extraction is the task of identifying key entities of interest (e.g., Organizations, People, Places, etc.) from the text. These entities typically constitute the nodes in a knowledge graph. The relation extraction is the task in which given two entities of interest, and some text, we extract the relations between them (e.g., net sales, management team information, etc.) from the text. Sometimes, the relation extraction is also used to extract properties of a given entity. The extracted relations and properties will typically become relations or node properties in our knowledge graph. Entity resolution is the task of identifying whether multiple mentions in a text refer to the same entity. For example, in a paragraph of text, "John Smith", "He", and "Her father", may all refer to the same entity.

In this chapter, we will give an overview of the techniques used for entity and relation extraction. We omit the entity resolution from our discussion here because it is an advanced topic for the scope of the present volume. Most techniques for entity and relation extraction, that are popular these days, rely on adapting a pre-trained language model for the task at hand. For our purpose, we treat the pre-trained language model and the machine learning techniques as black boxes as both are now available for use as off-the-shelf commodities. This progress in NLP and machine learning allows the knowledge graph creators to focus on the end-product, and on providing suitable training and evaluation data that is required for the adaptation of the language models. We will begin with an overview of the language models, and then describe the entity and relation extraction tasks in greater detail.

2. Overview of Language Models

Language modeling is the task of predicting what word comes next in a text. For example, given a sentence fragment: "students opened their", a language model will predict the next word that can complete this sentence. In this case, the next words might be book, exam, laptop, etc. More formally, given a set of words x1, ... , xn-1, a language model predicts the probability P( xn | x1, ... , xn-1), where xn is any word in the vocabulary. Language models are used extensively in auto completing search requests on the web, in auto correcting words in word processing, etc.

Modern language models are created by training a deep learning model, such as a Recurring Neural Network, on a large corpus of text. Numerous variations of pre-trained language models are available as open source products that can be adapted for the purpose of the specific task at hand. As we discuss the techniques for entity and relation extraction, we will also describe how the language models can be adapted for these tasks.

3. Entity Extraction

We will begin by considering a concrete example of entity extraction, and then give an overview of different approaches to entity extraction, and conclude the section by discussing some challenges in performing well at this task.

3.1 An Example of Entity Extraction

Let us consider the following piece of text from a news story:

    Cecilia Love, 52, a retired police investigator who lives in Massachusetts, said she paid around $370 a ticket with tax for nonstop United Airlines flights to Sacramento from Boston for her niece's high school graduation in June, 2020.    

We will consider the named entities in the above text. A named named entity is anything that can be referred to with a proper name: a person, a location, an organization, etc. The definition of a named entity is commonly extended to include things that are not entities per se, including dates, times, and other kinds of expressions involving time, and even numerical expressions, for example, prices. Here is the above text with the named entities marked:

    [PER Cecilia Love], 52, a retired police investigator who lives in [LOC New Jersey], said she paid around [MONEY $370] a ticket with tax for nonstop [ORG United Airlines] flight to [LOC Sacramento] from [LOC Boston] for her niece's high school graduation in [TIME June, 2020].    

The paragraph contains seven named entities, one of which is a person (indicated by PER), three are locations (indicated by LOC), one is money (indicated by MONEY), one is an organization (indicated by ORG), and one is a time (indicated by TIME). Depending on the domain of application, we may introduce more or less named entity types. For example, in the task of identifying key terms in a text, there is only one entity type that captures a key term.

Entity extraction task is useful in many different applications. For example, in question answering, it may help pull out answers from a retrieved passage of text. In a word processing application, it could help automatically connect entities appearing in the text with additional information (e.g., definition, facts, etc.) about those entities.

3.2 Approaches to Entity Extraction

First and foremost, we can view entity extraction as a labeling problem. We associate a label with each word, and the task becomes to predict the label. We can perform entity extraction by three broad approaches: sequence labeling, deep learning models, and rule-based approaches. We will briefly introduce each of these approaches.

To facilitate the labeling, we introduce a labeling scheme that is known as BIOES in which the meaning of different tags is as follows: B stands for the beginning of an entity, I stands for the interior of an entity, O stands for a word that is not part of an entity, E stands for the end of an entity, and S stands for a single word entity. As an example, the words in the text snippet shown above will be labeled as shown below.

Cecilia B Love E , O 52 O , O
a O retired O police O investigator O who O
lives O in O Massachusetts S , O said O
she O paid O around O $370 S a O
ticket O with O tax O for O nonstop O
United B Airlines E flights O to O SacramentoS
from O Boston S for O her O niece'sO
high O school O graduationO inO June B
, I 2020 E

In the sequence labeling approach, we train one of the machine learning algorithm (for example, Conditional Random Fields), using features such as: part of speech, presence of the word in a list of standard words, word embeddings, word base form, whether word contains a prefix or suffix, whether it is in all caps, etc. Significant effort is needed in feature engineering, because the performance of a particular choice of features and the machine learning algorithm can vary by the domain.

In a deep learning approach, there is no feature engineering, and we simply input word embeddings to a language model. Instead of predicting the next word, the language model now predicts one of the five tags (B, I, O, E, S) tags that are required for entity recognition. To adapt the language model to this new task, we first pre-train it using the corpus for that domain, and then train it for the task at hand. In the task-specific training of the language model, we provide the training by adding a distinguished token [CLS] that denotes the beginning of an entity, and a second distinguished token [SEP] that denotes the end of an entity. This training allows the model to predict these distinguished tags in response to a text input. Such predictions are enough for us to produce one of the five required tags for each word.

Finally, in a rule-based approach, one specifies labeling rules in a formal query language. The rules can include regular expressions, references to dictionaries, semantic constraints, and may also invoke automated extractors and reference table structures. The rules may also invoke machine learning modules for specific tasks. Rule application can be sequenced in a way that we first use high precision rules, followed by lookup in standard name list, followed by language-based heuristics, and when all else fails, resort to probabilistic machine learning techniques.

3.3 Challenges in Entity Extraction

Even though for specific tasks, entity extractors may exhibit precision and recall above 90%, but obtaining good performance across all the domains can be challenging. In this section, we consider a few challenges that are faced during entity extraction.

While labeling entities with their classes, there are numerous cases of ambiguity. For example, given the entity Louis Vuitton, it can refer to either a person, or an organization, or a commercial product. Resolving such ambiguities is not possible unless considerable context is taken into account.

For using a machine learning model, we require tremendous amount of data. In practice, either the data are not available, or they are largely incomplete. When we train our model using incomplete data, it seriously affects the performance.

One variation of the entity extraction task is to identify key phrases in the text. As the key phrases are not limited to a few classes, the task of identifying the specific class corresponding to a key phrase can become even more challenging. Sometimes, the key phrases are overly complex (e.g., duplication of a cell by fission), and sometimes, too general (e.g., Attach) making it very challenging to apply a uniform technique across the board.

Entities can appear in many different surface forms, for example, synonyms, acronyms, plurals, and morphological variations. In general, entity extraction should be aware of the lexical knowledge which usually does not exist when entering a new domain. As a result, for improving the performance of entity extraction, lexicon extraction becomes an essential related task.

4. Relation Extraction

We will begin by considering several concrete examples of relation extraction, then give an overview of different approaches to relation extraction, and conclude the section by discussing some challenges in performing well at this task.

4.1 Examples of Relation Extraction

Considering the text snippet from the previous section, we can extract relations such as Cecilia Love lives in Massachusetts, United Airlines flies from Boston, and United Airlines flies to Sacramento, etc. In a typical relation extraction task, entities have been previously identified, and in this sense, it extends the entity extraction task. The relations to be extracted are also specified ahead of time.

A popular instance of relation extraction task is to extract information from Wikipedia Infoboxes. The obvious application of this task is to improve the search results over the internet. Wikipedia infoboxes define relationships such as preceded by, succeeded by, children, spouse, etc. Achieving a high accuracy on this task can be challenging because of numerous corner cases. For example, Larry King has been married multiple times, and therefore, the extractor must be able to take that into account the time duration for which the marriage existed.

There also exist domain-specific relationships. For example, the Unified Medical Language Systems supports relationships such as causes, treats, disrupts, etc. Apart from standard relationships such as subclass-of, and has_part, the relationships to be extracted are domain-specific and usually require some up front design work. There are some approaches to relation extraction that do not require choosing relationships ahead of time, but the usefulness of such approaches in practice has been found to be limited.

4.2 Approaches to Relation Extraction

There are three broad approaches to relation extraction: syntactic patterns, various forms of supervised machine learning, and unsupervised machine learning. As noted in the previous section, the unsupervised machine learning has limited use in practice. Therefore, we will primarily consider the use of syntactic patterns and supervised machine learning for relation extraction.

A classical approach to extract relations relies on syntactic patterns known as Hearst Patterns. For example, consider the following sentence.:

    The bow lute, such as the Bambara ndang, is plucked and has an individual curved neck for each string.     

Even though we may have never heard of Bambara ndang, we can still infer that it is a kind of bow lute. More generally, we can identify syntactic patterns, that are strong indicators of the subclass of relationship. The following five syntactic patterns have been around for quite some time, and have been found extremely effective in practice.

Pattern NameExample
    such as... works by authors such as Herric, Goldsmith, and Shakespear ...    
    or otherBruises, wounds, broken bones, or other injuries ...    
    and other... temples, treasuries, and other Civic Buildings, ...     
    includingAll common law countries including Canada and England ...     
    especiallyMost European countries especially France, England, and Spain, ...     

We can discover new syntactic patterns for any relationship of choice as follows. We first collect examples for which the relationship is known to hold true. We then look for sentences where the relationship holds true. By identifying commonalities across such sentences, we can define a new pattern. We can then test such patterns against the corpus.

The supervised approaches to relation extraction require extensive training data. Whenever such training data is available, many of the off-the-shelf learning algorithms could be trained. But, in the absence of adequate data, weak supervision approaches are becoming popular to arrive at the required data. The basic idea in weak supervision is to write several approximate labeling functions that can generate the training data automatically. These weak labels are then combined using a probabilistic function.

As an example of a weak labeling function, consider the has part relation. For this relation, it has not been possible to develop the syntactic patterns of the sort suggested above. A possible weak labeling function for this relation is to first produce a dependency parse of the sentence, and then look for two nodes in the parse that have a path of length one that contains the verbs has or have. For taxonomic relationships, an additional weak labeling function is that if two entities end with the same base word but one has an additional modifier in front of it, this suggests a taxonomic relation (e.g., eukaryotic cell SUBCLASS cell).

To adapt a language model to the task of relation extraction, we change the input representation of a sentence so that each of the individual terms are explicity marked. For example, we input a sentence such as ["All", "[TERM1-START]", "cells", "[TERM1-END]", "have", "a", "[TERM2-START]", "cell" "membrane", "[TERM2-END]", "."], and expect the model to output the probability distribution over different relationships that might exist between the two terms. The predicted relationship depends on the input training data.

4.3 Challenges in Relation Extraction

The primary challenge in relation extraction is to obtain the necessary training data. The weak supervision approach is quite promising because the training data need not be perfect. One can resort to sources such as Wikidata and Wordnet as a source of data for defining weak labeling functions. Developing new approaches for weak labeling functions is a topic of current and ongoing research.

We also need a good workflow to validate the results of the extractors. Quite often the validation can be done through crowdsourcing. One can also prioritize the validation of those extracted relations where the confidence is low. Developing such active learning loops is another important and current research challenge.

5. Summary

In this chapter, we considered the problem of creating a knowledge graph by processing text. We addressed two fundamental problems of entity extraction and relation extraction. For both of these tasks, the early work focused on defining manual and syntactic rules for extraction. More recent popular approaches rely on adapting pre-trained language models, and customizing them to the specific corpus at hand.

For both entity and relation extraction, the most prevalent current approach is to adapt a language models that has been previously created using deep learning. Existing approaches that are syntactic or rule-based play an important role in bootstrapping the training data required for the deep learning models. Validating the output of extraction remains an ongoing challenge.

The problem of entity linking or entity resolution is an equally important problem for creating knowledge graphs, but we have chosen to omit it from the discussion here for two reasons. First, we believe that the challenge of getting good performance on entity and relation extraction, by itself, is a significant hurdle. Second, a prerequisite to a good performance on entity linking is the availability of a good lexicon. For these reasons, entity resolution is an advanced technique, and it may or may not be the primary bottleneck in solving the business problem for which we are creating the knowledge graph.