1. Introduction
Textual sources—ranging from financial news and SEC filings to
digital publications like the Wall Street Journal—contain a
wealth of data critical for market research and business
intelligence. By leveraging Natural Language Processing (NLP), we can
transform this unstructured text into knowledge graphs to enable
advanced analytics.
Because NLP is a vast and evolving field, this section does not aim
to provide an exhaustive overview. Instead, we focus on the core NLP
concepts essential for building knowledge graphs. Understanding these
fundamentals is increasingly important regardless of the specific NLP
method used for information extraction. Many specialized data vendors
have created dedicated systems to extract structured data from the
vast sea of natural language text.
To construct a robust knowledge graph from unstructured text, we primarily rely on three fundamental NLP tasks:
- Entity Extraction: Also known as Named Entity Recognition
(NER), this is the process of identifying and classifying key entities
within the text—such as Organizations, People, or Locations. In the
context of a knowledge graph, these extracted entities serve as the
primary nodes.
- Relation Extraction: Once entities are identified, the next
step is to determine the associations between them. For example, from
a financial report, we might extract the relation "is CEO of" between
a Person entity and an Organization entity. This process is also used
to identify specific properties of an entity, such as "Net Sales" or
"Headquarters." These relations and properties form the edges or node
attributes within the graph.
- Entity Resolution: This task ensures that multiple mentions
across a text refer to the correct single entity. This involves
coreference resolution (linking "John Smith" to the pronoun "he" later
in a paragraph) and entity linking (recognizing that "Apple," "Apple
Inc.," and "the Cupertino-based tech giant" all refer to the same
node). Proper resolution prevents the creation of duplicate nodes and
ensures the graph remains accurate and connected
In this chapter, we provide an overview of the techniques used for
entity and relation extraction. We have omitted entity resolution from
this discussion, as it is an advanced topic that falls outside the
scope of this volume.
Most modern extraction techniques rely on adapting pre-trained
language models to specific tasks. For our requirements, we treat
these models and their underlying machine learning architectures as
"black boxes," as they are now widely available as accessible,
off-the-shelf tools. This shift in the NLP landscape allows knowledge
graph creators to focus on the final product—the graph itself—rather
than the intricacies of model architecture. Instead, the primary
responsibility of the developer shifts toward providing high-quality
training and evaluation data to fine-tune these models.
We will begin with a high-level overview of language models before
diving into the specifics of the entity and relation extraction
tasks.
2. Overview of Language Models
Language modeling is the task of predicting the next word in a
sequence based on the words that preceded it. For example, given a
sentence fragment: "students opened their", a language model
predicts the most likely subsequent words, such as, "book", "exam",
"laptop", etc.
More formally, given a set of words x1, ... ,
xn-1, a language model calculates the probability P(
xn | x1, ... , xn-1), for every word
xn in its vocabulary. This technology is the backbone of
familiar tools like search engine autocomplete, smartphone auto
correcting, and modern generative AI.
To understand how these probabilities are calculated, consider the
following tiny corpus consisting of only three sentences.
| dogs chase cats | |
| cats love milk | |
| dogs love people | |
This corpus has six words: dogs, chase, cats, love, milk and people. The number
of times each of them occurs in the corpus is respectively 2, 1, 2, 2,
1, 1. We can also observe that any pair of words that appears in the corpus occurs exactly
once. For example, love is followed by cats only once. If we wanted
to calculate the probability P(milk|love), we could calculate that
value to be 0.5 as the ratio of count (love milk) and
count(love). Here count(love milk) denotes the number of
times love is followed by milk in the corpus (i.e., 1),
and the number of times love appears in the corpus (i.e., 2).
Modern language models are created by training a deep learning
model, such as a Recurring Neural Network, on a large corpus of text.
Numerous variations of pre-trained language models are available as
open source products that can be adapted for the purpose of the
specific task at hand. As we explore the techniques for entity and
relation extraction in the following sections, we will describe how
these models are adapted to transform raw sentences into the
triples required for a knowledge graph.
3. Entity Extraction
We will begin by considering a concrete example of entity
extraction, and then give an overview of different approaches to
entity extraction, and conclude the section by discussing some
challenges in performing well at this task.
3.1 An Example of Entity Extraction
A named entity is generally defined as anything that can be
referred to by a proper name, such as a person, location, or
organization. In practice, this definition is often extended to
include "entity-like" values such as dates, times, currencies, and
numerical expressions.
Consider the following sentence from a news story:
| | Cecilia Love, 52, a
retired police investigator who lives in Massachusetts, said she paid
around $370 a ticket with tax for nonstop United Airlines flights to
Sacramento from Boston for her niece's high school graduation in
June, 2020. | |
To a computer, this sentence is just a string of characters. An
entity extraction model "extracts" the entities by tagging the
segments of text with their corresponding types:
| | [PER Cecilia
Love], 52, a retired police investigator who lives in [LOC
Massachusetts], said she paid around [MONEY $370] a ticket
with tax for nonstop [ORG United Airlines] flight to
[LOC Sacramento] from [LOC Boston] for her
niece's high school graduation in [TIME
June, 2020]. | |
The paragraph contains seven named entities, one of which is a
person (indicated by PER), three are locations (indicated by LOC), one
is money (indicated by MONEY), one is an organization (indicated by
ORG), and one is a time (indicated by TIME). Depending on the domain
of application, we may introduce more or less named entity types. For
example, in the task of identifying key terms in a text, there is only
one entity type that captures a key term.
Entity extraction is a versatile tool used across many modern
software applications.
-
Question Answering (QA): When a user
asks a specific question, entity extraction helps the
system isolate potential answers from a retrieved
passage. For instance, if a user asks, "Which
airline did Cecilia fly?", the system uses NER to
identify [ORG United Airlines] as the
target answer.
-
Semantic Augmentation: In word
processing or web browsing, entity extraction can
identify entities in real-time and provide
"hover-over" definitions, historical facts, or direct
links to external databases like Wikipedia or a
corporate CRM.
-
Knowledge Discovery: By extracting
entities across thousands of documents, researchers
can identify hidden patterns, such as a specific
person appearing frequently in proximity to a
particular location or organization.
3.2 Approaches to Entity Extraction
At its core, entity extraction is framed as a sequence
labeling problem. In this framework, we associate a
label with every word (or token) in a sentence, and the
model's task is to predict the most likely label for each.
We can perform entity extraction using three broad approaches:
-
Classical Sequence Labeling: These are
statistical models, such as Conditional Random Fields
(CRF), that analyze features of a word and its neighbors
to determine the most probable sequence of labels.
-
Deep Learning Models: Modern
approaches use neural architectures like Transformers to
learn complex linguistic patterns. These models represent
the current state-of-the-art and are highly effective at
understanding context.
-
Rule-Based Approaches: These rely on
predefined patterns, regular expressions, or dictionaries
(gazetteers). They are particularly useful for highly
structured data such as phone numbers, dates, or
standardized product codes.
To facilitate the labeling, we introduce a labeling scheme that is
known as BIOES in which the meaning of different tags is as follows:
B stands for the beginning of an entity, I stands for the interior
of an entity, O stands for a word that is not part of an entity, E
stands for the end of an entity, and S stands for a single word
entity. As an example, the words in the text snippet shown above
will be labeled as shown below.
| | | | | | | | | |
| Cecilia | B |
Love | E |
, | O |
52 | O |
, | O |
a | O |
retired | O |
police | O |
investigator | O |
who | O |
lives | O |
in | O |
Massachusetts | S |
, | O |
said | O |
she | O |
paid | O |
around | O |
$370 | S |
a | O |
ticket | O |
with | O |
tax | O |
for | O |
nonstop | O |
United | B |
Airlines | E |
flights | O |
to | O |
Sacramento | S |
from | O |
Boston | S |
for | O |
her | O |
niece's | O |
high | O |
school | O |
graduation | O |
in | O |
June | B |
, | I |
2020 | E |
In the sequence labeling approach, we train a
statistical model—such as conditional random fields (CRF) —to predict
the correct BIOES tag for each token. This method is characterized by
a heavy reliance on feature engineering, where developers must
manually identify and extract relevant attributes from the text to
guide the model's decisions. These features often include linguistic
attributes like part-of-speech tags and the base form of the word, as
well as orthographic patterns such as whether the word is in all-caps,
contains digits, or possesses specific prefixes and
suffixes. Furthermore, models may incorporate lexical lookups against
a gazetteer (a list of known entities) or leverage word
embeddings to capture semantic context. A significant challenge of
this approach is that the performance can vary drastically depending
on the application domain and the choice of features. As a result,
moving from a general news corpus to a specialized field like medicine
or law requires substantial effort to "hand-craft" and tune a new
feature set that can accommodate the specific terminologies and
structures of that domain.
In a deep learning approach, there is no feature engineering, and
we simply input word embeddings to a language model. Instead of
predicting the next word, the language model now predicts one of the
five tags (B, I, O, E, S) tags that are required for entity
recognition. To adapt the language model to this new task, we first
pre-train it using the corpus for that domain, and then train it for
the task at hand. In the task-specific training of the language
model, we provide the training by adding a distinguished token [CLS]
that denotes the beginning of an entity, and a second distinguished
token [SEP] that denotes the end of an entity. This training allows
the model to predict these distinguished tags in response to a text
input. Such predictions are enough for us to produce one of the
five required tags for each word.
Finally, in a rule-based approach, one specifies labeling rules in
a formal query language. The rules can include regular expressions,
references to dictionaries, semantic constraints, and may also invoke automated
extractors and reference table structures. The rules may also invoke
machine learning modules for specific tasks. Rule application can be
sequenced in a way that we first use high precision rules, followed by
lookup in standard name list, followed by language-based heuristics,
and when all else fails, resort to probabilistic machine learning
techniques.
3.3 Challenges in Entity Extraction
Although entity extractors can achieve precision and recall above
90% on specific tasks, maintaining strong performance across diverse
domains remains challenging. Differences in vocabulary, writing style,
and underlying assumptions often cause methods that work well in one
setting to degrade in another. In this section, we examine several key
challenges encountered in entity extraction.
When labeling entities with semantic classes, ambiguity often
arises. For example, the name Louis Vuitton, can refer to
either a person, or an organization, or a commercial product.
Resolving such ambiguities typically requires analyzing surrounding
context, such as nearby words, sentence structure, and the broader
topic of the text.
Machine learning models typically require large amounts of labeled
training data. In practice, such data may be unavailable or
significantly largely incomplete. Training models on incomplete or
biased datasets can substantially degrade their performance and limit
their ability to generalize.
A related variation of entity extraction is key phrase
identification, which aims to extract salient phrases from text rather
than instances of a small, fixed set of entity classes. Because key
phrases can belong to many possible categories—or may not fit cleanly
into any predefined class—identifying key phrases is more
difficult. In addition, key phrases can vary widely in complexity,
ranging from highly specific expressions (e.g., duplication of a cell
by fission) to very general terms (e.g., attach), making it difficult
to design a single technique that performs well in all cases.
Entities can appear in many different surface forms, including
synonyms, acronyms, plural forms, and other morphological
variations. For example, the organization Louis Vuitton may also be
referred to simply as LV in text, while key phrases such as
duplication of a cell by fission may appear in shortened or rephrased
forms like cell fission or cell duplication. Effective entity
extraction therefore requires access to lexical knowledge that
captures these variations—knowledge that is often unavailable when
working in a new domain. Consequently, lexicon extraction, the task of
automatically identifying relevant terms and their variants, becomes
an important complementary problem for improving entity extraction
performance.
4. Relation Extraction
In this section, we begin with several concrete examples of
relation extraction to build intuition for the task. We then provide
an overview of common approaches used to extract relations from text,
and conclude by discussing the key challenges involved in achieving
strong performance.
4.1 Examples of Relation Extraction
Considering the text snippet from the previous section, we can
extract relations such as Cecilia Love lives in Massachusetts,
United Airlines flies from Boston, and United Airlines flies
to Sacramento, etc. In a typical relation extraction task, the
relevant entities are assumed to have already been identified;
relation extraction therefore builds directly on entity extraction. In
addition, the set of relations to be extracted (e.g., lives in, flies
from, flies to) is usually defined in advance.
A common example of relation extraction task is to extract
information from Wikipedia Infoboxes. This information can be used
to improve the search results over the internet. Wikipedia
infoboxes define relationships such as preceded
by, succeeded by, children, spouse, etc.
Achieving high accuracy on this task is challenging because of
numerous corner cases. For example, Larry King has been married
multiple times, and therefore, the extractor must be able to take that
into account the time duration for which the marriage existed.
The relation extraction is often applied for extracting
domain-specific relationships. For example, the Unified Medical
Language Systems supports relationships such
as causes, treats, disrupts, etc. In addition to
standard relations like subclass-of, and has_part,
extracting domain-specific relationships requires careful design and
selection of relationships ahead of time. Some approaches attempt to
extract relations without specifying them ahead of time, but in
practice, these methods are generally less usefulness for producing
accurate and meaningful results.
4.2 Approaches to Relation Extraction
There are three broad approaches to relation extraction: syntactic
patterns, supervised machine learning, and
unsupervised machine learning. As discussed earlier, the
unsupervised machine learning has limited use in practice. Therefore,
we will primarily consider the use of syntactic patterns and supervised
machine learning for relation extraction.
A classical approach to extract relations relies on syntactic
patterns known as Hearst Patterns, which are designed to identify
specific semantic relationships in the text. For example, consider
the following sentence.:
| | The bow lute, such as the
Bambara ndang, is plucked and has an individual curved neck for
each string. | |
Even though we may have never heard of Bambara ndang, we can still
infer that it is a kind of bow lute. More generally, we can identify
syntactic patterns, that are strong indicators of the subclass
of relationship. The following five syntactic patterns for
identifying subclass-of relations are well established and have
been shown to be highly effective in practice.
| Pattern Name | Example |
| | such as | ... works by authors such as Herric, Goldsmith, and Shakespear ... | |
| | or other | Bruises, wounds, broken bones, or other injuries ... | |
| | and other | ... temples, treasuries, and other Civic Buildings, ... | |
| | including | All common law countries including Canada and England ... | |
| | especially | Most European countries especially France, England, and Spain, ... | |
New syntactic patterns for extracting relationships can be
discovered using a bootstrapping approach. First, we collect a small
set of entity pairs for which the relationship is already known. We
then search for sentences in a corpus where these pairs co-occur. By
identifying common structures across such sentences, we can define new
patterns, which are then tested against the corpus to extract
additional entity pairs.
A well-known algorithm for this approach is Dual Iterative Pattern
Relation Expansion (DIPRE). Consider the task of extracting (author,
title) relationships. We start with a small set of known pairs
(entities) and locate all sentences containing these pairs. From these
sentences, we generate new syntactic patterns. The algorithm
recursively uses the newly discovered patterns to identify additional
entity pairs, which in turn are used to generate further patterns.
For example, given the seed pair of (William
Shakespear, The Comedy of Errors), and the following sentences,
- The Comedy of Errors, by William Shakespeare, was ...
- The Comedy of Errors, by William Shakespeare, is ...
- The Comedy of Errors, one of William Shakespeare's earliest attempts ...
- The Comedy of Errors, one of William Shakespeare's most ...
we can derive the following patterns:
- ?x , by ?y,
- ?x , one of ?y‘s
Using the newly derived patterns, the extraction process continues
recursively.
Supervised approaches to relation extraction require large amounts
of labeled training data. When such data is available, standard
machine learning algorithms can be trained to extract relationships
effectively. However, in many domains, obtaining sufficient labeled
data is difficult. To address this, weak supervision techniques have
become popular. The basic idea of weak supervision is to write several
approximate labeling functions that can automatically generate noisy
training data. These weak labels are then combined using a
probabilistic model to produce a final set of training labels, which
can be used to train a supervised relation extraction system.
As an example of a weak labeling function, consider the has
part relation. For this relation, it has been difficult to
develop reliable syntactic patterns of the sort suggested above. One
possible weak labeling function is to first generate
a parse tree of the sentence, and then look for two entity nodes
connected by a path of length one that contains the verbs has or
have. For example, consider the sentence: Most prokaryotes
have a cell wall located outside the cell membrane. In the parse
tree of this sentence prokaryotes and cell wall are
connected through a path of length one that has a labe have
indicating a has part relationship. For taxonomic
(subclass-of) relationships, another weak labeling function is based
on entity modifiers: if two entities share the same base word but
one includes an additional modifier, this can indicate a taxonomic
relationship. For example, eukaryotic cell can be identified as a
subclass of cell.
To adapt a language model for relation extraction, we modify the
input representation of a sentence so that each individual term is
explicity marked. For example, consider the sentence: All cells have
a cell membrane. This sentence has two terms cells and
cell membrane. We will indicate the presence of these terms
in the sentence by enclosing them in markers [TERM1-START] and
[TERM1-END] respectively denoting the start and the end of a term.
The set of tokens in the resulting sentence will be: ["All",
"[TERM1-START]", "cells", "[TERM1-END]", "have", "a",
"[TERM2-START]", "cell" "membrane", "[TERM2-END]", "."]. Our
training data then consists of sentences paired with the expected
relationship between the two terms. Once trained on that data, the
task for the model is to predict the relationships between the two
terms indicated in an input sentence. With this simple modification
to the input representation, a general-purpose language model can be
repurposed for relation extraction.
4.3 Challenges in Relation Extraction
A major challenge in relation extraction is obtaining sufficient
labeled training data. Manually annotating text with relation labels
is time-consuming and expensive, which limits the scale of supervised
approaches. Weak supervision offers a promising alternative by
allowing models to be trained on data that contain noisy or imperfect
labels. In this approach, external knowledge bases such as Wikidata
and lexical resources like WordNet can be used to define labeling
functions—heuristic rules that automatically assign tentative labels
to text. Although these labels may be inaccurate for individual
examples, learning algorithms can often aggregate many such weak
signals to produce effective models. Designing better and more
expressive weak labeling functions, and methods for combining them,
remains an active area of research.
In addition to training, an effective workflow is needed to
validate the outputs of relation extraction systems. In many cases,
this validation can be performed through crowdsourcing, where human
annotators check whether extracted relations are correct. To reduce
annotation effort, validation can be prioritized for extracted
relations with low confidence scores, which are more likely to contain
errors. This selective validation process naturally leads to active
learning loops, in which human feedback is used to iteratively improve
the model. Designing efficient validation workflows and active
learning strategies for relation extraction remains an important and
active area of research.
5. Summary
In this chapter, we examined the problem of automatically
constructing a knowledge graph from text. We focused on two
fundamental tasks: entity extraction and relation extraction. Early
approaches to both tasks relied on manually defined rules and
syntactic patterns derived from linguistic analysis. In contrast, most
modern methods build on large pre-trained language models, which are
then fine-tuned or adapted to the specific text corpus and extraction
task of interest.
For both entity extraction and relation extraction, the most
prevalent current approach is to adapt pre-trained language models
learned using deep learning techniques. Earlier syntactic and
rule-based methods continue to play an important role, particularly in
bootstrapping the training data needed for these models through weak
or heuristic labeling. Despite these advances, validating the outputs
of entity and relation extraction systems at scale remains an
important and unresolved challenge.
Entity linking, also known as entity resolution, is another
important task in constructing a knowledge graph from text. It
involves mapping an entity mention in a document (such as a person,
organization, or location) to the corresponding unique node in a
knowledge graph. Accurate entity and relation extraction are
prerequisites for effective entity linking, since errors at earlier
stages propagate downstream. For this reason, entity linking is often
considered a more advanced technique. In practice, however, it may or
may not be the primary bottleneck in a given application, depending on
factors such as the quality of the extracted entities, the ambiguity
of entity names, and the goals of the underlying business problem.
6. Further Reading
The discussion on entity and relation extraction in this chapter draws
from the information extraction chapter of the NLP textbook by Jurafsky and Martin
[Jurafsky &
Martin 2025] which is also an excellent source for more
in-depth discussion of modern approaches to natural language
processing. The weak labeling approach was pioneered in the Snorkel project at Stanford.
[Ratner
et. al. 2017]. The methodology described here was used in
bootstrapping an ontology graph from a
textbook [Chaudhri
et. al. 2022].
Much recent work focuses on leveraging generative AI models for
knowledge graph construction. Overview of efforts in the knowledge
engineering community is available
in [Shimizu
& Hitzler 2025]. The work, however, remains preliminary and
aspirational with ample room for systematic studies [Walker
et. al. 2024]. A workshop on this topic is being held as part of the
International Semantic Web
Conference [LLMs4OL
2025].
[Jurafsky & Martin 2025] Jurafsky, D., & Martin,
J. H. (2025). Speech and Language Processing: An Introduction to
Natural Language Processing, Computational Linguistics, and Speech
Recognition with Language Models (3rd ed., online draft). Stanford
University. Retrieved from
https://web.stanford.edu/~jurafsky/slp3/
[Ratner et. al. 2017] Ratner, Alexander, Stephen
Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher
Ré. “Snorkel: Rapid Training Data Creation With Weak Supervision”,
Proceedings of the VLDB Endowment, 11, no. 3 (November 1, 2017):
269–282. https://doi.org/10.14778/3157794.3157797.
[Chaudhri
et. al. 2022] Chaudhri, V. K., Boggess, M., Aung, H. L., Mallick,
D. B., Waters, A. C., & Baraniuk, R. E. (2021). A case study in
bootstrapping ontology graphs from textbooks. In Proceedings of the
3rd Conference on Automated Knowledge Base Construction (AKBC).
[Shimizu
& Hitzler 2025] Shimizu, C., & Hitzler, P. (2025). Accelerating
knowledge graph and ontology engineering with large language
models. Journal of Web Semantics, 85,
100862. https://doi.org/10.1016/j.websem.2025.100862
[Walker
et. al. 2024] Walker, J., Koutsiana, E., Nwachukwu, M.,
Meroño Peñuela, A., & Simperl, E. (2024). The promise and challenge of
large language models for knowledge engineering: Insights from a
hackathon. In Extended Abstracts of the 2024 CHI Conference on Human
Factors in Computing Systems
(pp. 1–9). ACM. https://doi.org/10.1145/3613905.3650844
[LLMs4OL
2025] LLMs4OL 2025: The 2nd Large Language Models for Ontology
Learning Challenge @ ISWC 2025. (2025). Retrieved December 14, 2025,
from https://sites.google.com/view/llms4ol2025
Exercises
Exercise 5.1.
Using the concept of a language model on the following sentence corpus, answer the questions below:
- I love running.
- Good health can be achieved by those who love running.
- I love good health.
- I love those who love running.
|
(a) |
What is P(health|good)? |
|
(b) |
What is P(running|love)? |
|
(c) |
What is P(love|I)? |
|
(d) |
What is (good|love)? |
|
(e) |
What is P(love|running)? |
Exercise 5.2.
An important feature used for entity extraction is
Word shape: it represents the abstract letter pattern of the
word by mapping lower-case letters to ‘x’, upper-case to ‘X’, numbers
to ’d’, and retaining punctuation. Thus for example C.I.A. would map
to X.X.X. and IRS-1040 would map to XXX-dddd. In a shorter-version of
word shape, consecutive character types are removed. For example,
C.I.A. would still map to X.X.X, but IRS-1040 would map to X-d. With
these definitions, address the following questions.
|
(a) |
What is the shape of the word: Googenheim? |
|
(b) |
What is the short-shape of the word: Googenheim? |
|
(c) |
What is the regular expression for the shape of the word Googenheim? |
|
(d) |
What is the regular expression for the short-shape of the word Googenheim? |
|
(e) |
Is it true that the short-shape is always strictly smaller than the regular shape of a word? |
Exercise 5.3.
Which of the following may not be a good feature for learning entity extraction?
|
(a) |
word shape |
|
(b) |
part of speech |
|
(c) |
presence in Gazeteer |
|
(d) |
presence in Wikipedia |
|
(e) |
number of characters |
Exercise 5.4.
Given the following sentence corpus, and the seed (Sacramento,California), what
patterns will be extracted by the DIPRE algorithm?
- The bill was signed in Sacramento, California.
- Sacramento is the capital of California.
- Sacramento is the capital of California, and its sixth largest city.
- California's Sacramento is home to the state legislature, but not the state supreme court.
- California Governor Jerry Brown signed the bill in Sacramento.
|
(a) |
in ?x, ?y |
|
(b) |
?x is the capitol of ?y |
|
(c) |
?y's ?x |
|
(d) |
?x's ?y |
|
(e) |
?y Governor * in ?x |
Exercise 5.5.
Which of the following would be a candidate for a weak labeling function to extract the parthood relationship between entities, i.e., an entity A has part entity B.
|
(a) |
?xs have ?y |
|
(b) |
?x includes ?y |
|
(c) |
?x contains ?y |
|
(d) |
?y surrounds ?x |
|
(e) |
?x causes ?y |
Exercise 5.6. The goal of this project is to extend
the companies database created in Exercise 4.7, and automatically
populate it using information extracted from earnings calls
transcripts. For this project, you can use the publicly available
earnings calls transcripts
from StruxData:
https://struxdata.github.io/. This
project will give you hands-on experience with text processing,
information extraction, and knowledge graph construction. Proceed in
the following steps.
- Select companies and time window
- Choose a small number of well-known companies to focus
on.
- Select a suitable time window so that you can work with
multiple transcripts for the same company across different
quarters.
- Extract relationships between companies
- For each earnings call transcript, identify mentions of
other companies and determine their relationship to the company
hosting the call.
- If your current database schema does not include these
relationships, extend it appropriately (e.g., "competitor,"
"partner," "customer").
- Identify financial headwinds and tailwinds
- Process each transcript to extract factors that may
positively (tailwinds) or negatively (headwinds) affect the
company’s stock performance.
- Consider categorizing these factors (e.g., market trends,
regulatory changes, supply chain issues) to make the database
more informative.
- Optional analysis
- Even though this task is meant for future chapters, Once
your database is populated, begin exploring patterns across
companies and quarters, or visualizing the relationships and
factors affecting stock performance.
Exercise 5.7.
The goal of this project is to bootstrap a knowledge graph from a
textbook, following the approach described by
Chaudhri et al. (2022).
You may use any textbook from the
OpenStax textbook library.
While the previous work used BERT for information extraction, in this project you will use state-of-the-art language models such as Gemini or ChatGPT.
Additionally, you should design a suitable scheme to evaluate the
quality of your extraction. Consider metrics such as precision,
recall, or coverage of entities and relations, and think about how to
validate the extracted knowledge in a systematic way.
|