CS520 Knowledge Graphs
What
should AI
Know ?
 

How to Create a Knowledge Graph from Data?


1. Introduction

Large organizations generate huge amounts of internal data and also consume data from third-party providers. These providers often extract information from unstructured sources and invest considerable effort in delivering it in structured form. To use this external data effectively, it must be integrated with the company's internal data. Such integration supports many important applications, including a 360-degree view of the customer, fraud detection, risk assessment, and loan approval. In this chapter, we focus on building a knowledge graph by integrating data from structured sources. The next chapter addresses techniques for extracting structured data from unstructured inputs.

When integrating data from multiple sources into a knowledge graph, we can either design a schema upfront --- as discussed in the previous chapter --- or begin without a predefined schema and load external data directly as triples. In practice, the initial schema is usually guided by the specific business use case being addressed. When such a schema exists, we must determine how the data elements in a new source should be aligned with it. This task is known as schema mapping.

In addition to aligning the structures of the two sources, we must also determine whether entities in the new data correspond to entities already represented in the knowledge graph. Inferring whether two entity descriptions refer to the same real-world entity is known as the record linkage problem. Record linkage arises both when onboarding a new data source and when keeping the knowledge graph up to date as third-party providers deliver ongoing data feeds.

In this chapter, we examine current approaches for addressing both the schema mapping and record linkage problems. We outline state-of-the-art algorithms and evaluate their effectiveness in tackling contemporary industry challenges.

2. Schema Mapping

Schema mapping assumes the existence of a schema which will be used for storing new data coming from another source. Schema mapping then defines which relations and attributes in the input database corresponds to which properties and relations in the knowledge graph. There exist techniques for bootstrapping schema mappings. The bootstrapped schema mappings can be post corrected through human intervention.

We begin our discussion of schema mapping by outlining some of the key challenges, emphasizing that the process is often manual and labor-intensive. Next, we describe an approach for specifying mappings between the schema of the input source and the schema of the knowledge graph. Finally, we conclude the section by examining techniques that can be used to bootstrap schema mapping and reduce the manual effort involved.

2.1 Challenges in Schema Mapping

Automating schema mapping presents several key challenges: understanding complex schemas, handling intricate mappings, and coping with limited training data.

Relational database schemas can be large, often containing thousands of tables and tens of thousands of attributes. Many attribute names, such as segment1 or segment2, provide little semantic meaning, making automatic mapping difficult.

Mappings between the input schema and the knowledge graph are rarely simple one-to-one correspondences. They may require calculations, the application of business rules, or special handling of missing values, which makes fully automated mapping challenging.

Finally, many automated approaches rely on machine learning, which requires substantial training data. Because schema information is inherently smaller and less abundant than the data itself, there are typically too few schema mappings available to train a reliable algorithm.

Commercial relational database schemas can be huge consisting of thousands of relations and tens of thousands of attributes. Sometimes, the relation and attribute names do not have semantics (e.g., segment1, segment2) which do not lend themseves to any realistic automated prediction of the mappings.

2.2 Specifying Schema Mapping

In this section, we explore one approach for specifying mappings between input data sources and a target knowledge graph schema. To illustrate the process, we use an example from the domain of cookware. Imagine multiple vendors providing product information that an e-commerce site wants to aggregate and present to its customers. We will examine two example sources and then introduce the schema of the knowledge graph, to which we will define the mappings.

Below is some sample data from the first data source, stored in a relational table ,called cookware. The table contains four attributes: name, type, material, and price.

cookware
nametypematerialprice
c01skilletcast iron50
c02saucepansteel40
c03skilletsteel30
c04saucepanaluminium20

The second database, shown below, lists the products of a manufacturer. In this case, the data is spread across multiple tables, each representing a different product attribute. The kind table specifies the type of each product. The base table specifies whether each product is made from a corrosible metal (aluminum or stainless), a noncorrosible metal (iron or steel), or something other than metal (glass or ceramic). The coated table lists products that have nonstick coatings. The price table gives the selling price. There is no material information. The company has chosen not to provide information about the metal used in each product. The coated table has only positive values; products without nonstick coatings are left unmentioned.

kind
idvalue
m01skillet
m02skillet
m03saucepan
m04saucepan
base
idvalue
m01corrosible
m02noncorrosible
m03noncorrosible
m04nonmetallic
coated
idvalue
m01yes
m02yes
price
idvalue
m0160
m0250
m0340
m0420

Suppose the desired schema for the knowledge graph, expressed as a property graph, is as shown below. The graph has two different node types: one for products, and the other for suppliers. These nodes are connected by a relationship called has_supplier. Each product node has properties "type" and "price".

To specify the mappings, and to illustrate the process, we will use a triple notation, making the approach applicable whether the knowledge graph follows the RDF or property graph data model. In the case of an RDF knowledge graph, creating IRIs is required, but this step is orthogonal to schema mapping and is therefore omitted here. The desired triples in the target knowledge graph are listed below.

knowledge graph
subjectpredicateobject
c01typeskillet
c01price50
c01has_suppliervendor_1
c02typesaucepan
c02price40
c02has_suppliervendor_1
c03typeskillet
c03price30
c03has_suppliervendor_1
c04typesaucepan
c04price20
c04has_suppliervendor_1
m01typeskillet
m01price60
m01has_suppliervendor_2
m02typeskillet
m02price50
m02has_suppliervendor_2
m03typesaucepan
m03price40
m03has_suppliervendor_2
m04typesaucepan
m04price20
m04has_suppliervendor_2

Any programming language can be used to express the mappings. In this example, we use Datalog to specify the mappings. The rules shown below are straightforward, with variables indicated by uppercase letters. The third rule introduces the constant vendor_1 to indicate the source of the data.

knowledge_graph(ID,type,Type) :- cookware(ID,TYPE,MATERIAL,PRICE)
knowledge_graph(ID,price,PRICE) :- cookware(ID,TYPE,MATERIAL,PRICE)
knowledge_graph(ID,has_supplier,vendor_1) :- cookware(ID,TYPE,MATERIAL,PRICE)

Next, we consider the rules for mapping the second source. These rules are similar to those for the first source, except that the information now comes from two different tables in the source data.

knowledge_graph(ID,type,Type) :- kind(ID,TYPE)
knowledge_graph(ID,price,PRICE) :- price(ID,PRICE)
knowledge_graph(ID,has_supplier,vendor_2) :- kind(ID,TYPE)

In general, it may not be appropriate to reuse identifiers from the source databases, and new identifiers may need to be created for the knowledge graph. In some cases, the knowledge graph may already contain objects equivalent to those in the incoming data. We will consider this issue in the section on record linkage.

2.3 Bootstrapping Schema Mapping

As noted in Section 2.1, fully automated schema mapping faces numerous practical difficulties. Research has focused on bootrapping the schema mappings based on a variety of techniques, with human input often used to validate the results. Bootstrapping techniques can be classified into three main categories: linguistic matching, instance-based matching, and constraint-based matching. We will examine examples of each technique in the following sections.

Linguistic techniques can be applied to attribute names or descriptions. Common approaches include:

  1. Exact Name Matching: Check whether the names of two attributes are identical. Confidence is higher if the names are expressed as IRIs.
  2. Canonicalization: Process names through techniques such as stemming before checking for equality (e.g., mapping CName to Customer Name).
  3. Synonyms and Hypernyms: Identify mappings using synonyms (e.g., car and automobile) or hypernyms (e.g., book and publication).
  4. Substrings and Phonetic Similarity: Compare attributes based on common substrings, pronunciation, or how words sound.
  5. Semantic Similarity of Descriptions: Extract keywords from attribute descriptions and assess similarity using the techniques above.

In instance-based matching, the actual data values are examined to determine possible mappings. For example, if an attribute contains date values, it can only be mapped to other attributes that also contain dates. Many standard data types can be inferred by analyzing the data itself.

In constraint based matching, we have to rely on the the schema to provided information about the constraints. For example, For example, if an attribute is defined as both unique and numeric, it is a potential match for identification fields such as an employee number or a social security number.

The techniques considered here are inexact, and hence, can be only used to bootstrap the schema mapping process. Any mappings must be verified, and validated by human experts.

3. Record Linkage

We begin our discussion by illustrating the record linkage problem with a concrete example. We then provide an overview of a typical approach used to address record linkage in practice.

3.1 A Sample Record Linkage Problem

Suppose we have data in the two tables shown below. The record linkage problem involves inferring that record a1 refers to the same real-world entity as record b1, and that record a3 refers to the same real-world entity as the record b2. As in schema mapping, these are inexact inferences, and need to undergo human validation.

Table A
CompanyCityState
a1AB CorporationNew YorkNY
a2Broadway AssociatesWashingtonWA
a3Prolific Consulting Inc.CaliforniaCA
Table B
CompanyCityState
b1ABCNew YorkNY
b2Prolific ConsultingCaliforniaCA

A large knowledge graph may contain information for over 10 million companies, and may periodically receive new data feeds extracted from natural language text. Such feeds can contain hundreds of thousands of companies. Even if the knowledge graph uses standardized identifiers for companies, the incoming data extracted from text will typically lack such identifiers. The goal of record linkage is to relate the companies contained in this new data feed with the corresponding companies that already exist in the knowledge graph. Because of the large data volumes involved, performing this task efficiently is of paramount importance.

3.2 An Approach to Solve the Record Linkage Problem

Efficient record linkage typically consists of two stages: blocking and matching. The blocking stage involves a fast and coarse grained computation to select a subset of records from the source and the target datasets that are likely to match. This reduces the number of comparisons required in the subsequent, more expensive matching stage. During matching, records within each block are compared pairwise using more precise criteria.

In the example considered above, a simple blocking strategy might consider only records that share the same state. Under this strategy, only records a1 and b1, and a3 and b2 need to be compared, significantly reducing the comparisons required.

Both the blocking and matching stages can be implemented using a random forest trained through an active learning process. A random forest is an ensemble of decision trees, where each tree makes a prediction and the final decision is determined by a majority vote. This approach is well suited to record linkage because it can combine many weak, noisy signals -- such as partial name matches or similar addresses -- into a robust overall decision.

Active learning further improves efficiency by focusing human effort where it is most valuable. Instead of labeling a large number of random training examples, the model actively identifies cases where it is uncertain and requests human input for those examples. These labeled examples are then used to iteratively refine the random forest. Together, random forests and active learning provide an effective and scalable approach to both blocking and matching.

3.3 Blocking

Blocking rules rely on standard similarity checking functions, such as, exact match, Jaccard similarity, overlap similarity, cosine similarity, etc. For example, to compute the overlap similarity between "Prolific Consulting" and "Prolific Consulting Inc.", we will first gather the tokens in each of them, and then check which tokens are in common, giving us a similarity score of 2/3.

Below is a snippet of a random forest used for blocking. A random forest can also be viewed as a collection of rule sets. The random forest shown below consists of two sets of rules. The arguments of each predicate are two values to be compared.

Several general principles guide the automatic selection of the similarity functions for blocking. For numerical attributes such as age, weight, price, etc., candidate similarity functions include exact match, absolute difference, relative difference, and Levenstein distance. For string valued attributes, typical choices include edit distance, cosine similarity, Jaccard similarity, and TF/IDF-based similarity measures.

3.4 Active Learning of Random Forests

We can learn the random forest for the blocking rules through the following active learning process. First, we randomly select a set of record pairs from the two datasets. By applying the similarity functions to each pair, we generate a set of feautres for each pair. Using these features, we train a random forest. The resulting rules are then aplied to new record pairs, and their performance evaluated. If the performance falls below a specified threshold, the process is repeated by providing additional labeled examples until an acceptable performance is obtained. We illustrate this process using the following example.

Suppose our first dataset contains three items: a, b, and c, and our second dataset contains two items: d and e. From these datasets, we select two pairs, (a, d) and (c, d), which are labeled by the user as similar and not similar, respectively.

For each pair, we apply the similarity functions to generate features, which are then used to train a random forest. The learned rules are applied to the remaining pairs not in the training set, and the user verifies the results. If the user identifies an incorrect prediction, such as for the pair (b, d), it is added to the training set for the next iteration.

Repeating this process over several iterations allows the random forest to converge to a model that can be effectively used in the blocking step.

Once a radom forest has been learned, we present each of the learned rules to the user. Based on the user verification, we choose the rules that will be used in the subsequent steps.

3.5 Applying the Rules

Once the blocking rules have been learned, the next step is to apply them to the actual data. When the dataset is very large, it is impractical to apply the rules to all possible pairs of objects. To address this, we use indexing.

For example, suppose one rule requires that the Jaccard index between two strings be greater than 0.7, and we want to match the movie Sound of Music. Since the length of this movie name is 3, we only need to consider movies whose lengths are between 3 × 0.7 and 3 / 0.7, i.e., between 2 and 4. If the dataset is indexed by string length, we can efficiently select only those movies with lengths in this range and then further filter them by applying the blocking rule.

3.6 Matching

The blocking step produces a much smaller set of record pairs that need to be evaluated for matches. The structure of the matching process is similar: we first define a set of features, train a random forest, and iteratively refine it through active learning. The primary difference between blocking and matching is that the matching step requires greater precision and often more computation. This is because matching is the final step in record linkage, and it is essential to have high confidence that the two records indeed refer to the same real-world entity.

4. Summary

In this chapter, we examined the problem of creating a knowledge graph by integrating data from structured sources. The integrated schema of the knowledge graph can be refined and evolved to meet changing business requirements. Mappings between the schemas of different sources can be bootstrapped using automated techniques, but they still require verification by human experts. Record linkage addresses integration at the instance level, where we must infer whether two records refer to the same real-world entity in the absence of unique identifiers. A common approach for record linkage is to learn a random forest through an active learning process. For applications that demand high accuracy, automatically computed record linkages may also require human validation.

5. Further Reading

The schema mapping problem has attracted attention of both database researchers [Rahm & Bernstein 2001] and ontology researchers [Noy & Stuckenschmidt 2005] with each community producing its own survey of techniques. Ontology Alignment Evaluation Initiative (OAEI) is an international effort to evaluate and compare ontology and schema matching techniques in a standardized way [OAEI]. Its primary goals are to assess the strengths and weaknesses of alignment systems, compare performance across methods, improve evaluation techniques, and foster communication among researchers.

The disucssion on the record linkage in this chapter is based on the Magellan system developed at University of Wisconsin at Madison [Doan et. al. 2020]. Record linking is referred to by several other names including deduplication and entity resolution. For an appreciation of the complexities of real-world record linking, refer to the explanation by Jeff Jonas [Jonas 21]. Recent trends on record linking include leveraging deep learning architectures [Mudgal et. al. 2020] including the recent Large Language Models [Peeters & Bizer 2023].

[Rahm & Bernstein 2001] Rahm, Erhard, and Philip A. Bernstein. "A survey of approaches to automatic schema matching." the VLDB Journal 10.4 (2001): 334-350.

[Noy & Stuckenschmidt 2005] Noy, Natasha, and Heiner Stuckenschmidt. "Ontology alignment: An annotated bibliography." Schloss Dagstuhl–Leibniz-Zentrum für Informatik, 2005.

[OAEI] Ontology Alignment Evaluation Initiative. (n.d.). Ontology Alignment Evaluation Initiative. Retrieved December 12, 2025, from https://oaei.ontologymatching.org/

[Doan et. al. 2020] Doan, A., Konda, P., Paulsen, D., Chandrasekhar, K., Martinkus, P., Christie, M., Govind, Y., & Suganthan, P. C. (2020). Magellan: Toward building ecosystems of entity matching solutions. Communications of the ACM, 63(8), 83–91. https://doi.org/10.1145/3405476

[Jonas 21] Jonas, J. (2021, approx.). Entity resolution explained step by step [Video]. YouTube. https://www.youtube.com/watch?v=VFE3kGdoXzA

[Mudgal et. al. 2020] Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Krishnan, G., Deep, R., Arcaute, E., & Raghavendra, V. (2018). Deep learning for entity matching: A design space exploration. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD ’18) (pp. 19–34). ACM. https://doi.org/10.1145/3183713.319692

[Peeters & Bizer 2023] Peeters, R., & Bizer, C. (2023). Using ChatGPT for entity matching (arXiv:2305.03423) [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2305.03423

Exercises

Exercise 4.1.Two popular methods to calculate similarity between two strings are edit distance (also known as Levenshtein distance) and the Jaccard measure.

We can define the Levenshtein distance between two strings a, b (of length |a| and |b| respectively), given by lev(a,b) as follows:

  • lev(a,b) = a if |b| = 0
  • lev(a,b) = b if |a| = 0
  • lev(tail(a),tail(b)), if a[0] = b[0]
  • 1 + min{lev(tail(a),b), lev(a,tail(b)), lev(tail(a),tail(b))} otherwise.
where the tail of some string x is a string of all but the first character of x, and x[n] is the nth character of the string x, starting with character 0.

We can define the Jaccard measure between two strings a, b as the size of the intersection divided by the size of the union between the two.

J(a,b) = |a ∩ b| / |a ∪ b|

(a) Given the strings "JP Morgan Chase" and "JPMC Corporation", what is the edit distance between the two?
(b) Given the strings "JP Morgan Chase" and "JPMC Corporation", what is the Jaccard measure between the two?
(c) Given three strings: x = Apple Corporation, CA, y = IBM Corporation, CA, and z = Apple Corp, which of these strings would be equated by the edit distance methods?
(d) Given three strings: x = Apple Corporation, CA, y = IBM Corporation, CA, and z = Apple Corp, which of these strings will be equated by the Jaccard measure?
(e) Given three strings: x = Apple Corporation, CA, y = IBM Corporation, CA, and z = Apple Corp, what intuition you would use to ensure that x is equated to z?

Exercise 4.2. A commonly used approach to account for the importance of words is a measure known as TF/IDF. Term frequency (TF) denotes the number of times a term occurs in a document. Inverse Document Frequency (IDF) denotes the number of documents containing a term. The TF/IDF score is calculated by taking the product of TF and IDF. Use your intuition to answer whether the following is true or false.

(a) Higher the TF/IDF score of a word, the rarer it is.
(b) In a general news corpus, TF/IDF for the word Apple is likely to be higher than the TF/IDF for the word Corporation
(c) Common words such as stop words will have a high TF/IDF.
(d) If a document contains words with high TF/IDF, it is likely to be ranked higher by the search engines.
(e) The concept of TF/IDF is not limited to words, and can also be applied to sequence of characters, for example, bigrams.

Exercise 4.3. To check if two names might refer to the same real-world organization, one strategy is to check the similarity between two documents that describe them. Cosine similarity is a measure of similarity between two vectors, and is defined as the cosine of the angle between them. Highly similar documents will have a cosine score closer to 1. Which of the following might be a viable approach to convert a document into a vector for calculating cosine similarity?

(a) Word embeddings of the words used in a document.
(b) TF/IDF scores of the words used in a document.
(c) TF/IDF of bigrams used in a document.
(d) None of the above.
(e) Any of (a), (b) or (c).

Exercise 4.4. Which of the following is true about schema mappings?

(a) Schema mapping is largely an automated process.
(b) It is usually straightforward to learn simple schema mappings.
(c) Complete schema documentation is almost never available.
(d) Examining the actual data in the two sources can provide important clues for schema mapping.
(e) Database constraints play no role in schema mapping.

Exercise 4.5. Which of the following is true about record linkage?

(a) Blocking can be an unnecessary and expensive step in the record linkage pipeline.
(b) A random forest is a set of set of rules.
(c) Blocking rules must always be authored manually.
(d) Active learning of blocking rules minimizes the training data we must provide.
(e) Matching rules are as expensive to apply as the blocking rules.

Exercise 4.6. Implement a schema mapper for the conference domain of the Ontology Alignment Evaluation Initiative. The domain consists of 16 different schemas describing conferences.

Exercise 4.7. Create a knowledge graph of companies by linking the records across Wikidata and the SEC data considered as part the problem 3.7. To connect the two data sets, you can either implement your own record linker, or use any available open-source tools. Report on the quality of the resulting data set in terms of the precision and recall of identifying the correct links across the SEC company data and the Wikidata company information.