1. Introduction
It is possible to get started with a knowledge graph with no
upfront design of its schema. To the degree an upfront
design of a knowledge graph is practical, it can significantly
improve its usefulness. Such a design involves making a suitable
choice of the nodes, node labels, node properties, relations and
relation properties.
he data used to build or update a knowledge graph can come from
many places: structured databases, semi-structured formats (like JSON
or XML), unstructured sources such as plain text or images, or even
information typed in directly by people.
When the data is structured or semi-structured, two main tasks are needed:
- Schema mapping: matching the fields in the input data (e.g.,
columns in a table) to the classes and properties in the knowledge
graph. For example, a table column called cust_name may
need to be mapped to the name property of the Person
node in a knowledge graph.
- Record linkage: deciding whether a new piece of data refers to
an existing entity in the graph or should be added as a new one.
For example, if the input says “John Smith, born 1999”, we must
decide whether this is the same John Smith already in the graph or
a different person who should be added as a new node.
These tasks are the same ones you see in general data integration,
except here the final result is stored in a graph format.
When the data is unstructured, such as natural-language text, we
instead need information extraction methods—mainly entity extraction
(finding people, places, organizations, etc.) and relation
extraction (figuring out how those entities are connected). For
example, from the sentence “Google acquired DeepMind in 2014,” the
entity extraction should detect Google and DeepMind as
organizations. The relation extraction should detect that the
relationship between Google and Deep Mind is has
acquired.
The methods used to populate a knowledge graph depend on how
large the system is and how accurate the data needs to be. For very
large, web-scale knowledge graphs used in search engines or
information retrieval, it’s impossible to check every triple by
hand, and the system can tolerate some errors. In contrast,
enterprise knowledge graphs—such as those used inside a company for
compliance, finance, or operations—often require very high accuracy,
so human review becomes essential, even if it happens right before
the data is used. Because accuracy is always important but full
manual verification is expensive, many systems rely on crowdsourcing
or other low-cost ways of getting human input to balance quality,
cost, and scalability.
In this chapter, we will focus on knowledge graph schema design.
In the next two chapters we will discuss the problems that arise in
populating a knowledge graph from structured data, i.e., the
problems of record linkage and schema mapping, and the problems that
arise while populating from text, i.e., entity extraction and
relation extraction.
2. Knowledge Graph Design
Both property graphs and RDF graphs come with design challenges—some
shared by both models and others unique to each. For instance, both
models sometimes need to use reification (i.e., turning a statement
into a first-class object) when a fact cannot be expressed directly as
a simple triple.
However, the models also differ in important ways. RDF requires a
consistent scheme for creating and managing IRIs (global identifiers
for entities), while property graphs do not. On the other hand,
property graph designers must decide whether a piece of information
should be stored as a property on a node or as a separate node
connected by an edge—a choice that RDF largely avoids because RDF
treats almost everything as a node with relationships.
In the rest of this section, we outline the key design issues
encountered in each model, highlighting which ones are shared and
which ones are specific to RDF or property graphs.
2.1 Design of an RDF Graph
The standard guidelines for creating and publishing RDF data on the
Web are known as the Linked Data Principles. They describe how to name
things, how to make those names accessible, and how to connect data
across the Web. The principles are:
- Use URIs as names for things. (URIs uniquely identify resources;
modern systems also support IRIs—Internationalized Resource
Identifiers—which allow non-ASCII characters.) For example, the city
of Paris can be identified by the URI
http://dbpedia.org/resource/Paris
- Use HTTP URIs so that people and programs can look up those
names. For example, typing the URI above in a browser or querying it
with SPARQL returns useful information about Paris.
- When someone looks up a URI, provide useful information, using
the Web standards (RDF, SPARQL).
- Include links to other URIs to help users discover additional
related information. For example, the Paris URI might link to
http://dbpedia.org/resource/France to indicate that Paris is a city
in France.
We will consider each of these guidelines in greater detail.
2.1.1 Use URIs as names for things
To publish a knowledge graph on the Web, we first need to identify
the key items in our domain—these are the things whose properties and
relationships we want to capture in the graph. In Web terminology, all
such items are called resources. Resources can be divided into two
types: information resources and non-information resources.
- Information resources are resources that exist on the Web
itself, such as documents, images, videos, and other media
files. For example, a Wikipedia page about the city of Paris is an
information resource.
- Non-information resources are things that exist in the real world
but are not themselves web documents. These include people, physical
products, places, proteins, scientific concepts, and other real-world
objects. For example, the city of Paris or the Eiffel Tower are
non-information resources, even though they may be described on the
Web.
As a simple rule of thumb: anything that exists outside of the
Web—what we usually think of as “real-world objects”—is considered a
non-information resource.
Publishers of knowledge graphs should design URIs so that they are
simple, stable, and easy to manage. Short, meaningful (mnemonic) URIs
are less likely to break when shared—for example, in emails—and are
easier for people to remember. Once a URI has been assigned to a
resource, it should remain unchanged for as long as possible. To
support long-term persistence, it is best to avoid including
implementation-specific details such as “.php” or “.asp” in the
URI. Finally, URIs should be constructed in a way that allows the
publisher to control and maintain them over time.
2.1.2 Use HTTP URIs so that people can look up those names
We identify resources using Uniform Resource Identifiers (URIs). In
practice, we use HTTP URIs exclusively and avoid other URI
schemes—such as Uniform Resource Names (URNs) or Digital Object
Identifiers (DOIs)**—because only HTTP URIs can be directly looked up
and dereferenced on the Web. This means that when someone accesses an
HTTP URI, the publisher can return useful information about the
resource, making it easier to integrate, link, and reuse data across
the Web.
For example, a DOI like doi:10.1000/xyz123 identifies a
resource but cannot be dereferenced without a special resolver
service. An HTTP URI like http://example.org/resource/Paris can
be opened directly in a browser or queried by a machine, and it can
return structured RDF data about the city of Paris.
The process of looking up a name on the Web is called URI
dereferencing. When we dereference a URI that identifies an
information resource, we expect to receive a direct representation of
that resource—for example, a text document, an image, or a video. In
contrast, when we dereference a URI that identifies a non-information
resource (such as a person, place, or physical object), we cannot
retrieve the object itself. Instead, we receive an RDF description
that provides structured information about that resource.
2.1.3 When someone looks up a URI, provide useful information
using RDF and SPARQL
When someone looks up a URI, the publisher should return a knowledge
graph encoded in RDF. This data should make use of standardized
vocabularies so that the IRIs used in the RDF description are
consistent and interoperable. Many well-established vocabularies
exist for describing data catalogs, organizations, and
multidimensional datasets such as statistical data on the Web. In
addition, Schema.org, a large open-source community effort, provides
widely used vocabularies for describing people, places, products,
events, and many other Web resources. In the following section, we
review several examples of these vocabularies.
The following RDF data describes a snippet of the organizational
structure of the UK Cabinet office.
| @prefix uk_cabinet: <http://reference.data.gov.uk/id/department/> |
| uk_cabinet:co rdf:type org:Organization |
| uk_cabinet:co skos:prefLabel "Cabinet Office" |
| uk_cabinet:co org:hasUnit uk_cabinet:cabinet-office-communications |
| uk_cabinet:cabinet-office-communications rdf:type org:OrganizationUnit |
| uk_cabinet:cabinet-office-communications skos:prefLabel "Cabinet Office Communications" |
| uk_cabinet:cabinet-office-communications org:hasPost uk_cabinet:post_246 |
| uk_cabinet:post_246 skos:prefLabel "Deputy Director, Deputy Prime Minister's Spokesperson" |
In the data above, the first triple uses the
class org:Organization from the Organization ontology. The
second triple uses the relation skos:prefLabel drawn from the
SKOS ontology. SKOS stands for a Simple Knowledge Organization System,
and provides a few commonly useful relations such
as skos:prefLabel for describing data. In this
case, skos:prefLabel simply allows us to associate a text label
with uk_cabinet:co. The third triple uses the predicate
org:hasUnit from the Organization ontology to describe a unit
within the UK Cabinet office. The next two triples make additional
assertions about this unit. The sixth triple uses
the org:hasPost to describe a position with in a
department, and the final two triples give additional information
about that position.
It is not always possible to find existing vocabularies suitable
for creating an RDF dataset. When it becomes necessary to create a new
vocabulary, certain best practices should be followed. The vocabulary
should be well-documented, self-describing, have a versioning policy,
support multiple languages, and be published by a trusted source to
ensure that the URIs it defines remain stable over time. A vocabulary
is considered self-describing if each term or property includes a
label, a definition, and a comment explaining its meaning.
2.1.4 Include links to other URIs, so that they can discover more
things
While publishing data in RDF, it is important to provide links to
other resources, as this significantly increases the usefulness and
interconnectedness of data. These links can be categorized into
three types: relationship links, identity links,
and vocabulary links. We will consider an example of each of
these kinds of links.
Relationship links connect resources to related things in
other datasets, such as people, places or organizations. These links
allow data to reference additional information, for example, linking a
person to background information about their city or to bibliographic
data about their publications. In the triple below, we illustrate a
relationship link in which a person from one dataset is asserted to be
based near a geographical location identified by a URI in another data
set.
| @prefix big: <http://biglynx.co.uk/people/> | |
| @prefix dbpedia: <http://dbpedia.org/resource/> | |
| big:dave-smith foaf:based_near dbpedia:Birmingham |
Identity Links point at URI aliases used by other data
sources to identify the same real-world object or abstract
concept. Identity links enable clients to retrieve further
descriptions about an entity, and serve an important social function
as they enable different views of the world to be expressed on the WWW
of Data. It is a standard practice to use the link
type http://www.w3.org/2002/07/owl#sameAs to state that two URI
aliases refer to the same resource. For example, if Dave Smith would
also maintain a private data homepage besides the data that Big Lynx
publishes about him, he could add a
http://www.w3.org/2002/07/owl#sameAs link to his private data
homepage, stating that the URI used to refer to him in this document
and the URI used by Big Lynx both refer to the same real-world
entity. A triple capturing this information is shown below.
| @prefix ds: <http://www.dave-smith.eg.uk> | |
| @prefix owl: <http://www.w3.org/2002/07/owl> | |
| @prefix big: <http://biglynx.co.uk/people/> | |
| ds:me owl:sameAs big:dave-smith |
Vocabulary links connect data to the definitions of the vocabulary
terms used to describe it, and also link those definitions to related
terms in other vocabularies. These links make data self-descriptive
and enable applications to understand and integrate data across
vocabularies. In the example below, the
class SmallMediumEnterprise defined by BigLynx is defined to be
a subclass of the class Company in DBpedia. By establishing
this link, it is possible to retrieve various assertions about the
class Company from DBPedia, and apply them to
class SmallMediumEnterprise facilitating richer data
integration and reasoning.
| @prefix dbpedia: <http://dbpedia.org/ontology/> | |
| big:sme#SmallMediumEnterprise rdfs:subClassOf dbpedia:Company | |
2.2 Design of a Property Graph
The design of a property graph involves choosing nodes, node
labels, node properties, edges and edge properties. The basic
design questions are whether to model a piece of information as a
property, label or as a separate object; when to introduce relation
properties; and how to to handle higher arity relationships. We will
illustrate the process of making these choices using examples.
2.2.1 Choosing Nodes, Labels and Properties
In a property graph model, the nodes usually represent entities in
the domain. If we were interested in representing information about
people, we will create a node for each individual person (e.g., John),
and associate the label Person with that node.
When making further design decisions about node labels, node
properties, and edges in a property graph, several factors should be
considered. These include the naturalness and clarity of labels,
whether the labels are likely to change over time, the impact on
runtime query performance, and the cardinality of property values.
To illustrate the choice of whether to model a piece of information
as a label, property, or as a separate object, consider the task of
representing the gender of a person. We have three potential ways to
capture this information.
- As labels: Create :Male
and :Female as labels and associate them with the Person
nodes.
- As a property: Createreate a property called "gender", and associate it
with Person nodes and allow it to have the values "male" and
"female".
- As a separate object: we Create a Gender object, associate it
with Person using a has_gender relationship, and give it
a property called "name" that can take "male" and "female" as values.
Each approach has different implications for flexibility, query simplicity, and data modeling clarity.
The labels in a property graph model are used to group nodes into
sets. All nodes labeled with the same label belong to the same
set. Queries can work with these sets instead of the whole graph,
making queries easier to write and more efficient. A node may be
labeled with any number of labels, including none, making labels an
optional addition to the graph. As a label groups nodes into a set, it
can be viewed as a class. The question of whether to introduce a new
label can be restated as whether to introduce a new class.
Creating new classes Male and Female vs introducing a
node property "gender" that can take two values of "male" and
"female" conveys the same information. In general, whenever a
phrase naturally occurring in language is frequently used in a
domain, it is a candidate to be introduced as a class as long as the
membership in the class does not change over time. As some
implementations optimize the retrieval based on the use of node
labels, it can result in a faster performance on queries that need
to filter the results based on the membership in the class. If class
membership changes with time, neither a label nor a node property
value is an appropriate choice, and we need to use a relation. We
will consider this in the next section.
2.2.2 When to introduce Relationships between Objects
For situations that could be modeled either by using a node
property or by introducing a separate object connected by a
relationship, at least two considerations come into play. The first,
discussed in the previous section, is whether class membership changes
over time. The second consideration is query performance, as some
modeling choices can make queries faster or more efficient. We will
examine these considerations next.
Continuing the example from the previous section, if the gender
of a person could change over a period of time, then the information should
be represented as a separate Gender node
connected to the Person node via a has_gender
relationship. A relationship property can be used on
the has_gender edge to indicate the time duration
for which that particular gender value applies.
However, dreating a separate Gender node for every person
could lead to a very large number of edges which is inefficient as
for most people the gender does not change. In such cases, a hybrid
approach may be preferable: for most people, the gender is stored as
a node property, while for the small subset of individuals whose
gender changes over time, it is represented via a relationship to
a Gender node.
Let us consider a situation where better query performance is a key
consideration. Suppose we wish to model movies, and their genres.
In one design, for a node of type Movie, we can introduce a
property "genre" that can take values such as "Action", "SciFci",
etc. In another design, we can introduce a new node
type Genre that "name" property which can take values such as
s "Action", "SciFci". We will then relate a node of
type Movie with a node of type Genre using
the has_genre relationship. In general, a movie can belong
to more than one genre. Suppose we wish to query for those movies
that have at least one common genre. In the first solution in which
we use the node property "genre", this query would be stated in
Cypher as follows:
| MATCH (m1:Movie), (m2:Movie) |
| WHERE any(x IN m1.genre WHERE x IN m2.genre) |
| AND m1 <> m2 |
| RETURN m1, m2 |
When we model genre as a separate object, the same query can be stated as follows:
| MATCH (m1:Movie)-[:has_genre]->(g:Genre), |
| (m2:Movie)-[:has_genre]->(g) |
| WHERE m1 <> m2 |
| RETURN m1, m2 |
In the second query above, we are able to more directly make use of
graph patterns, and in some graph engines, this query has a faster
runtime performance because of indexing on relations. Hence, in this
case, one has to choose between the two designs depending on the kind
of queries that will be expected.
2.2.3 When to introduce Relationship Properties
We have already seen an example of using a property on a
relationship to handle cases where the relationship changes over
time. Other common reasons to attach properties to relationships
include capturing weights or confidence scores, or recording
provenance and other metadata.
Some graph engines do not index relationship properties. If the use
case allows most queries to be evaluated without accessing these
properties—using them only for final filtering—then the lack of
indexing may not significantly impact performance. However, if
accessing relationship properties is critical for query performance,
it is better to reify the relationship as a separate node, which we
will discuss in the next section.
2.2.4 Handling non-binary Relationships
We often need to model relationships that involve more than two entities. A common
example is the between relationship that
given objects A, B and C captures that C
is between A and B. A standard approach to capturing
such higher arity relationships in a graph is reification.
Although we have previously discussed reification in the
context of RDF, but this technique is equally useful and desirable for
property graphs. To capture the between relationship we
introduce a new node type, Between_Relationship that has two
properties: has_object (holding values A and B)
and has_between_object (holding value C). This approach
can be generalized to relationships of any arity: create a new node
type for the relationship and add node properties for each argument of
that relation.
3. Summary
In this chapter, we considered the design of the graph data model
for both RDF and property graphs. Many design concerns -- such
as whether to reify a relationship, handling non-binary relationships,
-- are common to both models. The choice of whether to use a property vs a relation is
unique to the property graph data model.
The data publishing guidelines for the RDF encourage the use of
IRIs, reuse of existing vocabularies, and making links across
vocabularies. While such data linking practices are not intrinsic to
property graphs, adopting them can enhance a property graph’s
interoperability and usefulness in data integration.
4. Further Reading
Further elaboration of the guidelines for publishing RDF data on
the web are available in a
monograph [Heath & Bizer 2011]. A similar monograph is available for ontology design [Kendall & McGuinness 2019].
A course on the design of the Wikidata schema was recently
offered [Wikidata
Ontology Course 2025] The course introduces the Wikidata ontology, and
covers numerous design challenges in creating it.
For designing property graph schemas, extensive documentation is
available from graph database vendors such as
Neo4j [Neo4j
Data Modeling] and
TigerGraph [TigerGraph
Schema Design].
- [Heath & Bizer 2011] Heath, Tom, and Christian Bizer. Linked Data: Evolving
the Web into a Global Data Space. 1st ed. Synthesis Lectures on
the Semantic Web: Theory and Technology. Morgan & Claypool,
2011.
- [Kendall
& McGuinness 2019]Kendall, Elisa F., and Deborah
L. McGuinness. Ontology Engineering. Springer International
Publishing, 2019. https://doi.org/10.1007/978-3-031-79486-5
- [Wikidata
Ontology Course 2025] Peter F. Patel Schneider and Ege Atacan
Doğan. “Wikidata: WikiProject Ontology/Ontology Course.” Last
modified July 10, 2025. Accessed December 5,
2025. https://www.wikidata.org/wiki/Wikidata:WikiProject_Ontology/Ontology_Course
- [Neo4j
Data Modeling] Neo4j, Inc. n.d. “What Is Graph Data Modeling?” Accessed December 5, 2025. https://neo4j.com/docs/getting-started/data-modeling/
- [TigerGraph
Schema Design] TigerGraph. “Schema Design Guide.” GSQL
Language Reference, 4.2. Accessed December 5,
2025. https://docs.tigergraph.com/gsql-ref/4.2/tutorials/schema-design-guide
Exercises
Exercise 3.1.Which of the following statements about knowledge graph design are true?
|
(a) |
As knowledge graphs are schema free, no design of the schema is required. |
|
(b) |
Knowledge graphs are always created using automatic techniques. |
|
(c) |
For many knowledge graph applications, a perfect accuracy is not a hard requirement. |
|
(d) |
Knowledge graphs can contain undirected relationships. |
|
(e) |
Knowledge graphs do not use keys and foreign keys as defined for the relational database systems. |
Exercise 3.2.Which of the following is a good choice of an IRI for an RDF knowledge graph?
|
(a) |
ISBN-13 : 978-1681737225 |
|
(b) |
http://fcvcz.abt.co/mckz/ |
|
(c) |
https://www.wikidata.org/wiki/Q6135847 |
|
(d) |
http://worksheets.stanford.edu/homepage/index.php |
|
(e) |
http://dbpedia.org/resource/Frederick_Loewe |
Exercise 3.3.What type of link is captured by each of the following RDF statements? (Assume the following prefixes have been defined.)
@prefix dbpedia: http://dbpedia.org/resource/
@prefix bbc: http://www.bbc.co.uk/nature/species/
@prefix umbel-rc: https://umbel.org/umbel/rc/Person
@prefix foaf: http://foaf.org/
|
(a) |
dbpedia:Aardvark owl:sameAs bbc:Aardvark |
|
(b) |
dbpedia:Lady_Gaga skos:broader_of dbpedia:Lady_Gaga_audio_samples |
|
(c) |
dbpedia:Tetris foaf:isPrimaryTopicOf wikipedia-en:Tetris |
|
(d) |
dbpedia:Person rdf:subClassOf umbel-rc:Person |
|
(e) |
dbpedia:Sky_Bank foaf:homepage http://www.skyebankng.com/ |
Exercise 3.4.Which of the following are good class labels in a knowledge graph?
|
(a) |
Customers with overdue accounts |
|
(b) |
Australian Customers |
|
(c) |
Customers with revenues between 5 to 10 million |
|
(d) |
Customers who supply to recently funded startups |
|
(e) |
High Networth Value Customers |
Exercise 3.5.Which of the following requires reification for representing in a knowledge graph?
|
(a) |
John believes that life is good. |
|
(b) |
John was referred to Peter by Mary. |
|
(c) |
The effectiveness of a vaccine is 95%. |
|
(d) |
Earth revolves around the Sun. |
|
(e) |
On LinkedIN John rated Peter for being an expert in AI. |
Exercise 3.6. Design a property graph schema for a
causal graph for investing. Your design should take into account
the following domain knowledge.
In investing, headwind and tailwind are metaphors
used to describe the factors that could cause a difference to the
performance of a stock.
A headwind is any specific company, market or economic
factor that could causally hinder a company's growth, or reduce
its profitability, in the near future. These could include things like
increased competition, regulatory changes, unfavorable economic
conditions, or any other causal factor that makes it more difficult
for the company to succeed. If a stock is facing headwinds, it means
it is encountering challenges that could potentially lower its value
in the near future.
On the other hand, a tailwind refers to any such factor that
could causally boost the company's growth or increase its
profitability. These could include things like favorable economic
conditions, beneficial regulatory changes, or a successful new product
launch. A stock with tailwinds is benefiting from positive conditions
or events that are causally responsible for an increase in its value
in the near future.
Headwinds and tailwinds should be unitary in nature, i.e., not
decomposable into more specific assertions. For example, Company is
facing increasing competition, and is having difficulty hiring
critical talent should be decomposed into two distinct
headwinds.
The factors that do not causally affect company performance cannot
be headwinds or tailwinds. Examples of factors that are not
headwinds or tailwinds:
- valuations (factors that lead to high or low valuations may
be candidate head/tail winds)
- observations of company performance improvement
- analyst perceptions -- e.g., uncertainity in an analyst's
predictions, or variances in thier views
- one-off factors that affected performance in last quarter, but
will not have an impact in future quarters
A headwind / tailwind should have four well-defined attributes:
- materiality: possible values -- mild, medium, high. Measures
whether this factor is expected to have a material impact on
company performance. For instance, materiality is low for
factors that will affect only a small portion of company
business.
- duration: possible values -- short (1-2 quarters), medium (3-5
quarters), long (more than 5 quarters). It measures expected
duration of this factor.
- externality: possible values -- true, false or both. If the
factor external to (not controlled by) the company then the
externality is true. The factor is false if the company
controls the factor. Its value is both if the factor may be both
internally and externally controlled.
- obviousness: possible values -- low, medium, or high. Measures
how strongly does an analyst believe in the effect of the
parameter.
Exercise 3.7. Design a Company knowledge graph to support a business
intelligence dashboard.
The dashboard is to aggregate information from multiple sources
about a company to get better insight into its business, customers,
competitors, subsidiaries or parent organlzations. Assume that the two
sources are Wikidata and Security and Exchange Commission Filings.
Begin the process by reviewing the RDF schema for Company
(identifier Q783794) in Wikidata. Reuse as much of this schema as
necessary, and extend it as you see fit. Document your choices.
For SEC filings, use
the following
dataset from Kaggle:
https://www.kaggle.com/datasets/jamesglang/sec-edgar-company-facts-september2023. Extend
the schema you had extracted from Wikidata in the previous step to
handle any new information that appears in this SEC dataset in
Kaggle.
|