CS520 Knowledge Graphs
What
should AI
Know ?
 

How to Create a Knowledge Graph?


1. Introduction

It is possible to get started with a knowledge graph with no upfront design of its schema. To the degree an upfront design of a knowledge graph is practical, it can significantly improve its usefulness. Such a design involves making a suitable choice of the nodes, node labels, node properties, relations and relation properties.

he data used to build or update a knowledge graph can come from many places: structured databases, semi-structured formats (like JSON or XML), unstructured sources such as plain text or images, or even information typed in directly by people.

When the data is structured or semi-structured, two main tasks are needed:

  • Schema mapping: matching the fields in the input data (e.g., columns in a table) to the classes and properties in the knowledge graph. For example, a table column called cust_name may need to be mapped to the name property of the Person node in a knowledge graph.
  • Record linkage: deciding whether a new piece of data refers to an existing entity in the graph or should be added as a new one. For example, if the input says “John Smith, born 1999”, we must decide whether this is the same John Smith already in the graph or a different person who should be added as a new node.

These tasks are the same ones you see in general data integration, except here the final result is stored in a graph format.

When the data is unstructured, such as natural-language text, we instead need information extraction methods—mainly entity extraction (finding people, places, organizations, etc.) and relation extraction (figuring out how those entities are connected). For example, from the sentence “Google acquired DeepMind in 2014,” the entity extraction should detect Google and DeepMind as organizations. The relation extraction should detect that the relationship between Google and Deep Mind is has acquired.

The methods used to populate a knowledge graph depend on how large the system is and how accurate the data needs to be. For very large, web-scale knowledge graphs used in search engines or information retrieval, it’s impossible to check every triple by hand, and the system can tolerate some errors. In contrast, enterprise knowledge graphs—such as those used inside a company for compliance, finance, or operations—often require very high accuracy, so human review becomes essential, even if it happens right before the data is used. Because accuracy is always important but full manual verification is expensive, many systems rely on crowdsourcing or other low-cost ways of getting human input to balance quality, cost, and scalability.

In this chapter, we will focus on knowledge graph schema design. In the next two chapters we will discuss the problems that arise in populating a knowledge graph from structured data, i.e., the problems of record linkage and schema mapping, and the problems that arise while populating from text, i.e., entity extraction and relation extraction.

2. Knowledge Graph Design

Both property graphs and RDF graphs come with design challenges—some shared by both models and others unique to each. For instance, both models sometimes need to use reification (i.e., turning a statement into a first-class object) when a fact cannot be expressed directly as a simple triple.

However, the models also differ in important ways. RDF requires a consistent scheme for creating and managing IRIs (global identifiers for entities), while property graphs do not. On the other hand, property graph designers must decide whether a piece of information should be stored as a property on a node or as a separate node connected by an edge—a choice that RDF largely avoids because RDF treats almost everything as a node with relationships.

In the rest of this section, we outline the key design issues encountered in each model, highlighting which ones are shared and which ones are specific to RDF or property graphs.

2.1 Design of an RDF Graph

The standard guidelines for creating and publishing RDF data on the Web are known as the Linked Data Principles. They describe how to name things, how to make those names accessible, and how to connect data across the Web. The principles are:

  1. Use URIs as names for things. (URIs uniquely identify resources; modern systems also support IRIs—Internationalized Resource Identifiers—which allow non-ASCII characters.) For example, the city of Paris can be identified by the URI http://dbpedia.org/resource/Paris
  2. Use HTTP URIs so that people and programs can look up those names. For example, typing the URI above in a browser or querying it with SPARQL returns useful information about Paris.
  3. When someone looks up a URI, provide useful information, using the Web standards (RDF, SPARQL).
  4. Include links to other URIs to help users discover additional related information. For example, the Paris URI might link to http://dbpedia.org/resource/France to indicate that Paris is a city in France.

We will consider each of these guidelines in greater detail.

2.1.1 Use URIs as names for things

To publish a knowledge graph on the Web, we first need to identify the key items in our domain—these are the things whose properties and relationships we want to capture in the graph. In Web terminology, all such items are called resources. Resources can be divided into two types: information resources and non-information resources.

  • Information resources are resources that exist on the Web itself, such as documents, images, videos, and other media files. For example, a Wikipedia page about the city of Paris is an information resource.
  • Non-information resources are things that exist in the real world but are not themselves web documents. These include people, physical products, places, proteins, scientific concepts, and other real-world objects. For example, the city of Paris or the Eiffel Tower are non-information resources, even though they may be described on the Web.

As a simple rule of thumb: anything that exists outside of the Web—what we usually think of as “real-world objects”—is considered a non-information resource.

Publishers of knowledge graphs should design URIs so that they are simple, stable, and easy to manage. Short, meaningful (mnemonic) URIs are less likely to break when shared—for example, in emails—and are easier for people to remember. Once a URI has been assigned to a resource, it should remain unchanged for as long as possible. To support long-term persistence, it is best to avoid including implementation-specific details such as “.php” or “.asp” in the URI. Finally, URIs should be constructed in a way that allows the publisher to control and maintain them over time.

2.1.2 Use HTTP URIs so that people can look up those names

We identify resources using Uniform Resource Identifiers (URIs). In practice, we use HTTP URIs exclusively and avoid other URI schemes—such as Uniform Resource Names (URNs) or Digital Object Identifiers (DOIs)**—because only HTTP URIs can be directly looked up and dereferenced on the Web. This means that when someone accesses an HTTP URI, the publisher can return useful information about the resource, making it easier to integrate, link, and reuse data across the Web.

For example, a DOI like doi:10.1000/xyz123 identifies a resource but cannot be dereferenced without a special resolver service. An HTTP URI like http://example.org/resource/Paris can be opened directly in a browser or queried by a machine, and it can return structured RDF data about the city of Paris.

The process of looking up a name on the Web is called URI dereferencing. When we dereference a URI that identifies an information resource, we expect to receive a direct representation of that resource—for example, a text document, an image, or a video. In contrast, when we dereference a URI that identifies a non-information resource (such as a person, place, or physical object), we cannot retrieve the object itself. Instead, we receive an RDF description that provides structured information about that resource.

2.1.3 When someone looks up a URI, provide useful information using RDF and SPARQL

When someone looks up a URI, the publisher should return a knowledge graph encoded in RDF. This data should make use of standardized vocabularies so that the IRIs used in the RDF description are consistent and interoperable. Many well-established vocabularies exist for describing data catalogs, organizations, and multidimensional datasets such as statistical data on the Web. In addition, Schema.org, a large open-source community effort, provides widely used vocabularies for describing people, places, products, events, and many other Web resources. In the following section, we review several examples of these vocabularies.

The following RDF data describes a snippet of the organizational structure of the UK Cabinet office.

@prefix uk_cabinet: <http://reference.data.gov.uk/id/department/>
uk_cabinet:co rdf:type org:Organization
uk_cabinet:co skos:prefLabel "Cabinet Office"
uk_cabinet:co org:hasUnit uk_cabinet:cabinet-office-communications
uk_cabinet:cabinet-office-communications rdf:type org:OrganizationUnit
uk_cabinet:cabinet-office-communications skos:prefLabel "Cabinet Office Communications"
uk_cabinet:cabinet-office-communications org:hasPost uk_cabinet:post_246
uk_cabinet:post_246 skos:prefLabel "Deputy Director, Deputy Prime Minister's Spokesperson"

In the data above, the first triple uses the class org:Organization from the Organization ontology. The second triple uses the relation skos:prefLabel drawn from the SKOS ontology. SKOS stands for a Simple Knowledge Organization System, and provides a few commonly useful relations such as skos:prefLabel for describing data. In this case, skos:prefLabel simply allows us to associate a text label with uk_cabinet:co. The third triple uses the predicate org:hasUnit from the Organization ontology to describe a unit within the UK Cabinet office. The next two triples make additional assertions about this unit. The sixth triple uses the org:hasPost to describe a position with in a department, and the final two triples give additional information about that position.

It is not always possible to find existing vocabularies suitable for creating an RDF dataset. When it becomes necessary to create a new vocabulary, certain best practices should be followed. The vocabulary should be well-documented, self-describing, have a versioning policy, support multiple languages, and be published by a trusted source to ensure that the URIs it defines remain stable over time. A vocabulary is considered self-describing if each term or property includes a label, a definition, and a comment explaining its meaning.

2.1.4 Include links to other URIs, so that they can discover more things

While publishing data in RDF, it is important to provide links to other resources, as this significantly increases the usefulness and interconnectedness of data. These links can be categorized into three types: relationship links, identity links, and vocabulary links. We will consider an example of each of these kinds of links.

Relationship links connect resources to related things in other datasets, such as people, places or organizations. These links allow data to reference additional information, for example, linking a person to background information about their city or to bibliographic data about their publications. In the triple below, we illustrate a relationship link in which a person from one dataset is asserted to be based near a geographical location identified by a URI in another data set.

@prefix big: <http://biglynx.co.uk/people/>
@prefix dbpedia: <http://dbpedia.org/resource/>
big:dave-smith foaf:based_near dbpedia:Birmingham

Identity Links point at URI aliases used by other data sources to identify the same real-world object or abstract concept. Identity links enable clients to retrieve further descriptions about an entity, and serve an important social function as they enable different views of the world to be expressed on the WWW of Data. It is a standard practice to use the link type http://www.w3.org/2002/07/owl#sameAs to state that two URI aliases refer to the same resource. For example, if Dave Smith would also maintain a private data homepage besides the data that Big Lynx publishes about him, he could add a http://www.w3.org/2002/07/owl#sameAs link to his private data homepage, stating that the URI used to refer to him in this document and the URI used by Big Lynx both refer to the same real-world entity. A triple capturing this information is shown below.

@prefix ds: <http://www.dave-smith.eg.uk>
@prefix owl: <http://www.w3.org/2002/07/owl>
@prefix big: <http://biglynx.co.uk/people/>
ds:me owl:sameAs big:dave-smith

Vocabulary links connect data to the definitions of the vocabulary terms used to describe it, and also link those definitions to related terms in other vocabularies. These links make data self-descriptive and enable applications to understand and integrate data across vocabularies. In the example below, the class SmallMediumEnterprise defined by BigLynx is defined to be a subclass of the class Company in DBpedia. By establishing this link, it is possible to retrieve various assertions about the class Company from DBPedia, and apply them to class SmallMediumEnterprise facilitating richer data integration and reasoning.

@prefix dbpedia: <http://dbpedia.org/ontology/>
big:sme#SmallMediumEnterprise rdfs:subClassOf dbpedia:Company

2.2 Design of a Property Graph

The design of a property graph involves choosing nodes, node labels, node properties, edges and edge properties. The basic design questions are whether to model a piece of information as a property, label or as a separate object; when to introduce relation properties; and how to to handle higher arity relationships. We will illustrate the process of making these choices using examples.

2.2.1 Choosing Nodes, Labels and Properties

In a property graph model, the nodes usually represent entities in the domain. If we were interested in representing information about people, we will create a node for each individual person (e.g., John), and associate the label Person with that node.

When making further design decisions about node labels, node properties, and edges in a property graph, several factors should be considered. These include the naturalness and clarity of labels, whether the labels are likely to change over time, the impact on runtime query performance, and the cardinality of property values.

To illustrate the choice of whether to model a piece of information as a label, property, or as a separate object, consider the task of representing the gender of a person. We have three potential ways to capture this information.

  • As labels: Create :Male and :Female as labels and associate them with the Person nodes.
  • As a property: Createreate a property called "gender", and associate it with Person nodes and allow it to have the values "male" and "female".
  • As a separate object: we Create a Gender object, associate it with Person using a has_gender relationship, and give it a property called "name" that can take "male" and "female" as values.
Each approach has different implications for flexibility, query simplicity, and data modeling clarity.

The labels in a property graph model are used to group nodes into sets. All nodes labeled with the same label belong to the same set. Queries can work with these sets instead of the whole graph, making queries easier to write and more efficient. A node may be labeled with any number of labels, including none, making labels an optional addition to the graph. As a label groups nodes into a set, it can be viewed as a class. The question of whether to introduce a new label can be restated as whether to introduce a new class.

Creating new classes Male and Female vs introducing a node property "gender" that can take two values of "male" and "female" conveys the same information. In general, whenever a phrase naturally occurring in language is frequently used in a domain, it is a candidate to be introduced as a class as long as the membership in the class does not change over time. As some implementations optimize the retrieval based on the use of node labels, it can result in a faster performance on queries that need to filter the results based on the membership in the class. If class membership changes with time, neither a label nor a node property value is an appropriate choice, and we need to use a relation. We will consider this in the next section.

2.2.2 When to introduce Relationships between Objects

For situations that could be modeled either by using a node property or by introducing a separate object connected by a relationship, at least two considerations come into play. The first, discussed in the previous section, is whether class membership changes over time. The second consideration is query performance, as some modeling choices can make queries faster or more efficient. We will examine these considerations next.

Continuing the example from the previous section, if the gender of a person could change over a period of time, then the information should be represented as a separate Gender node connected to the Person node via a has_gender relationship. A relationship property can be used on the has_gender edge to indicate the time duration for which that particular gender value applies.

However, dreating a separate Gender node for every person could lead to a very large number of edges which is inefficient as for most people the gender does not change. In such cases, a hybrid approach may be preferable: for most people, the gender is stored as a node property, while for the small subset of individuals whose gender changes over time, it is represented via a relationship to a Gender node.

Let us consider a situation where better query performance is a key consideration. Suppose we wish to model movies, and their genres. In one design, for a node of type Movie, we can introduce a property "genre" that can take values such as "Action", "SciFci", etc. In another design, we can introduce a new node type Genre that "name" property which can take values such as s "Action", "SciFci". We will then relate a node of type Movie with a node of type Genre using the has_genre relationship. In general, a movie can belong to more than one genre. Suppose we wish to query for those movies that have at least one common genre. In the first solution in which we use the node property "genre", this query would be stated in Cypher as follows:

MATCH (m1:Movie), (m2:Movie)
WHERE any(x IN m1.genre WHERE x IN m2.genre)
AND m1 <> m2
RETURN m1, m2

When we model genre as a separate object, the same query can be stated as follows:

MATCH (m1:Movie)-[:has_genre]->(g:Genre),
      (m2:Movie)-[:has_genre]->(g)
WHERE m1 <> m2
RETURN m1, m2

In the second query above, we are able to more directly make use of graph patterns, and in some graph engines, this query has a faster runtime performance because of indexing on relations. Hence, in this case, one has to choose between the two designs depending on the kind of queries that will be expected.

2.2.3 When to introduce Relationship Properties

We have already seen an example of using a property on a relationship to handle cases where the relationship changes over time. Other common reasons to attach properties to relationships include capturing weights or confidence scores, or recording provenance and other metadata.

Some graph engines do not index relationship properties. If the use case allows most queries to be evaluated without accessing these properties—using them only for final filtering—then the lack of indexing may not significantly impact performance. However, if accessing relationship properties is critical for query performance, it is better to reify the relationship as a separate node, which we will discuss in the next section.

2.2.4 Handling non-binary Relationships

We often need to model relationships that involve more than two entities. A common example is the between relationship that given objects A, B and C captures that C is between A and B. A standard approach to capturing such higher arity relationships in a graph is reification.

Although we have previously discussed reification in the context of RDF, but this technique is equally useful and desirable for property graphs. To capture the between relationship we introduce a new node type, Between_Relationship that has two properties: has_object (holding values A and B) and has_between_object (holding value C). This approach can be generalized to relationships of any arity: create a new node type for the relationship and add node properties for each argument of that relation.

3. Summary

In this chapter, we considered the design of the graph data model for both RDF and property graphs. Many design concerns -- such as whether to reify a relationship, handling non-binary relationships, -- are common to both models. The choice of whether to use a property vs a relation is unique to the property graph data model.

The data publishing guidelines for the RDF encourage the use of IRIs, reuse of existing vocabularies, and making links across vocabularies. While such data linking practices are not intrinsic to property graphs, adopting them can enhance a property graph’s interoperability and usefulness in data integration.

4. Further Reading

Further elaboration of the guidelines for publishing RDF data on the web are available in a monograph [Heath & Bizer 2011]. A similar monograph is available for ontology design [Kendall & McGuinness 2019]. A course on the design of the Wikidata schema was recently offered [Wikidata Ontology Course 2025] The course introduces the Wikidata ontology, and covers numerous design challenges in creating it.

For designing property graph schemas, extensive documentation is available from graph database vendors such as Neo4j [Neo4j Data Modeling] and TigerGraph [TigerGraph Schema Design].

  • [Heath & Bizer 2011] Heath, Tom, and Christian Bizer. Linked Data: Evolving the Web into a Global Data Space. 1st ed. Synthesis Lectures on the Semantic Web: Theory and Technology. Morgan & Claypool, 2011.
  • [Kendall & McGuinness 2019]Kendall, Elisa F., and Deborah L. McGuinness. Ontology Engineering. Springer International Publishing, 2019. https://doi.org/10.1007/978-3-031-79486-5
  • [Wikidata Ontology Course 2025] Peter F. Patel Schneider and Ege Atacan Doğan. “Wikidata: WikiProject Ontology/Ontology Course.” Last modified July 10, 2025. Accessed December 5, 2025. https://www.wikidata.org/wiki/Wikidata:WikiProject_Ontology/Ontology_Course
  • [Neo4j Data Modeling] Neo4j, Inc. n.d. “What Is Graph Data Modeling?” Accessed December 5, 2025. https://neo4j.com/docs/getting-started/data-modeling/
  • [TigerGraph Schema Design] TigerGraph. “Schema Design Guide.” GSQL Language Reference, 4.2. Accessed December 5, 2025. https://docs.tigergraph.com/gsql-ref/4.2/tutorials/schema-design-guide

Exercises

Exercise 3.1.Which of the following statements about knowledge graph design are true?
(a) As knowledge graphs are schema free, no design of the schema is required.
(b) Knowledge graphs are always created using automatic techniques.
(c) For many knowledge graph applications, a perfect accuracy is not a hard requirement.
(d) Knowledge graphs can contain undirected relationships.
(e) Knowledge graphs do not use keys and foreign keys as defined for the relational database systems.

Exercise 3.2.Which of the following is a good choice of an IRI for an RDF knowledge graph?
(a) ISBN-13 : 978-1681737225
(b) http://fcvcz.abt.co/mckz/
(c) https://www.wikidata.org/wiki/Q6135847
(d) http://worksheets.stanford.edu/homepage/index.php
(e) http://dbpedia.org/resource/Frederick_Loewe

Exercise 3.3.What type of link is captured by each of the following RDF statements? (Assume the following prefixes have been defined.)

@prefix dbpedia: http://dbpedia.org/resource/
@prefix bbc: http://www.bbc.co.uk/nature/species/
@prefix umbel-rc: https://umbel.org/umbel/rc/Person
@prefix foaf: http://foaf.org/

(a) dbpedia:Aardvark owl:sameAs bbc:Aardvark
(b) dbpedia:Lady_Gaga skos:broader_of dbpedia:Lady_Gaga_audio_samples
(c) dbpedia:Tetris foaf:isPrimaryTopicOf wikipedia-en:Tetris
(d) dbpedia:Person rdf:subClassOf umbel-rc:Person
(e) dbpedia:Sky_Bank foaf:homepage http://www.skyebankng.com/

Exercise 3.4.Which of the following are good class labels in a knowledge graph?
(a) Customers with overdue accounts
(b) Australian Customers
(c) Customers with revenues between 5 to 10 million
(d) Customers who supply to recently funded startups
(e) High Networth Value Customers

Exercise 3.5.Which of the following requires reification for representing in a knowledge graph?
(a) John believes that life is good.
(b) John was referred to Peter by Mary.
(c) The effectiveness of a vaccine is 95%.
(d) Earth revolves around the Sun.
(e) On LinkedIN John rated Peter for being an expert in AI.

Exercise 3.6. Design a property graph schema for a causal graph for investing. Your design should take into account the following domain knowledge.

In investing, headwind and tailwind are metaphors used to describe the factors that could cause a difference to the performance of a stock.

A headwind is any specific company, market or economic factor that could causally hinder a company's growth, or reduce its profitability, in the near future. These could include things like increased competition, regulatory changes, unfavorable economic conditions, or any other causal factor that makes it more difficult for the company to succeed. If a stock is facing headwinds, it means it is encountering challenges that could potentially lower its value in the near future.

On the other hand, a tailwind refers to any such factor that could causally boost the company's growth or increase its profitability. These could include things like favorable economic conditions, beneficial regulatory changes, or a successful new product launch. A stock with tailwinds is benefiting from positive conditions or events that are causally responsible for an increase in its value in the near future.

Headwinds and tailwinds should be unitary in nature, i.e., not decomposable into more specific assertions. For example, Company is facing increasing competition, and is having difficulty hiring critical talent should be decomposed into two distinct headwinds.

The factors that do not causally affect company performance cannot be headwinds or tailwinds. Examples of factors that are not headwinds or tailwinds:

  • valuations (factors that lead to high or low valuations may be candidate head/tail winds)
  • observations of company performance improvement
  • analyst perceptions -- e.g., uncertainity in an analyst's predictions, or variances in thier views
  • one-off factors that affected performance in last quarter, but will not have an impact in future quarters

A headwind / tailwind should have four well-defined attributes:

  • materiality: possible values -- mild, medium, high. Measures whether this factor is expected to have a material impact on company performance. For instance, materiality is low for factors that will affect only a small portion of company business.
  • duration: possible values -- short (1-2 quarters), medium (3-5 quarters), long (more than 5 quarters). It measures expected duration of this factor.
  • externality: possible values -- true, false or both. If the factor external to (not controlled by) the company then the externality is true. The factor is false if the company controls the factor. Its value is both if the factor may be both internally and externally controlled.
  • obviousness: possible values -- low, medium, or high. Measures how strongly does an analyst believe in the effect of the parameter.

Exercise 3.7. Design a Company knowledge graph to support a business intelligence dashboard.

The dashboard is to aggregate information from multiple sources about a company to get better insight into its business, customers, competitors, subsidiaries or parent organlzations. Assume that the two sources are Wikidata and Security and Exchange Commission Filings.

Begin the process by reviewing the RDF schema for Company (identifier Q783794) in Wikidata. Reuse as much of this schema as necessary, and extend it as you see fit. Document your choices.

For SEC filings, use the following dataset from Kaggle: https://www.kaggle.com/datasets/jamesglang/sec-edgar-company-facts-september2023. Extend the schema you had extracted from Wikidata in the previous step to handle any new information that appears in this SEC dataset in Kaggle.