1. Introduction
It is possible to get started with a knowledge graph with no
upfront design of its schema, and populate both its schema and
instances during the creation process. To the degree an upfront
design of a knowledge graph is practical, it can significantly
improve its usefulness. Such a design involves making a suitable
choice of the nodes, node labels, node properties, relations and
relation properties.
The input to knowledge graph population can come from one or more
sources consisting of structured data, semi structured data, free text
or images, or direct authoring by human input. When we are working
with structured and semi structured data sources, we have to perform
schema mapping task (i.e., relating the schema in the input source with
the schema of the knowledge graph) and record linkage task (i.e.,
relating new instances with the pre-existing instances in the
knowledge graph). These exact same tasks are also faced during data
integration with the only difference that the integrated data is
expressed in a graph data model. When we are working with the
unstructured sources, we have to solve the information extraction
problems of entity extraction and relation extraction.
The choice of methods used in knowledge graph population depend on
the scale of the problem and the desired accuracy. If a knowledge
graph is to be used on the web scale for information retrieval, the
accuracy need not be perfect, and it is infeasible to use human
verification for every triple of the graph. If a knowledge graph is
to be used within an enterprise where the accuracy needs to be
nearly perfect, the human verification is essential even if it is
performed just before the information is to be used. As accuracy is
always desired regardless of the enterprise or the WWW settings, to
ensure cost effectiveness and scalability, there is an empahsis on
crowdsourcing and other low-cost methods of obtaining human
input.
In this chapter, we will focus on knowledge graph schema design.
In the next two chapters we will discuss the problems that arise in
populating a knowledge graph from structured data, i.e., the problems
of record linkage and schema mapping, and the problems that arise
while populating from text, i.e., entity extraction and relation
extraction.
2. Knowledge Graph Design
Both property graph and RDF data models have a set of design issues
some of which are common across the two, while others are
unique. For example, both models need to use reification for
situations that cannot be directly modeled using triples. An RDF
model needs to adopt a scheme for IRIs which is not necessary for
property graphs. In a property graph model, we need to decide
whether a value should be represented as a property or as a node,
while this distinction is unnecesary in an RDF model. In this
section, we will present an overview of such design issues that are
faced in each of these two models.
2.1 Design of an RDF Graph
The knowledge graph authoring guidelines for RDF data on the WWW
are known as the linked data principles as outlined below.
- Use URIs as names for things.
- Use HTTP URIs so that people can look up those names.
- When someone looks up a URI, provide useful information, using
the standards (RDF, SPARQL).
- Include links to other URIs, so that they can discover more
things.
We will consider each of these guidelines in greater detail.
2.1.1 Use URIs as names for things
To publish a knowledge graph on the WWW, we first have to identify
the items of interest in our domain. They are the things whose
properties and relationships, we want to describe in the graph. In WWW
terminology, all items of interest are called resources. The resources
are of two kinds: information resources and non-information
resources. All the resources we find on the traditional WWW, such as
documents, images, and other media files, are information
resources. But many of the things we want in our knowledge graph are
not: People, physical products, places, proteins, scientific concepts,
etc. As a rule of thumb, all "real-world objects" that exist outside
of the WWW are non-information resources.
The publishers of knowledge graphs should construct the URIs to be
shared in a way that they are simple, stable and manageable. Short,
mnemonic URIs will not break as easily when sent in emails and are in
general easier to remember. Once we setup a URI to identify a certain
resource, it should remain this way as long as possible. To ensure
long-term persistence, it is best to keep implementation-specific bits
and pieces such as ".php" and ".asp" out of the URIs. Finally, the
URIs should be defined in a way that they can be fully managed by the
publisher.
2.1.2 Use HTTP URIs so that people can look up those names
We identify resources using Uniform Resource Identifiers (URIs). We
restrict ourselves to using HTTP URIs only and avoid other URI schemes
such as Uniform Resource Names (URNs) and Digital Object Identifiers
(DOIs).
The process of looking up names is referred to as URI dereferencing.
When we dereference a URI for an information object, we expect to get
the representation of its current state (e.g., a text document, an image,
a video, etc.) But, when we dereference a non-information resource, we
can obtain its description in RDF expressed in an XML notation.
2.1.3 When someone looks up a URI, provide useful information using RDF and SPARQL
When someone looks up a URI, the provider should return a knowledge
graph in RDF. The data should reuse standardized vocabularies to
name the IRIs used in describing the RDF data. Several useful
vocabularies are available for describing data catalogs,
organizations, and multidimensional data, such as statistics on the
Web. An open source effort called Schema.Org publishes community
created open source vocabularies for open use over the web. We
consider a few examples of such vocabularies.
The following RDF data describes a snippet of the organizational
structure of the UK Cabinet office.
@prefix uk_cabinet: <http://reference.data.gov.uk/id/department/> |
uk_cabinet:co rdf:type org:Organization |
uk_cabinet:co skos:prefLabel "Cabinet Office" |
uk_cabinet:co org:hasUnit uk_cabinet:cabinet-office-communications |
uk_cabinet:cabinet-office-communications rdf:type org:OrganizationUnit |
uk_cabinet:cabinet-office-communications skos:prefLabel "Cabinet Office Communications" |
uk_cabinet:cabinet-office-communications org:hasPost uk_cabinet:post_246 |
uk_cabinet:post_246 skos:prefLabel "Deputy Director, Deputy Prime Minister's Spokesperson" |
In the data above, the first triple uses the
class org:Organization from the Organization ontology. The
second triple uses the relation skos:prefLabel drawn from the
SKOS ontology. SKOS stands for a Simple Knowledge Organization System,
and provides a few commonly useful relations such
as skos:prefLabel for describing data. In this
case, skos:prefLabel simply allows us to associate a text label
with uk_cabinet:co. The third triple uses the relation
org:hasUnit from the Organization ontology to describe a unit
within the UK Cabinet office. Next two triples make additional
assertions about this unit. The sixth triple uses
the org:hasPost relation to describe a position with in a
department, and the final two triples give additional information
about that position.
It may not be always possible to find pre-existing vocabularies
that can be used in creating an RDF data set. If creating a new
vocabulary becomes necessary, one should ensure that it is
documented, self-describing, has a versioning policy, is defined in
multiple languages, and is published by a trusted source so that the
URIs used in it persist for a long period of time. We say that a
vocabulary is self-describing if each property or term has a label,
definition and comment defined.
2.1.4 Include links to other URIs, so that they can discover more things
While publishing data using RDF one should provide links to other
objects so that its usefulness increases. There can be three kinds
of links: relationship links, identity links,
and vocabulary links. We will consider an example of each of
these kinds of links.
Relationship links point at related things in other data
sources such as other people, places or genes. For example,
relationship links enable people to point to background information
about the place they live, or to bibliographic data about the
publications they have written. In the triple below, we show a link
in which a person in one data set is asserted to be based near
a geographical location that is specified using a URI in another data set.
@prefix big: <http://biglynx.co.uk/people/> | |
@prefix dbpedia: <http://dbpedia.org/resource/> | |
big:dave-smith foaf:based_near dbpedia:Birmingham |
Identity Links point at URI aliases used by other data
sources to identify the same real-world object or abstract
concept. Identity links enable clients to retrieve further
descriptions about an entity, and serve an important social function
as they enable different views of the world to be expressed on the WWW
of Data. It is a standard practice to use the link
type http://www.w3.org/2002/07/owl#sameAs to state that two URI
aliases refer to the same resource. For example, if Dave Smith would
also maintain a private data homepage besides the data that Big Lynx
publishes about him, he could add a
http://www.w3.org/2002/07/owl#sameAs link to his private data
homepage, stating that the URI used to refer to him in this document
and the URI used by Big Lynx both refer to the same real-world
entity. A triple capturing this information is shown below.
@prefix ds: <http://www.dave-smith.eg.uk> | |
@prefix owl: <http://www.w3.org/2002/07/owl> | |
@prefix big: <http://biglynx.co.uk/people/> | |
ds:me owl:sameAs big:dave-smith |
Vocabulary links point from data to the definitions of the
vocabulary terms that are used to represent the data, as well as from
these definitions to the definitions of related terms in other
vocabularies. Vocabulary links make data self-descriptive and enable
Linked Data applications to understand and integrate data across
vocabularies. In the vocabulary link shown below, the class SmallMediumEnterprise defined by BigLynx is
defined to be a subclass of the class Company in the DBpedia ontology. By making
such a link, it is possible to retrieve various assertions about the class Company from
the DBPedia, and use them with the class SmallMediumEnterprise.
@prefix dbpedia: <http://dbpedia.org/ontology/> | |
big:sme#SmallMediumEnterprise rdfs:subClassOf dbpedia:Company | |
2.2 Design of a Property Graph
The design of a property graph involves choosing nodes, node
labels, node properties, edges and edge properties. The basic
design questions are whether to model a piece of information as a
property, label or as a separate object; when to introduce relation
properties; and how to to handle higher arity relationships. We will
illustrate the process of making these choices using examples.
2.2.1 Choosing Nodes, Labels and Properties
In a property graph model, the nodes usually represent entities in
the domain. If we were interested in representing information about
people, we will create a node for each individual person (e.g., John),
and associate the label Person with that node.
There are several considerations in making further choices of node
labels, node properties and edges. These considerations include:
naturalness of labels, whether the labels might change over a period
of time, runtime query performance, and the cardinality of
values.
To illustrate the choice of whether to model a piece of information
as a label, property, or as a separate object, consider the task of
representing the gender of a person. We have three potential ways to
capture this information: we can create :Male
and :Female as labels and associate them with the Person
nodes; (2) we can create a property called "gender", and associate it
with Person nodes and allow it to have the values "male" and
"female"; (3) we can create a Gender object, associate it
with Person using a has_gender relationship, and give it
a property called "name" that can take "male" and "female" as values.
The labels in a property graph model are used to group nodes into
sets. All nodes labeled with the same label belong to the same
set. Queries can work with these sets instead of the whole graph,
making queries easier to write and more efficient. A node may be
labeled with any number of labels, including none, making labels an
optional addition to the graph. As a label groups nodes into a set, it
can be viewed as a class. The question of whether to introduce a new
label can be restated as whether to introduce a new class?
Creating new classes Male and Female vs introducing
a node property "gender" that can take two values of "male" and
"female" captures the same information. In general, whenever a
phrase naturally occurring in language is frequently used in a
domain, it is a candidate to be introduced as a class as long as the
membership in the class does not change with time. As some
implementations optimize the retrieval based on the use of labels,
the use of labels can result in fast performance on queries that need to
filter the results based on the membership in the class. If class membership
changes with time, neither a label nor a node property value is an
appropriate choice, and we need to use a relation. We will consider this in
the next section.
2.2.2 When to introduce Relationships between Objects
For situations that could be modeled either by using a node
property or by introducing a separate object and relationship, there
are, at least, two different considerations. First of those
considerations was introduced in the previous section: the membership
in the class changes with time. The second consideration arises when
we wish to achieve better query performance. We will consider these
situations in greater detail next.
Continuing the example from the previous section, when the gender
of a person could change over a period of time, then our only choice
is to capture the information as a separate Gender object
that is related to Person using the has_gender
relationship. We can then associate a relationship property with
the has_gender relationship that indicates the time duration
for which that particular value of gender holds. Creating a
separate Gender node would, however, lead to a huge number of
edges which is wasteful as for most people the gender does not
change. In such a situation, a combination of the two solutions
might be desired where for most people the gender is represented as
a node property value, but for a small fraction of people, it is
represented as a relation property value on a relation to a
separate Gender node.
Let us consider a situation where better query performance is a key
consideration. Suppose we wish to model movies, and their genres.
In one design, for a node of type Movie, we can introduce a
property "genre" that can take values such as "Action", "SciFci",
etc. In another design, we can introduce a new node
type Genre that has a node a property "name" that can take
values such as s "Action", "SciFci". We will then relate a node of
type Movie with a node of type Genre using
the has_genre relationship. In general, we can associate more
than one genre with a movie. Suppose we wish to query for those
movies that have at least one common genre. In the first solution in
which we use the node property "genre", this query would be stated
in Cypher as follows:
MATCH (m1:Movie), (m2:Movie) |
WHERE any(x IN m1.genre WHERE x IN m2.genre) |
AND m1 <> m2 |
RETURN m1, m2 |
When we model genre as a separate object, the same query can be stated as follows:
MATCH (m1:Movie)-[:has_genre]->(g:Genre), |
(m2:Movie)-[:has_genre]->(g) |
WHERE m1 <> m2 |
RETURN m1, m2 |
In the second query above, we are able to more directly make use of
graph patterns, and in some graph engines, this query has a faster
runtime performance because of indexing on relations. Hence, in this
case, one has to choose between the two designs depending on the kind
of queries that will be expected.
2.2.3 When to introduce Relationship Properties
We have already seen an example of a property associated with a
relationship to deal with situations when the relationship changes
with time. Other situations in which it makes sense to introduce
properties with relationship include associating weights or confidence
with a relationship or to associate provenance or other meta data with
a relationship.
Some graph engines do not index based on relationship
properties. If the use case is such that much of the query evaluation
can be done without using the relationship properties, and they are
required only for final filtering of the results, one may not pay
significant performance panelty because of lack of indexing. If access
to relationship properties is central to query performance, it is
better to reify the relation as we will discuss in the next
section.
2.2.4 Handling non-binary Relationships
We often need to model relationships that are not binary. A common
example of such a relationship is the between relationship that
given objects A, B and C captures that C
is between A and B. A standard approach to capturing
such higher arity relationships in a graph is reification. We
have previously discussed reification in the context of RDF,
but this technique is equally useful and desirable for property
graphs. To capture the between relationship we introduce a new
node type, Between_Relationship that has two
properties: has_object (with values A and B)
and has_between_object (with value C). We can use
reification for relationships with any arity by creating a new node
type for the relation, and by introducing node properties for
the different arguments of that relation.
3. Summary
In this chapter, we considered the design of the graph data model
for both RDF and property graphs. The data model design concerns such
as whether to reify a relationship, handling non-binary relationships,
etc., are common across the RDF and the property graph data
models. The choice of whether to use a property vs a relation is
unique to the property graph data model. The RDF model provides
explicit guidelines on the use of IRIs, reuse of existing
vocabularies, and making links across vocabularies. Even though the
data linking considerations are not integral to the property graph
model, but their use can make a property graph system more useful in
data integration.
|