CS520 Knowledge Graphs
What
should AI
Know ?
 

What are some Graph Data Models?


1. Introduction

Two popular graph data models are Resource Description Framework (RDF), and the property graph (PG) model. The query language for RDF is SPARQL, and the query language for the property graph model is Cypher. In this chapter, we present an informal overview of both of these data models and give example queries for them. This chapter introduces the two models without giving a comprehensive technical overview. We consider translation of data represented using one of the models to data represented in the other, and also compare these graph data models to using a conventional relational data model.

2. Resource Description Framework

RDF is a framework for representing information on the web. The RDF data model and its query language SPARQL have been standardized by the World Wide Web Consortium.

2.1 RDF Data Model

An RDF triple, the basic unit of representation in this model, consists of a subject, a predicate, and an object. A set of such triples is called an RDF graph. We can visualize an RDF triple as a node and a directed edge diagram in which each triple is represented as a node-edge-node graph as shown below.

An RDF graph with two nodes (Subject and Object) and a triple connecting them (Predicate)

There can be three kinds of nodes: IRIs, literals, and blank nodes. An IRI is an Internationalized Resource Identifier used to uniquely identify resources on the web. A literal is a value of a certain data type, for example, string, integer, etc. A blank node is a node that does not have an identifier, and is similar to an anonmyous or an existential variable.

As an example illustration of the information expressed using RDF, let us take the example of representing knows relationship between people. In this example, a person with name art is denoted by the IRI http://example.org/art. In the notation used below the IRIs can be abbreviated by defining a prefix. For example, we have defined foaf as a prefix for <http://xmlns.com/foaf/0.1/>. The relation knows is defined by the IRI http://xmlns.com/foaf/0.1/knows

@prefix foaf: <http://xmlns.com/foaf/0.1/>
@prefix ex: <http://example.org/>
ex:art foaf:knows ex:bob
ex:art foaf:knows ex:bea
ex:bob foaf:knows ex:cal
ex:bob foaf:knows ex:cam
ex:bea foaf:knows ex:coe
ex:bea foaf:knows ex:cory
ex:bea foaf:age 23
ex:bea foaf:based_near _:o1

The last two triples illustrate a literal node and a blank node. The value of foaf:age is the integer 23 which is an example of a literal. A string is another common literal data type. The value of foaf:based_near is an anonymous spatial thing which is indicated by an underscore as a node identifier. Here o1 is an internal data identifier which has no significance outside the context of the present graph.

An RDF vocabulary is a collection of IRIs intended for use in RDF graphs. The IRIs in an RDF vocabulary often begin with a common substring known as a namespace IRI. In the example above, <http://xmlns.com/foaf/0.1/> is a name space prefix. Some namespace IRIs are associated by convention with a short name known as a namespace prefix. In the example above, we defined foaf and ex as name space prefixes.

RDF graphs are atemporal in the sense that they provide a static snapshot of data. With suitable vocabulary extension, they can express information about events, or other dynamic properties of entities.

An RDF dataset is a collection of RDF graphs and comprises exactly one default graph that can be empty and does not need to have a name, and one or more named graphs. Each named graph consists of an IRI or a blank node that represents its name and the RDF graph.

2.2 SPARQL Query Language

SPARQL (pronounced "sparkle", a recursive acronym for Simple Protocol and RDF Query Language) is a query language to retrieve and manipulate data stored in the Resource Description Framework (RDF). SPARQL can be used to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware. SPARQL contains capabilities for querying required and optional graph patterns along with their conjunctions and disjunctions. SPARQL also supports extensible value testing and constraining queries by source RDF graph. The results of SPARQL queries can be sets or RDF graphs.

Most forms of a SPARQL query contain a set of triple patterns called a basic graph pattern. Triple patterns are like RDF triples except that each of the subject, predicate and object may be a variable. A basic graph pattern matches a subgraph of the RDF data when RDF terms from that subgraph may be substituted for the variables in the graph pattern and the result is the RDF graph equivalent to the sub graph.

The example below shows a SPARQL query against the data graph shown above and queries for the people known by a person. The query consists of two parts: the SELECT clause identifies the variables to appear in the query results, and the WHERE clause provides the graph pattern to match against the data graph. The graph pattern in this example consists of a single triple with a single variable (?person) in the object position.

SELECT ?person
WHERE
<http://example.org/art> <http://xmlns.com/foaf/0.1/knows> ?person

Above query returns the following result set on our data graph.

?person1
<http://example.org/bob>
<http://example.org/bea>

The graph pattern can contain multiple triples. For example, in the query below, we ask for art's friends of friends.

SELECT ?person ?person1
WHERE
<http://example.org/art> <http://xmlns.com/foaf/0.1/knows> ?person
?person <http://xmlns.com/foaf/0.1/knows> ?person1

Above query returns the following result set on our data graph.

?person1?person2
<http://example.org/bob><http://example.org/cal>
<http://example.org/bob><http://example.org/cam>
<http://example.org/bea><http://example.org/coe>
<http://example.org/bea><http://example.org/cory>

Each solution gives one way in which the selected variables can be bound to RDF terms so that the query pattern matches the data. The result set gives all the possible solutions. In the above example, two different subsets of the data provided the matches that resulted in the answers. Above examples illustrate a basic graph pattern match; all the variables used in the query pattern must be bound in every solution.

The SPARQL queries can return blank nodes in the result. The identifiers for blank nodes used in the result may not be the same as the one used in the original RDF graph. The WHERE class in a SPARQL query provides a way to match against specific literal types and to filter the results based on numerical constraints.

The SPARQL queries have various forms. The SELECT form that we have considered until now returns the variable bindings. The CONSTRUCT form can be used to create results that define an RDF graph. The queries can also specify more than one graph patterns such that all of them or some of them have to match against the RDF data. The query results can also be further processed by providing directives to order them, eliminate duplicate results, or to limit the total number results returned.

3. Property Graphs

Proeprty graphs data model is used by many popular graph database systems. Unlike RDF that was explicitly motivated by a need to model information on the web, the graph database systems handle general graph data. Graph database systems distinguish themselves from traditional relational databases with little reliance on a predefined schema, and the optimization of operations that involve graph traversals. In this section, we will consider the property graph data model and the Cypher language that is used to query it.

3.1 Property Graph Data Model

A property graph data model consists of nodes, relationships and properties. Each node has a label, and a set of properties in the form of arbitrary key-value pairs. The keys are strings and the values are arbitrary data types. A relationship is a directed edge between two nodes, has a label, and can have a set of properties.

In the property graph shown below art and bea are two nodes. The node for bea has two properties: age and based_near. These two nodes are connected by an edge labeled as knows. This edge has the property since that indicates the year from which art and bea have known each other.

While defining a property graph data model, one has to decide the nodes, the edges, and the properties. For example, one can question why not to represent a city also as a node, and create an edge labeled as based_near between a person and the city instead of making them a property of the node representing a person. In general, any value that may be related to multiple other nodes in the graph such that there is either an application requirement to traverse those relationships efficiently, or we need to associate addiational properties with how it is related to other nodes, should itself be represented as a node. In this example, if we intend to traverse the based_near relationships, the following design would be more appropriate. Furthermore, it allows us to associate any additional properties with the based_near relationship.

3.2 Cypher Query Language

Cypher is a language for querying data represented in a property graph data model. The design concepts from Cypher are being considered for adoption into an ISO standard for a graph query language. In addition to querying, Cypher can also be used to create, update and delete data from a graph database. In this section, we will introduce only the query capabilities of Cypher.

The example below shows the Cypher query against the data graph considered earlier and queries for the the persons known by art. The query consists of two parts: the MATCH clause specifies a graph pattern that should match against the data graph and the RETURN clause specifies what should the query return. The graph pattern is specified in an ASCII notation for graphs: each node is written in parentheses, and each edge is written as an arrow. Both node and relation specifications include their respective types, and any additional properties that should be matched.

MATCH (p1:Person {name: art}) -[:knows]-> (p2: Person)
RETURN p2

In the example below, we show the Cypher query that asks for all the friends of a person that have existed since 2010.

MATCH (p1:Person {name:art}) -[:knows {since: 2010}]-> (p2: Person)
RETURN p2

From the above query, we can see that it is equally easy to associate properties with relations as it is with nodes. A person may have friends from years before 2010, and if we wanted the query to include those friends as well, it can be done by adding a WHERE clause.

MATCH (p1:Person {name:art}) -[:knows {since: Y}]-> (p2: Person)
WHERE Y <= 2010
RETURN p2

Through the WHERE clause, it is possible to specify a variety of filtering constraints as well as patterns that can be used to restrict the query results. In addition, Cypher provides language constructs for counting results, grouping data by values, and finding minimum/maximum values, and other mathematical and aggregation operations.

4. Comparison of Data Models

In this section, we will start off by comparing the RDF and the property graph data models. We will then compare both of them to relational data model.

4.1 Comparison of RDF and Property Graph Data Models

Beyond the features of RDF considered in the previous section, it has several additional layers, for example, RDF schema, Web Ontology Language (OWL), etc. Our discussion here will not consider those advanced features. The primary differences between the basic RDF model and the property graph model are that: (a) the property graph model allows edges to have properties (b) the property graph model does not require IRIs and does not support blank nodes. To support the edge properties, the RDF model supports an extension known as reification. We will consider this extension, and then describe different ways in which the data represented in either data model can be converted into the other format.

To understand reification in RDF, consider a situation in which we need to represent the provenance of the triple shown below. This triple asserts the weight of an item. The literal "2.4"^^xsd:decimal denotes the number 2.4 which is of type xsd:decimal. We are interested in specifying the person who took this measurement.

exproducts:item10245 exterms:weight "2.4"^^xsd:decimal

We can associate provenance information with the above triple using the RDF reification vocabulary. The RDF reification vocabulary consists of the type rdf:Statement, and the properties rdf:subject, rdf:predicate, and rdf:object. Using the reification vocabulary, a reification of the statement about the weight of the item would be given by assigning the statement an IRI such as exproducts:triple12345 (so statements can be written describing it), and then describing the statement as shown below. The last triple in the list below specifies the desired provenance information by asserting the identifier for the person who created the original triple.

exproducts:triple12345 rdf:type rdf:Statement .
exproducts:triple12345 rdf:subject exproducts:item10245 .
exproducts:triple12345 rdf:predicate exterms:weight .
exproducts:triple12345 rdf:object "2.4"^^xsd:decimal .
exproducts:triple12345 dc:creator exstaff:85740 .

These statements say that the resource identified by the IRI exproducts:triple12345 is an RDF statement, that the subject of the statement refers to the resource identified by exproducts:item10245, the predicate of the statement refers to the resource identified by exterms:weight, and the object of the statement refers to the decimal value identified by the typed literal "2.4"^^xsd:decimal. The final statement asserts that exproducts:triple12345 was provided by the person with the IRI exstaff:8574.

With the above reification vocabulary, it becomes possible to mechanically translate the data in the property graph model to RDF. Each node and its property value in the property graph data becomes a triple. Each edge in property graph data also becomes an RDF triple. Every edge in the property graph data that has a property is reified, and the properties of the edge become the triples of the reified edge that use the reification vocabulary as explained above.

To translate data expressed in the RDF model to the property graph model, a straightforward approach is to map each node and an edge to the corresponding node and an edge in the property graph. A possible refinement is that we create new property nodes only for those nodes that are either IRIs or blanks nodes. For any triple in RDF in which the target is a literal, we make it a property of the node in the property graph data.

In addition to converting data between RDF and property graph models, we are also interested in converting the syntactic form of data and the queries. For property graph model, there is no syntatic standard for their expression, and therefore, a custom translator needs to be written for the format one is working with. Once a translation scheme is fixed between the two data models, the corresponding translation between SPARQL and Cypher is straightforward.

4.2 Comparison of Graph Models and Relational Data Model

We can define a translation to and from the data expressed using relational model to data expressed using the RDF model and the property graph model. Some argue that the graph models are easier for humans to understand and that the graph query languages are more compact for certain queries. In principle, we can implement a user interface to visualize the relational schemas, and implement a query compiler that can map a query written in a graph query language into an equivalent form that operates on the relational tables. If an application requires navigating relationships, a graph database has an edge as it is optimized for graph traversals. For the rest of the section, we will consider an example to illustrate how the graph queries can be more compact than the corresponding relational queries, and conclude by mentioning the systems that attempt to support graph processing on a relational system.

To understand the contrast between graph queries and relational queries, we will consider a simple example in which we have three tables: an Employee table, a Department table, and an Employee-Department join table. An employee can be associated with multiple departments because of which they are stored in separate tables. Two tables are related with a join table that contains their foreign keys employee id and department id. We show these tables below.

Employee
idnamessn
e01alice...
e02bob...
e03charlie...
e04dana...
Employee_Department
employee iddepartment id
e01d01
e01d02
e02d01
e03d02
e04d03
Department
idnamemanager
d01IT...
d02Finance...
d03HR...

Given the tables as shown above, suppose we wish to list the employees in the IT department. The SQL query to perform this task will first need to join the employee and the department tables, and then filter the results on the department name. The required query is shown below.

SELECT name FROM Employee
LEFT JOIN Employee_Department
  ON Employee.Id = Employee_Department.EmployeeId
LEFT JOIN Department
  ON Department.Id = Employee_Department.DepartmentId
WHERE Department.name = "IT"

If we were to represent the same information using a property graph data model, we will have a node for department and employee. The employee ssn and department name will be the node properties. The Employee_Department table will be captured using a relationship in the property graph representation. If the Employee_Department table had additional attributes, they will be represented as edge properties in the property graph data model. A sample node in such a property graph is shown below.

We can query this data using the following Cypher query:

MATCH (p:Employee) -[:works_in]-> (d:Department)
WHERE d = "IT"
RETURN p

The Cypher query above is much more compact than its SQL counterpart. This compactness stems from the fact that the joins are naturally captured using graph patterns.

There have been some recent systems that represent the relational data in a schema free manner by representing each node property as a triple in one table, and each edge property as a four tuple in a second table. Such systems provide a query planner that accepts queries in a language like Cypher that computes an efficient execution plan over the two relational tables. Such systems are able to leverage the existing relational technology, and are also able to perform optimizations when some of the legacy data is in a traditional relational table.

5. Limitations of a Graph Data Model

A graph data model is not the most appropriate choice when the application contains primarily numeric data, and the reliance on only binary relationships is limiting. For example, the relational model is more effective in capturing timeseries data such as evolution of the population of a country. Even though we can represent such data using a graph, but it results in a huge number of triples without necessarily giving us advantages of better conceptual understanding and/or faster query performance through graph traversals. There are many relationships that cannot be naturally represented using binary relations. For example, between relation that captures that an object A is between two other objects B and C is inherently a ternary relationship. A ternary relationship can be transformed into a set of binary relationships using the reification technique, but by doing so, we lose the advantage of better conceptual understanding that we get from the graph data model. Graphs are also not the most natural representation for mathematical equations and chemical reactions where easy to understand domain specific representations exist.

6. Summary

In this chapter, we reviewed two popular graph data models: RDF and property graphs. RDF was devised for representing information over the web, and makes an extensive use of IRIs. The property graph model is a popular choice in many graph database systems, and provides a direct support for associating properties with both nodes and edges. Even though there are small differences between the two models, it is possible to inter-translate the data represented in one to the other. SPARQL is the query language for accessing data in RDF, and Cypher is the corresponding language for the data represented in property graphs. In a graph query language, queries requiring traversals are much more compact in comparison to an equivalent formulation in a relational data model. A graph data model can also provide a better user understanding of the knowledge in the subject domain. There are some systems that use a relational database as the storage for graph data and provide query optimizers to still allow queries to be expressed in a graph query language. Finally, a graph data model offers significant advantages for application that have rich relationships between objects, and require extensive traversal of those relationships.