Open Credo

January 31, 2022 | Blog, Data Engineering

Making Sense of Data with RDF* vs. LPG

There are two camps of Graph database, one side is RDF, where they are strict with their format, and somewhat limited for their extensibility. The other side is LPG, where they can define labels to the relationships. With its recent extension, RDF now allows users to add properties, thus becoming RDF*. In this blog, Ebru explores the structural and performance differences between LPG and RDF*.

WRITTEN BY

Ebru

Ebru

Cucen

Making Sense of Data with RDF* vs. LPG

We want machines to not only communicate with each other but also understand each other correctly. To achieve this ambitious goal, the Resource Description Framework (RDF) and Property Graph (PG) model emerged in the 2000s as two popular approaches to the graphical representation of data. We will focus on Labelled Property Graphs (LPG), which is a specific implementation of Property Graphs, as it is a widely used and accepted model.

RDF is a product of the early-2000 standardisation of data exchange on the web, inspired by the notion of the Semantic Web. The Semantic Web was conceived as the extension of the standard web with keeping HTTP as a protocol for data exchange and adding RDF-type schemas for information storage and retrieval and uniquely identifying relationships between documents. The goal of this standard was to enable the representation, storage, and exchange of resources, especially meta-data on the web in a graph-native way. 

LPG was developed by a group of Swedish engineers as a graph model for storing Enterprise Content Management (ECM) system data. The main motivation behind the invention of LPGs was to enable efficient storage, fast querying, and fast traversal of graph data; it also served to extend data nodes with arbitrary attributes and fields for valuable information about that data. This approach has been popularised by graph database platforms such as Neo4j.

Throughout the 2000s and 2010s, LPG and RDF were regarded as two opposite approaches to graph modelling, the latter emphasising atomic decomposition of data suitable for ontology creation and data exchange, and the former focusing on efficient deep-data traversals, path analysis, and storage of arbitrary graph data. 

More recently, RDF-star (RDF*) has emerged as the extension of the RDF standard that seeks to resolve the limitations of the classic RDF model. The emergence of RDF* opens up new possibilities for graph design and new applications for RDF in data science and machine learning. 

In this article, we’ll compare the innovations of RDF* to the traditional RDF and LPG models and discuss key advantages of RDF* and the use cases it enables. We’ll also describe applications and use cases where LPG is still a preferable approach for graphical data design. 

RDF Approach

In the classic RDF model, data is represented as statements via a triple, in the format of <subject node> -> <predicate> -> <object node> format. An example would be<Bob>-><sends>. The<subject node>has a directional relationship with the <object node> via the <predicate> which is sometimes called property. Each resource can be uniquely represented in IRI format, which is a generalisation of URLs, but it allows a wider range of Unicode characters, and the object node containing values may be represented as literals. 

Since RDF nodes are represented as simple IRIs, the RDF model lacks any internal structure, that is, there are no key-value pairs or attributes defined at the node level. All nodes in the RDF are globally unique, which makes it hard to model recurrent events, data provenance, and other attribute-rich representations. 

As you may see in the example below, to define a relationship attribute, we need to make a new statement for “startDate” and “source”. They are modelled as separate nodes in the RDF model. In contrast, in the LPG approach, they can be modelled as node attributes. This makes LPG much more compact than RDF. RDF practitioners have developed several workarounds for this limitation. For example, one way to add attributes to a subject or predicate is to create intermediate nodes pointing to other nodes that serve as attributes. This method works reasonably well but makes the graph structure more complex and less intuitive. It also increases the storage footprint of the RDF model.

 << <<:Elon_Musk :founded: SpaceX>> :startDate "2002-03-14"^^xsd:date >> :source <https://en.wikipedia.org/wiki/SpaceX> .

The same problem can also be addressed via the reification workaround, which involves creating a metagraph on top of the main graph to represent the attribute. In this case, updates have to be made to both the graph and the metagraph, which decreases write performance in RDF-based graphs. Also, as mentioned in the “Foundations of an Alternative Approach to Reification in RDF”, adding extra reification triples is inefficient for RDF data exchange and management and also requires more complicated queries. This is because each expression in the query should be accompanied by another subexpression for the corresponding reification triples. 

Yet another solution is adding a name to the RDF triple, which creates a ‘quad store’, also known as a named graph. Quad stores allow the same statement to exist in different graphs, enabling very important improvements compared to triple stores, such as data lineage, access control, etc.

In general, the common limitations of classic RDF graph models mentioned by graph practitioners are storage inefficiency, inability to represent complex relationships and recurrent events, and slow query and data-traversal times. At the same time, the RDF approach is great for data exchange, the incorporation of disparate datasets, and the creation of data ontologies, which we’ll discuss below. 

LPG Approach 

In contrast to the standard RDF, nodes in labelled property graphs have both IDs and a set of key-value pairs or attributes/properties. LPG edges/connections can have types and attributes (properties as the name suggests) natively, making the LPG data structure more dense, compact, and informative compared to RDF. 

The rich internal structure of LPGs results in more efficient storage and faster data traversals and queries. At the same time, however, due to the arbitrary data structure design, LPGs are not as practical for modelling ontologies and other structured data representations as RDF models. 

Another advantage of LPGs over RDF is their ability to uniquely identify instances of relationships, which is not possible in traditional RDF without workarounds. Together with their rich internal structure, this allows LPGs to better represent repeatable events and other entities that involve dynamic properties. 

(e:SPACE_ENTREPRENEUR { name:"Elon_Musk"} )-[:FOUNDED {startDate: "2002-03-14", source: "https://en.wikipedia.org/wiki/SpaceX"}]-(c:COMPANY {name: "SpaceX"})

 

On the other hand, the arbitrary data structures of LPGs make them less interoperable among users and software. This also applies to the lack of standardisation in LPG query languages. There are several popular alternatives including Cypher used by Neo4j and Gremlin (popularised by Apache TinkerPop graph computing framework), but they are linked to specific vendors, which limits interoperability. Yet another graph language, PathQuery was developed by Google to be used in Google Knowledge Graph and its search products. PathQuery was designed to be as ‘graphy’ as possible and easy to use by generalist developers. However, it has certain limitations such as the lack of full support for property graphs and inability to query multiple graphs simultaneously. PathQuery is also implementation-dependent in certain cases. It is worth mentioning that there is also an effort to standardise the LPG query language, resulting in the proposal of the Graph Query Language (GQL) in 2019.

RDF* Architecture

RDF* is the extension of the RDF standard that will be included in RDF 1.2. The main goal of this extension is to incorporate metadata and attributes into the classic RDF model, aligning it with the LPG approach while keeping all the benefits of traditional RDF. 

In RDF a statement would be “<<a knows b>>” and to achieve the relationship to have attributes, RDF* introduces ‘statements about statements’, such as “<<a knows b >>since 2020” also known as statement-level annotations. Quite similar in concept to LPG key-value pairs, they are designed for more efficient representation of data-provenance information, temporal relationships, property context and relations to other nodes than in LPGs. 

Difference Between RDF* Edge Properties and LPG Attributes

In property graphs, attribute values are just strings not linked to any other nodes in the graph. They can only represent literal values, rather than relationships and things. In contrast, values of RDF* properties can be both literal values (RDF literals) or nodes connected to other nodes in the graph. In this way, the attribute value of one node can be linked to other nodes providing additional context to understand the attribute. Treating edge properties as just another node helps contextualise the connections between different nodes and attributes in the graph. Such an approach follows the original philosophy of the Semantic Web, conceived as the all-encompassing graph of connected resources. 

RDF* Advantages

Statement-level annotations introduced in RDF* have several advantages over previous workarounds, such as the reification discussed above. These include the following:

  • More compact representation. This is achieved in RDF* by embedding triples directly into a subject or object position rather than adding extra triples, as in the reification approach. This innovation leads to the reduced size of RDF* documents and is thus more efficient for data exchange.
  • Better comprehensibility. Removing extra triples leads to improved comprehensibility for users inspecting RDF documents directly. 
  • Backward compatibility. Embedding triples into other triples is compatible with the SPARQL queries and is implemented in the SPARQL* extension of the language. 

Advantages of RDF* over LPGs

Along with several improvements to the standard RDF model, RDF* provides a number of advantages compared to the LPG model:

  • Capacity to represent edge attributes as nodes. As we mentioned above, unlike attributes in LPGs, RDF* edge attributes can be represented as nodes in the graph, allowing them to be treated as individual entities.
  • Better expressivity. RDF* allows every LPG to be efficiently converted into an RDF model. On the other hand, LPGs cannot fully represent RDF* because of the rich expressivity of the latter. 
  • Arbitrarily complex edge descriptions. With RDF*, graph developers can attach complex descriptions to edges that represent connections to other nodes, literal values (strings), and inter-attribute relationships. In contrast, with LPG, edge properties can only be represented as literal key-value pairs. 

These RDF* enhancements solidify the dominance of RDF* over LPG in simple query performance and also help close the gap with LPG in storage and deep-query efficiency.

RDF* can be used with the SPARQL* query language, which is the extension of the SPARQL query language used for standard RDF queries.

Use Cases for RDF* and LPG

Below, we’ll discuss how the improvements achieved in RDF* make it more suitable than LPG for certain types of applications and use cases, while LPG remains preferable for other scenarios.

Application with Cross-Domain Knowledge: RDF*

Many advanced data-intensive applications like smart homes need to integrate data from different domains and sources and make real-time inferences based on this data. The same capabilities are required by many IoT applications used in smart urban environments and various industrial settings.

RDF and RDF* offer many advantages for such kinds of applications. In particular, RDF* Graphs are more suitable than LPGs for modeling ontologies, a set of properties, relations, and categories that represent a specific domain or subject of the ‘smart’ component. RDF* makes it simpler to model ontologies thanks to its atomic decomposition of subjects and relations, global uniqueness of nodes and edges, and built-in shareability of data.

In their turn, ontologies implemented via RDF* enable cross-domain data sharing, data interoperability, expert systems, and the domain-based inference required by many IoT applications. 

In contrast, the arbitrary data design used in LPG makes it harder to implement ontologies in LPG-based graphs, thus making the LPG architecture less suitable for cross-domain data sharing and domain-based inference. 

Applications that Require Fast Search: LPG

LPGs were originally designed for fast querying and data traversal. This is achieved by the dense key-value structure that can be modelled in a relational setting. In contrast to the close-to-linear cost of data traversal in LPG, the cost of graph traversing in RDF is logarithmic. This is especially true of complex queries. Therefore, LPG is still a preferred option for applications where read efficiency is important and where deep queries that involve sub-graphs are frequent.

It should be noted that the introduction of RDF* has helped bridge the gap in performance with LPGs. New performance benchmarks show that RDF* is significantly faster than previous approaches to adding metadata to graphs (see table below).

Modelling approach

Total statements Loading time (min) Repository image size (MB)
Standard reification 391,652,270 52.4 36,768
N-ary relations 334,571,877 50.6

34,519

Named graphs 277,478,521 56 35,146
RDF* 220,375,702 34 22,465

Figure 1: RDF performance benchmarks (Source: GraphDB “RDF* and SPARQL*”)

Notwithstanding these improvements, much work is yet to be done to optimise RDF* implementations for fast data traversals comparable to LPGs.

Data Provenance: RDF*

RDF* statement-level annotations allow incorporating data provenance—information about data origin and history—into the graph design. Data provenance/lineage is important because the same piece of information can mean different things depending on the data history and context (e.g., who asked the question and how it was processed).

Implementing data provenance in the classic RDF model would require quad stores combined with some of the workarounds discussed above. The introduction of statement-level annotations in RDF* makes implementing data provenance easier. As a result, RDF* is now more suitable for applications that require regulatory compliance and auditing and can be used to audit transformations of datasets, as well as assess confidence levels in the validity of data. 

Representing Temporal Dimension: RDF*

Statement-level annotations in RDF* may be treated not just as literal values but as nodes connected to other nodes in the graph. This opens up the opportunity for representing the dimension of temporal entity events. For example, a node describing an employee can include an annotation with the date of the last task, which is in turn connected to the node representing this task. This allows for the enriching of the semantic description of nodes with temporal relationships, resulting in richer data queries. 

Data Analytics: RDF*

RDF was originally designed for data exchange on the web. RDF graphs are very good for slowly and gradually changing datasets and can be easily extended with new data from disparate data sources thanks to their simple, atomised, decomposed, and shareable format. 

However, the use of RDF in data science applications has been limited by the lack of internal structure (properties and metadata) on the nodes. The introduction of the RDF* extension partially resolves this problem. In particular, the ability to add properties to semantic statements using statement-level annotations makes RDF* very useful for training datasets and data science applications. 

Also, the RDF model has been traditionally used for knowledge graphs—data structures that can be used for querying complicated questions, logical reasoning, rules management, and Machine Learning. RDF* further enhances the applicability of the RDF model to the construction of knowledge graphs. In particular, knowledge graphs and hypergraphs composed using the RDF* framework can be more detailed and data-rich thanks to statement-level annotations. This has additional benefits for NLP applications with RDF*, such as StarE, a GNN encoder motivated by RDF* for hyper-relational knowledge graphs.

Conclusion

As we’ve learned in this article, the introduction of the RDF* extension bridges the gap between the traditional RDF and LPG models. At the same time, RDF* inherits all the benefits of RDF that make it suitable for cross-domain data sharing, ontology-based inference, data analytics, and NLP tasks. 

RDF* has a lower loading time and requires fewer statements and a smaller repository size than traditional RDF models, thanks to its ability to use statement-level annotations without complicating the graph structure. These improvements make RDF* more efficient than RDF in scenarios where fast queries are important. Still, LPGs remain a great option for applications that require fast data traversal and queries that are yet to be implemented and optimised in the RDF standard. 

Statement-level annotations are similar to LPG attributes in that they allow for the creation of dense and rich internal data structures with attributes and properties similar to LPGs. However, RDF* annotations are not just a copy of the LPG approach. In fact, the new standard introduces a number of enhancements compared to LPG. 

In particular, it allows for the use of state-level annotations as nodes, has better expressivity, and provides the ability to construct complex graph structures combining literal annotations and node annotations. One of the major advantages of RDF* is the ability to represent LPG graphs. For example, property graphs can now also be written in the RDF format using plugins such as Neosemantics, which enables the use of RDFs in the Neo4j graph database platform. 

 

This blog is written exclusively by the OpenCredo team. We do not accept external contributions.

RETURN TO BLOG

SHARE

Twitter LinkedIn Facebook Email

SIMILAR POSTS

Blog