Open Credo

June 3, 2011 | Neo4j

Neo4j: Super-Nodes and Indexed Relationships, Part I

Neo4J is one of the first graph databases to appear on the global market. Being open source, in addition to its power and simplicity in supporting graph data model it represents good choice for production-ready graph database.

However, there has been one area I have struggled to get good-enough performance from Neo4j recently – super nodes.

WRITTEN BY

Aleksa Vukotic

Neo4j: Super-Nodes and Indexed Relationships, Part I

Super nodes represent nodes with dense relationships (100K or more), which are quickly becoming bottlenecks in graph traversal algorithms when using Neo4J.

I have tried many different approaches to get around this problem, but introduction of auto indexing in Neo4j 1.4 gave me an idea that I had success with. The approach I took is to try to fetch relationships of the super nodes using Lucene indexes, instead of using standard Neo APIs. In this entry I’ll share what I managed to achieve and how.

To start, let’s generate a sample graph with super-nodes. Here is the amount of generated data:

  • 5 nodes representing cities/towns (for example “London”, “Manchester”, “Edinburgh”, “Cambridge” and “Oxford”)
  • Each city node has “IS_IN” relationship to a country node UK, (every city is related to a country node with IS_IN relationship, and we have only one country node – UK)
  • We have 100K users, and each of the users have a “HAS_VISITED” relationship to each of the five cities. I know this is not a likely scenario, but in a large data set (say among all Facebook users) you will find more then 5 cities that more then 100K users have visited
  • That’s in total 100,006 nodes (100K users+5 cities + 1 UK country node) and 500,005 relationships (100K users-HAS)_VISITED-5 cities + 5 cities IS_IN UK)

That’s not a huge amount of data, but it’s large enough to start with.
Let’s say that we want to load “IS_IN” relationships for a given city node. It’s quite simple thing to do using Neo API:

Iterable relationships = city.getRelationships(DynamicRelationshipType.withName("IS_IN"));

for (Relationship r : relationships) {
    cnt++;
}

 

Listing 1: Iterating through node relationships using standard Neo4j API

How long does this take for a single city node? I have run this code snipped repeatedly 10 times (to account for the cache hits in subsequent runs), and here are the results:

Aleksa S Technology Blog Neo4j Super Nodes And Indexed Relationships Part I

The first time the relationship is fetched, it takes 1.6 seconds. Each subsequent request remarkably takes less then 1 ms! We can thank Neo4j’s caching and mapped memory mechanism for this increase in speed (or the developers who implemented it to be more precise).

But I was still worried about the performance of the first run. For a single relationship fetched (there is only one “IS_IN” relationship for each city node), 1.6 seconds seems a lot. The reason for the poor performance is that a city node has 100,001 relationships in total, and only one of those is of type IS_IN. In order to fetch the right one, Neo4j engine will have to look at all the relationships, then compare their types, and return only those that match the requested type.

While the system would perform well when the cache warms up, this can still cause a problem if you often access “cold” parts of the graph. While Neo4j tries to keep most of the data in caches or mapped-memory, this cannot always be achieved. When the graph is to large to fit in entirely If the super-nodes cannot be kept in memory all the time (either because there is too many of them, or the data more frequently accessed takes all cache/mapped memory available), then the performance will be adversely affected.

In terms of large data systems 100K relationships per node is not an awful lot. What would happen if you have super nodes with 1M relationships? Let’s see for ourselves!

I’m going to keep 5 city nodes in the graph, with each one having one IS_IN relationship with the UK country node. But I’ll increase number of users to 1,000,000. And assume that each user visitied each of 5 cities. So we’ll have 1,000,001 relationships per city node. I repeated the code from listing 1 to fetch the relationships of type IS_IN (there will be only one), and repeat it 10 times to see the results when cache kicks in. Table 2 shows the results.

Table 2: Performance of fetching relationships from super-node with 1M relationships, using standard Neo4J API

Aleksa S Technology Blog Neo4j Super Nodes And Indexed Relationships Part I

The first time run is now close to 5 seconds, with subsequent calls which hit the cache still taking less then 1ms!
The performance degradation is not linear (with input increased 10 times, the time was only 3 times longer) – and that’s good. When its warm, the system works well with sub-ms responses.

But sometimes our requirements do not allow for the warming-up time. If you have a traversal that needs to pass through 10 super nodes, you may be getting close to 1 min a traversal. If you have a batch job that has to process 100K such traversals, and the data set is such that it cannot be stored in memory all the time, it may take 100K minutes, or more then a month to complete! What can we do to improve this?

One option can be to deploy HA Neo4j cluster, and partition calculation across the cluster. I haven’t tried this, and depending on your data set size and requirements it may be a viable solution.  But we’re going to concentrate on the simpler solution here: how can we fetch only relationships of the required type, ignoring the huge number of relationship of other types that we’re not interested in. We’re going to use Neo4j’s built-in Lucene indexing support to fetch the required relationships from the index directly. In order to do that, we’ll need to add the relationship type to the index whenever new relationship is created.

Firstly, we are going to configure Neo4J ‘s relationship auto-indexing capability by adding following properties to the neo4j.properties file, as listing 2 illustrates.

Listing 2: Configuring relationship auto indexing in neo4j.properties file

relationship_auto_indexing=true

relationship_keys_indexable=type

 

We are enabling the relationship auto-indexing using relationship_auto_indexing, property, and we are going to tell Neo to track and index property name type of every relationship created.

We have to make a slight change to our data generator and set the type property for every relationship we create:

Listing 3: Creating relationship with the type name as property, to enable relationship auto-indexing

DynamicRelationshipType relType = DynamicRelationshipType.withName("IS_IN");

Relationship isA = city.createRelationshipTo(ukCountry, relType);

isA.setProperty("type", relType.name());

DynamicRelationshipType hasVisited = DynamicRelationshipType.withName("HAS_VISITED");

Relationship rel = user.createRelationshipTo(city, hasVisited);

rel.setProperty("type", hasVisited.name());

 

Having inserted the data for all 1M users, we are going to fetch IS_IN relationships for a single city node using auto index:

Listing 4: Creating relationship with the type name as property, to enable relationship auto-indexing

Iterable relationships = autoIndex.query("type", "IS_IN", city, null);

for (Relationship r : relationships) {
  cnt++;
}

 

Now let’s measure how long loading the relationships from the index takes – as in the previous runs, we will fetch the relationships 10 times to allow the cache to wam up. Table 3 shows the results:

Table 3: Performance of fetching relationships from super-node with 1M relationships, using relationship auto-indexing

Aleksa S Technology Blog Neo4j Super Nodes And Indexed Relationships Part I

Thanks to the Lucene index, the “cold” fetch has taken only 12ms on this run. The subsequent calls are still very quick, although slightly slower than before, as now we see a few 1ms times among majority of sub ms calls.

In the previous example we saw how we can ensure a more efficient lookup of relationships of particular type. Where we are interested in traversal operations ensuring we retrieve only the relationships we are interested in traversing requires the use of a custom RelationshipExpander . The code in listing listing 5 shows an implementation of the relationship expander which ensures we use the Lucene index to retrieve relationships of a specified type.

Listing 5: Implementation of the IndexedRelationshipExpander that uses Lucene index to fetch relationships for node

public class IndexedRelationshipExpander implements RelationshipExpander{

private final GraphDatabaseService graphDatabaseService;

private final RelationshipType relationshipType;

private final Direction direction;
public class IndexedRelationshipExpander implements RelationshipExpander {
    private final GraphDatabaseService graphDatabaseService;
    private final RelationshipType relationshipType;
    private final Direction direction;

    private IndexedRelationshipExpander(GraphDatabaseService graphDatabaseService, RelationshipType relationshipType, Direction direction) {
        this.relationshipType = relationshipType;
        this.graphDatabaseService = graphDatabaseService;
        this.direction = direction;
    }

    public static IndexedRelationshipExpander create(GraphDatabaseService graphDatabaseService, RelationshipType relationshipType, Direction direction) {
        return new IndexedRelationshipExpander(graphDatabaseService, relationshipType, direction); #1
    }

    @Override
    public Iterable expand(Node node) {
        Iterable hits = null;
        ReadableRelationshipIndex autoIndex = this.graphDatabaseService.index().getRelationshipAutoIndexer().getAutoIndex();
        switch(this.direction) {
            case OUTGOING:
                hits = autoIndex.query("type", this.relationshipType.name(), node, null); #2 break;
            case INCOMING:
                hits = autoIndex.query("type", this.relationshipType.name(), null, node); #3 break;
            case BOTH:
                IndexHits out = autoIndex.query("type", this.relationshipType.name(), node, null);
                IndexHits in = autoIndex.query("type", this.relationshipType.name(), null, node);
                hits = Iterables.concat(out, in); #4 break;
        } return hits;
    }

    @Override
    public RelationshipExpander reversed() {
        return new IndexedRelationshipExpander(graphDatabaseService, relationshipType, direction.reverse());
    }
}

 

In order to access the index, the IndexedRelationshipExpander requires a reference to the GraphDatabaseService. In addition, we need to provide the relationship type and direction we are interested in to the IndexedRelationshipExpander (#1). The implementation of the expand() method simply checks the direction of the relationship, and queries the relationship auto-index (#2, #3 – the difference is only in the startNode/endNode arguments).

Where we need relationships in both directions, we have to perform two separate searches one for incoming and one for outgoing relationships. Then we need to concatenate the two results, in the example this is done using the Google Guava library (#4)
And that’s all. We can now traverse the relationships of the specified type using the new relationship expander:

Listing 6: Using custom IndexedRelationshipExpander in Neo4J Traversal API

RelationshipType relationshiptype = DynamicRelationshipType.withName("IS_IN");

Direction direction = Direction.OUTGOING;

TraversalDescription traversal = Traversal.description().

evaluator(Evaluators.atDepth(1)).expand(IndexedRelationhipExpande.create(this.embeddedGraphDatabase, relationshipType, direction));

traversal.traverse(city);

 

NOTE that you have to have indexed type property on every relationship, which contain the name of the relationship type so the IndexedRelationshipExpander works correctlyGiven this basic implementation of a RelationshipExpander, it’s easy to adapt it for you specific requirements.

For example, you can index the creation_date property for every relationship and then traverse only relationships created after a given date. You can include any indexed relationship property when creating a Lucene query, and implement it in the RelationshipExpander.expand() method.

We have now seen how we can use Lucene indexing to quickly fetch and traverse a small subset of the relationships on the super node. The effectiveness of this approach is however dependent on you being interested in only a small proportion of the relationships.

In the next blog entry we will explore how we can efficiently load a larger number of relationships from a supernode.

 

This blog is written exclusively by the OpenCredo team. We do not accept external contributions.

RETURN TO BLOG

SHARE

Twitter LinkedIn Facebook Email

SIMILAR POSTS

Blog