Open Credo

March 9, 2023 | Blog, Data Analysis, Neo4j

Ingesting Big Data into Neo4j – Part 3

Check out the last part of Ebru Cucen and Fahran Wallace’s blog series, in which they discuss their experience ingesting 400 million nodes and a billion relationships into Neo4j and what they discovered along the way.

WRITTEN BY

Fahran Wallace

Fahran Wallace

Senior Consultant

Ingesting Big Data into Neo4j – Part 3

Running the Neo4j-admin import tool

Following the first two articles in the series, that cover modelling and extraction, then transformation, the final part in our series will cover loading this data into Neo4j.

The load aspect of a Model Data, Extract, Transform, Load flow

Neo4j-admin import

There’s no shortage of ways to load data into Neo4j. Many ETL tools and data platforms provide connectors (though graph support is not ubiquitous), Apache Hop has very good native support (the maintainer now works for Neo4j), and Neo4j themselves provide a variety of tools, including the LOAD CSV command for small loads, the apoc.load.json procedure, and the Neo4j ETL tool for importing directly from a relational database via a JDBC connection. 

The Neo4j-admin import tool in particular excels when doing an initial data load. It writes an initial set of database backing files, neatly sidestepping the time required to do consistency and constraint checks during transactions. In our case, it took the estimated time down from many days to two hours. Updates can then be managed using one of the many tools that write to the database via transactions. Note:

  • This technique only works when creating a new database is viable. You can’t update an existing database in this way, even just to add some new nodes.
  • You need to do plenty of planning upfront, as updates that touch every node in a large dataset will take a long time. It may well be faster to throw away your original database and import a new version with the admin tool! This means you need to model your data well, in a way that is extensible later.

This constraint doesn’t jive particularly well with iterative software engineering instincts. We can strongly recommend starting out with a small, representative dataset, and ingesting that many times, playing with different structures and data cleansing, before committing to the big load. Bear in mind that when you’re ingesting a larger dataset, you have more opportunities to experience oddities in the data, so your transformation process might become brittle. It’s a good idea to introduce checkpoints, allowing returning to a failed sub-batch of data for re-preprocessing.  Many datasets are partitioned – ours was partitioned by the latest “updated date” of the records, so most of our refinement was done using just one day of data.

Setup

Firstly, this ingestion is going to take a while, and if the job fails, you have to trash everything and re-run it from the beginning. So we want to maximise our chances of success.

  • For the duration of the ingestion, the node that’s doing it is definitely a pet, not cattle. If you’re doing this in a cluster, make sure this pod is somewhere reasonably stable, and is unlikely to be evicted due to memory or CPU consumption.
  • Spot instances aren’t a great idea for the same reason – if you lose your spot, you’ll have to start again.

Initial Steps

Run the ingestion command on a small but fully representative dataset first of all. It should contain a smallish-but-still-sizeable sample of every node and every relationship type (say, 100,000 rows/nodes).

We did this and discovered some flaws in our flattening process (some fields contained unescaped newline characters, interfering with the jsonl). Frustratingly, this manifested through the import tool hanging indefinitely, but the debug logs did not indicate which node type contained the problem. Some binary search and commenting out various combinations of relationships and nodes in turn helped us find the issue.

Tip: The Neo4j-admin import tool should have a fairly smooth progress bar. If it’s making smooth, visible progress for a while, then hangs for 10 minutes, something is probably wrong. If it hangs for multiple hours, something is definitely wrong.

How big a machine do I need?

Neo4j has an estimation tool to help you size your hardware. However this calculation is for running the graph database itself, not ingestion process. It’s also limited to 64GB in size and 4 cores, which is not really enough for big data ingestion. The import process will also be using different resources to the running database (more heavy file I/O, theoretically less memory consumption), so this isn’t particularly helpful for our usage. 

For our 400 million+ nodes, 1 billion+ relationships, we used the maximum available compute engine in GCP, which was e2-highmem-16, with 16vCPU and 128 GB memory. The import tool was not above hanging when we used smaller machines, so whilst we could have probably managed with less, doing a short burst with a high-end machine was the right balance in this case.

It took us a little over 2 hours and about £20 for the whole process for the dependent resources such as disk and network. Be sure to downsize afterwards though, as the majority of your queries won’t need anything like this!

Running the tool

The admin import tool is pretty configurable, especially in how it responds to errors. For a large ingest, there’s always going to be an unforeseen data issue, so we opted to log errors and skip those rows as we found them, rather than aborting the whole job. Depending on the magnitude of the issue, we could then either fix the issue and retry the import from scratch, or add the failed records in a separate step later.

Note: At the time of writing, the tool lets you choose technically invalid database names. Read the guide, and stick to ASCII & hyphens, as a rule (no underscores). Happily, we discovered this early on, in our trials with small datasets!

It is a great idea to run this process in the background with nohup and/or &. SSH sessions get terminated prematurely all the time, and it’s sensible to take these precautions against having to restart your work.

Pre-run to see the estimated shape of the database

Although you may not find this as part of Neo4j guidelines, executing a dry-run of the import command for a few seconds was extremely useful.  It allowed us to see and validate the estimated database size (nodes, relationships, disk usage) and the memory requirements.

It also gives useful information about memory usage: not only does it give the configuration of your Neo4j instance, but also the available memory. If you’re importing data using a machine that’s been running for a while, the memory usage of existing background tasks might be significant. If the available resources seem low, it may be sensible to resume working on a machine with a lower workload. In our case, the large machine we provisioned for the task had no issues:

Monitoring the execution

If you are running the import in a container, you can utilise the docker commands to see the CPU and memory usage by docker container stats when the process is taking longer than it should. It’ll look something like this: 

Assessing the output

  1. Have a look at $NEO4J_HOME/data/databases/<your db>. You should see the following files, and several will have significant size; for example: neostore.nodestore.db, neostore.propertystore.db and neostore.relationshipstore.db.

Screenshot of a terminal showing Neo4j's store files: neostore.labelscanstore.db, 
neostore.labeltokenstore.db, 
neostore.nodestore.db, 
neostore.propertystore.db, 
neostore.relationshipgroupstore.db, 
neostore.relationshipstore.db, 
neostore.relationshiptypescanstore.db, 
neostore.relationshiptypestore.db and 
neostore.schemastore.db,

The admin import tool generates an import.report file, describing the various quirks and errors that occurred during execution. By default, it can be found in the directory the import command was executed from. Read through that file, store it for safekeeping, and make sure all the errors you see are expected.

It may mention duplicate nodes (a recoverable error when --skip-duplicate-nodes flag is passed to the command):

Id '4214504617' is defined more than once in group 'Customer-ID'

or describe relationships that are missing one of their related nodes (a more major inconsistency that needs resolution):

2171667476 (Customer-ID)-[HAS_TRANSACTION]->34197711 (Transaction-ID) referring to missing node 4197711

Creating the database

 

Important: check the permissions of the database folder now, BEFORE you create the database in Neo4j. If the database user doesn’t have read and write permissions on these files, things can get nastily stuck, and you may have to re-import the data.

Pick your favourite way of executing Cypher, and run the following commands:

:USE SYSTEM
CREATE DATABASE <YOUR DB> 

After a short pause, your database should appear online (run SHOW DATABASES to see it). If you get an error, and it appears as “offline”, it’s time to check the import.report and command logs more thoroughly.

Validating the database

A database that’s “online” might sound like proof of victory, but there are a few last things to validate. After our first “successful” ingest, we were disappointed to find that whilst our node-based queries were pretty fast, we got timeouts when retrieving relationships, even using simple queries. Not great for a database type that specialises in traversing relationships!

In our case, it transpired that our indexes were missing! By default, Neo4j creates two token-lookup indexes, one for nodes, and one for relationships. However, they didn’t appear when we ran SHOW INDEXES, hence the slow queries.

Happily, the neo4j-admin tool also provides a consistency checker. We ran it through the following command:

neo4j-admin check-consistency –database=<your database>

Unfortunately, this revealed several relationships that apparently were lacking either their source or target nodes. We put the kettle on, and ran the import job again for another couple of hours while we played with an old dataset. Strangely, rerunning the import once more gave us a perfectly consistent database!

Optimising for future use and reads

Once you’ve got a consistent database, congratulations! Now it’s time to add the indexes you planned out in the data modelling phase. On a large dataset, it can take quite a while for it to populate them, but you can still perform other queries while they’re building. You can see the progress using the SHOW INDEXES command once more. In our case, a simple node property index took ~15 minutes to populate, for a few hundred million rows. 

Backups are at the heart of running your operations smoothly, and it is still valid in the graph database space. Until you schedule the regular backups, it is not a bad idea to take a snapshot of the data disk.

If you have followed us until this point, you should be able to get started loading millions of nodes into Neo4j. The golden rule here is that planning is everything at this scale. At OpenCredo, we help our customers to get the best out of their data by decentralising governance, streamlining their pipelines, and gaining insights into knowledge graphs with connected data. Please reach out if you have any questions on ingestion, and let us know if there is anything we can help you with. 

RETURN TO BLOG

SHARE

Twitter LinkedIn Facebook Email

SIMILAR POSTS

Blog