March 9, 2023 | Blog, Data Analysis, Neo4j
Check out the last part of Ebru Cucen and Fahran Wallace’s blog series, in which they discuss their experience ingesting 400 million nodes and a billion relationships into Neo4j and what they discovered along the way.
Following the first two articles in the series, that cover modelling and extraction, then transformation, the final part in our series will cover loading this data into Neo4j.
There’s no shortage of ways to load data into Neo4j. Many ETL tools and data platforms provide connectors (though graph support is not ubiquitous), Apache Hop has very good native support (the maintainer now works for Neo4j), and Neo4j themselves provide a variety of tools, including the LOAD CSV
command for small loads, the apoc.load.json
procedure, and the Neo4j ETL tool for importing directly from a relational database via a JDBC connection.
The Neo4j-admin import tool in particular excels when doing an initial data load. It writes an initial set of database backing files, neatly sidestepping the time required to do consistency and constraint checks during transactions. In our case, it took the estimated time down from many days to two hours. Updates can then be managed using one of the many tools that write to the database via transactions. Note:
This constraint doesn’t jive particularly well with iterative software engineering instincts. We can strongly recommend starting out with a small, representative dataset, and ingesting that many times, playing with different structures and data cleansing, before committing to the big load. Bear in mind that when you’re ingesting a larger dataset, you have more opportunities to experience oddities in the data, so your transformation process might become brittle. It’s a good idea to introduce checkpoints, allowing returning to a failed sub-batch of data for re-preprocessing. Many datasets are partitioned – ours was partitioned by the latest “updated date” of the records, so most of our refinement was done using just one day of data.
Firstly, this ingestion is going to take a while, and if the job fails, you have to trash everything and re-run it from the beginning. So we want to maximise our chances of success.
Run the ingestion command on a small but fully representative dataset first of all. It should contain a smallish-but-still-sizeable sample of every node and every relationship type (say, 100,000 rows/nodes).
We did this and discovered some flaws in our flattening process (some fields contained unescaped newline characters, interfering with the jsonl). Frustratingly, this manifested through the import tool hanging indefinitely, but the debug logs did not indicate which node type contained the problem. Some binary search and commenting out various combinations of relationships and nodes in turn helped us find the issue.
Tip: The Neo4j-admin import tool should have a fairly smooth progress bar. If it’s making smooth, visible progress for a while, then hangs for 10 minutes, something is probably wrong. If it hangs for multiple hours, something is definitely wrong.
Neo4j has an estimation tool to help you size your hardware. However this calculation is for running the graph database itself, not ingestion process. It’s also limited to 64GB in size and 4 cores, which is not really enough for big data ingestion. The import process will also be using different resources to the running database (more heavy file I/O, theoretically less memory consumption), so this isn’t particularly helpful for our usage.
For our 400 million+ nodes, 1 billion+ relationships, we used the maximum available compute engine in GCP, which was e2-highmem-16
, with 16vCPU and 128 GB memory. The import tool was not above hanging when we used smaller machines, so whilst we could have probably managed with less, doing a short burst with a high-end machine was the right balance in this case.
It took us a little over 2 hours and about £20 for the whole process for the dependent resources such as disk and network. Be sure to downsize afterwards though, as the majority of your queries won’t need anything like this!
The admin import tool is pretty configurable, especially in how it responds to errors. For a large ingest, there’s always going to be an unforeseen data issue, so we opted to log errors and skip those rows as we found them, rather than aborting the whole job. Depending on the magnitude of the issue, we could then either fix the issue and retry the import from scratch, or add the failed records in a separate step later.
Note: At the time of writing, the tool lets you choose technically invalid database names. Read the guide, and stick to ASCII & hyphens, as a rule (no underscores). Happily, we discovered this early on, in our trials with small datasets!
It is a great idea to run this process in the background with nohup
and/or &. SSH sessions get terminated prematurely all the time, and it’s sensible to take these precautions against having to restart your work.
Although you may not find this as part of Neo4j guidelines, executing a dry-run of the import command for a few seconds was extremely useful. It allowed us to see and validate the estimated database size (nodes, relationships, disk usage) and the memory requirements.
It also gives useful information about memory usage: not only does it give the configuration of your Neo4j instance, but also the available memory. If you’re importing data using a machine that’s been running for a while, the memory usage of existing background tasks might be significant. If the available resources seem low, it may be sensible to resume working on a machine with a lower workload. In our case, the large machine we provisioned for the task had no issues:
If you are running the import in a container, you can utilise the docker commands to see the CPU and memory usage by docker container stats
when the process is taking longer than it should. It’ll look something like this:
The admin import tool generates an import.report
file, describing the various quirks and errors that occurred during execution. By default, it can be found in the directory the import command was executed from. Read through that file, store it for safekeeping, and make sure all the errors you see are expected.
It may mention duplicate nodes (a recoverable error when --skip-duplicate-nodes
flag is passed to the command):
Id '4214504617' is defined more than once in group 'Customer-ID'
or describe relationships that are missing one of their related nodes (a more major inconsistency that needs resolution):
2171667476 (Customer-ID)-[HAS_TRANSACTION]->34197711 (Transaction-ID) referring to missing node 4197711
Important: check the permissions of the database folder now, BEFORE you create the database in Neo4j. If the database user doesn’t have read and write permissions on these files, things can get nastily stuck, and you may have to re-import the data.
Pick your favourite way of executing Cypher, and run the following commands:
:USE SYSTEM
CREATE DATABASE <YOUR DB>
After a short pause, your database should appear online (run SHOW DATABASES
to see it). If you get an error, and it appears as “offline”, it’s time to check the import.report
and command logs more thoroughly.
A database that’s “online” might sound like proof of victory, but there are a few last things to validate. After our first “successful” ingest, we were disappointed to find that whilst our node-based queries were pretty fast, we got timeouts when retrieving relationships, even using simple queries. Not great for a database type that specialises in traversing relationships!
In our case, it transpired that our indexes were missing! By default, Neo4j creates two token-lookup indexes, one for nodes, and one for relationships. However, they didn’t appear when we ran SHOW INDEXES
, hence the slow queries.
Happily, the neo4j-admin tool also provides a consistency checker. We ran it through the following command:
neo4j-admin check-consistency –database=<your database>
Unfortunately, this revealed several relationships that apparently were lacking either their source or target nodes. We put the kettle on, and ran the import job again for another couple of hours while we played with an old dataset. Strangely, rerunning the import once more gave us a perfectly consistent database!
Once you’ve got a consistent database, congratulations! Now it’s time to add the indexes you planned out in the data modelling phase. On a large dataset, it can take quite a while for it to populate them, but you can still perform other queries while they’re building. You can see the progress using the SHOW INDEXES
command once more. In our case, a simple node property index took ~15 minutes to populate, for a few hundred million rows.
Backups are at the heart of running your operations smoothly, and it is still valid in the graph database space. Until you schedule the regular backups, it is not a bad idea to take a snapshot of the data disk.
If you have followed us until this point, you should be able to get started loading millions of nodes into Neo4j. The golden rule here is that planning is everything at this scale. At OpenCredo, we help our customers to get the best out of their data by decentralising governance, streamlining their pipelines, and gaining insights into knowledge graphs with connected data. Please reach out if you have any questions on ingestion, and let us know if there is anything we can help you with.
NODES 2022 – Neo4j Online Developer Education Summit 2022 – Tracing your data’s DNA
As data becomes ubiquitous and deeply interconnected, tracing where who or which system that data comes from – its lineage – will create bigger problems…