Open Credo

Consolidating Graph Data to drive future Data Analytics Capabilities

The National Journal has been a staple in the Washington media landscape for over 50 years. As a leading research and insights firms, it provides government affairs and communications professionals with the intelligence and tools needed to effectively advocate for new policies and to do so with greater efficiency and precision. Within National Journal, the Network Science Initiative (NSI) division conducts deep research on people and organizations, maps the connections among them, and illuminates viable pathways for enhanced advocacy and relationship development.

www.nationaljournal.com

THE CHALLENGE

National Journal (NJ) conducts research across a wide spectrum of topics, from general politics to health care to tax reform. Different analysts focus on different areas, and multiple data sources are involved. The resulting research yields sets of connected data, captured and stored in a number of disparate, yet disconnected stores with different formats.

The NSI team was looking to consolidate these disparate data sets, ultimately resulting in a single source of truth. The goal of this consolidation was to allow for multiple analysts to share and gain insight on connections discovered across multiple clients, projects, and dimensions, whilst still allowing for meaningful graph visualisations to be produced for end clients. Understanding the highly connected nature of the data involved, the National Journal selected the graph database Neo4j to house the data itself, coupled with Linkurious for data exploration and visualisation.

Anticipating the requirement to handle data beyond the initial migration phase, the NSI team also wanted a simple pipeline or workflow capable of handling ongoing ingestion for
two of their well-known data source formats. Crucial for both the initial migration, as well as the ongoing workflow process, was the ability to detect, handle and deal with duplicated and similar data.

THE SOLUTION

The partnership began with a very lightweight discovery  phase. Working collaboratively with National Journal, we established an understanding of typical working patterns for analysts, got into the detail of the data itself, as well as identified common queries and analysis techniques. Using our skills and experience in graph and broader data analysis techniques, we quickly arrived at the first version of the consolidated NSI graph data model which would accommodate the initial requirements. This provided the basis upon which further iterations then evolved.

Cognizant of the desire for a solution which was simple and not going to place a heavy operational burden on the engineering team needs to look after this, Google cloud (GCP) was collaboratively identified as the ideal platform to build upon. Moving into delivery the broader workflow and ingestion solution started to take shape. The final ingestion workflow combined a number of GCP managed services as well as some Python-based development. The end result was a solution where analysts upload data into the system, obtain initial validation and feedback, before ultimately landing up in Neo4j, having gone through a deduplication process on the way through.

Having the end to end workflow in place, the opportunity to evolve the deduplication and similarity detection logic was then tackled. Simple business rules, as well as more advanced algorithmic techniques, were used to deliver this aspect of the work. By continually working with the NSI analysts, the process, rules and algorithms were adapted and tuned to better identify and classify the data to achieve the most optimal outcome.

“When we began our partnership with OpenCredo, we had done nearly 200 network research projects for 60+ clients and yet we had no way of tracking people, organizations, and the connections among them across all of that data. OpenCredo helped us integrate that data, seamlessly flow it into our data visualization tool, and deal with a massive amount of data duplication issues. Through the process, we came to think about our data differently and have already begun to use this new frame to deliver excellence to our clients. The experience of working with OpenCredo couldn’t have been better – they were highly professional, organized, and supremely competent in delivering this work to us.”

– Luke Hartig, Executive Director

Technologies employed included Neo4j, Linkurious, Python and various GCP services.

THE OUTCOME

Through a process of continuous and iterative feedback with the NSI team, National Journal acquired:

  • A consolidation of their disparate data sets into a unified single source of truth within Neo4j
  • A fully managed, elastic cloud-based data ingestion platform within GCP
  • A flexible deduplication and similarity detection process – able to evolve and incorporate more sophisticated machine learning and algorithmic methods as future data needs evolve.