April 2, 2020 | Machine Learning
Recent years have seen many companies consolidate all their data into a data lake/warehouse of some sort. Once it’s all consolidated, what next?
Many companies consolidate data with a field of dreams mindset – “build it and they will come”, however a comprehensive data strategy is needed if the ultimate goals of an organisation are to be realised: monetisation through Machine Learning and AI is an oft-cited goal. Unfortunately, before one rushes into the enticing world of machine learning, one should lay more mundane foundations. Indeed, in data science, estimates vary between 50% to 80% of the time taken is devoted to so-called data-wrangling. Further, Google estimates ML projects produce 5% ML code and 95% “glue code”. If this is the reality we face, what foundations are required before one can dive headlong into ML?
“What good is data collection if we don’t know what we have?”
The fast track to creating a data graveyard is to diligently add all and any data to it, without keeping track of what data has been put in, what it means and what data types it constitutes (strong typing for data is always preferable to just unstructured collections of strings). A good data catalogue will feature:
“What good is data if we can’t keep it secure?”
Of course all data should be governed and permissioned, but there are several granularities to data governance, each more difficult – yet powerful – than the last. (also less prevalent as we move down the list…). Some of the approaches below have sometimes been implemented at the application layer, but as data becomes more centralised, it is necessary to remove this concern from the application in order to consistently govern data globally.
When data is the lifeblood of an organisation, the catalogue and its governance is the backbone that guarantees its integrity and protects your investment.
“What good is data if we can’t monetise it?”
Once the basics of good governance are in place, it is then possible to track and audit access to the data. This should be done at a level as fine-grained as possible, ideally at a field level. Depending on the size of the organisation, this then allows for the relevant department to be charged for their usage. This may seem mercenary at first, however in larger organisations, the capital expenditure of the on-prem hardware and the ongoing labour costs (or purely opex if hosting in the cloud), will need to be covered to secure ongoing investment in data availability.
Even without a chargeback model, it is advisable to gather statistics on access and usage (often called “exhaust data”) to better inform future decisions when planning operational or data model changes.
“What good is data if we don’t know where it came from?”
Central data stores by their very nature consume data from other sources and often perform ETL-like operations to sanitise and clean data and then produce that data and serve to other systems. As systems become more interconnected and interdependent, it will be necessary to track and ensure the provenance of data as it traverses the organisation (or even published to external consumers).
Many applications can detail how different source systems are connected. However this is not enough. It is necessary for individual rows/documents to be able to describe where their values came from, as this may easily change over time or vary from document to document within the same collection depending on how or when they were constructed.
With the correct lineage recording in place, it is then possible to answer the following important questions of your data estate:
Data lineage is set to become even more important as machine learning models start to derive ever-more data points that are used to make decisions, and those decisions will need to be explained and reproduced to regulators and consumers alike.
“What good is data if we can’t get it to the right place at the right time?”
No data strategy would be complete without a model or mechanism for delivering data to consumers who need it at the right time. Now that Big Data is “part of the furniture”, it is making way for Fast Data. Streaming data platforms such as Kafka and Pulsar mean that data can be delivered in a reliable, scalable and immutable way. It should be noted that while there are designs/patterns/idioms/architectures for a digital nervous system with a streaming platform as its backbone, in today’s landscape it would require a great deal of work to also incorporate the other facets of data strategy outlined herein.
“What good is data without insight?”
Water, water everywhere/ Nor any drop to drink – Rime of the ancient mariner
It is not enough for an organisation to merely centrally store all the data it generates and collects. The data needs to be analysed and new facts or insight generated from that data. A comprehensive data science strategy is needed (possibly in conjunction with a team of citizen developers), the exact nature of this will align with specific business goals and profit/cost lines. Design thinking workshops can help here by ideating what kinds of insights might be of use to different groups of stakeholders, then devising strategies for achieving those goals (or finding out more where necessary).
It is imperative that a good data insight vision will be supported by an industrialised data science process, with the appropriate tooling and reproducibility enforced and enabled by the platform. This will be explored in a forthcoming blog post.
“What good is insight if we can’t see/explain what we have?”
Hand-in-hand with insight, comes data visualisation. Once the data is collected it is important that it is displayed to the user in a way that assists their understanding of what is being displayed. This is an artform in and of itself, however David Mcandles’ excellent book “Information is Beautiful” can be of use for developers, with many of the charts being available as D3.js add-on libraries.
“How can I rely on the data if I don’t know how good it is?”
In order to remain responsive to change, and to continue pouring data into the data lake, it is necessary to devise a data testing strategy the aligns with the approach to continuous delivery. This will ensure the data integrity is maintained – i.e. that it is still usable, appropriate, discoverable and secure with correct and queryable meta data continually gathered (for lineage and such) with little overhead on application development teams, while guaranteeing the consistency that centralising the data function brings.
When data was smaller and more manageable, database restores from production were commonplace. However, in the age of big data lakes and warehouses this may not always be possible. In cases where regression testing of new releases against production is necessary and data volumes make complete population comparisons prohibitively expensive, inserting an immutable message log (such as Apache Kafka or Pulsar) in between the data delivery and the ingestion processes can be invaluable. Sometimes named the “wiretap” or “tee” pattern, it enables constant regression testing and comparison with production data. If this is tested within a canary environment with accompanying release note and comparison regression test packs, this can provide very early feedback to development teams and empower them to deliver releases more frequently into production.