Showing all blogs in category Data Engineering
Check out Mateus Pimenta’s TL;DR video to learn how federated computational governance could be implemented using Open Policy Agent (OPA) and policy-as-code to support a successful Data Mesh architecture.
As data becomes ubiquitous and deeply interconnected, tracing where who or which system that data comes from – its lineage – will create bigger problems and opportunities for us on the horizon. Watch the recording of James Bowkett’s talk from Devoxx UK on ‘Tracing Your Data’s DNA’
February 10, 2022 | Data Engineering
In this lunch & learn session, Fahran Wallace looks at how Apache Hop and its still active parent project Pentaho Data Integration help tackle data pipeline.
There are two camps of Graph database, one side is RDF, where they are strict with their format, and somewhat limited for their extensibility. The other side is LPG, where they can define labels to the relationships. With its recent extension, RDF now allows users to add properties, thus becoming RDF*. In this blog, Ebru explores the structural and performance differences between LPG and RDF*.
Message and event-driven systems provide an array of benefits to organisations of all shapes and sizes. At their core, they help decouple producers and consumers so that each can work at their own pace without having to wait for the other – asynchronous processing at its best.
In fact, such systems enable a whole range of messaging patterns, offering varying levels of guarantees surrounding the processing and consumption options for clients. Take for example the publish/subscribe pattern, which enables one message to be broadcast and consumed by multiple consumers; or the competing consumer pattern, which enables a message to be processed once but with multiple concurrent consumers vying for the honour—essentially providing a way to distribute the load. The manner in which these patterns are actually realised however, depends a lot on the technology used, as each has its own approach and unique tradeoffs.
In this article we will explore how this all applies to RabbitMQ and Apache Kafka, and how these two technologies differ, specifically from a message consumer’s perspective.
Our recent client was a Fintech who had ambitions to build a Machine Learning platform for real-time decision making. The client had significant Kubernetes proficiency, ran on the cloud, and had a strong preference for using free, open-source software over cloud-native offerings that come with lock-in. Several components were spiked with success (feature preparation with Apache Beam and Seldon for model serving performed particularly strongly). Kubeflow was one of the next technologies on our list of spikes, showing significant promise at the research stage and seemingly a good match for our client’s priorities and skills.
That platform slipped down the client’s priority list before completing the research for Kubeflow, so I wanted to see how that project might have turned out. Would Kubeflow have made the cut?
Events are obviously the fundamental building block of event-sourced systems. Commands are equally a common concept in such systems although the distinction between events and commands, if any, is not always clear. There are certainly varying views on what role each one should play.
Machine Learning is a hot topic these days, as can be seen from search trends. It was the success of Deepmind and AlphaGo in 2016 that really brought machine learning to the attention of the wider community and the world at large.
January 11, 2018 | Data Engineering
The last few years have seen Python emerge as a lingua franca for data scientists. Alongside Python we have also witnessed the rise of Jupyter Notebooks, which are now considered a de facto data science productivity tool, especially in the Python community. Jupyter Notebooks started as a university side-project known as iPython in circa 2001 at UC Berkeley.
October 24, 2017 | Data Engineering
Cockroach Labs, the creators of CockroachDB are coming to London for the first time since their 1.0 GA Release in May 2017. They will be taking time to talk about “The Hows & Whys of a Distributed SQL Database” at the Applied Data Engineering meetup, hosted and run by us here at OpenCredo.
We have been interested in CockroachDB for a while now, including publishing our initial impressions of the release on our blog. We thought this would be the perfect time to do a bit of a Q&A before the event! I posed Raphael Poss, a core Software Engineer at Cockroach Labs a few questions.
June 15, 2017 | Data Engineering
CockroachDB is a distributed SQL (“NewSQL”) database developed by Cockroach Labs and has recently reached a major milestone: the first production-ready, 1.0 release. We at OpenCredo have been following the progress of CockroachDB for a while, and we think it’s a technology of great potential to become the go-to solution for a having a general-purpose database in the cloud.
My recent blogpost I explored a few cases where using Cassandra and Spark together can be useful. My focus was on the functional behaviour of such a stack and what you need to do as a developer to interact with it. However, it did not describe any details about the infrastructure setup that is capable of running such Spark code or any deployment considerations. In this post, I will explore this in more detail and show some practical advice in how to deploy Spark and Apache Cassandra.
Apache Spark is a powerful open source processing engine which is fast becoming our technology of choice for data analytic projects here at OpenCredo. For many years now we have been helping our clients to practically implement and take advantage of various big data technologies including the like of Apache Cassandra amongst others.
In recent years, Cassandra has become one of the most widely used NoSQL databases: many of our clients use Cassandra for a variety of different purposes. This is no accident as it is a great datastore with nice scalability and performance characteristics.
However, adopting Cassandra as a single, one size fits all database has several downsides. The partitioned/distributed data storage model makes it difficult (and often very inefficient) to do certain types of queries or data analytics that are much more straightforward in a relational database.
On previous blog posts we have provided examples of different types of acceptance tests coverage, UI, API and Performance. One area where automation is often lacking is around validating the security of the application under test. This has been discussed in the post on non functional testing You Are Ignoring Non-functional Testing. With this post we will enhance the automation framework to quickly check for some common security flaws.
February 13, 2017 | Data Engineering
One of the stated intentions behind the design of Java 8’s Streams API was to take better advantage of the multi-core processing power of modern computers. Operations that could be performed on a single, linear stream of values could also be run in parallel by splitting that stream into multiple sub-streams, and combining the results from processing each sub-stream as they became available.
January 26, 2017 | Data Engineering
Suppose you are given the task of writing code that fulfils the following contract:
This blog is written exclusively by the OpenCredo team. We do not accept external contributions.
If there is one thing to understand about Cassandra, it is the fact that it is optimised for writes. In Cassandra everything is a write including logical deletion of data which results in tombstones – special deletion records. We have noticed that lack of understanding of tombstones is often the root cause of production issues our clients experience with Cassandra. We have decided to share a compilation of the most common problems with Cassandra tombstones and some practical advice on solving them.
In this technical report, we present Concursus, a framework for developing distributed applications using CQRS and event sourcing patterns within a modern, Java 8-centric, programming model. Following a high-level survey of the trends leading towards the adoption of these patterns, we show how Concursus simplifies the task of programming event sourcing applications by providing a concise, intuitive API to systems composed of event processing middleware.
January 26, 2016 | Data Engineering
In this second post about Hazelcast and Spring, I’m integrating Hazelcast and Spring-managed transaction for a specific use case: A transactional Queue. More specifically, I want to make the message polling, of my sample chat application, transactional.
October 1, 2015 | Data Engineering
Akka Streams, the new experimental module under Akka project has been finally released in July after some months of development and several milestone and RC versions. In this series I hope to gently introduce concepts from the library and demonstrate how it can be used to address real-life stream processing challenges.
Akka Streams is an implementation of the Reactive Streams specification on top of Akka toolkit that uses actor based concurrency model. Reactive Streams specification has been created by the number of companies interested in asynchronous, non-blocking, event based data processing that can span across system boundaries and technology stacks.
A few weeks ago, we thought about building a Google analytics dashboard to give us easy access to certain elements of our Google Analytics web traffic. We saw some custom dashboards for bloggers, but nothing quite right for our goal, since we wanted the data on a big screen for everyone in the office to view.
Most of the important players in this space are large IT corporations like Oracle and IBM with their commercial (read expensive) offerings.
While most of CEP products offer some great features, it’s license model and close code policy doesn’t allow developers to play with them on pet projects, which would drive adoption and usage of CEP in every day programming.