Open Credo

Showing all blogs in category Data Engineering

GOTO Copenhagen 2023 – The 12 Factor App For Data (Recording)

March 14, 2024 | Blog, Data Engineering, Platform Engineering

GOTO Copenhagen 2023 – The 12 Factor App For Data (Recording)

Watch the recording of our Technical Delivery Director, James Bowkett from the GOTO Copenhagen 2023 conference for his talk ‘The 12 Factor App For Data’

Read More Read More

TL;DR – Computational Governance in Data Mesh with OPA

April 21, 2023 | Blog, Data Engineering, Data Governance, Data Mesh, OPA

TL;DR – Computational Governance in Data Mesh with OPA

Check out Mateus Pimenta’s TL;DR video to learn how federated computational governance could be implemented using Open Policy Agent (OPA) and policy-as-code to support a successful Data Mesh architecture.

Read More Read More

Lunch & Learn: Scala in 2022

June 16, 2022 | Data Analysis, Data Engineering

Lunch & Learn: Scala in 2022

In this lunch & learn session Matt Farrow shares the new developments in Scala and what it could all mean for Scala’s future.

Read More Read More

Devoxx UK 2022 – Tracing Your Data’s DNA

May 26, 2022 | Data Analysis, Data Engineering

Devoxx UK 2022 – Tracing Your Data’s DNA

As data becomes ubiquitous and deeply interconnected, tracing where who or which system that data comes from – its lineage – will create bigger problems and opportunities for us on the horizon. Watch the recording of James Bowkett’s talk from Devoxx UK on ‘Tracing Your Data’s DNA’

Read More Read More

Lunch & Learn: A Closer Look at Databricks

March 24, 2022 | Data Analysis, Data Engineering

Lunch & Learn: A Closer Look at Databricks

In this Lunch & Learn session our Senior Consultant, Seb Margineanu shares an overview of the Databricks Lakehouse Platform by exploring the evolution of Databricks.

Read More Read More

Lunch & Learn: GraphQL on Springboot

March 10, 2022 | Data Engineering, Open Source

Lunch & Learn: GraphQL on Springboot

In this lunch & learn session, Ebru Cucen and Alberto Faedda explore the historical background of GraphQL with case examples and a demo of how it can be used.

Read More Read More

Lunch & Learn: Data Pipelines with Apache Hop & Pentaho Data Integration

February 10, 2022 | Data Engineering

Lunch & Learn: Data Pipelines with Apache Hop & Pentaho Data Integration

In this lunch & learn session, Fahran Wallace looks at how Apache Hop and its still active parent project Pentaho Data Integration help tackle data pipeline.

Read More Read More

Making Sense of Data with RDF* vs. LPG

January 31, 2022 | Blog, Data Engineering

Making Sense of Data with RDF* vs. LPG

There are two camps of Graph database, one side is RDF, where they are strict with their format, and somewhat limited for their extensibility. The other side is LPG, where they can define labels to the relationships. With its recent extension, RDF now allows users to add properties, thus becoming RDF*. In this blog, Ebru explores the structural and performance differences between LPG and RDF*.

Read More Read More

Kafka vs RabbitMQ: The Consumer-Driven Choice

July 20, 2021 | Blog, Data Engineering, Kafka

Kafka vs RabbitMQ: The Consumer-Driven Choice

Message and event-driven systems provide an array of benefits to organisations of all shapes and sizes. At their core, they help decouple producers and consumers so that each can work at their own pace without having to wait for the other – asynchronous processing at its best.

In fact, such systems enable a whole range of messaging patterns, offering varying levels of guarantees surrounding the processing and consumption options for clients. Take for example the publish/subscribe pattern, which enables one message to be broadcast and consumed by multiple consumers; or the competing consumer pattern, which enables a message to be processed once but with multiple concurrent consumers vying for the honour—essentially providing a way to distribute the load. The manner in which these patterns are actually realised however, depends a lot on the technology used, as each has its own approach and unique tradeoffs. 

In this article we will explore how this all applies to RabbitMQ and Apache Kafka, and how these two technologies differ, specifically from a message consumer’s perspective.

Read More Read More

Machine Learning at scale: first impressions of Kubeflow

April 20, 2021 | Data Engineering, Machine Learning, Software Consultancy

Machine Learning at scale: first impressions of Kubeflow

Our recent client was a Fintech who had ambitions to build a Machine Learning platform for real-time decision making. The client had significant Kubernetes proficiency, ran on the cloud, and had a strong preference for using free, open-source software over cloud-native offerings that come with lock-in. Several components were spiked with success (feature preparation with Apache Beam and Seldon for model serving performed particularly strongly). Kubeflow was one of the next technologies on our list of spikes, showing significant promise at the research stage and seemingly a good match for our client’s priorities and skills.

That platform slipped down the client’s priority list before completing the research for Kubeflow, so I wanted to see how that project might have turned out. Would Kubeflow have made the cut?

 

Read More Read More

Events and Commands: Two Faces of the Same Coin?

March 14, 2018 | Data Engineering, Microservices

Events and Commands: Two Faces of the Same Coin?

Events are obviously the fundamental building block of event-sourced systems. Commands are equally a common concept in such systems although the distinction between events and commands, if any, is not always clear. There are certainly varying views on what role each one should play.

Read More Read More

A Pragmatic Introduction to Machine Learning for DevOps Engineers

January 23, 2018 | Data Engineering, DevOps

A Pragmatic Introduction to Machine Learning for DevOps Engineers

Machine Learning is a hot topic these days, as can be seen from search trends. It was the success of Deepmind and AlphaGo in 2016 that really brought machine learning to the attention of the wider community and the world at large.

Read More Read More

Writing a custom JupyterHub Spawner

January 11, 2018 | Data Engineering

Writing a custom JupyterHub Spawner

The last few years have seen Python emerge as a lingua franca for data scientists. Alongside Python we have also witnessed the rise of Jupyter Notebooks, which are now considered a de facto data science productivity tool, especially in the Python community. Jupyter Notebooks started as a university side-project known as iPython in circa 2001 at UC Berkeley.

Read More Read More

Q&A with Cockroach Labs – creators of CockroachDB

October 24, 2017 | Data Engineering

Q&A with Cockroach Labs – creators of CockroachDB

Cockroach Labs, the creators of CockroachDB are coming to London for the first time since their 1.0 GA Release in May 2017. They will be taking time to talk about “The Hows & Whys of a Distributed SQL Database” at the Applied Data Engineering meetup, hosted and run by us here at OpenCredo.
We have been interested in CockroachDB for a while now, including publishing our initial impressions of the release on our blog. We thought this would be the perfect time to do a bit of a Q&A before the event! I posed Raphael Poss, a core Software Engineer at Cockroach Labs a few questions.

Read More Read More

CockroachDB: First Impressions

June 15, 2017 | Data Engineering

CockroachDB: First Impressions

CockroachDB is a distributed SQL (“NewSQL”) database developed by Cockroach Labs and has recently reached a major milestone: the first production-ready, 1.0 release. We at OpenCredo have been following the progress of CockroachDB for a while, and we think it’s a technology of great potential to become the go-to solution for a having a general-purpose database in the cloud.

Read More Read More

Deploy Spark with an Apache Cassandra cluster

May 2, 2017 | Cassandra, Data Engineering

Deploy Spark with an Apache Cassandra cluster

My recent blogpost I explored a few cases where using Cassandra and Spark together can be useful. My focus was on the functional behaviour of such a stack and what you need to do as a developer to interact with it. However, it did not describe any details about the infrastructure setup that is capable of running such Spark code or any deployment considerations. In this post, I will explore this in more detail and show some practical advice in how to deploy Spark and Apache Cassandra.

Read More Read More

New Blog Series: Spark – The Pragmatic Bits

April 25, 2017 | Cassandra, Data Analysis, Data Engineering

New Blog Series: Spark – The Pragmatic Bits

Apache Spark is a powerful open source processing engine which is fast becoming our technology of choice for data analytic projects here at OpenCredo. For many years now we have been helping our clients to practically implement and take advantage of various big data technologies including the like of Apache Cassandra amongst others.

 

 

Read More Read More

Data Analytics using Cassandra and Spark

March 23, 2017 | Cassandra, Data Analysis, Data Engineering

Data Analytics using Cassandra and Spark

In recent years, Cassandra has become one of the most widely used NoSQL databases: many of our clients use Cassandra for a variety of different purposes. This is no accident as it is a great datastore with nice scalability and performance characteristics.

However, adopting Cassandra as a single, one size fits all database has several downsides. The partitioned/distributed data storage model makes it difficult (and often very inefficient) to do certain types of queries or data analytics that are much more straightforward in a relational database.

Read More Read More

Automating Your Security Acceptance Tests

March 23, 2017 | Data Engineering, Machine Learning

Automating Your Security Acceptance Tests

On previous blog posts we have provided examples of different types of acceptance tests coverage, UI, API and Performance. One area where automation is often lacking is around validating the security of the application under test. This has been discussed in the post on non functional testing You Are Ignoring Non-functional Testing. With this post we will enhance the automation framework to quickly check for some common security flaws.

Read More Read More

Event Replaying with Hazelcast Jet

February 13, 2017 | Data Engineering

Event Replaying with Hazelcast Jet

One of the stated intentions behind the design of Java 8’s Streams API was to take better advantage of the multi-core processing power of modern computers. Operations that could be performed on a single, linear stream of values could also be run in parallel by splitting that stream into multiple sub-streams, and combining the results from processing each sub-stream as they became available.

Read More Read More

Reactive event processing with Reactor Core: a first look

January 26, 2017 | Data Engineering

Reactive event processing with Reactor Core: a first look

Suppose you are given the task of writing code that fulfils the following contract:

  • You will be given a promise that, at some point in the future, some data – a series of values – will become available.
  • In return, you will supply a promise that, at some point in the future, some data representing the results of processing that data will become available.
  • There may be more values to process than you can fit in memory, or even an infinite series of values.
  • You are allowed to specify what will be done with each individual value, as and when it becomes available; this includes discarding some values.
  • Whenever you want to use some external service to do something with a value, that service can only return you a promise that, at some point in the future, some data representing the result of processing that value will become available.

 

This blog is written exclusively by the OpenCredo team. We do not accept external contributions.

Read More Read More

Common Problems with Cassandra Tombstones

September 27, 2016 | Cassandra, Data Engineering

Common Problems with Cassandra Tombstones

If there is one thing to understand about Cassandra, it is the fact that it is optimised for writes. In Cassandra everything is a write including logical deletion of data which results in tombstones – special deletion records. We have noticed that lack of understanding of tombstones is often the root cause of production issues our clients experience with Cassandra. We have decided to share a compilation of the most common problems with Cassandra tombstones and some practical advice on solving them.

Read More Read More

Concursus: Event Sourcing for the Internet of Things

May 10, 2016 | Data Engineering, White Paper

Concursus: Event Sourcing for the Internet of Things

In this technical report, we present Concursus, a framework for developing distributed applications using CQRS and event sourcing patterns within a modern, Java 8-centric, programming model. Following a high-level survey of the trends leading towards the adoption of these patterns, we show how Concursus simplifies the task of programming event sourcing applications by providing a concise, intuitive API to systems composed of event processing middleware.

Read More Read More

Hazelcast and Spring-managed Transactions: A Sample Integration

January 26, 2016 | Data Engineering

Hazelcast and Spring-managed Transactions: A Sample Integration

In this second post about Hazelcast and Spring, I’m integrating Hazelcast and Spring-managed transaction for a specific use case: A transactional Queue. More specifically, I want to make the message polling, of my sample chat application, transactional.

Read More Read More

Introduction to Akka Streams – Getting started

October 1, 2015 | Data Engineering

Introduction to Akka Streams – Getting started

Going reactive

Akka Streams, the new experimental module under Akka project has been finally released in July after some months of development and several milestone and RC versions. In this series I hope to gently introduce concepts from the library and demonstrate how it can be used to address real-life stream processing challenges.

Akka Streams is an implementation of the Reactive Streams specification on top of Akka toolkit that uses actor based concurrency model. Reactive Streams specification has been created by the number of companies interested in asynchronous, non-blocking, event based data processing that can span across system boundaries and technology stacks.

Read More Read More

Building a Google analytics dashboard with Python3, Tornado and deploying it on OpenShift (for free)

August 5, 2015 | Data Analysis, Data Engineering

Building a Google analytics dashboard with Python3, Tornado and deploying it on OpenShift (for free)

A few weeks ago, we thought about building a Google analytics dashboard to give us easy access to certain elements of our Google Analytics web traffic. We saw some custom dashboards for bloggers, but nothing quite right for our goal, since we wanted the data on a big screen for everyone in the office to view.

Read More Read More

A Simple Introduction to Complex Event Processing – Stock Ticker End-to-End Sample

February 8, 2012 | Data Analysis, Data Engineering

A Simple Introduction to Complex Event Processing – Stock Ticker End-to-End Sample

Most of the important players in this space are large IT corporations like Oracle and IBM with their commercial (read expensive) offerings.

While most of CEP products offer some great features, it’s license model and close code policy doesn’t allow developers to play with them on pet projects, which would drive adoption and usage of CEP in every day programming.

Read More Read More