To identify service boundaries, it is not enough to consider (business) domains only. Other forces like organisational communication structures, and – very important – time, strongly suggest that we should include several other criteria in our considerations. This blog is an attempt to collect and document a list of heuristics for service decomposition that I found useful over the years. This list is by no means exhaustive nor authoritative and should only be used as reference and guide. Because circumstances change over time, you also should regularly revisit your service boundaries, adapt accordingly and always strive to lower entry costs for such changes on both, the organisational and technical level.
The desire to break down large systems into smaller units didn’t start with computing. Every larger company has multiple departments specialising in different areas, but working together to fulfill the company’s purpose. I love the piece that Dan North wrote long ago in his post “Classic SOA”, explaining service concepts in the non-digital world. In IT we try to mimic such structures and came up with terms like Modules, SOA and Microservices. Finding the right boundaries is not easy though, in particular at the beginning of a project when you know the least about it. I always wished that I’d had some guide about when and where to introduce boundaries. This blog contains my approach, distilled from experience gathered over the years of having gone through this process numerous times.
Services: A word on terminology
If you ask ten engineers, you will get eleven different definitions for the word service. For the purpose of this blog, a service ideally has the following attributes:
- Independent delivery pipeline (planning, implementation, deployment)
- Owned by one team
- Decentralized governance
- Independent Data-Storage
- Ownership of Communication Interfaces
- Clear Terminology (aka “ubiquitous language”)
- If user-facing, it comes with its own UI
I recommend reading Martin Fowler’s definition of Microservices. My favourite approach are “Self-Contained Systems” (SCS), to my knowledge first introduced by Stefan Tilkov during his work for otto.de. He talks about SCS on Infoq (for German readers there’s also an excellent article). For general principles it is also always worth following “RFC 3439: Some Internet Architectural Guidelines and Philosophy”.
We will get it wrong – Time invalidates assumptions
The sad truth is, our systems are subject to erosion. Whatever boundaries we choose, we will get it wrong somewhere because time will inevitably invalidate some of the assumptions underlying our initially chosen approach to decompose the system. Thus, we regularly need to revisit those assumptions that lead to those boundaries. Even if we had the perfect requirements, and found the perfect boundaries to support those requirements: their context changes over time, and what was perfect today may be completely unfit tomorrow.
When shall we break down a system?
TL;DR: continuously! For greenfield projects of course we need to start somewhere. I firmly believe that we will inevitably get it wrong because at the beginning we know the least about the system we are supposed to design and build. Humans are evidently very bad at predicting the future, hence I am a big fan of starting out with only a very coarse grained decomposition of services (an extreme would be the “Monolith First” approach) and only later break it down further when we feel we have learned enough to make informed decisions – a technique that nowadays is commonly known as “Evolutionary Architectures”. Of course, in order to apply an evolutionary approach we must be able to
- detect the need for further breaking up the system
- apply changes to our service boundaries
Developing capabilities to first and foremost detect the need for further breaking up a system, and then also to actually do so, is paramount to being able to respond to changes and new learnings in our architectures. We must strive to develop these capabilities as we go along and co-evolve them with the architecture.
So we are looking to break down our problem into smaller parts. What criteria should we consider? Over time, I compiled a list of aspects that help me think better about where to draw the lines between systems. The list is not ordered by any kind of importance as the weight of a criterion is highly context dependent. If you have a small team, the criteria “Organisational Boundaries” won’t be a problem, in larger organisations it becomes a prime candidate. Also, the nature of your domain has a strong influence: it makes a huge difference whether a system is data- or process-driven. For better structure, criteria can be classified by their nature:
- Process & Data
Process & Data Criteria
Domains (aka “Bounded Context”)
Perhaps among the better known candidates for introducing boundaries are what is widely known as “bounded contexts”, a term that became popular through Eric Evan’s seminal book “Domain Driven Design”. These are areas, where ubiquitous languages differ and the same term “Customer” is associated with different concepts. A “customer” in your “browser preferences” domain certainly has a different meaning than in “billing”. We can consider it almost mandatory to split systems along different domains to avoid clashing requirements and “one model to fit all” problems.
Another great starting point are process flows. Why? Think about it: what is the nature of many requirement changes? They might come in various disguises, but often they are about changes to process flows. Shops introducing a new payment method, your colleagues from Customer-Relations want to make it easier for customers to register their products, and changes in legislation require customers to agree to new “Terms & Conditions”. Such changes typically affect process flows. Given that coordination between teams and departments slow down time-to-delivery by several orders of magnitude, it easily follows that we want to keep a process flow within a single system to minimize coordination efforts needed to deliver changes to a process flow. This is a major motivation for the fact that Self-Contained Systems include their own UI. On a related note, this is also a great argument against a popular practice of separating systems along domain entities (every worked on a “Customer Service” ?), which Michael Nygard covered in his post “Services by Lifecycle”. So delivering “Change User Preferences” and “Checkout Shopping Cart” are great candidates to move into separate systems, including their UI.
The points where process flows enter a “wait” or “decision” state are good candidates for splitting flows into their constituent parts. E-commerce websites provide a good example: “browse product catalogue” and “check out” can (and probably should) be implemented as separate services to enable independent evolution.
You can only respond to requests as fast as your nearest persistence store. If you need to serve consumers at multiple locations around the globe, it will be necessary to bring data and processing logic as close as possible to where they are needed.
More common than not, data consumption is constrained by corporate policies or legal requirements. Making this data available to other parts of the system can be challenging and may require separated persistence, distribution and access logic.
The nature and lifecycle of data often dictates how architectures for handling this data need to operate. Frequency, dimensions, amount, peaks and size of data must be considered and it can be a good idea to move processing of data with differing characteristics into separate services.
A website’s sitemap can be very helpful to start thinking about service/self-contained-system boundaries. Which features are likely to stay and live as long as the application, which features may come and go over time? Separate stable parts from volatile parts of the system so you don’t have to touch your stable services due to processes with shorter life-cycles (see “Volatility vs Stability”, “Process Flows” and “Process Handoffs”).
It can also be very helpful to group processes in your site-menu based on your service decomposition. This may require negotiations with the business team, but pays off when you need to introduce new or retire obsolete services and thus need to add or remove their entries from the website menu.
Developing a service across organisational silos and budgets is almost guaranteed to be magnitudes slower than within a single command chain. This is why almost every definition of “service” contains the criteria “owned by one team” and demands independent delivery pipelines. It is not always easy, in particular in highly political organisations where “Data Ownership” can be a very strong force. However, due to the high impact of cross-silo communication on delivery performance, I can’t emphasise enough the importance of tackling this problem as soon as it surfaces. There are two fundamental ways we can deal with it:
- on the organisational level, we can try to re-organise teams and/or break down silos
- on the architectural level, we can apply one of the other criteria (e.g. “process-handover”) to break-up the cross-silo dependency
Teams & Skills
A very strong driver can be the available teams and skills at your disposal. This can be traced straight back to Conway’s law (after all, any article about architecture would be incomplete without mentioning this law). It can be very challenging to develop resilient, decentralized systems with teams that are only used to RPC-style synchronous communication and certainly requires extra being taken care of.
While decentralization is king when it comes to developing resilient systems, you probably don’t want every team to develop their own login page, menu bar or API authorization layer. Depending on the size of your system it can make sense to treat them as separate services with their own lifecycle and owned by a dedicated team. However, take extra care to avoid single-point-of-failures and cross-silo dependencies. You don’t want your service teams to submit paper forms to the API-gateway team for adding service routes. Thus, to make this work, such cross-cutting services must put a lot of emphasis on providing self-service capabilities to other service teams.
For certain domains it may also make a lot of sense, to separate out the parts that can be handled by third party products. This particularly applies to commodity services. An OTS shop system (either product or SaaS) can take over product purchases while customer support processes are implemented as added-value services. The Wardley-Mapping technique can help with making such decisions. However, be aware that customising existing products can be much more time-consuming and hard to maintain than implementing such commodity processes yourself.
If the system gets too large, risk grows exponentially that further evolution will soon suffer the exact same fate as most monoliths. Always keep an eye on the size of your services.
Volatility vs. Stability
Different parts of systems evolve at different velocities (sadly some systems tend to evolve only at fast speed, meaning they are lacking direction). Typically, older parts will have reached some degree of stability which newer ones are still lacking. You may want to quickly try (think: “lean startup experiment”) new features, and retire old ones if your users don’t accept them. At the same time you want to keep other parts of your system stable, for instance your checkout process. It can be a good idea to separate these so that you do not hold back rapid service delivery by constant regression testing of stable parts.
If you already have stable asynchronous communication between different parts of your system – great! These are ideal candidates to be developed by different teams. However, synchronous communication between two services couples them very close to each other. This makes it very difficult to detangle them. The risk here is to end up with a “Distributed Monolith” (a term coined by my colleague Tareq Abedrabbo in his talk about “The 7 Deadly Sins of Microservices”), where you have to deploy all services at once otherwise everything breaks down.
Security Constraints (e.g. User Roles, Levels of Access Control)
In order to allow for the easier application of security perimeters, it can be advisable to split services with different security constraints. Administrative processes are often better physically moved to a separate service to isolate them. This kind of isolation is a lot less error-prone and fragile than implementing complicated access-control mechanisms to protect processes within the same service.
Physically separating operations based on their expected (or experienced) load is a powerful weapon to isolate and protect them from each other, saving the development teams from a lot of headache from trying to throttle operations running within the same process.
The argument here is very simple: a service with lots of external dependencies is incredibly hard to test and quite possibly a design smell. Very likely this is the result of packing too many processes into a single service. This often happens over time as systems grow. If the costs of creating a new service are too high, there’s a tendency to stuff more and more logic into a service. We must always have an eye in particular on synchronous dependencies to avoid creating a “Distributed Monolith” (see also “Communication Paths” above).
There are criteria I observed in the wild that turned out to be less successful, or perhaps better stated as: “led straight into disaster”.
For me this is one of the worst places to start. Entity-driven design is probably a relic from the times when we were taught to “look for nouns in your requirements”. I have seen my fair share of “customer service” and “product service” incarnations over the years. A common reason for change are changes to process flows (see “Process Flow”). A new requirement tends to affect several systems at once and thus require a lot of coordination and synchronisation among multiple teams. This in turn not only leads to a significant increase of communication, but also to increased complexity in delivery with multi-team synchronized deployments, feature toggles, feature branches and various other symptom treatments. As mentioned before, Michael Nygard’s “Services by Lifecycle” is a good read on this. In short: don’t!
Frontend and Backend
The argument against separating frontends from backends goes straight back to our argument about process flows: requirement changes tend to affect flows. Typically, changing a flow on the frontend will trigger supporting changes in the corresponding backend. Sam Newman’s description of the “Backends For Frontends” pattern nicely elaborates on the various flavours of this pattern. As a way to mitigate the close coupling of frontends to backends I recommend evaluating “Resource-Oriented Client Architecture (ROCA)” and REST (the real one!) approaches.
Conclusion: The times, they are a’ changin’
Over time new requirements and learnings will likely shed a different light on your systems and make it desirable to move or re-group service functions to enable your system to better digest new requirements. Software is never done and keeps evolving to adapt to an ever-changing world, hence our architectures don’t have an end state unless they go out of service. Grady Booch famously stated “Architecture represents the significant design decisions that shape a system, where ‘significant’ is measured by ‘cost of change’”. The past has taught us that more analysis doesn’t work to improve on those significant decisions, time ensures that we will always chase a moving target.
A much better approach is to lower the cost of architectural changes, which allows us to push the “last responsible moment” further down the road and make decisions about our service boundaries as late as possible when we have learned more about the system. This leads us straight to ideas about aligning organisations around product delivery (aka “Devops”) and programmable infrastructure (aka “The Cloud”). After all, it is hard to adapt your architecture if it takes three departments and six months to provision a new web-server. But that’s a story for another day.
Do you have any additional criteria you found useful? I would love to hear about it in the comments section!