An excerpt from Data Mesh in Action by Jacek Majchrzak, Sven Balnojan, and Marian Siwiak
This excerpt explores the four principles of the Data Mesh as a concept.
Read this article if you are interested in learning what a Data Mesh is and how it is used.
Zhamak Dehghani first described the current incarnation of the Data Mesh concept as a set of four principles: domain ownership, domain data as a product, federated computational governance, and self-serve data platform. From our perspective, however, the key to the success of Data Mesh implementation lies in understanding that it is a socio-technical architecture, not a technical solution. In this article, we will present Dehghani’s 4 principles and explain the socio-technological aspects that are key to understanding the Data Mesh architecture.
Domain-oriented decentralized data ownership and architecture
The first principle is that data and its relevant components should be owned, maintained, and developed by the people closest to it, i.e. the people inside the data’s domain. This calls for the application of the concepts of domains and bounded contexts to the data world. This also means the application of decentralization to ownership and architecture.
The idea is that domain-internal engineers are responsible for developing data interfaces, allowing other users (e.g. data scientists, self-service BI users, data engineers, and system developers) to use the domain data. Data engineers of a data product are expected to be experts within remits of a single domain, which minimizes communication problems and misinterpretations of data.
Data should have clear ownership, and it should not be on the centralized level of organization; we should put that responsibility into the hands of the people closest to it. That might be the domain team in the case of a source-oriented dataset, or it might be a data engineering team, an analytics engineering team, or a data science team, for new datasets created out of other. It could also be the organizational unit using the data if our dataset is a very consumer-oriented dataset.
The domain is an area or part of our business. It is a way of slicing and decomposing the company. Quite often, our organizational structure resembles business domains.
In any given business, you could find domains like Content Distribution, Content Production & Market, and Brand Development.
Figure 1. Simplified business domain sketch from a fictional business called Messflix LLC.
The Domain Ownership principle says that each of the teams or units that owns those domains should also own data that has been created within them. So the team developing software to support content production is also responsible for Content Production Data. But what does it mean to be responsible for data?
Being responsible for data means hosting and serving datasets to other parts of the organization, i.e. to other domains. Ownership spans not only data; the team should also be responsible for pipelines, software, cleansing, deduplicating, and enriching of data.
The same as with agile principles and agile teams, ownership on its own would not make much sense without having autonomy at the same time. This is why teams should be able to release and deploy operational and analytical data systems autonomously.
Every business consists of many domains. Each of those domains can usually be further split into subdomains. By applying this principle, we will end up with a mesh of interconnected domain data nodes. Nodes of the mesh can and even should be connected. Data can be moved, duplicated, and change shape between the domains when it is needed. For example, data will usually move from source-oriented data domains (output of the operational systems) to more consumer-oriented data domains, where raw data will be customarily aggregated and transformed into a more consumer-friendly shape and format.
In such a mesh of interconnected domains and subdomains, only the people close to the data know it well enough to actually work with it. Take an example from figure 1: Does the word “content” mean the same thing in all three domains? Hopefully not, because in the “Produce content” domain, we will have both “draft versions and ideas” of content being created, as well as content that will then become truly productionized content. However, in the “Distribute content” domain, this will mean productionized final content.
Just imagine a developer from the “Distribute content” domain who is supposed to compile a list of “content pieces” from the “Produce content” domain. He would likely produce a list of produced content pieces. However, that would miss the point. The requirement likely asks for a list of all content pieces, including ideas, things that are still in production, and canceled pieces. Additionally, the status of these content pieces should also be included.
However, people outside this domain will not even know that these are important pieces of data. Hence making the people inside the domain the only people truly able to work with this kind of data.
Source Data Domains usually serve data and information that represent facts about the business. Data should be exposed to other domains to serve operational purposes (like REST API) and analytical purposes. Source Data Domain should expose domain events and historical snapshots aggregated over time. With the latter, we should make sure that underlying storage is optimized for Big Data analytical consumption (like Data Lakes). In the previous example, the “Produce content” domain becomes a source data domain, when it exposes lists of produced content pieces. This is an original piece of data created by the business process of creating content.
Consumer Data Domains are aligned closely with consumption. An excellent example of such a domain could be Management Reporting and Predictions. In the case of Messflix LLC, it could be the Content Recommendations domain which might be a subdomain of Content Distribution. If now the marketing team takes the list of produced content pieces and enriches it with relevant marketing materials, the tweets it uses to promote it, etc., then the new list slips into a consumer data domain.
Parts of the data between Data Domains can be duplicated. Still, because it is serving a different purpose, it will also be modeled differently; this means that domain boundaries will usually also be data model boundaries. Because we want to give teams as much autonomy as possible, we are not trying to create a single canonical model for the whole organization. We are giving them the freedom to model the data in the way they need it. Besides the Source and Consumer aligned domain, we could also encounter core data domains used across the organization and usually represent key entities or objects.
When we share the responsibility for data across the organization, we gain tremendous scalability and maintainability. For example, when we want to add new datasets to the Mesh, we will be adding a new autonomous node. At the same time, teams that own datasets will be in a very comfortable situation. They will only own data that they truly understand.
Data as a product
The second principle is that data must be viewed as a product. This calls for an introduction of product thinking, integrated into data management.
Usually, when we talk about data, the first thing that comes to our mind is either a file or a table in a database. We often see a spreadsheet or a series of rows in a file with named columns when we think of data. Taking this perspective, it is easy to reduce data to technical details. But, instead, the more important question is: what gives value to the data from the organization’s point of view? Or reversing this question: what is stopping our data from being turned into valuable decisions?
Without a proper set of descriptions, even the best-prepared set of data will not be found, and thus no value can extracted from it. Missing information on the recency or completeness of data might render it completely useless in cases that require these attributes. That’s why it’s worthwhile to think of data as a product, i.e. a larger whole that ultimately makes up the experience of the users who use it.
Data offered by a data team shall follow typical product features like:
- Viable quality – ensured by specialized domain experts.
- Anticipation of user needs – the team offering the data to the outside world should understand the enterprise’s business environment, e.g. present the data in a format easily digested by existing data pipelines to ensure availability and easy usefulness of the provided data.
- Secured availability – ensuring availability of product to fulfill user needs whenever they need it.
- Focus on user goals – the focus on one domain should not mean lack of communication of data team with other users; on the contrary, search for synergies and shared toolset should create an opportunity to understand each other’s needs better.
- Findable – any data product should be discoverable by a simple means, something a random table in a database lacks.
- Interoperable – different data products should be combinable in a way that increases their value.
When can we call data a product?
In everyday life, we deal with many products. A product might be defined as, in general, “the result of a conscious action.” If we take a pair of jeans as an example, we are convinced that it meets certain predetermined conditions. For a product to be called a pair of jeans, it must have a suitable form and shape and be made of a certain material. In addition, it should have its unique name (especially if the manufacturer provides many products of a given type) because we buy a specific model and not a pair of jeans in general.
Treating data as a product will mean that someone has consciously designed the product and then created and released it, is responsible for it, and is mainly responsible for its quality. In the context of Data Mesh, this will be the responsibility of the Data Product Owner (who designs the Data Product) and the Data Product Developers who implement it. Just like a product on a shelf in a store, a Data Product has its unique name and established characteristics, including:
- Level of quality
- Level of availability
- Security rules
- Frequency of updates
- Specific content
When we think about products, it’s not enough to just expose data. We also need to make sure that we maximize its usability for the end-users. In this context, the role of the Data Product Owner is critical, as he or she is responsible for the final User Experience of the data.
Referring to the Messflix example mentioned earlier, we can imagine a few exemplary Data Products in the Produce Content domain:
- Cost statement
- Movie popularity
- Movie trends
Treating data as a product brings us straight to the Product Thinking approach; start with what problems your customers want to be solved and let this drive the Data Product design process. Then, as a Data Product Owner you should deliver a Data Product addressing these problems based on predefined success criteria.
Data as a product also implies a degree of standardization that allows a single element to be incorporated into a larger Data Mesh ecosystem. To call a given set of data a Data Product, it should be:
- Self-described and discoverable – data should be described, and this description should be an integral part of the Data Product. A Data Product should be able to register itself in the Data Mesh ecosystem and, as a result, should be discoverable.
- Addressable – it should have a unique address (e.g. in the form of URI address) so that it can be referred to and relationships between Data Products can be created.
- Interoperable – a Data Product should be made according to predefined standards, concerning the form of data sharing, standardized formats, vocabularies, terminology or identifiers, secure and trust-worthy. A Data Product should meet the established and declared SLA, and enable controlled access, ensuring data security from two perspectives: intellectual property and GDPR-type regulations.
A Data Product as an autonomous component
While fulfilling the previously established conditions, a Data Product should at the same time constitute an autonomous component so that it can be independently developed by the team responsible for a given Data Product. From the point of view of a technical solution, we can see a Data Product as an analog of a microservice in the data world. In addition to making data available, the Data Product also embeds code related to data transformation, cleaning, or harmonization. It also exposes interfaces for automatic integration with the Data Mesh environment and platform by providing, among other things:
- Input logic and output ports – form and protocols used to ingest from source and expose data to consumers.
- Operational metrics – number of users, throughput, amount of data fetched, etc.
- Data quality reports – quantity of incomplete data, format incompatibilities, statistical measures of outliers, etc.
- Metadata – specification and description of the data schema, domain description, business ownership, etc.
- Configuration endpoint – means to configure the Data Product in runtime, e.g. setting the security rules.
These are a few examples of what can constitute a Data Product as a technical component.
Federated computational governance
The third principle is to federate and automate the data governance across all of the participants of the Data Mesh. It aims to provide a unified framework and interoperability to the ecosystem of largely independent data products. Its purpose is to make the autonomous data products work in an actual data mesh, not just as stand-alone products.
Federated computational governance requires two inseparable elements: a governing body and a means of enforcement.
Overarching rules and regulations shall be agreed upon by a body composed of Data Product Owners, self-serve data infrastructure platform team, security specialists, as well as CDO/CIO/CTO representatives. The body would also serve as a place for discussion regarding, e.g. development of new data products vs. adding new datasets to existing data products, methods of ingesting new external data sources, priorities for central platform development, etc.
Effective data governance is crucial, with data security being one of the main concerns of CDOs/CIOs in companies across all sectors and sizes. Also, most large enterprises need to introduce controls on data security and governance enforced by governmental or business regulations, e.g., GDPR, HIPAA, PCI DSS, or SoX.
Federalization of Data Governance
At first glance, a Data Mesh adds a new layer of complexity to the already vast scope of Data Governance, as it now also needs to address a shift of responsibilities to Data Products. Each Data Product, in turn, needs to be equipped with processes allowing safe and efficient ways of handling owned infrastructure, code, and data (and metadata). Moreover, data governance processes need to balance company-wide cohesion of data solutions, usually achieved through standardization vs. autonomy-driven flexibility and creativity offered to Data Product teams.
It is imperative to understand that there is no silver bullet solution to balance central governance and local autonomy! This will always depend on the specifics of your organization. For example, what are the needs and the maturity of your Data Product Owners? What is the data-related risk appetite of the organization? What is the level of expertise of your central data governance team? How sensitive is the data you’re working with? You need to answer these and a lot more questions before you’ll be able to set the remits of central and local teams.
Another set of policies is required to ensure interoperability of Data Products and the ability of data consumers to join them together readily. Finally, the procedures need to provide the compatibility of different Data Products without explicitly enforcing an overarching data model, as such an approach has been proved to create a bottleneck in data operations.
A Data Mesh tries to answer that need with the Federalization of Data Governance. The federalization in this context means governing structure operating at parity of distinct levels – central and local. The central level of governance, executed by a data governance council, will decide on a minimal set of global rules required to ensure safe and secure discoverability and interoperability of data products.
Data Product teams, led by Data Product Owners, are responsible for developing their products with a high degree of autonomy, deciding on all technical and procedural issues within its remits (within boundaries of the enterprise technology stack).
There is no silver bullet solution as to the precise division of responsibilities, the structure of the data governance council, and exact rules governing the data world. Instead, each business will have its own set of global rules, leaving domain teams with different levels of control over their Data Products.
Computational elements of Data Governance
The name Federated Computational Governance was coined by Zhamak Daghani in her de facto “Data Mesh manifesto” blog article. However, as Computational Governance is not defined anywhere else, we will assume that this term relates to elements of Data Governance automation, enabled by the existence of the Central Platform that serves as a medium connecting different Data Products.
Data Governance elements, which can be automated and embedded into Data Platform include, but are not limited to:
- Catalog, reference, and master data
- Lineage & provenance
- Validation & verification
- Storage and operations
- Security & privacy
Once again, there is no silver bullet solution for deciding which of these should be automated. It will require your teams’ hard work to identify which data governance tasks create bottlenecks in Data Mesh development in your organization and automate them. It will pay off, however. First, it will offer Data Product Owners the solution allowing them a frictionless connection of their Data Products. Second, it will enable users to make efficient use of the exposed data.
You will learn more about automating elements of data governance when we will discuss the details of the self-serve data infrastructure-as-a-platform.
Self-serve data infrastructure as a platform
The fourth principle is to extract the duplicated efforts of the Data Mesh into a platform. It calls for the application of “platform thinking” to the data context. Platform thinking means that efforts that are duplicated throughout the company to a larger extent can also be packaged into a “platform” and thus only done once, but offered as a “service” to others.
Just like anyone can rent cloud resources on one of the major cloud providers and customize them to fit his needs, taking away to duplicate the effort of maintaining one’s own cloud, his idea can be shrunk to efforts inside a company.
Building and maintaining data products is resource-intensive and requires a set of very specialized skills (ranging from computational environment management to security). Multiplication of the required effort by the number of Data Products would endanger the feasibility of the whole Data Mesh idea. The idea behind the Central Platform is to centralize repeatable and generalizable actions to the degree necessary (yet again depending on the context of the company!) and to offer a set of tools abstracting away specialized skills. It would reduce entry and access barriers for both Data Product developers and Data Product consumers.
You may start with the requirements of your first Data Product Owners and iteratively build the setup up to meet your organization’s needs.
The infrastructural support can be offered at on-premise computing power or in the cloud, depending on the enterprise policies. Its elements can be available in IaaS, PaaS, SaaS, or their hybrid models. Data Mesh platform could support:
- Governability – all data computation related to Data Governance processes needs to be incorporated and automatically enforced on every Data Product connected to the Data Infrastructure.
- Security – the infrastructure solution shall ensure that all Data Products offer freedom of operation for users whose access allows them to meet their information requirements and ensure safety from unauthorized access. To do that set of ready-to-use processes, tools and procedures shall be accessible to Data Product creators and users.
- Flexibility, adaptability, and elasticity – the infrastructure needs to support multiple types of business domain data. It means enabling different data storage solutions, ETL and query operations, deployments, data processing engines, and pipelines. All that needs to be scalable, to serve business needs as they arise.
- Resilience – smooth operations of data-driven businesses require high availability of data. Therefore, they are ensuring redundancies and disaster recovery protocols at the design level of each infrastructural element.
- Process automation – from data metadata injection at Data Product registry to ensuring access control, data flow through central infrastructure needs to be fully automated, possibly using machine learning and artificial intelligence to ensure efficient data processing, quality, and monitoring.
Now that you have a good grasp of the four key Data Mesh principles, let us talk about Data Mesh as a socio-technical architecture.
Data Mesh is not a technical solution for data problems. Data Mesh solves these problems using socio-technical architecture or socio-technical system design. We believe this is the biggest strength of Data Mesh as a paradigm, and we think you should regard the socio-technical architecture as a foundation of data mesh implementation. You need to apply socio-technical architecture consciously to make it successful.
But before doing this, we need to get a bit of background info. For that, we look into Conway’s Law to understand why we won’t be able to change technology on its own and then into the Team Topologies framework which is in essence about the socio-technical architecture. Finally, we look into Cognitive Load as one of the main ideas leveraged by the Team Topologies framework.
“Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization’s communication structure.” (Conway’s Law)
Melvin Conway made this observation in 1967, just ten years after FORTRAN, the first widely used high-level language, was released. It didn’t take us long to see this strong, almost gravitational force in practice. This ‘force’ will sooner or later transform our architecture so it will look like our organizational structure. As architects, we need to be fully aware of the implied consequences.
We should take as an example the martial art called Aikido. ‘Aiki’ refers to the martial arts principle or tactic of blending with an attacker’s movements to control their actions with minimal effort. The masters of Aikido teach you that you should not meet force with force. Instead, you should take advantage of the force exerted by your opponent. In the same way, we should be wary and take advantage of Conway’s Law.
Conway’s Law simply tells us to take care of both the organizational side of things and the technical. Just changing technical things isn’t going to make an impact. The organization will warp or misuse the technical side of things.
In the socio-technical architecture approach, as the name suggests, we are trying to co-design both the social and the technical architecture elements simultaneously. This way, we are thinking upfront about all constraints and concerns.
This section is about the one main architecture method that is used today and within the Data Mesh community to do socio-technical architecture well.
“Team Topologies is a clear, easy-to-follow approach to modern software delivery with an emphasis on optimizing team interactions for flow.” (source: teamtopologies.com)
Matthew Skelton and Manuel Pais created team topologies, and they are starting to get some traction in the community because of their simplicity and straightforward application. Team Topologies is an approach that helps you not fall into the trap of Conway’s Law. The Team Topologies framework is designed to optimize the teams inside a company for optimal flow. It relies heavily on the idea of cognitive load to properly split workloads across this flow and separates teams into just a handful of teams as well as interaction modes into ones that are designed to minimize cognitive load.
Since Team Topologies is focused on the optimal flow of the company, it will help us to optimize the data to value flow inside the company.
Cognitive Load is a term from cognitive load theory, and it means the amount of mental resources (working memory) used by a person. This theory was first applied in the teaching field, and it influenced how we write instructions so readers can easily digest them. But after that, it was used in more and more areas, like IT.
There are three types of Cognitive Load:
- Intrinsic: the skills and knowledge you need to build your product (like programming languages, framework, and patterns).
- Extraneous activities are not part of product development but are needed to release the product (like infrastructure, deployment, and monitoring).
- Germane: knowledge of business and the problem space (like, types of movies if you are a Messflix LLC developer).
As socio-technical architects, we should shape our teams so that the total cognitive load is limited. It is very similar to road traffic, and you will not maximize the flow of cars if you fill up every square meter of the road with cars; traffic moves in the fastest way when there is enough space between vehicles to enable them to drive near the speed limit. It is the same with the teams.
So, what can we do to make Data Mesh teams performant? So firstly, we should plan our technology stack to be simple to limit intrinsic load. Secondly, we should embrace platform thinking as not to force teams to reinvent the wheel every time with yet another monitoring and logging solution to limit extraneous load. Doing both will allow the team to entirely focus on what is essential: the germane load, a.k.a. Business Domain. But germane load should also be limited, and we will do it by decomposing solutions into smaller domain-oriented data products.
All of the principles are influenced by socio-technical ways of thinking.
Domain-oriented decentralized data ownership and architecture reduce the germane load by decentralizing responsibility for domains and corresponding data.
Data as a product reduces collaboration between teams to access data by making data findable, accessible, interoperable, and reusable; it limits extraneous load by exposing data in the expected and known format and ports by consuming teams.
Self-serve data infrastructure as a platform reduces extraneous load by providing a self-service platform.
Federated computational governance makes collaboration between teams easier by enforcing common standards and patterns; it enables team members’ transfers by reducing possible technology stack within the company.
As you can see, at the heart of every principle, we find the socio-technical way of thinking. Due to the socio-technical nature of this paradigm shift, it’s important to be clear about the benefits that this shift can possibly result in.
While Data Mesh is a socio-technical solution to many problems, there are some major challenges associated with it.
That’s all for now. Thanks for reading.