An excerpt from Data Mesh in Action by Jacek Majchrzak, Sven Balnojan, and Marian Siwiak
This excerpt covers
● What is a “Data Mesh”? Our definition of a Data Mesh
● What are the key concepts of the Data Mesh paradigm?
● What are the advantages of the Data Mesh?
Take 25% off Data Mesh in Action by entering fccsiwiak into the discount code box at checkout at manning.com.
The Data Mesh is a decentralization paradigm. It decentralizes the ownership of data, its transformation into information as well as its serving. It aims to increase the value extraction from data by removing bottlenecks in the data value stream.
The Data Mesh paradigm is disrupting the data space. Large and small companies are racing to showcase “their Data Mesh-like journey” all over the internet. It’s becoming the new “thing” to try out for any company that wants to extract more value from their data. I regard the Data Mesh paradigm as a socio-technical architecture with an emphasis on the socio. The main focus is on people, processes, and organization, not technology. Data Meshes can, but don’t have to, be implemented using the same technologies most current data systems run on.
But as a topic of ongoing debate and only slowly emerging best practices and standards, we found the need for an in-depth book that covers both the key principles that make Data Meshes work and examples and variations needed to adapt this to any company.
To start off, we will look at the core ideas of the Data Mesh as well as the benefits and the challenges associated with it.
Data Mesh 101
The Data Mesh paradigm is all about decentralizing responsibility.
For instance, the development team for the “Customer Registration Component” of a company also creates a dataset for analytical purposes of “registered customers”. They ensure it is in an easy-to-digest format by transforming the data, e.g. to a CSV file, and serving it the way the consumers would like it, e.g. on a central file-sharing system.
But this deceptively simple definition has a lot of implications because, in most companies, data is handled as a “by-product”. It is usually turned into value only after being output as a by-product into some form of storage, then pulled into a central technology by a centralized data team, and then decentralized actors again pick up the data. Be that an analyst in the marketing department, inside a recommendation system used in a marketing campaign, or displayed inside the frontend.
Figure 1 depicts a common form of data architecture, both organizational and technical. It hopefully also shows its pitfalls.
Figure 1 Decentralized data emission and central transformation causes problems for the users due to unclear ownership and responsibility for data and its quality, among other things.
We can see here two levels of centralization:
- The centralized technology in the form of storage and the usual data engineering, data science machinery.
- The organizational centralization of the data team.
Since the development team considers the data a “by-product”, the ownership is implicitly assigned to the data team. But such central teams usually cannot keep up with business domain knowledge within multiple data source domains. The developer responsible for Customer Registrations would only need to know the language and the updates inside that component and the associated business. But the central data team will have to have the same understanding of each domain multiplied by the number of domains. Such overload makes it unlikely that the central team will understand even a single domain to the degree the responsible development team does. As a result, the data team cannot tell whether the data is correct, what it actually means, or what specific metrics might mean.
The Data Mesh paradigm shift calls for decentralization of the responsibility for data, that is to consider it an actual product. The situation depicted in figure 1 can turn into a data mesh if the development team provides the data product straight to the analysts through some standardized data port. It could be something as simple as a plain CSV file hosted in the appropriate cloud storage spot, easy to access for the analyst. Take a look at figure 2 to see this shift in action.
Figure 2 Decentralized data transformation makes data consumers happy by offering simple access to well-described data.
A platform team could help provide a simple technology as a service to be used by development teams to deploy such data products, including the data ports, quickly.
Data producers focus on developing data products, which, together with data consumers, start to form connections and compose a network. We call such a network the mesh, where the individual nodes are data products and consumers.
Even in our small example, we observe a significant operational paradigm change. It encompasses both a shift in the ownership responsibility (from a central data team to the development teams) and the technical challenge of making the new setup work.
Introducing changes in the operational paradigm will result in ripples affecting many areas of your business. To stop it from becoming chaos, we need guiding principles.
Before that, you must understand our definition of the “Data Mesh” and its non-technical aspects.
Definition of a Data Mesh
Zhamak Dehghani made an incredible effort in curating the idea of the Data Mesh starting in 2019. She provided us with all its critical elements and introduced a structured approach to the previously discussed paradigm shift.
Since the first introduction of the “Data Mesh” approach introduced by Zhamak, many “Data Mesh”-inspired, business-derived and theoretical examples have appeared. A lot of this content might not perfectly fit into the initial description of the Data Mesh framework, as presented in this article. A lot of businesses seem somewhat unsure about what exactly conforms to the definition of a Data Mesh and what doesn’t.
For this reason, we opt for solutions that are first and foremost practical. Therefore, the Data Mesh definition we coin below aims to be broad, functional, and emphasize decentralized efforts to maximize the value derived from data:
DEFINITION Data Mesh
The Data Mesh is a decentralization paradigm. It decentralizes the ownership for data, the transformation of data into information, and data serving.
It aims to increase the value extraction from data by removing bottlenecks in the data value stream.
The Data Mesh paradigm is guided by four principles, helping to make data operations efficient at scale: domain ownership, domain data as a product, federated computational governance, and self-serve data platform. Data Mesh implementations may differ in scope and degree of utilization of these principles.
The goal of implementing a Data Mesh is to extract more value from the company’s data assets. That is also the reason we keep this definition so lightweight and inclusive in relation to the level at which each of the principles are followed. The following non-technical use case of a Data Mesh will hopefully explain what we mean by that.
Why Data Mesh?
We see three main reasons why the data world is in need of decentralization in the form of the Data Mesh:
- With the proliferation of data sources and data consumers, a central team in the middle creates an organizational bottleneck.
- With multiple data emitting and consuming technologies, central monolithic data storage creates a technological bottleneck and much information is lost due to it.
- Both data quality and data ownership are only implicitly assigned, which causes confusion and a lack of control in both cases.
Over the past thirty years, most data architectures were designed to integrate multiple data sources, i.e. central data teams merged data from all kinds of source systems and provided harmonized sets to users, who in turn tried to use it to drive business value.
Yet, for over a decade now, the problem of big-data hangovers has plagued companies of all sizes. The data environments struggle with the scalability of the solutions, completeness of the data, accessibility issues, and the like. This might be familiar to some of you. Some things simply seem to not work out. Dozens of reports & dashboards seem to be of no use compared to the costs of creating & maintaining them. A bunch of data science projects seems to stay stuck in the “prototype” phase, and those running data-intensive applications probably are facing a bunch of data-related problems. At least it should seem that way compared to the effort it takes to get a software component to run. Just not yet right.
One of the reasons for the scalability problem is the proliferation of data sources and data consumers. An obvious bottleneck emerges when one central team manages and owns data along its whole journey: from ingestion through transformation and harmonization to serving it to all potential users. Splitting the team along the data pipe does not help much either. When engineers working on data ingestion change anything, they need to inform the group responsible for transformation. Otherwise, the upstream systems may fail, or will process the data incorrectly. Required close collaboration between the engineers leads to the tight coupling of all data-related systems.
The other problem arises from the monolithic nature of data platforms, such as warehouses and lakes. As a result, they often lack the diversity to reflect the reality encoded in data derived from sources and domain-specific structures. Moreover, enforced flattening of data structures reduces the ability to generate valuable insights from the collected data, as crucial domain-specific knowledge gets lost in these centralized platforms. We could observe it in one of the projects we worked on. The car parts manufacturing company was buying data related to failures of different parts. Even though the provider had information on the part provenance, i.e. the model the part was installed in, the buyer had no data models allowing it to store this information. As a result, components were analyzed separately, hampering R&D’s attempts to understand the big picture better.
Two more interwoven factors exacerbate the problems described above. One is unclear data ownership structure; the other is the responsibility for data quality. Data traveling through different specialized teams loses its connection to its business meaning, which means that developers of centralized data processing systems and applications can’t and won’t fully understand its content. In contrast, data quality cannot be assessed in disconnection from its meaning.
Similar problems have been recognized in other areas of software engineering and have resulted in the emergence (and success!) of Domain-Driven Development and microservices. Application of similar thinking (i.e. focus on data ownership and shared tooling) to data engineering led to the development of the idea of the Data Mesh.
There are two main alternative models to Data Mesh’s decentralization of responsibility for data.
The first option is the centralization of both people and technology. This is the default setup for any start-up. And it’s a very decent default option, just like the monolith is a decent default option for any software component. In the beginning, the costs of decentralization outweigh its benefits. The benefits brought in by working closely together inside one data team, having just one technology to use, makes things a lot easier.
INSIGHT Centralization is a sensible default option
Centralized data work both organizational and technical can make sense as a default option. Decentralization does carry costs, and centralization can mitigate those. That does imply though, that the value derived from centralized and decentralized data is roughly equal.
The second option is the idea of splitting up the work not by business domains as the Data Mesh suggests, but by technology. This usually results in one core data engineering team responsible mostly for ingesting data and provisioning a data storage infrastructure and multiple other teams, analytics teams, data science teams, analysts you name it. These pick up the raw data and turn it into something meaningful down the road. You might first centralize your data system and then layer up with this option to increase the flow.
There is nothing wrong with these two options. They might be reasonable default options, but both options fail to align with value creation, which is deeply tied to business domains. Neither are able to address sudden changes in just one business domain. As with microservices, where the strength is the ability to quickly extract value from one specific service by scaling it up all by itself, the data mesh is able to scale up value extraction in just one domain. All other options need to scale up everything to scale up value extraction in just one domain.
So in one way or another, both of these options will hit a wall at some point in time, in which adding the next data source, or adding the next data science project will feel increasingly complex & costly. That’s the point where you want to switch to a Data Mesh.
Data Warehouses & Data Lakes inside the Data Mesh
There is a misconception about the Data Mesh. It is sometimes perceived as an exclusive alternative to the central data lake or the central data warehouse.
But that does not take into account what the Data Mesh is, i.e. a combination of two things: technology & organization. The Data Mesh is an alternative to having one centralized data unit taking care of the data inside a central data storage.
That still leaves the option to have central data storage and decentralized units working & owning the data. Indeed that is a common implementation in companies that don’t need complete flexibility on the data producers’ side.
It also is a common approach to keep data lakes & data warehouses inside a business intelligence or data science team. The data lakes & data warehouses then become a node inside the Data Mesh.
Figure 3 Data Mesh can still use Data Lakes, e.g. a Data Science team building data products may use Data Lakes as nodes within the Data Mesh.
Data Meshes make heavy use of both data lakes and data warehouses in various formats, Data Meshes in general do not try to focus on any specific technology. Let’s take a look at the benefits of Data Mesh.
Data Mesh benefits
Let’s analyze the potential lying in Data Mesh implementation from two different perspectives: through the eyes of the business decision-makers and the technologist.
The business perspective
From the business perspective, data itself is of little value. Worse, it means incurred costs! Sounds like heresy? To understand that statement, and if needed, convey it to your business partners, you need to understand the different levels in how people perceive reality. A good approximation of this phenomenon is the so-called DIKW pyramid, derived from the 1934 play “The Rock” by T.S. Eliot. It represents data, information, knowledge, and wisdom as a hierarchical structure, where each next element can be derived from the former.
The data in this context is just a set of values (which costs money to store). To derive the value from it, one needs to build up the context allowing for informed decision-making. The Data Mesh improves the robustness of the whole pyramid.
As we mentioned, having the raw data is of no use for decision-makers. One can argue that they can download it to their laptops and analyze it themselves. It is true! It has, however, two underlying assumptions:
- To download the data, it needs to be accessible.
- To ensure the value of any performed analysis, data needs to be as complete as possible.
To address the first assumption: we mentioned already and will say repeatedly, Data Mesh is very much focused on making data accessible. Not only accessible but findable, interoperable, and reusable as well! This is embedded in one of the four Data Mesh principles—Data as a Product—which is all about making sure data is there for the taking.
Completeness of the data is another issue where Data Mesh shines. Unlike most Data Warehouse or Data Lake architectures, Data Products and their data models are not developed by IT specialists in isolation from business concerns. Instead, it’s a joint effort, ensuring the data presented outside the domain is sufficient to derive meaningful conclusions from.
Data Mesh also helps add value to elements higher in the hierarchy. The teams transforming data into information, knowledge, and wisdom, which the business environment likes to call “insight,” gain instant access to multiple interoperable data sources.
Of course, in theory, it’s possible to make it happen in a Data Lake as well. However, the reality of as a single team managing technical aspects of the environment, as well as data access and transfer rights, is not feasible in our experience. And if required bits of data are stored in two different Data Lakes (or four, which is not that unusual), getting them all to work together is next to impossible.
In short, having access to read-optimized Data Products enables quick prototyping of new analytical methods and opens a path for the rapid development of new business capabilities.
The technology perspective
The main benefit from the technological perspective is keeping the speed of development with the organization’s growth. Data Mesh is meant to address the shortcomings of other data architectures, like Data Warehouses or Data Lakes, by decentralizing data production and governance. Those architectures introduce a bottleneck — a central team responsible for harmonizing all the data for the whole of your company and making it ready for consumption. A single team cannot scale to accommodate the varied data needs of a growing organization. Both the technology as well as the team knowledge quickly becomes a scale problem. Eventually, more time is spent on maintenance, and new projects become more and more delayed.
The other benefit of Data Mesh is the clarity of data ownership right from the point of its creation. It flattens the data management structure leaving just a thin layer of a Federated Governance Team. And even that team’s activities are limited to agreeing on standards within autonomous domains.
The increase in the speed of development also comes from empowering the implementation teams. Since producing and maintaining the Data Products lies on their shoulders, the speed of change is not limited by a single central integrations team’s backlog of tasks. It means that both the evolution of and fixes to the Data Product happen quicker. This is especially prominent in case of any bug fixing and downtimes. Furthermore, the team that owns a given Data Product is better equipped to react faster because there is no context switching, as is the case with a single central data team.
The other factor worth mentioning is data environment stability. With Data Products offering access to contracted versions of its datasets, pipelines built on them are much more robust and require much less maintenance.
That’s all for now. Thanks for reading.