|
An excerpt from Data Mesh in Action by Jacek Majchrzak, Sven Balnojan, and Marian Siwiak with Mariusz Sieraczkiewicz The term “Data Product” gets tossed around quite a bit, but what does it actually mean? This article explores what is (and is not) a data product and why.
|
Often when thinking about data, we focus on technical aspects such as schema or data relationships. However, we need to remember that other related elements, such as schema description, domain description, the definition of access rules, metrics, and quality checks, are equally important. Often these individual elements are handled by separate, dedicated, specialized data teams. It turns out that, in reality, these non-functional aspects of data do not form an easily-managed coherent whole with the data itself.
On the other hand there are many datasets in any given organization that would be valuable to have available in an easy-to-use form. Unfortunately, they are often difficult to access (for example, on a private drive), undescribed, and usually unprepared for consumption (not cleaned or standardized).
Therefore, it makes sense to think about data as a product: a well-defined unit tailored to user needs.
Because the Data Mesh approach wants to treat data as a product, we need to make sure that the datasets:
- Have business value
- Can be easily found
- Are accessible
- Are able to work together
- Can be used in different contexts
We need to make sure that a given dataset is worth making into product, because this activity comes at a price. Moreover, the proper analysis allows us to prioritize the work on the Data Products. Let’s define the term “Data Product”. It’s a concept that you have probably heard before, but its meaning varies among people and organizations.
While it is relatively easy to determine what is not a Data Product, deciding what is a Data Product is harder. Data that was created as a side effect of another activity, for example, is not a Data Product. If an employee created an Excel file that is freely available, it is not a Data Product because:
- It was not created with the intention of creating a Data Product, i.e. no conscious effort was made to prepare it to be a Data Product.
- It does not address clearly defined user needs.
Data Product definition
We will define Data Product as follows:
A Data Product is an autonomous, read-optimized, standardized data unit containing at least one dataset (Domain Dataset), created for satisfying user needs.
This idea can be visualized in the form shown in figure 1.
Figure 1 Data Product and its context
In the figure, we see that a Data Product is an autonomous entity that makes its data available through Output ports from which it can be used by other transaction systems, Data Products or by end users.
NOTE: Other usages of the term Data Product
It is also worth noting that the combination of the words “Data Product” already exists in the data world in a different sense (outside of the Data Mesh context). In this case, Data Product is defined as follows:
A data product is an application or tool that uses data to help businesses improve their decisions and processes.
According to this definition, Data Products are all the tools that facilitate data processing, especially statistical analysis of data.
Let’s look at Data Products in the Produce Content domain to explain what this definition means in practice:
- Autonomous data unit: Every Data Product is a self-contained system or dataset that can be treated as an independent unit. For example, for Messflix LLC, Movie Popularity is a separate component built in the form of a Data Lake. It independently retrieves data from an external API, processes it, stores it and makes it available to its consumers. It can be developed without affecting other systems and data, as shown in figure 2.
- Read optimized data unit: The form of data storage is optimized for analytical functions. In the case of the Movie Popularity Data Product, the data is stored in the form of files containing the ranking of movies generated every week. Using Data Engineering tools (e.g. Python language libraries), makes it easy to perform a more complex analysis.
Figure 2 Movie Popularity Data Product as a sample of autonomous data unit
- Standardized data unit: The Data Product must be prepared in the right way, and it must be described with metadata, use standardized protocols and ports for data sharing, and use defined vocabularies and an identifier system.
- Node in a Data Mesh: The Data Product naturally becomes part of a larger ecosystem by, among other things, posting the appropriate Data Catalog metadata and allowing it to be used in a self-serve manner.
- Containing at least one Domain Data Set: It must have a dataset that itself has business value. Movie Popularity contains a dataset of movie rankings from a given week, Scripts contains a dataset of metadata about scripts, and Cost Statement contains a dataset of movie production costs.
- Satisfying user needs: Data Product was created with Product Thinking in mind, and it has clearly defined users. In the case of Cost Statement, these are accountants who financially control production costs.
Product rather than project
As you already know, the use of the term product in Data Mesh is not accidental. It is worth contrasting it with the term project, commonly used to name various kinds of initiatives carried out in organizations. It is worth emphasizing that the creation of a Data Product should not be treated as a project. A project is a time-limited activity that is supposed to lead to a specific goal. The project ends and may not be continued.
On the other hand, Product Thinking makes us look at a Data Product from a long-term perspective that goes beyond just making it available. It also assumes it has further development, adjusting it based on the feedback gathered from users. Product Thinking includes the idea of continuous improvement and long-term thinking about quality, as opposed to a one-time project.
What can be a Data Product?
What could be a candidate for a Data Product? In general, any data representation that has value for users can be a good candidate.
You can find a couple of examples in the following enumeration:
- Raw unstructured files: For example, images generated by geological sampling equipment or videos uploaded by users of a video portal; however, to be useful to end-users, they should be accompanied by sufficient metadata that describes them.
- Simple files: For example, results of a series of measurements taken on geological samples, Excel reports, or data in CSV format exported from the application/ In this case description in the form of metadata can also be crucial in their further use.
- A data set inside a database (or data storage in general): Containing a read-optimized representation of data from a source system (system of records).
- A data set built based on data retrieved from a COTS-type system (commercial off the shelf): For example, containing information about stock levels.
- REST API: Data exposed from applications to read transactional data, optimized for reading, optionally supporting HATEOAS to facilitate automated (machine consumable) data consumption.
- Data stream representing the history of changes to the application: For example, events that relate to changes made within a billing account.
- Data stream representing snapshots of data entities from a transaction system: For example, information about a system user.
- Data mart: Representing data enabling multidimensional analysis of sales results.
- Denormalized table: Containing search-optimized information, e.g. about the rentals of a particular movie.
- Graph database: Reflecting complex relationships between data, for example, specialized for finding movie recommendations.
- Local data lake or part of a bigger data lake: Used to build a view used as a basis for analysis of watching movies statistics.
- MDM-type database (master data management): Containing an integrated view of system users (personal information, viewing preferences, viewing time preferences, payment information, rating information, etc.).
As we can see from the examples, there are many possibilities. But, of course, these are just examples of the data itself, the core of a Data Product. Data Mesh does not dictate a particular form of data representation or a specific technology.
Conceptual Architecture of a Data
Product
In the previous sections, we showed how to validate and define a Data Product, and how to assign a Product Owner along with a Data Product Development Team. In this section, we will focus on a high-level view of the Data Product structure from the perspective of the Data Product user (external architecture view) and the implementer (internal architecture view).
External architecture view
In figure 4 showing Data Products in the Produce Content domain, the key elements of the Data Product’s external architecture were presented: external and internal ports. The figure excerpt for the Production Cost Statement looks like Figure 3.
Figure 3 Cost Statement Data Product interfaces
To be fully functional, a Data Product requires more components to become an autonomous element of the architecture. We should consciously design the entire set of interfaces that will make up the full Data Product. For the Production Cost Statement, it will look like Figure 4.
Figure 4 The set of interfaces exposed by a Data Product
As part of the designing process, we decided to provide the following interfaces:
- REST API Endpoint with OpenAPI documentation: API that will allow consumers to directly use the data.
- Operational metrics of REST API usage: REST API metrics, such as the number of active users, number of data retrieved, and number of requests processed.
- Cost Statement files on a shared drive: Processed data exposed as files.
- Operational metrics of shared drive resource usage: Metrics from the file-sharing system, including active users count, number of downloads, and number of requests processed.
- Data Product metadata: Information describing the data being shared, such as business description, contact persons, and responsible persons.
- Overall metrics of Data Product usage: Metrics that are the sum of the metrics from individual ports.
- Data Quality reports for data: Reports on the quality of the data provided, such as information about the time of their update and the data gaps.
- Security Access configuration: A tool for setting up access to data, mainly exposed for a self-serve platform.
The generalized external architecture of the Data Product is shown in Figure 5.
Figure 5 The main groups of interfaces of a Data Product
We can divide all of the Data Product interfaces into three main groups:
- Communication interfaces: Input interfaces and output ports, which define the format and protocol in which data can be read, for example, files, API, database, and stream.
- Information interfaces: Interfaces that can be used by users to monitor and retrieve additional information such as metadata, lineage, current metrics, or quality indicators.
- Configuration interfaces: Interfaces, mainly platform-specific, that enable the Data Product to be included in the Data mesh ecosystem, e.g. security options and parameters, Data Sharing Agreement statements.
Specifically, the self-serve platform uses configuration and information interfaces to configure, catalog, monitor, and visualize the relationships between Data Products. Interfaces must be standardized for interoperability to be possible. This standardization is part of computational governance.
In the next subsection, we will take a closer look at external ports, which are the Data Product’s primary means of exposing data.
Data Product ports
A Data Product is created with the end user in mind, and as we know, users have different needs and preferences. For example, a data scientist might prefer to have the same data about movies rented in Latin America in the form of a well-described file, a data engineer might like to have a bunch of data files to be processed with a crafted Python script, while a programmer might prefer to use a REST API.
A Data Product is not only a form in which data is stored but also the various ways that data can be accessed. In the case of a Data Product, we call these ports.
A Data Product can have one or more output ports, and the infrastructure should make it easy to expose the same data using different forms. Below we present examples of output ports representations.
Database-like storage
One of the primary ports that will come into play when developing Data Products will be database-like storage, i.e. SQL databases, noSQL databases, or Data Warehouse Data Marts. In many cases, this may be the only type of port provided by a Data Product. As a result, it offers excellent opportunities, especially in the analytical processing of large amounts of data, including the Big Data area.
This type of storage can be queried with SQL language or its analog, giving basically unlimited possibilities for further processing. But, of course, they are addressed mainly to technical users who can work with this type of storage and use SQL language.
Files
The second type of port, which will be familiar to less technical users of Data Product, is files that can be imported into tools such as MS Excel, Google™ Sheet, or used by data scientists. It is worth remembering that data engineers often feel very confident working with this type of data source. Files allow working with large amounts of data, and many algorithms implemented in libraries of such languages as R or Python are perfectly optimized for file processing.
Files, in the case of smaller sets, can represent the whole dataset. And for large amounts of data, in most cases, a subset of the data will first be selected, through an appropriate query or configuration of various filters and then exported to a file.
(REST) API
Another type of port is the REST API. We use HTTP REST here as an example of the protocol that is most popular at the time of writing this book. Nevertheless, we mean here a type of interface that is mainly intended for consumption by other systems.
This type of port is mainly suitable for working on small subsets of data because it is usually inefficient for large collections. If we add GraphQL support to this type of port, we get the ability to perform quite complex and customized queries on the data. And with the addition of HATEOAS, you can gain additional capabilities for automating data browsing using ML/AI.
So this port is great for automating data processing and for relatively low-intensity analytics that does not require large amounts of data.
Streams
By streams, we mean messaging-based systems where individual information is stored as single messages (e.g. message queues). This type of port is not very popular in the world of Data yet, especially as a form of sharing data, because tools for its analytical processing are still under development. Messaging systems are great for distributed processing of large amounts of data, especially in situations where we want to process data in near real-time. An example is systems that continuously analyze stock price changes and generate recommendations for purchases. Near real-time analytical results may be one reason to consider using streams.
The second reason to consider streams in the data world is to use the idea of Event Sourcing known from the software development world. Event Sourcing is essentially storing the changes associated with an entity (e.g. a transaction) rather than the final state of the entity. Having all the changes from the very beginning provides tremendous opportunities for seeing an entity at any point in time, such as building any representation based on them, creating analyses that were not considered when the system was created, and time travel.
The stream in this form might be the basis for creating other Data Products.
It is worth noting that a stream can also be used to implement a Data Product—to transport the data needed to build the Data Product. However, it’s then considered an input interface.
Visualizations
The last example of a port on our list are visualizations: various kinds of charts and dashboards representing data. They can be a good resource for non-technical users, Especially if they are combined with a filtering and data selection mechanism. This port provides visual representations of data in a relatively easy way. Such automatic visualizations can be supported by one of the platform capabilities.
The table below summarizes port applications for various situations.
Type of Data Product port | Access type | Amount of data |
---|---|---|
Database like storages |
SQL-like interface programmatically |
Large |
Files |
Scripting Import to BI tools Open in simple tools like MS Excel and gSheet |
Small to large |
API | Programmatically | Small to moderate |
Streams | Programmatically | Huge amount of data processing in real-time |
Visualizations | Dedicated tool/UI | Small to moderate |
Having decided what kind of port representation to use, we can focus more on detailed design of a Data Product.
Internal architecture view
While the external architecture is fairly self-explanatory and easy to standardize, the internal architecture of a Data Product will depend significantly on the specifics of the Data Product, how it is processed, and the technologies used.
Let’s see in figure 6 how the internal architecture of the Data Product Production Cost Statement was designed and based on it we will extract the main components of the internal architecture of the Data Product.
Figure 6 Example of a Data Product internal components
The data related to production costs comes from spreadsheet files, which are constantly updated by the production team according to the established rules. To enable more complex analysis of this data, it was made available in the form of a REST API and files that will be used for financial analysis. The whole solution will be implemented as a set of Python scripts, run once a day by the scheduler.
These scripts read the current form of spreadsheet files, save them in their raw form in the form, then perform data cleaning and transformation to the target form, which is written to the database used by the REST API and files located on the shared drive. During data processing, processing logs, error logs, and a quality report are generated. The whole process is complemented by CI/CD pipelines and metadata describing the Data Product. All these elements constitute a Data Product—they are created and maintained by the Data Product Development Team.
The following sections describe the main elements of the Data Product’s internal architecture.
Datasets
One of the most obvious components of a Data Product are the datasets. They are what ultimately constitute the essence of a Data Product. As described in an earlier section, a dataset can be one or more related tables, a file, data in a stream, or successive versions of processed data. In many cases, the dataset will not physically reside where the other elements of the Data Product reside.
For example, it may be a remote file whose size or legal aspects prevent copying. In particular, the latter may be the case when, due to licensing, the file can only physically reside within the boundaries of a geographical region. Another example might be a dataset that is part of a database running as a cloud service.
Metadata
In the Data Mesh approach, metadata plays a particularly important role that allows many processes to be automated. Much of the metadata, especially the descriptive and configuration metadata, will be part of the internal implementation of the Data Product, that is, its name, business description, datasets schema, people responsible, data quality metrics, available ports. Metadata in its physical form can be described, for example, as a JSON file (as described in more detail later).
Code
An important part of Data Product implementation will be the parts of the code that enable the creation of the final product concerning, among other things:
- Cleansing and data wrangling: Cleaning, repairing, and harmonizing data.
- Transformational pipelines: ETL or ELT transformations, enrichment, and harmonization.
- Infrastructure as code: Configuration of the container and orchestration infrastructure layers.
- CI/CD pipelines: Processes that update the Data Product code after changes, verifying its correctness, and applying the changes to the production environment if the entire process went correctly.
- Scheduling processing code: Configuration of systems that run regular data processing.
And now you know all about what makes a Data Product.
That’s all for this article. For more, you can check out the book here.