|
From Designing Cloud Data Platforms by Danil Zburivsky and Lynda Partner In this article, we’ll layer some of the critical and more advanced functionality needed for most data platforms today. Without this added layer of sophistication your data platform would work but it wouldn’t scale easily, nor would it meet the growing data velocity challenges. It would also be limited in terms of the types of data consumers (people and systems who consume the data from the platform) it supports, as they’re also growing in both numbers and variety.
|
Take 40% off Designing Cloud Data Platforms by entering fcczburivsky into the discount code box at checkout at manning.com.
Let’s take a deeper dive into a more complex cloud data platforms architecture:
- Exploring which functional layers exist in modern platform architectures and the roles they play;
- Introducing the concepts of fast/slow storage, streaming vs batch, metadata management, ETL overlays, and data consumers;
In a world that has new tools and services being announced almost daily, we’ll also look at some of the tools and services which are available to bring into your data platform planning. We’ll take a look at some of the tooling which is independent of cloud providers, including both open source and commercial offerings.
We know that choosing a cloud provider is a decision that hinges on far more than what services they offer, and the truth is that you can build and operate a great data platform on any of the three cloud providers discussed here. A systematic approach recommended by service providers like Global Storage to manage the protection and recovery of data could be a smart move for businesses and startups. This could, in the future provide them with a proactive, thorough, and consultative approach. That said, our recommendations on tools are intended to help narrow your search and free up some of your time to start working on your design.
NOTE: An ongoing debate rages about the tradeoffs between using Cloud-vendor specific services versus trying to build a platform which is cloud independent. To date our experience has been that the benefits of going cloud specific and using PaaS to keep support costs low, outweighs the benefits of being cloud vendor independent (the ability to more easily move your platform from one cloud vendor to another)
This is a lot in one article, but this is where your data platform design gets interesting.
Cloud data platform layered architecture
This section covers:
- The six layers of a modern data platform architecture
- The importance of a layered design
We’ll discuss what they do and why, along with some tips that we’ve learned by implementing these in real life. We’ll also talk about why it’s a good idea from an architecture perspective for these layers to be “loosely coupled”, or separate layers that communicate through a well-defined interface, don’t depend on the internal implementation of a specific layer.
This is a high-level architecture of a data platform with four layers (ingestion, storage, processing and serving):
Figure 1 – The four-layer high level data platform architecture
Figure 2. Cloud data platform layered architecture
Figure 2 above shows a more sophisticated data platform architecture with six layers:
- In the ingestion layer, we show a distinction between batch and streaming ingest.
- In our storage layer, we introduce the concept of slow and fast storage options
- In our processing layer, we discuss how it works with batch and streaming data, fast and slow storage
- We add new Metadata layer to enhance our processing layer
- We expand the Serving layer to go beyond a data warehouse to include other data consumers
A layer is a functional component that performs a specific task in the data platform system. In practical terms, a layer is either a cloud service, an open source or commercial tool or an application component which you have implemented yourself. Often it’s a combination of several such components. Let’s go through each of these layers in more detail.
Data ingestion layer
The name of this layer is self-explanatory-it’s responsible for connecting to the source systems and bringing data into the data platform. Much more is happening in this layer.
Figure 3. The data ingestion layer connects to the source system and brings data into the data platform
The data ingestion layer should be able to perform the following tasks:
- Securely connect to a wide variety of data sources-in streaming and/or batch modes.
- Transfer data from the source to the data platform without applying significant changes to the data itself or its format. Preserving raw data in the lake is important for cases where data needs to be re-processed later, without having to reach out to the source again.
- Register statistics and ingestion status in the metadata repository. For example, it’s important to know how much data has been ingested either in a given batch or within a specific time frame if it’s a streaming data source.
You can see that our architecture diagram has both batch and streaming ingestion coming into the ingestion layer.
Figure 4: Ingestion Layer should support Streaming and Batch Ingestion
You may hear that the data processing world is moving (or has already moved, depending on who you talk to) to data streaming and real time solutions. Although we agree that this is the direction the industry is heading, we can’t ignore reality. From our experience there are many existing data stores today which only support batch data access. This is true for third-party data sources which are often delivered as files on FTP in CSV, JSON or XML format or other systems with batch-only access patterns.
For the data sources you control, like your operational RDBMS for example, it’s possible to implement a full streaming solution. It’s significantly harder, if not impossible, to achieve the same with third-party sources. And as third-party sources often comprise a large portion of data that needs to be brought into the data platform, both batch and streaming ingestion are likely to be part of your data platform design.
We believe that it’s a good architectural thinking to build your data ingestion layer to support both batch and streaming ingestion as a first-class citizens. This means using appropriate tools for different types of data sources. For example, it’s possible to ingest streaming data as a series of small batches, but then we lose the ability to perform real-time analytics in future. Our design should prevent such technical debt. This way you will always be able to ingest any data source, no matter where it comes from. And this is one of the most important characteristics of a data platform.
NOTE: We often hear from companies that their data platform must be “real-time” but we’ve learned that it’s important to unpack what real-time means when it comes to analytics. From our experience there are often two different interpretations-making data available for analysis as soon as it’s produced at the source, i.e. real-time ingestion, and the other is immediately analyzing and taking action on data that has been ingested in real-time, i.e. real-time analytics. Fraud detection and real time recommendation systems are good examples of real-time analytics. Most of our customers don’t need real-time analytics, at least not yet. They want to ensure that the data they use to produce insights is up to date, meaning as current as a few minutes or hours ago, even though the report or dashboard may only be looked at periodically. Given that real-time analytics is much more complex than real-time ingestion, it’s often worth exploring user needs in detail to fully understand how to best architect your data platform.
Some of you may look at the cloud data platform architecture diagram in this article, see two ingestions paths, one for batch and one for stream, and ask if this is an example of a Lambda Architecture?
NOTE: For those of you who are not familiar with Lambda Architecture we recommend a great resource on the topic http://lambda-architecture.net/
In a nutshell, a Lambda Architecture suggests that, in order to provide accurate analytical results combined with a low latency analytical results, a data platform must support both batch and streaming data processing paths. The difference between Lambda architecture and the cloud data platform architecture described here is that in Lambda the same data goes through two different pipelines and in our data platform architecture batch goes through one pipeline and streaming through a different one.
The Lambda architecture was conceived in the early days of Hadoop implementations, when it was impossible to build a fully resilient and accurate real-time pipeline. The fast path provided low latency results, but due to limitations of some of the streaming frameworks available in Hadoop-based platforms, these results were not always one hundred percent accurate. To reconcile the potential differences in results, the same data is pushed through a batch layer and reconciled to produce a completely correct result at a later stage.
Today with cloud services like Dataflow from Google or open source solutions like Kafka Streams, these limitations have largely been overcome. Other frameworks, like Spark Streaming have also made significant improvements to accuracy and handling failures. In our proposed cloud data platform architecture, batch and stream ingestions paths are intended for completely different data sources. If a data source doesn’t support real-time data access or the nature of the data is such that it arrives only periodically it’s easier and more efficient to push this data through the batch path. Other sources that support one event at a time data access, i.e. streaming, should go through the real time layer.
Delivering data into the lake efficiently and reliably requires a data ingestion layer to have four important properties:
- Pluggable architecture. New types of data sources are added all the time. It’s unrealistic to expect that connectors for every data source are available in the ingestion tool or service that you choose. Make sure that your data ingestion layer allows you to add new connector types without significant effort.
- Scalability. The data ingestion layer should be able to handle large volumes of data and be able to scale beyond a single computer capacity. You may not need all this scale today, but you should always plan ahead and choose solutions that don’t require you to completely revamp your data ingestion layer when you need to grow.
- High availability. The data ingestion layer should be able to handle the failure of individual components, like disk, network or full virtual machine failures and still be able to deliver data to the data platform.
- Observability. The data ingestion layer should expose critical metrics, like data throughput and latency, to external monitoring tools. Most of these metrics should be stored in the central metadata repository. Some of the more technical metrics, like memory or CPU or disk use, might be exposed to the monitoring tools directly. It’s important to make sure that data ingestion layer doesn’t act as a black box if you want visibility into the movement of your data into the data platform. This is important for monitoring and troubleshooting purposes.
Fast and slow storage
As the data ingestion layer usually doesn’t store any data, though it may use transient cache, once data is passed through ingestion layer it must be stored reliably. The storage layer in the data platform architecture is responsible for persisting data for long term consumption. It has two types of storage-Fast and Slow, as shown on the diagram below.
Figure 5. The data storage layer persists data for consumption using Fast and Slow storage
We’ll use terms slow and fast storage throughout this article to differentiate between cloud storage services which are optimized for larger files (10s of MBs and more) and those optimized for storing smaller bits of data (KBs typically), but with much higher performance characteristics. Such systems are also sometimes referred to as message buses, distributed logs are queues with persistence. Fast and slow here isn’t a reference to a specific hardware characteristic, like the difference between HDD and SSD drives, but rather the characteristics of the storage software design and use cases it’s targeted for. Another example, frameworks which allow you to process data in real-time (Dataflow, Kafka Streams, etc) are tied to a specific storage system. If you want to do real-time, you need to work directly with fast storage layer.
The storage layer in the data platform must perform the following tasks:
- Store data for both long term and short term
- Make data available for consumption either in batch or streaming modes
One of the benefits of cloud is that storage is inexpensive and storing data for years or even decades is feasible. Having all that data available gives you many options with the ability to repurpose it for new analytic use cases like machine learning at the top of the list.
In our cloud data platform architecture, the data storage layer is split into two distinct components-slow and fast storage. “Slow” storage is your main storage for archive and persistent data. This is where data is stored for days, months and often years or even decades. In a cloud environment this type of storage is “available as a service from the cloud vendors as an object store that allows you to cost effectively store all kinds of data, and support fast reading of large volumes of data.
The main benefit of using an object store for long term storage is that in the cloud you don’t have any compute associated directly with the storage. For example, you don’t need to provision and pay for a new virtual machine if you want to increase the capacity of your object store. Cloud vendors grow and shrink the capacity of your storage in response to the data you upload or delete. This makes this type of storage cost efficient.
The main downside of the object stores is that they don’t support low latency access. This means that for streaming data, which operates on a single message or a single data point at a time, this object stores don’t provide the necessary response time. The difference between uploading a 1TB file with JSON data to the object store (batch) or trying to upload the same volume as a billion single JSON documents, one at a time (streaming) is significant.
For streaming use cases a cloud data platform requires a different type of storage. We call it “fast” storage because it can accommodate low latency read/write operations on a single message. Most associate storage of this type with Apache Kafka, but there are also services from cloud vendors which have similar characteristics.
Fast storage brings the low latency required for streaming data ingestion, but it tends to mean that some compute capacity is associated with the storage. For example, in a Kafka cluster you need to add new machines with RAM, CPU and disk if you want to increase your fast storage capacity. This means that the cost of the fast cloud storage is significantly higher than the cost of the slow storage. In practice you would configure a data retention policy, where your fast storage only stores a certain amount of data (one day, one week or one months, depending on your data volumes). The data is transferred to a permanent location on the slow storage and purged from the fast storage per the policy.
The storage layer should have the following properties:
- Reliable. Both slow and fast storage should be able to persist data in the face of various failures.
- Scalable. You should be able to add extra storage capacity with minimal effort.
- Performant. You should be able to read large volumes of data with high enough throughput from slow storage or read/write single messages with low latency to the fast storage.
- Cost efficient. You should be able to apply a data retention policy to optimize storage mix to optimize costs.
Processing layer
The processing layer, highlighted in the diagram below, is the heart of the data platform implementation. This is where all the required business logic is applied and all the data validations and data transformations take place. The processing layer also plays an important role in providing ad-hoc access to the data in the data platform.
Figure 6. The processing layer is where business logic is applied and all data validations and data transformations take place as well as providing ad-hoc access to data
The processing layer should be able to perform the following tasks:
- Read data in batch or streaming modes from storage and apply various types of business logic
- Provide a way for data analysts and data scientists to work with data in the data platform in an interactive fashion
The processing layer is responsible for reading data from storage, applying some calculations on it and then saving it back to storage for further consumption. This layer should be able to work with both slow and fast data storage. This means that the services or frameworks we choose to implement this layer with should have support for both batch processing of files stored in the slow storage as well as “one message at a time” processing from the fast storage.
Today, there are open source frameworks and cloud services which allow you to process data from both fast and slow storage at the same time. A good example is an open source Apache Beam project and service from Google called Dataflow, which provides a managed serverless execution environment for Apache Beam jobs. Apache Beam supports both batch and real-time processing models within the same framework. Generally, a layer in the data platform doesn’t have to be implemented using a single cloud service or software product. Often you’ll find that using specialized solutions for each batch and streaming processing gives you better results than using a single multi-purpose tool.
The processing layer should have the following properties:
- Scale beyond a single computer. Data processing framework or cloud service should be able to work efficiently with data sizes ranging from megabytes to terabytes or petabytes.
- Support both batch and real-time streaming models. Sometimes it makes sense to use two different tools for this.
- Support most popular programming languages like Python, Java or Scala
- Provide an SQL interface. This is more a “nice to have” requirement. A lot of analytics, like in the ad-hoc scenario is done using SQL. Framework that supports SQL significantly increases the productivity of your analysts, data scientists or data engineers.
Technical Metadata layer
Technical metadata, as opposed to business metadata, typically includes but isn’t limited to schema information from the data sources, status of the ingestion and transformation pipelines: success, failure, error rates, or statistics about ingested and processed data, like row counts and lineage information for data transformation pipelines. As shown in the diagram below, the metadata layer is central to the data platform and it’s kept in a metadata store.
Figure 7. Metadata layer stores information about the status of the data platform layers, needed for automation, monitoring and alerting and developer productivity
A data platform metadata store performs the following tasks:
- Stores information about activities status of different data platform layers
- Provides an interface for layers to fetch, add and update metadata in the store
This technical metadata is important for automation, monitoring and alerting, and developer productivity. Because our data platform design consists of multiple layers which sometimes don’t communicate directly with other layers, we need to have a repository that stores the state of these layers. This allows, for example, the data processing layer to know which data is available for processing by checking metadata layer instead of trying to communicate with the ingestion layer directly. This allows us to decouple different layers from each other, reducing the complexities associated with interdependencies.
Another type of metadata that you might be familiar with is business metadata which is often represented by a data catalog that stores information about what data means from a business perspective. Business metadata is an important component of an overall data strategy, because it allows for easier data discovery and communication. Business metadata stores and data catalog are well represented by multiple third-party products and can be plugged into the layered data platform design as another layer. Going into details of the business metadata solutions is outside of the scope of this article.
A data platform metadata store should have the following properties:
- Scalable. Hundreds (or sometimes thousands!) of individual tasks can be running in the data platform environment. A metadata layer must be able to scale to provide fast responses to all of the tasks.
- Highly-available. The metadata layer can become a single point of failure in your data platform pipelines. If the processing layer needs to fetch information about which data is available for processing from the metadata layer and the metadata service isn’t responding, the processing pipeline will either fail or get stuck. This can trigger cascading failures if there are other pipelines depending on the one that failed.
- Extendable. No strict rules determine which metadata you should store in this layer. You may also find that often you want to store some business specific information in the metadata layer, like how many rows with certain column values do we have. Your metadata layer should allow you to easily store this extra information.
Technical metadata management in the data platform is a relatively new topic. Few existing solutions can fulfill the tasks described above. For example, a Confluent Schema Registry allows you to store, fetch and update schema definitions, but it doesn’t allow you to store any other types of metadata. Some metadata layer roles can be performed by various ETL overlay services like Amazon Glue for example. As it stands today, you need a combination of different tools and services to implement a fully functional technical metadata layer.
The Serving Layer and data consumers
The serving layer delivers the output of analytics processing to the various data consumers.
As shown in the diagram below, the Serving layer is responsible for:
- Serving data to consumers who expect a relational data structure and full SQL support via a data warehouse
- Serving data to consumers who want to access data from storage without going through a data warehouse
Figure 8. The serving layer delivers the output of analytics processing to data consumers
In most cases a cloud data platform doesn’t replace the need for a data warehouse as a data consumption option. Data warehouses provide a data access point for data consumers which require full SQL support and expect data to be presented in a relational format. Such data consumers may include various existing dashboarding and Business Intelligence applications, but also data analysts or power business user familiar with SQL. The diagram below shows the data warehouse as the access point for these consumers.
Figure 9. A data warehouse is included in the serving layer
A serving layer almost always includes a data warehouse, which should have the following properties:
- Scalable and reliable. Cloud warehouse should efficiently work with both large and small data sets and scale beyond the capacity of a single computer. It also should be able to continue to serve data in the face of inevitable failures or individual components
- NoOps. Preferably a cloud warehouse should require as little tuning or operational maintenance as possible
- Elastic cost model. Another highly desired property of the cloud warehouse is the ability to scale both up and down in response to the load. In many traditional BI workloads warehouse is mostly busy during the business days and may experience only a small portion of the load during off-hours. A cloud cost model should be able to reflect this.
In a modern data architecture, although it’s likely you have data consumers who want a relational data structure and SQL for data access, increasingly other data access languages are gaining popularity.
Some data consumers require direct access to the data in the lake as shown in the diagram below
Figure 10. Direct data lake access allows consumers to bypass the serving layer and work directly with raw, unprocessed data
Usually data science, data exploration and experimentation use cases fall into this category. Direct access to the data in the lake unlocks the ability to work with the raw, unprocessed data. It also moves experimentation workloads outside of a warehouse to avoid performance impacts. Your data warehouse can be serving critical business reports and dashboards and the last thing you want is for it to suddenly slow down because a data scientist decided to read the last ten years of data from it. Multiple ways can provide direct access to the data in the lake. Some cloud providers provide an SQL engine that can run queries directly on the files in the cloud storage. In other cases, you may use SparkSQL to achieve the same goal. Finally, it’s not uncommon; data science workloads copy required files from the data platform into their experimentation environments like notebook or dedicated data science VMs.
NOTE: Although it is possible to have a data platform without a data warehouse, it’s likely that every business has business users who want access to the data in the lake, and these business users are best served via a data warehouse as opposed to direct access to the lake. As such, when we talk about data platforms, we always assume it feeds at a data warehouse for data consumption.
Data consumers aren’t always humans. The results of real time analytics which are being calculated as the data is received a single message at a time are rarely intended to be consumed by a human. No one spends their day watching metrics on a dashboard change every second. Outputs from a real-time analytics pipeline are usually consumed by other applications like marketing activation systems, such as ecommerce recommendation systems that decide which item to recommend to a user as they’re shopping, or ad bidding systems where the balance between ad relevance and cost change in milliseconds.
Figure 11. Consumers of real-time data are often other applications, not humans
Such programmatic data consumers require a dedicated API to consume data from the lake in real-time, often using built-in APIs which are often available in the real-time data processing engine of your choice. Alternatively, you can implement a separate API layer to allow multiple programmatic consumers access to real-time data using the same interface. This separate API layer approach scales better when you have multiple programmatic data consumers, but it also requires significantly more engineering effort to implement and maintain.
Orchestration and ETL overlay layers
Two components of our cloud data platform architecture require special consideration. Or rather, require a slightly different approach to thinking about them. They’re the Orchestration and ETL overlay layers, highlighted in the diagram below. The reason these layers require special treatment is because in many cloud data platform implementation the responsibilities of these layers are spread across many different tools.
Figure 12. Orchestration and ETL overlay layers have responsibilities which are spread across many different tools
First let’s take a look at the orchestration layer.
In a cloud data platform architecture, the Orchestration layer is responsible for the following tasks:
- Coordinate multiple data processing jobs according to a dependency graph (a list of dependencies for each data processing job that includes which sources are required for each job and whether a job depends on other jobs)
- Handle job failures and retries
Figure 13. The orchestration overlay allows data engineers to construct complex data flows with multiple inter-dependencies
As we’ve now seen, a modern cloud data platform architecture includes multiple loosely coupled layers that communicate with each other via Metadata layer. The missing piece in this design is a component which can coordinate work across multiple layers. Although the Metadata acts as a repository for various status information and statistics about the data pipelines, the Orchestration layer is an action-oriented component. Its main function is to allow data engineers to construct complex data flows, with multiple inter-dependencies.
Imagine the following scenario, outlined in the diagram below:
- We have three independent data sources to bring into our data platform: S1, S2 and S3
- We want a data transformation job that combines data from sources S1 and S2. Let’s call this job J1
- Finally, we have another transformation job, J2, that combines results from J1 with the data from source S3
Figure 14. Example of jobs and data dependency graph
As you can see from this example, jobs J1 and J2 can’t run independently from each other. All three sources can be delivering the latest data at different schedules. For example, sources S1 and S2 could be real-time sources delivering data as it becomes available, but S3 could be a third-party batch source that only makes new data available once a day. If we don’t coordinate J1 and J2 somehow, our final data product may have incorrect or incomplete results.
You can use several different ways to address this challenge. One approach is to combine jobs J1 and J2 into one job and schedule it in such a way that it only runs when latest data becomes available from all three sources (you can use the Metadata service for this!). This approach sounds straightforward, but what if you need to add more steps to this job? Or what happens if job J2 is a common task that should be shared across multiple different jobs? As the complexity of your data pipeline grows, developing and maintaining monolithic data processing jobs becomes a challenge. Combining J1 and J2 has all the downside of a monolith design: hard to make changes to specific components, hard to test and a challenge for different teams to collaborate on.
An alternative approach is to coordinate jobs J1 and J2 using an external orchestration mechanism:
Figure 15. An orchestration layer coordinates multiple jobs and allows job implementations to remain independent of each other.
An orchestration layer, shown in the diagram above, is responsible for coordinating multiple jobs based on when required input data is available from external source, or when an upstream dependency is met, like J1 needs to complete before J2 can start. In this case, job implementations remain independent of each other. When they’re independent, they can be developed, tested and changed separately and the orchestration layer maintains what is called a dependency graph – a list of dependencies for each data processing job that includes which sources are required for each job and whether a job depends on other jobs. A dependency graph need be changed only when the logical flow of the data is changed, for example when a new step in the processing is introduced. It doesn’t have to be changed when an implementation of a certain step changes.
In a large data platform implementation, the dependency graph can contain hundreds and sometimes thousands of dependencies. In such implementations there are usually multiple teams involved in developing and maintaining the data processing pipelines. The logical separation of the jobs and the dependency graph makes it easier for these teams to make changes to parts of the system, without having to impact the larger data platform.
An Orchestration layer should have the following properties:
- Scale. It should be able to grow from a handful to thousands of tasks and handle large dependency graphs efficiently.
- High availability. If orchestration layer is down or unresponsive your data processing jobs won’t run
- Maintainable. It should have a dependency graph which is easy to describe and maintain
- Transparency. It should provide visibility into jobs statuses, history of execution and other observability metrics. This is important for monitoring and debugging purposes
Several implementations of the Orchestration layer are available today. One of the most popular is Apache Airflow, an open source job scheduler and orchestration mechanism. Airflow satisfies most of the properties listed above and it’s available as a cloud service on Google Cloud Platform, called Cloud Composer. Other tools like Azkaban and Oozie can serve the same purpose, but both have been created specifically as a job orchestration tools for Hadoop and don’t fit into flexible cloud environments as well as Airflow does.
When it comes to native cloud services, different cloud providers approach the orchestration problem differently. As mentioned above Google adopted Airflow and makes it available as a managed service, simplifying the operational aspect of managing the orchestration layer. Amazon and Microsoft include some of the orchestration features into their ETL tools overlay products.
An ETL tools overlay, highlighted in the diagram below, is a product or a suite of products whose main purpose is to make the implementation and maintenance of cloud data pipelines easier. These products absorb some of the responsibilities of the various data platform architecture layers and provide a simplified mechanism to develop and manage specific implementations. These tools have a user interface and allow for data pipelines to be developed and deployed with little or no code.
Figure 16. An ETL Tools overlay can be used to implement many of the layer functions in a cloud data platform architecture
ETL overlay tools are usually responsible for:
- Adding and configuring data ingestions from multiple sources (Ingestion layer)
- Creating data processing pipelines (Processing layer)
- Storing some of the metadata about the pipelines (Metadata layer)
- Coordinating multiple jobs (Orchestration layer)
As you can see an ETL overlay tool can implement almost all layers in the cloud data platform architecture.
Does this mean that you can implement a data platform using a single tool and not worry about implementing and managing separate layers? The answer, perhaps unsurprisingly, is that “it depends”. The main question you have to keep in mind when deciding to fully rely on an ETL overlay service from a cloud vendor (or a similar third-party solution) is “How easy is it to extend it?” Is it possible to add new components for data ingestion? Do you have to process the data using only ETL service facilities or can you call external data processing components? Finally, it’s important to understand whether you can integrate other third-party services or open source tools with this ETL service.
The reason these questions are important is because no system is static. You may find that using an ETL service is a great way to get you started and can provide significant time/cost savings. As your data platform grows you may find yourself in a situation where the ETL service or tool doesn’t allow you to easily implement some needed functionality. If this ETL service doesn’t provide any options for extending its functionality or integrating with other solutions, then your only choice is to build a workaround that bypasses the ETL layer completely.
In our experience at some point these workarounds become as complex as the initial solution and you end up with what we fondly call a “spaghetti architecture”. We certainly don’t like speaking poorly of one of our favorite meals, but a spaghetti architecture is the result of different components of the system becoming more and more tangled together making it harder to maintain. In a spaghetti architecture, workarounds exist not because they fit into the overall design but because they have to compensate for an ETL service limitations.
An ETL overlay should have the following properties
- Extensibility. It should be possible to add your own components to the system?
- Integrations. It’s important for modern ETL tool to be able to delegate some of the tasks to external systems. For example, if all data processing is done inside the ETL service using a black box engine, then you won’t be able to address any of the limitations or idiosyncrasies of this engine
- Automation maturity. Many ETL solutions offer a “no code required” UI-driven experience. This is great for rapid prototyping, but when thinking about a production implementation consider how this tool will fit into your organization’s continuous integration / continuous delivery practices. For example, how will a change to a pipeline be tested and validated automatically, before being promoted to a production environment. If the ETL tool doesn’t have an API or an easy-to-use configuration language then your automation options are limited.
- Cloud architecture fit. The ETL tools market’s mature, but many open source and commercial tools were built in the age of on-prem solutions and monolithic data warehouses. Although all of them offer some form of cloud integration, you need to carefully evaluate whether this solution allows you to use cloud capabilities to their fullest extent. For example, some of the existing ETL tools are only capable of processing data using SQL running in the warehouse, and others use their own processing engines which are less scalable and robust than Spark or Beam.
When it comes to existing ETL overlay solutions, there are many available options.
As for ETL solutions that aren’t part of cloud vendor services, there are many third-party solutions to choose from. Talend is one of the more popular solutions today. It uses an “open core” model, meaning that the core functionality of Talend is open source and free to use for development and prototyping. When it comes to using Talend for production workloads, a proper commercial license is required. Informatica is another ETL tool popular among larger organizations.
That’s all for this article.
If you want to learn more about the book, check it out on our browser-based liveBook reader here.