From Spark in Action, Second Ed. by Jean Georges Perrin
In this article, you’ll learn what a dataframe is, how it’s organized, and about immutability.
Save 37% off Spark in Action, Second Ed. Just enter code fccperrin into the discount code box at checkout at manning.com.
The essential role of the dataframe in Spark
A dataframe is both a data structure and an API. In this article, we’re going to learn about what the dataframe is and how Spark uses it.
Spark’s dataframe API is used within Spark SQL, streaming, machine learning, and GraphX to manipulate graph-based data structure within Spark, drastically simplifying the access to those technologies via a unified API. You won’t need to learn an API for each sub-library.
Figure 1. Only having to learn one API to use Spark SQL, streaming, machine and deep learning, and graph-based analytics, makes developers happier!
It’s probably odd to call a dataframe majestic, but this qualifier suits it perfectly. Like majestic art work attracts curiosity, like a majestic oak dominates the forest, like majestic walls protect the castle, the dataframe is majestic in the world of Spark.
Organization of a dataframe
In this section, you’ll learn about how a dataframe organizes data. A dataframe is a set of records organized into named columns. It’s equivalent to a table in a relational database or a in Java.
Figure 2 illustrates a dataframe.
Figure 2. A full dataframe with its schema and data: the dataframe is implemented as a dataset of rows (Dataset<Row>). Each column is named and typed. The data itself are in a partition.
Dataframes can be constructed from a wide array of sources, such as files, databases, or custom data sources. The key concept of the dataframe is its API, which is available in Java, Python, Scala, and R. In Java, a dataframe is represented by a dataset of rows: .
Dataframes include the schema in a form of , which can be used for introspection. Dataframes also include a method to debug more quickly your dataframes. Enough theory, let’s practice!
Immutability isn’t a swear word
Dataframes, as well as datasets and RDDs (resilient distributed datasets), are considered immutable storage.
Immutability is defined as unchangeable. When applied to an object, it means that its state can’t be modified after it’s created.
I think this is counter intuitive: when I first started working with Spark, I had a difficult time embracing the concept: “let’s work with this brilliant piece of technology designed for data processing, but where the data are immutable. You expect me to process data, but they can’t be changed?”
Figure 3 gives an explanation: in its first state, data are immutable, then you start modifying them, but Spark only stores the steps of your transformation, not every step of the transformed data.
Let me rephrase that: Spark stores the initial state of the data, in an immutable way, and then keeps the recipe (the list the transformation). The intermediate data aren’t stored.
Figure 3. A typical flow: data are initially stored in an immutable way, after, what’s stored is the recipe for transformations, not the different stages of the data.
The “why” is easy to understand: figure 3 illustrates a typical Spark flow with one node, but figure 4 illustrates with more nodes.
Figure 4. As you add nodes, you can start thinking about the complexity of data synchronization. By keeping only the recipe (or list of transformations), you reduce your dependency on storage and increase your reliability (resilience). No data are stored in stage two.
Immutability finds its important role in a distributed way:
- either one stores data and each modification is done immediately on every node, like with a relational database does, or
- one keeps the data in sync over the nodes and shares the transformation recipe with the different nodes.
Spark uses the second solution, as it’s faster to sync a recipe that the whole data over each node. Catalyst is the kid in charge of optimization in Spark processing. Immutability and recipes are cornerstone to this optimization engine.
Although immutability is brilliantly used by Spark as the foundation to optimize data processing, you won’t need to think too much about it as you develop applications.
And that’s where we will stop for now.
If you want to learn more about the book, check it out on liveBook here and see this slide deck.