From Spark in Action, 2nd Ed. by Jean Georges Perrin

This is the third in a series of 4 articles on the topic of ingesting data from files with Spark. This section deals with ingesting a XML file.

Save 37% off Spark in Action, 2nd Ed. Just enter code fccperrin into the discount code box at checkout at manning.com.

In part 1 we dealt with ingesting data from a CSV file, and in part 2 we ingested from a JSON file. In this part we’re going to talk about ingesting data from an XML file.

Ingesting an XML file

In this section, we’ll ingest an XML document containing the NASA patents, display some patents and the dataframes’ schema. Note that, in this context, the schema isn’t an XML Schema (or XSD), but a dataframe schema. Quite a few years ago, when I discovered XML, I thought it could become some unified lingua franca of data exchange. XML is:

  • Structured.
  • Extensible.
  • Self-describing.
  • Embeds validation rules through DTD (Document Type Definition) and XSD (XML Schema Definition).
  • XML is a W3 standard. You can read more about XML at: https://www.w3.org/XML/.
  • XML looks like HTML and any other markup-language since SGML:
      
     <rootElement>
       <element attribute="attribute’s value">
         Some payload in a text element
       </element>
       <element type="without sub nodes"/>
     </rootElement>
    

Unfortunately, XML is verbose and it’s harder to read than JSON. Nevertheless, XML is still widely used and Apache Spark ingests it nicely.

Figure 1 shows a fragment of the XML file and illustrates the process.


Figure 1 Spark ingests an XML file containing NASA patents. Spark uses an external plugin, provided by Databricks, to perform the ingestion. Spark then displays records and the dataframe schema (not to be confused with an XML Schema).


For this XML example, you’re going to ingest the NASA patents. NASA offers various open datasets at https://data.nasa.gov. Listing 1 shows a record of this file.

You can download the NASA patents dataset from: https://data.nasa.gov/Raw-Data/NASA-Patents/gquh-watm. For this example, I used Spark v2.2.0 on MacOS X v 10.12.6 with Java 8 as well as Databricks’ XML parser v0.4.1. The dataset was downloaded in January 2018.

Listing 1 – NASA patents (excerpt)

  
 <response>                                                         ❶ 
   <row                                                             ❷ 
       _id="1"                                                      ❸ 
       _uuid="BAC69188-84A6-4D28-951E-FC687ACB6D4A"                 ❸ 
       _position="1"                                                ❸ 
       _address="https://data.nasa.gov/resource/nasa-patents/1">    ❸ 
     <center>NASA Ames Research Center</center>
     <status>Issued</status>
     <case_number>ARC-14048-1</case_number>
     <patent_number>5694939</patent_number>
     <application_sn>08/543,093</application_sn>
     <title>Autogenic-Feedback Training Exercise Method &amp; System</title>
     <patent_expiration_date>2015-10-03T00:00:00</patent_expiration_date>
   </row>
 …
 </response>
  

❶   The root element of your list of patents

❷   The element (or tag) designing our record.

❸   Attributes are prefixed by one underscore (_)

Desired output

Listing 2 shows the output of a dataframes’ data and schema after ingesting the NASA patents as an XML document. You can see that the attributes are prefixed by an underscore (_) (attributes had already an underscore as a prefix in the original document, and they have two now) and the element’s name is used as a column name.

Listing 2 – NASA patents in a dataframe

  
 +--------------------+----+----------+--------------------+--------------+…
 |           __address|__id|__position|              __uuid|application_sn|…
 +--------------------+----+----------+--------------------+--------------+…
 |https://data.nasa...| 407|       407|2311F785-C00F-422...|    13/033,085|…
 |https://data.nasa...|   1|         1|BAC69188-84A6-4D2...|    08/543,093|…
 |https://data.nasa...|   2|         2|23D6A5BD-26E2-42D...|    09/017,519|…
 |https://data.nasa...|   3|         3|F8052701-E520-43A...|    10/874,003|…
 |https://data.nasa...|   4|         4|20A4C4A9-EEB6-45D...|    09/652,299|…
 +--------------------+----+----------+--------------------+--------------+…
 only showing top 5 rows
  
 root
  |-- __address: string (nullable = true)
  |-- __id: long (nullable = true)
  |-- __position: long (nullable = true)
  |-- __uuid: string (nullable = true)
  |-- application_sn: string (nullable = true)
  |-- case_number: string (nullable = true)
  |-- center: string (nullable = true)
  |-- patent_expiration_date: string (nullable = true)
  |-- patent_number: string (nullable = true)
  |-- status: string (nullable = true)
  |-- title: string (nullable = true)
  

Code

As usual, our code starts by a main() method, which is calling a start() method which creates a Spark session. Listing 3 is the Java code needed to ingest the NASA XML file, then display five records and its schema.

Listing 3 – XmlToDataframeApp.java

  
 package net.jgp.books.sparkWithJava.ch07.lab_300.xml_ingestion;
  
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
 import org.apache.spark.sql.SparkSession;
  
 public class XmlToDataframeApp {
  
   public static void main(String[] args) {
     XmlToDataframeApp app = new XmlToDataframeApp();
     app.start();
   }
  
   private void start() {
     SparkSession spark = SparkSession.builder()
         .appName("XML to Dataframe")
         .master("local")
         .getOrCreate();
  
     Dataset<Row> df = spark.read().format("xml") ❶ 
         .option("rowTag", "row")                 ❷ 
         .load("data/nasa-patents.xml");
  
     df.show(5);
     df.printSchema();
   }
 }
  

❶   Specify XML as the format. Case doesn’t matter.

❷   The element or tag that indicates a record in the XML file.

I had to modify the original NASA document as it contained an element with the same name wrapping the records. Unfortunately, as of now, Spark can’t do it for us. The original structure was:

  
 <response>
   <row>
     <row _id="1" …>
       …
     </row>
     …
   </row>
 </response>
  

If the first child of response was rows or anything other than row, I wouldn’t have had to remove it (another option is to rename it).

As the parser isn’t part of the standard Spark distribution, you need to add it to the pom.xml, as described in listing 4. To ingest XML, use a product called spark-xml_2.11 (the artifact), by a company called Databricks, in version 0.4.1.

Listing 4 – pom.xml to ingest XML (excerpt)

 …
   <properties>
     
     <scala.version>2.11</scala.version>                    ❶ 
     <spark-xml.version>0.4.1</spark-xml.version>           ❷ 
   </properties>
  
   <dependencies>
 …
     <dependency>
       <groupId>com.databricks</groupId>
       <artifactId>spark-xml_${scala.version}</artifactId>  ❸ 
       <version>${spark-xml.version}</version>              ❹ 
       <exclusions>                                         ❺ 
         <exclusion>
           <groupId>org.slf4j</groupId>
           <artifactId>slf4j-simple</artifactId>
         </exclusion>
       </exclusions>
     </dependency>
 …
   </dependencies>
 …

❶   Scala version on which the XML is built

❷   Version of the XML parser

❸   Equivalent to spark-xml_2.11

❹   Equivalent to 0.4.1

❺   Optional: I took a habit of excluding the logger from other packages to have a better control over the one I use

More details on Spark XML can be found at https://github.com/databricks/spark-xml. Not too complex! Stay tuned for part 4. If you want to learn more about the book, check it out liveBook here and see this slide deck.