Description: C:\Users\Chris\Desktop\Manning\Images\cover images\Spark-shadow.jpg

By Jean Georges Perrin

This is the second in a series of 4 articles on the topic of ingesting data from files with Spark. This section deals with ingesting a JSON file.

Save 37% off Spark in Action: With examples in Java. Just enter code fccperrin into the discount code box at checkout at manning.com.

In part 1, we discussed ingesting data from a CSV file. In this part, we’re going to discuss ingesting data from a JSON file.

Ingesting a JSON file

Over the last few years, JSON (JavaScript Object Notation) has become the new cool kid in town in terms of data exchange, mainly after REST (Representational State Transfer) supplanted SOAP (Simple Object Access Protocol) and WSDL (Web Services Description Language, written in XML) in web services-oriented architecture.

JSON is easier to read, less verbose, and brings less constraints than XML. It supports nested constructs like arrays and objects. You can find out more about JSON at https://www.json.org.

A sub-format of JSON is called JSON Lines. JSON Lines (http://jsonlines.org) stores a record on one line, easing parsing and readability. Here’s a small example copied from their websites, as you can see it supports Unicode.

  
 {"name": "Gilbert", "wins": [["straight", "7"], ["one pair", "10"]]}
 {"name": "Alexa", "wins": [["two pair", "4"], ["two pair", "9"]]}
 {"name": "May", "wins": []}
 {"name": "Deloise", "wins": [["three of a kind", "5"]]}
  

Before Spark v2.2.0, JSON Lines was the only JSON format that Spark could read.


Figure 1 Spark is ingesting a JSON Lines file. Note that the records are in JSON but on one line each. After the ingestion, Spark displays some records and the schema.

For your first JSON ingestion, you’re going to use the foreclosure dataset from the city of Durham, NC from 2006 to 2016. You can freely download their datasets from their portal at https://OpenDurham.nc.gov.

Open Durham is the open data portal of the city and county of Durham, NC. They use OpenDataSoft’s solution, which provides data as JSON Lines. For this example, I used Spark v2.2.0 on MacOS X v 10.13.2 with Java 8. The dataset was downloaded in January 2018.

Listing 1 shows three records. Listing 2 shows a pretty print of the first record.

Listing 1 Foreclosure data: two first records and the last record

  
 [{"datasetid": "foreclosure-2006-2016", "recordid": "629979c85b1cc68c1d4ee8
 [CA]cc351050bfe3592c62", "fields": {"parcel_number": "110138", "geocode": [
 [CA]36.0013755, -78.8922549], "address": "217 E CORPORATION ST", "year": "2
 [CA]006"}, "geometry": {"type": "Point", "coordinates": [-78.8922549, 36.00
 [CA]13755]}, "record_timestamp": "2017-03-06T12:41:48-05:00"},
 {"datasetid": "foreclosure-2006-2016", "recordid": "e3cce8bbc3c9b804cbd87e2
 [CA]67a6ff121285274e0", "fields": {"parcel_number": "110535", "geocode": [3
 [CA]5.995797, -78.895396], "address": "401 N QUEEN ST", "year": "2006"}, "g
 [CA]eometry": {"type": "Point", "coordinates": [-78.895396, 35.995797]},
 …
 {"datasetid": "foreclosure-2006-2016", "recordid": "1d57ed470d533985d5a3c3d
 [CA]fb37c294eaa775ccf", "fields": {"parcel_number": "194912", "geocode": [3
 [CA]5.955832, -78.742107], "address": "2516 COLEY RD", "year": "2016"}, "ge
 [CA]ometry": {"type": "Point", "coordinates": [-78.742107, 35.955832]}, "re
 [CA]cord_timestamp": "2017-03-06T12:41:48-05:00"}]
  

Listing 2 shows an indented (pretty print via JSONLint.com and Eclipse) version of the first record, and you can see the structure: field names, arrays, and nested structure.

Listing 2 Foreclosure data: pretty print of the first record

  
 [
   {
     "datasetid": "foreclosure-2006-2016",
     "recordid": "629979c85b1cc68c1d4ee8cc351050bfe3592c62",
     "fields": {
       "parcel_number": "110138",
       "geocode": [
         36.0013755,
         -78.8922549
       ],
       "address": "217 E CORPORATION ST",
       "year": "2006"
     },
     "geometry": {
       "type": "Point",
       "coordinates": [
         -78.8922549,
         36.0013755
       ]
     },
     "record_timestamp": "2017-03-06T12:41:48-05:00"
   }
 …
 ]
  
  

Desired output

Listing 3 shows the output of a dataframes’ data and schema after ingesting a JSON Lines document.

Listing 3 Displaying foreclosure records and schema

  
 +-------------+-------------+-------------+----------------+-------------+
 |    datasetid|       fields|     geometry|record_timestamp|     recordid|
 +-------------+-------------+-------------+----------------+-------------+
 |foreclosur...|[217 E COR...|[WrappedAr...|   2017-03-06...|629979c85b...|
 |foreclosur...|[401 N QUE...|[WrappedAr...|   2017-03-06...|e3cce8bbc3...|
 |foreclosur...|[403 N QUE...|[WrappedAr...|   2017-03-06...|311559ebfe...|
 |foreclosur...|[918 GILBE...|[WrappedAr...|   2017-03-06...|7ec0761bd3...|
 |foreclosur...|[721 LIBER...|[WrappedAr...|   2017-03-06...|c81ae2921f...|
 +-------------+-------------+-------------+----------------+-------------+
 only showing top 5 rows
  
 root
  |-- datasetid: string (nullable = true)          
  |-- fields: struct (nullable = true)             
  |    |-- address: string (nullable = true)
  |    |-- geocode: array (nullable = true)        
  |    |    |-- element: double (containsNull = true)
  |    |-- parcel_number: string (nullable = true) 
  |    |-- year: string (nullable = true)          
  |-- geometry: struct (nullable = true)
  |    |-- coordinates: array (nullable = true)    
  |    |    |-- element: double (containsNull = true)
  |    |-- type: string (nullable = true)          
  |-- record_timestamp: string (nullable = true)   
  |-- recordid: string (nullable = true)           
  |-- year: string (nullable = true)               
  

The “fields” field is a structure with nested fields

The dataframe can contain arrays

Every field in which Spark can’t precisely identify the data type, it then uses a string.

When you see a piece of data like that, aren’t you tempted to group by the year to see the evolution of foreclosures or display each event on a map to see if there are areas more subject to foreclosures, and compare with average incomes in this area? This is good: let your inner-data scientist spirit come out!

Code

As you can imagine, reading JSON isn’t much more complex that ingesting a CSV file, as you’ll see in listing 4.

isting 4 – JsonLinesToDataframeApp.java

  
 package net.jgp.books.sparkWithJava.ch07.lab_200.json_ingestion;
  
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
 import org.apache.spark.sql.SparkSession;
  
 public class JsonLinesToDataframeApp {
  
   public static void main(String[] args) {
     JsonLinesToDataframeApp app =
         new JsonLinesToDataframeApp();
     app.start();
   }
  
   private void start() {
     SparkSession spark = SparkSession.builder()
         .appName("JSON Lines to Dataframe")
         .master("local")
         .getOrCreate();
  
     Dataset<Row> df = spark.read().format("json") 
         .load("data/durham-nc-foreclosure-2006-2016.json");
  
     df.show(5, 13);
     df.printSchema();
   }
 }
  

That’s it! This is the only change you have to do to ingest JSON.

Easier than CSV for sure! Stay tuned for part three. If you’re interested in some more general information about the book, check it out on liveBook and see this slide deck.