|
By Jean Georges Perrin This is the second in a series of 4 articles on the topic of ingesting data from files with Spark. This section deals with ingesting a JSON file. |
Save 37% off Spark in Action: With examples in Java. Just enter code fccperrin into the discount code box at checkout at manning.com.
In part 1, we discussed ingesting data from a CSV file. In this part, we’re going to discuss ingesting data from a JSON file.
Ingesting a JSON file
Over the last few years, JSON (JavaScript Object Notation) has become the new cool kid in town in terms of data exchange, mainly after REST (Representational State Transfer) supplanted SOAP (Simple Object Access Protocol) and WSDL (Web Services Description Language, written in XML) in web services-oriented architecture.
JSON is easier to read, less verbose, and brings less constraints than XML. It supports nested constructs like arrays and objects. You can find out more about JSON at https://www.json.org.
A sub-format of JSON is called JSON Lines. JSON Lines (http://jsonlines.org) stores a record on one line, easing parsing and readability. Here’s a small example copied from their websites, as you can see it supports Unicode.
{"name": "Gilbert", "wins": [["straight", "7♣"], ["one pair", "10♥"]]} {"name": "Alexa", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]} {"name": "May", "wins": []} {"name": "Deloise", "wins": [["three of a kind", "5♣"]]}
Before Spark v2.2.0, JSON Lines was the only JSON format that Spark could read.
Figure 1 Spark is ingesting a JSON Lines file. Note that the records are in JSON but on one line each. After the ingestion, Spark displays some records and the schema.
For your first JSON ingestion, you’re going to use the foreclosure dataset from the city of Durham, NC from 2006 to 2016. You can freely download their datasets from their portal at https://OpenDurham.nc.gov.
Open Durham is the open data portal of the city and county of Durham, NC. They use OpenDataSoft’s solution, which provides data as JSON Lines. For this example, I used Spark v2.2.0 on MacOS X v 10.13.2 with Java 8. The dataset was downloaded in January 2018.
Listing 1 shows three records. Listing 2 shows a pretty print of the first record.
Listing 1 Foreclosure data: two first records and the last record
[{"datasetid": "foreclosure-2006-2016", "recordid": "629979c85b1cc68c1d4ee8 cc351050bfe3592c62", "fields": {"parcel_number": "110138", "geocode": [ 36.0013755, -78.8922549], "address": "217 E CORPORATION ST", "year": "2 006"}, "geometry": {"type": "Point", "coordinates": [-78.8922549, 36.00 13755]}, "record_timestamp": "2017-03-06T12:41:48-05:00"}, {"datasetid": "foreclosure-2006-2016", "recordid": "e3cce8bbc3c9b804cbd87e2 67a6ff121285274e0", "fields": {"parcel_number": "110535", "geocode": [3 5.995797, -78.895396], "address": "401 N QUEEN ST", "year": "2006"}, "g eometry": {"type": "Point", "coordinates": [-78.895396, 35.995797]}, … {"datasetid": "foreclosure-2006-2016", "recordid": "1d57ed470d533985d5a3c3d fb37c294eaa775ccf", "fields": {"parcel_number": "194912", "geocode": [3 5.955832, -78.742107], "address": "2516 COLEY RD", "year": "2016"}, "ge ometry": {"type": "Point", "coordinates": [-78.742107, 35.955832]}, "re cord_timestamp": "2017-03-06T12:41:48-05:00"}]
Listing 2 shows an indented (pretty print via JSONLint.com and Eclipse) version of the first record, and you can see the structure: field names, arrays, and nested structure.
Listing 2 Foreclosure data: pretty print of the first record
[ { "datasetid": "foreclosure-2006-2016", "recordid": "629979c85b1cc68c1d4ee8cc351050bfe3592c62", "fields": { "parcel_number": "110138", "geocode": [ 36.0013755, -78.8922549 ], "address": "217 E CORPORATION ST", "year": "2006" }, "geometry": { "type": "Point", "coordinates": [ -78.8922549, 36.0013755 ] }, "record_timestamp": "2017-03-06T12:41:48-05:00" } … ]
Desired output
Listing 3 shows the output of a dataframes’ data and schema after ingesting a JSON Lines document.
Listing 3 Displaying foreclosure records and schema
+-------------+-------------+-------------+----------------+-------------+ | datasetid| fields| geometry|record_timestamp| recordid| +-------------+-------------+-------------+----------------+-------------+ |foreclosur...|[217 E COR...|[WrappedAr...| 2017-03-06...|629979c85b...| |foreclosur...|[401 N QUE...|[WrappedAr...| 2017-03-06...|e3cce8bbc3...| |foreclosur...|[403 N QUE...|[WrappedAr...| 2017-03-06...|311559ebfe...| |foreclosur...|[918 GILBE...|[WrappedAr...| 2017-03-06...|7ec0761bd3...| |foreclosur...|[721 LIBER...|[WrappedAr...| 2017-03-06...|c81ae2921f...| +-------------+-------------+-------------+----------------+-------------+ only showing top 5 rows root |-- datasetid: string (nullable = true) ❸ |-- fields: struct (nullable = true) ❶ | |-- address: string (nullable = true) | |-- geocode: array (nullable = true) ❷ | | |-- element: double (containsNull = true) | |-- parcel_number: string (nullable = true) ❸ | |-- year: string (nullable = true) ❸ |-- geometry: struct (nullable = true) | |-- coordinates: array (nullable = true) ❷ | | |-- element: double (containsNull = true) | |-- type: string (nullable = true) ❸ |-- record_timestamp: string (nullable = true) ❸ |-- recordid: string (nullable = true) ❸ |-- year: string (nullable = true) ❸
❶ The “fields” field is a structure with nested fields
❷ The dataframe can contain arrays
❸ Every field in which Spark can’t precisely identify the data type, it then uses a string.
When you see a piece of data like that, aren’t you tempted to group by the year to see the evolution of foreclosures or display each event on a map to see if there are areas more subject to foreclosures, and compare with average incomes in this area? This is good: let your inner-data scientist spirit come out!
Code
As you can imagine, reading JSON isn’t much more complex that ingesting a CSV file, as you’ll see in listing 4.
isting 4 – JsonLinesToDataframeApp.java
package net.jgp.books.sparkWithJava.ch07.lab_200.json_ingestion; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class JsonLinesToDataframeApp { public static void main(String[] args) { JsonLinesToDataframeApp app = new JsonLinesToDataframeApp(); app.start(); } private void start() { SparkSession spark = SparkSession.builder() .appName("JSON Lines to Dataframe") .master("local") .getOrCreate(); Dataset<Row> df = spark.read().format("json") ❶ .load("data/durham-nc-foreclosure-2006-2016.json"); df.show(5, 13); df.printSchema(); } }
❶ That’s it! This is the only change you have to do to ingest JSON.
Easier than CSV for sure! Stay tuned for part three. If you’re interested in some more general information about the book, check it out on liveBook and see this slide deck.