perrin_IDfFiSp3_00

From Spark in Action, 2nd Ed. by Jean Georges Perrin

This is the last in a series of 4 articles on the topic of ingesting data from files with Spark. This section deals with ingesting a TXT file.


Save 37% off Spark in Action, 2nd Ed. Just enter code fccperrin into the discount code box at checkout at manning.com.


This is the last in our data-ingestion series of short articles. So far in part 1 we ingested from CSV, from JSON in part 2, and from XML in part 3. In this section we’re going to ingest data from a TXT (text) file.

Ingesting a text file

Text files are still used around there and, although they’re less popular in enterprise applications, you still get a few at times. The growing popularity of deep learning and artificial intelligence also drives more NLP (Natural Language Processing) activities. In this section, you won’t do any NLP, only ingest text files. To know more about NLP, you can refer to Manning’s Natural Language Processing in Action.

The task is to ingest Shakespeare’s Romeo & Juliet. Project Gutenberg (http://www.gutenberg.org) hosts numerous books and resources in digital format.

Each line of the book becomes a record of our dataframe. No features need to be cut by sentence or word. Listing 1 shows an excerpt of the file you’re going to work on.

Getting the files You can download Romeo and Juliet from http://www.gutenberg.org/cache/epub/1777/pg1777.txt. For this example, I used Spark v2.2.0 on MacOS X v 10.12.6 with Java 8. The dataset was downloaded in January 2018.

Listing 1 Abstract of Project Gutenberg’s version of Romeo and Juliet

  
 This Etext file is presented by Project Gutenberg, in
 cooperation with World Library, Inc., from their Library of the
 Future and Shakespeare CDROMS.  Project Gutenberg often releases
 Etexts that are NOT placed in the Public Domain!!
 …
 ACT I. Scene I.
 Verona. A public place.
  
 Enter Sampson and Gregory (with swords and bucklers) of the house
 of Capulet.
  
   Samp. Gregory, on my word, we'll not carry coals.
   Greg. No, for then we should be colliers.
   Samp. I mean, an we be in choler, we'll draw.
   Greg. Ay, while you live, draw your neck out of collar.
   Samp. I strike quickly, being moved.
   Greg. But thou art not quickly moved to strike.
   Samp. A dog of the house of Montague moves me.
 …
  

Desired output

Listing 1 shows the first five rows of Romeo and Juliet after it has been ingested by Spark and transformed into a dataframe.

Listing 2 Romeo and Juliet in a dataframe

  
 +--------------------+
 |               value|
 +--------------------+
 |                    |
 |This Etext file i...|
 |cooperation with ...|
 |Future and Shakes...|
 |Etexts that are N...|
 +--------------------+
 only showing top 5 rows
  
 root
  |-- value: string (nullable = true)
  

Code

Listing 2 is the Java code needed to turn Romeo and Juliet into a dataframe.

Listing 3 – TextToDataframeApp.java

  
 package net.jgp.books.sparkWithJava.ch07.lab_400.text_ingestion;
  
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
 import org.apache.spark.sql.SparkSession;
  
 public class TextToDataframeApp {
  
   public static void main(String[] args) {
     TextToDataframeApp app = new TextToDataframeApp();
     app.start();
   }
  
   private void start() {
     SparkSession spark = SparkSession.builder()
         .appName("Text to Dataframe")
         .master("local")
         .getOrCreate();
  
     Dataset<Row> df = spark.read().format("text") 
         .load("data/romeo-juliet-pg1777.txt");
  
     df.show(5);
     df.printSchema();
   }
 }
  

 specify “text” when you want to ingest a text file

Unlike with other formats, there are no option to be set with text. It’s that easy! We’ve come to the end of our series and we hope that you’ve found it both enjoyable and informative. If you want to learn more about the book, check it out on liveBook here and see this slide deck.