Description: C:\Users\Chris\Desktop\Manning\Images\cover images\haskell in depth.jpg

From Haskell in Depth by Vitaly Bragilevsky

This article explores text processing in the functional programming style.

Take 37% off Haskell in Depth. Just enter fccbragilevsky into the discount code box at checkout at manning.com.

Suppose you want to write a program that analyzes the vocabulary of a given text. There exist many sophisticated methods for such an analysis, but you want to be quick. The task is quite basic. Namely, you should:

  • Extract all words from the given text file and print them.
  • Count the size of the vocabulary.
  • Find the most frequently used words.

Such a program could be part of a larger social media text analyzer, which could be used to mine various pieces of information (ranging from level of education or social position to the risk of financial default) by analyzing the texts people post on their social media pages.

Or, more likely, you’ve gotten up in the middle of the night with a desire to explore the size of Shakespeare’s vocabulary. How many different words did Shakespeare use in Hamlet?

First attempt: Exploring the problem in a REPL with Strings

Let’s start by exploring Shakespeare’s vocabulary. Suppose you have a file, texts/hamlet.txt, with the full text of Hamlet. How many different words are there? You can use the GHCi REPL (read-eval-print loop) to find out without writing a program. You only need to import two modules for processing lists and characters, read the file into the String, and then manipulate its content:

  
 ghci> :m + Data.List Data.Char
 ghci> text <- readFile "texts/hamlet.txt"
 ghci> ws = map head $ group $ sort $ map (map toLower) $ words text
 ghci> take 7 ws
 ["&","'em?","'gainst","'tane","'tis","'tis,","'twas"]
  

The idea is to split the file’s content into a list of words, make all of them lowercase, sort them, and then remove repetitions by grouping the equal words and taking the first word in each group. The initial result shows that we’ve forgotten about leading (and trailing) punctuation, and we need to do some cleanup:

  
 ghci> ws1 = map (takeWhile isLetter . dropWhile (not . isLetter)) $ words text
 ghci> ws = map head $ group $ sort $ filter (not . null) $ map (map toLower) ws1
 ghci> take 7 ws
 ["a","abhominably","abhorred","abilitie","aboord","aboue","about"]
 ghci> length ws
 4633
  

This is better, although looking at the other words in the resulting list, ws, reveals that this isn’t a correct way to clean words: for example, it’s removed all hyphenated words (due to takeWhile isLetter).

Note

I’m using the ghci> prefix for the code executed in the GHCi REPL. You can use the same by executing the following command:

  
 $ ghci
 GHCi, version 8.2.2: http://www.haskell.org/ghc/  :? for help
 Prelude> :set prompt "ghci> "
 ghci>
  

Alternatively, you can tweak your .ghci file to make this change permanent. The GHC User Guide gives you more details about the location of .ghci file: you can write any GHCi command there; for example, defining values, importing modules, or executing some Haskell code.

Moving away from REPL and Strings

The REPL approach to solving problems quickly gets clumsy; let’s try to write a complete program to solve the same problem—extracting a vocabulary (a list of used words) from the given text file and computing the size of the vocabulary.

In this attempt we’ll also switch to the Data.Text datatype, which is much more suitable for handling textual data in terms of both performance and convenience. The resulting program is presented in the following listing.

Listing 1. Extracting a vocabulary: the second attempt (vocab1.hs)

  
 import Data.Char
 import Data.List
 import qualified Data.Text as T          
 import qualified Data.Text.IO as TIO     
 import System.Environment                2
  
 main = do
   [fname] <- getArgs                     
   text <- TIO.readFile fname             
   let ws = map head $ group $ sort $ map T.toCaseFold $ filter (not . T.null)
            $ map (T.dropAround $ not . isLetter) $ T.words text              
   TIO.putStrLn $ T.unwords ws                                                
   print $ length ws
  

  Import the modules for working with Text

  Import the module for reading command-line arguments

  Read the command-line arguments into a list of `String`s

  Read the file content into the Text value

  Process text into a list of words

  Print all the words, delimited by spaces

Now you can read the filename from the command line without worrying too much about incorrect user input. This is done in the first line of the main function. The modules Data.Text and Data.Text.IO are usually imported with qualifiers to avoid name clashes with Prelude (the module which is imported by default). They come with the text package, which should be installed prior to compiling this program (although it may be installed already, depending on your GHC distribution).

The module Data.Text provides many functions analogous to the list functions from Prelude and Data.List, but it also adds specific text processing functions that were used in listing 1. For example:

  
 toCaseFold :: Text -> Text
 dropAround :: (Char -> Bool) -> Text -> Text
  

The function toCaseFold converts the whole Text value to folded case, and does it significantly faster (and in a more correct way) than mapping with toLower over every character. The function dropAround removes leading and trailing characters that satisfy the given predicate (the function of type Char → Bool).

Tip

Finding the function you need

You can use the website https://hoogle.haskell.org to find functions by name or, even more usefully, by type annotation: for example, searching for “(Char

→ Bool) → Text → Text” gives you the dropAround function I’ve

described, together with other functions for cleaning texts. You can also Google what you need, but adding the “hackage” or “stackoverflow haskell” keywords to your search query usually leads to much more relevant results.

Give it a try!

Running the program from listing 1 on the text of Shakespeare’s Hamlet results in something like this (with the part in the middle stripped away):

  
 $ vocab1 texts/hamlet.txt
 a a'th a-crosse a-downe a-downe-a a-dreames a-foot a-sleepe a-while a-worke
 ...
 4827
  

We won’t attempt to make the cleaning results better, because the main goal of this chapter is to discuss structuring functional programs. In fact, breaking text on words can’t be done reliably without diving deep into the Unicode rules on text boundary positions. If you’re interested, look at the documentation for the module Data.Text.ICU from the text-icu package; you’ll find fascinating details there on what a word is. Even then, you need to add some sort of semantic analysis to come up with a bulletproof solution that works beyond the English language. Don’t forget to check your own solution with several text files.

Designing a functional program as a set of IO actions

The sort of programming presented in the previous subsections resembles scripting more than actual programming. In Haskell we tend to express our ideas in terms of types and functions first and then proceed with implementing them. Remember that our task is to explore the vocabulary of a given text file, but let’s be more specific here: from now on we’ll regard a vocabulary as consisting of entries with words and numbers of occurrences (they can be used later for determining the most frequent words, for example). Now that we’ve agreed on the types of an entry (a word and its number of occurrences) and a vocabulary (a list of entries), we are ready to write a type-based outline for the program:

  
 type Entry = (T.Text, Int)   -- one entry
 type Vocabulary = [Entry]    -- a list of entries
  
 extractVocab :: T.Text -> Vocabulary
 printAllWords :: Vocabulary -> IO ()
 processTextFile :: FilePath -> IO ()
 main :: IO ()
  

This outline clearly shows that our plan is to read and process the text file (processTextFile) by:

  1. Extracting a vocabulary from the file’s content
  2. Using the vocabulary to print all words

But there’s more than that in this outline: we plan to read and process command-line arguments (the name of the file, which is a variable of type FilePath) in the main function, and, if this is done correctly, proceed with processing the text file (using processTextFile). This function reads the content of the given file (this is the second component of the user input in this program, after the command-line arguments) into a variable of type Text, extract the vocabulary (with the extractVocab function), and finally print it (with printAllWords).

Note the function extractVocab: it’s the only pure function in this program. We can see that from its type, T.Text → Vocabulary—there’s no IO there.

You can see this scenario visualized in the flowchart depicted in figure 1. I’ve tried to present all the components of the program there: user input, actions in the I/O part of the program, and their relations with the pure functions. Arrows in this flowchart represent moving data between the user and the program, and between functions within the program.


Figure 1. Extracting a vocabulary: program structure flowchart


The function extractVocab here does exactly what we did before in the let expression inside main, it takes a Text and returns a Vocabulary:

  
 extractVocab :: T.Text -> Vocabulary
 extractVocab t = map buildEntry $ group $ sort ws
   where
     ws = map T.toCaseFold $ filter (not . T.null) $ map cleanWord $ T.words t
     buildEntry ws@(w:_) = (w, length ws)
     cleanWord = T.dropAround (not . isLetter)
  

Once we have a Vocabulary we can print it as follows:

  
 printAllWords :: Vocabulary -> IO ()
 printAllWords vocab = do
   putStrLn "All words: "
   TIO.putStrLn $ T.unlines $ map fst vocab
  

I prefer to have a separate function for file processing to avoid many lines of code in the main function, where I’m going to read command-line arguments and check them for correctness:

  
 processTextFile :: FilePath -> IO ()
 processTextFile fname = do
   text <- TIO.readFile fname
   let vocab = extractVocab text
   printAllWords vocab
  
 main = do
   args <- getArgs
   case args of
     [fname] -> processTextFile fname
     _ -> putStrLn "Usage: vocab-builder filename"
  

This program structure is flexible enough to accommodate several task changes. For example, it’s easy to print the total number of words in the text file and find the most frequently used words. These new goals can be expressed in types as follows:

  
 printWordsCount :: Vocabulary -> IO ()
 printFrequentWords :: Vocabulary -> Int -> IO ()
  

Unfortunately, there’s a problem with this approach: we tend to stick with IO, and almost every function in the program is an I/O action. Next, I’ll show you how to do the same task in a completely different way.

Embracing pure functions

The problem with functions like printAllWords, printWordsCount, or printFrequentWords is that they are too tightly and unnecessarily coupled with I/O. Even their own names suggest that the same functionality can be achieved by combining the impure print with pure computations.

It’s generally agreed within the Haskell community that we can get the most out of the advantages of functional programming when we use pure functions as much as possible: they’re easier to combine with other functions, they can’t break something in other parts of the program, and we can even reason about their correctness.

Let’s try to follow the path of purity as we solve our main problem in this section—extracting a vocabulary—to see how pure functions allow us to extend the functionality of the program easily.

Putting I/O apart with pure functions

Remember that your initial goal was to write a program that was able to:

  • Extract the vocabulary of a text file.
  • Print all words used in the text.
  • Count and print the number of different words.
  • Find the specified number of the most frequently used words.

For example, when running it on Shakespeare’s Hamlet you’d like to get something like the following output:

  
 All words:
 a
 ...
 zone
  
 Total number of words: 29575
  
 Frequent words:
 the - 993
 and - 862
 to - 683
 of - 610
 i – 547
  

The task is more ambitious than what we’ve programmed this far, but we can handle it easily thanks to the clear separation between pure and impure parts of the program. This is the new outline, now with a big choice of pure functions:

  
 extractVocab :: T.Text -> Vocabulary
 allWordsReport :: Vocabulary -> T.Text
 wordsCountReport :: Vocabulary -> T.Text
 frequentWordsReport :: Vocabulary -> Int -> T.Text
 processTextFile :: FilePath -> IO ()
 main :: IO ()
  

Every *Report function here is a function that takes a Vocabulary with additional arguments if needed and returns the resulting text in the form of T.Text ready for printing. We’ll use these functions in processTextFile. Preparing these reports can be done with two auxiliary pure functions:

  
 wordsCount :: Vocabulary -> Int
 wordsByFrequency :: Vocabulary -> Vocabulary
  

The functions in the I/O part of the program (IO-actions) now become more trivial: the goal of processTextFile is reading the text file, calling extractVocab to extract the vocabulary, and printing the results of all the *Report functions. The complete flowchart is depicted in figure 2.


Figure 2. Extracting a vocabulary with pure functions: program structure flowchart


Nothing new happens when preparing the reports on the list of words used and their counts, but dealing with frequently occurring words is more interesting. Let’s discuss that now.

Computing the most frequent words by sorting them

Note that I’ve deliberately split the job between two functions, wordsByFrequency and frequentWordsReport:

  
 wordsByFrequency :: Vocabulary -> Vocabulary
 frequentWordsReport :: Vocabulary -> Int -> T.Text
  

Is this useful in practice? The function wordsByFrequency can be used later—its goal is to sort the vocabulary by number of occurrences of each word—but the goal of frequentWordsReport is only to prepare the report for printing. It isn’t a good idea to add the Int argument to wordsByFrequency, because this greatly limits its potential for reuse.

Let’s move onto the implementation of computing the most frequent words. First, you need to sort the vocabulary entries by descending number of occurrences. This can be achieved using the sortBy function from the Data.List module, with the following type:

  
 sortBy :: (a -> a -> Ordering) -> [a] -> [a]
  

This function requires a comparison function returning the Ordering datatype (whose values can be LT, EQ, or GT, meaning less than, equal, or greater than, respectively). The function sortBy sorts elements of the list in ascending order by default. To address the required descendancy, you can construct a comparison function by using the comparing function from the Data.Ord module over the values, wrapped with the type Down. This implementation reverses the results of comparing two values of the given type, providing you with a sort in descending order instead of the default ascending sortBy behavior:

  
 wordsByFrequency :: Vocabulary -> Vocabulary
 wordsByFrequency = sortBy (comparing $ Down . snd)
  

Note also that calling the snd function over the vocabulary entry enables you to compare numbers of occurrences instead of words.

I recommend reading Roman Cheplyaka’s blog post on the descending sort in Haskell to see some of the difficulties behind this seemingly easy task:

Note

https://ro-che.info/articles/2016-04-02-descending-sort-haskell. This post  explores the behavior of sortBy and compares it with the other sorting functions, such as sortOn and sortWith.

Now that you can sort the vocabulary entries as required, you can move on to preparing the corresponding report. One problem here’s how to combine string literals with T.Text values. Instead of looking for specific functions for doing that, you can teach the compiler to regard string literals as values of type T.Text, using the OverloadedStrings GHC extension. To enable this extension in the source code, start your file with the LANGUAGE pragma:

  
 {-# LANGUAGE OverloadedStrings #-}
  

Now you can use string literals directly as values of type Text without converting them:

  
 frequentWordsReport :: Vocabulary -> Int -> T.Text
 frequentWordsReport vocab n = T.append "\nFrequent words:\n"
                               $ T.unlines $ map showEntry $ take n
                               $ wordsByFrequency vocab
   where
     showEntry (t, n) = T.append t $ T.pack $ " - " ++ show n
  

The function T.pack is used here to convert a value (not a literal) of type String to type T.Text.

Note

Using GHC extensions

The use of compiler extensions in other programming languages is somewhat controversial. In C or C++, for example, there’s the standard language definition, which is implemented by many different compilers, and there are specific compiler extensions. In Python or Ruby there are no standards, and that means that there are no extensions: there’s only one compiler, and you can legitimately use only what it implements.

In Haskell the situation is different: there’s a standard definition (the Haskell 2010 Report is the current version) and only one widely used compiler (the Glasgow Haskell Compiler), which implements many extensions beyond what’s defined in the Haskell Report. It’s generally considered good practice to use these extensions, because this gives the community the experience needed to decide whether to include them in the next standard or, in some cases, to exclude them from the compiler itself.

Exploring vocabulary: The overall solution

The overall solution is presented in the following listing. It features a LANGUAGE directive in the first line that enables the GHC extension OverloadedStrings in the source code; every other aspect of this code has already been discussed.

Listing 2. Analyzing words in a text file with a clearly divided pure part and I/O part (vocab3.hs)

  
 {-# LANGUAGE OverloadedStrings #-}                                              
  
 import Data.Char                                                                
 import Data.List
 import Data.Ord
 import qualified Data.Text as T
 import qualified Data.Text.IO as TIO
 import System.Environment
  
 type Entry = (T.Text, Int)                                                      
 type Vocabulary = [Entry]
  
 extractVocab :: T.Text -> Vocabulary                                            
 extractVocab t = map buildEntry $ group $ sort ws                               
   where
     ws = map T.toCaseFold $ filter (not . T.null) $ map cleanWord $ T.words t   
     buildEntry ws@(w:_) = (w, length ws)
     cleanWord = T.dropAround (not . isLetter)
  
 allWordsReport :: Vocabulary -> T.Text                                          
 allWordsReport vocab = T.append "\nAll words:\n"                                
                        $ T.unlines $ map fst vocab
  
 wordsCount :: Vocabulary -> Int                                                 
 wordsCount vocab = sum $ map snd vocab                                          
  
 wordsCountReport :: Vocabulary -> T.Text                                        
 wordsCountReport vocab = T.append "\nTotal number of words: "                   
                          $ T.pack $ show $ wordsCount vocab
  
 wordsByFrequency :: Vocabulary -> Vocabulary                                    
 wordsByFrequency = sortBy (comparing $ Down . snd)                              
  
 frequentWordsReport :: Vocabulary -> Int -> T.Text                              
 frequentWordsReport vocab n = T.append "\nFrequent words:\n"
                               $ T.unlines $ map showEntry $ take n
                               $ wordsByFrequency vocab
   where
     showEntry (t, n) = T.append t $ T.pack $ " - " ++ show n
  
 processTextFile :: FilePath -> Int -> IO ()                                     
 processTextFile fname n = do
   text <- TIO.readFile fname
   let vocab = extractVocab text
   TIO.putStrLn $ allWordsReport vocab
   TIO.putStrLn $ wordsCountReport vocab
   TIO.putStrLn $ frequentWordsReport vocab n
  
 main :: IO ()
 main = do
   args <- getArgs
   case args of
     [fname, n] -> processTextFile fname (read n)
     _ -> putStrLn "Usage: vocab-builder filename number_of_frequent_words"
  

  Enable the GHC extension OverloadedStrings, which allows treating string literals as Text values

  Import the modules required in this program

  Define type synonyms to make type annotations for functions easier to read and understand

  The group of pure functions that do almost all the work in the program

  The I/O action that brings everything together

Preparing reports with this program isn’t particularly impressive: we build them trivially by appending strings.

That’s all for this article. If you want to learn more about the book, check it out on liveBook here and see this slide deck.