Incremental Haskell III: Tokyo Drift
Prelude
This is part 3 in a series of articles designed to help those new to Haskell work through the highly-recommended book, Real World Haskell. In them, I try to keep things at a digestible pace and not to presuppose too much. Here are links to the previous articles. I hope you find these helpful, and welcome comments on the Haskell subreddit.
Mo’ problems, mo types
As we mentioned in the previous article, it would be nice to be able to customize our JSON output. Right now, we have a default implementation of printing output that isn’t very flexible, not supporting things like “word wrapping” (soft line breaks).
Certainly, being able to format output text in a flexible manner is not a problem specific to JSON (though that will be our current application of it). Because it’s a new “problem to solve,” we’ll start from the top with a new type. We want to represent a chunk of text that can be formatted.
data FormattedText =
The simplest type of formatting is no formatting at all (just like momma always said).
data FormattedText = Text String
We want the ability to break text over more than one line if it’s too wide. We’ll add a data constructor to our type that represents these line breaks. Then when we go to render the text, when we match this constructor, we can go to the next line.
data FormattedText = Text String | Linebreak
We also want to support a soft line break, which will be a space character if we aren’t out of space on a line, or a newline otherwise. So we’ll make a WithOptionalBreaks data constructor to represent this.
data FormattedText = Text String | Linebreak | WithOptionalBreaks FormattedText FormattedText
As you can imagine, a real document (JSON or otherwise) is going to consist of chunks of text combined together. Some chunks will have different formatting than others, but together they make up a whole document of formatted text, which we can call a formatted document. And since a formatted document is just formatted text, if we combine it with more formatted text, we still have a formatted document. I hope that makes sense, even if it does involve some “recursive thinking.” The concept translates very well into Haskell:
data FormattedText = Text String | Linebreak | WithOptionalBreaks FormattedText FormattedText | FormattedDocument FormattedText FormattedText
That is, we represent a whole document as two chunks of FormattedText, which can include plain Text, Linebreaks, even other FormattedDocuments that are themselves made up of the same thing. If you know anything about data structures, you’ll recognize a tree structure here.
Defining custom operators in Haskell
Almost every language has operators, which give abstract symbols meaning to make certain expressions read more smoothly. Of course, this concept originally comes from mathematics, where being able to tersely express things is important.
Languages with operators tend to at least include ASCII equivalents of the basic arithmetic symbols like + and -. Some language designers decided to “overload” operators, giving them different meanings depending on context. In Java, for example, you concatenate strings thus:
String newString = "hello " + "world";
Most people tend to be in favor of operators; having to write Plus(1, 3) every time you wanted to add 1 and 3 isn’t quite as convenient, and one could argue less readily apparent. Not everyone is in agreement on operator overloading, however, as it can introduce ambiguity when one symbol has more than one meaning.
Haskell programmers are big fans of operators. I’m sure this has a lot to do with the math culture that is quite prevalent in the community and user base. Haskell and its community are quite bold in using abstract concepts in everyday code, and often it makes more sense to define an operator than give an otherwise inadequate name to a function.
The language lets you define your own operators, including precedence. Functions that you name using only non-alphanumeric symbols are automatically considered operators and can be used just like you’d use + or -. Just like functions, operator names need to be unique and there is no overloading allowed (in the traditional sense). That is, you don’t see 1 + 1 on one line and “hello ” + “world” on the next. The (+) function is coded to accept only numeric inputs, and so there wouldn’t be a way to add Strings with (+) without String being numeric.
So what’s that have to do with what we’re doing?
The type we just wrote is perfectly suited to having a custom operator or two to help us out. Right now, if we want to build a document up out of text chunks, we have to do something like this:
FormattedDocument (Text "hi") (FormattedDocument (Text "howdy") (Text "hello"))
We can quickly and easily define an operator that makes the concept of combining text more palatable:
(<++>) :: FormattedText -> FormattedText -> FormattedText textChunk <++> anotherChunk = FormattedDocument textChunk anotherChunk
Since FormattedText concatenation is conceptually kinda like String concatenation, I chose an operator that looks like (++).
Now we can avoid “parentheses nesting hell”:
Text "Knights " <++> Text "who say" <++> Linebreak <++> Text "ni"
As you can see above, we can add text to other text, and add line breaks where we want. We can use this same approach for our optional line breaks. Let’s make another operator that represents a possible breaking spot in a line:
(</>) :: FormattedText -> FormattedText -> FormattedText textChunk </> anotherChunk = textChunk <++> softLineBreak <++> anotherChunk
So when we > two pieces of text, we join them together with a soft line break–the kind that ends up being a newline if anotherChunk didn’t fit on the current line. Otherwise, it’ll just become a space.
Our WithOptionalBreaks constructor handles these two possibilities (break or just space) for us:
softLineBreak = WithOptionalBreaks (Text " ") (Linebreak)
Back to rendering JSON
Now that we have our type and the ability to combine values of our type, let’s take a look at how we’ll use it to “pretty print” a JSON string value. This isn’t quite as trivial as it sounds because we need to pay attention to characters in the string that need to be escaped with backslashes.
renderJsonString :: String -> FormattedText renderJsonString string = Text "\"" <++> Text (escapeChars string) <++> Text "\"" where escapeChars chars = concatMap escapeChar chars escapeChar char = -- ??
We’re breaking the problem into chunks, keeping individual functions small. Our main function, renderJsonString, just surrounds the string in quotation marks, and then calls a helper function escapeChars on the string. escapeChars, in turn, is going to concatMap an escape function over each individual character in the string.
In case you still aren’t used to the map family of functions yet, they’re the ones that pass each element of a provided list to a specified function. A new list is returned containing the results of each function application. In this case, each result is going to be a String (which, don’t forget, is just a [Char]). We only want one String out of our escapeChars function, so we use the concatMap version of map. It just smooshes all the results together, which is great for scenarios like this.
In the previous installment, we talked about covering all our bases when pattern matching. If you are matching against a Maybe, it’s a good idea to think about what happens if the value is Nothing as well as if it is Just something. If our input is a list, what if it’s empty?
These are great things to think about to avoid errors. However, Haskell programmers feel very passionate about avoiding redundant code, so it’s okay to avoid special-casing a pattern if we won’t gain anything by it. The escapeChars function above is a good example, as it takes a list. We could have done the following:
escapeChars [] = "" escapeChars chars = concatMap escapeChar chars
But actually we don’t need to, because concatMap already handles the empty list case for us in the same way! (If the list passed to concatMap is empty, it returns an empty list. In Haskell this looks like:)
concatMap _ [] = []
(Remember, the underscore pattern is used when we don’t care about the value of an argument. In this case, it doesn’t matter what function is passed to concatMap, if the list is empty, it just returns [].)
Escaping special characters
Our escapeChar function will take a character from a string, find out if it needs to be escaped, and if so, return an escaped version. Otherwise it’ll just return the character.
-- ... escapeChar char = case lookup char asciiEscapeTable of Just escaped -> escaped Nothing -> -- ??
Here, we are pattern matching the result of our check. The check is done with the lookup function, a standard library function which searches a list of key-value pairs for a given key (in our case, the character we’re checking). Since lookup returns Maybe a result, we match against Just (meaning the character was found in the list of characters-needing-escape) and Nothing. For the first case, we grab the escaped string and return it.
Our list of escaped characters, asciiEscapeTable, contains the most common escape sequences. We will build it by feeding a table-creation function two lists: a list of characters that need escaping, and a list of their escaped versions.
asciiEscapeTable :: [(Char, String)] -- a list of key-value pairs, commonly called an 'association list' asciiEscapeTable = mysteryTableMakerFunction ['\b','\n','\f','\r','\t','\\','"','/'] ["\\b","\\n","\\f","\\r","\\t","\\\\","\\\"","\\/"]
So far, so good–assuming we can find a function that takes two lists and makes one list of pairs out of them (sequentially pairing one element from the first list with one from the second). If we think about what the type of such a function would be, we come up with:
[a] -> [b] -> [(a, b)]
Remember Hoogle? If we go there and search for a function with that type, it tells us straight away that the function we want is called zip, and it’s in the Prelude so we don’t even need to import it. (In fact, zip is a common functional programming building block.)
asciiEscapeTable = zip ['\b','\n','\f','\r','\t','\\','"','/'] ["\\b","\\n","\\f","\\r","\\t","\\\\","\\\"","\\/"]
Very true. Using those facts, we can write a version that’s easier to type, if we want:
asciiEscapeTable = zipWith addBackslash "\b\n\f\r\t\\\"/" "bnfrt\\\"/" where addBackslash input output = (input, ['\\', output])
The zipWith function lets us apply a helper function before it combines the pair into a list element. We use it to simply prepend a backslash to every element of our output list.
And if the lookup returns Nothing?
Unfortunately, we aren’t done yet. We can’t just return the character because it might be a Unicode character that requires escaping, and our nice little escape table earlier was ASCII only. So we need another check to see if this is the case:
-- ... Nothing | needsUnicodeEscape char -> unicodeEscape char | otherwise -> [char] needsUnicodeEscape :: Char -> Bool needsUnicodeEscape char = char < ' ' || char == '\x7f' {-- ASCII 'del' --} || char > '\xff'
So we changed the second pattern in our case statement to be more specific, using guards. Guards allow you to specify further conditions on a pattern match. Our first guard does a check to see if we need to escape the character in a more complex way (using RWH’s choice of escaping all non-printable ASCII characters). The other guard, the otherwise, just returns the character because it means it doesn’t need escaping. (We made a singleton list out of char because the other branches of the statement are returning a String, not just a single Char, and we need to be consistent.)
Now we need to write our unicodeEscape function. To do this, we need to turn the character into a string representing its numeric Unicode value. This means a backslash, a u, and a four-digit hex string. (The JSON standard requires this format.)
unicodeEscape :: Char -> String unicodeEscape char = "\\u" ++ padWithZeroes (toHex char) where padWithZeroes string = -- ?? toHex char = -- ??
We need to ensure our hex value is four digits, so we’ll pad it with zeroes if necessary. This means finding out the length of our hex, and throwing a 0 in front for each place less than four.
To help us, we’ll use a handy little function called replicate that (as you’d expect) replicates its argument. You specify how many times it should do this.
-- ... where padWithZeroes string = replicate (4 - length string) '0' ++ string
In order to convert a character to its hex value, we first need to get a numeric value of the Char. Hoogle can help us if we search for the type signature of such a function (that is, Char -> Int). The result that suits our needs is ord, in Data.Char (it gives the ordinal value of a character). We then need to turn our numeric value into a hex string. The Numeric module provides such a function: showHex. If we import the Numeric and Data.Char modules at the top of our file, we can finish up our toHex function snappily:
-- ... toHex char = showHex (ord char) ""
(showHex actually takes another argument after the input number–a String. This has uses, but we won’t bother with it here, so we just feed it an empty string.)
Though our function is pretty good, Unicode values actually go up higher than the JSON “\uxxxx” syntax allows. The standard describes how to express characters that have values higher than four digits. RWH has already implemented this; look for the astral function in chapter 5. Feel free to incorporate it if you want!
Now to pretty printing!
Now that we have our string rendering, we’re ready to get on to flexible output formatting. We’ll start with a type signature:
printWithWidth :: Int -> FormattedText -> String
The Int parameter will represent the maximum width of a line. Remember our soft line breaks? If one of those occurs in our text, we can use this width information to determine whether or not to begin a new line.
Our implementation will use the different constructors we defined for FormattedText to our advantage: we’ll use them as instructions about how to print.
printWithWidth width formattedText = processText 0 [formattedText] where processText column (textChunk : remainingText) = case textChunk of Text string -> string ++ processText (column + length string) remainingText Linebreak -> '\n' : processText 0 remainingText FormattedDocument part1 part2 -> processText column (part1 : part2 : remainingText) WithOptionalBreaks noBr withBr -> nicestFit column (processText column (noBr : remainingText)) (processText column (withBr : remainingText)) processText _ _ = ""
A line-by-line explanation might help.
Right away, we pass column number 0 and a list containing what our function was passed to a helper function. Then we define said helper function, and have it match against every possible sort of FormattedText. The column variable will track our current position in a line. So when processText gets a string of Text, it increments the column variable by the length of the string and emits the string. When it gets a Linebreak, it emits a newline character, and resets the column count because we’re now on a fresh line. In both cases, we recursively continue processing the [formattedText] list.
But that list only has one element, right?
Well yes, at the start. But if we pass in a FormattedDocument (which as you remember, contains other FormattedTexts), we grow our remainingText list by prepending both halves of the document “tree.” This action in and of itself doesn’t move our column position on the line; the recursive call to processText will take care of that later.
The most interesting piece
Remember we made the WithOptionalBreaks constructor to hold the two versions of a soft line break: the space version, and the newline version. The reason we did this will now become clear. What we’re going to do with our nicestFit function is compare the lengths of those two versions of text based on what column position we’re on in the line. If the “no breaks” version would be too long, we’ll emit the version with line breaks.
-- ... nicestFit column version1 version2 = if version1 `fitsIn` spaceRemaining then version1 else version2 where spaceRemaining = width - (min width column) fitsIn :: String -> Int -> Bool _ `fitsIn` remainingWidth | remainingWidth < 0 = False "" `fitsIn` remainingWidth = True ('\n':_) `fitsIn` remainingWidth = True (char : chars) `fitsIn` remainingWidth = chars `fitsIn` (remainingWidth - 1)
Hopefully those two functions are pretty straightforward. Hitting our WithOptionalBreaks constructor causes our nicestFit function to test both the versions it contains by seeing if there’s enough space for the “no breaks” version. It does this by testing one character at a time vs. the remaining width on the line. If we get to the end of the string (“”) or a newline, it fits. Otherwise we return False, which means we’ll emit the version with line breaks.
Though once you are used to it, you’ll be able to instantly figure out what recursive functions like the ones above do, you might becoming from an imperative language background, where recursion is uncommon at best. There’s nothing to be scared of, but I do recommend sitting down and working through all the steps in evaluating a recursive procedure every now and then. It might be helpful to sit down with GHCi and maybe even a pencil and work through a step-by-step evaluation of a call to printWithWidth.