Archive

Archive for the ‘Vaguely Instructional’ Category

Incremental Haskell III: Tokyo Drift

November 2nd, 2009

Prelude

This is part 3 in a series of articles designed to help those new to Haskell work through the highly-recommended book, Real World Haskell. In them, I try to keep things at a digestible pace and not to presuppose too much. Here are links to the previous articles. I hope you find these helpful, and welcome comments on the Haskell subreddit.

Mo’ problems, mo types

As we mentioned in the previous article, it would be nice to be able to customize our JSON output. Right now, we have a default implementation of printing output that isn’t very flexible, not supporting things like “word wrapping” (soft line breaks).

Certainly, being able to format output text in a flexible manner is not a problem specific to JSON (though that will be our current application of it). Because it’s a new “problem to solve,” we’ll start from the top with a new type. We want to represent a chunk of text that can be formatted.

data FormattedText =

The simplest type of formatting is no formatting at all (just like momma always said).

data FormattedText = Text String

We want the ability to break text over more than one line if it’s too wide. We’ll add a data constructor to our type that represents these line breaks. Then when we go to render the text, when we match this constructor, we can go to the next line.

data FormattedText =  Text String
                    | Linebreak

We also want to support a soft line break, which will be a space character if we aren’t out of space on a line, or a newline otherwise. So we’ll make a WithOptionalBreaks data constructor to represent this.

data FormattedText =  Text String
                    | Linebreak
                    | WithOptionalBreaks FormattedText FormattedText

As you can imagine, a real document (JSON or otherwise) is going to consist of chunks of text combined together. Some chunks will have different formatting than others, but together they make up a whole document of formatted text, which we can call a formatted document. And since a formatted document is just formatted text, if we combine it with more formatted text, we still have a formatted document. I hope that makes sense, even if it does involve some “recursive thinking.” The concept translates very well into Haskell:

data FormattedText =  Text String
                    | Linebreak
                    | WithOptionalBreaks FormattedText FormattedText
                    | FormattedDocument FormattedText FormattedText

That is, we represent a whole document as two chunks of FormattedText, which can include plain Text, Linebreaks, even other FormattedDocuments that are themselves made up of the same thing. If you know anything about data structures, you’ll recognize a tree structure here.

Defining custom operators in Haskell

Almost every language has operators, which give abstract symbols meaning to make certain expressions read more smoothly. Of course, this concept originally comes from mathematics, where being able to tersely express things is important.

Languages with operators tend to at least include ASCII equivalents of the basic arithmetic symbols like + and -. Some language designers decided to “overload” operators, giving them different meanings depending on context. In Java, for example, you concatenate strings thus:

String newString = "hello " + "world";

Most people tend to be in favor of operators; having to write Plus(1, 3) every time you wanted to add 1 and 3 isn’t quite as convenient, and one could argue less readily apparent. Not everyone is in agreement on operator overloading, however, as it can introduce ambiguity when one symbol has more than one meaning.

Haskell programmers are big fans of operators. I’m sure this has a lot to do with the math culture that is quite prevalent in the community and user base. Haskell and its community are quite bold in using abstract concepts in everyday code, and often it makes more sense to define an operator than give an otherwise inadequate name to a function.

The language lets you define your own operators, including precedence. Functions that you name using only non-alphanumeric symbols are automatically considered operators and can be used just like you’d use + or -. Just like functions, operator names need to be unique and there is no overloading allowed (in the traditional sense). That is, you don’t see 1 + 1 on one line and “hello ” + “world” on the next. The (+) function is coded to accept only numeric inputs, and so there wouldn’t be a way to add Strings with (+) without String being numeric.

So what’s that have to do with what we’re doing?

The type we just wrote is perfectly suited to having a custom operator or two to help us out. Right now, if we want to build a document up out of text chunks, we have to do something like this:

FormattedDocument (Text "hi") (FormattedDocument (Text "howdy") (Text "hello"))

We can quickly and easily define an operator that makes the concept of combining text more palatable:

(<++>) :: FormattedText -> FormattedText -> FormattedText
textChunk <++> anotherChunk = FormattedDocument textChunk anotherChunk

Since FormattedText concatenation is conceptually kinda like String concatenation, I chose an operator that looks like (++).

Now we can avoid “parentheses nesting hell”:

Text "Knights " <++> Text "who say" <++> Linebreak <++> Text "ni"

As you can see above, we can add text to other text, and add line breaks where we want. We can use this same approach for our optional line breaks. Let’s make another operator that represents a possible breaking spot in a line:

(</>) :: FormattedText -> FormattedText -> FormattedText
textChunk </> anotherChunk = textChunk <++> softLineBreak <++> anotherChunk

So when we two pieces of text, we join them together with a soft line break–the kind that ends up being a newline if anotherChunk didn’t fit on the current line. Otherwise, it’ll just become a space.

Our WithOptionalBreaks constructor handles these two possibilities (break or just space) for us:

softLineBreak = WithOptionalBreaks (Text " ") (Linebreak)

Back to rendering JSON

Now that we have our type and the ability to combine values of our type, let’s take a look at how we’ll use it to “pretty print” a JSON string value. This isn’t quite as trivial as it sounds because we need to pay attention to characters in the string that need to be escaped with backslashes.

renderJsonString :: String -> FormattedText
renderJsonString string = Text "\"" <++> Text (escapeChars string) <++> Text "\""
    where escapeChars chars = concatMap escapeChar chars
          escapeChar char   = -- ??

We’re breaking the problem into chunks, keeping individual functions small. Our main function, renderJsonString, just surrounds the string in quotation marks, and then calls a helper function escapeChars on the string. escapeChars, in turn, is going to concatMap an escape function over each individual character in the string.

In case you still aren’t used to the map family of functions yet, they’re the ones that pass each element of a provided list to a specified function. A new list is returned containing the results of each function application. In this case, each result is going to be a String (which, don’t forget, is just a [Char]). We only want one String out of our escapeChars function, so we use the concatMap version of map. It just smooshes all the results together, which is great for scenarios like this.

On patterns that need to be matched

In the previous installment, we talked about covering all our bases when pattern matching. If you are matching against a Maybe, it’s a good idea to think about what happens if the value is Nothing as well as if it is Just something. If our input is a list, what if it’s empty?

These are great things to think about to avoid errors. However, Haskell programmers feel very passionate about avoiding redundant code, so it’s okay to avoid special-casing a pattern if we won’t gain anything by it. The escapeChars function above is a good example, as it takes a list. We could have done the following:

escapeChars [] = ""
escapeChars chars = concatMap escapeChar chars

But actually we don’t need to, because concatMap already handles the empty list case for us in the same way! (If the list passed to concatMap is empty, it returns an empty list. In Haskell this looks like:)

concatMap _ [] = []

(Remember, the underscore pattern is used when we don’t care about the value of an argument. In this case, it doesn’t matter what function is passed to concatMap, if the list is empty, it just returns [].)

Escaping special characters

Our escapeChar function will take a character from a string, find out if it needs to be escaped, and if so, return an escaped version. Otherwise it’ll just return the character.

-- ...
          escapeChar char   = case lookup char asciiEscapeTable of
                                Just escaped -> escaped
                                Nothing      -> -- ??

Here, we are pattern matching the result of our check. The check is done with the lookup function, a standard library function which searches a list of key-value pairs for a given key (in our case, the character we’re checking). Since lookup returns Maybe a result, we match against Just (meaning the character was found in the list of characters-needing-escape) and Nothing. For the first case, we grab the escaped string and return it.

Our list of escaped characters, asciiEscapeTable, contains the most common escape sequences. We will build it by feeding a table-creation function two lists: a list of characters that need escaping, and a list of their escaped versions.

asciiEscapeTable :: [(Char, String)]  -- a list of key-value pairs, commonly called an 'association list'
asciiEscapeTable = mysteryTableMakerFunction ['\b','\n','\f','\r','\t','\\','"','/'] ["\\b","\\n","\\f","\\r","\\t","\\\\","\\\"","\\/"]

So far, so good–assuming we can find a function that takes two lists and makes one list of pairs out of them (sequentially pairing one element from the first list with one from the second). If we think about what the type of such a function would be, we come up with:

[a] -> [b] -> [(a, b)]

Remember Hoogle? If we go there and search for a function with that type, it tells us straight away that the function we want is called zip, and it’s in the Prelude so we don’t even need to import it. (In fact, zip is a common functional programming building block.)

asciiEscapeTable = zip ['\b','\n','\f','\r','\t','\\','"','/'] ["\\b","\\n","\\f","\\r","\\t","\\\\","\\\"","\\/"]
But hey, lists of Chars are Strings, you tell us that all the time. And all our escapes are done with backslashes!

Very true. Using those facts, we can write a version that’s easier to type, if we want:

asciiEscapeTable = zipWith addBackslash "\b\n\f\r\t\\\"/" "bnfrt\\\"/"
    where addBackslash input output = (input, ['\\', output])

The zipWith function lets us apply a helper function before it combines the pair into a list element. We use it to simply prepend a backslash to every element of our output list.

And if the lookup returns Nothing?

Unfortunately, we aren’t done yet. We can’t just return the character because it might be a Unicode character that requires escaping, and our nice little escape table earlier was ASCII only. So we need another check to see if this is the case:

-- ...
                                Nothing | needsUnicodeEscape char -> unicodeEscape char
                                        | otherwise               -> [char]
 
needsUnicodeEscape :: Char -> Bool
needsUnicodeEscape char = char < ' ' || char == '\x7f' {-- ASCII 'del' --} || char > '\xff'

So we changed the second pattern in our case statement to be more specific, using guards. Guards allow you to specify further conditions on a pattern match. Our first guard does a check to see if we need to escape the character in a more complex way (using RWH’s choice of escaping all non-printable ASCII characters). The other guard, the otherwise, just returns the character because it means it doesn’t need escaping. (We made a singleton list out of char because the other branches of the statement are returning a String, not just a single Char, and we need to be consistent.)

Now we need to write our unicodeEscape function. To do this, we need to turn the character into a string representing its numeric Unicode value. This means a backslash, a u, and a four-digit hex string. (The JSON standard requires this format.)

unicodeEscape :: Char -> String
unicodeEscape char = "\\u" ++ padWithZeroes (toHex char)
    where padWithZeroes string = -- ??
          toHex char = -- ??

We need to ensure our hex value is four digits, so we’ll pad it with zeroes if necessary. This means finding out the length of our hex, and throwing a 0 in front for each place less than four.

To help us, we’ll use a handy little function called replicate that (as you’d expect) replicates its argument. You specify how many times it should do this.

-- ...
    where padWithZeroes string = replicate (4 - length string) '0' ++ string

In order to convert a character to its hex value, we first need to get a numeric value of the Char. Hoogle can help us if we search for the type signature of such a function (that is, Char -> Int). The result that suits our needs is ord, in Data.Char (it gives the ordinal value of a character). We then need to turn our numeric value into a hex string. The Numeric module provides such a function: showHex. If we import the Numeric and Data.Char modules at the top of our file, we can finish up our toHex function snappily:

-- ...
          toHex char = showHex (ord char) ""

(showHex actually takes another argument after the input number–a String. This has uses, but we won’t bother with it here, so we just feed it an empty string.)

A note on correctness

Though our function is pretty good, Unicode values actually go up higher than the JSON “\uxxxx” syntax allows. The standard describes how to express characters that have values higher than four digits. RWH has already implemented this; look for the astral function in chapter 5. Feel free to incorporate it if you want!

Now to pretty printing!

Now that we have our string rendering, we’re ready to get on to flexible output formatting. We’ll start with a type signature:

printWithWidth :: Int -> FormattedText -> String

The Int parameter will represent the maximum width of a line. Remember our soft line breaks? If one of those occurs in our text, we can use this width information to determine whether or not to begin a new line.

Our implementation will use the different constructors we defined for FormattedText to our advantage: we’ll use them as instructions about how to print.

printWithWidth width formattedText = processText 0 [formattedText]
    where   processText column (textChunk : remainingText) =
                        case textChunk of
                            Text string                     -> string ++ processText (column + length string) remainingText
                            Linebreak                       -> '\n' : processText 0 remainingText
                            FormattedDocument part1 part2   -> processText column (part1 : part2 : remainingText)
                            WithOptionalBreaks noBr withBr  -> nicestFit column (processText column (noBr : remainingText))
                                                                                (processText column (withBr : remainingText))
            processText _ _ = ""

A line-by-line explanation might help.

Right away, we pass column number 0 and a list containing what our function was passed to a helper function. Then we define said helper function, and have it match against every possible sort of FormattedText. The column variable will track our current position in a line. So when processText gets a string of Text, it increments the column variable by the length of the string and emits the string. When it gets a Linebreak, it emits a newline character, and resets the column count because we’re now on a fresh line. In both cases, we recursively continue processing the [formattedText] list.

But that list only has one element, right?

Well yes, at the start. But if we pass in a FormattedDocument (which as you remember, contains other FormattedTexts), we grow our remainingText list by prepending both halves of the document “tree.” This action in and of itself doesn’t move our column position on the line; the recursive call to processText will take care of that later.

The most interesting piece

Remember we made the WithOptionalBreaks constructor to hold the two versions of a soft line break: the space version, and the newline version. The reason we did this will now become clear. What we’re going to do with our nicestFit function is compare the lengths of those two versions of text based on what column position we’re on in the line. If the “no breaks” version would be too long, we’ll emit the version with line breaks.

-- ...
            nicestFit column version1 version2 = if version1 `fitsIn` spaceRemaining
                                                 then version1
                                                 else version2
                        where spaceRemaining = width - (min width column)
 
fitsIn :: String -> Int -> Bool
_              `fitsIn` remainingWidth | remainingWidth < 0 = False
""             `fitsIn` remainingWidth                      = True
('\n':_)       `fitsIn` remainingWidth                      = True
(char : chars) `fitsIn` remainingWidth                      = chars `fitsIn` (remainingWidth - 1)

Hopefully those two functions are pretty straightforward. Hitting our WithOptionalBreaks constructor causes our nicestFit function to test both the versions it contains by seeing if there’s enough space for the “no breaks” version. It does this by testing one character at a time vs. the remaining width on the line. If we get to the end of the string (“”) or a newline, it fits. Otherwise we return False, which means we’ll emit the version with line breaks.

A side note on self-instruction

Though once you are used to it, you’ll be able to instantly figure out what recursive functions like the ones above do, you might becoming from an imperative language background, where recursion is uncommon at best. There’s nothing to be scared of, but I do recommend sitting down and working through all the steps in evaluating a recursive procedure every now and then. It might be helpful to sit down with GHCi and maybe even a pencil and work through a step-by-step evaluation of a call to printWithWidth.

Vaguely Instructional , , ,

Incremental Haskell, Part 2

October 13th, 2009

Prelude

This is part 2 in a polyonymous series that seeks to be a companion guide for those learning Haskell through reading Real World Haskell. In case you missed it, here’s the full intro and part 1.

When we last finished, we ended up with a swell type for representing JSON data and some functions for converting from existing types like String to our new type and back again. We also thought about a couple more functions we’d need to complete our library, including one for converting our Haskell representation of a JSON type to real JSON. We tentatively gave this function the name toJson, with the logical type JsonType -> String. After all, the goal is to get from our custom type to something we can print to the screen or to a file, and we’d like our data in String form to do that.

On separation of concerns

There are a couple of advantages to keeping functions small and focused. Sure, we could write a function that takes a JsonType and prints it right away, instead of returning a String. But what if we later decide we want to compress the string before printing? That way it’d be smaller and thus take up less bandwidth if we need to send it over a network (a common practice with JSON-serialized data). What if we also want to provide an option for legibly-formatted, human-readable output?

If we try to account for these (and more) possibilities, our function soon becomes bloated. We’ll likely end up tearing pieces of it out to reuse anyway, so it’s better to keep in mind ahead of time that functions should do one thing. This is sound advice in any language, but there are a couple more especially enticing reasons to do it in Haskell:

  • The type signature of a function can tell you a lot about it. We’ll see in this article how that gives you an advantage that even something like Visual Studio’s Intellisense is hard-pressed to match. If your function becomes bloated, trying to do too much, then you sacrifice this nicety.
  • Haskell makes a bigger deal about side effects than most programming languages. Side effects are the parts of a program that affect global state–that is, when a function interacts with something it didn’t create. This sort of code can be tough to reason about, as the function interacting with global state cannot tell what else might interact with it. Successful Haskellers have found that properly controlling and segregating code that has side effects (like I/O) is a boon to maintainability. The general principle is, code that doesn’t have to have side effects shouldn’t. This would include our example of getting our parsed JSON data into a string suitable for printing: eventually we do need to print it out, which is I/O, but the actual construction of the appropriate string doesn’t involve I/O, and can be performed separately. We’ll continue to focus on this concept.

How to say toString() in Haskell

Many mainstream object-oriented languages have a hierarchy of types such that all types are subtypes of a base “Object” class. Very frequently, this Object class (and thus all classes) has a method that is used to print a string representation of itself. In such a language, our JsonType might override this method so that when printed, JsonType values would come out looking like real chunks of JSON.

Haskell has similar functionality, but for a different purpose. Whereas the typical toString() method mentioned above is generally not of much use until overridden, and the format of its output unspecified, Haskell has a function called show that provides a more consistent approach. Specifically, calling show on a value will return a string that, if read back into Haskell, would yield the value passed. That is, its output is designed to be not just human-readable, but compiler-readable. For example, passing a String value to show will return it surrounded in double quotes and containing properly escaped special characters.

We get a default implementation for show when we put “deriving Show” in the declaration of a new type. If you recall, we indeed did this for JsonType. That’s why GHCi is able to print out our JsonType values–GHCi always calls show on the result of an expression!

If for some reason we aren’t satisfied with the show that Haskell gives a type automatically, we can define it ourselves. It’s recommended, though, that if we do so, it should behave as we described above–the output should be valid Haskell. Since we want our toJson function to produce valid JSON, not valid Haskell, we aren’t going to redefine JsonType’s show.

What the heck, man? Then why did you just tell me about show?

Fear not–show will still help us out! A couple of our JsonTypes are really Strings and Doubles, and so once we extract them out of the JsonString and JsonNumber constructors respectively, we can call show on the values, and Haskell will format them to strings representing valid Haskell syntax. For strings and numbers, JSON’s syntax and Haskell’s are the same, so we get these cases taken care of for free:

toJson :: JsonType -> String
 
toJson (JsonString string) = show string
toJson (JsonNumber number) = show number

However, if we tried to do the same thing with JsonBool, we’d be in trouble: if you try show True in GHCi, you’ll see it returns “True”, and JSON wants lowercase booleans. Thus:

toJson (JsonBool True)  = "true"
toJson (JsonBool False) = "false"

Likewise, trying show JsonNull in GHCi gives “JsonNull”, and that’s not what we want.

toJson JsonNull = "null"

Printing a JSON “object” is also a bit more complicated. In JSON, an object looks like this:

{ "some key": value, "another key": anotherValue }

In other words, curly braces surround the object, pairs are delimited by commas, and a pair consists of a string followed by a colon followed by a value (which can be of any JSON type). Another wrinkle is that there can be any number of pairs, including zero. This means we’re going to use a common functional programming pattern:

  • Our primary function does the stuff that will happen for sure, and passes the processing that can vary to a helper function.
  • The helper function needs to handle the cases of what to do when there’s useful input, and what to do when there isn’t. Commonly, input of this sort is a list, so we at least need to handle the empty list vs. a list with elements.

In our case:

-- Remember, we defined JsonObject like this: JsonObject [(String, JsonType)]
toJson (JsonObject object) = "{" ++ toKeyValuePairs object ++ "}"
    where toKeyValuePairs []    = ""
          toKeyValuePairs pairs = -- ??

We’re off to a good start. Our top-level function is doing the part that will always be done: surrounding the object with curly braces. Our helper function to handle the rest is called toKeyValuePairs, and we define it using the common Haskellism of putting it in a where clause. When a function is supplementary like this, it doesn’t need to be available to the whole module, and it automatically has access to values introduced in the function it’s attached to. Perhaps most importantly, it allows our functions to read quite nicely, allowing a reader to digest a function a chunk at a time. It might look strange at first, but it’s a handy feature you’ll soon wish was in more languages.

Our helper function is primed and ready to handle any or no input. In fact, no input was easy: just return the empty string and we’re done. The other case takes a bit more work. Thinking of our algorithm, we need to turn our list of pairs into strings with the format “key”: value. Then we need to join those strings with commas.

-- ...
        toKeyValuePairs pairs     = joinWithCommas (map pairToString pairs)
        pairToString (key, value) = show key ++ ": " ++ toJson value
 
joinWithCommas :: [String] -> String
joinWithCommas stringList = -- ??

There’s a bit going on here, but it’s hopefully not too hard to follow. The part in parentheses comes first–the “turn list of pairs into list of properly formatted strings” part. For this, we use map, which you’ve probably come across before–it’s a functional programming building block. It applies a function to everything in a list, spitting out a new list containing the results of each application. We want each of our pairs turned into a string, so we’ll pass a function–let’s call it pairToString–and our list of pairs to map.

We then write pairToString. Remember, map will be passing it one element of the list of pairs at a time. We use pattern matching to break up the passed-in pair into its key and value parts. The key part needs to be printed as a properly escaped string in double quotes, so we’ll use the same trick we did earlier and use the show function. We then append the colon. Now we need to print out the value. The problem is, value can be any valid JSON type, and different types get printed out in different ways. If only we had a function that handled the printing of every JSON type! Oh wait, we’re writing it, and it’s called toJson :)

Now we need to join our resulting list of formatted strings with commas. It’s reasonable to assume that joining a list, delimiting elements with a certain thing, is already in the standard library (as it is in many languages, for strings at least). Problem is, we may not know exactly what that function is called. We could just write it ourselves, but it’s of course better not to have to replicate effort if we don’t have to.

Earlier I mentioned how Haskell’s cool type system can help us know what a function does merely by knowing its type. In fact, an entire method of searching the Haskell API is built upon this concept. One website that uses this method to allow us to find functions is Hoogle.

So let’s go to Hoogle, and search for the type of the function we’re looking for. We want a function that takes a list of Strings to be joined, a String to join them with, and returns a single joined String as the result. In Haskell, we write that type as:

[String] -> String -> String

So type that into the search box in Hoogle.

The first result of the search is the function we want! It’s called intercalate, and the description says it “inserts the list xs in between the lists in xss and concatenates the result.” If you think about it, that’s exactly what we’re doing. Note that Hoogle found this function for us even though we didn’t even type in the type exactly. We had the arguments swapped, and we specified String even though intercalate can really use any type of list, not just [Char] (which is all a String is). Cool, eh? Use Hoogle a lot.

Before we go using the function we found in our code, we should take note of where this function is defined. Unless it’s in the Prelude (the name of the Haskell library that’s imported by default), we have to explicitly import the specified library so that the function name can be found. This practice is common in many languages, as the standard library tends to be divided up into namespaces/modules/whatever. (An apology to PHP programmers, who are used to having every standard library function ever written smooshed into the default namespace.)

Hoogle notes in soothing green text that intercalate resides in the module Data.List, so add the following to the top of your source file:

import Data.List (intercalate)

Now we can finish up our joinWithCommas function:

joinWithCommas stringList = intercalate ", " stringList

Now that printing JSON objects is taken care of, we’re just left with JSON arrays. The process will be almost exactly the same as for objects, except we don’t have to print keys, only values. Oh, and we use square brackets instead of curly braces.

toJson (JsonArray array) = "[" ++ toValues array ++ "]"
    where toValues []     = ""
          toValues values = joinWithCommas (map toJson values)

Again we use map, this time to apply our toJson function over all the values in the underlying list we’re using to represent our JSON array. Then we join the resulting properly formatted strings with the joinWithCommas function we just fleshed out.

Here, then, is the full toJson function:

toJson :: JsonType -> String
 
toJson (JsonString string) = show string
toJson (JsonNumber number) = show number
toJson (JsonBool True)     = "true"
toJson (JsonBool False)    = "false"
toJson JsonNull            = "null"
toJson (JsonObject object) = "{" ++ toKeyValuePairs object ++ "}"
    where toKeyValuePairs []        = ""
          toKeyValuePairs pairs     = joinWithCommas (map pairToString pairs)
          pairToString (key, value) = show key ++ ": " ++ toJson value
toJson (JsonArray array)   = "[" ++ toValues array ++ "]"
    where toValues []     = ""
          toValues values = joinWithCommas (map toJson values)

Not bad! Now that we’ve taken care of formatting the JSON data, we just need to actually print it. For the time being, we’ll write a quick function that will print to the screen:

printJson jsonValue = putStrLn (toJson jsonValue)

In the next article, we’ll play with different ways of rendering our JSON data.

Vaguely Instructional , , ,