= Parsing Forth == Parsing = String -> AST We defined a representation of what a Forth program looks like. The AST is just a bit of data that is easier to manipulate than strings. Both strings (i.e., files or snippets of Forth code) and the AST represent forth programs that can be executed. The reason we have both is these humans involved in programming. Forth programmers don't like writing ASTs - it's tedious, and annoying. Compiler writers on the other hand don't like manipulating strings - it's tricky, error prone, and tends to mix content and representation too much. The process to convert from strings to ASTs is not very trivial though, otherwise no one would care for ASTs. Writing grammars, on the other hand, is rather simple. Grammars are a description of how a string is allowed to be shaped, and they also embed code to create an AST by matching parts of the grammar. Haskell has a library for a very simple style of parsing, called Parsec. Lets look at `src/Harrorth/Parser.hs`, and see what we can make of it. == Introducing Parsec Parsec is a monadic parser combinator library. Sounds tough? It's those properties that make it fun to use. Lets see what they mean by writing our parser. == `src/Harrorth/Parser.hs` First and foremost we declare that this file provides the module `Harrorth.Parser`. Next we see two `import` clauses. These give us the Parsec API, and the AST we defined in the previous chapter. module Harrorth.Parser where import Text.ParserCombinators.Parsec import Harrorth.AST On with the good stuff. == The root of the grammar Next we see the definition of a parser for a forth program. This is where the "simple" part comes in - the one about parser combinator blah blah blah. We define the parser for a forth program by combining two parsers: * A parser for forth expressions * A parser for EOF forthProgram :: Parser Forth forthProgram = do ast <- forth eof return ast It's type, `forthProgram :: Parser Forth` means that it's a function that returns the type `Forth` (recall, `[Exp]`), within the Parser monad. The parser monad is the "monadic" part of what Parsec is. Monads are, as you probably may have head, representations of computations, and err, things that can fail, and uhm, more stuff. Forget about that for now. You can go try to play around with the "Meet the Monads" tutorial: http://www.nomaware.com/monads/html/meet.html The parsing process of a combinator parser is combining parsers, and making sure they can parse the input they got in the order they were defined. This process is provided by Parser's implementation of `>>=`, the monadic bind operator. == Do notation and monads Do notation, as you can see mentioned in a billion and one monad tutorials, is just short hand for gluing stuff together with monadic binds. When translate the above from do notation, here's what it looks like: forthProgram :: Parser Forth forthProgram = do forth >>= \ast -> -- parse forth, feed result into a function that takes `ast` eof >>= \_ -> -- and parses eof, feed result into a function that return ast -- ignores it's params, and returns `ast` What Parsec's `>>=` gives us is a routine that will try to parse some input, and then try the next parser. If something doesn't work out, it gives up. Do notation lets us say it more clearly: * first parse forth, collect result * then parse eof * then return result The Parser monad makes sure that the combination of `forth` and `eof` works out. `return` is also monadic nonsense. It puts `ast` back into the Parser monad. This will allow the value coming out of `forthProgram` to be combined with other parsers. Monads, as I see them now, are a way to intermix something that must happen in a sequence, cleanly with other data. Parsing takes stuff out of a string, and then tries to apply functions that build a result. By using a monadic library, we can leverage types to get the application of the parsing logic automatically. All we have to do is declare it properly. === Our combined parsers `eof` is a builtin parser, and it should be easy to guess what it does ;-). Let's look at `forth` instead: forth :: Parser Forth forth = sepEndBy forthExp sep It, like `forthProgram` returns a `Forth`. It uses `sepEndBy` to parse a series of `forthExp`s, separated by `sep`. `sep` is just white space, thrown away. `forthExp` is of type `Parser Exp`, meaning that it returns `Exp`s in the parser monad. Recall that an `Exp` is either a definition of a word, a literal, or a word. `forthExp` uses Parsec's `<|>` operator to denote alternatives. The last thing we will look at is `newWord`. You should be able to figure the rest out for yourself. `newWord` takes the definition of a forth word, which looks like : foo bar ; meaning that `foo` is now defined to be `bar`. Lets examine it: newWord :: Parser Exp newWord = do char ':' maybeSep name <- wordName sep body <- forth maybeSep char ';' return $ NewWord name body First it matches a single colon, then an optional separator, then a wordName - just some letters. It extracts the name and puts it into a variable. After that it matches a separator, and then the body of the definition - some forth code! We put the AST we parsed from the body of the definition into `body`, and then match the closing semicolon. The thing `newWord` returns is a `data Exp`. We'll need to use one of `Exp`s constructors to return it. Since it returns the expression to define a new word, we'll use the `NewWord` constructor, which takes a name and a definition: return $ NewWord name body The '$' means that everything on the right of it is a single value. It can be read like this: return (NewWord name body) So `NewWord name body` creates a new `Exp`, with the data we extracted from the string we just parsed. `return` is the way to insert our `Exp` into the Parser monad. We need this because otherwise our parser for new words is not a part of the Parser monad, and cannot mean anything to Parsec. == The Rest literal :: Parser Exp literal = do lit <- many1 digit return $ Push (read lit) word :: Parser Exp word = do name <- wordName return $ Invoke name wordName :: Parser String wordName = do name <- many1 letter return name sep :: Parser () sep = skipMany1 space maybeSep:: Parser () maybeSep = skipMany space == Playing with the parser This is the fun part: % ghci -issrc src/Harrorth/Parser.hs *Harrorth.Parser> parseTest forthProgram "2 foo : bar 4 blah ; ding" [Push 2,Invoke "foo",NewWord "bar" [Push 4,Invoke "blah"],Invoke "ding"] WOOO!