Part 2: Tokenizing and Reading
Every interpreter starts the same way: raw text goes in, structured data comes out. This phase has two stages — tokenizing (breaking text into tokens) and reading (assembling tokens into nested structures). Together they constitute the front end of our interpreter.
The Front-End Pipeline
%% Color palette: Blue #0173B2, Orange #DE8F05, Teal #029E73, Purple #CC78BC, Brown #CA9161, Gray #808080
flowchart TB
src["(+ 1 2)\nraw string"]
tok["Tokenizer\nlexical analysis"]
tokens["LPAREN · PLUS · 1 · 2 · RPAREN\ntoken list"]
par["Parser\nrecursive descent"]
ast["List{Symbol+, Number 1, Number 2}\nLispVal tree"]
src --> tok --> tokens --> par --> ast
classDef blue fill:#0173B2,color:#fff,stroke:#0173B2
classDef orange fill:#DE8F05,color:#fff,stroke:#DE8F05
classDef teal fill:#029E73,color:#fff,stroke:#029E73
class src blue
class tok,par orange
class tokens,ast teal
CS Concept: Lexical Analysis
Lexical analysis (or lexing, or tokenizing) is the process of grouping a stream of characters into meaningful units called tokens. A token is the smallest unit of syntax — a number, a string, a symbol, a parenthesis.
Lexical analysis answers: "what are the words?" Parsing answers: "what do the words mean structurally?"
For Scheme, the token types are:
| Token | Examples |
|---|---|
| Left paren | ( |
| Right paren | ) |
| Number | 42, -7, 3.14 |
| String | "hello", "world" |
| Boolean | #t, #f |
| Symbol | +, define, x, my-var |
No keywords — define, if, lambda are just symbols. The interpreter, not the lexer, gives them special meaning.
CS Concept: Context-Free Grammars
The structure of S-expressions can be described as a context-free grammar (CFG):
s-expr ::= atom | list
list ::= '(' s-expr* ')'
atom ::= number | string | boolean | symbol
%% Color palette: Blue #0173B2, Orange #DE8F05, Teal #029E73, Purple #CC78BC, Brown #CA9161, Gray #808080
flowchart LR
SE["s-expr"]
A["atom"]
L["list\n'(' s-expr* ')'"]
N["number"]
S["symbol"]
ST["string"]
B["boolean"]
SE -->|"is either"| A
SE -->|"or"| L
L -->|"contains zero or more"| SE
A --> N
A --> S
A --> ST
A --> B
classDef blue fill:#0173B2,color:#fff,stroke:#0173B2
classDef orange fill:#DE8F05,color:#fff,stroke:#DE8F05
classDef teal fill:#029E73,color:#fff,stroke:#029E73
class SE blue
class A,L orange
class N,S,ST,B teal
This grammar is recursive: a list contains zero or more S-expressions, each of which may itself be a list. This recursive structure is what makes a recursive descent parser the natural implementation strategy.
Context-free means the rule for expanding a non-terminal does not depend on what surrounds it. This is a weaker requirement than natural languages but sufficient for all programming languages.
The LispVal Type
Before parsing, we define the type that represents all Lisp values throughout the interpreter. In Go, we use an interface with a marker method — a technique for creating closed sum types without language-level discriminated unions.
// LispVal is the sum type for all Scheme values.
// The lispVal() marker method prevents accidental implementations.
type LispVal interface {
lispVal()
}
type Number struct{ Value float64 }
type Symbol struct{ Value string }
type Str struct{ Value string }
type Bool struct{ Value bool }
type List struct{ Values []LispVal }
type Lambda struct {
Params []string
Body LispVal
Env *Env
}
type Builtin struct{ Fn func([]LispVal) (LispVal, error) }
type Nil struct{}
func (Number) lispVal() {}
func (Symbol) lispVal() {}
func (Str) lispVal() {}
func (Bool) lispVal() {}
func (List) lispVal() {}
func (Lambda) lispVal() {}
func (Builtin) lispVal() {}
func (Nil) lispVal() {}%% Color palette: Blue #0173B2, Orange #DE8F05, Teal #029E73, Purple #CC78BC, Brown #CA9161, Gray #808080
graph LR
LV["LispVal\ninterface"]
N["Number\nValue float64"]
ST["Str\nValue string"]
B["Bool\nValue bool"]
SY["Symbol\nValue string"]
LI["List\nValues []LispVal"]
LA["Lambda\nParams []string\nBody LispVal\nEnv *Env"]
BU["Builtin\nFn func([]LispVal) (LispVal, error)"]
NL["Nil\nempty list ()"]
LV --> N
LV --> ST
LV --> B
LV --> SY
LV --> LI
LV --> LA
LV --> BU
LV --> NL
classDef blue fill:#0173B2,color:#fff,stroke:#0173B2
classDef orange fill:#DE8F05,color:#fff,stroke:#DE8F05
classDef teal fill:#029E73,color:#fff,stroke:#029E73
classDef purple fill:#CC78BC,color:#fff,stroke:#CC78BC
class LV blue
class N,ST,B,SY orange
class LI,LA teal
class BU,NL purple
Why an interface with a marker method? It restricts the type switch in eval to known cases. Any type that accidentally satisfies LispVal will only do so by explicitly implementing lispVal() — an intentional act. In F#, this exhaustiveness is enforced by the compiler on discriminated unions; in Go, we enforce it by convention.
Tokenizer
The tokenizer takes a string and produces a flat list of token strings. Go's strings package makes this concise:
func tokenize(input string) []string {
// Insert spaces around parens so Split works uniformly
s := strings.NewReplacer("(", " ( ", ")", " ) ").Replace(input)
var tokens []string
for _, tok := range strings.Fields(s) {
tokens = append(tokens, tok)
}
return tokens
}strings.Fields splits on any whitespace and discards empty strings — it handles tabs, newlines, and multiple spaces automatically.
What the tokenizer does NOT do: it does not assign types to tokens. The string "42" comes out as the string "42", not as a number. Typing happens in the next stage.
For string literals with spaces (e.g., "hello world"), a more sophisticated tokenizer is needed. The version above handles the core Scheme subset covered in this series.
Parser: Recursive Descent
The parser converts a flat list of token strings into a nested LispVal. The algorithm is a classic recursive descent parser — each grammar rule corresponds to a function.
func parseAtom(token string) LispVal {
if token == "#t" {
return Bool{Value: true}
}
if token == "#f" {
return Bool{Value: false}
}
if f, err := strconv.ParseFloat(token, 64); err == nil {
return Number{Value: f}
}
return Symbol{Value: token}
}
func parseExpr(tokens []string, pos int) (LispVal, int, error) {
if pos >= len(tokens) {
return nil, pos, fmt.Errorf("unexpected end of input")
}
tok := tokens[pos]
switch tok {
case "(":
return parseList(tokens, pos+1)
case ")":
return nil, pos, fmt.Errorf("unexpected ')'")
default:
return parseAtom(tok), pos + 1, nil
}
}
func parseList(tokens []string, pos int) (LispVal, int, error) {
var values []LispVal
for pos < len(tokens) && tokens[pos] != ")" {
val, newPos, err := parseExpr(tokens, pos)
if err != nil {
return nil, newPos, err
}
values = append(values, val)
pos = newPos
}
if pos >= len(tokens) {
return nil, pos, fmt.Errorf("missing closing ')'")
}
return List{Values: values}, pos + 1, nil
}Unlike the F# version which returns (LispVal, remaining tokens), the Go version passes a positional index through the token slice. This avoids allocating new slices at each recursive step.
Parsing (+ 1 2) Step by Step
%% Color palette: Blue #0173B2, Orange #DE8F05, Teal #029E73, Purple #CC78BC, Brown #CA9161, Gray #808080
sequenceDiagram
participant T as Token Stream
participant PE as parseExpr
participant PL as parseList
participant PA as parseAtom
T->>PE: LPAREN PLUS 1 2 RPAREN
PE->>PL: sees LPAREN, delegate: PLUS 1 2 RPAREN
PL->>PE: token PLUS (not RPAREN)
PE->>PA: PLUS
PA-->>PE: Symbol plus
PE-->>PL: Symbol plus, pos advances
PL->>PE: token 1 (not RPAREN)
PE->>PA: 1
PA-->>PE: Number 1.0
PE-->>PL: Number 1.0, pos advances
PL->>PE: token 2 (not RPAREN)
PE->>PA: 2
PA-->>PE: Number 2.0
PE-->>PL: Number 2.0, pos advances
PL->>PE: token RPAREN, done
PL-->>PE: List of Symbol-plus Number-1 Number-2
PE-->>T: List of Symbol-plus Number-1 Number-2
Recursive Descent = Grammar as Code
The mutual recursion between parseExpr and parseList mirrors the grammar's mutual recursion between s-expr and list:
%% Color palette: Blue #0173B2, Orange #DE8F05, Teal #029E73, Purple #CC78BC, Brown #CA9161, Gray #808080
flowchart TB
subgraph Grammar
SE2["s-expr"]
L2["list"]
A2["atom"]
SE2 -->|"is a"| L2
SE2 -->|"or"| A2
L2 -->|"contains"| SE2
end
subgraph Code
PE["parseExpr"]
PL["parseList"]
PA["parseAtom"]
PE -->|"calls"| PL
PE -->|"calls"| PA
PL -->|"calls"| PE
end
SE2 -. "implemented by" .-> PE
L2 -. "implemented by" .-> PL
A2 -. "implemented by" .-> PA
classDef blue fill:#0173B2,color:#fff,stroke:#0173B2
classDef teal fill:#029E73,color:#fff,stroke:#029E73
class SE2,L2,A2 blue
class PE,PL,PA teal
Putting It Together: The Read Function
func read(input string) (LispVal, error) {
tokens := tokenize(input)
if len(tokens) == 0 {
return nil, fmt.Errorf("empty input")
}
expr, pos, err := parseExpr(tokens, 0)
if err != nil {
return nil, err
}
if pos != len(tokens) {
return nil, fmt.Errorf("unexpected tokens after expression")
}
return expr, nil
}Test it:
v, _ := read("(+ 1 2)")
// → List{Values: []LispVal{Symbol{Value:"+"}, Number{Value:1}, Number{Value:2}}}
v, _ = read("(define x 10)")
// → List{Values: []LispVal{Symbol{"define"}, Symbol{"x"}, Number{10}}}
v, _ = read("(lambda (x y) (* x y))")
// → List{Values: []LispVal{Symbol{"lambda"}, List{[]LispVal{Symbol{"x"}, Symbol{"y"}}}, ...}}The parser has no knowledge of what define or lambda mean — those are just symbols. The evaluator (Part 3) gives them meaning.
CS Concept: Why Recursive Descent?
Recursive descent is not the only parsing algorithm. There are table-driven parsers (LL, LR, LALR) used by most production compiler generators. Why recursive descent here?
- Simplicity — each grammar rule is a function. The code structure mirrors the grammar exactly.
- Scheme's grammar is LL(1) — at every point, one token of lookahead is sufficient to decide which rule applies.
- Good error messages — the call stack tells you exactly where in the grammar parsing failed.
- Hand-written = transparent — reading a hand-written recursive descent parser is much more instructive than reading a generated parser.
What We Have
After Part 2, we can transform any valid Scheme expression from text into a structured LispVal tree:
%% Color palette: Blue #0173B2, Orange #DE8F05, Teal #029E73, Purple #CC78BC, Brown #CA9161, Gray #808080
%% Parses: (if (> x 0) x (- 0 x))
graph LR
root["List"]
if_sym["Symbol: if"]
test["List - test"]
gt_sym["Symbol: gt"]
x1["Symbol: x"]
zero1["Number: 0"]
conseq["Symbol: x"]
alt["List - alternate"]
minus["Symbol: minus"]
zero2["Number: 0"]
x3["Symbol: x"]
root --> if_sym
root --> test
root --> conseq
root --> alt
test --> gt_sym
test --> x1
test --> zero1
alt --> minus
alt --> zero2
alt --> x3
classDef blue fill:#0173B2,color:#fff,stroke:#0173B2
classDef orange fill:#DE8F05,color:#fff,stroke:#DE8F05
classDef teal fill:#029E73,color:#fff,stroke:#029E73
class root blue
class test,alt orange
class if_sym,gt_sym,x1,conseq,x3,minus,zero1,zero2 teal
In Part 3, we implement the environment model and the core eval/apply loop that gives these trees meaning.
Last updated April 19, 2026