LLVM教程（二）语法分析器和AST抽象语法树

17 Dec 2019

Reading time ~12 minutes

本文为LLVM教程的笔记（二）。

解析kaleidoscope时，可以运用递归下降解析+优先级解析于解析二元表达式及其他部分。

1.抽象语法树（AST）

语言中的每种结构都与一种AST相对应，kaleidoscope语言中的语法结构包括表达式/函数原型/函数对象。

1.1表达式AST节点

定义基类和数值常量子类，将数值常量的值存着成员变量中：

//===----------------------------------------------------------------------===//
// Abstract Syntax Tree (aka Parse Tree)
//===----------------------------------------------------------------------===//

/// ExprAST - 所有表达式节点的基类
class ExprAST {
public:
virtual ~ExprAST() {}
};

/// NumberExprAST - 数值常量子类
class NumberExprAST : public ExprAST {
double Val;
public:
NumberExprAST(double val) : Val(val) {}
};

1.2基本部分的其他表达式AST节点

为了保存变量名，定义VariableExprAST；保存运算符用BinaryExprAST；保存函数名和用作参数的表达式列表用CallExprAST。

/// VariableExprAST - Expression class for referencing a variable, like "a".
class VariableExprAST : public ExprAST {
std::string Name;
public:
VariableExprAST(const std::string &name) : Name(name) {}
};

/// BinaryExprAST - Expression class for a binary operator.
class BinaryExprAST : public ExprAST {
char Op;
ExprAST *LHS, *RHS;
public:
BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
: Op(op), LHS(lhs), RHS(rhs) {}
};

/// CallExprAST - Expression class for function calls.
class CallExprAST : public ExprAST {
std::string Callee;
std::vector<ExprAST*> Args;
public:
CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
: Callee(callee), Args(args) {}
};

1.3函数接口与函数本身AST

kaleidoscope中，函数类型取决于参数个数（因为所有值都是双精度浮点数，所以不用保存参数类型）：

/// PrototypeAST - This class represents the "prototype" for a function,
/// which captures its name, and its argument names (thus implicitly the number
/// of arguments the function takes).
class PrototypeAST {
std::string Name;
std::vector<std::string> Args;
public:
PrototypeAST(const std::string &name, const std::vector<std::string> &args)
: Name(name), Args(args) {}

};

/// FunctionAST - This class represents a function definition itself.
/// 函数本身的定义
class FunctionAST {
PrototypeAST *Proto;
ExprAST *Body;
public:
FunctionAST(PrototypeAST *proto, ExprAST *body)
: Proto(proto), Body(body) {}

};

2.语法解析器基础

2.1准备用于构造AST的语法解析器

语法解析器需要能把输入分解为AST，比如把用户输入的“x+y”分解为：

ExprAST *X = new VariableExprAST("x");
ExprAST *Y = new VariableExprAST("y");
ExprAST *Result = new BinaryExprAST('+', X, Y);

为了达成这个目标，需要以下步骤：

2.1.1定义辅助函数

实现语元缓冲，预读next token，将CurTok视作当前待解析的语元：

/// CurTok/getNextToken - Provide a simple token buffer.  CurTok is the current
/// token the parser is looking at.  getNextToken reads another token from the
/// lexer and updates CurTok with its results.
static int CurTok;
static int getNextToken() {
return CurTok = gettok();
}

2.1.2报错函数

下面三个函数用于报错，解析处理过程中的错误，简单返回NULL：

/// Error* - These are little helper functions for error handling.
ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }

3.解析基本表达式

kaleidoscope中每条生成规则都需要对应的解析函数。

3.1数值常量

将语元tok_number构造为一个NumberExprAST节点，which has been represented in 1.1，令词法分析器读取下一个语元并返回这个节点。

/// numberexpr ::= number
static ExprAST *ParseNumberExpr() {
ExprAST *Result = new NumberExprAST(NumVal);
getNextToken(); // consume the number
return Result;
}

这个函数递归下降地消化了与当前生成规则相关的语元，它把下一个与本规则无关的待解析语元放进了词法解析器的语元缓冲。

3.2括号运算符

这个函数表现了语法分析器的几个特点：

展示Error函数用法。当左括号和子表达式解析完后，紧跟的符号不是右括号，这是语法解析器就会报错，这里报错方式比较简陋就是返回NULL。

递归调用ParseExpression，简化递归语法的处理。

 /// parenexpr ::= '(' expression ')'
 static ExprAST *ParseParenExpr() {
 getNextToken();  // eat (.
 ExprAST *V = ParseExpression();
 if (!V) return 0;

 if (CurTok != ')')
 return Error("expected ')'");
 getNextToken();  // eat ).
 return V;
 }

P.S. 不必为括号构造AST节点，因为括号就是为了引导表达式的分组解析过程，当语法解析器构造AST之后括号又没用了，所以其实可以不构造。

3.3处理变量引用和函数调用

这个函数和其他ExprAST函数差不多，主要是调用这个函数时，当前语元得是tok_identifier。它既有递归又有错误处理。

但是这里额外采用了“预读”，它检查标识符之后的是不是（，来决定是构造VariableExprAST还是CallExprAST。

/// identifierexpr
///   ::= identifier
///   ::= identifier '(' expression* ')'
static ExprAST *ParseIdentifierExpr() {
  std::string IdName = IdentifierStr;

  getNextToken();  // eat identifier.

  if (CurTok != '(') // Simple variable ref.
    return new VariableExprAST(IdName);

  // Call.
  getNextToken();  // eat (
  std::vector<ExprAST*> Args;
  if (CurTok != ')') {
    while (1) {
      ExprAST *Arg = ParseExpression();
      if (!Arg) return 0;
      Args.push_back(Arg);

      if (CurTok == ')') break;

      if (CurTok != ',')
        return Error("Expected ')' or ',' in argument list");
      getNextToken();
    }
  }

3.4统一入口

构造辅助函数，先判定待解析表达式的类别，预先判断待解析表达式的类型，然后调用相应函数解析：

/// primary
///   ::= identifierexpr
///   ::= numberexpr
///   ::= parenexpr
static ExprAST *ParsePrimary() {
  switch (CurTok) {
  default: return Error("unknown token when expecting an expression");
  case tok_identifier: return ParseIdentifierExpr();
  case tok_number:     return ParseNumberExpr();
  case '(':            return ParseParenExpr();
  }
}

4.解析二元表达式

二元表达式往往伴随着二义性，比如x+y*z，语法解析器可以解析为先加后乘或者先乘后加，解决这个问题需要利用运算符优先级解析。

4.1制定优先级表

kaleidoscope支持4种二元运算符，下列代码用map简化了新运算符的添加：

/// BinopPrecedence - This holds the precedence for each binary operator that is
/// defined.
static std::map<char, int> BinopPrecedence;

/// GetTokPrecedence - Get the precedence of the pending binary operator token.
static int GetTokPrecedence() {
  if (!isascii(CurTok))
    return -1;

  // Make sure it's a declared binop.
  int TokPrec = BinopPrecedence[CurTok];
  if (TokPrec <= 0) return -1;
  return TokPrec;
}

int main() {
  // Install standard binary operators.
  // 1 is lowest precedence.
  BinopPrecedence['<'] = 10;
  BinopPrecedence['+'] = 20;
  BinopPrecedence['-'] = 20;
  BinopPrecedence['*'] = 40;  // highest.
  ...
}

接下来以a+b+(c+d)*e*f+g为例子，进行运算符优先级解析（（c+d）视作主表达式之一），解析出来的第一个主表达式应该是“a”，紧跟着是若干个有序对，即：[+, b]、[+, (c+d)]、[*, e]、[*, f]和[+, g]。

4.2ParseExpression

有序对格式[binop, primaryexpr]， ParseBinOpRHS用于解析有序对列表，参数包括一个整数和一个指针，其中整数代表运算符优先级，指针则指向已经解析出来的部分表达式。

/// expression
///   ::= primary binoprhs
///
static ExprAST *ParseExpression() {
  ExprAST *LHS = ParsePrimary();
  if (!LHS) return 0;

  return ParseBinOpRHS(0, LHS);
}

4.3ParseBinOpRHS

传入ParseBinOpRHS处理的优先级是该函数所能处理的最低运算符优先级，假设语元流中的下一对是“[+,x]”，且传入ParseBinOpRHS的优先级是40，那么该函数将直接返回，因为“+”的优先级是20。

下面函数的开头先检查了优先级，如果小于就直接返回，按例子中应该是传入表达式a和当前语元+：

/// binoprhs
///   ::= ('+' primary)*
static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
  // If this is a binop, find its precedence.
  while (1) {
    int TokPrec = GetTokPrecedence();

    // If this is a binop that binds at least as tightly as the current binop,
    // consume it, otherwise we are done.
    if (TokPrec < ExprPrec)
      return LHS;

当检查通过，语元是二元运算符，纳入当前表达式，保存完二元运算符之后解析主表达式，构造有序对[+,b]

// Okay, we know this is a binop.
int BinOp = CurTok;
getNextToken();  // eat binop

// Parse the primary expression after the binary operator.
ExprAST *RHS = ParsePrimary();
if (!RHS) return 0;

继续考虑表达式的结合次序，是“(a+b) binop unparsed”或“a + (b binop unparsed)”。语法解析器将先预读binop，检查优先级，然后再与上面保存的BinOp（+）作比较，发现 “binop位于“RHS”的右侧，如果binop的优先级低于或等于当前运算符的优先级，则可知括号应该加在前面，即按‘(a+b) binop …’处理。在本例中，当前运算符是‘+’，下一个运算符也是‘+’，二者的优先级相同。既然如此，理应按照“a+b”来构造AST节点”：

// If BinOp binds less tightly with RHS than the operator after RHS, let
// the pending operator take RHS as its LHS.
int NextPrec = GetTokPrecedence();
if (TokPrec < NextPrec) {

现在a+b+的前半段被解析为（a+b）+，+成为当前语元，进入下一轮迭代，上述代码将继续构造有序对[+, (c+d)]。

现在，主表达式右边的binop是*，*优先级又高于+，检查运算符优先级的if判断通过，进入if内部。

if的内部判断出它的优先级比当前解析的binop高，所以如果binop右边的若干个连续有序对都高于+，那么就该全部解析，拼成RHS并返回

    // If BinOp binds less tightly with RHS than the operator after RHS, let
    // the pending operator take RHS as its LHS.
    int NextPrec = GetTokPrecedence();
    if (TokPrec < NextPrec) {
      RHS = ParseBinOpRHS(TokPrec+1, RHS);
      if (RHS == 0) return 0;
    }

    // Merge LHS/RHS.
    LHS = new BinaryExprAST(BinOp, LHS, RHS);
  }
}

5.解析其余结构

5.1解析函数原型

两处函数原型：extern声明与函数定义

/// prototype
///   ::= id '(' id* ')'
static PrototypeAST *ParsePrototype() {
  if (CurTok != tok_identifier)
    return ErrorP("Expected function name in prototype");

  std::string FnName = IdentifierStr;
  getNextToken();

  if (CurTok != '(')
    return ErrorP("Expected '(' in prototype");

  std::vector<std::string> ArgNames;
  while (getNextToken() == tok_identifier)
    ArgNames.push_back(IdentifierStr);
  if (CurTok != ')')
    return ErrorP("Expected ')' in prototype");

  // success.
  getNextToken();  // eat ')'.

  return new PrototypeAST(FnName, ArgNames);
}

5.2函数定义

函数定义=函数原型+函数体表达式

/// definition ::= 'def' prototype expression
static FunctionAST *ParseDefinition() {
  getNextToken();  // eat def.
  PrototypeAST *Proto = ParsePrototype();
  if (Proto == 0) return 0;

  if (ExprAST *E = ParseExpression())
    return new FunctionAST(Proto, E);
  return 0;
}

5.3ParseExtern

除了用户自定义的前置声明，extern还可以用来声明c标准库的一些函数。extern语句可以看作不带函数体的函数原型。

/// external ::= 'extern' prototype
static PrototypeAST *ParseExtern() {
getNextToken();  // eat extern.
return ParsePrototype();
}

5.4在顶层输入任意表达式求值

/// toplevelexpr ::= expression
static FunctionAST *ParseTopLevelExpr() {
  if (ExprAST *E = ParseExpression()) {
    // Make an anonymous proto.
    PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
    return new FunctionAST(Proto, E);
  }
  return 0;
}

6.引导代码

在最外层的循环中按当前语元的类型选定相应的解析函数：

/// top ::= definition | external | expression | ';'
static void MainLoop() {
  while (1) {
    fprintf(stderr, "ready> ");
    switch (CurTok) {
    case tok_eof:    return;
    case ';':        getNextToken(); break;  // ignore top-level semicolons.
    case tok_def:    HandleDefinition(); break;
    case tok_extern: HandleExtern(); break;
    default:         HandleTopLevelExpression(); break;
    }
  }
}

7.总结

输入示例：

$ ./a.out
ready> def foo(x y) x+foo(y, 4.0);
Parsed a function definition.
ready> def foo(x y) x+y y;
Parsed a function definition.
Parsed a top-level expr
ready> def foo(x y) x+y );
Parsed a function definition.
Error: unknown token when expecting an expression
ready> extern sin(a);
ready> Parsed an extern
ready> ^D
$

8.完整源码

#include <cstdio>
#include <cstdlib>
#include <string>
#include <map>
#include <vector>

//===----------------------------------------------------------------------===//
// Lexer
//===----------------------------------------------------------------------===//

// The lexer returns tokens [0-255] if it is an unknown character, otherwise one
// of these for known things.
enum Token {
tok_eof = -1,

// commands
tok_def = -2, tok_extern = -3,

// primary
tok_identifier = -4, tok_number = -5
};

static std::string IdentifierStr;  // Filled in if tok_identifier
static double NumVal;              // Filled in if tok_number

/// gettok - Return the next token from standard input.
static int gettok() {
static int LastChar = ' ';

// Skip any whitespace.
while (isspace(LastChar))
LastChar = getchar();

if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
IdentifierStr = LastChar;
while (isalnum((LastChar = getchar())))
IdentifierStr += LastChar;

if (IdentifierStr == "def") return tok_def;
if (IdentifierStr == "extern") return tok_extern;
return tok_identifier;
}

if (isdigit(LastChar) || LastChar == '.') {   // Number: [0-9.]+
std::string NumStr;
do {
NumStr += LastChar;
LastChar = getchar();
} while (isdigit(LastChar) || LastChar == '.');

NumVal = strtod(NumStr.c_str(), 0);
return tok_number;
}

if (LastChar == '#') {
// Comment until end of line.
do LastChar = getchar();
while (LastChar != EOF && LastChar != '\n' && LastChar != '\r');

if (LastChar != EOF)
return gettok();
}

// Check for end of file.  Don't eat the EOF.
if (LastChar == EOF)
return tok_eof;

// Otherwise, just return the character as its ascii value.
int ThisChar = LastChar;
LastChar = getchar();
return ThisChar;
}

//===----------------------------------------------------------------------===//
// Abstract Syntax Tree (aka Parse Tree)
//===----------------------------------------------------------------------===//

/// ExprAST - Base class for all expression nodes.
class ExprAST {
public:
virtual ~ExprAST() {}
};

/// NumberExprAST - Expression class for numeric literals like "1.0".
class NumberExprAST : public ExprAST {
double Val;
public:
NumberExprAST(double val) : Val(val) {}
};

/// VariableExprAST - Expression class for referencing a variable, like "a".
class VariableExprAST : public ExprAST {
std::string Name;
public:
VariableExprAST(const std::string &name) : Name(name) {}
};

/// BinaryExprAST - Expression class for a binary operator.
class BinaryExprAST : public ExprAST {
char Op;
ExprAST *LHS, *RHS;
public:
BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
: Op(op), LHS(lhs), RHS(rhs) {}
};

/// CallExprAST - Expression class for function calls.
class CallExprAST : public ExprAST {
std::string Callee;
std::vector<ExprAST*> Args;
public:
CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
: Callee(callee), Args(args) {}
};

/// PrototypeAST - This class represents the "prototype" for a function,
/// which captures its name, and its argument names (thus implicitly the number
/// of arguments the function takes).
class PrototypeAST {
std::string Name;
std::vector<std::string> Args;
public:
PrototypeAST(const std::string &name, const std::vector<std::string> &args)
: Name(name), Args(args) {}

};

/// FunctionAST - This class represents a function definition itself.
class FunctionAST {
PrototypeAST *Proto;
ExprAST *Body;
public:
FunctionAST(PrototypeAST *proto, ExprAST *body)
: Proto(proto), Body(body) {}

};

//===----------------------------------------------------------------------===//
// Parser
//===----------------------------------------------------------------------===//

/// CurTok/getNextToken - Provide a simple token buffer.  CurTok is the current
/// token the parser is looking at.  getNextToken reads another token from the
/// lexer and updates CurTok with its results.
static int CurTok;
static int getNextToken() {
return CurTok = gettok();
}

/// BinopPrecedence - This holds the precedence for each binary operator that is
/// defined.
static std::map<char, int> BinopPrecedence;

/// GetTokPrecedence - Get the precedence of the pending binary operator token.
static int GetTokPrecedence() {
if (!isascii(CurTok))
return -1;

// Make sure it's a declared binop.
int TokPrec = BinopPrecedence[CurTok];
if (TokPrec <= 0) return -1;
return TokPrec;
}

/// Error* - These are little helper functions for error handling.
ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }

static ExprAST *ParseExpression();

/// identifierexpr
///   ::= identifier
///   ::= identifier '(' expression* ')'
static ExprAST *ParseIdentifierExpr() {
std::string IdName = IdentifierStr;

getNextToken();  // eat identifier.

if (CurTok != '(') // Simple variable ref.
return new VariableExprAST(IdName);

// Call.
getNextToken();  // eat (
std::vector<ExprAST*> Args;
if (CurTok != ')') {
while (1) {
ExprAST *Arg = ParseExpression();
if (!Arg) return 0;
Args.push_back(Arg);

if (CurTok == ')') break;

if (CurTok != ',')
return Error("Expected ')' or ',' in argument list");
getNextToken();
}
}

// Eat the ')'.
getNextToken();

return new CallExprAST(IdName, Args);
}

/// numberexpr ::= number
static ExprAST *ParseNumberExpr() {
ExprAST *Result = new NumberExprAST(NumVal);
getNextToken(); // consume the number
return Result;
}

/// parenexpr ::= '(' expression ')'
static ExprAST *ParseParenExpr() {
getNextToken();  // eat (.
ExprAST *V = ParseExpression();
if (!V) return 0;

if (CurTok != ')')
return Error("expected ')'");
getNextToken();  // eat ).
return V;
}

/// primary
///   ::= identifierexpr
///   ::= numberexpr
///   ::= parenexpr
static ExprAST *ParsePrimary() {
switch (CurTok) {
default: return Error("unknown token when expecting an expression");
case tok_identifier: return ParseIdentifierExpr();
case tok_number:     return ParseNumberExpr();
case '(':            return ParseParenExpr();
}
}

/// binoprhs
///   ::= ('+' primary)*
static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
// If this is a binop, find its precedence.
while (1) {
int TokPrec = GetTokPrecedence();

// If this is a binop that binds at least as tightly as the current binop,
// consume it, otherwise we are done.
if (TokPrec < ExprPrec)
return LHS;

// Okay, we know this is a binop.
int BinOp = CurTok;
getNextToken();  // eat binop

// Parse the primary expression after the binary operator.
ExprAST *RHS = ParsePrimary();
if (!RHS) return 0;

// If BinOp binds less tightly with RHS than the operator after RHS, let
// the pending operator take RHS as its LHS.
int NextPrec = GetTokPrecedence();
if (TokPrec < NextPrec) {
RHS = ParseBinOpRHS(TokPrec+1, RHS);
if (RHS == 0) return 0;
}

// Merge LHS/RHS.
LHS = new BinaryExprAST(BinOp, LHS, RHS);
}
}

/// expression
///   ::= primary binoprhs
///
static ExprAST *ParseExpression() {
ExprAST *LHS = ParsePrimary();
if (!LHS) return 0;

return ParseBinOpRHS(0, LHS);
}

/// prototype
///   ::= id '(' id* ')'
static PrototypeAST *ParsePrototype() {
if (CurTok != tok_identifier)
return ErrorP("Expected function name in prototype");

std::string FnName = IdentifierStr;
getNextToken();

if (CurTok != '(')
return ErrorP("Expected '(' in prototype");

std::vector<std::string> ArgNames;
while (getNextToken() == tok_identifier)
ArgNames.push_back(IdentifierStr);
if (CurTok != ')')
return ErrorP("Expected ')' in prototype");

// success.
getNextToken();  // eat ')'.

return new PrototypeAST(FnName, ArgNames);
}

/// definition ::= 'def' prototype expression
static FunctionAST *ParseDefinition() {
getNextToken();  // eat def.
PrototypeAST *Proto = ParsePrototype();
if (Proto == 0) return 0;

if (ExprAST *E = ParseExpression())
return new FunctionAST(Proto, E);
return 0;
}

/// toplevelexpr ::= expression
static FunctionAST *ParseTopLevelExpr() {
if (ExprAST *E = ParseExpression()) {
// Make an anonymous proto.
PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
return new FunctionAST(Proto, E);
}
return 0;
}

/// external ::= 'extern' prototype
static PrototypeAST *ParseExtern() {
getNextToken();  // eat extern.
return ParsePrototype();
}

//===----------------------------------------------------------------------===//
// Top-Level parsing
//===----------------------------------------------------------------------===//

static void HandleDefinition() {
if (ParseDefinition()) {
fprintf(stderr, "Parsed a function definition.\n");
} else {
// Skip token for error recovery.
getNextToken();
}
}

static void HandleExtern() {
if (ParseExtern()) {
fprintf(stderr, "Parsed an extern\n");
} else {
// Skip token for error recovery.
getNextToken();
}
}

static void HandleTopLevelExpression() {
// Evaluate a top-level expression into an anonymous function.
if (ParseTopLevelExpr()) {
fprintf(stderr, "Parsed a top-level expr\n");
} else {
// Skip token for error recovery.
getNextToken();
}
}

/// top ::= definition | external | expression | ';'
static void MainLoop() {
while (1) {
fprintf(stderr, "ready> ");
switch (CurTok) {
case tok_eof:    return;
case ';':        getNextToken(); break;  // ignore top-level semicolons.
case tok_def:    HandleDefinition(); break;
case tok_extern: HandleExtern(); break;
default:         HandleTopLevelExpression(); break;
}
}
}

//===----------------------------------------------------------------------===//
// Main driver code.
//===----------------------------------------------------------------------===//

int main() {
// Install standard binary operators.
// 1 is lowest precedence.
BinopPrecedence['<'] = 10;
BinopPrecedence['+'] = 20;
BinopPrecedence['-'] = 20;
BinopPrecedence['*'] = 40;  // highest.

// Prime the first token.
fprintf(stderr, "ready> ");
getNextToken();

// Run the main "interpreter loop" now.
MainLoop();

return 0;
}

Reference:

LLVM Tutorial

编译器