bison parser深入分析

GNU Bison Parser 深入分析

一些约定

bison对我们定义好的文法进行了增广(augmentation),添加了$accept$end符号表示接收和终止,并且增加了一条规则用于判断是否完成语法分析:

1
$accept : <start-symbol> $end

除此之外还增加了$undefined来表示未出现在文法中的symbol。增加了error用来生成错误。

一个简单的例子

就像flex一样,bison也是表驱动的,为了理解bison如何parse,我们用这样一个文法去生成一个parser:

1
2
3
4
5
6
7
8
(1) L → L;E
(2) L → E
(3) E → E,P
(4) E → P
(5) P → a
(6) P → (M)
(7) M → ε
(8) M → L

对应到bison里,就是这样:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
%%
L : L ';' E
| E
;
E : E ',' P
| P

;P : 'a'
| (M)

;M : /* nothing */
| L
;
%%

下面是这个文法的LALR(1)算法生成的自动机表(增广后的):

action GOTO
state ; , a ( ) $end L E P M
0 s1 s2 3 4 5
1 r5 r5 r5 r5 r5 r5
2 r7 r7 s1 s2 r7 r7 6 4 5 7
3 s9 s8
4 r2 s10 r2 r2 r2 r2
5 r4 r4 r4 r4 r4 r4
6 s9 r8 r8 r8 r8 r8
7 s11
8* acc acc acc acc acc acc
9 s1 s2 12 5
10 s1 s2 s11 13
11 r6 r6 r6 r6 r6 r6
12 r1 s10 r1 r1 r1 r1
13 r3 r3 r3 r3 r3 r3

这个例子的符号表如下:

Symbol Number
$end 0
error 1
$undefined 2
';' 3
',' 4
'a' 5
'(' 6
')' 7
$accept 8
L 9
E 10
P 11
M 12

注意,$enderror, $undefined对应的number永远是0,1,2,然后是terminal symbol,再然后是$accept,再然后是non-terminal。

如此简单的语法都要生成一个庞大的表,更何况一门完备的语言,并且注意到了表中有很多地方是空的(对应error),因此bison对这个表进行了压缩。

数据压缩

default action

首先是对action的部分进行压缩,注意到state 1,2,4,5,6,11,12,13仅仅只有1种reduction操作(有的还有shift,但它不是reduction),因此就可以先单独考量这样的行,对它们定义一个default action(reduction),也就是只要碰到这些state,在排除shift操作后一律使用reduction操作。对于空白的error部分即使进行了reduction,也仅仅是推迟了错误的发生。

形式上这个就应该是这个样子:

1
default_reductions[] = {none,r5,r7,none,r2,r4,r8,none,none,none,none,r6,r1,r3}

在bison生成的文件种就对应了yydefact[],这个表的index表示的是state number,里面的值表示reduction的rule number,0表示error:

1
static const yytype_uint8 yydefact[] ={0,6,8,0,3,5,9,0,1,0,0,7,2,4};

注意第一条规则是bison的默认增广规则,因此所有规则号都加了1。

default GOTO

接下来要压缩GOTO的部分,bison的压缩策略是这样的:将每一个non-terminal对应的GOTO列中最多的那一个单独拿出来(L:3,E:4,P:5,M:7)构成一个default GOTO:

1
default_gotos[] = { 3, 4, 5, 7 }

在生成文件中就对应了yydefgoto[]

1
static const yytype_int8 yydefgoto[] ={ -1,3,4,5,7 };

它的index是用non-terminal的symbol number减去terminal个数得到的(0处对应的-1好像没有什么用)。

压缩non-default action

在选出default reduction表之后,去除它们后action剩下的部分是:

action
state 3 4 5 6 7 0
0 1 2
2 1 2
3 9 8
4 10
6 9
7 11
9 1 2
10 1 2 11
12 10

可以看出有很多空白,接下来的操作就是要对这些空白进行压缩。

bison会将这个表变成两个一维的表,一个用来存储这些非空白的信息,一个用来存储它们的dimension。一般表达式就是:

1
action number = T[D[state number] + symbol number]

例如,我们想要得到state 4在遇到symbol号为4时的操作,那就像这样进行索引:

1
T[D[4] + 4] = 10

根据这个压缩规则,bison分别在yytable[]yypact[]存储这个‘T’和‘D’:

1
static const yytype_uint8 yytable[] ={8,1,2,9,11,10,9,6,12,0,0,0,13};static const yytype_int8 yypact[] ={-4,-5,-4,0,1,-5,3,-3,-5,-4,-4,-5,1,-5};

我们可以验证一下,state 4遇到规则4时对应的是s10:

1
yytable[yypact[4]+4] = yytable[5] = 10

注意,尽管去除了default reduction,这里面的操作也是可能有reduction的,为此,bison使用正负号来区别它们,yytable[]里正号代表shift,负号代表reduction(当然这个例子里没有),如果是0就执行default action。

压缩non-default GOTO

和non-default action一样,采用相同的方法去压缩non-default GOTO。只是在indexing的时候,公式变为了:

1
action number = T[ND[symbol number - num_terminal] + state number]

在这个例子中,non-default GOTO的表是这样的:

GOTO
state 9 10 11
2 6
9 12
10 13

压缩后的index在bison生成的表中对应了yypgoto[],value仍然用yytable[]

1
static const yytype_int8 yypgoto[] ={      -5,     5,    -1,     2,    -5};

例如,我们要symbol 9在state 2中的GOTO:

1
yytable[yypgoto[9-8]+2] = yytable[7] = 6

算法的实现

一些static数据

yytranslate

yytranslate[]定义了文法中符号到symbol number的映射过程,假如这些符号是non-terminal,那index就对应non-terminal %token声明时的值,如果这些符号是ASCII码中的terminal,那么index就对应了ASCII值。

1
static const yytype_uint8 yytranslate[] ={       0,     2,     2,     2,     2, ...};

比如说这个例子中出现的’a’,对应的ASCII值是97,根据yytranslate[]的索引,可以得到符号a对应的symbol number是5:

1
yytranslate[97] = 5;

在这里面有很多的2(undefined),因为很多ASCII表中的符号并没有出现在文法中。

yyr1

yyr1[]是每条文法规则的LHS符号的symbol number:

1
static const yytype_uint8 yyr1[] ={       0,     8,     9,     9,    10,    10,    11,    11,    12,    12};

0并没有用作文法的number,所以yyr1[0]是0.

yyr2

yyr2[]是每条文法规则的RHS符号的个数:

1
static const yytype_uint8 yyr2[] ={       0,     2,     3,     1,     3,     1,     1,     3,     0,     1};

同上,yyr2[0]`是0.

yycheck

yycheck[]是用来判断使用default还是non-default的,在生成文件的代码中表现为:

1
2
3
4
5
6
7
8
9
10
11
12
/* If the proper action on seeing token YYTOKEN is to reduce or to   detect an error, take that action.  */
yyn += yytoken;
if (yyn < 0 || YYLAST < yyn || yycheck[yyn] != yytoken)
goto yydefault;

/* Now 'shift' the result of the reduction. Determine what state that goes to, based on the state we popped back to and the rule number reduced by. */
yyn = yyr1[yyn];
yystate = yypgoto[yyn - YYNTOKENS] + *yyssp;
if (0 <= yystate && yystate <= YYLAST && yycheck[yystate] == *yyssp)
yystate = yytable[yystate];
else yystate = yydefgoto[yyn - YYNTOKENS];
goto yynewstate;

在这个例子中,yycheck[]的值如下:

1
static const yytype_int8 yycheck[] ={       0,     5,     6,     3,     7,     4,     3,     2,     9,    -1,      -1,    -1,    10};

其他不重要的表:

1
2
3
4
5
yyrhs: yyrhs[n] = 第n条规则的RHS第一个symbol
yyprhs: yyprhs[n] = yyrhs[n]symbol的index
yyrline: yyrline[n] = .y文法源文件中第n条规则定义时对应的行
yytname: yytname[n] = symbol number n对应的symbol string
yytoknum: yytoknum[n] = token n的token number

yyparse()

整个算法的主程序是yyparse(),具体的框架如下:

首先我们要定义一个state stack:

1
2
3
4
/* The state stack.  */
yytype_int16 yyssa[YYINITDEPTH];
yytype_int16 *yyss;
yytype_int16 *yyssp;

然后定义符号的stack:

1
2
3
4
/* The semantic value stack.  */
YYSTYPE yyvsa[YYINITDEPTH];
YYSTYPE *yyvs;
YYSTYPE *yyvsp;

这里的YYSTYPE是文法的non-terminal所允许的type,可以自己定义,在这个例子里默认为了int

然后定义用来输出错误的一个buffer:

1
2
3
/* Buffer for error messages, and its allocated size.  */
char yymsgbuf[128];
char *yymsg = yymsgbuf;

定义将new state push到栈里的操作:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
yynewstate:
/* In all cases, when you get here, the value and location stacks
have just been pushed. So pushing a state here evens the stacks. */
yyssp++;

yysetstate:
*yyssp = yystate;

if (yyss + yystacksize - 1 <= yyssp)
{
/* Get the current used size of the three stacks, in elements. */
YYSIZE_T yysize = yyssp - yyss + 1;

#ifdef yyoverflow
{
/* Give user a chance to reallocate the stack. Use copies of
these so that the &'s don't force the real ones into
memory. */
YYSTYPE *yyvs1 = yyvs;
yytype_int16 *yyss1 = yyss;

/* Each stack pointer address is followed by the size of the
data in use in that stack, in bytes. This used to be a
conditional around just the two extra args, but that might
be undefined if yyoverflow is a macro. */
yyoverflow (YY_("memory exhausted"),
&yyss1, yysize * sizeof (*yyssp),
&yyvs1, yysize * sizeof (*yyvsp),
&yystacksize);

yyss = yyss1;
yyvs = yyvs1;
}
#else /* no yyoverflow */
# ifndef YYSTACK_RELOCATE
goto yyexhaustedlab;
# else
/* Extend the stack our own way. */
if (YYMAXDEPTH <= yystacksize)
goto yyexhaustedlab;
yystacksize *= 2;
if (YYMAXDEPTH < yystacksize)
yystacksize = YYMAXDEPTH;

{
yytype_int16 *yyss1 = yyss;
union yyalloc *yyptr =
(union yyalloc *) YYSTACK_ALLOC (YYSTACK_BYTES (yystacksize));
if (! yyptr)
goto yyexhaustedlab;
YYSTACK_RELOCATE (yyss_alloc, yyss);
YYSTACK_RELOCATE (yyvs_alloc, yyvs);
# undef YYSTACK_RELOCATE
if (yyss1 != yyssa)
YYSTACK_FREE (yyss1);
}
# endif
#endif /* no yyoverflow */

yyssp = yyss + yysize - 1;
yyvsp = yyvs + yysize - 1;

YYDPRINTF ((stderr, "Stack size increased to %lu\n",
(unsigned long int) yystacksize));

if (yyss + yystacksize - 1 <= yyssp)
YYABORT;
}

YYDPRINTF ((stderr, "Entering state %d\n", yystate));

if (yystate == YYFINAL)
YYACCEPT;

goto yybackup;

核心部分只是前面和后面的几行代码,中间都是在处理overflow的,用一些宏来决定是将stack size 扩张两倍还是直接扔出memory exhausted错误。

然后定义读取lookahead token的操作yybackup,这里就要决定是使用default还是non-default,假如是non-default,是shift还是reduce还是报错了:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
yybackup:

/* Do appropriate processing given the current state. Read a
lookahead token if we need one and don't already have one. */

/* First try to decide what to do without reference to lookahead token. */
yyn = yypact[yystate];
if (yypact_value_is_default (yyn))
goto yydefault;

/* Not known => get a lookahead token if don't already have one. */

/* YYCHAR is either YYEMPTY or YYEOF or a valid lookahead symbol. */
if (yychar == YYEMPTY)
{
YYDPRINTF ((stderr, "Reading a token: "));
yychar = yylex ();
}

if (yychar <= YYEOF)
{
yychar = yytoken = YYEOF;
YYDPRINTF ((stderr, "Now at end of input.\n"));
}
else
{
yytoken = YYTRANSLATE (yychar);
YY_SYMBOL_PRINT ("Next token is", yytoken, &yylval, &yylloc);
}

/* If the proper action on seeing token YYTOKEN is to reduce or to
detect an error, take that action. */
yyn += yytoken;
if (yyn < 0 || YYLAST < yyn || yycheck[yyn] != yytoken)
goto yydefault;
yyn = yytable[yyn];
if (yyn <= 0)
{
if (yytable_value_is_error (yyn))
goto yyerrlab;
yyn = -yyn;
goto yyreduce;
}

/* Count tokens shifted since error; after three, turn off error
status. */
if (yyerrstatus)
yyerrstatus--;

/* Shift the lookahead token. */
YY_SYMBOL_PRINT ("Shifting", yytoken, &yylval, &yylloc);

/* Discard the shifted token. */
yychar = YYEMPTY;

yystate = yyn;
YY_IGNORE_MAYBE_UNINITIALIZED_BEGIN
*++yyvsp = yylval;
YY_IGNORE_MAYBE_UNINITIALIZED_END

goto yynewstate;

然后定义default操作,很简单,读取yydefault[]表,如果是0就报错,不是0就进入reduce:

1
2
3
4
5
yydefault:
yyn = yydefact[yystate];
if (yyn == 0)
goto yyerrlab;
goto yyreduce;

接下来是reduce操作:根据yyr2[]找到对应规则的reduce的RHS符号个数以用来决定pop多少,然后根据yyr1[]找到对应规则的LHS值push进去:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
yyreduce:
/* yyn is the number of a rule to reduce with. */
yylen = yyr2[yyn];

/* If YYLEN is nonzero, implement the default value of the action:
'$$ = $1'.

Otherwise, the following line sets YYVAL to garbage.
This behavior is undocumented and Bison
users should not rely upon it. Assigning to YYVAL
unconditionally makes the parser a bit smaller, and it avoids a
GCC warning that YYVAL may be used uninitialized. */
yyval = yyvsp[1-yylen];


YY_REDUCE_PRINT (yyn);
switch (yyn)
{

#line 1179 "eg.tab.c" /* yacc.c:1646 */
default: break;
}
/* User semantic actions sometimes alter yychar, and that requires
that yytoken be updated with the new translation. We take the
approach of translating immediately before every use of yytoken.
One alternative is translating here after every semantic action,
but that translation would be missed if the semantic action invokes
YYABORT, YYACCEPT, or YYERROR immediately after altering yychar or
if it invokes YYBACKUP. In the case of YYABORT or YYACCEPT, an
incorrect destructor might then be invoked immediately. In the
case of YYERROR or YYBACKUP, subsequent parser actions might lead
to an incorrect destructor call or verbose syntax error message
before the lookahead is translated. */
YY_SYMBOL_PRINT ("-> $$ =", yyr1[yyn], &yyval, &yyloc);

YYPOPSTACK (yylen);
yylen = 0;
YY_STACK_PRINT (yyss, yyssp);

*++yyvsp = yyval;

/* Now 'shift' the result of the reduction. Determine what state
that goes to, based on the state we popped back to and the rule
number reduced by. */

yyn = yyr1[yyn];

yystate = yypgoto[yyn - YYNTOKENS] + *yyssp;
if (0 <= yystate && yystate <= YYLAST && yycheck[yystate] == *yyssp)
yystate = yytable[yystate];
else
yystate = yydefgoto[yyn - YYNTOKENS];

goto yynewstate;

然后是三个用来检测error的操作:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
yyerrlab:
/* Make sure we have latest lookahead translation. See comments at
user semantic actions for why this is necessary. */
yytoken = yychar == YYEMPTY ? YYEMPTY : YYTRANSLATE (yychar);

/* If not already recovering from an error, report this error. */
if (!yyerrstatus)
{
++yynerrs;
#if ! YYERROR_VERBOSE
yyerror (YY_("syntax error"));
#else
# define YYSYNTAX_ERROR yysyntax_error (&yymsg_alloc, &yymsg, \
yyssp, yytoken)
{
char const *yymsgp = YY_("syntax error");
int yysyntax_error_status;
yysyntax_error_status = YYSYNTAX_ERROR;
if (yysyntax_error_status == 0)
yymsgp = yymsg;
else if (yysyntax_error_status == 1)
{
if (yymsg != yymsgbuf)
YYSTACK_FREE (yymsg);
yymsg = (char *) YYSTACK_ALLOC (yymsg_alloc);
if (!yymsg)
{
yymsg = yymsgbuf;
yymsg_alloc = sizeof yymsgbuf;
yysyntax_error_status = 2;
}
else
{
yysyntax_error_status = YYSYNTAX_ERROR;
yymsgp = yymsg;
}
}
yyerror (yymsgp);
if (yysyntax_error_status == 2)
goto yyexhaustedlab;
}
# undef YYSYNTAX_ERROR
#endif
}



if (yyerrstatus == 3)
{
/* If just tried and failed to reuse lookahead token after an
error, discard it. */

if (yychar <= YYEOF)
{
/* Return failure if at end of input. */
if (yychar == YYEOF)
YYABORT;
}
else
{
yydestruct ("Error: discarding",
yytoken, &yylval);
yychar = YYEMPTY;
}
}

/* Else will try to reuse lookahead token after shifting the error
token. */
goto yyerrlab1;


/*---------------------------------------------------.
| yyerrorlab -- error raised explicitly by YYERROR. |
`---------------------------------------------------*/
yyerrorlab:

/* Pacify compilers like GCC when the user code never invokes
YYERROR and the label yyerrorlab therefore never appears in user
code. */
if (/*CONSTCOND*/ 0)
goto yyerrorlab;

/* Do not reclaim the symbols of the rule whose action triggered
this YYERROR. */
YYPOPSTACK (yylen);
yylen = 0;
YY_STACK_PRINT (yyss, yyssp);
yystate = *yyssp;
goto yyerrlab1;


/*-------------------------------------------------------------.
| yyerrlab1 -- common code for both syntax error and YYERROR. |
`-------------------------------------------------------------*/
yyerrlab1:
yyerrstatus = 3; /* Each real token shifted decrements this. */

for (;;)
{
yyn = yypact[yystate];
if (!yypact_value_is_default (yyn))
{
yyn += YYTERROR;
if (0 <= yyn && yyn <= YYLAST && yycheck[yyn] == YYTERROR)
{
yyn = yytable[yyn];
if (0 < yyn)
break;
}
}

/* Pop the current state because it cannot handle the error token. */
if (yyssp == yyss)
YYABORT;


yydestruct ("Error: popping",
yystos[yystate], yyvsp);
YYPOPSTACK (1);
yystate = *yyssp;
YY_STACK_PRINT (yyss, yyssp);
}

YY_IGNORE_MAYBE_UNINITIALIZED_BEGIN
*++yyvsp = yylval;
YY_IGNORE_MAYBE_UNINITIALIZED_END


/* Shift the error token. */
YY_SYMBOL_PRINT ("Shifting", yystos[yyn], yyvsp, yylsp);

yystate = yyn;
goto yynewstate;

然后定义accept,abort,overflow时yyparse()分别返回什么值:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
yyacceptlab:
yyresult = 0;
goto yyreturn;

/*-----------------------------------.
| yyabortlab -- YYABORT comes here. |
`-----------------------------------*/
yyabortlab:
yyresult = 1;
goto yyreturn;

#if !defined yyoverflow || YYERROR_VERBOSE
/*-------------------------------------------------.
| yyexhaustedlab -- memory exhaustion comes here. |
`-------------------------------------------------*/
yyexhaustedlab:
yyerror (YY_("memory exhausted"));
yyresult = 2;
/* Fall through. */
#endif

最后一步:yyreturn,返回时清空stack,error message的buf:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
yyreturn:
if (yychar != YYEMPTY)
{
/* Make sure we have latest lookahead translation. See comments at
user semantic actions for why this is necessary. */
yytoken = YYTRANSLATE (yychar);
yydestruct ("Cleanup: discarding lookahead",
yytoken, &yylval);
}
/* Do not reclaim the symbols of the rule whose action triggered
this YYABORT or YYACCEPT. */
YYPOPSTACK (yylen);
YY_STACK_PRINT (yyss, yyssp);
while (yyssp != yyss)
{
yydestruct ("Cleanup: popping",
yystos[*yyssp], yyvsp);
YYPOPSTACK (1);
}
#ifndef yyoverflow
if (yyss != yyssa)
YYSTACK_FREE (yyss);
#endif
#if YYERROR_VERBOSE
if (yymsg != yymsgbuf)
YYSTACK_FREE (yymsg);
#endif
return yyresult;
}
 wechat
scan and following my wechat official account.