Skip to content
Advertisement

Proper way to handle ambiguous tokens in PLY

I am implementing an existing scripting language in part as a toy project and in part so that I can write my own implementation of the program that uses the language. One of the issues I’m running into is that I have a few constructs that overlap in terms of specification but are more clear when used:

 Variables - r'[A-Za-z0-9_]+'      # Yes, '456' is a valid variable name
 Numbers - r'-?[0-9]+(.[0-9]+)?'
 Macros - r'#[A-Za-z0-9_]+'
 Field Reference - r'(this.)?([A-Za-z]+.)*[A-Za-z]+'
 Tag reference - r'[A-Za-z0-9_]+.[A-Za-z0-9_]*??'

This mostly works, but, for example, “456” could be a number or a variable. “34.567” could be a number or a tag reference (the documentation for the scripting language says that it’s a bad idea to start identifiers with numbers, but doesn’t outright forbid it). Is there a good way to handle the potential ambiguity of the tokens? Currently, I’m tokenizing the former as variable, and the latter as a number, and handling it later in the parser, but it feels very clumsy.

Advertisement

Answer

Is there any need for the tokenizer to distinguish between variables, numbers, field references and tag references? Presumably, the parser will be able to decide which of those categories a particular token falls into, by consulting its symbol table of declared variables and possibly by considering the context in which the token was used. If that’s the case, then you can just return a single token for all four cases, which will simplify your lexer and probably your grammar.

There’s a general principle of parser design, which is never sufficiently emphasised, so I’ll put it in bold here:

Every parser component should do the absolute minimum amount of work necessary to distinguish between correct inputs.

In other words, if the only possibilities are a unique correct parse and an input error, and it’s at all difficult to decide at that point which applies, then just pass the decision on to the next phases, where more information is available. Only do the work necessary to distinguish between two or more different correct inputs.

This applies, for example, to trying to do type-checking in the parser. That’s a losing proposition; there isn’t enough information to do it correctly until semantic analysis is complete and you know what all of the identifiers refer to. More importantly, it adds no benefit to the parser (or the lexer) because it does not affect how a correct input is parsed; all it does is let you identify certain (not all) incorrect inputs. By the above principle, you shouldn’t try.

This principle comes up over and over again in parsing. There is always the temptation to try to make error detection “more precise” too early in the parse. Resist! Do error detection only when you have enough information to do it reliably. You’ll have to do it at that point anyway, so you’re not saving anything by trying to do some of it earlier. Early detection might shave a few microseconds off of a failed parse, but the speed of parsing incorrect inputs is not very important. Always optimise for correct inputs.

This also applies to writing grammars for syntaxes which are not easy to precisely shoehorn into a one-token lookahead grammar. It’s OK to let an incorrect input to sneak through the parse and then detect it during semantic analysis. For example, you could try to detect whether built-in function calls have the correct number of arguments. But why bother? Letting a call with too many or too few arguments go through to semantic analysis does not create any ambiguities. There are lots of other examples.

Other big benefits of letting errors trickle down to the semantic analysis are that it’s much easier to generate accurate error messages, which are useful for the end user, and that it’s much easier to do error recovery, so you can continue processing the input and provide multiple errors and warnings in a single run, another feature your users will appreciate.

There are exceptions to every guideline, so I’m not saying this is an absolute rule. In COBOL, for example, some operators have different parsing precedences depending on their datatype. (No sensible language designer would commit that barbarity today, I hope, but you do need to take it into account for legacy parsers.) You can only pass a decision down the line if it doesn’t create ambiguities between correct inputs. But you should always try to keep this guideline in mind.

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement