I am parsing a domain specific language which has several type definitions. Primitive types have been relatively straight-forward to parse, but moving on to the complex types in the DSL have proven more challenging. I am looking for a strategy to recursively match a definition that can contain a reference to itself.
For example, the map data type can look like this at a high level:
map<primitive_type, complex_type>
Which can appear in the input as a recursive definition:
map<string, map<int, map<string, ...>>>=
This map of maps can theoretically be of arbitrary depth.
To make matters more complicated, these nested complex types could also be other complex types as well.
map<string, map<int, map<string, array<array<string>>>>>
I’m looking for a strategy to handle parsing this so that I get an accurate type containing the nested map type parsed out. I considered just using a regex, but this doesn’t feel optimal to me. I also know there exists Forward for recursive grammars, but I haven’t quite been able to figure out how to apply it to this scenario. If there are any good patterns or references for this type of parsing I would really appreciate any guidance!
Advertisement
Answer
Forward is the way to do this. Since type declarations may themselves contain type declarations, this is probably a good choice for where the Forward will go:
import pyparsing as pp type_decl = pp.Forward()
Looking at your examples, I can see we will need some punctuation, we can define some suppressed expressions for these so that they parse, but won’t junk up the parsed results:
LANGLE, RANGLE, COMMA = map(pp.Suppress, "<>,") # if we get an ellipsis, we want to keep it, so use Literal instead of Suppress ELLIPSIS = pp.Literal("...")
Next we can do the simple types:
simple_type = pp.oneOf("string int float", asKeyword=True)
Now the two complex types, map
and array
. For these, we will use the Forward wherever a complex type might occur:
MAP, ARRAY = map(pp.Keyword, ["map", "array"]) map_type = MAP - LANGLE + pp.Group(simple_type + COMMA + (type_decl | ELLIPSIS)) + RANGLE array_type = ARRAY - LANGLE + type_decl + RANGLE complex_type = pp.Group(map_type | array_type)
I use the ‘-‘ operator here since, if there is a syntax error after seeing a valid “map” or “array” keyword, pyparsing will give us the error within that expression.
At this point we have our complex and simple types, so we can “inject” them into the previously-defined Forward, using the <<=
operator:
type_decl <<= (complex_type | simple_type)
Check using runTests
:
type_decl.runTests(""" string int float map<string, map<int, map<string, ...>>> map<string, map<int, map<string, array<array<string>>>>> """, fullDump=False)
Gives:
string ['string'] int ['int'] float ['float'] map<string, map<int, map<string, ...>>> [['map', ['string', ['map', ['int', ['map', ['string', '...']]]]]]] map<string, map<int, map<string, array<array<string>>>>> [['map', ['string', ['map', ['int', ['map', ['string', ['array', ['array', 'string']]]]]]]]]