Skip to content
Advertisement

Using pyparsing, how can I group expressions that are matched by OneOrMore(expre1|expr2)?

My website receives allows users to post a string that contains several Questions followed by multiple choice answers. There is an enforced style-guide that allows the results to be parsed by Regex and then Questions + MCQ choices are stored in a database, to be later returned in randomized practice exams.

I wanted to transition over to pyparsing, because the regex is not immediately readable and I feel a little locked in with it. I would like to have the option to easily expand functionality of my questionparser, and with Regex it feels very cumbersome.

User input is in the form of:

quiz = [<question-answer>, <q-start>]
<question-answer> = <question> + <answer>
<question> = [<q-text>, n] ?!= <a-start>
<answer> = [<answer>, <a-start>]  ?!= <q-start>
<q-start> = <nums> + "." | ")"
<a-start> = <alphas> + "." | ")" 

Long user-input string is separated into question-answers, deliminated by the the next question-answer group’s q-start. Questions are all text between q-start and a-start. Answers are a list of all text between a-start and a-start or the following q-start.

Sample text:

3. A lesion that affects N. Solitarius will result in the patient having problems related to:
a. taste and blood pressure regulation
c. swallowing and respiration
b. smell and taste
d. voice quality and taste
e. whistling and chewing

4. A patient comes to your office complaining of weakness on the right side of their body. You notice that their head is
turned slightly to the left and their right shoulder droops. When asked to protrude their tongue, it deviates to the right. Eye
movements and eye-related reflexes appear to be normal. The lesion most likely is located in the:
c. left ventral medulla
a. left ventral midbrain
b. right dorsal medulla
d. left ventral pons
e. right ventral pons

5. A colleague {...}

Regex I have been using:

# matches a question-answer block. Matching q-start until an empty line.
regex1 = r"(^[t ]*[0-9]+[).][t ]+[sS]*?(?=^[nr]))" 

# Within question-answer block, matches everything that does not start with a-start
regex6 = r"(^(?!(^[a-fA-F][).]s+[sS]+)).*)"

# Matches all text between a-start and the following a-start, or until the question-answer substring block ends.
regex5 = r"(^[a-fA-F][).]s+[sS]+)"       

Then a little python and re to trim away question numbers, mcq letters, join all broken lines in question, append MCQs into a list.

In pyparsing I have tried this:

EOL = Suppress(LineEnd())
delim = oneOf(". )")
q_start = LineStart() + Word(nums) + delim
a_start = LineStart() + Char(alphas) + delim

question = Optional(EOL) + Group(Suppress(q_start) + OneOrMore(SkipTo(LineEnd()) + EOL, stopOn=a_start)).setResultsName('question', listAllMatches=True)

answer = Optional(EOL) + Group(Suppress(a_start) + OneOrMore( SkipTo(LineEnd()) + EOL, stopOn=(a_start | q_start | StringEnd()))).setResultsName('answer', listAllMatches=True)



qi = Group(OneOrMore(question|answer)).setResultsName('group', listAllMatches=True)
t = qi.parseString(test)
print(t.dump())

Results:

[[['The tectum of the midbrain comprises the:'], ['superior and inferior colliculi'], ['reticular formation'], ['internal arcuate fibers'], ['cerebellar peduncles'], ['pyramids'], ['Damage to the dorsal columns on one side of the spinal cord would results in:'], ['loss of MVP ipsilaterally below the level of the lesion'], ['hypertonicity of the contralateral limbs'], ['loss of pain and temperature contralaterally below the level of the lesion'], ['loss of MVP contralaterally above the level of the lesion'], ['loss of pain and temperature ipsilaterally above the level of the lesion']]]
- group: [[['The tectum of the midbrain comprises the:'], ['superior and inferior colliculi'], ['reticular formation'], ['internal arcuate fibers'], ['cerebellar peduncles'], ['pyramids'], ['Damage to the dorsal columns on one side of the spinal cord would results in:'], ['loss of MVP ipsilaterally below the level of the lesion'], ['hypertonicity of the contralateral limbs'], ['loss of pain and temperature contralaterally below the level of the lesion'], ['loss of MVP contralaterally above the level of the lesion'], ['loss of pain and temperature ipsilaterally above the level of the lesion']]]
  [0]:
    [['The tectum of the midbrain comprises the:'], ['superior and inferior colliculi'], ['reticular formation'], ['internal arcuate fibers'], ['cerebellar peduncles'], ['pyramids'], ['Damage to the dorsal columns on one side of the spinal cord would results in:'], ['loss of MVP ipsilaterally below the level of the lesion'], ['hypertonicity of the contralateral limbs'], ['loss of pain and temperature contralaterally below the level of the lesion'], ['loss of MVP contralaterally above the level of the lesion'], ['loss of pain and temperature ipsilaterally above the level of the lesion']]
    - answer: [['superior and inferior colliculi'], ['reticular formation'], ['internal arcuate fibers'], ['cerebellar peduncles'], ['pyramids'], ['loss of MVP ipsilaterally below the level of the lesion'], ['hypertonicity of the contralateral limbs'], ['loss of pain and temperature contralaterally below the level of the lesion'], ['loss of MVP contralaterally above the level of the lesion'], ['loss of pain and temperature ipsilaterally above the level of the lesion']]
      [0]:
        ['superior and inferior colliculi']
      [1]:
        ['reticular formation']
      [2]:
        ['internal arcuate fibers']
      [3]:
        ['cerebellar peduncles']
      [4]:
        ['pyramids']
      [5]:
        ['loss of MVP ipsilaterally below the level of the lesion']
      [6]:
        ['hypertonicity of the contralateral limbs']
      [7]:
        ['loss of pain and temperature contralaterally below the level of the lesion']
      [8]:
        ['loss of MVP contralaterally above the level of the lesion']
      [9]:
        ['loss of pain and temperature ipsilaterally above the level of the lesion']
    - question: [['The tectum of the midbrain comprises the:'], ['Damage to the dorsal columns on one side of the spinal cord would results in:']]
      [0]:
        ['The tectum of the midbrain comprises the:']
      [1]:
        ['Damage to the dorsal columns on one side of the spinal cord would results in:']

This does match questions and answers, and properly bypasses linebreaks that may interrupt questions or answers. The issue I am having is that they are not grouped the way I expected. I was expecting something along the lines of group[0] = question, answer[1:4] group[2] = question, answer[1:4]

Does anyone have any advice?

Thanks!

Advertisement

Answer

I think you were on the right track – I took a separate pass at your parser and came up with very similar constructs, but just a few differences.

question = Combine(q_start.suppress() + SkipTo(EOL + a_start))
answer = Combine(a_start.suppress() + SkipTo(EOL + (a_start | q_start | StringEnd())))
q_a = Group(question("question") + answer[1, ...]("answers"))

for t in q_a[...].parseString(test):
    print(t.dump())

The biggest difference was that the expression I used to parse your text did not just do OneOrMore(question | answer), but instead defined a Group(question + OneOrMore(answer)). This creates a group for each question and its related answers. In your parser, using listAllMatches just creates one results name for all the questions, and another for all the answers, but loses all the associations between them. By creating the “question + one or more answers” group, then these associations are maintained.

If you want to remove the ‘n’s, you can do that more easily with a parse action than with the EOL business.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement