Skip to content
Advertisement

Split a nested XML string to get a string using parser

I have this string :

'<Section xml:space="preserve" HasTrailingParagraphBreakOnPaste="False" xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"><Paragraph FontSize="11" FontFamily="Portable User Interface" Foreground="#FF000000" FontWeight="Normal" FontStyle="Normal" FontStretch="Normal" CharacterSpacing="0" Typography.AnnotationAlternates="0" Typography.EastAsianExpertForms="False" Typography.EastAsianLanguage="Normal" Typography.EastAsianWidths="Normal" Typography.StandardLigatures="True" Typography.ContextualLigatures="True" Typography.DiscretionaryLigatures="False" Typography.HistoricalLigatures="False" Typography.StandardSwashes="0" Typography.ContextualSwashes="0" Typography.ContextualAlternates="True" Typography.StylisticAlternates="0" Typography.StylisticSet1="False" Typography.StylisticSet2="False" Typography.StylisticSet3="False" Typography.StylisticSet4="False" Typography.StylisticSet5="False" Typography.StylisticSet6="False" Typography.StylisticSet7="False" Typography.StylisticSet8="False" Typography.StylisticSet9="False" Typography.StylisticSet10="False" Typography.StylisticSet11="False" Typography.StylisticSet12="False" Typography.StylisticSet13="False" Typography.StylisticSet14="False" Typography.StylisticSet15="False" Typography.StylisticSet16="False" Typography.StylisticSet17="False" Typography.StylisticSet18="False" Typography.StylisticSet19="False" Typography.StylisticSet20="False" Typography.Capitals="Normal" Typography.CapitalSpacing="False" Typography.Kerning="True" Typography.CaseSensitiveForms="False" Typography.HistoricalForms="False" Typography.Fraction="Normal" Typography.NumeralStyle="Normal" Typography.NumeralAlignment="Normal" Typography.SlashedZero="False" Typography.MathematicalGreek="False" Typography.Variants="Normal" TextOptions.TextHintingMode="Fixed" TextOptions.TextFormattingMode="Ideal" TextOptions.TextRenderingMode="Auto" TextAlignment="Left" LineHeight="0" LineStackingStrategy="MaxHeight"><Run>Panneau de polystyrène expansé (PSE) ignifugé réalisant une fonction d’isolation thermique pour unnm² de surface en assurant la résistance thermique de R = 3.55 K.m².W-1.</Run></Paragraph></Section>'

My goal is to extract Panneau de polystyrène expansé (PSE) ignifugé réalisant une fonction d’isolation thermique pour unnm² de surface en assurant la résistance thermique de R = 3.55 K.m².W-1. so the text between <run> and </run>

I did it with regular expression but it doesn’t work with some xml string so I tried with xml.etree.ElementTree but I didn’t succeed to access the sting nested in <run></run>

How can I extract this text using XML parser?

Advertisement

Answer

Here a simple way to get the data:

xmlstr = '<Section xml:space="preserve" HasTrailingParagraphBreakOnPaste="False" xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"><Paragraph FontSize="11" FontFamily="Portable User Interface" Foreground="#FF000000" FontWeight="Normal" FontStyle="Normal" FontStretch="Normal" CharacterSpacing="0" Typography.AnnotationAlternates="0" Typography.EastAsianExpertForms="False" Typography.EastAsianLanguage="Normal" Typography.EastAsianWidths="Normal" Typography.StandardLigatures="True" Typography.ContextualLigatures="True" Typography.DiscretionaryLigatures="False" Typography.HistoricalLigatures="False" Typography.StandardSwashes="0" Typography.ContextualSwashes="0" Typography.ContextualAlternates="True" Typography.StylisticAlternates="0" Typography.StylisticSet1="False" Typography.StylisticSet2="False" Typography.StylisticSet3="False" Typography.StylisticSet4="False" Typography.StylisticSet5="False" Typography.StylisticSet6="False" Typography.StylisticSet7="False" Typography.StylisticSet8="False" Typography.StylisticSet9="False" Typography.StylisticSet10="False" Typography.StylisticSet11="False" Typography.StylisticSet12="False" Typography.StylisticSet13="False" Typography.StylisticSet14="False" Typography.StylisticSet15="False" Typography.StylisticSet16="False" Typography.StylisticSet17="False" Typography.StylisticSet18="False" Typography.StylisticSet19="False" Typography.StylisticSet20="False" Typography.Capitals="Normal" Typography.CapitalSpacing="False" Typography.Kerning="True" Typography.CaseSensitiveForms="False" Typography.HistoricalForms="False" Typography.Fraction="Normal" Typography.NumeralStyle="Normal" Typography.NumeralAlignment="Normal" Typography.SlashedZero="False" Typography.MathematicalGreek="False" Typography.Variants="Normal" TextOptions.TextHintingMode="Fixed" TextOptions.TextFormattingMode="Ideal" TextOptions.TextRenderingMode="Auto" TextAlignment="Left" LineHeight="0" LineStackingStrategy="MaxHeight"><Run>Panneau de polystyrène expansé (PSE) ignifugé réalisant une fonction d’isolation thermique pour unnm² de surface en assurant la résistance thermique de R = 3.55 K.m².W-1.</Run></Paragraph></Section>'
from xml.etree import cElementTree as ET
results = []
root = ET.fromstring(xmlstr)
for p in list(root):
 for r in list(p):
  print(r.text)
  results.append(r.text)

Result:

Panneau de polystyrène expansé (PSE) ignifugé réalisant une fonction d’isolation thermique pour un m² de surface en assurant la résistance thermique de R = 3.55 K.m².W-1.

If you run the code in a python interactive prompt, at the end you can use results:

>>> results
['Panneau de polystyrène expansé (PSE) ignifugé réalisant une fonction d’isolation thermique pour unnm² de surface en assurant la résistance thermique de R = 3.55 K.m².W-1.']
>>> results[0]
'Panneau de polystyrène expansé (PSE) ignifugé réalisant une fonction d’isolation thermique pour unnm² de surface en assurant la résistance thermique de R = 3.55 K.m².W-1.'
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement