I have the following data, which I recieve via a ssh session to a switch. I wish to convert the input which is text to a dict for easy access and the possiblity to monitor certain values.
I cannot get the data extracted without a ton of splits and regexes and still get stuck.
Port : 1 Media Type : SF+_SR Vendor Name : VENDORX Part Number : SFP-10G-SR Serial Number : Gxxxxxxxx Wavelength: 850 nm Temp (Celsius) : 37.00 Status : Normal Low Warn Threshold : -40.00 High Warn Threshold : 85.00 Low Alarm Threshold : -50.00 High Alarm Threshold : 100.00 Voltage AUX-1/Vcc (Volts) : 3.27 Status : Normal Low Warn Threshold : 3.10 High Warn Threshold : 3.50 Low Alarm Threshold : 3.00 High Alarm Threshold : 3.60 Tx Power (dBm) : -3.11 Status : Normal Low Warn Threshold : -7.30 High Warn Threshold : 2.00 Low Alarm Threshold : -9.30 High Alarm Threshold : 3.00 Rx Power (dBm) : -4.68 Status : Normal Low Warn Threshold : -11.10 High Warn Threshold : 2.00 Low Alarm Threshold : -13.10 High Alarm Threshold : 3.00 Tx Bias Current (mA): 6.27 Status : Normal Low Warn Threshold : 0.00 High Warn Threshold : 12.00 Low Alarm Threshold : 0.00 High Alarm Threshold : 15.00 Port : 2 Media Type : SF+_SR Vendor Name : VENDORY Part Number : SFP-10G-SR Serial Number : Gxxxxxxxx Wavelength : 850 nm Temp (Celsius) : 37.00 Status : Normal ..... etc - till port 48
Which I want to convert to:
[ { "port": "1", "vendor": "VENDORX", "media_type": "SF+_SR", "part_number": "SFP-10G-SR", "serial_number": "Gxxxxxxxx", "wavelength": "850 nm", "temp": { "value": "37.00", "status": "normal", # alarm threshold and warn threshold may be ignored }, "voltage_aux-1": { "value": "3.27", "status": "normal", # alarm threshold and warn threshold may be ignored }, "tx_power": { "value": "-3.11", "status": "normal", # alarm threshold and warn threshold may be ignored }, "rx_power": { "value": "-4.68", "status": "normal", # alarm threshold and warn threshold may be ignored }, "tx_bias_current": { "value": "6.27", "status": "normal", # alarm threshold and warn threshold may be ignored }, { "port": "2", "vendor": "VENDORY", "media_type": "SF+_SR", "part_number": "SFP-10G-SR", "serial_number": "Gxxxxxxxx", "wavelength": "850 nm", "temp": { "value": "37.00", "status": "normal", # alarm threshold and warn threshold may be ignored }, ...... etc } ]
Advertisement
Answer
Updated (Complete rewrite and simplification).
Here are some ideas for you — adjust to taste.
The solution herein tries to avoid using “domain specific knowledge” as much as possible. The only assumptions are:
- Empty lines don’t matter.
- Indentation is meaningful.
- Keys are transformed to lowercase, and some content is removed (stuff in parentheses,
'name'
,'threshold'
, and/...
). - When a line has multiple “key : value” pairs or is followed by an indented group of lines, that is a block of information pertaining to the first key.
Ultimately, when a key has multiple values (e.g. 'port'
), then these values are put together as a list. When a key has a value that is a single dict (like for 'temp'
), then the first key of that dict (the same as the key itself) is replaced by 'value'
. Thus, we will see:
{'port': [{'port': 1, ...}, {'port': 2, ...}, ...]}
, but{'temp': {'value': 37, ...}}
.
Records
We start by splitting each line into (key, value)
pairs and note the indentation of the line. The result is a list of records, each containing: (indent, [(k0, v0), ...])
:
import re def proc_kv(k, v): k = re.sub(r'(.*)', '', k.lower()) k = re.sub(r' (?:name|threshold)', '', k) k = re.sub(r'/S+', '', k) k = '_'.join(k.strip().split()) for typ in (int, float): try: v = typ(v) break except ValueError: pass return k, v def proc_line(s): s = re.sub(r't', ' ' * 4, s) # handle tabs if any # split into one or more key-value pairs p = [e.strip() for e in re.split(r':', s)] if len(p) < 2: return None # if there are several pairs, use the largest space # to split '{v[i]} {k[i+1]}' p = [p[0]] + [ e for x in p[1:-1] for e in x.split(max(re.split(r'( +)', x)[1::2]), maxsplit=1) ] + [p[-1]] kv_pairs = [proc_kv(k, v) for k, v in zip(p[::2], p[1::2])] # figure out the indentation of that line indent = len(s) - len(s.lstrip(' ')) return indent, kv_pairs
Example on your text:
records = [r for r in [proc_line(s) for s in txt.splitlines()] if r] >>> records [(0, [('port', 1)]), (4, [('media_type', 'SF+_SR')]), (4, [('vendor', 'VENDORX')]), (4, [('part_number', 'SFP-10G-SR')]), (4, [('serial_number', 'Gxxxxxxxx')]), (4, [('wavelength', '850 nm')]), (4, [('temp', 37.0), ('status', 'Normal')]), (10, [('low_warn', -40.0), ('high_warn', 85.0)]), ...
Note that not only keys but also values may contain spaces (e.g. 'Wavelength : 850 nm'
). We decided to use the largest space to split intermediary '{v[i] k[i+]}'
substrings. Thus:
>>> proc_line(' a b : 34 nm c d : 4 ft') (2, [('a_b', '34 nm'), ('c_d', '4 ft')]) # but >>> proc_line(' a b : 34 nm c d : 4 ft') (2, [('a_b', 34), ('nm_c_d', '4 ft')])
Blocks
We then construct a hierarchical representation of the records in way that takes indentation into account:
def get_blocks(records, parent=None): indent, _ = records[0] starts = [i for i, (o_indent, _) in enumerate(records) if o_indent == indent] block = [] if parent is None else parent.copy() continuation_block = len(block) > 1 for i, j in zip(starts, starts[1:] + [len(records)]): _, kv = records[i] continuation_block &= (single_line := i + 1 == j) if continuation_block: block += kv elif single_line: block += [(kv[0][0], kv)] if len(kv) > 1 else kv else: block.append((kv[0][0], get_blocks(records[i+1:j], parent=kv))) return block
Example on the records above (obtained from your txt
):
blocks = get_blocks(records) >>> blocks [('port', [('port', 1), ('media_type', 'SF+_SR'), ('vendor', 'VENDORX'), ('part_number', 'SFP-10G-SR'), ('serial_number', 'Gxxxxxxxx'), ('wavelength', '850 nm'), ('temp', [('temp', 37.0), ...
Note the repeated first key in sub blocks (e.g. ('port', [('port', 1), ...])
and ('temp', [('temp', 37.0), ...])
.
Final structure
We then transform the blocks
hierarchical structure into a dict
, with some ad-hoc logic (no clobbering (k, v)
pairs that have the same key, etc.). And finally put all the pieces together in a proc_txt()
function:
def reshape(a): if isinstance(a, list) and len(a) == 1: a = a[0] if isinstance(a, dict): a = {'value' if i == 0 else k: v for i, (k, v) in enumerate(a.items())} return a def to_dict(blocks): if not isinstance(blocks, list): return blocks d = {} for k, v in blocks: d[k] = d.get(k, []) + [to_dict(v)] return {k: reshape(v) for k, v in d.items()} def proc_txt(txt): records = [r for r in [proc_line(s) for s in txt.splitlines()] if r] blocks = get_blocks(records) d = to_dict(blocks) return d
Example on your text
>>> proc_txt(txt) {'port': [{'port': 1, 'media_type': 'SF+_SR', 'vendor': 'VENDORX', 'part_number': 'SFP-10G-SR', 'serial_number': 'Gxxxxxxxx', 'wavelength': '850 nm', 'temp': {'value': 37.0, 'status': 'Normal', 'low_warn': -40.0, 'high_warn': 85.0, 'low_alarm': -50.0, 'high_alarm': 100.0}, ... ]}