Convert strings with an unknown number of hex strings embedded in them to strings using regex

Question

So I have a list of strings (content from Snort rules), and I am trying to convert the hex portions of them to UTF-8/ASCII, so I can send the content over netcat. The method I have now works fine for strings with single hex characters (i.e. 3A), but breaks when there's a series of hex characters (i.e. 3A 4B 00

Accepted Answer

It looks like a substring is either always hex i.e. (?:[A-Fa-f0-9]{2}s)+[A-Fa-f0-9]{2} or not hex at all between | symbols?This works:for string in strings:    for i in range(len(string)):        string = re.sub(r"(?<=|)((?:[A-Fa-f0-9]{2}s)*[A-Fa-f0-9]{2})(?=|)", convert_hex, string)    converted_strings.append(string)(extra parentheses for a capturing group 1 &#8211; you could leave out one pair of parentheses and change your function to act on group(0) instead)But it breaks on your example |08 00 00 00 27 C7 CC 6B C2 FD 13 0E|, as that doesn&#8217;t appear to be a valid UTF-8 encoding. The resulting error:UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 5: invalid continuation byteHowever, a valid UTF-8 encoded multi-byte string like '|74 65 73 74 20 f0 9f 98 80|' works just fine:import reimport codecsdef convert_hex(match):  string = match.group(1)  string = string.replace(" ", "")  decode_hex = codecs.getdecoder("hex_codec")  try:    result = decode_hex(string)[0]  except:    result = bytes.fromhex((lambda s: ("%s%s00" * (len(s)//2)) % tuple(s))(string)).decode('utf-16-le')  return result.decode("utf-8")strings = ['|74 65 73 74 20 f0 9f 98 80|']converted_strings = []for string in strings:    for i in range(len(string)):        string = re.sub(r"(?<=|)((?:[A-Fa-f0-9]{2}s)*[A-Fa-f0-9]{2})(?=|)", convert_hex, string)    converted_strings.append(string)print(converted_strings)Result:['|test 😀|']If you don&#8217;t really need a printable representation of the data, you could just have your function return the bytes object and only apply the function to matching parts &#8211; instead of constructing a new string.Based on what @Selcuk was saying, perhaps a result with byte-strings makes more sense &#8211; this works on all three types of input:import reimport codecsdef convert_hex(match):  string = match.group(1)  string = string.replace(b" ", b"")  decode_hex = codecs.getdecoder("hex_codec")  try:    result = decode_hex(string)[0]  except:    result = bytes.fromhex((lambda s: ("%s%s00" * (len(s)//2)) % tuple(s))(string)).decode('utf-16-le')  return resultstrings = ['|0A|Referer|3A| res|3A|/C|3A|', '|74 65 73 74 20 f0 9f 98 80|', '|08 00 00 00 27 C7 CC 6B C2 FD 13 0E|']converted_strings = []for string in strings:    string = re.sub(rb"(?<=|)((?:[A-Fa-f0-9]{2}s)*[A-Fa-f0-9]{2})(?=|)", convert_hex, string.encode())    converted_strings.append(string)print(converted_strings)Result:[b'|n|Referer|:| res|:|/C|:|', b'|test xf0x9fx98x80|', b"|x08x00x00x00'xc7xcckxc2xfdx13x0e|"]No encoding issues, because no encoding is chosen. (Note that I didn&#8217;t attempt to change convert_hex too much &#8211; there&#8217;s some encoding juggling in there that you may need to look at, I just got it to work for bytes)

Advertisement

Answer