So I have a list of strings (content from Snort rules), and I am trying to convert the hex portions of them to UTF-8/ASCII, so I can send the content over netcat.
The method I have now works fine for strings with single hex characters (i.e. 3A), but breaks when there’s a series of hex characters (i.e. 3A 4B 00 FF)
My current solution is:
import re import codecs def convert_hex(match): string = match.group(1) string = string.replace(" ", "") decode_hex = codecs.getdecoder("hex_codec") try: result = decode_hex(string)[0] except: result = bytes.fromhex((lambda s: ("%s%s00" * (len(s)//2)) % tuple(s))(string)).decode('utf-16-le') return result.decode("utf-8") strings = ['|0A|Referer|3A| res|3A|/C|3A|', 'RemoteNC Control Password|3A|', '/bbs/search.asp', 'User-Agent|3A| Mozilla/4.0 |28|compatible|3B| MSIE 5.0|3B| Windows NT 5.0|29|'] converted_strings = [] for string in strings: for i in range(len(string)): string = re.sub(r"|(.{2})|", convert_hex, string) converted_strings.append(string)
For the strings in strings
, this works, but for a string like:
|08 00 00 00 27 C7 CC 6B C2 FD 13 0E|
it breaks.
I tried changing the regex to:
re.sub(r"|.*([A-Fa-f0-9]{2}).*|")
but that only converts the last hex.
I need this solution to work for strings like Hello|3A|World
, |3A 00 FF|
, and Hello|3A 00|World
I know it’s an issue with the regexp, but I’m not sure what exactly.
Any help would be much appreciated.
Advertisement
Answer
It looks like a substring is either always hex i.e. (?:[A-Fa-f0-9]{2}s)+[A-Fa-f0-9]{2}
or not hex at all between |
symbols?
This works:
for string in strings: for i in range(len(string)): string = re.sub(r"(?<=|)((?:[A-Fa-f0-9]{2}s)*[A-Fa-f0-9]{2})(?=|)", convert_hex, string) converted_strings.append(string)
(extra parentheses for a capturing group 1 – you could leave out one pair of parentheses and change your function to act on group(0)
instead)
But it breaks on your example |08 00 00 00 27 C7 CC 6B C2 FD 13 0E|
, as that doesn’t appear to be a valid UTF-8 encoding. The resulting error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 5: invalid continuation byte
However, a valid UTF-8 encoded multi-byte string like '|74 65 73 74 20 f0 9f 98 80|'
works just fine:
import re import codecs def convert_hex(match): string = match.group(1) string = string.replace(" ", "") decode_hex = codecs.getdecoder("hex_codec") try: result = decode_hex(string)[0] except: result = bytes.fromhex((lambda s: ("%s%s00" * (len(s)//2)) % tuple(s))(string)).decode('utf-16-le') return result.decode("utf-8") strings = ['|74 65 73 74 20 f0 9f 98 80|'] converted_strings = [] for string in strings: for i in range(len(string)): string = re.sub(r"(?<=|)((?:[A-Fa-f0-9]{2}s)*[A-Fa-f0-9]{2})(?=|)", convert_hex, string) converted_strings.append(string) print(converted_strings)
Result:
['|test 😀|']
If you don’t really need a printable representation of the data, you could just have your function return the bytes
object and only apply the function to matching parts – instead of constructing a new string.
Based on what @Selcuk was saying, perhaps a result with byte-strings makes more sense – this works on all three types of input:
import re import codecs def convert_hex(match): string = match.group(1) string = string.replace(b" ", b"") decode_hex = codecs.getdecoder("hex_codec") try: result = decode_hex(string)[0] except: result = bytes.fromhex((lambda s: ("%s%s00" * (len(s)//2)) % tuple(s))(string)).decode('utf-16-le') return result strings = ['|0A|Referer|3A| res|3A|/C|3A|', '|74 65 73 74 20 f0 9f 98 80|', '|08 00 00 00 27 C7 CC 6B C2 FD 13 0E|'] converted_strings = [] for string in strings: string = re.sub(rb"(?<=|)((?:[A-Fa-f0-9]{2}s)*[A-Fa-f0-9]{2})(?=|)", convert_hex, string.encode()) converted_strings.append(string) print(converted_strings)
Result:
[b'|n|Referer|:| res|:|/C|:|', b'|test xf0x9fx98x80|', b"|x08x00x00x00'xc7xcckxc2xfdx13x0e|"]
No encoding issues, because no encoding is chosen. (Note that I didn’t attempt to change convert_hex
too much – there’s some encoding juggling in there that you may need to look at, I just got it to work for bytes
)