Skip to content
Advertisement

How can I read a byte array file of strings?

There is a file with following contents:

b'prefix:input_text'
b'oEffect:PersonX xd8xafxd8xb1 xd8xacxd9x86xdaxaf ___ xd8xa8xd8xa7xd8xb2xdbx8c xd9x85xdbx8c xdaxa9xd9x86xd8xaf'
b'oEffect:PersonX xd8xafxd8xb1 xd8xacxd9x86xdaxaf ___ xd8xa8xd8xa7xd8xb2xdbx8c xd9x85xdbx8c xdaxa9xd9x86xd8xaf'

This is my try to read the lines and convert them to readable utf characters, but still it shows the same strings in the output file:

f = open(input_file, "rb")
for x in f:
  inpcol.append(x.decode('utf-8'))

f = open(pred_file, "r")
for x in f:
  predcol.append(x)

f = open(target_file, "r")
for x in f:
  targcol.append(x)
data =[]
for i in tqdm(range(len(targcol))):
  data.append([inpcol[i],targcol[i],predcol[i]])

pd.DataFrame(data,columns=["input_text","target_text","pred_text"]).to_csv(f"{path}/merge_{predfile}.csv", encoding="utf-8")
print("Done!")

The output file is:

,input_text,target_text,pred_text
0,"b'prefix:input_text'
","target_text
","ﺏﺭﺎﯾ ﺩﺮﮐ ﻮﻀﻌﯿﺗ
"
1,"b'xNeed:PersonX xd8xafxd8xb1 xd8xacxd9x86xdaxaf ___ xd8xa8xd8xa7xd8xb2xdbx8c xd9x85xdbx8c xdaxa9xd9x86xd8xaf'
","ﺞﻨﮕﯾﺪﻧ
","ﺏﺭﺎﯾ ﭗﯾﺩﺍ ﮎﺭﺪﻧ ﯽﮐ ﺖﯿﻣ
"

As you see, the problem exists for input line but not for target and prediction lines (however scrambled but that’s okay)

Advertisement

Answer

It seems someone wrote bytes in wrong way. Someone used str(bytes) instead of bytes.decode('utf-8'). Or maybe code was created for Python 2 which treats bytes and strings in different way then Python 3.


if you can correct code which write it then you have to fix text

text = "b'oEffect:PersonX xd8xafxd8xb1 xd8xacxd9x86xdaxaf ___ xd8xa8xd8xa7xd8xb2xdbx8c xd9x85xdbx8c xdaxa9xd9x86xd8xaf'"

crop b' '

text = text[2:-1]

convert back to bytes using special encoding 'raw_unicode_escape'

text = text.encode('raw_unicode_escape')

and convert to string correctly

text = text.decode()

And now

print(text)

gives me

oEffect:PersonX در جنگ ___ بازی می کند

EDIT:

It seems it has codes converted to strings with double slashes like b'\xd8' but print() may display it as single slash but print(repr()) may show it with double slashes.

It may need more decode/encode to convert it correctly.

text = "b'xNeed:PersonX \xd8\xaf\xd8\xb1 \xd8\xac\xd9\x86\xda\xaf'"
print(repr(text))
print(text)

text = text[2:-1]
text = text.encode('raw_unicode_escape')
text = text.decode('unicode_escape')
text = text.encode('raw_unicode_escape')
text = text.decode()
print(text)
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement