Python UTF-16 unicode conversion

I’m using the below code to convert Arabic to Unicode UTF-16.

for example I have an Arabic text as مرحبا

unicode = ''.join([hex(ord(i)) for i in t.text])

JavaScript
​x
 
unicode = ''.join([hex(ord(i)) for i in t.text])
​

this code provide Unicode string as 0x6450x6310x62d0x6280x627

The format in which I need Unicode is u0645u0631u062du0628u0627

I want to replicate this website

using the above method I’m using replace method to convert 0x format to u0 format but 0x format don’t convert special characters as expected so I’ve to use replace method.

    unicode = str(unicode).replace('0x', '\u0')
    unicode = str(unicode).replace('\u020', ' ') #For Space
    unicode = str(unicode).replace('\u02e', '\u002e') #For .
    unicode = str(unicode).replace('\u022', '\u0022') #For "
    unicode = str(unicode).replace('\u07d', '\u007d') #For }
    unicode = str(unicode).replace('\u030', '\u0030') #For 0
    unicode = str(unicode).replace('\u07b', '\u007b') #For {
    unicode = str(unicode).replace('\u031', '\u0031') #For 1

JavaScript
 
    unicode = str(unicode).replace('0x', '\u0')
    unicode = str(unicode).replace('\u020', ' ') #For Space
    unicode = str(unicode).replace('\u02e', '\u002e') #For .
    unicode = str(unicode).replace('\u022', '\u0022') #For "
    unicode = str(unicode).replace('\u07d', '\u007d') #For }
    unicode = str(unicode).replace('\u030', '\u0030') #For 0
    unicode = str(unicode).replace('\u07b', '\u007b') #For {
    unicode = str(unicode).replace('\u031', '\u0031') #For 1
​

Using the default python encoding, UTF-16 didn’t provide encoding in u0 format.

print("مرحبا".encode('utf-16'))
b"xffxfeEx061x06-x06(x06'x06"

JavaScript
 
print("مرحبا".encode('utf-16'))
b"xffxfeEx061x06-x06(x06'x06" 
​

How can I get results in u0 format as this this website is providing in UTF-16 format.

Thanks.

Answer

This problem is just about how you represent the hex value. To get the string in the representation you want, you can use

In [84]: text = "مرحبا"

In [85]: print(''.join([f'\u{ord(c):0>4x}' for c in text]))
u0645u0631u062du0628u0627

JavaScript
 
In [84]: text = "مرحبا"
​
In [85]: print(''.join([f'\u{ord(c):0>4x}' for c in text]))
u0645u0631u062du0628u0627
​

Short explanation

Consider the first character of the text:

In [86]: ord(text[0])
Out[86]: 1605

JavaScript
 
In [86]: ord(text[0])
Out[86]: 1605
​

It has integer (decimal) value 1605. This in hex is 645:

In [87]: hex(ord(text[0]))
Out[87]: '0x645'

JavaScript
 
In [87]: hex(ord(text[0]))
Out[87]: '0x645'
​

You can also use string formatting (for example f-strings in Python 3.6+) to show it as u0645:

In [88]: f'\u{ord(text[0]):0>4x}'
Out[88]: '\u0645'

JavaScript
 
In [88]: f'\u{ord(text[0]):0>4x}'
Out[88]: '\u0645'
​

The x in the format string means “hex”. The 0>4 means that print it as 4-digit number, and pad it with zeroes.

Advertisement

Answer

Short explanation