Skip to content
Advertisement

python str.format with utf-8 characters that take more than 1 position

I trying to print japanese characters in python, aligned in columns. It seems that japanese characters have a width equivalent to two spaces, so the alignment doesn’t work.

Here is the code:

def print_kanji(s, k):
    print('{:<20}{:<10}{:<10}{:<10}'
        .format(s, k['reading'][0], k['reading'][1], k['kanji']))

# Being 's' some input string and 'k' a map which contains readings in the 3 different japanese alphabets.

The output I obtain is the following:

decir               いう        イウ        言う        

pequeño             すくない      スクナイ      少ない       

niño                こども       コドモ       子供        

ya [ha hecho X]     もう        モウ

The column in the left is spanish but that’s not important. The important thing is that the 3 columns on the right are not aligned. I have counted the number of positions and it is correct, that is, the first japanese column always is 10 ‘positions’ long, the problem is that japanese characters are 2 positions wide while blanks are 1 only.

I have checked as well that a blank (using the japanese input) is two positions wide as well, therefore I should be able to fix the problem by replacing the ‘latin’ space (1 position width) by the japanese one.

How can I change the character that format will use to align strings?

EDIT

I have found that str.format has a parameter which is fill. I have tried to replace this by the japanese blank (two positions wide) and the result is worse.

EDIT 2

I have solved it by implementing this function

def get_formatted_kanji(h, k, kn):
    h2 = h + str(' ' * (10 - 2*len(h)))
    k2 = k + str(' ' * (10 - 2*len(h)))
    kn2 = kn + str(' ' * (10 - 2*len(h)))
    return h2 + k2 + kn2

# being h, k and kn the three 'japanese strings' to be formatted in columns

however, is there a better (built-in) way of achieving this?

Advertisement

Answer

In a terminal, it’s common for certain characters to take up two columns and other characters to take up one column. You can find out which characters are which by using the unicodedata Python module, which has an east_asian_width().

Here is an example of how you can use it to pad your text:

import unicodedata
table = [
    ('decir', 'いう', 'イウ', '言う'), 
    ('pequeño', 'すくない', 'スクナイ', '少ない'), 
    ('niño', 'こども', 'コドモ', '子供'), 
    ('ya [ha hecho X]', 'もう', 'モウ', ''),
]

WIDTHS = {
    'F': 2,
    'H': 1,
    'W': 2,
    'N': 1,
    'A': 1, # Not really correct...
    'Na': 1,
}

def pad(text, width):
    text_width = 0
    for ch in text:
        width_class = unicodedata.east_asian_width(ch)
        text_width += WIDTHS[width_class]
    if width <= text_width:
        return text
    return text + ' ' * (width - text_width)

for s, reading1, reading2, kanji in table:
    print('{}{}{}{}'.format(
        pad(s, 20),
        pad(reading1, 10),
        pad(reading2, 10),
        pad(kanji, 10),
    ))

Here is a screenshot of how this looks on my system (macOS):

The same table, with columns lined up visually.

Limitations

The above code does not handle Unicode combining characters. A more complete implementation would perform Unicode text segmentation, and then figure out the width of each grapheme cluster. There are libraries that do this for you, I’m sure.

Language Note

As a note, I don’t think the words “少ない” and “pequeño” are likely equivalents. The Spanish word “pequeño” refers to the size of something, and “少ない” refers to the quantity.

I think it’s more likely that

  • poco: 少ない
  • pequeño: 小さい
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement