Skip to content
Advertisement

git-p4 message and author encoding

today i am in the position to migrate some pretty old perforce repositories to git. While this is realy interesting there is one thing that caught my eye. All special characters in the commit messages and even the author names are not in the correct encoding.

So i tried to investigate where the problem comes from.

  • first of all the perforce server does not support unicode, so setting the P4CHARSET has no effect but Unicode clients require a unicode enabled server.
  • then i checked the output of simple commands like p4 users wich where indeed in ANSI (consulting notepad++, or ISO-8859-1 according to file -bi on redirected output)
  • the locale command says LANG=en_US.UTF-8 …

after all my guess is that all p4 client output is in ISO-8859-1 but git-p4 assumes UTF-8 instead.

I tried rewriting the commit messages with

git filter-branch --msg-filter 'iconv -f iso-8859-1 -t utf-8' -- --all

but that doesnt fix the issues, especialy as it is not intended to rewrite the author names.

anyone has a guess how to force the output to be translated to UTF-8 before git-p4 recieves them?

Update:

i tried to “overwrite” the default p4 commands output with a simple shell script that i prepended to PATH

/usr/bin/p4 $@ | iconv -f iso-8859-1 -t utf-8

but that destoys the marshalled python objects that are obviously used:

  File "/usr/local/bin/git-p4", line 2467, in getBranchMapping
    for info in p4CmdList(command):
  File "/usr/local/bin/git-p4", line 480, in p4CmdList
    entry = marshal.load(p4.stdout)
ValueError: bad marshal data

Update2:

As seen here Changing default encoding of Python? i tried to set python encoding to ascii:

export export PYTHONIOENCODING="ascii"
python -c 'import sys; print(sys.stdin.encoding, sys.stdout.encoding)'

Output:

('ascii', 'ascii')

but still all messages and authors are not correctly migrated.

Update 3:

Even trying to patch the git-p4.py def commit(self, details, files, branch, parent = "") function did not help: Changing

self.gitStream.write(details["desc"])

to one of those

self.gitStream.write(details["desc"].encode('utf8', 'replace'))
self.gitStream.write(unicode(details["desc"],'utf8')

did just raise:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 29: ordinal not in range(128)

as i am no python developer i have no idea what to try next.

Advertisement

Answer

I suspect the type of details["desc"] is byte string. (str for python2).

Therefore you need to decode it to Unicode before you encode it.

print type(details["desc"])

to find out the type.

details["desc"].decode("iso-8859-1").encode("UTF-8")

might help to convert from iso-8859-1 to UTF-8.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement