today i am in the position to migrate some pretty old perforce repositories to git. While this is realy interesting there is one thing that caught my eye. All special characters in the commit messages and even the author names are not in the correct encoding.
So i tried to investigate where the problem comes from.
- first of all the perforce server does not support unicode, so setting the P4CHARSET has no effect but
Unicode clients require a unicode enabled server.
- then i checked the output of simple commands like
p4 users
wich where indeed in ANSI (consulting notepad++, or ISO-8859-1 according tofile -bi
on redirected output) - the
locale
command says LANG=en_US.UTF-8 …
after all my guess is that all p4 client output is in ISO-8859-1 but git-p4 assumes UTF-8 instead.
I tried rewriting the commit messages with
git filter-branch --msg-filter 'iconv -f iso-8859-1 -t utf-8' -- --all
but that doesnt fix the issues, especialy as it is not intended to rewrite the author names.
anyone has a guess how to force the output to be translated to UTF-8 before git-p4 recieves them?
Update:
i tried to “overwrite” the default p4 commands output with a simple shell script that i prepended to PATH
/usr/bin/p4 $@ | iconv -f iso-8859-1 -t utf-8
but that destoys the marshalled python objects that are obviously used:
File "/usr/local/bin/git-p4", line 2467, in getBranchMapping for info in p4CmdList(command): File "/usr/local/bin/git-p4", line 480, in p4CmdList entry = marshal.load(p4.stdout) ValueError: bad marshal data
Update2:
As seen here Changing default encoding of Python? i tried to set python encoding to ascii:
export export PYTHONIOENCODING="ascii" python -c 'import sys; print(sys.stdin.encoding, sys.stdout.encoding)'
Output:
('ascii', 'ascii')
but still all messages and authors are not correctly migrated.
Update 3:
Even trying to patch the git-p4.py def commit(self, details, files, branch, parent = "")
function did not help:
Changing
self.gitStream.write(details["desc"])
to one of those
self.gitStream.write(details["desc"].encode('utf8', 'replace')) self.gitStream.write(unicode(details["desc"],'utf8')
did just raise:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 29: ordinal not in range(128)
as i am no python developer i have no idea what to try next.
Advertisement
Answer
I suspect the type of details["desc"]
is byte string. (str for python2).
Therefore you need to decode
it to Unicode before you encode
it.
print type(details["desc"])
to find out the type.
details["desc"].decode("iso-8859-1").encode("UTF-8")
might help to convert from iso-8859-1 to UTF-8.