Skip to content
Advertisement

NumPy dtype issues in genfromtxt(), reads string in as bytestring

I want to read in a standard-ascii csv file into numpy, which consists of floats and strings.

E.g.,

JavaScript

Whatever I tried, the resulting array would look like

E.g.,

JavaScript

However, I want to save a step for the byte-string conversion and was wondering how I can read in the string columns as regular string directly.

I tried several things from the numpy.genfromtxt() documentation, e.g., dtype='S,S,S,f,S' or dtype='a25,a25,a25,f,a25', but nothing really helped here.

I am afraid, but I think I just don’t understand how the dtype conversion really works…Would be nice if you can give me some hint here!

Thanks

Advertisement

Answer

In Python2.7

JavaScript

in Python3

JavaScript

The ‘regular’ strings in Python3 are unicode. But your text file has byte strings. all_data is the same in both cases (136 bytes), but Python3’s way of displaying a byte string is b'C.3', not just ‘C.3’.

What kinds of operations do you plan on doing with these strings? 'ZIN' in all_data['f0'][1] works with the 2.7 version, but in 3 you have to use b'ZIN' in all_data['f0'][1].

Variable/unknown length string/unicode dtype in numpy reminds me that you can specify a unicode string type in the dtype. However this becomes more complicated if you don’t know the lengths of the strings beforehand.

JavaScript

producing

JavaScript

In Python2.7 all_data_u displays as

JavaScript

all_data_u is 448 bytes, because numpy allocates 4 bytes for each unicode character. Each U4 item is 16 bytes long.


Changes in v 1.14: https://docs.scipy.org/doc/numpy/release.html#encoding-argument-for-text-io-functions

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement