How can I read in a binary file from hdfs into a Spark dataframe?

Question

I am trying to port some code from pandas to (py)Spark. Unfortunately I am already failing with the input part, where I want to read in binary data and put it in a Spark Dataframe. So far I am using fromfile from numpy: But for Spark I couldn't find how to do it. My workaround so far was to use

Accepted Answer

So, for anyone that starts with Spark as me and stumbles upon binary files. Here is how I solved it:dt=np.dtype([('idx_metric','>i4'),('idx_resource','>i4'),('date','>i4'),             ('value','>f8'),('pollID','>i2')])schema=StructType([StructField('idx_metric',IntegerType(),False),                   StructField('idx_resource',IntegerType(),False),                    StructField('date',IntegerType),False),                    StructField('value',DoubleType(),False),                    StructField('pollID',IntegerType(),False)])filenameRdd=sc.binaryFiles('hdfs://nameservice1:8020/user/*.binary')def read_array(rdd):    #output=zlib.decompress((bytes(rdd[1])),15+32) # in case also zipped    array=np.frombuffer(bytes(rdd[1])[20:],dtype=dt) # remove Header (20 bytes)    array=array.newbyteorder().byteswap() # big Endian    return array.tolist()unzipped=filenameRdd.flatMap(read_array)bin_df=sqlContext.createDataFrame(unzipped,schema)And now you can do whatever fancy stuff you want in Spark with your dataframe.

Advertisement

Answer