Tag: hdfs

How to use pyarrow parquet with multiprocessing

hdfs parquet pyarrow python python-multiprocessing

I want to read multiple hdfs files simultaneously using pyarrow and multiprocessing. The simple python script works (see below), but if I try to do the same thing with multiprocessing, then it hangs indefinitely. My only guess is that env is different somehow, but all the environment variable should be the same in the child process and parent process. I’ve

How to use multistep mrjob with json file

hadoop hdfs json mrjob python

I’m trying to use hadoop to get some statistics from a json file like average number of stars for a category or language with most reviews. To do this I am using mrjob, I found this code: It allows to find the most used word, but I am not sure how to do this with json attributes instead of words.

PySpark not able to move file from local to HDFS

hadoop hdfs python

I am running hadoop in my local machine on port 8020. My name nodes exist under path /usr/local/Cellar/hadoop/hdfs/tmp/dfs/name. I have setup a pySpark project using Conda env and installed pyspark and hdfs3 dependencies. The following is my code: I am trying to copy the file from my local file system to HDFS but I am getting the following error: But