Skip to content
Advertisement

PySpark not able to move file from local to HDFS

I am running hadoop in my local machine on port 8020. My name nodes exist under path /usr/local/Cellar/hadoop/hdfs/tmp/dfs/name. I have setup a pySpark project using Conda env and installed pyspark and hdfs3 dependencies.

The following is my code:

from pyspark.sql import SparkSession
from hdfs3 import HDFileSystem

spark = SparkSession.builder.appName('First Project').getOrCreate()

hdfs = HDFileSystem(host="localhost", port=8020)
hdfs.put("test.csv", "/usr/local/Cellar/hadoop/hdfs/tmp/dfs/name/test.csv")

I am trying to copy the file from my local file system to HDFS but I am getting the following error:

OSError: Could not open file: /usr/local/Cellar/hadoop/hdfs/tmp/dfs/name/test.csv, mode: wb Parent directory doesn't exist: /usr/local/Cellar/hadoop/hdfs/tmp/dfs/name

But I can cd into the same directory and its exists. I am not sure why I get this error.

Also, when I try to do hdfs.mv with same params, I get the following error:

FileNotFoundError: test.csv

Advertisement

Answer

If you want to upload a local CSV using Spark you’d need to actually use Spark

spark = SparkSession.builder.getOrCreate()

df = spark.read.csv('file:///path/to/file.csv')
df.write.csv(output_path)

Otherwise, you cannot “put” into your Homebrew location since that doesn’t exist on HDFS (at least, not unless you ran hadoop fs mkdir -p /usr/local/Cellar/... for some reason)

when I try to do hdfs.mv with same params … FileNotFoundError

Because you need to cd to the directory with the local CSV first. Otherwise, specify the full path

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement