I am running hadoop in my local machine on port 8020. My name nodes exist under path /usr/local/Cellar/hadoop/hdfs/tmp/dfs/name
. I have setup a pySpark project using Conda env and installed pyspark and hdfs3 dependencies.
The following is my code:
from pyspark.sql import SparkSession from hdfs3 import HDFileSystem spark = SparkSession.builder.appName('First Project').getOrCreate() hdfs = HDFileSystem(host="localhost", port=8020) hdfs.put("test.csv", "/usr/local/Cellar/hadoop/hdfs/tmp/dfs/name/test.csv")
I am trying to copy the file from my local file system to HDFS but I am getting the following error:
OSError: Could not open file: /usr/local/Cellar/hadoop/hdfs/tmp/dfs/name/test.csv, mode: wb Parent directory doesn't exist: /usr/local/Cellar/hadoop/hdfs/tmp/dfs/name
But I can cd into the same directory and its exists. I am not sure why I get this error.
Also, when I try to do hdfs.mv with same params, I get the following error:
FileNotFoundError: test.csv
Advertisement
Answer
If you want to upload a local CSV using Spark you’d need to actually use Spark
spark = SparkSession.builder.getOrCreate() df = spark.read.csv('file:///path/to/file.csv') df.write.csv(output_path)
Otherwise, you cannot “put” into your Homebrew location since that doesn’t exist on HDFS (at least, not unless you ran hadoop fs mkdir -p /usr/local/Cellar/...
for some reason)
when I try to do hdfs.mv with same params … FileNotFoundError
Because you need to cd
to the directory with the local CSV first. Otherwise, specify the full path