Tag: hadoop

How to use multistep mrjob with json file

I’m trying to use hadoop to get some statistics from a json file like average number of stars for a category or language with most reviews. To do this I am using mrjob, I found this code: It allows to find the most used word, but I am not sure how to do this with json attributes instead of words.

PySpark not able to move file from local to HDFS

hadoop hdfs python

I am running hadoop in my local machine on port 8020. My name nodes exist under path /usr/local/Cellar/hadoop/hdfs/tmp/dfs/name. I have setup a pySpark project using Conda env and installed pyspark and hdfs3 dependencies. The following is my code: I am trying to copy the file from my local file system to HDFS but I am getting the following error: But

Read shapefile from HDFS with geopandas

geopandas hadoop python

I have a shapefile on my HDFS and I would like to import it in my Jupyter Notebook with geopandas (version 0.8.1). I tried the standard read_file() method but it does not recognize the HDFS directory; instead I believe it searches in my local directory, as I made a test with the local directory and reads the shapefile correctly. This

How can I read in a binary file from hdfs into a Spark dataframe?

apache-spark apache-spark-sql hadoop numpy python

I am trying to port some code from pandas to (py)Spark. Unfortunately I am already failing with the input part, where I want to read in binary data and put it in a Spark Dataframe. So far I am using fromfile from numpy: But for Spark I couldn’t find how to do it. My workaround so far was to use