I have a shapefile on my HDFS and I would like to import it in my Jupyter Notebook with geopandas
(version 0.8.1
).
I tried the standard read_file()
method but it does not recognize the HDFS directory; instead I believe it searches in my local directory, as I made a test with the local directory and reads the shapefile correctly.
This is the code I used:
import geopandas as gpd shp = gpd.read_file('hdfs://hdfsha/my_hdfs_directory/my_shapefile.shp')
and the error I obtained:
--------------------------------------------------------------------------- CPLE_OpenFailedError Traceback (most recent call last) fiona/_shim.pyx in fiona._shim.gdal_open_vector() fiona/_err.pyx in fiona._err.exc_wrap_pointer() CPLE_OpenFailedError: hdfs://hdfsha/my_hdfs_directory/my_shapefile.shp: No such file or directory During handling of the above exception, another exception occurred: DriverError Traceback (most recent call last) <ipython-input-17-3118e740e4a9> in <module> ----> 2 shp = gpd.read_file('hdfs://hdfsha/my_hdfs_directory/my_shapefile.shp' class="ansi-blue-fg">) 3 print(shp.shape) 4 shp.head(3) /opt/venv/geocoding/lib/python3.6/site-packages/geopandas/io/file.py in _read_file(filename, bbox, mask, rows, **kwargs) 94 95 with fiona_env(): ---> 96 with reader(path_or_bytes, **kwargs) as features: 97 98 # In a future Fiona release the crs attribute of features will /opt/venv/geocoding/lib/python3.6/site-packages/fiona/env.py in wrapper(*args, **kwargs) 398 def wrapper(*args, **kwargs): 399 if local._env: --> 400 return f(*args, **kwargs) 401 else: 402 if isinstance(args[0], str): /opt/venv/geocoding/lib/python3.6/site-packages/fiona/__init__.py in open(fp, mode, driver, schema, crs, encoding, layer, vfs, enabled_drivers, crs_wkt, **kwargs) 255 if mode in ('a', 'r'): 256 c = Collection(path, mode, driver=driver, encoding=encoding, --> 257 layer=layer, enabled_drivers=enabled_drivers, **kwargs) 258 elif mode == 'w': 259 if schema: /opt/venv/geocoding/lib/python3.6/site-packages/fiona/collection.py in __init__(self, path, mode, driver, schema, crs, encoding, layer, vsi, archive, enabled_drivers, crs_wkt, ignore_fields, ignore_geometry, **kwargs) 160 if self.mode == 'r': 161 self.session = Session() --> 162 self.session.start(self, **kwargs) 163 elif self.mode in ('a', 'w'): 164 self.session = WritingSession() fiona/ogrext.pyx in fiona.ogrext.Session.start() fiona/_shim.pyx in fiona._shim.gdal_open_vector() DriverError: hdfs://hdfsha/my_hdfs_directory/my_shapefile.shp: No such file or directory
So, I was wondering whether it is actually possible to read a shapefile, stored in HDFS, with geopandas. If yes, how?
Advertisement
Answer
If someone is still looking for an answer to this question, I managed to find a workaround.
First of all, you need a .zip file which contains all the data related to your shapefile (.shp, .shx, .dbf, …). Then, we use pyarrow
to establish a connection to HDFS and fiona
to read the zipped shapefile.
Package versions I’m using:
pyarrow==2.0.0
fiona==1.8.18
The code:
# import packages import pandas as pd import geopandas as gpd import fiona import pyarrow # establish a connection to HDFS fs = pyarrow.hdfs.connect() # read zipped shapefile with fiona.io.ZipMemoryFile(fs.open('hdfs://my_hdfs_directory/my_zipped_shapefile.zip')) as z: with z.open('my_shp_file_within_zip.shp') as collection: gdf = gpd.GeoDataFrame.from_features(collection) print(gdf.shape)