Skip to content
Advertisement

File metadata such as time in Azure Storage from Databricks

I m trying to get creationfile metadata.

File is in: Azure Storage
Accesing data throw: Databricks

right now I m using:

   file_path = my_storage_path
   dbutils.fs.ls(file_path)

but it returns

[FileInfo(path='path_myFile.csv', name='fileName.csv', size=437940)]

I do not have any information about creation time, there is a way to get that information ?

other solutions in Stackoverflow are refering to files that are already in databricks Does databricks dbfs support file metadata such as file/folder create date or modified date in my case we access to the data from Databricks but the data are in Azure Storage.

Advertisement

Answer

It really depends on the version of Databricks Runtime (DBR) that you’re using. For example, modification timestamp is available if you use DBR 10.2 (didn’t test with 10.0/10.1, but definitely not available on 9.1):

enter image description here

If you need to get that information you can use Hadoop FileSystem API via Py4j gateway, like this:

URI           = sc._gateway.jvm.java.net.URI
Path          = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem    = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration

fs = FileSystem.get(URI("/tmp"), Configuration())

status = fs.listStatus(Path('/tmp/'))
for fileStatus in status:
    print(f"path={fileStatus.getPath()}, size={fileStatus.getLen()}, mod_time={fileStatus.getModificationTime()}")
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement