Skip to content
Advertisement

Compare CSV files content with filecmp and ignore metadata

JavaScript

I want to compare all CSV files kept on my local machine to files kept on a server. The folder structure is the same for both of them. I only want to do a data comparison and not metadata (like time of creation, etc). I am using filecmp but it seems to perform metadata comparison. Is there a way to do what I want?

Advertisement

Answer

There are multiple ways to compare the .csv files between the 2 repositories (server file system and local file system).


Method 1: using hashlib

This method uses the Python module hashlib. I used the hashing algorithm sha256 to compute the hash digest for the files. I compare the hashes for files with the exact file name. This method works well, but it will overlook any file that doesn’t exist in both directories.

JavaScript

Method 2: using os st_size

This method uses the Python module os. In this example, I compared the size of files. This method works ok, but it will misclassify any file that has any data change that doesn’t change the size of the file.

JavaScript

Method 3: using os st_size and st_mtime

This method also uses the Python module os. In this example, I compared not only the size of the file, but also the last modification time. This method works good, but it will misclassify files as being identical. In testing, I saved a file with no data modifications and os.st_mtime flagged the file as being different, but in reality it wasn’t really different.

JavaScript

Method 4: using set()

This example uses Python set() to determine the line to line differences between 2 csv files with the same name. This method will output the exact change between the 2 csv files.

JavaScript

Method 5: using filecmp.cmp

This method uses the Python module filecmp. In this example I used filecmp.cmp with shallow set to False. Setting this parameter to False instructs filecmp to look at the contents of the files and not the metadata, such as filesize, which is the default for filecmp.cmp. This method works as well as Method 1, that used hashlib.

JavaScript

Method 6: using filecmp.dircmp

This method also uses the Python module filecmp. In this example I used filecmp.dircmp, which allows me to not only identify files that are non-common between the 2 directories and find those files that have similar names, but different content.

JavaScript

Method 7: line-by-line comparison

This example does a line-by-line comparison of 2 csv files and output the line that are different. The output can be added to either Python dictionary or to JSON file for secondary.

JavaScript

Local file system to S3 bucket using hashlib

The example below is a real world use case for comparing files between a local file system and a remote S3 bucket. I originally was going to use object.e_tag that AWS S3 creates, but that tag can have issues and shouldn’t be used in a hashing comparison operation. I decided to query S3 and load an individual file into a memory file system that could be queried and emptied during each comparison operation. This method worked very well and have no adverse impact to my system performance.

JavaScript

Local file system to S3 bucket using filecmp

This example is the same as the one above except I use filecmp.cmp instead of hashlib for the comparison operation.

JavaScript

Local file system to Google Cloud storage bucket using hashlib

This example is similar to the S3 hashlib code example above, but it uses a Google Cloud storage bucket.

JavaScript

Local file system to Google Cloud storage bucket using filecmp

This example is similar to the S3 filecmp code example above, but it uses a Google Cloud storage bucket.

JavaScript
Advertisement