How to write a csv file via pandas and read it in R at regular intervals?

Question

Background A driving simulator PC in my lab generates data that I receive via python socket. The data is generated every 1/60th of a second. I continuously save it to a csv file called position.csv. I also want to read position.csv in R to use in a shiny app. I read it every 0.2 seconds. Problem When I run th…

Accepted Answer

Without even looking at the shiny portion of this, reaching out to the filesystem for a CSV file every 0.2 seconds has got to be a huge bottleneck and unlikely to be the best way to go with performance in mind.The most likely reason you&#8217;re getting permission denied may (I haven&#8217;t tested) be due to file-locking, where pandas has temporarily locked the file while writing to it, and R is trying to read it too soon. Frankly, even if you were not getting a &#8220;denied&#8221; error, it is certainly feasible to try to read the file while it is mid-write, meaning incomplete data. There should be some coordination of the write and read events such that this cannot happen.Some thoughts, without testing (but with experience in mutual-file-access):Not Filesystem-basedOne alternative is to use some form of streaming-data mechanism such as Redis. This can be a simple &#8220;topic&#8221; (fifo queue) or with a little more thought (and depending on your needs) a Pub/Sub setup. With this, pandas would push its new data to the topic or pubsub topic, and one (if a vanilla topic) or one-or-more (if pubsub) consumers would get the data in its entirety.Advantages:Not using a filesystem, so the biggest bottleneck will be network bandwidth, likely much lower latency than filesystems, and writes and reads are always atomic (meaning no read-while-being-written problems like you&#8217;re facing);With pub/sub, any client can start &#8220;late&#8221; and get all past data (if desired) without impacting any other consumers. Realize that &#8220;another consumer&#8221; may be just you monitoring things, it doesn&#8217;t necessarily have to be a full-time processing program.Disadvantages:Requires Redis (or Apache Kafka or RabbitMQ or something similar) as a service somewhere on the network, the closer (topologically the better).Requires a little more thought into the architecture of cooperation between pandas and R. This will reap benefits.This is actually quite easy to do on your dev-computer using Docker: the Redis image is free and performs very well, I use it regularly for purposes similar to this. (Docker is not required, Redis installs just fine without it, over to you.)(Python has redis-py, R has redux.)Filesystem-basedIf you must go with file-based, then you need to use a method that completely mitigates the risk of reading-while-writing. While file-writing is not atomic (which is why you have problems), file-renaming is. Write the file to a temporary file (on the same filesystem, but not in a place or with a name that R will read) and then, once written/closed, rename it so that R will see it.For instance, let&#8217;s assume that your convention is to use /some/path/file1234.csv where perhaps the 1234 increments with each write. (You might instead have time, it doesn&#8217;t matter.) Let&#8217;s restrict R so that it only sees files that end in the literal .csv (not difficult). In pandas, write to /some/path/file1234.csv.temp, and when complete (and you close() in python!), rename it to /some/path/file1234.csv. As soon as the file is renamed, R should be able to read it without distraction.Advantages:No change in architecture, likely the fastest to implement and test.Disadvantages:Still based on the filesystem, which means compounding latency from: network (if using a network filesystem), OS, HDD, etc.If you&#8217;re really curious, MailDir is a directory structure that I use for identical purposes, though it&#8217;s working off of a huge GPFS (NFS-like) where latency of file-creation can be upwards of 10-15 seconds, file-locking is not supported (reliably), and without the file-renaming atomicity I mentioned above, I would be sunk. Surely you do not need the &#8220;complexity&#8221; (not much, but relatively more complex) of a maildir structure for passing files between pandas and R, but &#8230; the premise of atomic file-renaming has precedence and has had a lot of people work on it. (Maildir scales very well, afaict. The only thing I haven&#8217;t attempted to figure out yet is filesystem-based pubsub in maildir &#8230;)(Python and R both do file-renaming atomically, no non-standard modules/packages required.)Possible change to python (untested):+ import os  # ...      dict = {'y': fields[0], 'x': fields[1], 'z': fields[2]}      my_data = pd.DataFrame([dict], columns=dict.keys())-     open("position.csv", "w")-     my_data.to_csv("position.csv", index=False)+     my_data.to_csv("position.csv.temp", index=False)+     try:+       os.remove("position.csv")+     except:+       pass+     os.rename("position.csv.temp", "position.csv")It might be good enough to os.remove("position.csv") before renaming instead of using .old, I haven&#8217;t tested to see what works best. I&#8217;m not as concerned about reading-processes, since on most systems the file itself (regardless of the inode on the filesystem) should allow R to continue reading even if the filename has been deleted. Again, not well tested.Other ConsiderationsFile format: while CSV is standard and easy, you may want to consider something that reads and writes faster such as feather. Both python and R have modules/packages that will support this. I don&#8217;t have experience with it, but perhaps https://rstudio-pubs-static.s3.amazonaws.com/207316_edcc0ea0a7c04ea5a63833aaea7051fb.html is a useful start. (There&#8217;s also &#8220;parquet&#8221;, I have no experience with this either.)Down-sampling: do you really need the data to be read every 0.2 seconds? Shiny (R!) is not going to guarantee reading every file right away, so there will be some &#8220;slop&#8221; here anyway. I recommend down-sampling to perhaps 1/sec or once per 2-3 seconds, depending on how you really intend to use it. I recognize some use-cases are not compatible with this recommendation, over to you.Redis option:Redis running on my laptop using the default port 6379. I&#8217;m using docker, so I started it with this. (Note that if you have special networking setup and/or not able to use host-mode for listening ports, this might need adjusting. It should just work, though, no other configuration required.)$ docker run -p "6379:6379" --name some-redis -d redisIn python:import pandas as pdimport redisr = redis.Redis(host='localhost', port=6379, db=0)df1 = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})df2 = pd.DataFrame(data={'col1': [11, 12], 'col2': [13, 14]})r.rpush('carsim', df1.to_json(orient='records'))r.rpush('carsim', df2.to_json(orient='records'))In R:R <- redux::hiredis()popped <- R$LPOP("carsim")popped# [1] "[{"col1":1,"col2":3},{"col1":2,"col2":4}]"jsonlite::fromJSON(popped)#   col1 col2# 1    1    3# 2    2    4popped <- R$LPOP("carsim")jsonlite::fromJSON(popped)#   col1 col2# 1   11   13# 2   12   14popped <- R$LPOP("carsim")popped# NULLAdaptation of maildir-like directoriesThis trick uses a premise of maildirs with a little bit of compromise (which should be fine so long as there is only one file-writer, the car-sim).In python:-     my_data.to_csv("position.csv", index=False)+     filename = '{:.3f}.csv'.format(time.time())+     my_data.to_csv('tmp/' + filename, index=False)+     os.rename('tmp/' + filename, 'new/' + filename)And in R:files <- list.files("new/", full.names = TRUE)dat <- rbindlist(lapply(files, fread))file.rename(files, "cur/")

How to write a csv file via pandas and read it in R at regular intervals?

Background

Problem

Python script for saving data to a csv file:

R script for reading the csv file and use in an app:

Question

Advertisement

Answer

Not Filesystem-based

Filesystem-based

Other Considerations

Redis option:

Adaptation of maildir-like directories