Background
A driving simulator PC in my lab generates data that I receive via python socket
. The data is generated every 1/60th of a second. I continuously save it to a csv file called position.csv
. I also want to read position.csv
in R to use in a shiny
app. I read it every 0.2 seconds.
Problem
When I run the shiny
app in R, python throws the PermissionError: [Errno 13] Permission denied: 'position.csv'
Python script for saving data to a csv file:
import socket import struct import pandas as pd UDP_IP = "127.0.0.1" UDP_PORT = 9000 sock = socket.socket(socket.AF_INET, # Internet socket.SOCK_DGRAM) # UDP sock.bind((UDP_IP, UDP_PORT)) while True: data, addr = sock.recvfrom(1024) # buffer size is 1024 bytes fields = struct.unpack_from('=ddd', data) print(fields[0],fields[1],fields[2]) dict = {'y': fields[0], 'x': fields[1], 'z': fields[2]} my_data = pd.DataFrame([dict], columns=dict.keys()) open("position.csv", "w") my_data.to_csv("position.csv", index=False)
R script for reading the csv file and use in an app:
library(shinydashboard) library(dplyr) library(ggplot2) library(shiny) library(data.table) # ui---- ui <- dashboardPage(skin = "black", dashboardHeader(title = "Dashboard"), dashboardSidebar( sidebarMenu( menuItem("Maps", tabName = "navigation", icon = icon("compass")) )), dashboardBody( tabItems( # First tab content tabItem(tabName = "navigation", fluidRow( tags$style(type="text/css", ".recalculating {opacity: 1.0;}"), plotOutput("plot1") ) ) ) ) ) # server---- server <- function(input, output, session) { position <- reactivePoll(200, session, # This function returns the time that log_file was last modified checkFunc = function() { if (file.exists("position.csv")) file.info("position.csv")$mtime[1] else "" }, # This function returns the content of log_file valueFunc = function() { data.table::fread("position.csv") } ) xl1 <- reactive({position()$x - 1000}) xl2 <- reactive({position()$x + 1000}) yl1 <- reactive({position()$y - 800}) yl2 <- reactive({position()$y + 800}) output$plot1 <- renderPlot({ ggplot() + geom_point(data = position(), aes(x, y), color = "red", size = 5) + coord_equal( xlim = c(xl1(), xl2()), ylim = c(yl1(), yl2())) + theme_void() }) cancel.onSessionEnded <- session$onSessionEnded(function() { stopApp() }) cancel.onSessionEnded() } shinyApp(ui, server)
Question
How do I successfully read from and write to the position.csv
file?
Advertisement
Answer
Without even looking at the shiny portion of this, reaching out to the filesystem for a CSV file every 0.2 seconds has got to be a huge bottleneck and unlikely to be the best way to go with performance in mind.
The most likely reason you’re getting permission denied may (I haven’t tested) be due to file-locking, where pandas has temporarily locked the file while writing to it, and R is trying to read it too soon. Frankly, even if you were not getting a “denied” error, it is certainly feasible to try to read the file while it is mid-write, meaning incomplete data. There should be some coordination of the write and read events such that this cannot happen.
Some thoughts, without testing (but with experience in mutual-file-access):
Not Filesystem-based
One alternative is to use some form of streaming-data mechanism such as Redis. This can be a simple “topic” (fifo queue) or with a little more thought (and depending on your needs) a Pub/Sub setup. With this, pandas would push its new data to the topic or pubsub topic, and one (if a vanilla topic) or one-or-more (if pubsub) consumers would get the data in its entirety.
Advantages:
- Not using a filesystem, so the biggest bottleneck will be network bandwidth, likely much lower latency than filesystems, and writes and reads are always atomic (meaning no read-while-being-written problems like you’re facing);
- With pub/sub, any client can start “late” and get all past data (if desired) without impacting any other consumers. Realize that “another consumer” may be just you monitoring things, it doesn’t necessarily have to be a full-time processing program.
Disadvantages:
- Requires Redis (or Apache Kafka or RabbitMQ or something similar) as a service somewhere on the network, the closer (topologically the better).
- Requires a little more thought into the architecture of cooperation between pandas and R. This will reap benefits.
This is actually quite easy to do on your dev-computer using Docker: the Redis image is free and performs very well, I use it regularly for purposes similar to this. (Docker is not required, Redis installs just fine without it, over to you.)
(Python has redis-py
, R has redux
.)
Filesystem-based
If you must go with file-based, then you need to use a method that completely mitigates the risk of reading-while-writing. While file-writing is not atomic (which is why you have problems), file-renaming is. Write the file to a temporary file (on the same filesystem, but not in a place or with a name that R will read) and then, once written/closed, rename it so that R will see it.
For instance, let’s assume that your convention is to use /some/path/file1234.csv
where perhaps the 1234
increments with each write. (You might instead have time, it doesn’t matter.) Let’s restrict R so that it only sees files that end in the literal .csv
(not difficult). In pandas, write to /some/path/file1234.csv.temp
, and when complete (and you close()
in python!), rename it to /some/path/file1234.csv
. As soon as the file is renamed, R should be able to read it without distraction.
Advantages:
- No change in architecture, likely the fastest to implement and test.
Disadvantages:
- Still based on the filesystem, which means compounding latency from: network (if using a network filesystem), OS, HDD, etc.
If you’re really curious, MailDir is a directory structure that I use for identical purposes, though it’s working off of a huge GPFS (NFS-like) where latency of file-creation can be upwards of 10-15 seconds, file-locking is not supported (reliably), and without the file-renaming atomicity I mentioned above, I would be sunk. Surely you do not need the “complexity” (not much, but relatively more complex) of a maildir structure for passing files between pandas and R, but … the premise of atomic file-renaming has precedence and has had a lot of people work on it. (Maildir scales very well, afaict. The only thing I haven’t attempted to figure out yet is filesystem-based pubsub in maildir …)
(Python and R both do file-renaming atomically, no non-standard modules/packages required.)
Possible change to python (untested):
+ import os # ... dict = {'y': fields[0], 'x': fields[1], 'z': fields[2]} my_data = pd.DataFrame([dict], columns=dict.keys()) - open("position.csv", "w") - my_data.to_csv("position.csv", index=False) + my_data.to_csv("position.csv.temp", index=False) + try: + os.remove("position.csv") + except: + pass + os.rename("position.csv.temp", "position.csv")
It might be good enough to os.remove("position.csv")
before renaming instead of using .old
, I haven’t tested to see what works best. I’m not as concerned about reading-processes, since on most systems the file itself (regardless of the inode on the filesystem) should allow R to continue reading even if the filename has been deleted. Again, not well tested.
Other Considerations
File format: while CSV is standard and easy, you may want to consider something that reads and writes faster such as
feather
. Both python and R have modules/packages that will support this. I don’t have experience with it, but perhaps https://rstudio-pubs-static.s3.amazonaws.com/207316_edcc0ea0a7c04ea5a63833aaea7051fb.html is a useful start. (There’s also “parquet”, I have no experience with this either.)Down-sampling: do you really need the data to be read every 0.2 seconds? Shiny (R!) is not going to guarantee reading every file right away, so there will be some “slop” here anyway. I recommend down-sampling to perhaps 1/sec or once per 2-3 seconds, depending on how you really intend to use it. I recognize some use-cases are not compatible with this recommendation, over to you.
Redis option:
Redis running on my laptop using the default port 6379. I’m using docker, so I started it with this. (Note that if you have special networking setup and/or not able to use host-mode for listening ports, this might need adjusting. It should just work, though, no other configuration required.)
$ docker run -p "6379:6379" --name some-redis -d redis
In python:
import pandas as pd import redis r = redis.Redis(host='localhost', port=6379, db=0) df1 = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]}) df2 = pd.DataFrame(data={'col1': [11, 12], 'col2': [13, 14]}) r.rpush('carsim', df1.to_json(orient='records')) r.rpush('carsim', df2.to_json(orient='records'))
In R:
R <- redux::hiredis() popped <- R$LPOP("carsim") popped # [1] "[{"col1":1,"col2":3},{"col1":2,"col2":4}]" jsonlite::fromJSON(popped) # col1 col2 # 1 1 3 # 2 2 4 popped <- R$LPOP("carsim") jsonlite::fromJSON(popped) # col1 col2 # 1 11 13 # 2 12 14 popped <- R$LPOP("carsim") popped # NULL
Adaptation of maildir-like directories
This trick uses a premise of maildirs with a little bit of compromise (which should be fine so long as there is only one file-writer, the car-sim).
In python:
- my_data.to_csv("position.csv", index=False) + filename = '{:.3f}.csv'.format(time.time()) + my_data.to_csv('tmp/' + filename, index=False) + os.rename('tmp/' + filename, 'new/' + filename)
And in R:
files <- list.files("new/", full.names = TRUE) dat <- rbindlist(lapply(files, fread)) file.rename(files, "cur/")