I have a model that is served using TorchServe. I’m communicating with the TorchServe server using gRPC. The final postprocess
method of the custom handler defined returns a list which is converted into bytes for transfer over the network.
The post process method
def postprocess(self, data): # data type - torch.Tensor # data shape - [1, 17, 80, 64] and data dtype - torch.float32 return data.tolist()
The main issue is at the client where converting the received bytes from TorchServe to a torch Tensor is inefficiently done via ast.literal_eval
# This takes 0.3 seconds response = self.inference_stub.Predictions( inference_pb2.PredictionsRequest(model_name=model_name, input=input_data)) # This takes 0.84 seconds predictions = torch.as_tensor(literal_eval( response.prediction.decode('utf-8')))
Using numpy.frombuffer
or torch.frombuffer
return the following error.
import numpy as np np.frombuffer(response.prediction) Traceback (most recent call last): File "<string>", line 1, in <module> ValueError: buffer size must be a multiple of element size np.frombuffer(response.prediction, dtype=np.float32) Traceback (most recent call last): File "<string>", line 1, in <module> ValueError: buffer size must be a multiple of element size
Using torch
import torch torch.frombuffer(response.prediction, dtype = torch.float32) Traceback (most recent call last): File "<string>", line 1, in <module> ValueError: buffer length (2601542 bytes) after offset (0 bytes) must be a multiple of element size (4)
Is there an alternative, more efficient solution of converting the received bytes into torch.Tensor
?
Advertisement
Answer
One hack I’ve found that has significantly increased the performance while sending large tensors is to return a list of json.
In your handler’s postprocess function:
def postprocess(self, data): output_data = {} output_data['data'] = data.tolist() return [output_data]
At the clients side when you receive the grpc response, decode it using json.loads
response = self.inference_stub.Predictions( inference_pb2.PredictionsRequest(model_name=model_name, input=input_data)) decoded_output = response.prediction.decode('utf-8') preds = torch.as_tensor(json.loads(decoded_output))
preds
should have the output tensor
Update:
There’s an even faster method and should completely solve the bottleneck. Use tf.io.serialize_tensor
from tensorflow to serialize your tensor inside postprocess
def postprocess(self, data): return [tf.io.serialize_tensor(data.cpu()).numpy()]
Decode it using tf.io.parse_tensor
response = self.inference_stub.Predictions( inference_pb2.PredictionsRequest(model_name=model_name, input=input_data)) prediction = response.prediction torch.as_tensor(tf.io.parse_tensor(prediction, out_type=tf.float32).numpy())