I have a model that is served using TorchServe. I’m communicating with the TorchServe server using gRPC. The final postprocess method of the custom handler defined returns a list which is converted into bytes for transfer over the network.
The post process method
def postprocess(self, data):
# data type - torch.Tensor
# data shape - [1, 17, 80, 64] and data dtype - torch.float32
return data.tolist()
The main issue is at the client where converting the received bytes from TorchServe to a torch Tensor is inefficiently done via ast.literal_eval
# This takes 0.3 seconds
response = self.inference_stub.Predictions(
inference_pb2.PredictionsRequest(model_name=model_name, input=input_data))
# This takes 0.84 seconds
predictions = torch.as_tensor(literal_eval(
response.prediction.decode('utf-8')))
Using numpy.frombuffer or torch.frombuffer return the following error.
import numpy as np np.frombuffer(response.prediction) Traceback (most recent call last): File "<string>", line 1, in <module> ValueError: buffer size must be a multiple of element size np.frombuffer(response.prediction, dtype=np.float32) Traceback (most recent call last): File "<string>", line 1, in <module> ValueError: buffer size must be a multiple of element size
Using torch
import torch torch.frombuffer(response.prediction, dtype = torch.float32) Traceback (most recent call last): File "<string>", line 1, in <module> ValueError: buffer length (2601542 bytes) after offset (0 bytes) must be a multiple of element size (4)
Is there an alternative, more efficient solution of converting the received bytes into torch.Tensor?
Advertisement
Answer
One hack I’ve found that has significantly increased the performance while sending large tensors is to return a list of json.
In your handler’s postprocess function:
def postprocess(self, data):
output_data = {}
output_data['data'] = data.tolist()
return [output_data]
At the clients side when you receive the grpc response, decode it using json.loads
response = self.inference_stub.Predictions(
inference_pb2.PredictionsRequest(model_name=model_name, input=input_data))
decoded_output = response.prediction.decode('utf-8')
preds = torch.as_tensor(json.loads(decoded_output))
preds should have the output tensor
Update:
There’s an even faster method and should completely solve the bottleneck. Use tf.io.serialize_tensor from tensorflow to serialize your tensor inside postprocess
def postprocess(self, data):
return [tf.io.serialize_tensor(data.cpu()).numpy()]
Decode it using tf.io.parse_tensor
response = self.inference_stub.Predictions(
inference_pb2.PredictionsRequest(model_name=model_name, input=input_data))
prediction = response.prediction
torch.as_tensor(tf.io.parse_tensor(prediction, out_type=tf.float32).numpy())