I use tf.data.TextLineDataset to read 4 large files and I use tf.data.Dataset.zip to zip these 4 files and create “dataset”. However, I can not pass “dataset” to dataset.map to use tf.compat.v1.string_split and split with t separator and finally use batch, prefetch and finally feed into my model.
This is my code:
d1 = tf.data.TextLineDataset("File1.raw") d2 = tf.data.TextLineDataset("File2.raw") d3 = tf.data.TextLineDataset("File3.raw") d4 = tf.data.TextLineDataset("File4.raw") dataset = tf.data.Dataset.zip((d1,d2,d3,d4)) dataset = dataset.map(lambda string: tf.compat.v1.string_split([string],sep='t').values)
This is error message:
packages/tensorflow/python/autograph/impl/api.py", line 339, in _call_unconverted return f(*args, **kwargs) TypeError: <lambda>() takes 1 positional argument but 4 were given
What should I do?
Advertisement
Answer
The tf.data.Dataset.zip
function iterates over an arbitrary number of dataset objects at the same time. In other words, if you zip over four datasets, you will get four items at each iteration (one from each dataset). This also explains the error OP received
TypeError: <lambda>() takes 1 positional argument but 4 were given
The function being mapped needs to be able to handle four arguments, because it is being applied to a zip of four datasets. The code below includes a function that takes four arguments (datasets) and splits them by t
. You can map this to the zipped dataset. I substituted the tf.data.TextLineDataset
objects with sample datasets.
import tensorflow as tf d1 = tf.data.Dataset.from_tensors(["foot1"]) d2 = tf.data.Dataset.from_tensors(["foot2"]) d3 = tf.data.Dataset.from_tensors(["foot3"]) d4 = tf.data.Dataset.from_tensors(["foot4"]) def split_by_tab(text1, text2, text3, text4): sep = "t" return ( tf.strings.split(text1, sep=sep), tf.strings.split(text2, sep=sep), tf.strings.split(text3, sep=sep), tf.strings.split(text4, sep=sep), ) dataset = tf.data.Dataset.zip((d1,d2,d3,d4)) dataset = dataset.map(split_by_tab)
As alternative, I can merge these file and create a very large file and then shuffle, batch and prefetch rows from it. Right? Any other solution?
The files could be merged, but if they are large, it’s probably not worth doing. I did not realize that the features were split across multiple files. In this case, zipping is a reasonable thing to do.
There is also a library tensorflow_text
that may be relevant to this question. Might be worth checking out.