Is there any way to split .tfrecords file into many .tfrecords files directly, without writing back each Dataset example ?
Advertisement
Answer
You can use a function like this:
JavaScript
x
17
17
1
import tensorflow as tf
2
3
def split_tfrecord(tfrecord_path, split_size):
4
with tf.Graph().as_default(), tf.Session() as sess:
5
ds = tf.data.TFRecordDataset(tfrecord_path).batch(split_size)
6
batch = ds.make_one_shot_iterator().get_next()
7
part_num = 0
8
while True:
9
try:
10
records = sess.run(batch)
11
part_path = tfrecord_path + '.{:03d}'.format(part_num)
12
with tf.python_io.TFRecordWriter(part_path) as writer:
13
for record in records:
14
writer.write(record)
15
part_num += 1
16
except tf.errors.OutOfRangeError: break
17
For example, to split the file my_records.tfrecord
into parts of 100 records each, you would do:
JavaScript
1
2
1
split_tfrecord(my_records.tfrecord, 100)
2
This would create multiple smaller record files my_records.tfrecord.000
, my_records.tfrecord.001
, etc.