How to correctly pass a split function to TextVectorization layer

Question

I'm defining a custom split callable for TextVectorization like this: resulting in: as seen above the split function is working correctly outside of the TextVectorization layer but failes when passed as a callable Answer Your split_slash function does not seem to properly tokenize the phrases. It is probably because your TextVectorization layer strips your phrases of all punctuation including /

Accepted Answer

Your split_slash function does not seem to properly tokenize the phrases.print(f"Vocabulary:tt{input_text_processor.get_vocabulary()}")'''Vocabulary:['',            '[UNK]',            'textthathasa',            'lotofslashesinside',            'fortestingpurposesfoo']'''It is probably because your TextVectorization layer strips your phrases of all punctuation including / by default before your split_slash function is called. Setting standardize=None in your TextVectorization layer will do the trick for you.Alternatively, you could also try the following snippet.import tensorflow as tfdef custom_standardization(input_data):  return tf.strings.regex_replace(input_data, '/', ' ')inputs = ["text/that/has/a","lot/of/slashes/inside","for/testing/purposes/foo"]input_text_processor = tf.keras.layers.TextVectorization(max_tokens=13, standardize=custom_standardization) #split = split_slash)input_text_processor.adapt(inputs)print(f"Vocabulary:tt{input_text_processor.get_vocabulary()}")example_tokens = input_text_processor(inputs)print(example_tokens)for x in inputs:  print(split_slash(x))Note that your phrases are split on whitespace by default after removing your slashes.'''Vocabulary:     ['', '[UNK]', 'that', 'text', 'testing', 'slashes', 'purposes', 'of', 'lot', 'inside', 'has', 'for', 'foo']tf.Tensor([[ 3  2 10  1] [ 8  7  5  9] [11  4  6 12]], shape=(3, 4), dtype=int64)tf.Tensor([b'text' b'that' b'has' b'a'], shape=(4,), dtype=string)tf.Tensor([b'lot' b'of' b'slashes' b'inside'], shape=(4,), dtype=string)tf.Tensor([b'for' b'testing' b'purposes' b'foo'], shape=(4,), dtype=string)'''For more information, check out the documentation.

Advertisement

Answer