I’m defining a custom split callable for TextVectorization like this:
JavaScript
x
14
14
1
import tensorflow as tf
2
from tensorflow import keras
3
@tf.function
4
def split_slash(input_str):
5
return tf.strings.split(input_str, sep="/")
6
inputs = ["text/that/has/a","lot/of/slashes/inside","for/testing/purposes/foo"]
7
input_text_processor = keras.layers.TextVectorization(max_tokens=13, split = split_slash)
8
9
input_text_processor.adapt(inputs)
10
example_tokens = input_text_processor(inputs)
11
print(example_tokens)
12
for x in inputs:
13
print(split_slash(x))
14
resulting in:
JavaScript
1
8
1
tf.Tensor(
2
[[2]
3
[3]
4
[4]], shape=(3, 1), dtype=int64)
5
tf.Tensor([b'text' b'that' b'has' b'a'], shape=(4,), dtype=string)
6
tf.Tensor([b'lot' b'of' b'slashes' b'inside'], shape=(4,), dtype=string)
7
tf.Tensor([b'for' b'testing' b'purposes' b'foo'], shape=(4,), dtype=string)
8
as seen above the split function is working correctly outside of the TextVectorization layer but failes when passed as a callable
Advertisement
Answer
Your split_slash
function does not seem to properly tokenize the phrases.
JavaScript
1
9
1
print(f"Vocabulary:tt{input_text_processor.get_vocabulary()}")
2
'''
3
Vocabulary:['',
4
'[UNK]',
5
'textthathasa',
6
'lotofslashesinside',
7
'fortestingpurposesfoo']
8
'''
9
It is probably because your TextVectorization
layer strips your phrases of all punctuation including /
by default before your split_slash
function is called. Setting standardize=None
in your TextVectorization
layer will do the trick for you.
Alternatively, you could also try the following snippet.
JavaScript
1
17
17
1
import tensorflow as tf
2
3
def custom_standardization(input_data):
4
return tf.strings.regex_replace(input_data, '/', ' ')
5
6
inputs = ["text/that/has/a","lot/of/slashes/inside","for/testing/purposes/foo"]
7
8
input_text_processor = tf.keras.layers.TextVectorization(max_tokens=13, standardize=custom_standardization) #split = split_slash)
9
10
input_text_processor.adapt(inputs)
11
print(f"Vocabulary:tt{input_text_processor.get_vocabulary()}")
12
example_tokens = input_text_processor(inputs)
13
14
print(example_tokens)
15
for x in inputs:
16
print(split_slash(x))
17
Note that your phrases are split on whitespace
by default after removing your slashes.
JavaScript
1
11
11
1
'''
2
Vocabulary: ['', '[UNK]', 'that', 'text', 'testing', 'slashes', 'purposes', 'of', 'lot', 'inside', 'has', 'for', 'foo']
3
tf.Tensor(
4
[[ 3 2 10 1]
5
[ 8 7 5 9]
6
[11 4 6 12]], shape=(3, 4), dtype=int64)
7
tf.Tensor([b'text' b'that' b'has' b'a'], shape=(4,), dtype=string)
8
tf.Tensor([b'lot' b'of' b'slashes' b'inside'], shape=(4,), dtype=string)
9
tf.Tensor([b'for' b'testing' b'purposes' b'foo'], shape=(4,), dtype=string)
10
'''
11
For more information, check out the documentation.