Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Modify text.regex_split_with_offsets() behavior to be in line with tf.strings.length() #1245

Open
briango28 opened this issue Jan 19, 2024 · 1 comment

Comments

@briango28
Copy link

text.regex_split_with_offsets() currently returns begin and end as tf.int64 tensors that count indices in bytes.

tf.strings.length() on the other hand, returns a tf.int32 tensor which counts lengths in either bytes or UTF8 characters according to the value of the parameter unit.

So this would actually be two separate requests:

  1. Change the return types of text.regex_split_with_offsets() to tf.int32, removing the need for a cast when comparing with tf.strings.length(). I doubt there will be a use case for strings longer than INT32_MAX in the foreseeable future.
  2. Add parameter unit: Literal["BYTE", "UTF8_CHAR"] = "BYTE" matching the behavior of tf.strings.length() and tf.strings.substr(). Seeing the regular expressions are already being interpreted in 'utf-8', I think it would make sense to add a layer of abstraction to facilitate slicing by UTF-8 character index.
@briango28
Copy link
Author

Follow-up

Having converted begin & end indices from BYTE to UTF8 with

offsets = tf.strings.unicode_decode_with_offsets(txt, 'UTF8')[1]
begin = tf.map_fn(lambda indices: tf.where(tf.expand_dims(indices, 1) == offsets)[:, 1], begin)

Where tf.strings.unicode_decode_with_offsets() returns offsets with type tf.int64, I'm not so sure about no. 1 anymore :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant