Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add deserialize-utf8-lossy feature to always deserialize using lossy UTF-8 conversion #1187

Closed
wants to merge 1 commit into from

Conversation

tyilo
Copy link

@tyilo tyilo commented Aug 20, 2024

Useful if you need to read from a collection created by a driver for another programming language.

See #799

@tyilo
Copy link
Author

tyilo commented Aug 20, 2024

Also consider making this the default to improve interoperability with other drivers.

@abr-egn abr-egn self-assigned this Aug 22, 2024
@abr-egn
Copy link
Contributor

abr-egn commented Aug 22, 2024

Thanks for the contribution! I don't think we'll want to make this the default to avoid breaking anyone relying on the existing validation, but it seems quite helpful as an opt-in feature.

If you're willing, could you share more information about the situation that motivates this? The bson spec says that strings are UTF-8, so at least in theory drivers shouldn't be writing values that require this.

@abr-egn abr-egn self-requested a review August 22, 2024 14:31
@tyilo
Copy link
Author

tyilo commented Aug 22, 2024

If you're willing, could you share more information about the situation that motivates this? The bson spec says that strings are UTF-8, so at least in theory drivers shouldn't be writing values that require this.

Yes, I agree that in a perfect world this would be rejected by all drivers and also the MongoDB server itself.

However, at least the Java driver doesn't always produce valid UTF-8 when writing to a collection.
See https://jira.mongodb.org/projects/JAVA/issues/JAVA-5575 which has been closed as "Won't Fix".

Copy link
Contributor

@abr-egn abr-egn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@isabelatkinson
Copy link
Contributor

isabelatkinson commented Sep 5, 2024

Hi @tyilo, apologies for the delay here! UTF-8 lossy deserialization is a concept that we'd like to keep contained to the bson library if possible. Would the wrapper type added in mongodb/bson-rust#497 work for you rather than exposing this feature flag?

Here's the basic idea of what using this type would look like:

struct HasLossyString {
    // this string might have invalid utf-8
    s: Utf8LossyDeserialization<String>,
}

let collection: Collection<HasLossyString> = client.database("db").collection("coll");
// documents with invalid utf-8 strings in field s will not error
let cursor = collection.find(filter).await?;

// both s1 and s2 might have invalid utf-8
struct HasLossyStrings {
    s1: String,
    s2: String,
}

let collection: Collection<Utf8LossyDeserialization<HasLossyStrings>> = client.database("db").collection("coll");
// documents with any invalid utf-8 string values will not error
let cursor = collection.find(filter).await?;

You could also use this type with Document/RawDocumentBuf.

@tyilo
Copy link
Author

tyilo commented Sep 6, 2024

@isabelatkinson Seems to work great.

@isabelatkinson
Copy link
Contributor

@tyilo great to hear. I just merged in the addition to the BSON library, and it will be included in that crate's next release. Going to close this out - thanks for bringing this issue to our attention!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants