Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Querying for a document and getting "invalid utf-8 sequence of 1 bytes from index 119" despite not even trying to deserialize anything #799

Closed
LukasDeco opened this issue Dec 31, 2022 · 8 comments
Assignees

Comments

@LukasDeco
Copy link

Versions/Environment

  1. What version of Rust are you using? - 1.65.0
  2. What operating system are you using? - WSL (Linux on windows)
  3. What versions of the driver and its dependencies are you using? (Run
    cargo pkgid mongodb & cargo pkgid bson) - [email protected] & [email protected]
  4. What version of MongoDB are you using? (Check with the MongoDB shell using db.version()) - 5.0.14
  5. What is your MongoDB topology (standalone, replica set, sharded cluster, serverless)? - Replica Set - 3 nodes

Describe the bug

A clear and concise description of what the bug is.

I am attempting to query for a document that is giving me an error about "invalid utf-8 sequence". I've done a lot of googling on this issue but no luck so far. Nothing related to mongodb :(

The document is quite large, so that might be a potential issue, but I'm not sure.

I'm able to query for another document from the same collection without issue, and that document is also quite large, so I'm not sure if the size is the issue.

I have removed all the properties from the struct so I'm not trying to deserliaze anything at this point, just get the document successfully - and I still get this error. 😢

My next move is to manually delete much of the data out of the document or query for a different document... but obviously none of this is ideal. I'd just like to get someone to point me in the right direction on what the cause of this error might be.

Also important to note is I use Mongodb App Services(formerly Realm, formerly Stitch) and I ran schema validations across these documents and everything passes.

Here's the code, but I don't think it helps much:

// manually setting ID for the query
let id = ObjectId::from_str("6116dc1633616dc8924e1050").unwrap();
        let filter_document = doc! {"userId": id};
        let find_one_options = FindOneOptions::builder().build();
        let profile = self
            .profiles_repository
            .find_one::<Profile >(filter_document, find_one_options)
            .await;
//inside find_one
async fn find_one<'b, T: DeserializeOwned + Sync + Send + Unpin>(
        &self,
        filter: Document,
        find_one_options: FindOneOptions,
    ) -> Result<Option<T>, Error> {
        let collection = self
            .client
            .database("Occasionally")
            .collection::<T>("Profiles");

        collection.find_one(filter, find_one_options).await
    }
// Profile Struct, commented out all the props :(
#[derive(Default, Debug, Clone, PartialEq, Serialize, Deserialize)]
#[serde(rename_all = "camelCase")]
pub struct Profile {
    // pub other_id: Option<ObjectId>,
    // pub complex_prop: Vec<OtherStruct>,
    // pub user_id: Option<ObjectId>,
}

Any help is greatly appreciated and please let me know if I can provide any other information.

@LukasDeco
Copy link
Author

I am wondering if utf8 validation disabling can help? https://www.mongodb.com/docs/drivers/node/current/fundamentals/utf8-validation/ but I don't see such on option on the rust driver.

@bajanam bajanam removed the triage label Jan 3, 2023
@abr-egn
Copy link
Contributor

abr-egn commented Jan 3, 2023

Unfortunately, Rust strings are required to be utf8, so we can't disable validation. We have considered providing a way to opt in to lossy validation so that invalid sequences are replaced with placeholders rather than erroring; I don't know if that would be helpful for your use case.

You could try loading the document as a RawBson value instead of your record type; that should defer deserialization of strings until you access that specific field.

It is certainly unexpected that an empty record type still causes the error - I'm looking into that! Is it possible for you to share the document that triggers this behavior?

@LukasDeco
Copy link
Author

@abr-egn I've attached the document to here. Again, it is very huge so you are warned. ~12000 lines. I also excluded a field that basically contains content very similar to what is in the "recommendationsByCategory" field but less specifically organized.

I'm hoping I don't have to use RawBson because it sounds like at some point there would still be an issue? I might try that anway though.
Thanks so much for your help!
gift-profile-utf8-issue.txt

@abr-egn
Copy link
Contributor

abr-egn commented Jan 4, 2023

Thanks! I've reproduced this issue with a minimal test case; I'm looking to see why we're eagerly deserializing here, and what mitigations are possible.

Note that the BSON spec does say that strings are utf8 (https://bsonspec.org/spec.html), so having string values that aren't is likely to cause issues in other places as well.

@abr-egn
Copy link
Contributor

abr-egn commented Jan 5, 2023

Unfortunately, it turns out that deserializing using Serde requires iterating over all of the fields of the incoming data, so avoiding the error by dropping fields isn't really feasible.

If you're okay with lossy decoding, you can use that by loading the data into a RawDocumentBuf and parsing your record type out of that with from_slice_utf8_lossy, e.g.

//inside find_one
async fn find_one<'b, T: DeserializeOwned + Sync + Send + Unpin>(
        &self,
        filter: Document,
        find_one_options: FindOneOptions,
    ) -> Result<Option<T>, Error> {
        let collection = self
            .client
            .database("Occasionally")
            .collection::<RawDocumentBuf>("Profiles");

        let found = collection.find_one(filter, find_one_options).await?;
        match found {
            None => Ok(None),
            Some(raw) => {
                let lossy = bson::from_slice_utf8_lossy(raw.as_bytes())?;
                Ok(Some(lossy))
            }
        }
    }

Any strings in the returned value that were invalid utf8 will have the invalid sequences replaced with placeholder characters.

Can I ask how this data was inserted? If it was via the Rust driver, that points to another bug that we'll need to look into :)

@LukasDeco
Copy link
Author

Okay thank you! I'm about to give this a shot...

@LukasDeco
Copy link
Author

It works great! Thank you so much.

@clarkmcc
Copy link

clarkmcc commented May 3, 2024

We have considered providing a way to opt in to lossy validation so that invalid sequences are replaced with placeholders rather than erroring.

Any chance support for this option out-of-the-box would be considered in the future?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants