Querying for a document and getting "invalid utf-8 sequence of 1 bytes from index 119" despite not even trying to deserialize anything #799

LukasDeco · 2022-12-31T15:02:13Z

Versions/Environment

What version of Rust are you using? - 1.65.0
What operating system are you using? - WSL (Linux on windows)
What versions of the driver and its dependencies are you using? (Run
cargo pkgid mongodb & cargo pkgid bson) - [email protected] & [email protected]
What version of MongoDB are you using? (Check with the MongoDB shell using db.version()) - 5.0.14
What is your MongoDB topology (standalone, replica set, sharded cluster, serverless)? - Replica Set - 3 nodes

Describe the bug

A clear and concise description of what the bug is.

I am attempting to query for a document that is giving me an error about "invalid utf-8 sequence". I've done a lot of googling on this issue but no luck so far. Nothing related to mongodb :(

The document is quite large, so that might be a potential issue, but I'm not sure.

I'm able to query for another document from the same collection without issue, and that document is also quite large, so I'm not sure if the size is the issue.

I have removed all the properties from the struct so I'm not trying to deserliaze anything at this point, just get the document successfully - and I still get this error. 😢

My next move is to manually delete much of the data out of the document or query for a different document... but obviously none of this is ideal. I'd just like to get someone to point me in the right direction on what the cause of this error might be.

Also important to note is I use Mongodb App Services(formerly Realm, formerly Stitch) and I ran schema validations across these documents and everything passes.

Here's the code, but I don't think it helps much:

// manually setting ID for the query
let id = ObjectId::from_str("6116dc1633616dc8924e1050").unwrap();
        let filter_document = doc! {"userId": id};
        let find_one_options = FindOneOptions::builder().build();
        let profile = self
            .profiles_repository
            .find_one::<Profile >(filter_document, find_one_options)
            .await;

//inside find_one
async fn find_one<'b, T: DeserializeOwned + Sync + Send + Unpin>(
        &self,
        filter: Document,
        find_one_options: FindOneOptions,
    ) -> Result<Option<T>, Error> {
        let collection = self
            .client
            .database("Occasionally")
            .collection::<T>("Profiles");

        collection.find_one(filter, find_one_options).await
    }

// Profile Struct, commented out all the props :(
#[derive(Default, Debug, Clone, PartialEq, Serialize, Deserialize)]
#[serde(rename_all = "camelCase")]
pub struct Profile {
    // pub other_id: Option<ObjectId>,
    // pub complex_prop: Vec<OtherStruct>,
    // pub user_id: Option<ObjectId>,
}

Any help is greatly appreciated and please let me know if I can provide any other information.

The text was updated successfully, but these errors were encountered:

LukasDeco · 2022-12-31T15:33:22Z

I am wondering if utf8 validation disabling can help? https://www.mongodb.com/docs/drivers/node/current/fundamentals/utf8-validation/ but I don't see such on option on the rust driver.

abr-egn · 2023-01-03T19:29:12Z

Unfortunately, Rust strings are required to be utf8, so we can't disable validation. We have considered providing a way to opt in to lossy validation so that invalid sequences are replaced with placeholders rather than erroring; I don't know if that would be helpful for your use case.

You could try loading the document as a RawBson value instead of your record type; that should defer deserialization of strings until you access that specific field.

It is certainly unexpected that an empty record type still causes the error - I'm looking into that! Is it possible for you to share the document that triggers this behavior?

LukasDeco · 2023-01-04T01:43:08Z

@abr-egn I've attached the document to here. Again, it is very huge so you are warned. ~12000 lines. I also excluded a field that basically contains content very similar to what is in the "recommendationsByCategory" field but less specifically organized.

I'm hoping I don't have to use RawBson because it sounds like at some point there would still be an issue? I might try that anway though.
Thanks so much for your help!
gift-profile-utf8-issue.txt

abr-egn · 2023-01-04T16:35:25Z

Thanks! I've reproduced this issue with a minimal test case; I'm looking to see why we're eagerly deserializing here, and what mitigations are possible.

Note that the BSON spec does say that strings are utf8 (https://bsonspec.org/spec.html), so having string values that aren't is likely to cause issues in other places as well.

abr-egn · 2023-01-05T18:38:24Z

Unfortunately, it turns out that deserializing using Serde requires iterating over all of the fields of the incoming data, so avoiding the error by dropping fields isn't really feasible.

If you're okay with lossy decoding, you can use that by loading the data into a RawDocumentBuf and parsing your record type out of that with from_slice_utf8_lossy, e.g.

//inside find_one
async fn find_one<'b, T: DeserializeOwned + Sync + Send + Unpin>(
        &self,
        filter: Document,
        find_one_options: FindOneOptions,
    ) -> Result<Option<T>, Error> {
        let collection = self
            .client
            .database("Occasionally")
            .collection::<RawDocumentBuf>("Profiles");

        let found = collection.find_one(filter, find_one_options).await?;
        match found {
            None => Ok(None),
            Some(raw) => {
                let lossy = bson::from_slice_utf8_lossy(raw.as_bytes())?;
                Ok(Some(lossy))
            }
        }
    }

Any strings in the returned value that were invalid utf8 will have the invalid sequences replaced with placeholder characters.

Can I ask how this data was inserted? If it was via the Rust driver, that points to another bug that we'll need to look into :)

LukasDeco · 2023-01-10T13:33:46Z

Okay thank you! I'm about to give this a shot...

LukasDeco · 2023-01-15T17:23:58Z

It works great! Thank you so much.

clarkmcc · 2024-05-03T17:19:23Z

We have considered providing a way to opt in to lossy validation so that invalid sequences are replaced with placeholders rather than erroring.

Any chance support for this option out-of-the-box would be considered in the future?

github-actions bot assigned abr-egn Dec 31, 2022

github-actions bot added the triage label Dec 31, 2022

bajanam removed the triage label Jan 3, 2023

abr-egn added the waiting-for-reporter label Jan 6, 2023

github-actions bot removed the waiting-for-reporter label Jan 10, 2023

bajanam added the waiting-for-reporter label Jan 10, 2023

LukasDeco closed this as completed Jan 15, 2023

github-actions bot removed the waiting-for-reporter label Jan 15, 2023

tyilo mentioned this issue Aug 20, 2024

Add deserialize-utf8-lossy feature to always deserialize using lossy UTF-8 conversion #1187

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Querying for a document and getting "invalid utf-8 sequence of 1 bytes from index 119" despite not even trying to deserialize anything #799

Querying for a document and getting "invalid utf-8 sequence of 1 bytes from index 119" despite not even trying to deserialize anything #799

LukasDeco commented Dec 31, 2022

LukasDeco commented Dec 31, 2022

abr-egn commented Jan 3, 2023

LukasDeco commented Jan 4, 2023

abr-egn commented Jan 4, 2023

abr-egn commented Jan 5, 2023

LukasDeco commented Jan 10, 2023

LukasDeco commented Jan 15, 2023

clarkmcc commented May 3, 2024

Querying for a document and getting "invalid utf-8 sequence of 1 bytes from index 119" despite not even trying to deserialize anything #799

Querying for a document and getting "invalid utf-8 sequence of 1 bytes from index 119" despite not even trying to deserialize anything #799

Comments

LukasDeco commented Dec 31, 2022

Versions/Environment

Describe the bug

LukasDeco commented Dec 31, 2022

abr-egn commented Jan 3, 2023

LukasDeco commented Jan 4, 2023

abr-egn commented Jan 4, 2023

abr-egn commented Jan 5, 2023

LukasDeco commented Jan 10, 2023

LukasDeco commented Jan 15, 2023

clarkmcc commented May 3, 2024