-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating mnemonic language determination #142
Updating mnemonic language determination #142
Conversation
Every test failed in the same manner: This happens because some mnemonics, Prior to this change the resulting word languages would have only contained Chinese Traditional as all instances of Chinese Simplified would have been overwritten. With this change both are now returned and both result in a valid mnemonic which then raises the error. I'm not sure how we want to handle this. Though this is a little spooky, I don't believe it is possible for the underlying private key to differ if there is duplication between languages but having that confirmed by someone more knowledgeable than I would be preferred. |
Do we know that this failure case is possible? I want to be sure we are solving something that can actually happen. Did the original report give us a sample mnemonic that causes a failure, that is, two different (but valid) keys in Simplified and Traditional? |
I'm trying to wrap my head around this. In the This would leave the
What do you think @valefar-on-discord ? Is this a good picture of the whole situation? Am I missing something here? |
It would still make sense to fix the |
I just noticed that this would also apply to the following additional commands, and not just
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I think we can fix the failing unit tests with assuming that the validatorerror will be raised, ie expecting it to be raised. As a better overall fix for this whole issue, I think we should create another PR with one of the solutions in #142 (comment)
I talked to a cryptographic expert and they let me know that the mnemonic word is used as a lookup to determine the index in the word list. As long as the matching words have the identical index in the corresponding word list, the underlying private key and cryptographic results will be identical. I did a comparison of Chinese simplified/traditional and, though there are 1275 shared words, each one are of the same index. My guess is this is a non-issue and the creation of these word lists has this consideration in mind. That's not to say we should be complicit. I am a fan of Remy's prompt solution and it's not that big of a lift. |
Does prompting improve UX? It may be simpler to just make an assumption either way, if the resulting keys are the same anyway. |
Here’s what we have from the reporting user. Let’s see how this comes out in simplified and traditional Chinese. If it’s valid in both and conflicts, we need to prompt. If it’s valid in only one but our current code guesses the wrong language, we want to fix the guessing algorithm.
|
Is the user trying to recover the private key with another tool other than the deposit-cli? Because with For mnemonic like: 的 的 的 的 的 的 的 的 的 的 的 在 where the mnemonic is valid for both languages CHS and CHT (edit: I assume it is valid on CHT too, haven't tested this as deposit-cli can't recover CHT), the resulting private key should be the same. I remember it is something like: 12 words --> 12 indices (12 numbers from 0-2047) --> using these 12 numbers to generate the keys As @valefar-on-discord mentioned, each word corresponds to an index in the word list. I have also checked and can confirm that when CHS and CHT are the same word, the indices are the same. Therefore, the resulting key should be the same, because they essentially have the same 12 indices/numbers. If there is some tools to verify this, then we can make an assumption either way as @yorickdowne says. Curious what tools did @anupsv used that encounter this issue? |
I’d like to verify that this mnemonic works with deposit cli as is and generates the same secret key. We know it should; let me see it does. If it does, then @valefar-on-discord and @remyroy , I am not convinced we should change the existing logic. Unless we risk detecting the wrong language in some cases, as in, a language that doesn’t contain all words. For example, if we have a mnemonic that is all Spanish, and matches Portuguese/Spanish only on the last word, would it correctly use Spanish? Or the other way around … you get the drift. |
@chong-he The prompt you are seeing is for the language used for the tool's prompts and outputs. This allows for 12 languages that are partially translated. When generating a mnemonic, you are given the mnemonic language option of @yorickdowne In your example, because multiple languages are determined for the list of words, each language will be attempted and ultimately output:
Based on how this logic currently exists, Portuguese will not overwrite Spanish but vice-versa, so lets say you generated a Portuguese mnemonic and a single word existed in Spanish: Your word map would be 23 entries containing Portuguese and 1 of Spanish. Two languages will be determined and both iterated upon. Portuguese would result in a valid mnemonic but Spanish would not and the Portuguese full word mnemonic would be returned. I ran a script to check the indices of each word across each language to see if there are any that contain mis-matched indices and there are specifically between English and French:
Given the amount of duplicates I think this is an issue we want to account for. |
Thanks for doing that work! Ok, so we could have a valid mnemonic in either English or French, but obviously only one language gives the key the user wants. In that case yes: If the mnemonic is valid in multiple languages - we can check for that first - then we should prompt. If the words exist in multiple languages but only one language gives us a mnemonic with correct checksum, we want to silently choose the language that has a working mnemonic. Better UX that way. How’s that sound? |
Edit to remove the quote as it's too long Thanks for the explanation @valefar-on-discord ! Now I get that the As from the above, it seems now that CHS and CHT is not an issue already, right? Rather, it is now English or French issue. And Yorick's proposal sounds great - only prompt when detecting valid mnemonic |
To expand this a bit more further on what I tried, the for loop implementations could be more optimized and I had implemented that. But the test vectors started to fail and that's when i noticed the issue. WIth the current way of how the dict mapping is generated between the word and language, there are many cases where keys are overwritten when collisions happened which I encountered first with CHT and CHS primarily because that is the first test case in the mnemonic test vectors. |
Yeah that's exactly what I am seeing as well and the problem is substantially more concerning with English/French. I think we have an agreed path forward on how to resolve this though it is going to be a bit of a lift. Great catch and thanks for reporting it! |
Perfect! Happy to help as well since i've been looking at this for a bit now. If someone is working on it already, thats cool too. |
ea8a86f
to
99b0473
Compare
Test Coverage: Download HTML Report
|
Test Coverage: Download HTML Report
|
Added the ability to prompt the user to select a language if multiple are valid for the same mnemonic. I tried to manipulate the existing mnemonic logic as little as I could but this is a solution. I'm not 100% thrilled with it so would love some feedback on how this can be improved. If we can get to a place where we are happy with the solution I'll wrap it in some unit tests. But if you run I haven't done a thorough round of testing so might be some edge cases |
Test Coverage: Download HTML Report
|
Everything removed in A solution is good enough for me, tbh. I am not sure we are getting to "a very elegant / beautiful" solution - it's a somewhat messy problem by its nature. |
@valefar-on-discord , we should probably honor |
|
I am thinking about --non_interactive scripting, where a prompt is undesirable. We may want the --mnemonic_language here as an optional parameter, because we can’t always reliably detect the language. |
Did some testing about CHS and CHT mnemonic:
For case 1, I think the language prompt is fine solution, it will only prompts when the mnemonic is valid for >1 language (which I think is not often the case) |
Thanks for helping us test, @chong-he ! We cannot reliably detect the language when:
For CHS and CHT the indices are the same, so it might be the same key regardless and arguably we don’t need to prompt. But then, why did we get a report - I am cautious whether it’s really the same key. For English and French the indices are not the same. This should mean that again only one version is correct, as the other one will fail checksum. But, do we know that for certain? Can I ask, in your testing: You are saying the pub key for CHS is different from the one in CHT, for the same mnemonic? If so, that’s a clear conflict where we need to prompt, exactly what the person originally reporting this was concerned about. |
I never think of If we want to add the |
That might be cleaner code-wise but isn’t it better UX to prompt if we’re interactive? Overall the promise is that new-mnemonic and existing-mnemonic will work without any additional arguments and prompt for everything they need. |
962d607
to
5e36c1b
Compare
I agree it is a better UX but I question the benefit with the additional code complexity for a use case that, as far as I can tell, is incredibly unlikely to trigger. I am still not sure if there is such a mnemonic with english and french that would result in different cryptographic values as the underlying mathematics behind it are beyond me. I created an alternative version that is very similar to this PR but doesn't prioritize UX: a35d5fd But if we do want to maximize UX then I can easily add the |
Test Coverage: Download HTML Report
|
Example: Each word is the same, except that CHS that have their corresponding CHT characters are changed to the CHT characters Both mnemonic are valid, but they should be considered different? Because the resulting public keys are different. The original report was, there is a bug with CHS and CHT where when CHS and CHT are having the same characters, the CHS one is being overwritten So I am thinking, it should be the case that, when a mnemonic that have the exact same words and valid for 2 or more languages, such as: Selecting CHS or CHT during key generation/recovery should result in different keys. I think that's what is meant by the collision mentioned in the original report. Because if the resulting keys are the same, there is no need to prompt/to have an additional flag (except for the possibility that the keys are different (e.g., English and French), even though the probability of this happening can be very low). For now, the public keys generated by ``的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 性` when selecting CHS and CHT during key recovery are the same. Should they be different in the first place? Because they are the same |
5e36c1b
to
55922b6
Compare
55922b6
to
d882e14
Compare
Test Coverage: Download HTML Report
|
Test Coverage: Download HTML Report
|
Test Coverage: Download HTML Report
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do some minor changes before merging this. Let me know what you think.
Test Coverage: Download HTML Report
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything is fine. I'm merging.
* Updating mnemonic language determination * Removing duplicate mnemonic values * Adding test for multi-language mnemonics * Prompt user for language selection on MultiLanguageError * Adding optional mnemonic_language argument * Adding tests for determine_mnemonic_language * Resolving UX concerns
Fixes Closes Resolves #119
Changes
Updating the language determination of mnemonics to allow a list of languages per word
Types of changes
What types of changes does your code introduce?
Testing
Requires testing
If yes, did you write tests?
Notes on testing
Creating mnemonics with Chinese Simplified or Chinese Traditional can easily reproduce the issue with a set_trace in the
determine_mnemonic_language
logicDocumentation
Requires documentation update
Requires explanation in Release Notes
Remarks
The issue here is the word language map only allows for a single language per word. Because there is so much overlap between Chinese Simplified and Chinese Traditional, you can run into a situation where, after generating a mnemonic in Chinese Simplified, your word map could look like this:
But even in this situation, the logic will go through both discovered languages and provide the correct mnemonic.
The failure case would be if you generate a mnemonic with Chinese Simplified, but each word is also in Chinese Traditional and results in an invalid mnemonic.
If that scenario is even possible in the bounds of how mnemonics work then it would have to be from a mnemonic created outside this tool as there is a verification step in the
new-mnemonic
flow.Either way, the logic has been updated to allow for multiple languages per word resulting in this for the same test mnemonic: