Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixes #55 : Refactor ISO code usage using Python langcodes #60

Merged
merged 5 commits into from
Jan 27, 2024

Conversation

shashank-iitbhu
Copy link
Contributor

Contributor checklist


Description

  • Refactor ISO Code Usage: Replaces the existing ISO code and language retrieval mechanism in Scribe with the langcodes library for improved language code handling. Refer to the functions get_language_iso and get_language_from_iso in src/scribe_data/utlis.py.
  • Incorporate langcodes and language_data: Adds langcodes and language_data as dependencies to environment.yaml file.
  • Uses LanguageTagError attribute of langcodes with try and except for error handling.
  • Tested these functions manually using the tests given in test_update_utils.py file.

Related issue

Copy link

github-actions bot commented Jan 16, 2024

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. It'd be great to have you!

Maintainer checklist

  • The commit messages for the remote branch should be checked to make sure the contributor's email is set up correctly so that they receive credit for their contribution

    • The contributor's name and icon in remote commits should be the same as what appears in the PR
    • If there's a mismatch, the contributor needs to make sure that the email they use for GitHub matches what they have for git config user.email in their local Scribe-Data repo
  • The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

@shashank-iitbhu shashank-iitbhu changed the title fixes #55 fixes #55 : Refactor ISO code usage using Python langcodes Jan 16, 2024
@andrewtavis
Copy link
Member

Hey @shashank-iitbhu! Really happy to get this so quickly! Thanks so much :)

A quick note on this is that it looks like your GitHub email isn't set up properly. See the note in the maintainer checklist:

  • If there's a mismatch, the contributor needs to make sure that the email they use for GitHub matches what they have for git config user.email in their local Scribe-Data repo

Do you want to check that and maybe reopen this PR after? Big thing is that your account likely won't get credit for the commit as it is now.

Signed-off-by: Shashank Mittal <[email protected]>
@shashank-iitbhu
Copy link
Contributor Author

Hey @shashank-iitbhu! Really happy to get this so quickly! Thanks so much :)

A quick note on this is that it looks like your GitHub email isn't set up properly. See the note in the maintainer checklist:

  • If there's a mismatch, the contributor needs to make sure that the email they use for GitHub matches what they have for git config user.email in their local Scribe-Data repo

Do you want to check that and maybe reopen this PR after? Big thing is that your account likely won't get credit for the commit as it is now.

I have configured my email with this PR. Let me know if this works.

@andrewtavis
Copy link
Member

Looks great now @shashank-iitbhu! Thanks for the quick reaction :)

@andrewtavis andrewtavis self-requested a review January 16, 2024 23:30
@andrewtavis
Copy link
Member

@wkyoshida, do you have an idea on the tensorflow version error (just installed 2.11 for work earlier today). Before the version in the env file and requirements were different, but post fix in my commit it's still saying:

ERROR: Could not find a version that satisfies the requirement tensorflow>=2.11.0 (from versions: none)
ERROR: No matching distribution found for tensorflow>=2.11.0

Comment on lines +24 to +25
- langcodes>=3.0.0
- language_data>=1.0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these need to get added in the requirements.txt too?
Thinking it might be the cause for this error (logs) in the CI runs on Ubuntu runners at least?

Copy link
Contributor Author

@shashank-iitbhu shashank-iitbhu Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, forgot to add them in requirements.txt, Let me push another commit.

@wkyoshida
Copy link
Member

@wkyoshida, do you have an idea on the tensorflow version error

Still not sure yet what the issue is.. 🤔
Briefly looked at suggestions for resolving tensorflow versions, but things like using a high enough pip version or 64-bit don't seem like they'd be the case.. the search goes on

Signed-off-by: Shashank Mittal <[email protected]>
…/Scribe-Data into langcodes-implementation merge
@andrewtavis
Copy link
Member

Still all are failing... 🤔

@shashank-iitbhu
Copy link
Contributor Author

Still all are failing... 🤔

Screenshot 2024-01-19 at 11 54 07 PM Can't we just upgrade the tensorflow version?

@andrewtavis
Copy link
Member

I'll give that a try, and if it doesn't work I'll try to remove it all together. We're not really using it for anything major anyway.

@wkyoshida
Copy link
Member

Still all are failing... 🤔

The failures in the Ubuntu runners (logs) appear now to be due to failing tests actually. I believe some of the tests might need updating given the new changes.

Could try running the pytest command locally to run the tests and see their outcome to determine which need fixing.

@shashank-iitbhu
Copy link
Contributor Author

Still all are failing... 🤔

The failures in the Ubuntu runners (logs) appear now to be due to failing tests actually. I believe some of the tests might need updating given the new changes.

Could try running the pytest command locally to run the tests and see their outcome to determine which need fixing.

The tests were failing because langcodes.find was returning a LookupError when provided with the string "gibberish" and langcodes.make was returning a custom string instead of a ValueError. With the latest commit, I have modified both the functions in src/scribe_data/utils.py to return ValueError. All the tests passed.

Copy link
Member

@andrewtavis andrewtavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for the changes here, @shashank-iitbhu! Really appreciate the work here 😊 Hopefully you can make it to the sync today and we can discuss #57 (if that's still of interest). I'd be happy to set up the baseline docs this week and then you can take it from there!

@andrewtavis andrewtavis merged commit c660feb into scribe-org:main Jan 27, 2024
3 of 6 checks passed
@andrewtavis
Copy link
Member

FYI you two I decided to update the data download process for generating autosuggestions to remove Tensorflow as it looks like there will be problems from now on given M1 vs. not M1 macs. We're now at least not getting that error anymore, but now we're looking at:

ERROR: Could not build wheels for marisa-trie, which is required to install pyproject.toml-based projects
Successfully built PyICU sentencepiece
Failed to build marisa-trie

Looks like the last hurdle for a good Mac build though, so progress 😊

@shashank-iitbhu shashank-iitbhu deleted the langcodes-implementation branch January 27, 2024 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants