Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a query Arabic adjectives and their plural and dual. #255

Merged
merged 11 commits into from
Oct 9, 2024

Conversation

OmarAI2003
Copy link
Contributor

Contributor checklist


Create SPARQL Query for Arabic Adjectives

This Pull Request modifies the SPARQL query to include the dual form representation of adjectives in addition to their singular and plural forms.

Key Changes:

  • Created a new SPARQL query file to extract the following:
    • Adjective labels: The Arabic text for adjectives.
    • Plural forms: The plural representations of adjectives.
    • Dual forms: The dual representations of adjectives.

Linguistic Context:

In Arabic:

  • Plural and Dual Forms: In Arabic, adjectives can take plural forms, which may be sound (formed by changing the structure of the word) or broken, similar to nouns. Additionally, there is a distinct dual form for adjectives used to refer specifically to two items, typically achieved by adding specific endings to the singular form.

This PR accurately reflects the grammatical nuances of Arabic adjectives, making it more useful for users querying this linguistic data.

- Modified the SPARQL query to extract the dual form of adjectives along with their labels and plural forms.
- Ensured that the output includes the lexeme ID, adjective, plural, and dual forms consistently in Arabic text.
Copy link

github-actions bot commented Oct 5, 2024

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!

Maintainer checklist

  • The commit messages for the remote branch should be checked to make sure the contributor's email is set up correctly so that they receive credit for their contribution

    • The contributor's name and icon in remote commits should be the same as what appears in the PR
    • If there's a mismatch, the contributor needs to make sure that the email they use for GitHub matches what they have for git config user.email in their local Scribe-Data repo
  • The linting and formatting workflow within the PR checks do not indicate new errors in the files changed

  • The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

@OmarAI2003 OmarAI2003 changed the title Add query adjectives and their plural and dual. Add a query Arabic adjectives and their plural and dual. Oct 5, 2024
@andrewtavis
Copy link
Member

Could you remove the extra comments and put MARK: comments for sections of the query as is done in other queries, @OmarAI2003? Aside from this, really nice 😊 Thanks so much!

@andrewtavis andrewtavis self-requested a review October 6, 2024 13:49
- Removed extra comments.
- Replaced with MARK: comments to match the style used in other queries.
@OmarAI2003
Copy link
Contributor Author

Thank you for the feedback! 😊 I've updated the query to use MARK: comments as requested and removed the extra comments

@OmarAI2003
Copy link
Contributor Author

Hi @andrewtavis

I've noticed some failing tests related to the addition of the "adjectives" data type in the list_data_types function. The expected output in the tests seems to be missing the "adjectives" entry, while the actual output includes it.

Could you please clarify whether the tests need to be updated to reflect the new data type, or should I investigate further to ensure everything is functioning correctly?

Thanks for your guidance!

@andrewtavis
Copy link
Member

The tests should be updated, and that's coming in another PR :)

@andrewtavis
Copy link
Member

Thanks @OmarAI2003!

@andrewtavis
Copy link
Member

For this one as well, @OmarAI2003, would be great if we add enough properties so that each of the query results is unique :) Are there some forms that could be added or properties to the existing form so there's one form returned per property set?

@OmarAI2003
Copy link
Contributor Author

Thanks @andrewtavis for your feedback on the pr
I wanted to clarify your suggestion regarding ensuring that query results are unique.

In Arabic, diacritical marks play a crucial role in changing the meaning of words, allowing two words to have the same letters but different meanings. This can lead to situations where words may appear similar but are indeed distinct due to these marks.

Additionally, when it comes to plurals, there are different forms for masculine and feminine, which means we can have multiple entries that share the singular and dual forms but differ in their pluralization. Therefore, I believe these variations wouldn’t necessarily count as duplicates.

The main issue I see regarding potential duplication in this query might stem from the addition of the definite article (ال) to nouns, which can affect their uniqueness.
a

Could you please clarify if you believe there are actual duplicate rows in the current results? Your insight would really help me understand your perspective better!

@andrewtavis
Copy link
Member

I do know a bit of Arabic from my bachelor's, @OmarAI2003, which is why I'm so happy you're working on it 😇

Big thing is that in the end we want just one row returned per lexeme ID. Let's add properties and forms such that each of the forms is returned in a unique way, so not just dual, but dual feminine, dual masculine, etc, and the same for plural and any other forms. If there's a definite version as well with ال at the start, then this can also be its own form :)

@OmarAI2003
Copy link
Contributor Author

Thank you for your insights regarding the lexeme IDs.

I wanted to clarify whether having one row returned per lexeme ID is a strict requirement. For instance, with the ID L1131459 linked here, achieving this goal seems almost impossible. I have been experimenting with various approaches, adding numerous properties, so much that for a given query (after simplification, by ignoring certain forms like context and pausal) I ended up with all of those properties combined and yet I still encounter duplicates in my queries

  • Masculine, nominative case, singular, indefinite
  • Masculine, accusative case, singular, indefinite
  • Masculine, genitive case, singular, indefinite
  • Masculine, nominative case, singular, definite
  • Masculine, accusative case, singular, definite
  • Masculine, genitive case, singular, definite
  • Feminine, nominative case, singular, indefinite
  • Feminine, accusative case, singular, indefinite
  • Feminine, genitive case, singular, indefinite
  • Feminine, nominative case, singular, definite
  • Feminine, accusative case, singular, definite
  • Feminine, genitive case, singular, definite
  • Masculine, nominative case, plural, indefinite
  • Masculine, accusative case, plural, indefinite
  • Masculine, genitive case, plural, indefinite
  • Masculine, nominative case, plural, definite
  • Masculine, accusative case, plural, definite
  • Masculine, genitive case, plural, definite
  • Feminine, nominative case, plural, indefinite
  • Feminine, accusative case, plural, indefinite
  • Feminine, genitive case, plural, indefinite
  • Feminine, nominative case, plural, definite
  • Feminine, accusative case, plural, definite
  • Feminine, genitive case, plural, definite
  • Masculine, nominative case, dual, indefinite
  • Masculine, accusative case, dual, indefinite
  • Masculine, genitive case, dual, indefinite
  • Masculine, accusative case, dual, definite
  • Masculine, genitive case, dual, definite
  • Feminine, nominative case, dual, indefinite
  • Feminine, accusative case, dual, indefinite
  • Feminine, genitive case, dual, indefinite
  • Feminine, accusative case, dual, definite
  • Feminine, genitive case, dual, definite

The worst part is that after adding all these forms, I’m successfully getting the least possible number of duplicates for this particular outlier ID. But, the other IDs return nothing since all the normal lexemes are not structured this way in Wikidata; they only have a maximum of 6 or 7 forms on average like this one.

Should I disregard the duplicate requirements for this particular ID and focus on the more standard cases?

@OmarAI2003
Copy link
Contributor Author

I apologize if this message seems too long or if I'm annoying you. Thank you for your understanding!

@andrewtavis
Copy link
Member

Not annoying me at all! Maybe what we can do is bring in the current query with the seven forms and then in the future the modeling will be more consistent? What do you think?

@OmarAI2003
Copy link
Contributor Author

I’m still excited to work on this and am open to any format you prefer to make the project more consistent and improve language data extraction. I could have added only form fro singular adjectives like other languages to remove duplicates, but I’m all for whatever helps the project in the long run.

@andrewtavis
Copy link
Member

As with the nouns, maybe it makes sense to model the full possible forms and just have empty responses for some :) Then hopefully the queries will work on the final modeled version or the data once that's made by the Wikidata community 😊

Let me know what you think, @OmarAI2003!

@andrewtavis andrewtavis added the hacktoberfest-accepted Accepted as a part of Hacktoberfest label Oct 7, 2024
@OmarAI2003
Copy link
Contributor Author

OmarAI2003 commented Oct 8, 2024

As with the nouns, maybe it makes sense to model the full possible forms and just have empty responses for some :) Then hopefully the queries will work on the final modeled version or the data once that's made by the Wikidata community 😊

Let me know what you think, @OmarAI2003!

Adding all possible forms sounds reasonable, but after some trials, I found it technically challenging because Arabic forms and plurals are not standardized. Not every noun follows a predictable pattern (e.g., "مذكر", i.e., "masculine", won’t have a feminine plural form), and irregular or broken plurals add more complexity, so we might end up with something like 100 columns, which is computationally insane.

To address this, I created a query that aggregates the grammatical features for each lexeme and outputs the available information. This ensures that the query returns meaningful data even if the expected forms are incomplete or missing.
a

Let me know your thoughts!

Copy link
Member

@andrewtavis andrewtavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this will be enough for now, @OmarAI2003, as the amount of data isn't so much that we need to be too exact. The most important thing is that we add enough forms for unique results. Maybe doing a separate query for context form at some point could be interesting, but then let's not worry about it for now. Same for definite, but then it's what we're getting here the definite article prepended :)

@andrewtavis andrewtavis merged commit cae689e into scribe-org:main Oct 9, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hacktoberfest-accepted Accepted as a part of Hacktoberfest
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants