-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a query Arabic adjectives and their plural and dual. #255
Conversation
…nguage extraction
- Modified the SPARQL query to extract the dual form of adjectives along with their labels and plural forms. - Ensured that the output includes the lexeme ID, adjective, plural, and dual forms consistently in Arabic text.
Thank you for the pull request!The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :) If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you! Maintainer checklist
|
Could you remove the extra comments and put |
- Removed extra comments. - Replaced with MARK: comments to match the style used in other queries.
Thank you for the feedback! 😊 I've updated the query to use |
Hi @andrewtavis I've noticed some failing tests related to the addition of the "adjectives" data type in the Could you please clarify whether the tests need to be updated to reflect the new data type, or should I investigate further to ensure everything is functioning correctly? Thanks for your guidance! |
The tests should be updated, and that's coming in another PR :) |
Thanks @OmarAI2003! |
For this one as well, @OmarAI2003, would be great if we add enough properties so that each of the query results is unique :) Are there some forms that could be added or properties to the existing form so there's one form returned per property set? |
Thanks @andrewtavis for your feedback on the pr In Arabic, diacritical marks play a crucial role in changing the meaning of words, allowing two words to have the same letters but different meanings. This can lead to situations where words may appear similar but are indeed distinct due to these marks. Additionally, when it comes to plurals, there are different forms for masculine and feminine, which means we can have multiple entries that share the singular and dual forms but differ in their pluralization. Therefore, I believe these variations wouldn’t necessarily count as duplicates. The main issue I see regarding potential duplication in this query might stem from the addition of the definite article (ال) to nouns, which can affect their uniqueness. Could you please clarify if you believe there are actual duplicate rows in the current results? Your insight would really help me understand your perspective better! |
I do know a bit of Arabic from my bachelor's, @OmarAI2003, which is why I'm so happy you're working on it 😇 Big thing is that in the end we want just one row returned per lexeme ID. Let's add properties and forms such that each of the forms is returned in a unique way, so not just dual, but dual feminine, dual masculine, etc, and the same for plural and any other forms. If there's a definite version as well with |
Thank you for your insights regarding the lexeme IDs. I wanted to clarify whether having one row returned per lexeme ID is a strict requirement. For instance, with the ID L1131459 linked here, achieving this goal seems almost impossible. I have been experimenting with various approaches, adding numerous properties, so much that for a given query (after simplification, by ignoring certain forms like context and pausal) I ended up with all of those properties combined and yet I still encounter duplicates in my queries
The worst part is that after adding all these forms, I’m successfully getting the least possible number of duplicates for this particular outlier ID. But, the other IDs return nothing since all the normal lexemes are not structured this way in Wikidata; they only have a maximum of 6 or 7 forms on average like this one. Should I disregard the duplicate requirements for this particular ID and focus on the more standard cases? |
I apologize if this message seems too long or if I'm annoying you. Thank you for your understanding! |
Not annoying me at all! Maybe what we can do is bring in the current query with the seven forms and then in the future the modeling will be more consistent? What do you think? |
I’m still excited to work on this and am open to any format you prefer to make the project more consistent and improve language data extraction. I could have added only form fro singular adjectives like other languages to remove duplicates, but I’m all for whatever helps the project in the long run. |
As with the nouns, maybe it makes sense to model the full possible forms and just have empty responses for some :) Then hopefully the queries will work on the final modeled version or the data once that's made by the Wikidata community 😊 Let me know what you think, @OmarAI2003! |
Adding all possible forms sounds reasonable, but after some trials, I found it technically challenging because Arabic forms and plurals are not standardized. Not every noun follows a predictable pattern (e.g., "مذكر", i.e., "masculine", won’t have a feminine plural form), and irregular or broken plurals add more complexity, so we might end up with something like 100 columns, which is computationally insane. To address this, I created a query that aggregates the grammatical features for each lexeme and outputs the available information. This ensures that the query returns meaningful data even if the expected forms are incomplete or missing. Let me know your thoughts! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this will be enough for now, @OmarAI2003, as the amount of data isn't so much that we need to be too exact. The most important thing is that we add enough forms for unique results. Maybe doing a separate query for context form at some point could be interesting, but then let's not worry about it for now. Same for definite, but then it's what we're getting here the definite article prepended :)
Contributor checklist
Create SPARQL Query for Arabic Adjectives
This Pull Request modifies the SPARQL query to include the dual form representation of adjectives in addition to their singular and plural forms.
Key Changes:
Linguistic Context:
In Arabic:
This PR accurately reflects the grammatical nuances of Arabic adjectives, making it more useful for users querying this linguistic data.