Add a query Arabic adjectives and their plural and dual. #255

OmarAI2003 · 2024-10-05T20:15:15Z

Contributor checklist

This pull request is on a separate branch and not the main branch

Create SPARQL Query for Arabic Adjectives

This Pull Request modifies the SPARQL query to include the dual form representation of adjectives in addition to their singular and plural forms.

Key Changes:

Created a new SPARQL query file to extract the following:
- Adjective labels: The Arabic text for adjectives.
- Plural forms: The plural representations of adjectives.
- Dual forms: The dual representations of adjectives.

Linguistic Context:

In Arabic:

Plural and Dual Forms: In Arabic, adjectives can take plural forms, which may be sound (formed by changing the structure of the word) or broken, similar to nouns. Additionally, there is a distinct dual form for adjectives used to refer specifically to two items, typically achieved by adding specific endings to the singular form.

This PR accurately reflects the grammatical nuances of Arabic adjectives, making it more useful for users querying this linguistic data.

…nguage extraction

- Modified the SPARQL query to extract the dual form of adjectives along with their labels and plural forms. - Ensured that the output includes the lexeme ID, adjective, plural, and dual forms consistently in Arabic text.

github-actions · 2024-10-05T20:15:36Z

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!

Maintainer checklist

The commit messages for the remote branch should be checked to make sure the contributor's email is set up correctly so that they receive credit for their contribution
- The contributor's name and icon in remote commits should be the same as what appears in the PR
- If there's a mismatch, the contributor needs to make sure that the email they use for GitHub matches what they have for git config user.email in their local Scribe-Data repo
The linting and formatting workflow within the PR checks do not indicate new errors in the files changed
The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

andrewtavis · 2024-10-06T13:49:22Z

Could you remove the extra comments and put MARK: comments for sections of the query as is done in other queries, @OmarAI2003? Aside from this, really nice 😊 Thanks so much!

- Removed extra comments. - Replaced with MARK: comments to match the style used in other queries.

OmarAI2003 · 2024-10-06T15:37:26Z

Thank you for the feedback! 😊 I've updated the query to use MARK: comments as requested and removed the extra comments

…Scribe-Data into add-query-adjectives

OmarAI2003 · 2024-10-06T16:25:55Z

Hi @andrewtavis

I've noticed some failing tests related to the addition of the "adjectives" data type in the list_data_types function. The expected output in the tests seems to be missing the "adjectives" entry, while the actual output includes it.

Could you please clarify whether the tests need to be updated to reflect the new data type, or should I investigate further to ensure everything is functioning correctly?

Thanks for your guidance!

andrewtavis · 2024-10-07T00:08:24Z

The tests should be updated, and that's coming in another PR :)

andrewtavis · 2024-10-07T00:08:32Z

Thanks @OmarAI2003!

andrewtavis · 2024-10-07T00:25:07Z

For this one as well, @OmarAI2003, would be great if we add enough properties so that each of the query results is unique :) Are there some forms that could be added or properties to the existing form so there's one form returned per property set?

OmarAI2003 · 2024-10-07T01:22:38Z

Thanks @andrewtavis for your feedback on the pr
I wanted to clarify your suggestion regarding ensuring that query results are unique.

In Arabic, diacritical marks play a crucial role in changing the meaning of words, allowing two words to have the same letters but different meanings. This can lead to situations where words may appear similar but are indeed distinct due to these marks.

Additionally, when it comes to plurals, there are different forms for masculine and feminine, which means we can have multiple entries that share the singular and dual forms but differ in their pluralization. Therefore, I believe these variations wouldn’t necessarily count as duplicates.

The main issue I see regarding potential duplication in this query might stem from the addition of the definite article (ال) to nouns, which can affect their uniqueness.

Could you please clarify if you believe there are actual duplicate rows in the current results? Your insight would really help me understand your perspective better!

andrewtavis · 2024-10-07T08:29:47Z

I do know a bit of Arabic from my bachelor's, @OmarAI2003, which is why I'm so happy you're working on it 😇

Big thing is that in the end we want just one row returned per lexeme ID. Let's add properties and forms such that each of the forms is returned in a unique way, so not just dual, but dual feminine, dual masculine, etc, and the same for plural and any other forms. If there's a definite version as well with ال at the start, then this can also be its own form :)

OmarAI2003 · 2024-10-07T12:08:30Z

Thank you for your insights regarding the lexeme IDs.

I wanted to clarify whether having one row returned per lexeme ID is a strict requirement. For instance, with the ID L1131459 linked here, achieving this goal seems almost impossible. I have been experimenting with various approaches, adding numerous properties, so much that for a given query (after simplification, by ignoring certain forms like context and pausal) I ended up with all of those properties combined and yet I still encounter duplicates in my queries

Masculine, nominative case, singular, indefinite
Masculine, accusative case, singular, indefinite
Masculine, genitive case, singular, indefinite
Masculine, nominative case, singular, definite
Masculine, accusative case, singular, definite
Masculine, genitive case, singular, definite
Feminine, nominative case, singular, indefinite
Feminine, accusative case, singular, indefinite
Feminine, genitive case, singular, indefinite
Feminine, nominative case, singular, definite
Feminine, accusative case, singular, definite
Feminine, genitive case, singular, definite
Masculine, nominative case, plural, indefinite
Masculine, accusative case, plural, indefinite
Masculine, genitive case, plural, indefinite
Masculine, nominative case, plural, definite
Masculine, accusative case, plural, definite
Masculine, genitive case, plural, definite
Feminine, nominative case, plural, indefinite
Feminine, accusative case, plural, indefinite
Feminine, genitive case, plural, indefinite
Feminine, nominative case, plural, definite
Feminine, accusative case, plural, definite
Feminine, genitive case, plural, definite
Masculine, nominative case, dual, indefinite
Masculine, accusative case, dual, indefinite
Masculine, genitive case, dual, indefinite
Masculine, accusative case, dual, definite
Masculine, genitive case, dual, definite
Feminine, nominative case, dual, indefinite
Feminine, accusative case, dual, indefinite
Feminine, genitive case, dual, indefinite
Feminine, accusative case, dual, definite
Feminine, genitive case, dual, definite

The worst part is that after adding all these forms, I’m successfully getting the least possible number of duplicates for this particular outlier ID. But, the other IDs return nothing since all the normal lexemes are not structured this way in Wikidata; they only have a maximum of 6 or 7 forms on average like this one.

Should I disregard the duplicate requirements for this particular ID and focus on the more standard cases?

OmarAI2003 · 2024-10-07T12:10:18Z

I apologize if this message seems too long or if I'm annoying you. Thank you for your understanding!

andrewtavis · 2024-10-07T12:15:34Z

Not annoying me at all! Maybe what we can do is bring in the current query with the seven forms and then in the future the modeling will be more consistent? What do you think?

OmarAI2003 · 2024-10-07T12:29:33Z

I’m still excited to work on this and am open to any format you prefer to make the project more consistent and improve language data extraction. I could have added only form fro singular adjectives like other languages to remove duplicates, but I’m all for whatever helps the project in the long run.

andrewtavis · 2024-10-07T14:40:31Z

As with the nouns, maybe it makes sense to model the full possible forms and just have empty responses for some :) Then hopefully the queries will work on the final modeled version or the data once that's made by the Wikidata community 😊

Let me know what you think, @OmarAI2003!

OmarAI2003 · 2024-10-08T03:08:45Z

As with the nouns, maybe it makes sense to model the full possible forms and just have empty responses for some :) Then hopefully the queries will work on the final modeled version or the data once that's made by the Wikidata community 😊

Let me know what you think, @OmarAI2003!

Adding all possible forms sounds reasonable, but after some trials, I found it technically challenging because Arabic forms and plurals are not standardized. Not every noun follows a predictable pattern (e.g., "مذكر", i.e., "masculine", won’t have a feminine plural form), and irregular or broken plurals add more complexity, so we might end up with something like 100 columns, which is computationally insane.

To address this, I created a query that aggregates the grammatical features for each lexeme and outputs the available information. This ensures that the query returns meaningful data even if the expected forms are incomplete or missing.

Let me know your thoughts!

…features.

andrewtavis

I think that this will be enough for now, @OmarAI2003, as the amount of data isn't so much that we need to be too exact. The most important thing is that we add enough forms for unique results. Maybe doing a separate query for context form at some point could be interesting, but then let's not worry about it for now. Same for definite, but then it's what we're getting here the definite article prepended :)

OmarAI2003 added 2 commits October 5, 2024 21:58

Add query_adjectives.sparql to support adjective queries in Arabic la…

a2bf1da

…nguage extraction

OmarAI2003 changed the title ~~Add query adjectives and their plural and dual.~~ Add a query Arabic adjectives and their plural and dual. Oct 5, 2024

andrewtavis self-requested a review October 6, 2024 13:49

Refactor query comments to use MARK: format for consistency

8cc098b

- Removed extra comments. - Replaced with MARK: comments to match the style used in other queries.

OmarAI2003 and others added 3 commits October 6, 2024 19:00

Merge branch 'main' into add-query-adjectives

6d46b1f

Merge remote-tracking branch 'upstream/main' into add-query-adjectives

86b43a8

Merge branch 'add-query-adjectives' of https://github.com/OmarAI2003/…

33e9ea0

…Scribe-Data into add-query-adjectives

Remove label service and minor query cleanup

9c34003

andrewtavis added the hacktoberfest-accepted Accepted as a part of Hacktoberfest label Oct 7, 2024

OmarAI2003 added 3 commits October 8, 2024 07:38

Merge remote-tracking branch 'upstream/main' into add-query-adjectives

2ba26f1

Add SPARQL query to retrieve Arabic adjectives and their grammatical …

404004c

…features.

Resolved merge conflicts and finalized query_adjectives.sparql

c865ca5

OmarAI2003 mentioned this pull request Oct 8, 2024

Expand Arabic Nouns SPARQL Query to Include Additional Forms #256

Merged

1 task

Return forms themselves for Arabic adjectives

74f2eee

andrewtavis approved these changes Oct 9, 2024

View reviewed changes

andrewtavis merged commit cae689e into scribe-org:main Oct 9, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a query Arabic adjectives and their plural and dual. #255

Add a query Arabic adjectives and their plural and dual. #255

OmarAI2003 commented Oct 5, 2024

github-actions bot commented Oct 5, 2024 •

edited by andrewtavis

Loading

andrewtavis commented Oct 6, 2024

OmarAI2003 commented Oct 6, 2024

OmarAI2003 commented Oct 6, 2024

andrewtavis commented Oct 7, 2024

andrewtavis commented Oct 7, 2024

andrewtavis commented Oct 7, 2024

OmarAI2003 commented Oct 7, 2024

andrewtavis commented Oct 7, 2024

OmarAI2003 commented Oct 7, 2024

OmarAI2003 commented Oct 7, 2024

andrewtavis commented Oct 7, 2024

OmarAI2003 commented Oct 7, 2024

andrewtavis commented Oct 7, 2024

OmarAI2003 commented Oct 8, 2024 •

edited

Loading

andrewtavis left a comment

Add a query Arabic adjectives and their plural and dual. #255

Add a query Arabic adjectives and their plural and dual. #255

Conversation

OmarAI2003 commented Oct 5, 2024

Contributor checklist

Create SPARQL Query for Arabic Adjectives

Key Changes:

Linguistic Context:

github-actions bot commented Oct 5, 2024 • edited by andrewtavis Loading

Thank you for the pull request!

Maintainer checklist

andrewtavis commented Oct 6, 2024

OmarAI2003 commented Oct 6, 2024

OmarAI2003 commented Oct 6, 2024

andrewtavis commented Oct 7, 2024

andrewtavis commented Oct 7, 2024

andrewtavis commented Oct 7, 2024

OmarAI2003 commented Oct 7, 2024

andrewtavis commented Oct 7, 2024

OmarAI2003 commented Oct 7, 2024

OmarAI2003 commented Oct 7, 2024

andrewtavis commented Oct 7, 2024

OmarAI2003 commented Oct 7, 2024

andrewtavis commented Oct 7, 2024

OmarAI2003 commented Oct 8, 2024 • edited Loading

andrewtavis left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 5, 2024 •

edited by andrewtavis

Loading

OmarAI2003 commented Oct 8, 2024 •

edited

Loading