Note: as described on issues#2 this file, README.en.md, is result of machine translation. No attempt is made to correct translation errors, in order to allow future volunteers to remain involved because they do not know English.
[work in progress] Permanent project to coordinate the creation and update linguistic data sets (such as those that can be used to detect discrimination and hate speech) preferably validated by people representatives of affected groups or subject matter experts. Dedicated to public domain.
Table of Contents
NOTE: at the moment, 2020-12-01, the content made available here is not ready for end use and primarily serves to test strategies for how to collect and HXL hashtags to use to classify information.
Unlike EticaAI/linguistic-datasets-portuguese (which is a list for different data sets in Portuguese from different sources) this repository contains reference for the data sets themselves where Etica.AI serves as organization to allow collaboration on an ongoing basis.
Linguistic datasets in Portuguese are rare, not very complete and, when they exist, often are on restricted use license or depend on access to APIs proprietary, even if free. The importance of our work here, from even freeing commercial use, has the potential to help with automation (such as detection of verbal attacks).
Not only HXL (The Humanitarian eXchange Language) is our main form data storage in this project, as there is an exchange of aids, via with people who already work in the information technology area of international humanitarian organizations.
Your feedback on how to improve collaboration processes can impact even even outside Portuguese-speaking countries. You, whether you are a developer of software to even a typically affected community member (even without knowing English or without having affinity with computers) if you are interested we can help you prepare beyond your home country.
For the purposes of this project, both Etica.AI and HXL-CPLP people should be seen as facilitators, not creators. Community people affected, even if they are not specialists with an academic doctorate (but who, still, has the courage to help assemble initial content that can be revised in the future) are the main enablers of every idea.
One of the implications of data sets dedicated to the public domain is that the final result may not contain names of individuals (not even Etica.AI / HXL-CPLP) as much as possible we will see alternative ways of valuing in special contribution from people who help to coordinate / revalidate work of others or who created meaningful initial content even if you prefer not to assume authorship of your contributions for fear of retaliation.
As far as possible under the law, Etica.AI waived all copyrights and neighboring or neighboring rights to this work for the [Public Domain] (UNLICENSE).