The UD_French-GSD was converted in 2015 from the content head version of the universal dependency treebank v2.0 (https://github.com/ryanmcd/uni-dep-tb). It is updated since 2015 independently from the previous source.
The UD_French-GSD is converted from the content head version of the universal dependency treebank v2.0 (https://github.com/ryanmcd/uni-dep-tb). The README for the original project is available below.
The version 2.7 of UD_French-GSD data consists of 400,399 words (16,341 sentences).
No sentence id were available in the original resource, so new sent_id
were automatically introduced in the converted corpus with prefixes fr-ud-train
, fr-ud-dev
and fr-ud-test
on the corresponding original files, followed by a 5 digit number following the order of sentences in the original files.
- file
fr-ud-train.conll
: 14,449 sentences; 354,662 words fr-ud-train_00001
tofr-ud-train_14554
- file
fr-ud-dev.conll
: 1,476 sentences; 35,718 words fr-ud-dev_00001
tofr-ud-dev_01478
- file
fr-ud-test.conll
: 416 sentences; 10,019 words fr_ud-test_00001
tofr_ud-test_00298
fr-ud-dev_01479
tofr-ud-dev_01596
Sentences are shuffled and there is no way to know what is the source or the genre of a given sentence.
Probably due to some bug in a conversion program, version 1.2 contains many truncated sentences (date missing for instance). Almost every truncated sentence are from Wikipedia, so it was possible to recover the original text. Most of the truncated sentences were completed in version 1.3. Some sentences were completed later. There are probably still some truncated sentences.
The latest version of the corpus was produced by Marie-Catherine de Marneffe, Bruno Guillaume, Matias Grioni, Carly Dickerson and Guy Perrier. Automatic modifications and consistency checking were partly done using the Grew software.
See below for references and acknowledgments concerning the original corpus.
[In French] Bruno Guillaume, Marie-Catherine de Marneffe, Guy Perrier. Conversion et améliorations de corpus du français annotés en Universal Dependencies. Traitement Automatique des Langues, ATALA, 2019, 60 (2), pp.71-95. hal-02267418
-
2024-11-15 v2.15
-
Construction annotations in the UCxn framework added to MISC
This release adds rule-based annotations of Interrogatives, Conditionals, Existentials, and NPN (noun-preposition-noun) constructions on the head of the respective phrase, plus construction elements. The UCxn v1 notation and categories are documented here.
-
-
2024-05-15 v2.14
- Fix validation warnings on flat:foreign
- Fix many NOUN wrongly tagged as PROPN
-
2020-11-15 v2.7 New conversion from SUD data.
-
2020-05-15 v2.6
- UD_French-GSD is now maintained in the SUD format and converter to UD with the Grew-based conversion system SUD_to_UD.
-
2019-11-15 v2.5
- Google gave permission to drop the "NC" restriction from the license. This applies to the UD annotations (not the underlying content, of which Google claims no ownership or copyright).
- Many corrections:
- expletive annotation with relations
expl:subj
,expl:comp
andexpl:pass
aux
->aux:tense
MWEPOS
->EXTPOS
- systematic review of non-projective trees
- improved ellipsis annotations
- expletive annotation with relations
-
2019-05-15 v2.4
- Change annotations to make the corpus valid
- Change MWE annotations (see here)
-
2018-11-15 v2.3
- add subtyping for preposition "à"
- remove duplicate sentences
- fix errors found with conversion to SUD
-
2018-04-15 v2.2
- Partial subtyping of
obl
into the 3 relationsobl:agent
,obl:mod
andobl:arg
(This is done form 43,5% of the total number of theobl
relations; this is completely done from PP introduced by prepositions de, par, dans and avec). - Many error corrections
- Partial subtyping of
-
2017-11-15 v2.1
- Only "être", "avoir" or "faire" with the POS tag
AUX
- modifications taking into account new decisions taken for harmonisation of several French Treebanks (causative, copulas, auxiliaries, change of
nmod:poss
indet
)
- Only "être", "avoir" or "faire" with the POS tag
-
2017-03-01 v2.0
- Conversion of data to follow new guidelines (automatic conversions with manual verifications and manual annotation of
nmod
VSobl
in ambiguous cases)
- Conversion of data to follow new guidelines (automatic conversions with manual verifications and manual annotation of
-
2016-11-15 v.1.4
- POS annotations were made more consistent
- Several dependency paths were also made more consistent across similar expressions
- Only "avoir" and "être" are considered copula
-
2016-05-15 v1.3
- Add lemmas and features
- Complete truncated sentences
-
2015-11-15 v1.2
- Add sent_id and full text in metadata
- Removed duplicate sentences (overlaps with train) from dev and test data.
-
2015-05-15 v1.1
- Manual corrections of full text
- change of
det
innmod:poss
for possessive determiners
-
2015-01-15 v1.0
- Initial release
===================================
Universal Dependency Treebanks v2.0
(legacy information)
===================================
=========================
Licenses and terms-of-use
=========================
For the following languages
German, Spanish, French, Indonesian, Italian, Japanese, Korean and Brazilian
Portuguese
we will distinguish between two portions of the data.
1. The underlying text for sentences that were annotated. This data Google
asserts no ownership over and no copyright over. Some or all of these
sentences may be copyrighted in some jurisdictions. Where copyrighted,
Google collected these sentences under exceptions to copyright or implied
license rights. GOOGLE MAKES THEM AVAILABLE TO YOU 'AS IS', WITHOUT ANY
WARRANTY OF ANY KIND, WHETHER EXPRESS OR IMPLIED.
2. The annotations -- part-of-speech tags and dependency annotations. These are
made available under a CC BY-SA 4.0. GOOGLE MAKES
THEM AVAILABLE TO YOU 'AS IS', WITHOUT ANY WARRANTY OF ANY KIND, WHETHER
EXPRESS OR IMPLIED. See attached LICENSE file for the text of CC BY-NC-SA.
Portions of the German data were sampled from the CoNLL 2006 Tiger Treebank
data. Hans Uszkoreit graciously gave permission to use the underlying
sentences in this data as part of this release.
Any use of the data should reference the above plus:
Universal Dependency Annotation for Multilingual Parsing
Ryan McDonald, Joakim Nivre, Yvonne Quirmbach-Brundage, Yoav Goldberg,
Dipanjan Das, Kuzman Ganchev, Keith Hall, Slav Petrov, Hao Zhang,
Oscar Tackstrom, Claudia Bedini, Nuria Bertomeu Castello and Jungmee Lee
Proceedings of ACL 2013
=======
Contact
=======
[email protected]
[email protected]
[email protected]
See https://github.com/ryanmcd/uni-dep-tb for more details