Paris Stories is a corpus of oral French collected and transcribed by Linguistics students from Sorbonne Nouvelle and corrected by students from the Plurital Master's Degree of Computational Linguistics ( Inalco, Paris Nanterre, Sorbonne Nouvelle) between 2017 and 2021. It contains monologues and dialogues from speakers living in the Parisian region.
For an assignment, students had to record a friend or a relative sharing an anecdote about a given theme (meaningful encounters, vacations, interesting stories..). The corpus was created for the study of contemporary spoken French and to train a syntactic parser for spoken French. All data has been morpho-syntactically annotated following the SUD (Surface Syntactic Universal Dependencies) guidelines.
See SUD Guidelines : https://surfacesyntacticud.github.io/guidelines/u/
The Treebank can be found here : http://universal.grew.fr/?corpus=SUD_French-ParisStories@latest
The recordings can be downloaded via the url given in the '# sound_url' metadata.
-- Paris Stories 2019 --
Creation Year : 2017
Annotation Year : 2019
Size :
- 19 samples
- 13951 tokens
- 709 sentences
- app. 1 hour of recordings
Topics : travels, funny/unusual stories
-- Paris Stories 2020 --
Creation Year : 2018
Annotation Year : 2020
Size :
- 16 samples
- 9064 tokens
- 553 sentences
- app. 30 min of recordings
Topics : vacation stories, funny/unusual stories
-- Paris Stories 2021 --
Creation Year : 2020
Annotation Year : 2021
Size :
- 14 samples
- 7825 tokens
- 499 sentences
- app. 45 minutes of recordings
Topics : first encounters, funny/unusual stories
The corpus is maintained here in the SUD framework and automatically converted into UD_French-ParisStories using the Grew software with the conversions rules described here.
Annotation : Sylvain Kahane, Bruno Guillaume, Mariam Nakhlé, Vanessa Gaudray-Bouju, Menel Mahamdi
Annotation tools development : Kim Gerdes, Marine Courtin, Gaël Guibon
Conversion and handling of data validation : Bruno Guillaume
Direction of data collection : Cédric Gendrot, Kim Gerdes, Marine Courtin
We would like to thank all the students who participated in this project.
Sylvain Kahane, Bernard Caron, Emmett Strickland, Kim Gerdes. Annotation guidelines of UD and SUD treebanks for spoken corpora: A proposal. Proceedings of the 20th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2021)