GitHub - AI-team-UoA/GeoQuestions1089: Crowdsourced Geospatial Question-Answering dataset containing triples of question-queries-answers.

A crowdsourced geospatial question-answering dataset that contains 1089 triples of natural language questions, SPARQL/GeoSPARQL queries, and their answers over YAGO2geo.

Overview

GeoQuestions1089 is a crowdsourced geospatial question-answering dataset that targets the Knowledge Graph YAGO2geo. It contains 1089 triples of geospatial questions, their answers, and the respective SPARQL/GeoSPARQL queries.

It has been used to benchmark two state of the art Question Answering engines, GeoQA2 and the engine of Hamzei et al.

Also available on:

Repository information

The repository is organized as follows:

engines/: contains the versions of the engines that were used in the benchmark.
GoST/: contains a transpiler that rewrites queries to use materialized relations.
results/generated-queries/: contains the queries generated by the engines for the benchmark.
GeoQuestions1089.json: contains 1089 natural language questions and their queries.
GeoQuestions1089_answers.json: contains the results of the queries.
GeoQuestions1089.csv: the entire dataset in CSV format.

Dataset

The dataset is described in the following paper (also used to cite the dataset):

@inproceedings{10.1007/978-3-031-47243-5_15,
  title = {Benchmarking Geospatial Question Answering Engines Using the Dataset GeoQuestions1089},
  author = {Sergios-Anestis Kefalidis, Dharmen Punjani, Eleni Tsalapati, 
         Konstantinos Plas, Mariangela Pollali, Michail Mitsios, 
         Myrto Tsokanaridou, Manolis Koubarakis and Pierre Maret},
  booktitle = {The Semantic Web - {ISWC} 2023 - 22nd International Semantic Web Conference,
            Athens, Greece, November 6-10, 2023, Proceedings, Part {II}},
  year = {2023}
}

Shortly, the GeoQuestions1089 dataset consists of two parts, which we will refer to as GeoQuestions_c and GeoQuestions_w both of which target the union of YAGO2 and YAGO2geo.

GeoQuestions_c consits of 1017 entries and GeoQuestions_w of 72 entries. The difference between the two is that the natural language questions of GeoQuestions_w contain grammatical, syntactical and spelling mistakes.

Description	Range
Triples targeting YAGO2geo (GeoQuestions_c)	1-895
Triples targeting YAGO2 + YAGO2geo (GeoQuestions_c)	896-1017
Triples with questions that contain mistakes (GeoQuestions_w)	1018-1089

Current version of the dataset

The aforementioned paper describes version 1.0. The latest available version is 1.1.

Version 1.1 includes several enhancements:

Uniform query format and variable naming
Fixes in natural language capitalization
Corrections in query categorization
Replacement of stSPARQL functions with GeoSPARQL functions where applicable
Minor improvements in query correctness of existing queries
A few triples that were erroneous (resulting from incorrect file modifications and text editing) have been replaced by correct ones.

These updates ensure greater consistency and accuracy in the dataset, making it a more reliable resource for geospatial QA research.

Benchmark (Version 1.1)

We have used the dataset to evaluate the engines GeoQA2 and the engine of Hamzei et al.. We present the results of the evaluation:

GeoQA2

Combined Table: Evaluation of GeoQA2 over GeoQuestions_C and GeoQuestions_W

Category	Executable Queries (C)	Correct Answers (C)	Correct Answers*(1) (C)	Executable Queries (W)	Correct Answers (W)	Correct Answers*(1) (W)
A	83.81%	50.86%	60.68%	75.00%	50.00%	66.67%
B	74.82%	60.43%	80.76%	81.81%	45.45%	55.56%
C	81.25%	45.45%	55.94%	85.71%	50.00%	58.34%
D	54.54%	9.09%	16.67%	100.00%	0.00%	0.00%
E	76.08%	24.63%	32.38%	50.00%	33.33%	66.67%
F	58.33%	25.00%	42.85%	50.00%	0.00%	0.00%
G	73.56%	33.33%	45.31%	36.36%	0.00%	0.00%
H	66.89%	18.62%	27.83%	66.67%	0.00%	0.00%
I	80.76%	19.23%	23.80%	50.00%	0.00%	0.00%
Total	75.61%	37.75%	49.93%	68.05%	30.55%	44.89%

(1) Corrent Answers* is the percentage of correct answers calculated over the number of Executable Queries generated by the engines.

System of Hamzei et al.

Combined Table: Evaluation of the system of Hamzei et al. over GeoQuestions_C and GeoQuestions_W

Category	Executable Queries (C)	Correct Answers (C)	Correct Answers* (C)	Executable Queries (W)	Correct Answers (W)	Correct Answers* (W)
A	82.08%	23.12%	28.16%	93.75%	6.25%	6.67%
B	94.96%	53.23%	56.06%	100.00%	54.54%	54.54%
C	81.81%	26.13%	31.94%	100.00%	14.28%	14.28%
D	81.81%	4.54%	5.55%	100.00%	0.00%	0.00%
E	92.75%	6.52%	7.03%	83.34%	0.00%	0.00%
F	62.50%	12.50%	20.00%	90.90%	0.00%	0.00%
G	80.45%	10.34%	12.85%	100.00%	0.00%	0.00%
H	77.93%	26.89%	34.51%	77.78%	0.00%	0.00%
I	84.61%	7.96%	9.09%	50.00%	0.00%	0.00%
Total	83.97%	22.81%	27.28%	93.05%	12.50%	13.43%

Additional benchmark results exist and we are working on publishing them. Until then, if you want to see more please send a message at:

s[dot]kefalidis[at]di[dot]uoa[dot]gr

Tools

Materialization and Transpiler

To improve the time performance of query execution, we pre-computed and materialized certain relations between entities in the YAGO2geo KG.

The geospatial relations within, crosses, intersects and borders (and their extensions, e.g., overlaps and covers) are the most expensive ones to be computed. While north, south, east and west are easily computed. Hence, we materialized these relations.

To ease the transformation of GeoSPARQL/stSPARQL FILTERs to materialized triples we have developed and provide publically a transpiler that rewrites queries to use the materialized triples where possible.

To use the provided binary run the command:

java -cp PATH/TO/GOST_EXECUTABLE gr.uoa.di.ai.Transpiler QUERY

RDF Store

To run the experiments and generate the answers for the gold and generated queries we used GraphDB. Because GraphDB does not support stSPARQL functions, we have extended the GeoSPARQL plugin of GraphDB.

Notes

About the definition of near for distance calculations

We decided to define near based on the concept used. This is consistent with the definition of near in GeoQuestions201.

Near to	Distance
Near to a City:	5km
Near to a Town:	5km
Near to a Bay:	1km
Near to a Beach:	1km
Near to a Forest:	1km
Near to a Hotel:	1km
Near to a Lake:	1km
Near to a Landmark:	1km
Near to a Village:	1km
Near to a Restaurant:	500 meters
Near to a Park:	500 meters

Prefixes used in GeoQuestions1089:

PREFIX geo: <http://www.opengis.net/ont/geosparql#>
PREFIX geof: <http://www.opengis.net/def/function/geosparql/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX yago: <http://yago-knowledge.org/resource/>
PREFIX y2geor: <http://kr.di.uoa.gr/yago2geo/resource/>
PREFIX y2geoo: <http://kr.di.uoa.gr/yago2geo/ontology/>
PREFIX strdf: <http://strdf.di.uoa.gr/ontology#>
PREFIX uom: <http://www.opengis.net/def/uom/OGC/1.0/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>

Knowledge Graph data:

You can download the complete YAGO2+YAGO2geo knowledge graph, with the materialized relations, here.

Team & Authors

Sergios-Anestis Kefalidis, Research Associate at the University of Athens, Greece
Dharmen Punjani, Research Associate at Université Jean Monnet Saint-Etienne, France
Eleni Tsalapati, Senior Researcher at the University of Athens, Greece
Kostas Plas, Research Associate at the University of Athens, Greece
Mariangela Pollali, Research Associate at the University of Athens, Greece
Michail Mitsios, Research Associate at the University of Athens, Greece
Myrto Tsokanaridou, Research Associate at the University of Athens, Greece
Manolis Koubarakis, Professor at the University of Athens, Greece
Pierre Maret, Professor at Université Jean Monnet Saint-Etienne, France

This is a research project by the AI-Team of the Department of Informatics and Telecommunications at the University of Athens.

Funding

This project is being/has been funded in the context of:

the first call for H.F.R.I. Research Projects to support faculty members and researchers and the procurement of high-cost research equipment grant (HFRI-FM17-2351)
the ESA project DA4DTE (subcontract 202320239)
the Horizon 2020 project AI4Copernicus (GA No. 101016798)
the Marie Skłodowska-Curie project QuAre (GA No. 101032307)

License

Released under the CC0 Attribution 4.0 International license (see LICENSE).

Category	GeoQuestions1089_c	GeoQuestions1089_w
A	173	16
B	139	11
C	176	14
D	22	1
E	138	6
F	24	2
G	174	11
H	145	9
I	26	2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Repository information

Dataset

Categories

Current version of the dataset

Benchmark (Version 1.1)

GeoQA2

Combined Table: Evaluation of GeoQA2 over GeoQuestions_C and GeoQuestions_W

(1) Corrent Answers* is the percentage of correct answers calculated over the number of Executable Queries generated by the engines.

System of Hamzei et al.

Combined Table: Evaluation of the system of Hamzei et al. over GeoQuestions_C and GeoQuestions_W

Additional benchmark results exist and we are working on publishing them. Until then, if you want to see more please send a message at:

Tools

Materialization and Transpiler

RDF Store

Notes

About the definition of near for distance calculations

Prefixes used in GeoQuestions1089:

Knowledge Graph data:

Team & Authors

Funding

License

About

Releases 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
GoST		GoST
engines		engines
logo		logo
results/generated-queries		results/generated-queries
scripts		scripts
GeoQuestions1089.csv		GeoQuestions1089.csv
GeoQuestions1089.json		GeoQuestions1089.json
GeoQuestions1089_answers.json		GeoQuestions1089_answers.json
IMAGE_CREDITS		IMAGE_CREDITS
LICENSE		LICENSE
README.md		README.md

License

AI-team-UoA/GeoQuestions1089

Folders and files

Latest commit

History

Repository files navigation

Overview

Repository information

Dataset

Categories

Current version of the dataset

Benchmark (Version 1.1)

GeoQA2

Combined Table: Evaluation of GeoQA2 over GeoQuestions_C and GeoQuestions_W

(1) Corrent Answers* is the percentage of correct answers calculated over the number of Executable Queries generated by the engines.

System of Hamzei et al.

Combined Table: Evaluation of the system of Hamzei et al. over GeoQuestions_C and GeoQuestions_W

Additional benchmark results exist and we are working on publishing them. Until then, if you want to see more please send a message at:

Tools

Materialization and Transpiler

RDF Store

Notes

About the definition of near for distance calculations

Prefixes used in GeoQuestions1089:

Knowledge Graph data:

Team & Authors

Funding

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Languages