Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SaltXML importer #261

Merged
merged 68 commits into from
Aug 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
55c5a3c
Add empty SalXML importer
thomaskrause Jun 19, 2024
4e17f80
Add empty SaltXML exporter
thomaskrause Jun 25, 2024
373471f
Merge branch 'main' into feature/saltxml
thomaskrause Jun 26, 2024
88b1622
Apply automatic changes
thomaskrause Jun 26, 2024
2012b68
Update the snapshot test to include the new SaltXML modules
thomaskrause Jun 26, 2024
91e94e3
Update quick-xml dependency
thomaskrause Jun 26, 2024
dc1ae07
Start mapping the Salt corpus structure.
thomaskrause Jun 26, 2024
a194e61
Map features on documents and corpora as annotations
thomaskrause Jun 26, 2024
c49132c
Apply automatic changes
thomaskrause Jun 26, 2024
3489ce2
Fix clippy warnings
thomaskrause Jun 26, 2024
eb41dd7
Use DOM parser for SaltXML documents
thomaskrause Jun 27, 2024
59a0c4d
Map document and corpus relations
thomaskrause Jun 27, 2024
b87b883
Map annis:doc annotation and actual return the document IDs
thomaskrause Jun 27, 2024
e27213f
Refactor the progress reporting in SaltXML
thomaskrause Jun 27, 2024
5ad718b
Preparations for mapping the documents
thomaskrause Jun 27, 2024
e9006e3
Merge branch 'main' into feature/saltxml
thomaskrause Jul 1, 2024
707ccfd
Start to map the textual datasource of a document
thomaskrause Jul 1, 2024
5614958
Add token nodes and store the document in the mapper struct
thomaskrause Jul 1, 2024
ba01b93
Apply automatic changes
thomaskrause Jul 1, 2024
1c56dba
Store nodes, edges and layers in the mapper struct
thomaskrause Jul 1, 2024
a87d6f4
Restructure code to fail if source/target can't be resolved
thomaskrause Jul 1, 2024
d060a6e
Get token value from textual relations
thomaskrause Jul 2, 2024
2988822
Add ordering edges between the token
thomaskrause Jul 2, 2024
66227c0
Add whitespace before/after token
thomaskrause Jul 2, 2024
58c7f2b
Add layer information for tokens
thomaskrause Jul 2, 2024
eba61fb
Fix some clippy issues
thomaskrause Jul 2, 2024
8308990
Decouple CLI compilation and running the documentation generation to …
thomaskrause Jul 2, 2024
b5896f3
Add token annotations
thomaskrause Jul 2, 2024
51f1dfc
Add spans and spanning relations
thomaskrause Jul 3, 2024
7476269
Merge branch 'main' into feature/saltxml
thomaskrause Jul 3, 2024
7d17c27
Add dominance edge with empty name as well
thomaskrause Jul 3, 2024
eb5b27b
Do not map "salt::SNAME" feature if this is not a document.
thomaskrause Jul 4, 2024
1be7b97
Allow to configure how empty annotation namespaces are handled
thomaskrause Jul 4, 2024
6fb8321
Use "default_ns" as fallback for annotation namespaces instead of emp…
thomaskrause Jul 4, 2024
d3ad396
Do not hide token in grid visualizer if there is only one tokenisation
thomaskrause Jul 4, 2024
05ba5a0
Fix clippy issue
thomaskrause Jul 5, 2024
1cb6d7b
Merge branch 'main' into feature/saltxml
thomaskrause Jul 5, 2024
688f616
Map pointing relations
thomaskrause Jul 5, 2024
4b5a5a7
Remove SaltXML exporter for now
thomaskrause Jul 5, 2024
9232fdc
Add example Salt corpus with timeline
thomaskrause Jul 5, 2024
faccdd0
Use code points as reference and not byte positions
thomaskrause Jul 5, 2024
c9bbe89
Add PartOf edges for all created nodes and start mapping a timeline
thomaskrause Jul 5, 2024
bc5a802
Consider the textual DS name when sorting the token relations
thomaskrause Jul 5, 2024
e24b11d
Use the segmentation name for the ordering component if there is a ti…
thomaskrause Jul 5, 2024
8823e7e
Map timeline relations as coverage between segmentation nodes and the…
thomaskrause Jul 5, 2024
d42575d
Move document node to mapper struct
thomaskrause Jul 5, 2024
789912b
Add coverage to indirectly covered TLI token
thomaskrause Jul 5, 2024
8002cf2
Add tok annotation to TLIs
thomaskrause Jul 5, 2024
193ec86
Merge branch 'main' into feature/saltxml
thomaskrause Jul 10, 2024
848f1c0
Map and visualize datasources, but do do not connect tokens to them
thomaskrause Jul 10, 2024
9208005
Update test snapshots
thomaskrause Jul 10, 2024
d1eecc5
Merge branch 'main' into feature/saltxml
thomaskrause Jul 12, 2024
129ffd7
Add whitespace after the TLI token, so that they don't cover all the …
thomaskrause Jul 12, 2024
c6867a9
Use single space for TLI token instead
thomaskrause Jul 12, 2024
e8acd13
Update test snapshot
thomaskrause Jul 12, 2024
2122e8e
Start to map the audio file
thomaskrause Jul 12, 2024
2b4d8fa
Merge branch 'main' into feature/saltxml
thomaskrause Aug 5, 2024
299e3d8
Resolve the URL of the linked media/audio file
thomaskrause Aug 6, 2024
f5ca1f7
Map corpus/document annotations but do not map the ones of the "salt"…
thomaskrause Aug 6, 2024
bedf582
Map time information from SaltXML
thomaskrause Aug 6, 2024
f817efd
Use the file name as node ID
thomaskrause Aug 6, 2024
33c8a95
Update changelog
thomaskrause Aug 6, 2024
072d081
Map meta annotation
thomaskrause Aug 7, 2024
6ecbbad
Video and audio visualizers need the special "preloaded" visibility t…
thomaskrause Aug 7, 2024
eb0244f
Add more complex pointing edges with annotations to one of the exampl…
thomaskrause Aug 7, 2024
35b6a8b
Add test for some of the features of the GraphmL guess_vis function
thomaskrause Aug 7, 2024
cab5573
Esnure stable order in GraphML for test
thomaskrause Aug 7, 2024
d0b87ba
Forget to commit argument
thomaskrause Aug 7, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `map` manipulator can now add annotated spans and copy values from existing
annotations. The copied values can be manipulated using regular expressions and
replacement values.
- Addes `saltxml` import format

### Fixed

Expand Down
28 changes: 19 additions & 9 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
[package]
authors = ["Thomas Krause <[email protected]>", "Martin Klotz <[email protected]>"]
authors = [
"Thomas Krause <[email protected]>",
"Martin Klotz <[email protected]>",
]
description = "Converts linguistic data formats based on the graphANNIS data model as intermediate representation and can apply consistency tests."
edition = "2018"
homepage = "https://github.com/korpling/annatto/"
Expand All @@ -11,7 +14,7 @@ version = "0.14.0"
[dependencies]
ansi_term = "0.12"
anyhow = "1.0"
clap = {version = "4.0", features = ["derive", "env"]}
clap = { version = "4.0", features = ["derive", "env"] }
console = "0.15"
csv = "1.1"
documented = "0.3.0"
Expand All @@ -27,32 +30,34 @@ lazy_static = "1.4.0"
linked-hash-map = "0.5.6"
log = "0.4"
normpath = "1.1"
ordered-float = {version = "4.1", default-features = false}
ordered-float = { version = "4.1", default-features = false }
pathdiff = "0.2"
percent-encoding = "2.3.1"
pest = "2.7"
pest_derive = "2.0"
quick-xml = "0.31"
quick-xml = "0.34"
rayon = "1.1"
regex = "1.10"
roxmltree = "0.20.0"
serde = "1.0"
serde_derive = "1.0"
struct-field-names-as-array = "0.3.0"
strum = {version = "0.26.2", features = ["derive"]}
tabled = {version = "0.15", features = ["ansi"]}
strum = { version = "0.26.2", features = ["derive"] }
tabled = { version = "0.15", features = ["ansi"] }
tempfile = "3"
termimad = "0.29.1"
text-splitter = "0.6.3"
thiserror = "1.0"
toml = "0.8.0"
tracing-subscriber = {version = "0.3", features = ["env-filter"]}
tracing-subscriber = { version = "0.3", features = ["env-filter"] }
umya-spreadsheet = "~1.1.1"
url = "2.5.2"
xml-rs = "0.8"
zip = "0.6.6"

[dev-dependencies]
assert_cmd = "2.0.11"
insta = {version = "1.26.0", features = ["toml", "filters"]}
insta = { version = "1.26.0", features = ["toml", "filters"] }
pretty_assertions = "1.3"

# Compile some of the dependencies in release mode if when we are ourself in
Expand Down Expand Up @@ -82,7 +87,12 @@ ci = "github"
# The installers to generate for each app
installers = []
# Target platforms to build apps for (Rust target-triple syntax)
targets = ["aarch64-apple-darwin", "x86_64-apple-darwin", "x86_64-unknown-linux-gnu", "x86_64-pc-windows-msvc"]
targets = [
"aarch64-apple-darwin",
"x86_64-apple-darwin",
"x86_64-unknown-linux-gnu",
"x86_64-pc-windows-msvc",
]
# The preferred cargo-dist version to use in CI (Cargo.toml SemVer syntax)
cargo-dist-version = "0.16.0"
# Publish jobs to run in CI
Expand Down
10 changes: 5 additions & 5 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
| Type | Modules |
|------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Import formats | [conllu](importers/conllu.md), [exmaralda](importers/exmaralda.md), [graphml](importers/graphml.md), [meta](importers/meta.md), [none](importers/none.md), [opus](importers/opus.md), [path](importers/path.md), [ptb](importers/ptb.md), [relannis](importers/relannis.md), [textgrid](importers/textgrid.md), [toolbox](importers/toolbox.md), [treetagger](importers/treetagger.md), [xlsx](importers/xlsx.md), [xml](importers/xml.md) |
| Export formats | [graphml](exporters/graphml.md), [exmaralda](exporters/exmaralda.md), [sequence](exporters/sequence.md), [textgrid](exporters/textgrid.md), [xlsx](exporters/xlsx.md) |
| Graph operations | [check](graph_ops/check.md), [collapse](graph_ops/collapse.md), [visualize](graph_ops/visualize.md), [enumerate](graph_ops/enumerate.md), [link](graph_ops/link.md), [map](graph_ops/map.md), [revise](graph_ops/revise.md), [chunk](graph_ops/chunk.md), [split](graph_ops/split.md), [none](graph_ops/none.md) |
| Type | Modules |
|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Import formats | [conllu](importers/conllu.md), [exmaralda](importers/exmaralda.md), [graphml](importers/graphml.md), [meta](importers/meta.md), [none](importers/none.md), [opus](importers/opus.md), [path](importers/path.md), [ptb](importers/ptb.md), [relannis](importers/relannis.md), [saltxml](importers/saltxml.md), [textgrid](importers/textgrid.md), [toolbox](importers/toolbox.md), [treetagger](importers/treetagger.md), [xlsx](importers/xlsx.md), [xml](importers/xml.md) |
| Export formats | [graphml](exporters/graphml.md), [exmaralda](exporters/exmaralda.md), [sequence](exporters/sequence.md), [textgrid](exporters/textgrid.md), [xlsx](exporters/xlsx.md) |
| Graph operations | [check](graph_ops/check.md), [collapse](graph_ops/collapse.md), [visualize](graph_ops/visualize.md), [enumerate](graph_ops/enumerate.md), [link](graph_ops/link.md), [map](graph_ops/map.md), [revise](graph_ops/revise.md), [chunk](graph_ops/chunk.md), [split](graph_ops/split.md), [none](graph_ops/none.md) |
6 changes: 6 additions & 0 deletions docs/exporters/saltxml.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# saltxml (exporter)

Exports Excel Spreadsheets where each line is a token, the other columns are
spans and merged cells can be used for spans that cover more than one token.

*No Configuration*
14 changes: 14 additions & 0 deletions docs/importers/saltxml.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# saltxml (importer)

Imports the SaltXML format used by Pepper (<https://corpus-tools.org/pepper/>).
SaltXML is an XMI serialization of the [Salt model](https://raw.githubusercontent.com/korpling/salt/master/gh-site/doc/salt_modelGuide.pdf).

## Configuration

### missing_anno_ns_from_layer

If `true`, use the layer name as fallback for the namespace annotations
if none is given. This is consistent with how the ANNIS tree visualizer
handles annotations without any namespace. If `false`, use the
`default_ns` namespace as fallback.

57 changes: 29 additions & 28 deletions src/exporter/graphml.rs
Original file line number Diff line number Diff line change
Expand Up @@ -238,17 +238,17 @@ fn media_vis(graph: &AnnotationGraph) -> Result<Vec<Visualizer>, Box<dyn std::er
layer: None,
vis_type: "audio".to_string(),
display_name: "audio".to_string(),
visibility: "hidden".to_string(),
visibility: "preloaded".to_string(),
mappings: None,
});
}
"mp4" | "avi" | "mov" => {
"mp4" | "avi" | "mov" | "webm" => {
vis.push(Visualizer {
element: "node".to_string(),
layer: None,
vis_type: "video".to_string(),
display_name: "video".to_string(),
visibility: "hidden".to_string(),
visibility: "preloaded".to_string(),
mappings: None,
});
}
Expand Down Expand Up @@ -350,35 +350,36 @@ fn node_annos_vis(graph: &AnnotationGraph) -> Result<Visualizer, Box<dyn std::er
mappings.insert("annos".to_string(), node_names);
mappings.insert("escape_html".to_string(), "false".to_string());

let more_than_one_ordering = order_names.len() > 1;
let ordered_nodes_are_identical = {
more_than_one_ordering && {
let ordering_components =
graph.get_all_components(Some(AnnotationComponentType::Ordering), None);
let node_sets = ordering_components
.iter()
.map(|c| {
if let Some(strge) = graph.get_graphstorage(c) {
strge
.source_nodes()
.filter_map(|r| if let Ok(n) = r { Some(n) } else { None })
.collect::<BTreeSet<u64>>()
} else {
BTreeSet::default()
}
})
.collect_vec();
let mut all_same = true;
//for i in 1..node_sets.len()
for (a, b) in node_sets.into_iter().tuple_windows() {
all_same &= matches!(a.cmp(&b), Ordering::Equal);
}
all_same
let ordered_components_contain_identical_nodes = if order_names.len() > 1 {
let ordering_components =
graph.get_all_components(Some(AnnotationComponentType::Ordering), None);
let node_sets = ordering_components
.iter()
.map(|c| {
if let Some(strge) = graph.get_graphstorage(c) {
strge
.source_nodes()
.filter_map(|r| if let Ok(n) = r { Some(n) } else { None })
.collect::<BTreeSet<u64>>()
} else {
BTreeSet::default()
}
})
.collect_vec();
let mut all_same = true;
//for i in 1..node_sets.len()
for (a, b) in node_sets.into_iter().tuple_windows() {
all_same &= matches!(a.cmp(&b), Ordering::Equal);
}
all_same
} else {
// There is only one ordering component
true
};

mappings.insert(
"hide_tok".to_string(),
(!ordered_nodes_are_identical).to_string(),
(!ordered_components_contain_identical_nodes).to_string(),
);
mappings.insert("show_ns".to_string(), "false".to_string());
Ok(Visualizer {
Expand Down
Loading