-
Notifications
You must be signed in to change notification settings - Fork 18
Metashare_import
Have the option to define different language resources, linguality info etc. as they have eg. at http://metashare.csc.fi/repo2/search2/
If we plan to have some editors (person other than submitter who eg improves the metadata quality), we should extend the metadataInfo component. Or, if this won’t be the case, at least return more reasonable creation date. To have something more like the image on left insted of what we have now (right)
But the supplied minimum is more or less the same (there are differences but they are “resource dependent” ie for some they have iprHolder or usageInfo.forseenUse or resourceDocumentationInfo … ). Have a look at these two items.
The metadata are available through OAI-PMH here . It’s an xml document with element for each resource. Content of these elements should already be adhering to the metashare schema. To accomplish that some values had to be slightly tweak and some missing values had to be added (see the Metashare_import#The_minimum below, to know what has to be there), the Metashare_import#Additional_fields are also there.
-
The organization info is moved under contactPerson~~>affiliation. The one provided email is used both as a personal email and as a email of the organization * Metadata creation date is fixed at 1970-01-01 * metaShareId set to NOT_DEFINED_FOR_V2 * If there is no funding project name it is set to N/A * All resources are of type ‘corpus’ * media type is either text or audio * All resources are marked as monolingual * If the size info was missing size is set to 0 bytes * language name is set to N/A h2. Metadata definition h3. The minimum The minimal schema consists of five “top level” components: * identificationInfo this groups together at least the name of the resource, its description and metaShareId set to fixed value) * distributionInfo groups availability and licence information * contactPerson Groups information on the person that is responsible for giving information for the resource. At least his/hers surname and a way to communicate with him/her~~ at least an email.
-
metadataInfo
Information on the metadata record, at least the date of creation of the metadata
- resourceComponentType
Which distinguishes between the four resource types used in metashare (corpus, tool/service, language description, lexical/conceptual resource).
For corpora we need to supply the media info (text, audio, video, image, textNumerical, textNgram) and for almost all media types the linguality type (mono/bi/multi lingual), language id(s) and name(s) and finally some size information (number and unit - bytes, words, etc.).
Tools or services have just the type information and indicator whether are they language dependent.
Language desriptions have info again on type of description (grammar/other) and a media type (similar to corpus media types. Media types again with linguality, size…)
Lexical/Conceptual resources - again resource type (wordlist, ontology, wordnet…) and media type
So the minimal resourceInfo boils down to this sequence
See example to get a more concrete idea how does the resource look like.
Following items are not mandatory but either someone considered them important or they were mandatory in previous version of the schema:
- given name
- organization info (name and email)
- funding info (type of funding and project title)
- validation info - validated - if the resource was somehow validated
Use splitOAI.pl to create xml file per resourceInfo in the current directory. In addition to just spliting the oai_metasharev2 output it “sets” the correct namespaces and adds the identifier field (currently it points to …/xmlui/handle/…, but can be easily pointed to hdl.handle…).
mkdir resources
cd resources
perl ../splitOAI.pl
check if the created files are valid, as importing files that don’t adhere to the schema is discouraged
for i in *.xml;do xmllint --noout --schema /opt/metashare/META-SHARE/misc/schema/v2.0/META-SHARE-Resource.xsd $i; done
hopefully everything is valid (the crosswalk should fill all the needed values, put them in right order…)
to use the correct version of python
export PATH=/opt/metashare/META-SHARE-v2.0/opt/bin:$PATH
export PYTHONPATH=/opt/metashare/META-SHARE-v2.0/lib/python2.7/site-packages:$PYTHONPATH
now as root
cd /opt/metashare/META-SHARE/metashare
python import_xml.py /path/to/previously/created/resources/*.xml
and that should be all
Even though we are able to produce a valid output someone should probably review the resources and fill the correct values. Going through the licences/restrictions would also be a good idea.