stanbol-gosen

This modules allow to use the lucene-gosen for processing Japanese Texts with Apache Stanbol.

This project contains the following modules:

Bundle that provides Lucene Gosen support to the Stanbol Commons Solr Core module
LabelTokenizer implementation for the EntityLinking engine
Stanbol NLP processing Engine
Bundlelist for users to include in their custom Stanbol Launcher configurations.

See the README files of those modules for more information

Installation

As the Lucene-Gosen modules are not available on maven cent ray (see Issue 20) users will need to download the Gosen artifacts from the Project Homepage and install them manually to their local maven repository.

The Lucene Gosen 2.0.2 with the Naist Chasen dictionary can be downloaded for here. After that you need to call

mvn install:install-file -Dfile=lucene-gosen-2.0.2-naist-chasen.jar \
    -DgroupId=com.google.code -DartifactId=lucene-gosen-naist-chasen \
    -Dversion=2.0.2 -Dpackaging=jar

The used groupId and artifactId is based on the one used by lucene-gosen-ipadic:1.2.1 that is available on mvm central.

After that you can build this project by calling

mvn install

Using the Gosen Analyzers with Stanbol

This section provides information on how to configure a SolrCore to correctly index Japanese Text using Lucene-Gosen.

Solr Field Configuration

To use the Gosen Analyzers for Japanese you need first to define an fieldType. The following configuration is based on the recommendation of the Lucene-Gosen project.

:::xml
<fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false" >
    <analyzer type="index">
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-japanese.txt"/>
        <tokenizer class="solr.GosenTokenizerFactory" /> <!-- compositePOS="compositePOS.txt" dictionaryDir="dictionary/naist-chasen" -->
        <filter class="solr.GosenWidthFilterFactory"/>
        <filter class="solr.GosenPunctuationFilterFactory" enablePositionIncrements="true"/>
        <filter class="solr.GosenPartOfSpeechStopFilterFactory" tags="stoptags_ja.txt" enablePositionIncrements="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_ja.txt" enablePositionIncrements="true"/>
        <filter class="solr.KeywordMarkerFilterFactory" ignoreCase="false"/>
        <filter class="solr.GosenBasicFormFilterFactory"/>
        <filter class="solr.GosenKatakanaStemFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-japanese.txt"/>
        <tokenizer class="solr.GosenTokenizerFactory" /> <!-- compositePOS="compositePOS.txt" dictionaryDir="dictionary/naist-chasen" -->
        <filter class="solr.GosenWidthFilterFactory"/>
        <filter class="solr.GosenPunctuationFilterFactory" enablePositionIncrements="true"/>
        <filter class="solr.GosenPartOfSpeechStopFilterFactory" tags="stoptags_ja.txt" enablePositionIncrements="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_ja.txt" enablePositionIncrements="true"/>
        <filter class="solr.KeywordMarkerFilterFactory" ignoreCase="false"/>
        <filter class="solr.GosenBasicFormFilterFactory"/>
        <filter class="solr.GosenKatakanaStemFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

In addition you need to use this fiedType for Japanese Texts. In the case of the Stanbol Entityhub this is done by adding the dynamicField definition for fields starting with @ja

:::xml
<!--
    Dynamic field for Japanese languages.
 -->
<dynamicField name="@ja*"  type="text_ja" indexed="true" stored="true" multiValued="true" omitNorms="false"/>

If you want special field configuration for some properties you will also need to define field configurations for those. The following example shows how to enable termVectors (as used by Solr MLT queries) for the rdfs:comment field

<field name="@ja/rdfs:comment/" type="text_ja" indexed="true" stored="true" multiValued="true" omitNorms="false" termVectors="true"/>

Typically it is recommended to start from the default.solrindex.zip and apply the desired changes.

Usage with the EntityhubIndexing Tool

Extract the default.solrindex.zip to the "indexing/config" directory.
Copy the lucene-gosen-2.0.2-naist-chasen.jar (downloaded during the installation step) in the lib directory of the Solr Core configuration "indexing/config/paoding/lib". Solr includes all jar files within this directory in the Classpath. Because of that the Gosen analyzers will be available during indexing
Rename the "indexing/config/default" directory to the {site-name} (the value of the "name" property of the "indexing/config/indexing.properties" file). As an alternative it is also possible to explicitly define the name of the configuration for the SolrYardIndexingDestination.
```
 :::text
 indexingDestination=org.apache.stanbol.entityhub.indexing.destination.solryard.SolrYardIndexingDestination,solrConf:{config-dir-name},boosts:fieldboosts
```

After that the Entityhub Indexing Tool will use the custom SolrCore configuration including Gosen for indexing.

Usage with the Entityhub SolrYard

If you want to create an empty SolrYard instance that uses a SolrCore configuration with Gosen than you need to

create a SolrCore configuration that includes the required fieldType and field definitions (see start of this section)
ZIP the edited SolrCore configuration and name the resulting archive {name}.solrindex.zip
copy the {name}.solrindex.zip to the datafile directory of your Stanbol instance ({working-dir}/stanbol/datafiles)
create the SolrYard instance and configure the "Solr Index/Core" (org.apache.stanbol.entityhub.yard.solr.solrUri) to {name}. Make sure the "Use default SolrCore configuration" (org.apache.stanbol.entityhub.yard.solr.useDefaultConfig) is disabled.

If you want to use your SolrCore configuration as default for creating SolrYards you need to use default as {name} (name the file default.solrindex.zip) and keep this file in the datafiles folder.

See also the documentation on how to configure a managed site).

Japanese Language Support for the Stanbol Enhancer

The typical Enhancement Chain for Japanese Texts will include the following Engines:

Language Detection: By default this engine uses the name langdetect. Make sure to NOT use the Apache Tika based language detection engine, as it does not support the detection of Japanese.
Gosen NLP Engine: This engine is provided by this project. It is based on the Stanbol NLP processing module and supports Tokenizing, Sentence Detection, Part of Speech (POS) tagging as well as Named Entity Recognition (NER).
Entity Linking Engine: The Stanbol EntityLinking engine can consume NLP processing results and used them to lookup Entities from a controlled vocabulary. If you have indexed Entities with Japanese Labels in an Entityhub Site, than you will need to create and configure an Entityhub Linking Engine.
- For proper processing of Japanese labels of Entities the Gosen LabelTokenizer module (provided by this project) needs to be installed in the Stanbol instance

Note: As of now the Named Entity Linking Engine SHOULD NOT be used for Japanese Texts as it does not use Japanese specific processing of the Entity Labels. It is recommended to use the Entityhub Linking Engine and filter results based on the types of the Entities.

Configuring your Stanbol Instance for Japanese

There are several options how this can be done.

if you want to create your own Stanbol Launcher, that you need simple to add the gosen-bundlelist as dependency to the pom.xml file of your launcher configuration.
at.salzburgresearch.stanbol at.salzburgresearch.stanbol.launchers.bundlelists.languageextras.gosen 0.10.0-SNAPSHOT partialbundlelist provided
Doing so will ensure that the modules referenced by this bundle list will be included in your Stanbol Launcher. For more information on how to build custom launchers please see the documentation on the Stanbol Webpage
manually add the required modules to an Stanbol Instance. For that you need to install the following modules to your Stanbol instance.
- at.salzburgresearch.stanbol.commons.solr.extras.gosen:0.11.0-SNAPSHOT
- at.salzburgresearch.stanbol.enhancer.engines.gosennlp:0.10.0-SNAPSHOT
- at.salzburgresearch.stanbol.enhancer.engines.entitylinking.labeltokenizer.gosen
This is typically done by either using the Apache Felix Webconsole or by copying the according jar files to the stanbol/fileinstall directory of your stanbol instance. If this directory does not yet exist you need to create it. But there are also other possibilities to install the required bundles to the Stanbol OSGI environment. The implementation of the modules are independent of the used start levels.

Please also note the Stanbol Production Mode section for additional information.

License

This modules are dual licensed under GNU Lesser General Public License (as used by the Lucene-Gosen project) and the Apache Software License, Version 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
gosen-bundlelist		gosen-bundlelist
gosen-nlp-engine		gosen-nlp-engine
labeltokenizer-gosen		labeltokenizer-gosen
lucene-gosen-naist-chasen		lucene-gosen-naist-chasen
LICENSE-ASL		LICENSE-ASL
LICENSE-LGPL		LICENSE-LGPL
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

stanbol-gosen

Installation

Using the Gosen Analyzers with Stanbol

Solr Field Configuration

Usage with the EntityhubIndexing Tool

Usage with the Entityhub SolrYard

Japanese Language Support for the Stanbol Enhancer

Configuring your Stanbol Instance for Japanese

License

About

Licenses found

Releases

Packages

Languages

License

Licenses found

westei/stanbol-gosen

Folders and files

Latest commit

History

Repository files navigation

stanbol-gosen

Installation

Using the Gosen Analyzers with Stanbol

Solr Field Configuration

Usage with the EntityhubIndexing Tool

Usage with the Entityhub SolrYard

Japanese Language Support for the Stanbol Enhancer

Configuring your Stanbol Instance for Japanese

License

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages