Skip to content

专注于解决自然语言处理领域的几个核心问题:词法分析,句法分析,语义分析,语种检测,信息抽取,文本聚类和文本分类. 为相关领域的研发人员提供完整的通用设计与参考实现. 涵盖了多种自然语言处理算法,适配了多个自然语言处理框架. 兼容Lucene/Solr/ElasticSearch插件.

License

Notifications You must be signed in to change notification settings

HongZhaoHua/jstarcraft-nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JStarCraft NLP


License Total lines Codacy Badge

希望路过的同学,顺手给JStarCraft框架点个Star,算是对作者的一种鼓励吧!


目录


介绍

JStarCraft NLP是一个面向自然语言处理领域的轻量级引擎.遵循Apache 2.0协议.

专注于解决自然语言处理领域的几个核心问题:

  • 词法分析
  • 句法分析
  • 语义分析
  • 信息抽取
  • 文本聚类
  • 文本分类

涵盖了多种自然语言处理算法,整合了多个自然语言处理框架.为相关领域的研发人员提供提供满足工业级别场景要求的通用设计与参考实现,普及自然语言处理在Java领域的应用.


特性

  • 1.文本相关性
    • 词语相关性
    • 短语相关性
    • 句子相关性
    • 文档相关性
  • 2.文本哈希
    • 局部敏感哈希
    • 布隆过滤器
  • 3.词法分析(Lexical Analysis)
    • 分词
    • 词性标注
  • 4.句法分析(Sentence Analysis)
    • 句法结构分析
    • 依存关系分析
  • 5.语义分析(Semantic Analysis)
  • 6.信息抽取(Information Extraction)
  • 7.文本聚类
  • 8.文本分类
  • 9.兼容Lucene,Solr,ElasticSearch
  • 10.整合第三方框架
    • Ansj
    • Stanford CoreNLP
    • HanLP
    • IK
    • Jcseg
    • jieba
    • MMSEG
    • MYNLP
    • THULAC
    • word

安装

JStarCraft RNS要求使用者具备以下环境:

  • JDK 8或者以上
  • Maven 3

安装JStarCraft-Core框架

git clone https://github.com/HongZhaoHua/jstarcraft-core.git

mvn install -Dmaven.test.skip=true

安装JStarCraft-AI框架

git clone https://github.com/HongZhaoHua/jstarcraft-ai.git

mvn install -Dmaven.test.skip=true

安装JStarCraft-NLP引擎

git clone https://github.com/HongZhaoHua/jstarcraft-nlp.git

mvn install -Dmaven.test.skip=true

使用

设置依赖

  • 设置Maven依赖
<dependency>
    <groupId>com.jstarcraft</groupId>
    <artifactId>nlp</artifactId>
    <version>1.0</version>
</dependency>
  • 设置Gradle依赖
compile group: 'com.jstarcraft', name: 'nlp', version: '1.0'

配置第三方框架

配置Ansj
名称 功能 默认值
名称 功能 默认值
配置Stanford-CoreNLP
名称 功能 默认值
名称 功能 默认值
配置HanLP
名称 功能 默认值
名称 功能 默认值
配置IK
名称 功能 默认值
名称 功能 默认值
配置Jcseg
名称 功能 默认值
名称 功能 默认值
配置jieba
名称 功能 默认值
名称 功能 默认值
配置MMSEG
名称 功能 默认值
名称 功能 默认值
配置MYNLP
名称 功能 默认值
名称 功能 默认值
配置THULAC
名称 功能 默认值
名称 功能 默认值
配置word
名称 功能 默认值
名称 功能 默认值

架构


概念

信息熵

**信息熵(Information Entropy)**是指某个片段外部搭配的丰富程度;

互信息

**互信息(Mutual Information)**是指某个片段内部搭配的固定程度;


示例


对比


版本


参考

词性标注集

代码 名称 词类 说明
A 形容词 实词 取英语形容词adjective的第1个字母
C 连词 虚词 取英语连词conjunction的第1个字母
D 副词 虚词 取英语副词adverb的第2个字母
E 叹词 虚词 取英语叹词exclamation的第1个字母
M 数词 实词 取英语数词numeral的第3个字母
N 名词 实词 取英语名词noun的第1个字母
O 拟声词 虚词 取英语拟声词onomatopoeia的第1个字母
P 介词 虚词 取英语拟声词onomatopoeia的第1个字母
Q 量词 实词 取英语量词quantity的第1个字母
R 代词 实词 取英语代词pronoun的第2个字母
T 冠词 虚词 取英语冠词article的第3个字母
U 助词 虚词 取英语助词auxiliary的第2个字母
V 动词 实词 取英语动词verb的第1个字母
W 标点符号
X 未知

语种检测

编码 名称
cmn Mandarin Chinese
spa Spanish
eng English
rus Russian
arb Standard Arabic
ben Bengali
hin Hindi
por Portuguese
ind Indonesian
jpn Japanese
fra French
deu German
jav Javanese
kor Korean
tel Telugu
vie Vietnamese
mar Marathi
ita Italian
tam Tamil
tur Turkish
urd Urdu
guj Gujarati
pol Polish
ukr Ukrainian
fas Persian
kan Kannada
mai Maithili
mal Malayalam
mya Burmese
ori Oriya (macrolanguage)
gax Borana-Arsi-Guji Oromo
swh Swahili (individual language)
sun Sundanese
ron Romanian
pan Panjabi
bho Bhojpuri
amh Amharic
hau Hausa
fuv Nigerian Fulfulde
bos Bosnian (Cyrillic)
bos Bosnian (Latin)
hrv Croatian
nld Dutch
srp Serbian (Cyrillic)
srp Serbian (Latin)
tha Thai
ckb Central Kurdish
yor Yoruba
uzn Northern Uzbek (Cyrillic)
uzn Northern Uzbek (Latin)
zlm Malay (individual language) (Arabic)
zlm Malay (individual language) (Latin)
ibo Igbo
nep Nepali (macrolanguage)
ceb Cebuano
skr Saraiki
tgl Tagalog
hun Hungarian
azj North Azerbaijani (Cyrillic)
azj North Azerbaijani (Latin)
sin Sinhala
koi Komi-Permyak
ell Modern Greek (1453-)
ces Czech
run Rundi
bel Belarusian
plt Plateau Malagasy
qug Chimborazo Highland Quichua
mad Madurese
nya Nyanja
zyb Yongbei Zhuang
pbu Northern Pashto
kin Kinyarwanda
zul Zulu
bul Bulgarian
swe Swedish
lin Lingala
som Somali
hms Southern Qiandong Miao
hnj Hmong Njua
ilo Iloko
kaz Kazakh
uig Uighur (Arabic)
uig Uighur (Latin)
hat Haitian
khm Khmer
aka Akan
hil Hiligaynon
sna Shona
tat Tatar
xho Xhosa
hye Armenian
min Minangkabau
afr Afrikaans
lua Luba-Lulua
sat Santali
bod Tibetan
tir Tigrinya
fin Finnish
slk Slovak
tuk Turkmen (Cyrillic)
tuk Turkmen (Latin)
dan Danish
nob Norwegian Bokmål
suk Sukuma
als Tosk Albanian
sag Sango
nno Norwegian Nynorsk
heb Hebrew
mos Mossi
tgk Tajik
cat Catalan
sot Southern Sotho
kat Georgian
bcl Central Bikol
glg Galician
lao Lao
lit Lithuanian
umb Umbundu
tsn Tswana
vec Venetian
nso Pedi
ban Balinese
bug Buginese
knc Central Kanuri
kng Koongo
ibb Ibibio
lug Ganda
ace Achinese
bam Bambara
tzm Central Atlas Tamazight
ydd Eastern Yiddish
kmb Kimbundu
lun Lunda
shn Shan
war Waray (Philippines)
dyu Dyula
wol Wolof
kir Kirghiz
nds Low German
fuf Pular
mkd Macedonian
vmw Makhuwa
zgh Standard Moroccan Tamazight
ewe Ewe
khk Halh Mongolian
slv Slovenian
ayr Central Aymara
bem Bemba (Zambia)
emk Eastern Maninkakan
bci Baoulé
bum Bulu (Cameroon)
epo Esperanto
pam Pampanga
tiv Tiv
tpi Tok Pisin
ven Venda
ssw Swati
nyn Nyankole
kbd Kabardian
iii Sichuan Yi
yao Yao
lav Latvian
quz Cusco Quechua
src Logudorese Sardinian
sco Scots
tso Tsonga
rmy Vlax Romani
men Mende (Sierra Leone)
fon Fon
nhn Central Nahuatl
dip Northeastern Dinka
kde Makonde
snn Siona
kbp Kabiyè
tem Timne
toi Tonga (Zambia)
est Estonian
snk Soninke
cjk Chokwe
ada Adangme
aii Assyrian Neo-Aramaic
quy Ayacucho Quechua
rmn Balkan Romani
bin Bini
gaa Ga
ndo Ndonga
nym Nyamwezi
sus Susu
tly Talysh
srr Serer
kha Khasi
hea Northern Qiandong Miao
gkp Guinea Kpelle
hni Hani
fry Western Frisian
yua Yucateco
fij Fijian
fur Friulian
tet Tetum
wln Walloon
eus Basque
oss Ossetian
nbl South Ndebele
pov Upper Guinea Crioulo
cym Welsh
lus Lushai
dag Dagbani
dga Southern Dagaare
bre Breton
kek Kekchí
lij Ligurian
pcd Picard
roh Romansh
bfa Bari
kri Krio
cnh Hakha Chin
lob Lobi
arn Mapudungun
bba Baatonum
dzo Dzongkha
kea Kabuverdianu
sah Yakut
smo Samoan
koo Konzo
nzi Nzima
maz Central Mazahua
pis Pijin
ctd Tedim Chin
cos Corsican
ltz Luxembourgish
lia West-Central Limba
mlt Maltese
hna Mina (Cameroon)
zdj Ngazidja Comorian
guc Wayuu
qwh Huaylas Ancash Quechua
quc K'iche'
div Dhivehi
isl Icelandic
kqn Kaonde
pap Papiamento
gle Irish
dyo Jola-Fonyi
hns Caribbean Hindustani
gjn Gonja
njo Ao Naga
hus Huastec
mag Magahi
xsm Kasem
ote Mezquital Otomi
qxn Northern Conchucos Ancash Quechua
tyv Tuvinian
gag Gagauz
san Sanskrit
shk Shilluk
nba Nyemba
miq Mískito
mam Mam
tah Tahitian
nav Navajo
ami Amis
lot Otuho
cak Kaqchikel
tzh Tzeltal
tzo Tzotzil
lns Lamnso'
ton Tonga (Tonga Islands)
tbz Ditammari
lad Ladino
vai Vai
mto Totontepec Mixe
ady Adyghe
abk Abkhazian
ast Asturian
tsz Purepecha
swb Maore Comorian
cab Garifuna
krl Karelian
zam Miahuatlán Zapotec
top Papantla Totonac
cha Chamorro
crs Seselwa Creole French
ddn Dendi (Benin)
loz Lozi
mri Maori
hsb Upper Sorbian
cri Sãotomense
pbb Páez
alt Southern Altai
qva Ambo-Pasco Quechua
mxv Metlatónoc Mixtec
gla Scottish Gaelic
kjh Khakas
csw Swampy Cree
qvm Margos-Yarowilca-Lauricocha Quechua
fao Faroese
kal Kalaallisut
cni Asháninka
chk Chuukese
mah Marshallese
rar Rarotongan
evn Evenki
qvn North Junín Quechua
wwa Waama
buc Bushi
qvh Huamalíes-Dos de Mayo Huánuco Quechua
toj Tojolabal
lue Luvale
qvc Cajamarca Quechua
ojb Northwestern Ojibwa
jiv Shuar
qud Calderón Highland Quichua
lld Ladin
hlt Matu Chin
que Quechua
pon Pohnpeian
agr Aguaruna
qxa Chiquián Ancash Quechua
quh South Bolivian Quechua
tca Ticuna
chj Ojitlán Chinantec
ike Eastern Canadian Inuktitut
kwi Awa-Cuaiquer
rgn Romagnol
oki Okiek
tob Toba
guu Yanomamö
qxu Arequipa-La Unión Quechua
pau Palauan
shp Shipibo-Conibo
gld Nanai
gug Paraguayan Guaraní
mzi Ixcatlán Mazatec
cjs Shor
mic Mi'kmaq
haw Hawaiian
eve Even
yap Yapese
cbt Chayahuita
ame Yanesha'
gyr Guarayu
vep Veps
cpu Pichis Ashéninka
acu Achuar-Shiwiar
not Nomatsiguenga
sme Northern Sami
yad Yagua
ura Urarina
cbu Candoshi-Shapra
huu Murui Huitoto
cof Colorado
boa Bora
ztu Güilá Zapotec
piu Pintupi-Luritja
cbr Cashibo-Cacataibo
mcf Matsés
bis Bislama
orh Oroqen
ykg Northern Yukaghir
ese Ese Ejja
nio Nganasan
cic Chickasaw
csa Chiltepec Chinantec
mcd Sharanahua
amc Amahuaca
amr Amarakaeri
cot Caquinte
oaa Orok
ajg Aja (Benin)
arl Arabela
ppl Pipil
bax Bamun
nku Bouna Kulango
cbi Chachi
ccp Chakma
chr Cherokee (Cherokee)
chr Cherokee (Cherokee)
duu Drung
cfm Falam Chin
fat Fanti
ido Ido
ina Interlingua (International Auxiliary Language Association)
kkh Khün
ktu Kituba (Democratic Republic of Congo)
fkv Kven Finnish
lat Latin
glv Manx
mfq Moba
mnw Mon
mxi Mozarabic
pcm Nigerian Pidgin
niu Niuean
kqs Northern Kissi
sey Secoya
ekk Standard Estonian
lvs Standard Latvian
blt Tai Dam
kdh Tem
tdt Tetun Dili
twi Twi (Latin)
twi Twi (Latin)
auc Waorani
gaz West Central Oromo
pnb Western Panjabi
zro Záparo

协议

JStarCraft NLP遵循Apache 2.0协议,一切以其为基础的衍生作品均属于衍生作品的作者.


作者

作者 洪钊桦
E-mail [email protected], [email protected]

致谢


About

专注于解决自然语言处理领域的几个核心问题:词法分析,句法分析,语义分析,语种检测,信息抽取,文本聚类和文本分类. 为相关领域的研发人员提供完整的通用设计与参考实现. 涵盖了多种自然语言处理算法,适配了多个自然语言处理框架. 兼容Lucene/Solr/ElasticSearch插件.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages