Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fastText的label有上限 #44

Open
yuyoyth opened this issue May 18, 2022 · 5 comments
Open

fastText的label有上限 #44

yuyoyth opened this issue May 18, 2022 · 5 comments

Comments

@yuyoyth
Copy link

yuyoyth commented May 18, 2022

这是打印读取train的结果

Number of words:  2148
Number of labels: 185898
Max threshold count: 2`
Number of wordHash2Id: 250728

可看到读取上限为185898,而train中我提供的label数为1300000+,为了排除数据问题,我将原本train以150000分割为9个文件,依次进行读取测试,结果均能正常返回label读取数,基本可排除是数据文件的问题
fastText是确定的设置了这个上限吗还是文件读取量有上限?原train文件有480MB大小,分割后最大为52MB

@yuyoyth
Copy link
Author

yuyoyth commented May 18, 2022

训练代码为

InputArgs inputArgs = new InputArgs();
inputArgs.setLoss(LossName.ns);
inputArgs.setThread(15);
inputArgs.setEpoch(100);
inputArgs.setLr(0.5);
inputArgs.setDim(100);
FastText model = FastText.trainSupervised(trainFile, inputArgs);

@jimichan
Copy link
Member

确定是分类问题吗?label数量这么大

@yuyoyth
Copy link
Author

yuyoyth commented May 18, 2022

我想做模糊文本到唯一id的映射,即使缺字多字依旧能尽可能匹配,为此专门做了汉字编码,希望对于相似字也能实现匹配
以下是train的一行参照

__label__00004e937c254cef906f24ae819ed540   78542508029 AE010320006 GG032906029 FC42168327G 4D012106046 F7022402279 AE010320006 F0012304046 K702C430145 FD442777327 FJ542102273 G401127754A 5A02137120C GE04184781F F803134117C FJ51130127C 3G041342107 6C018717144 E0042101002 5E031271128 7 2 9A042600275

@jimichan
Copy link
Member

你这个应该去用词向量或者simhash之类的方案,不应该用文本分类

@yuyoyth
Copy link
Author

yuyoyth commented May 18, 2022

你这个应该去用词向量或者simhash之类的方案,不应该用文本分类

感谢建议,我尝试更换下方法

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants