Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect values get_sentiment and get_nrc_sentiment with Swedish text #39

Open
lisagy opened this issue Aug 27, 2021 · 1 comment
Open

Comments

@lisagy
Copy link

lisagy commented Aug 27, 2021

Hi! I am very new to R and GitHub and coding overall so I apologize for any following mistakes!

I am trying to do a sentiment analysis of a Swedish novel with the help of the syuzhet package but noticed the get_sentiment and get_nrc_sentiment function read the value of certain words incorrectly. I first noticed it with my custom lexicon but then did a test with the nrc lexicon as well and saw that both give incorrect values for words with the letters ö, ä and å in them. Most of the time these words get value 0 (while they should be getting 1 or -1), but I’ve also seen a case where the word gets assigned a positive value (1) while it should be negative (-1).

I've changed RStudio's default encoding to utf-8 and my system's locale to Swedish but nothing has helped.
How could I solve this problem? This is the code I would use to get my results:

# For the nrc lexicon

binas_historia <- read_file(file.choose())
bina_words <- get_tokens(binas_historia, pattern = "\\W")
sentiment_b_nrc <- get_nrc_sentiment(bina_words, language = "swedish")
overzichtje_nrc <- data.frame(bina_words, nrc_data)

# For the Swedish (custom) lexicon 

binas_historia <- read_file(file.choose())
bina_words <- get_tokens(binas_historia, pattern = "\\W")
sensaldo_lexicon <- read.table("HP/Thesis/sensaldo-fullform.txt", 
header = FALSE,
col.names = c("word", "category", "value"), 
colClasses = c("character", "character", "numeric"),
encoding = "UTF-8")
sentiment_b_s <- get_sentiment(bina_words, method = "custom", lexicon = sensaldo_lexicon)
overzichtje_sensaldo <- data.frame(bina_words, sentiment_b_s)
@mjockers
Copy link
Owner

Since leaving academia, I rarely find time to work on this package anymore. Support for non-English languages is weak. I encourage you to develop a solution and submit as a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants