Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example needed for tidy approach for stm modeling with covariates #173

Open
jooyoungseo opened this issue May 3, 2020 · 3 comments
Open

Comments

@jooyoungseo
Copy link

In the current tidytext document explaining about the tidy approach to stm object, there is no specific example of how to add covariates.

I wanted to try that out with stm::gadarian data using prevalence = ~treatment + s(pid_rep) covariate formula; however, I have faced some errors.

Would you mind adding one example on how to address this kind of model to the tidytext package document?

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(stm)
#> stm v1.3.5 successfully loaded. See ?stm for help. 
#>  Papers, resources, and other materials at structuraltopicmodel.com
library(tidytext)

glimpse(gadarian)
#> Rows: 341
#> Columns: 4
#> $ MetaID              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
#> $ treatment           <dbl> 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0,...
#> $ pid_rep             <dbl> 1.00000, 1.00000, 0.33300, 0.50000, 0.66667, 0....
#> $ open.ended.response <chr> "problems caused by the influx of illegal immig...

gadarian2 <- gadarian %>%
  mutate(document = row_number())

gadarian_sparse <- gadarian2 %>%
  unnest_tokens(word, open.ended.response) %>%
  anti_join(stop_words) %>%
  count(document, word) %>%
  cast_sparse(document, word, n)
#> Joining, by = "word"

topic_model <- stm(gadarian_sparse,
  K = 3, init.type = "Spectral",
  prevalence = ~ treatment + s(pid_rep),
  data = gadarian2,
  verbose = FALSE
)
#> Error in stm(gadarian_sparse, K = 3, init.type = "Spectral", prevalence = ~treatment + : number of observations in content covariate (335) prevalence covariate (341) and documents (335) are not all equal.

Created on 2020-05-03 by the reprex package (v0.3.0)

Session info
devtools::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 4.0.0 (2020-04-24)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United States.1252  
#>  ctype    English_United States.1252  
#>  tz       America/New_York            
#>  date     2020-05-03                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version    date       lib source                              
#>  assertthat    0.2.1      2019-03-21 [1] CRAN (R 4.0.0)                      
#>  backports     1.1.6      2020-04-05 [1] CRAN (R 4.0.0)                      
#>  callr         3.4.3      2020-03-28 [1] CRAN (R 4.0.0)                      
#>  cli           2.0.2      2020-02-28 [1] CRAN (R 4.0.0)                      
#>  crayon        1.3.4      2017-09-16 [1] CRAN (R 4.0.0)                      
#>  data.table    1.12.8     2019-12-09 [1] CRAN (R 4.0.0)                      
#>  desc          1.2.0      2018-05-01 [1] CRAN (R 4.0.0)                      
#>  devtools      2.2.2.9000 2020-05-01 [1] Github (r-lib/devtools@b166195)     
#>  digest        0.6.25     2020-02-23 [1] CRAN (R 4.0.0)                      
#>  dplyr       * 0.8.5      2020-03-07 [1] CRAN (R 4.0.0)                      
#>  ellipsis      0.3.0      2019-09-20 [1] CRAN (R 4.0.0)                      
#>  evaluate      0.14       2019-05-28 [1] CRAN (R 4.0.0)                      
#>  fansi         0.4.1      2020-01-08 [1] CRAN (R 4.0.0)                      
#>  fs            1.4.1      2020-04-04 [1] CRAN (R 4.0.0)                      
#>  generics      0.0.2      2018-11-29 [1] CRAN (R 4.0.0)                      
#>  glue          1.4.0      2020-04-03 [1] CRAN (R 4.0.0)                      
#>  highr         0.8        2019-03-20 [1] CRAN (R 4.0.0)                      
#>  htmltools     0.4.0      2019-10-04 [1] CRAN (R 4.0.0)                      
#>  janeaustenr   0.1.5      2017-06-10 [1] CRAN (R 4.0.0)                      
#>  knitr         1.28.5     2020-04-28 [1] Github (yihui/knitr@93b46ba)        
#>  lattice       0.20-41    2020-04-02 [1] CRAN (R 4.0.0)                      
#>  lifecycle     0.2.0      2020-03-06 [1] CRAN (R 4.0.0)                      
#>  magrittr      1.5        2014-11-22 [1] CRAN (R 4.0.0)                      
#>  Matrix        1.2-18     2019-11-27 [1] CRAN (R 4.0.0)                      
#>  memoise       1.1.0      2017-04-21 [1] CRAN (R 4.0.0)                      
#>  pillar        1.4.3      2019-12-20 [1] CRAN (R 4.0.0)                      
#>  pkgbuild      1.0.7      2020-04-25 [1] CRAN (R 4.0.0)                      
#>  pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.0.0)                      
#>  pkgload       1.0.2      2018-10-29 [1] CRAN (R 4.0.0)                      
#>  prettyunits   1.1.1      2020-01-24 [1] CRAN (R 4.0.0)                      
#>  processx      3.4.2      2020-02-09 [1] CRAN (R 4.0.0)                      
#>  ps            1.3.2      2020-02-13 [1] CRAN (R 4.0.0)                      
#>  purrr         0.3.4      2020-04-17 [1] CRAN (R 4.0.0)                      
#>  R6            2.4.1      2019-11-12 [1] CRAN (R 4.0.0)                      
#>  Rcpp          1.0.4.6    2020-04-09 [1] CRAN (R 4.0.0)                      
#>  remotes       2.1.1      2020-02-15 [1] CRAN (R 4.0.0)                      
#>  rlang         0.4.6      2020-05-02 [1] CRAN (R 4.0.0)                      
#>  rmarkdown     2.1.3      2020-05-03 [1] Github (rstudio/rmarkdown@d7e1bda)  
#>  rprojroot     1.3-2      2018-01-03 [1] CRAN (R 4.0.0)                      
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 4.0.0)                      
#>  SnowballC     0.7.0      2020-04-01 [1] CRAN (R 4.0.0)                      
#>  stm         * 1.3.5      2020-04-28 [1] Github (bstewart/stm@c95ef0b)       
#>  stringi       1.4.6      2020-02-17 [1] CRAN (R 4.0.0)                      
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 4.0.0)                      
#>  testthat      2.3.2      2020-03-02 [1] CRAN (R 4.0.0)                      
#>  tibble        3.0.1      2020-04-20 [1] CRAN (R 4.0.0)                      
#>  tidyselect    1.0.0      2020-01-27 [1] CRAN (R 4.0.0)                      
#>  tidytext    * 0.2.4      2020-04-28 [1] Github (juliasilge/tidytext@a1c0220)
#>  tokenizers    0.2.1      2018-03-29 [1] CRAN (R 4.0.0)                      
#>  usethis       1.6.1.9000 2020-05-01 [1] Github (r-lib/usethis@4487260)      
#>  utf8          1.1.4      2018-05-24 [1] CRAN (R 4.0.0)                      
#>  vctrs         0.2.4      2020-03-10 [1] CRAN (R 4.0.0)                      
#>  withr         2.2.0      2020-04-20 [1] CRAN (R 4.0.0)                      
#>  xfun          0.13.1     2020-04-30 [1] Github (yihui/xfun@bf8afdd)         
#>  yaml          2.2.1      2020-02-01 [1] CRAN (R 4.0.0)                      
#> 
#> [1] C:/Program Files/R/R-4.0.0/library
@juliasilge
Copy link
Owner

The main problem you are having is that when you remove stop words, you remove some entire documents. Then when you use the data argument in the stm() function for the prevalence and/or content covariates, the number of observations don't line up; there are more observations in gadarian than in gadarian_sparse. You can get this to work if you don't remove stop words:

library(tidytext)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(stm)
#> stm v1.3.5 successfully loaded. See ?stm for help. 
#>  Papers, resources, and other materials at structuraltopicmodel.com

gadarian_sparse <- gadarian %>%
  mutate(document = row_number()) %>%
  unnest_tokens(word, open.ended.response) %>%
  count(document, word) %>%
  cast_sparse(document, word, n)

topic_model <- stm(
  gadarian_sparse,
  K = 3, init.type = "Spectral",
  prevalence = ~ treatment + s(pid_rep),
  data = gadarian,
  verbose = FALSE
)

summary(topic_model)
#> A topic model with 3 topics, 341 documents and a 1512 word dictionary.
#> Topic 1 Top Words:
#>       Highest Prob: the, to, of, people, is, in, country 
#>       FREX: from, come, coming, if, entering, illegally, united 
#>       Lift: afraid, if, mean, unsecured, been, entering, from 
#>       Score: the, to, from, coming, people, come, it 
#> Topic 2 Top Words:
#>       Highest Prob: that, and, a, i, they, not, our 
#>       FREX: that, they, we, have, pay, so, usa 
#>       Lift: asians, east, indians, usa, bums, contibution, goverment 
#>       Score: that, we, they, not, our, have, here 
#> Topic 3 Top Words:
#>       Highest Prob: for, immigrants, illegal, of, and, jobs, our 
#>       FREX: security, social, job, health, mexico, workers, loss 
#>       Lift: caused, ducation, hospitals, lowering, quality, bombings, killing 
#>       Score: illegal, for, security, jobs, immigrants, loss, our

Created on 2020-05-04 by the reprex package (v0.3.0)

Another option is to create a new dataframe for covariates that only contains the observations in gadarian_sparse, if removing stop words is important for your topic model.

I think a good option would be to rewrite / expand the topic modeling vignette to use stm throughout and add a section for document-level covariates. It needs some updating anyway.

@jooyoungseo
Copy link
Author

Thank you very much for your kind explanation, @juliasilge!

On top of your advice, I have got it to work. What do you think about my approach below?

library(tidytext)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(stm)
#> stm v1.3.5 successfully loaded. See ?stm for help. 
#>  Papers, resources, and other materials at structuraltopicmodel.com

gadarian2 <- gadarian %>%
  mutate(document = as.character(row_number()))

gadarian_sparse <- gadarian2 %>%
  unnest_tokens(word, open.ended.response) %>%
  anti_join(stop_words) %>%
  count(document, word) %>%
  cast_sparse(document, word, n)
#> Joining, by = "word"

covariate_df <- tibble(document = rownames(gadarian_sparse)) %>%
  inner_join(gadarian2)
#> Joining, by = "document"

topic_model <- stm(gadarian_sparse,
  K = 3, init.type = "Spectral",
  prevalence = ~ treatment + s(pid_rep),
  data = covariate_df,
  verbose = FALSE
)

summary(topic_model)
#> A topic model with 3 topics, 335 documents and a 1160 word dictionary.
#> Topic 1 Top Words:
#>       Highest Prob: taxes, security, illegals, immigrants, english, language, social 
#>       FREX: 1, law, taxes, terrorists, due, lost, 3 
#>       Lift: extent, fined, fullest, ileagles, on't, sneack, buttons 
#>       Score: 1, assimilate, security, english, law, 3, recieve 
#> Topic 2 Top Words:
#>       Highest Prob: jobs, illegal, immigration, welfare, country, care, americans 
#>       FREX: healthcare, cost, hospitals, strain, welfare, lack, im 
#>       Lift: crowding, hospitals, cheap, draining, allowing, immigrates, sealing 
#>       Score: jobs, im, cost, loss, welfare, capitalist, question 
#> Topic 3 Top Words:
#>       Highest Prob: people, immigrants, illegal, country, immigration, coming, border 
#>       FREX: people, coming, live, illegally, process, means, support 
#>       Lift: live, coming, term, false, process, required, people 
#>       Score: people, coming, process, illegally, stop, businesses, suffering

Created on 2020-05-04 by the reprex package (v0.3.0)

Session info
devtools::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 4.0.0 (2020-04-24)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United States.1252  
#>  ctype    English_United States.1252  
#>  tz       America/New_York            
#>  date     2020-05-04                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package     * version    date       lib source                              
#>  assertthat    0.2.1      2019-03-21 [1] CRAN (R 4.0.0)                      
#>  backports     1.1.6      2020-04-05 [1] CRAN (R 4.0.0)                      
#>  callr         3.4.3      2020-03-28 [1] CRAN (R 4.0.0)                      
#>  cli           2.0.2      2020-02-28 [1] CRAN (R 4.0.0)                      
#>  crayon        1.3.4      2017-09-16 [1] CRAN (R 4.0.0)                      
#>  data.table    1.12.8     2019-12-09 [1] CRAN (R 4.0.0)                      
#>  desc          1.2.0      2018-05-01 [1] CRAN (R 4.0.0)                      
#>  devtools      2.2.2.9000 2020-05-01 [1] Github (r-lib/devtools@b166195)     
#>  digest        0.6.25     2020-02-23 [1] CRAN (R 4.0.0)                      
#>  dplyr       * 0.8.5      2020-03-07 [1] CRAN (R 4.0.0)                      
#>  ellipsis      0.3.0      2019-09-20 [1] CRAN (R 4.0.0)                      
#>  evaluate      0.14       2019-05-28 [1] CRAN (R 4.0.0)                      
#>  fansi         0.4.1      2020-01-08 [1] CRAN (R 4.0.0)                      
#>  fs            1.4.1      2020-04-04 [1] CRAN (R 4.0.0)                      
#>  generics      0.0.2      2018-11-29 [1] CRAN (R 4.0.0)                      
#>  glue          1.4.0      2020-04-03 [1] CRAN (R 4.0.0)                      
#>  highr         0.8        2019-03-20 [1] CRAN (R 4.0.0)                      
#>  htmltools     0.4.0      2019-10-04 [1] CRAN (R 4.0.0)                      
#>  janeaustenr   0.1.5      2017-06-10 [1] CRAN (R 4.0.0)                      
#>  knitr         1.28.5     2020-04-28 [1] Github (yihui/knitr@93b46ba)        
#>  lattice       0.20-41    2020-04-02 [1] CRAN (R 4.0.0)                      
#>  lifecycle     0.2.0      2020-03-06 [1] CRAN (R 4.0.0)                      
#>  magrittr      1.5        2014-11-22 [1] CRAN (R 4.0.0)                      
#>  Matrix        1.2-18     2019-11-27 [1] CRAN (R 4.0.0)                      
#>  matrixStats   0.56.0     2020-03-13 [1] CRAN (R 4.0.0)                      
#>  memoise       1.1.0      2017-04-21 [1] CRAN (R 4.0.0)                      
#>  pillar        1.4.3      2019-12-20 [1] CRAN (R 4.0.0)                      
#>  pkgbuild      1.0.7      2020-04-25 [1] CRAN (R 4.0.0)                      
#>  pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.0.0)                      
#>  pkgload       1.0.2      2018-10-29 [1] CRAN (R 4.0.0)                      
#>  prettyunits   1.1.1      2020-01-24 [1] CRAN (R 4.0.0)                      
#>  processx      3.4.2      2020-02-09 [1] CRAN (R 4.0.0)                      
#>  ps            1.3.2      2020-02-13 [1] CRAN (R 4.0.0)                      
#>  purrr         0.3.4      2020-04-17 [1] CRAN (R 4.0.0)                      
#>  R6            2.4.1      2019-11-12 [1] CRAN (R 4.0.0)                      
#>  Rcpp          1.0.4.6    2020-04-09 [1] CRAN (R 4.0.0)                      
#>  remotes       2.1.1      2020-02-15 [1] CRAN (R 4.0.0)                      
#>  rlang         0.4.6      2020-05-02 [1] CRAN (R 4.0.0)                      
#>  rmarkdown     2.1.3      2020-05-03 [1] Github (rstudio/rmarkdown@d7e1bda)  
#>  rprojroot     1.3-2      2018-01-03 [1] CRAN (R 4.0.0)                      
#>  sessioninfo   1.1.1      2018-11-05 [1] CRAN (R 4.0.0)                      
#>  SnowballC     0.7.0      2020-04-01 [1] CRAN (R 4.0.0)                      
#>  stm         * 1.3.5      2020-04-28 [1] Github (bstewart/stm@c95ef0b)       
#>  stringi       1.4.6      2020-02-17 [1] CRAN (R 4.0.0)                      
#>  stringr       1.4.0      2019-02-10 [1] CRAN (R 4.0.0)                      
#>  testthat      2.3.2      2020-03-02 [1] CRAN (R 4.0.0)                      
#>  tibble        3.0.1      2020-04-20 [1] CRAN (R 4.0.0)                      
#>  tidyselect    1.0.0      2020-01-27 [1] CRAN (R 4.0.0)                      
#>  tidytext    * 0.2.4      2020-04-28 [1] Github (juliasilge/tidytext@a1c0220)
#>  tokenizers    0.2.1      2018-03-29 [1] CRAN (R 4.0.0)                      
#>  usethis       1.6.1.9000 2020-05-01 [1] Github (r-lib/usethis@4487260)      
#>  vctrs         0.2.4      2020-03-10 [1] CRAN (R 4.0.0)                      
#>  withr         2.2.0      2020-04-20 [1] CRAN (R 4.0.0)                      
#>  xfun          0.13.1     2020-04-30 [1] Github (yihui/xfun@bf8afdd)         
#>  yaml          2.2.1      2020-02-01 [1] CRAN (R 4.0.0)                      
#> 
#> [1] C:/Program Files/R/R-4.0.0/library

@juliasilge
Copy link
Owner

Yep, that is what I would do! 🙌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants