You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A couple of things which I do not understand happen
when I run stm with 9 topics, some of them appear to yield duplicated results (at least in the top 10 keywords per topic, see the plot generated in the reprex). Any idea why?
when I try using the searchK() function from stm to determine the optimal number of topics, I get an error message I cannot decipher.
Maybe this is not a bug, but I am banging my head against the wall and any help is appreciated.
library(dplyr)
#> #> Attaching package: 'dplyr'#> The following objects are masked from 'package:stats':#> #> filter, lag#> The following objects are masked from 'package:base':#> #> intersect, setdiff, setequal, union
library(quanteda)
#> Package version: 3.3.1#> Unicode version: 15.0#> ICU version: 72.1#> Parallel computing: 4 of 4 threads used.#> See https://quanteda.io for tutorials and examples.
library(stm)
#> stm v1.3.6.1 successfully loaded. See ?stm for help. #> Papers, resources, and other materials at structuraltopicmodel.com
library(RCurl)
library(readtext)
#> #> Attaching package: 'readtext'#> The following object is masked from 'package:quanteda':#> #> texts
library(tidytext)
library(ggplot2)
## Download the dfm matrix from## https://e.pcloud.link/publink/show?code=XZmHFDZeObPiNtsGWfzuBlnVw2ryzATt1X7dfm_mat<- readRDS("dfm_mat.RDS")
## see https://rstudio-pubs-static.s3.amazonaws.com/406792_9287b832dd9e413f97243628cb2f7ddb.html## convert the dfm to a format suitable to stm.dfm2stm<- convert(dfm_mat, to="stm")
model.stm<- stm(dfm2stm$documents, dfm2stm$vocab, K=9, data=dfm2stm$meta,
init.type="Spectral")
#> Beginning Spectral Initialization #> Calculating the gram matrix...#> Finding anchor words...#> .........#> Recovering initialization...#> ...........................#> Initialization complete.#> ...#> Completed E-Step (0 seconds). #> Completed M-Step. #> Completing Iteration 1 (approx. per word bound = -6.780) #> ...#> Completed E-Step (0 seconds). #> Completed M-Step. #> Completing Iteration 2 (approx. per word bound = -6.762, relative change = 2.715e-03) #> ...#> Completed E-Step (0 seconds). #> Completed M-Step. #> Completing Iteration 3 (approx. per word bound = -6.761, relative change = 4.260e-05) #> ...#> Completed E-Step (0 seconds). #> Completed M-Step. #> Completing Iteration 4 (approx. per word bound = -6.761, relative change = 1.602e-05) #> ...#> Completed E-Step (0 seconds). #> Completed M-Step. #> Completing Iteration 5 (approx. per word bound = -6.761, relative change = 1.024e-05) #> Topic 1: europe, can, european, new, need #> Topic 2: union, need, europe, today, us #> Topic 3: europe, union, work, european, need #> Topic 4: union, need, europe, today, us #> Topic 5: europe, can, european, new, need #> Topic 6: europe, union, work, european, need #> Topic 7: union, need, europe, today, us #> Topic 8: europe, can, european, new, need #> Topic 9: accelerate, union, need, europe, us #> ...#> Completed E-Step (0 seconds). #> Completed M-Step. #> Model Converged## I make the model tidy.## See https://juliasilge.com/blog/sherlock-holmes-stm/stm_tidy<- tidy(model.stm)
gpl<-stm_tidy|>
group_by(topic) |>
top_n(10, beta) |>
ungroup() |>
mutate(topic= paste0("Topic ", topic),
term= reorder_within(term, beta, topic)) |>
ggplot(aes(term, beta, fill= as.factor(topic))) +
geom_col(alpha=0.8, show.legend=FALSE) +
facet_wrap(~topic, scales="free_y") +
coord_flip() +
scale_x_reordered() +
labs(x=NULL, y= expression(beta),
title="Highest word probabilities for each topic",
subtitle="Different words are associated with different topics")
gpl
## I can fit a model by stm with a chosen number of topics to the data### Now I try determining the optimal number of topics using the searchK function### See https://stackoverflow.com/questions/64989642/use-dfm-in-searchk-calcuation
set.seed(02138)
K<-5:15model_search<- searchK(dfm2stm$documents, dfm2stm$vocab, K,
data=dfm2stm$meta)
#> Beginning Spectral Initialization #> Calculating the gram matrix...#> Finding anchor words...#> .....#> Recovering initialization...#> ...........................#> Initialization complete.#> ...#> Completed E-Step (0 seconds). #> Completed M-Step. #> Completing Iteration 1 (approx. per word bound = -6.781) #> ...#> Completed E-Step (0 seconds). #> Completed M-Step. #> Completing Iteration 2 (approx. per word bound = -6.761, relative change = 2.956e-03) #> ...#> Completed E-Step (0 seconds). #> Completed M-Step. #> Completing Iteration 3 (approx. per word bound = -6.761, relative change = 2.235e-05) #> ...#> Completed E-Step (0 seconds). #> Completed M-Step. #> Model Converged#> Error in missing$docs[[i]]: subscript out of bounds## This fails but I do not understand why....
sessionInfo()
#> R version 4.3.2 (2023-10-31)#> Platform: x86_64-pc-linux-gnu (64-bit)#> Running under: Debian GNU/Linux 12 (bookworm)#> #> Matrix products: default#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0 #> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0#> #> locale:#> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C #> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 #> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 #> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C #> [9] LC_ADDRESS=C LC_TELEPHONE=C #> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C #> #> time zone: Europe/Brussels#> tzcode source: system (glibc)#> #> attached base packages:#> [1] stats graphics grDevices utils datasets methods base #> #> other attached packages:#> [1] ggplot2_3.4.4 tidytext_0.4.1 readtext_0.90 RCurl_1.98-1.13#> [5] stm_1.3.6.1 quanteda_3.3.1 dplyr_1.1.3 #> #> loaded via a namespace (and not attached):#> [1] janeaustenr_1.0.0 utf8_1.2.4 generics_0.1.3 slam_0.1-50 #> [5] bitops_1.0-7 stringi_1.7.12 lattice_0.22-5 digest_0.6.33 #> [9] magrittr_2.0.3 evaluate_0.23 grid_4.3.2 fastmap_1.1.1 #> [13] plyr_1.8.9 Matrix_1.6-2 httr_1.4.7 stopwords_2.3 #> [17] fansi_1.0.5 scales_1.2.1 cli_3.6.1 rlang_1.1.2 #> [21] tokenizers_0.3.0 munsell_0.5.0 reprex_2.0.2 withr_2.5.2 #> [25] yaml_2.3.7 tools_4.3.2 reshape2_1.4.4 colorspace_2.1-0 #> [29] fastmatch_1.1-4 vctrs_0.6.4 R6_2.5.1 lifecycle_1.0.4 #> [33] stringr_1.5.0 fs_1.6.3 pkgconfig_2.0.3 RcppParallel_5.1.7#> [37] pillar_1.9.0 gtable_0.3.4 data.table_1.14.8 glue_1.6.2 #> [41] Rcpp_1.0.11 xfun_0.41 tibble_3.2.1 tidyselect_1.2.0 #> [45] knitr_1.45 farver_2.1.1 htmltools_0.5.7 SnowballC_0.7.1 #> [49] rmarkdown_2.25 labeling_0.4.3 compiler_4.3.2
The text was updated successfully, but these errors were encountered:
larry77
changed the title
SearchK Fails and Trouble with Topic Detection
SearchK Fails and Trouble with Topic Detection on a Set with a Small Number of Documents
Nov 14, 2023
Please have a look at the reprex below.
You need to download the document feature matrix from
https://e.pcloud.link/publink/show?code=XZmHFDZeObPiNtsGWfzuBlnVw2ryzATt1X7
A couple of things which I do not understand happen
Maybe this is not a bug, but I am banging my head against the wall and any help is appreciated.
Created on 2023-11-14 with reprex v2.0.2
The text was updated successfully, but these errors were encountered: