Skip to content

SearchK Fails and Trouble with Topic Detection on a Set with a Small Number of Documents #284

Open
@larry77

Description

@larry77

Please have a look at the reprex below.
You need to download the document feature matrix from

https://e.pcloud.link/publink/show?code=XZmHFDZeObPiNtsGWfzuBlnVw2ryzATt1X7

A couple of things which I do not understand happen

  1. when I run stm with 9 topics, some of them appear to yield duplicated results (at least in the top 10 keywords per topic, see the plot generated in the reprex). Any idea why?
  2. when I try using the searchK() function from stm to determine the optimal number of topics, I get an error message I cannot decipher.

Maybe this is not a bug, but I am banging my head against the wall and any help is appreciated.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(quanteda)
#> Package version: 3.3.1
#> Unicode version: 15.0
#> ICU version: 72.1
#> Parallel computing: 4 of 4 threads used.
#> See https://quanteda.io for tutorials and examples.
library(stm)
#> stm v1.3.6.1 successfully loaded. See ?stm for help. 
#>  Papers, resources, and other materials at structuraltopicmodel.com
library(RCurl)
library(readtext)
#> 
#> Attaching package: 'readtext'
#> The following object is masked from 'package:quanteda':
#> 
#>     texts
library(tidytext)
library(ggplot2)

## Download the dfm matrix from

## https://e.pcloud.link/publink/show?code=XZmHFDZeObPiNtsGWfzuBlnVw2ryzATt1X7


dfm_mat <- readRDS("dfm_mat.RDS")






## see https://rstudio-pubs-static.s3.amazonaws.com/406792_9287b832dd9e413f97243628cb2f7ddb.html

## convert the dfm to a format suitable to stm.

dfm2stm <- convert(dfm_mat, to = "stm")



model.stm <- stm(dfm2stm$documents, dfm2stm$vocab, K = 9, data = dfm2stm$meta,
                 init.type = "Spectral") 
#> Beginning Spectral Initialization 
#>   Calculating the gram matrix...
#>   Finding anchor words...
#>      .........
#>   Recovering initialization...
#>      ...........................
#> Initialization complete.
#> ...
#> Completed E-Step (0 seconds). 
#> Completed M-Step. 
#> Completing Iteration 1 (approx. per word bound = -6.780) 
#> ...
#> Completed E-Step (0 seconds). 
#> Completed M-Step. 
#> Completing Iteration 2 (approx. per word bound = -6.762, relative change = 2.715e-03) 
#> ...
#> Completed E-Step (0 seconds). 
#> Completed M-Step. 
#> Completing Iteration 3 (approx. per word bound = -6.761, relative change = 4.260e-05) 
#> ...
#> Completed E-Step (0 seconds). 
#> Completed M-Step. 
#> Completing Iteration 4 (approx. per word bound = -6.761, relative change = 1.602e-05) 
#> ...
#> Completed E-Step (0 seconds). 
#> Completed M-Step. 
#> Completing Iteration 5 (approx. per word bound = -6.761, relative change = 1.024e-05) 
#> Topic 1: europe, can, european, new, need 
#>  Topic 2: union, need, europe, today, us 
#>  Topic 3: europe, union, work, european, need 
#>  Topic 4: union, need, europe, today, us 
#>  Topic 5: europe, can, european, new, need 
#>  Topic 6: europe, union, work, european, need 
#>  Topic 7: union, need, europe, today, us 
#>  Topic 8: europe, can, european, new, need 
#>  Topic 9: accelerate, union, need, europe, us 
#> ...
#> Completed E-Step (0 seconds). 
#> Completed M-Step. 
#> Model Converged

## I make the model tidy.
## See  https://juliasilge.com/blog/sherlock-holmes-stm/

stm_tidy <- tidy(model.stm)

gpl <- stm_tidy  |> 
    group_by(topic)  |> 
    top_n(10, beta)  |> 
    ungroup()  |> 
    mutate(topic = paste0("Topic ", topic),
           term = reorder_within(term, beta, topic))  |> 
    ggplot(aes(term, beta, fill = as.factor(topic))) +
    geom_col(alpha = 0.8, show.legend = FALSE) +
    facet_wrap(~ topic, scales = "free_y") +
    coord_flip() +
    scale_x_reordered() +
    labs(x = NULL, y = expression(beta),
         title = "Highest word probabilities for each topic",
         subtitle = "Different words are associated with different topics")


gpl

## I can fit a model by stm with a chosen number of topics to the data



### Now I try determining the optimal number of topics using the searchK function

### See https://stackoverflow.com/questions/64989642/use-dfm-in-searchk-calcuation

set.seed(02138)

K <- 5:15

 model_search <- searchK(dfm2stm$documents, dfm2stm$vocab, K,
data = dfm2stm$meta)
#> Beginning Spectral Initialization 
#>   Calculating the gram matrix...
#>   Finding anchor words...
#>      .....
#>   Recovering initialization...
#>      ...........................
#> Initialization complete.
#> ...
#> Completed E-Step (0 seconds). 
#> Completed M-Step. 
#> Completing Iteration 1 (approx. per word bound = -6.781) 
#> ...
#> Completed E-Step (0 seconds). 
#> Completed M-Step. 
#> Completing Iteration 2 (approx. per word bound = -6.761, relative change = 2.956e-03) 
#> ...
#> Completed E-Step (0 seconds). 
#> Completed M-Step. 
#> Completing Iteration 3 (approx. per word bound = -6.761, relative change = 2.235e-05) 
#> ...
#> Completed E-Step (0 seconds). 
#> Completed M-Step. 
#> Model Converged
#> Error in missing$docs[[i]]: subscript out of bounds

## This fails but I do not understand why....

sessionInfo()
#> R version 4.3.2 (2023-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 12 (bookworm)
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
#>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
#>  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Europe/Brussels
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ggplot2_3.4.4   tidytext_0.4.1  readtext_0.90   RCurl_1.98-1.13
#> [5] stm_1.3.6.1     quanteda_3.3.1  dplyr_1.1.3    
#> 
#> loaded via a namespace (and not attached):
#>  [1] janeaustenr_1.0.0  utf8_1.2.4         generics_0.1.3     slam_0.1-50       
#>  [5] bitops_1.0-7       stringi_1.7.12     lattice_0.22-5     digest_0.6.33     
#>  [9] magrittr_2.0.3     evaluate_0.23      grid_4.3.2         fastmap_1.1.1     
#> [13] plyr_1.8.9         Matrix_1.6-2       httr_1.4.7         stopwords_2.3     
#> [17] fansi_1.0.5        scales_1.2.1       cli_3.6.1          rlang_1.1.2       
#> [21] tokenizers_0.3.0   munsell_0.5.0      reprex_2.0.2       withr_2.5.2       
#> [25] yaml_2.3.7         tools_4.3.2        reshape2_1.4.4     colorspace_2.1-0  
#> [29] fastmatch_1.1-4    vctrs_0.6.4        R6_2.5.1           lifecycle_1.0.4   
#> [33] stringr_1.5.0      fs_1.6.3           pkgconfig_2.0.3    RcppParallel_5.1.7
#> [37] pillar_1.9.0       gtable_0.3.4       data.table_1.14.8  glue_1.6.2        
#> [41] Rcpp_1.0.11        xfun_0.41          tibble_3.2.1       tidyselect_1.2.0  
#> [45] knitr_1.45         farver_2.1.1       htmltools_0.5.7    SnowballC_0.7.1   
#> [49] rmarkdown_2.25     labeling_0.4.3     compiler_4.3.2

Created on 2023-11-14 with reprex v2.0.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions