Skip to content

Ideas Scrapyard

Menshikh Ivan edited this page Feb 15, 2018 · 4 revisions

This page created only as "history artifact", contains many ideas that non-relevant now.

Supervised Latent Dirichlet Allocation

Note: Consider integration with existing Python sLDA

Background: Supervised Latent Dirichlet Allocation (sLDA) [1] is a Natural Language Processing method based on Latent Dirichlet Allocation (LDA) [2]. It is used in predicting the number of "Likes" for a post or the number of stars in a movie review.

In the vanilla LDA we treat the topic proportions for a text document as a draw from a Dirichlet distribution. We obtain the words in the document by repeatedly choosing a topic assignment from those proportions, then drawing a word from the corresponding topic. In Supervised Latent Dirichlet Allocation (sLDA), we add our target variable to the LDA model. For example, the number of stars assigned in a movie review or number of "Likes" of a post.

While academic implementations of sLDA exist in C++ and R [3, 4], there is no Python implementation available. You will contribute a scalable implementation of sLDA to the Python data science world. A quality implementation will be widely used in the industry.

RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at [email protected].

Goals

  1. Demonstrate understanding of topic modeling theory and practice by describing, implementing and evaluating sLDA.

  2. Implement a streamed sLDA that is capable of online (incremental) updates. Processing must be done in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. Optionally implement a version that can use multiple cores on the same machine.

  3. Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).

Deliverables

  1. Code: a pull request against gensim [5, 6] on github. [7] Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

  2. Report: timings, memory use and accuracy of your sLDA implementation on the Cornell Movie Review Corpus [8] following the same methodology as in [1]. A summary of insights into parameter selection and tuning of sLDA.

Resources:

[1] Mcauliffe, Jon D., and David M. Blei. "Supervised topic models." Advances in neural information processing systems. 2008.

[2] Blei, David M.; Ng, Andrew Y.; Jordan, Michael I (January 2003). Lafferty, John, ed. "Latent Dirichlet allocation". Journal of Machine Learning Research 3 (4–5): pp. 993–1022

[3] sLDA implementation in C++

[4] Implementation of sLDA in R

[5] Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks

[6] Gensim github issue #121.

[7] Gensim on github

[8] Movie Review Dataset from Cornell NLP group

[9] Ramage, Daniel, et al. "Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora." Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1. Association for Computational Linguistics, 2009.

[10] Labelled LDA in Python

[11] Jagarlamudi, Jagadeesh, Hal Daumé III, and Raghavendra Udupa. "Incorporating lexical priors into topic models." Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2012

Distributed similarity queries

Background: gensim implements fast routines for similarity retrieval ("give me documents similar to this one, using Latent Semantic Analysis"). The routines can make use of multiple cores (using BLAS), but not multiple machines. For large datasets, it is desirable to store shards in a distributed manner, across a cluster of computers. During querying, collect and merge results from all shards.

To do: Extend the sharding already present in gensim, so that different shards can reside on different computers. Design an API to make the concept of "shards" flexible, so that similarity classes that use different implementations (see k-NN above) can plug into it easily.

The network communication must use a fast protocol (Pyro? ØMQ?), so as to not increase query latency too much.

Resources: gensim mailing list.

Dynamic Topic Models improvements

Background: Dynamic topic models are generative models that can be used to analyze the evolution of (unobserved) topics of a collection of documents over time. This family of models was proposed by David Blei and John Lafferty and is an extension to Latent Dirichlet Allocation (LDA) that can handle sequential documents. DTM has already been implemented in Gensim by Google Summer of Code student @bhargavvader

To do: Implement a distributed cluster version of DTM, and a version that can use multiple cores on the same machine. Implement DIM in gensim and evaluate.

Implementation must accept data in stream format (sequence of document vectors). It can use NumPy/SciPy as building blocks, pushing as much number crunching in low-level (ideally, BLAS) routines as possible.

We aim for robust, industry-strength implementations in gensim, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

Gensim doesn't include any support for "timed streams", or time tags, at the moment. So part of this project will be engineering a clean API for this new functionality.

Resources: Dynamic Topic Models.

Original Blei&Lafferty article PDF.

Wang&Blei&Heckerman article on Continuous Time Dynamic Topic Model PDF.

Wang&McCallum: "Topics over time" PDF.

Academic implementation of DTM on David Blei's page.

Gensim implementation of DTM.

Nested Hierarchical Dirichlet Processes

Background: Paisley, Wang, Blei, Jordan recently developed a stochastic variational version of nested HDP. It reportedly preforms better than HDP etc. (of course!).

To do: Implement this model (probably extending / replacing the existing online HDP implementation in gensim) and evaluate it. Optionally also implement a distributed cluster version, or a version that can use multiple cores on the same machine.

Implementation must accept data in stream format (sequence of document vectors), to allow large inputs.

We aim for robust, industry-strength implementations in gensim, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

Resources: "Nested Hierarchical Dirichlet Processes" by John Paisley, Chong Wang, David M. Blei and Michael I. Jordan PDF.

Pachinko Allocation Model

Background Li, McCallum developed a hierarchical LDA-like model for document classification. They report 2-5% accuracy improvements over an LDA model on a test corpus. (http://people.cs.umass.edu/~mccallum/papers/pam-icml06.pdf)

An implementation of this model may provide additional alternatives in choice of model, which in some cases may be helpful.

An implementation must be heavily unit tested and and production-ready. It would use many of the same classes and methods as the LDA, which is a bonus in terms of a first pass at implementation.

Resources Blei, D., Griffiths, T., Jordan, M., & Tenenbaum, J. (2004). Hierarchical topic models and the nested Chinese restaurant process. NIPS.

Blei, D., Ng, A., & Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

Diggle, P. J., & Gratton, R. J. (1984). Monte Carlo methods of inference for implicit statistical models. Journal of the Royal Statistical Society B, 46, 193–227.

Li, W., Blei, D., & McCallum, A. (2007). Nonparametric Bayes pachinko allocation.

Li, W., & McCallum, A. (2006). Pachinko allocation: DAG-structured mixture models of topic correlations. ICML.

Minka, T. (2000). Estimating a Dirichlet distribution. Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. UAI.

Wallach, H. M. (2006). Topic modeling: beyond bag-ofwords. ICML.

Glove word-embedding integration

Integrate or re-write in an optimized way the glove word-embedding code by Maciej Kula (https://github.com/maciejkula/glove-python). Next step would be adding Swivel algorithm support

Wrapper for BigARTM Topic Modeling package

See BigARTM github page

Spectral LDA in Python

Much better performance than current variational inference way to fit LDA.

Either implement in Python or find a way to load the model trained on Spark.

Code and paper references

Intrinsic evaluation of word2vec models with QVec

Shows how good your word2vec model is on specific syntactic and semantic tasks. Wrapper around this code https://github.com/ytsvetko/qvec

Compare tensorflow and gensim summarizations

https://research.googleblog.com/2016/08/text-summarization-with-tensorflow.html

Implement Wordspace in Python

Translate from R into Python using existing Gensim code. Medium difficulty.

From gensim issue suggestion: "Hi, it seems that wordspace model is very useful (http://infomap-nlp.sourceforge.net/doc/algorithm.html and https://cran.r-project.org/web/packages/wordspace/index.html). It is similar to the lsa model except that wordspace decomposes a co-occurrence matrix instead of term-document matrix."

Word sense embedding

A sense embedding is able to learn multiple representations per word capturing different word meanings.

Integrate one of existing word sense embeddings into gensim. Adagram is the best one currently.

Low priority as rarely appears in production.

Consider:

Adagram

Sensegram

Add cuckoo hashing to dictionary to avoid collisions

Change HashDictionary to use cuckoo hashing.

Hat-tip to A. Mueller

Add word embedding from "Learning Distributed Word Representations For

Bidirectional LSTM Recurrent Neural Network" paper

See paper

Joint embedding

Slight modification of word2vec for the purpose of sponsored advertising. See this paper "Joint Embedding of Query and Ad by Leveraging Implicit Feedback"

Implement OHDOCLUS – Online and Hierarchical Document Clustering

See https://github.com/ruiEnca/ohDoclus

Integrate with PhraseMachine - phrase detection based on part of speech tagging

See code and paper at https://github.com/slanglab/phrasemachine

Factor analysis algorithms and code

See how it works, compare to existing techniques, maybe get in touch for inclusion / robust reimplementation in gensim.

Code from 2014: https://github.com/cangermueller/vbmfa

Unsupervised Seq2seq

Implement in tensorflow. Reproduce this paper or try another GAN model

Unsupervised segmentation into blocks/sentences/words

See discussion in https://github.com/RaRe-Technologies/gensim/issues/1135#issuecomment-277529491