TechNotes: Topic Modeling and Text Clustering

Topics and Transformations

In this tutorial, I will show how to transform documents from one vector representation into another. This process serves two goals:

To bring out hidden structure in the corpus, discover relationships between words and use them to describe the documents in a new and (hopefully) more semantic way.

To make the document representation more compact. This both improves efficiency (new representation consumes less resources) and efficacy (marginal data trends are ignored, noise-reduction).

https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html

Topic modeling for surveys or how we solved the problem of text clustering with LDA

In order to research the data, the following methods and approaches were used:

Sentiment analysis for the text responses of the survey.

Word Clouds generation for each survey question separately and for the entire dataset.

Unsupervised machine learning applied to text data for each survey question separately.

https://medium.com/exness-blog/topic-modeling-for-surveys-or-how-we-solved-the-problem-of-text-clustering-with-lda-ef4896e6f905

The Ultimate Guide to Clustering Algorithms and Topic Modeling

The series of articles aims to provide readers with a thorough view of two common but very different clustering algorithms, K-Means and Latent Dirichlet Allocation (LDA), and their applications in topic modeling

https://towardsdatascience.com/wthe-ultimate-guide-to-clustering-algorithms-and-topic-modeling-4f7757c115

An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit

In this study we evaluate several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit. We benchmark four different feature representations derived from term-frequency inverse-document-frequency (tf-idf) matrices and word embedding models combined with four clustering methods, and we include a Latent Dirichlet Allocation topic model for comparison.

https://www.sciencedirect.com/science/article/abs/pii/S0306457318307805

Evaluation of clustering and topic modeling methods over health-related tweets and emails

We trained eight state-of- the-art topic modeling and clustering algorithms on short texts from two health-related datasets (tweets and emails): Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), LDA with Gibbs Sampling (GibbsLDA), Online LDA, Biterm Model (BTM), Online Twitter LDA, and Gibbs Sampling for Dirichlet Multinomial Mixture (GSDMM), as well as the k-means clustering algorithm with two different feature representations: TF-IDF and Doc2Vec. We used cluster validity indices to evaluate the performance of topic modeling and clustering: two internal indices (i.e. assessing the goodness of a clustering structure without external information) and five external indices (i.e. comparing the results of a cluster analysis to an externally known provided class labels).

In overall, for number of clusters (k) from 2 to 50, Online Twitter LDA and GSDMM achieved the best performance in terms of internal indices, while LSI and k-means with TF-IDF had the highest external indices. Also, of all tweets (N=286,971; HPV represents 94.6% of tweets and lynch syndrome represents 5.4%), for k = 2, most of the methods could respect this initial clustering distribution. However, we found model performance varies with the source of data and hyper-parameters such as the number of topics and the number of iterations used to train the models. We also conducted an error analysis using the Hamming loss metric, for which the poorest value was obtained by GSDMM on both datasets.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9040385/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9040385

Integrating Document Clustering and Topic Modeling

In this paper, we propose a multi-grain clustering topic model (MGCTM) which integrates document clustering and topic modeling into a unified framework and jointly performs the two tasks to achieve the overall best performance. Our model tightly couples two components: a mixture component used for discovering latent groups in document collection and a topic model component used for mining multi-grain topics including local topics specific to each cluster and global topics shared across clusters.

https://arxiv.org/ftp/arxiv/papers/1309/1309.6874.pdf

Document Clustering vs Topic Models: A Case Study

In this paper, we report experiments on the observed relationship between clusters and topic models in a preliminary study of a large text collection. Both produce results that appear cohesive in their own right, but surprisingly – given the very different ways in which they are formed – the descriptions of the collections that they generate are strongly similar. This unexpected mutual reinforcement creates confidence in both approaches as tools for annotating and describing the contents of document collections.

https://dl.acm.org/doi/fullHtml/10.1145/3503516.3503527

Text Clustering: Grouping News Articles in Python

In this tutorial, we are going to cover the following topics:

1 Text Clustering

2 K-Means Clustering

3 Perform clustering on the News dataset

4 Evaluate Clustering Performance

5 Evaluate Clustering Performance using WordCloud

https://machinelearninggeek.com/text-clustering-clustering-news-articles/

Clustering sentence embeddings to identify intents in short text

This post will provide an approach I learned that can automatically cluster short-text message data to identify and extract intents.

https://towardsdatascience.com/clustering-sentence-embeddings-to-identify-intents-in-short-text-48d22d3bf02e