Guiding Questions
Develop your answers to the following guiding questions while watching the video lectures throughout the week.
- What is clustering? What are some applications of clustering in text mining and analysis?
- How can we use a mixture model to do document clustering? 1. How many parameters are there in such a model?
- How is the mixture model for document clustering related to a topic model such as PLSA? In what way are they similar? Where are they different?
- How do we determine the cluster for each document after estimating all the parameters of a mixture model?
- How does hierarchical agglomerative clustering work? How do single-link, complete-link, and average-link work for computing group similarity? Which of these three ways of computing group similarity is least sensitive to outliers in the data?
- How do we evaluate clustering results?
- What is text categorization? What are some applications of text categorization?
- What does the training data for categorization look like?
- How does the Na?ve Bayes classifier work?
- Why do we often use logarithm in the scoring function for Na?ve Bayes?
4.1 Text Clustering: Motivation
image.png
image.png
image.png
4.2 Text Clustering: Generative Probabilistic Models Part 1
image.png
image.png
每篇文章只有一個主題犀概,才可以做 Cluster
image.png
image.png
image.png
image.png
image.png
- 對于文章中的每個詞: Cluster Model 選擇文檔只選擇一次雾家;Topic Model 每個詞都選擇一次
- Cluster Model: Word Distribution 產(chǎn)生文章中的每一個詞水评;Topic Model 不一定Word Distribution 就能產(chǎn)生所有文章中的詞凄吏,可以在別的 Topic 中產(chǎn)生
image.png
L:#文章中的單詞數(shù)
4.3 Text Clustering: Generative Probabilistic Models Part 2
image.png
如何從2個 Cluster拓展到 N 個 Cluster
image.png
image.png