Monday, September 26, 2011

What is a good explanation of Latent Dirichlet Allocation

晚上在Quora溜达,看到了个问题,关于LDA的,关于寻找LDA的一个良好的解释

众人合力得到了一个很赞的解释,这里我把最直观的解释翻译(部分翻译,有的部分还是英语表达更好)过来吧,同时来在学习的过程中对概率的一些知识进行了回顾(概率学了居然都忘了,只有模糊的印象)

场景


假设有下面的句子集合



  • I ate a banana and spinach smoothie for breakfast

  • I like to eat broccoli and bananas.

  • Chinchillas and kittens are cute.

  • My sister adopted a kitten yesterday.

  • Look at this cute hamster munching on a piece of broccoli.


LDA是一种自动发现这些句子中包含的主题(topics)的方法。比如,给定这些句子和两个主题,LDA可能会产生下面的这些:




  • Sentences 1 and 2: 100% Topic A

  • Sentences 3 and 4: 100% Topic B

  • Sentence 5: 60% Topic A, 40% Topic B

  • Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, ... (at which point, you could interpret topic A to be about food)

  • Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, ... (at which point, you could interpret topic B to be about cute animals)


问题是,LDA如何做发现这些的?


LDA Model


仔细来说,LDA使用混合的主题来代表文档(documents),这些主题将文档中的词(words)按照概率划分到不同的主题。LDA做了如下的假设,当你写文档的是,你会做如下的事情



  • Decide on the number of words N the document will have (say, according to a Poisson distribution).



  • Choose a topic mixture for the document (according to a Dirichlet distribution over a fixed set of K topics). For example, assuming that we have the two food and cute animal topics above, you might choose the document to consist of 1/3 food and 2/3 cute animals.

  • Generate each word in the document by:

  • ....First picking a topic (according to the multinomial distribution that you sampled above; for example, you might pick the food topic with 1/3 probability and the cute animals topic with 2/3 probability).

  • ....Then using the topic to generate the word itself (according to the topic's multinomial distribution). For instance, the food topic might output the word "broccoli" with 30% probability, "bananas" with 15% probability, and so on.


考虑到这种生成(generate)文档的模型,LDA试图从文档追溯一些topics,很可能是这些topics生成了这些文档。


样例


举个例子,根据上述model,当产生某个特定文档D的时候,你会这么做:



  • Decide that D will be 1/2 about food and 1/2 about cute animals.

  • Pick 5 to be the number of words in D.

  • Pick the first word to come from the food topic, which then gives you the word "broccoli".

  • Pick the second word to come from the cute animals topic, which gives you "panda".

  • Pick the third word to come from the cute animals topic, giving you "adorable".

  • Pick the fourth word to come from the food topic, giving you "cherries".

  • Pick the fifth word to come from the food topic, giving you "eating".


上述流程其实就是模拟人写文档的一个过程,这个由LDA model生成的文档为"broccoli panda adorable cherries eating"(LDA是一系列words的模型)


学习(ML)


首先假定有一个文档集合,已经选择了K个topics用于发现,希望使用LDA来学习出每个文档的主题表示(topic representation),以及每个主题相关的词(words)。怎么做呢?一种方法如下(collapsed Gibbs sampling):



  • Go through each document, and randomly assign each word in the document to one of the K topics.(对每个文档,将每个word随机赋予某个topic)

  • Notice that this random assignment already gives you both topic representations of all the documents and word distributions of all the topics (albeit not very good ones).(这种随机赋值也是一种学习的结果,不过我们需要去改进,方法就是迭代)

  • So to improve on them, for each document d...

  • ....Go through each word w in d...

  • ........And for each topic t, compute two things: 1) p(topic t | document d) = the proportion of words in document d that are currently assigned to topic t, and 2) p(word w | topic t) = the proportion of assignments to topic t over all documents that come from this word w. Reassign w a new topic, where you choose topic t with probability p(topic t | document d) * p(word w | topic t) (according to our generative model, this is essentially the probability that topic t generated word w, so it makes sense that we resample the current word's topic with this probability). (Also, I'm glossing over a couple of things here, such as the use of priors/pseudocounts in these probabilities.)(计算文档属于某topic的概率p(t|d)以及word属于topic的概率p(w|t),然后根据这个来计算当把某word w赋予一个新的topic的时候的概率,即p(t|d)*p(w|t))

  • ........In other words, in this step, we're assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated.(每次计算某word属于某topic概率的时候,假定其他word的topic是确定的,)

  • After repeating the previous step a large number of times, you'll eventually reach a roughly steady state where your assignments are pretty good. So use these assignments to estimate the topic mixtures of each document (by counting the proportion of words assigned to each topic within that document) and the words associated to each topic (by counting the proportion of words assigned to each topic overall).(迭代直到收敛,使用最后的值来表示每个文档的主题表示,以及每个topic 的words表示)



这个学习过程有点多,因此稍稍解释了一下,这个过程和PageRank很像吧,都是开始随机赋值,然后更新值,迭代直到收敛。

相关概率知识


看的过程中发现自己的概率都忘光了,于是看得时候把相应的概念在wiki上翻了出来,虽没都看透,但起码有个感性的理解了


狄利克雷分布是一个连续多随机变量分布,理解这个需要了解下面的知识


先验概率:简称先验,一个不确定量p的先验,是一个人在没有考虑数据的时候对于p的不确定性的判断,是这样一个概率分布。一个简单的例子是,假如p是选民会选A的比例,p的先验概率不能考虑民意调查,即选民不参考其他会影响他的vote的data。

对应的,后验概率如下:


后验概率是把相关的data考虑进去的条件概率,后验概率分布是从一次实验中得到的不确定量p的分布,再进一步,共轭先验。


假如概率先验分布p(θ)和后验分布p(θ|x)的分布是同一种分布,那么这就叫共轭分布,而该先验称为共轭先验。如p(θ)是服从高斯分布的,而p(θ|x)也是服从高斯分布的,那么这两个分布成为共轭分布。

这上述几个概念都是属于贝叶斯统计的


贝叶斯统计是统计的一个分支,在贝叶斯统计中,关于现实世界的真实状态的证据都是用贝叶斯概率http://en.wikipedia.org/wiki/Bayesian_probability来表达的,贝叶斯概率是关于概率的一种不同的解释,属于置信概率的范畴。概率的贝叶斯解释可以看作是逻辑的延伸,使得可以对不确定量进行推理。

参考文献: