不死虫的古堡: September 2011

晚上在Quora溜达，看到了个问题，关于LDA的，关于寻找LDA的一个良好的解释

众人合力得到了一个很赞的解释，这里我把最直观的解释翻译（部分翻译，有的部分还是英语表达更好）过来吧，同时来在学习的过程中对概率的一些知识进行了回顾（概率学了居然都忘了，只有模糊的印象）

假设有下面的句子集合

LDA是一种自动发现这些句子中包含的主题（topics）的方法。比如，给定这些句子和两个主题，LDA可能会产生下面的这些：

Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, ... (at which point, you could interpret topic A to be about food)

Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, ... (at which point, you could interpret topic B to be about cute animals)

问题是，LDA如何做发现这些的？

仔细来说，LDA使用混合的主题来代表文档（documents），这些主题将文档中的词（words）按照概率划分到不同的主题。LDA做了如下的假设，当你写文档的是，你会做如下的事情

Decide on the number of words N the document will have (say, according to a Poisson distribution).

Choose a topic mixture for the document (according to a Dirichlet distribution over a fixed set of K topics). For example, assuming that we have the two food and cute animal topics above, you might choose the document to consist of 1/3 food and 2/3 cute animals.

....First picking a topic (according to the multinomial distribution that you sampled above; for example, you might pick the food topic with 1/3 probability and the cute animals topic with 2/3 probability).

....Then using the topic to generate the word itself (according to the topic's multinomial distribution). For instance, the food topic might output the word "broccoli" with 30% probability, "bananas" with 15% probability, and so on.

考虑到这种生成（generate）文档的模型，LDA试图从文档追溯一些topics，很可能是这些topics生成了这些文档。

举个例子，根据上述model，当产生某个特定文档D的时候，你会这么做：

Pick the first word to come from the food topic, which then gives you the word "broccoli".

Pick the second word to come from the cute animals topic, which gives you "panda".

上述流程其实就是模拟人写文档的一个过程，这个由LDA model生成的文档为"broccoli panda adorable cherries eating"（LDA是一系列words的模型）

首先假定有一个文档集合，已经选择了K个topics用于发现，希望使用LDA来学习出每个文档的主题表示（topic representation），以及每个主题相关的词（words）。怎么做呢？一种方法如下（collapsed Gibbs sampling）：

Go through each document, and randomly assign each word in the document to one of the K topics.（对每个文档，将每个word随机赋予某个topic）

Notice that this random assignment already gives you both topic representations of all the documents and word distributions of all the topics (albeit not very good ones).（这种随机赋值也是一种学习的结果，不过我们需要去改进，方法就是迭代）

........And for each topic t, compute two things: 1) p(topic t | document d) = the proportion of words in document d that are currently assigned to topic t, and 2) p(word w | topic t) = the proportion of assignments to topic t over all documents that come from this word w. Reassign w a new topic, where you choose topic t with probability p(topic t | document d) * p(word w | topic t) (according to our generative model, this is essentially the probability that topic t generated word w, so it makes sense that we resample the current word's topic with this probability). (Also, I'm glossing over a couple of things here, such as the use of priors/pseudocounts in these probabilities.)（计算文档属于某topic的概率p(t|d)以及word属于topic的概率p(w|t)，然后根据这个来计算当把某word w赋予一个新的topic的时候的概率，即p(t|d)*p(w|t)）

........In other words, in this step, we're assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated.（每次计算某word属于某topic概率的时候，假定其他word的topic是确定的，）

After repeating the previous step a large number of times, you'll eventually reach a roughly steady state where your assignments are pretty good. So use these assignments to estimate the topic mixtures of each document (by counting the proportion of words assigned to each topic within that document) and the words associated to each topic (by counting the proportion of words assigned to each topic overall).（迭代直到收敛，使用最后的值来表示每个文档的主题表示，以及每个topic 的words表示）