In the popular tf-idf scheme, it reduces documents of arbitrary length to fix-length lists of numbers.
tf: term frequency, the word’s frequency in a document.
idf: inverse document frequency, using the inverse of the document frequency, such as a word appears in N documents out of the total M documents, then it’s idf=M/N.
So if a word appears high (tf) in small number of documents(high idf), then it can be used effectively for classify a class of documents. The end result is a term-by-document matrix X whose columns contain the tf-idf values for each of the documents in the corpus.
Pros: It’s reduced the length of documents to a fix-length (|V|) vector.
Cons: If the vocabulary’ s size is big, then the matrix will also very big, and it will increase with the number of documents; at the same time, it do not consider the inter-or intra document statistical structure.
LSI (Latent semantic indexing)
LSI uses a singular value decomposition of the X matrix to identify a linear subspace in the space of tf-idf features that captures most of the variance in the collection. Meanwhile, it can capture some aspects of basic linguistic notions such as synonymy and polysemy.
pLSI (aspect model)
LDA (Latent Dirichlet Allocation)
A classic representation theorem due to de Finetti (1990) establishes that any collection of exchangeable random variables has a representation as a mixture distribution—in general an infinite mixture. This line of thinking leads to the LDA model.