# TF-IDF Scheme

In the popular tf-idf scheme, it reduces documents of arbitrary length to fix-length lists of numbers.

tf: term frequency, the word’s frequency in a document.

idf: inverse document frequency, using the inverse of the document frequency, such as a word appears in N documents out of the total M documents, then it’s idf=M/N.

So if a word appears high (tf) in small number of documents(high idf), then it can be used effectively for classify a class of documents. The end result is a term-by-document matrix X whose columns contain the tf-idf values for each of the documents in the corpus.

Pros: It’s reduced the length of documents to a fix-length (|V|) vector.

Cons: If the vocabulary’ s size is big, then the matrix will also very big, and it will increase with the number of documents; at the same time, it do not consider the inter-or intra document statistical structure.

# LSI (Latent semantic indexing)

LSI uses a singular value decomposition of the X matrix to identify a linear subspace in the space of tf-idf features that captures most of the variance in the collection. Meanwhile, it can capture some aspects of basic linguistic notions such as synonymy and polysemy.

# pLSI (aspect model)

# LDA (Latent Dirichlet Allocation)

A classic representation theorem due to de Finetti (1990) establishes that any collection of exchangeable random variables has a representation as a mixture distribution—in general an infinite mixture. This line of thinking leads to the LDA model.

## So, what do you think ?