TF-IDF

TF : Term Frequency
IDF : Inverse Document Frequency(idf) : The formulae is log(N/No. of document in which the word appear)
Here N is the total number of documents.
The intuition behind IDF is: Consider we have 100 documents . The frequency of word “insurance” is 100 and the frequency of word “try” is also 100. Now the thing is that try appears in all the document 1 time whereas insurance does not appear in all the document whereas it appear in some of the document multiple time.
IDF for term
insurance = log(100/50) = log(2) = 0.30
try = log(100/100) = log(1) = 0
Thus although the frequency of both the word in the corpus is same but “insurance” has more weight compare to “try”
TF: Term Frequency : The weight of a term that occurs in a document is the term frequency. It is calculated as follows
Number of times the word appear in the document / total number of words in the
Consider a document : “How are you today , yes today”
term frequency of “today” = 2/6 = 0.33
term frequency of “yes” = 1 /6 = 0.16
Hence the weight of term today is more compare to the weight of term yes as it occurs multiple time in the document
TF-IDF is the multiplication of tf * idf
Footer