Generally speaking the starting point for many text mining projects is the creation of a matrix of document word counts (i.e. the number of times a particular word appears in each document). How these frequencies are interpreted is then up to the analyst working on the project. Inverse document frequencies ($idf$) provide a useful way of measuring the importance of the occurrence of words in a document. In essence they give greater prominence to high frequency words that appear in just a few documents within a corpus. For example the analysis of medical interviews might reveal the common use of the word ‘unwell’ in many documents, and given the nature of the topic this fact does not does not add much new information. However when the word ‘headache’ appears frequently in a small subset of documents it is probably providing rather more information about those particular documents.
The full expression for inverse document frequencies is:
[latexpage]
\[
idf(i,j) = (1 + log(wf_i_j))log(N/df_i)
\]
where $N$ is the total number of documents, $i$ is an index on words and $j$ is an index on documents. $wf_i_j$ is the frequency of the $ith$ word in the $jth$ document, and $df_i$ is the document frequency of the $ith$ word. Obviously this is called the inverse document frequency because the inverse of the document frequency is used in the formula ($N/df_i$).
The $1 + log(wf_i_j)$ term ensures that documents with large number of instances of a given word do not dwarf the $log(N/df_i)$ term, which is zero in any case if $df_i$ is equal to $N$. Words that appear in every document are thus given an $idf$ of zero – which may, or may not be an important fact. Words that appear with a high frequency in a single document are given the highest rating. On the whole this provides useful input for data mining algorithms.
This metric is supported in open source search platforms Solr and Lucene and is also a feature of data mining tools such as Statistica, Rapidminer and SAS.