What are some examples of how TF-IDF works.

Here is an example of Term Frequency-Inverse Document Frequency (TF-IDF)

Suppose we have a collection of documents that includes a news article about the Olympic Games.

The article mentions the words “gold,” “medal,” and “winner” several times, while the words “bronze,” “third,” and “place” are only mentioned once. Using TF-IDF, we can determine which words are most important in this document compared to the entire collection of documents.

TF (term frequency) is calculated as the number of times a word appears in the document, divided by the total number of words in the document.

In this case, the TF for the word “gold” is high because it appears frequently in the article. Conversely, the TF for the word “bronze” is low because it appears only once.

IDF (inverse document frequency) is calculated as the total number of documents in the collection, divided by the number of documents that contain the word. In this case, the IDF for the word “gold” is relatively low because it appears frequently in many documents. The IDF for the word “bronze” is higher because it appears less frequently in the collection.

By multiplying the TF and IDF values, we can calculate the TF-IDF score for each word. This score indicates the importance of the word in the context of the document and the entire collection. In this example, the TF-IDF score for “gold” would be relatively high, while the score for “bronze” would be relatively low.

Example 2: Consider a collection of customer reviews for a restaurant. Suppose we want to identify which words are most commonly used to describe the food. Using TF-IDF, we can calculate the importance of each word in the context of the entire collection.

For a given document (i.e., a single review), the TF value for a word is calculated as the number of times the word appears in the review, divided by the total number of words in the review. The IDF value for a word is calculated as the total number of reviews in the collection, divided by the number of reviews that contain the word.

By multiplying the TF and IDF values, we can obtain the TF-IDF score for each word in the review. We can then aggregate the TF-IDF scores for each word across all reviews to identify the most commonly used words in the collection. The words with the highest TF-IDF scores are likely to be the most important and relevant to the topic at hand (i.e., the restaurant’s food).

Term Frequency-Inverse Document Frequency (TF-IDF)

TF-IDF is a numerical statistic used to measure how important a word is to a document in a collection of documents.

It is commonly used in information retrieval and text mining.

The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

Example 1: The cat jumped over the moon

TF-IDF scores:

  • Cat: 0.50
  • Moon: 0.50
  • Jumped: 0.33
  • Over: 0.33

Example 2: The dog chased the cat

  • TF-IDF scores:
  • Dog: 0.50
  • Cat: 0.50
  • Chased: 0.33
  • The: 0.33