Term Frequency ([[TF]]) is a fundamental concept in natural language processing and information retrieval that measures how often a particular word (or term) appears within a document. Here's a breakdown:
**What is Term Frequency?**
- **Raw Count or Normalized:** TF can be expressed as a simple raw count of how many times a term appears in a document or a normalized value. Normalization accounts for document length, preventing longer documents from having artificially higher TFs.
- **Part of the TF-IDF Metric:** Term Frequency is one essential component of the TF-IDF (Term Frequency-Inverse Document Frequency) metric, widely used in determining how important a word is to a document within a larger collection.
**How to Calculate Term Frequency**
1. **Raw Count:** Count the number of times the chosen term appears in a document.
2. **Normalization:** A common normalization technique is to divide the raw count by the total number of words in the document. This gives a proportion representing how frequently the term appears relative to the document's length.
**Why Term Frequency Matters**
- **Intuitive Measure:** It provides a basic understanding of which words are prominent within a document.
- **Building Block for TF-IDF:** TF is one piece of the puzzle, and it's essential to calculate the complete TF-IDF value.
- **Identifying Keywords:** While not always perfect, words with high TFs within a document can often be relevant keywords or important topics.
**Important Considerations**
- **Common Words:** Terms like "the" or "and" will have high TFs, but don't carry much meaning. This is why it's balanced against IDF in the TF-IDF calculation.
- **Different Terms:** TF can be calculated for single words (unigrams), or combinations of words (bigrams, n-grams).
**Example**
In the sentence "The cat sat on the mat", here's the TF (raw count) for some words:
- "the": TF = 2
- "cat": TF = 1
- "on": TF = 1
pen_spark
# References
```dataview
Table title as Title, authors as Authors
where contains(subject, "Term Frequency") or contains(subject, "TF")
sort modified desc, authors, title
```