You can visit and try all the analytics and layout implementations live using IPython/Jupyter—not just a static notebook, but fully interactive on Google Colab
- Topic: Exploring two fascinating statistical patterns: Benford's Law and Zipf's Law.
- Context: Applying these laws to the
movies_subtitles.csv
dataset. - Goal: Understand how these non-intuitive laws manifest in real-world data.
- Concept: Benford's Law describes the frequency distribution of leading digits in many real-life sets of numerical data.
- Prediction: The law states that the digit '1' appears as the leading digit about 30% of the time, '2' about 17.6%, and so on, with higher digits appearing less frequently.
- Formula: P(d) = log10(1 + 1/d) for d ∈ {1, 2, ..., 9}
- Intuition: It often applies to data that spans several orders of magnitude.
- Question: What is the probability that the leading digit of a number (in a naturally occurring dataset) is 1?
- Benford's Answer: Approximately 30.1%
- Contrast: If digits were uniformly distributed, we'd expect ~11.1% (1/9).
- Hypothesis: Numerical data within the
movies_subtitles.csv
(e.g., start/end timestamps, potentially other numeric fields if available) might follow Benford's Law. - Data Column(s) for Analysis:
start_time
(Subtitle start times in seconds)
- Analysis:
- Extract the
start_time
numerical data. - Isolate the first significant digit of each number.
- Calculate the frequency distribution of these leading digits (1-9).
- Compare the observed distribution to Benford's predicted distribution.
- Extract the
- Dataset Used:
movies_subtitles.csv
- Column Analyzed:
start_time
- Observed vs. Expected Frequencies (from 10,358,275 valid start times):
- Digit 1: 20.1% (Expected: 30.1%)
- Digit 2: 19.5% (Expected: 17.6%)
- Digit 3: 18.8% (Expected: 12.5%)
- Digit 4: 16.9% (Expected: 9.7%)
- Digit 5: 11.1% (Expected: 7.9%)
- Digit 6: 5.7% (Expected: 6.7%)
- Digit 7: 3.3% (Expected: 5.8%)
- Digit 8: 2.5% (Expected: 5.1%)
- Digit 9: 2.2% (Expected: 4.6%)
- Visualization:
pie title Observed Frequencies "1 (Exp: 30.1%)" : 20.1 "2 (Exp: 17.6%)" : 19.5 "3 (Exp: 12.5%)" : 18.8 "4 (Exp: 9.7%)" : 16.9 "5 (Exp: 7.9%)" : 11.1 "6 (Exp: 6.7%)" : 5.7 "7 (Exp: 5.8%)" : 3.3 "8 (Exp: 5.1%)" : 2.5 "9 (Exp: 4.6%)" : 2.2
- Conclusion: The
start_time
data does not strongly conform to Benford's Law. While the general trend of decreasing frequency for higher digits exists, the lower digits (especially '1') are underrepresented, and digits 2-5 are overrepresented compared to the theoretical distribution. This might be due to the nature of subtitle timings.
Here is the distribution of leading digits observed in the data compared to the expected Benford's Law distribution:
- Concept: Zipf's Law relates the frequency of a word in a text corpus to its rank in the frequency table.
- Prediction: The most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, and so on.
- Formula: Frequency ≈ Constant / Rank
- Observation: A few words are used very frequently, while many words are used rarely.
- Question: How does the frequency of words relate to their rank in a large body of text?
- Zipf's Answer: The frequency is inversely proportional to the rank.
- Example: If "the" is the 1st most common word, "of" (2nd) should appear about 1/2 as often, and "and" (3rd) about 1/3 as often.
- Hypothesis: The frequency of words used in the movie subtitles text should follow Zipf's Law.
- Data Column for Analysis:
text
(The subtitle text content)
- Analysis:
- Extract all text from the
text
column. - Tokenize the text into words (lowercase, remove punctuation/numbers, remove common English stop words and subtitle artifacts like 'sighs').
- Count the frequency of each unique word.
- Rank the words from most frequent to least frequent.
- Plot word frequency against word rank using log-log scales.
- Extract all text from the
-
Dataset Used:
movies_subtitles.csv
-
Column Analyzed:
text
-
Analysis Summary: Analyzed 34,611,884 words after filtering (from 306,477 unique words).
-
Top 10 Words & Frequencies:
Rank Word Frequency 1 dont 417556 2 im 406630 3 know 384426 4 get 308548 5 like 286525 6 go 284509 7 right 266279 8 come 253165 9 youre 243768 10 well 234644 -
Visualization (Frequency Share for Top 10 Words):
pie title Frequency Share of Top 10 Words "dont (418k)" : 417556 "im (407k)" : 406630 "know (384k)" : 384426 "get (309k)" : 308548 "like (287k)" : 286525 "go (285k)" : 284509 "right (266k)" : 266279 "come (253k)" : 253165 "youre (244k)" : 243768 "well (235k)" : 234644
-
Conclusion: The word distribution in the
text
column conforms well to Zipf's Law. The log-log plot shows a clear linear relationship between log(rank) and log(frequency) with a negative slope, as predicted by the law. This indicates a typical natural language distribution where a few words dominate in frequency.
The relationship between word rank and frequency is often visualized on a log-log scale, where Zipf's Law predicts a near-linear downward trend:
- Observation: The plot shows the top words clustering along a roughly linear path, consistent with Zipf's Law. The calculated trendline slope is close to the ideal value of -1.
-
Summary: We investigated the applicability of Benford's Law to subtitle start times and Zipf's Law to word frequencies within the
movies_subtitles.csv
dataset. -
Key Findings:
- Benford's Law & Start Times: The distribution of leading digits in
start_time
deviated significantly from Benford's Law expectations, as clearly visualized in the generated plot. This suggests that subtitle timings might not represent the kind of naturally occurring, logarithmically distributed data spanning multiple orders of magnitude where Benford's Law typically holds (e.g., they might be clustered or follow a different pattern). - Zipf's Law & Word Frequency: In contrast, the
text
data exhibited a strong correlation with Zipf's Law. The analysis confirmed the characteristic inverse relationship where a few words dominate in frequency, and frequency drops rapidly with rank, which is typical for natural language corpora.
- Benford's Law & Start Times: The distribution of leading digits in
-
Broader Implications:
- This comparison highlights the specific conditions under which these statistical laws apply. Benford's Law is a powerful tool for detecting anomalies or fraud in certain numerical datasets (like financial records), but its applicability depends heavily on the data generation process.
- Zipf's Law's robustness in natural language underpins many Natural Language Processing (NLP) techniques, from search engine algorithms and text compression to keyword analysis and language modeling. Our findings reinforce its relevance even in the specific context of movie subtitles.
-
Avenues for Further Exploration:
- Could Benford's Law apply more effectively to other potential numerical data related to movies (e.g., budget, runtime, box office gross)?
- How does the Zipfian distribution change if we analyze dialogue separately from descriptive text within the subtitles?
- Do word frequency patterns (Zipf's Law parameters) differ significantly across different movie genres present in the dataset?
- Could the deviation from Benford's Law in
start_time
itself reveal interesting patterns about editing or pacing conventions in film?