The data this week comes from Vincent Arel-Bundock's Rdatasets package(https://vincentarelbundock.github.io/Rdatasets/index.html).
Rdatasets is a collection of 2246 datasets which were originally distributed alongside the statistical software environment R and some of its add-on packages. The goal is to make these data more broadly accessible for teaching and statistical software development.
We're working with the spam email dataset. This is a subset of the spam e-mail database.
This is a dataset collected at Hewlett-Packard Labs by Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt and shared with the UCI Machine Learning Repository. The dataset classifies 4601 e-mails as spam or non-spam, with additional variables indicating the frequency of certain words and characters in the e-mail.
# Option 1: tidytuesdayR package
## install.packages("tidytuesdayR")
tuesdata <- tidytuesdayR::tt_load('2023-08-15')
## OR
tuesdata <- tidytuesdayR::tt_load(2023, week = 33)
spam <- tuesdata$spam
# Option 2: Read directly from GitHub
spam <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-15/spam.csv')
variable | class | description |
---|---|---|
crl.tot | double | Total length of uninterrupted sequences of capitals |
dollar | double | Occurrences of the dollar sign, as percent of total number of characters |
bang | double | Occurrences of ‘!’, as percent of total number of characters |
money | double | Occurrences of ‘money’, as percent of total number of characters |
n000 | double | Occurrences of the string ‘000’, as percent of total number of words |
make | double | Occurrences of ‘make’, as a percent of total number of words |
yesno | character | Outcome variable, a factor with levels 'n' not spam, 'y' spam |
First column was removed.
- Explore the data, watching out for interesting relationships. We would like to emphasize that you should not draw conclusions about causation in the data. There are various moderating variables that affect all data, many of which might not have been captured in these datasets. As such, our suggestion is to use the data provided to practice your data tidying and plotting techniques, and to consider for yourself what nuances might underlie these relationships.
- Create a visualization, a model, a shiny app, or some other piece of data-science-related output, using R or another programming language.
- Share your output and the code used to generate it on social media with the #TidyTuesday hashtag.