The cleansema package is designed to perform the initial cleaning of data collected using the SEMA smartphone application. All you need to do is use the clean_sema function in R by pointing it at the directory containing your raw data, and it will give clean, processed data as output.
You'll first want to install R and Rstudio (see here for a quick walkthrough of that process, which is quite simple.)
Once you have RStudio installed and open, using the cleansema package is simple. The first time you want to use it, you'll need to run the following code to install it. Paste this code into the command line of RStudio and hit run:
install.packages('devtools')
library(devtools)
install_github("seanchrismurphy/cleansema")
After that, you're ready to use the clean_sema function on your data. This function requires only the input folder that contains your raw data to use.
NOTE Sema should export a zipped file - once you unzip that file into a folder, these are the raw data files that clean_sema expects, and no more needs to be done to them (except to delete the ones you don't want read in).
For example, running the code below will take any files in the 'Raw Sema Data' folder, process them, and load them in R as a dataset called clean_data. Adjust the input paramaters as needed (note that Windows users will need to use forward slashes, not the default back slashes that Windows uses)
require(cleansema)
clean_data <- clean_sema(input = 'Users/Sean/Raw Sema Data/')
Once the data is loaded into R, you can either work on it there, or export it as a clean .csv file, suitable for import into your statistical package of choice, using the code below. You'll want to give the full file path and the name of the .csv file you hope to save. You can ignore the row.names = FALSE command - that's just a bit of bookkeeping:
write.csv(clean_data, 'Users/Sean/Clean Data/cleaned data.csv', row.names = FALSE)
There are a few optional extras available - you can have clean_sema set data to missing based on a certain reaction time threshold, and you can choose a specific value for missing data if you'd like it to be something other than NA. To see how to use these, just type ?clean_sema at the R console after you've run the require(cleansema line of code.
And that's all there is to it!
The clean_sema function works in several steps, outlined below.
- The individual .csv files (that sema creates for each version of the study) are joined together.
- column names are changed to lowercase, and the
participant_idcolumn is renamed tosema_id - The
has_answerscolumn, which indicates if a survey was responded to, is recoded to 0s (for no) and 1s (for yes) - Reaction time variables have 0s (which indicate missing data) recoded to explicitly missing (NA in R).
- The reaction time fields for multiple choice variables, which were spread out and duplicated across columns, are neatened up into a single column for each.
- The
rownrvariable is created, which indicates what number the survey is for each participant (i.e. first, second). This variable indexes all surveys the participant received, regardless of whether/how they responded. - The
surveys_receivedandsurveys_respondedvariables are created for each participant, indicating the total number of surveys each received and responded to. - The
responsecountandfastrtcountvariables are created. These index, for each survey, how many questions were answered, and how many response times were below the rt.min threshold, respectively. - If
rt.trimis set to TRUE, surveys containing more thanrt.threshold(.5, or 50%, by default) responses that are below thert.minthreshold (500ms by default) are replaced with missing values, andhas_answersis set to 0 for these surveys. The number of surveys removed is printed when the function is run, and is also saved in thesurveys_removedvariable for each participant, so that problematic participants can be traced. - Additionally, if
rt.trimis set to TRUE, remaining responses below thert.minthreshold are replaced with missing data, as are the corresponding reaction times. The number of responses removed is printed when the function is run. - The
datanrvariable is calculated. This is similar torownr, but indexes only surveys where has_answers is 1. This can be used to create lagged variables such that all available data is used, though note this may cause the time interval between lagged responses to vary considerably. - Various date and time variables are calculated from
delivered, the timestamp from the participant's phone indicating when the survey was received.- First,
datedlvandtimedlvare calculated - these index the date and the time (in 24 hour time) that the participant began responding. - The
intervalvariable is calculated. This represents the time (in minutes) since the previous survey. It may prove useful both in ensuring that prompts were delivered at the correct time intervals, but also as a moderator of time-lagged effects. - The
daynrvariable is calculated. This represents, for each participant, which 'day' of the survey this is for them, beginning from 1 for ease of interpretation (though you will often want to subtract 1 from this to ensure the baseline is 0 for analytic purposes) - The
day_of_weekandweekendvariables are created. These label the day of the week (e.g. Mon, Sun) and whether or not it was a weekend, respectively. - The
survey_startandday_startvariables are calculated. These are simply the date and time the participant received their first survey of the study, or of the date, respectively. These are then used to calculate theminutes_since_survey_startandminutes_since_day_startvariables for each response (these are both specific to each participant, and may be useful for measuring diurnal trends or fatigue effects separate to the actual date or time).
- First,
The function will also print some output when run - giving you basic descriptive counts on your data, and also warning you if individual participants have data from more than one timezone.