Skip to content

Apply multiclass AUC approaches to some trial datasets to get a better understanding of how to apply it and the advantages/disadvantages

License

CC-BY-4.0, MIT licenses found

Licenses found

CC-BY-4.0
LICENSE.CC-BY
MIT
LICENSE.MIT
Notifications You must be signed in to change notification settings

rgayler/multiclass_AUC

Repository files navigation

Multiclass AUC

This is an open, shareable, reproducible, computational research project on using multiclass AUC as a performance metric for multiclass classification.

It is the work of Ross W. Gayler

It is a common data science task to classify inputs by assigning each input to one of a fixed number of categories. Given the existence of a set of inputs with known categories we want to apply the classification model and assess its performance by comparing the assigned categories to the known categories.

Accuracy is a commonly used performance metric. However, accuracy assumes that all misclassifications are equally costly and that the base rates of the categories are fixed. Signal Detection Theory (SDT) explicitly addresses these assumptions and Area Under the Curve (AUC) is a common SDT metric of discriminability. AUC is most commonly dealt with in the context of binary classification. AUC for multiclass classification is much less commonly encountered in practice and there is no commonly accepted agreement on how best to apply it

The objective of this project is to apply multiclass AUC approaches to some trial datasets to get a better understanding of how to apply it and the advantages/disadvantages.

The point of making it an open, shareable, reproducible project is that anyone should be able to copy it, reproduce the analyses, and try out modifications.

Multiclass AUC by Ross W. Gayler is licensed under CC BY 4.0

All materials other than code are released under the Creative Commons Attribution 4.0 International License.

Code is released under the MIT license.

Project organisation

This is an open, shareable, reproducible, computational research project.

  • All the computational work and document preparation is done with the R statistical computing environment and the Rstudio integrated development environment.

  • The entire research project is contained in a single directory that corresponds to an RStudio R project.

  • The renv package is used to manage the R package versions used by the project

  • The workflowr package is used to structure the project so that all the materials and outputs are available via an openly accessible, automatically generated website.

  • The analyses are organised as notebooks which are largely free-standing. Any dependencies are noted in each notebook. The dependencies are managed manually.

  • The project code and documents are shared publicly on GitHub at https://github.com/rgayler/multiclass_AUC

  • The website automatically generated by workflowr from the rendered project documents is at https://rgayler.github.io/multiclass_AUC/

Project directory structure

workflowr directories

workflowr creates a set of standard directories. See the package documentation for details on how these directories are used. The brief purposes are:

  • analysis - rmarkdown analysis notebooks
  • code - R code not in analysis notebooks
  • data - raw data and associated metadata
  • docs - automatically generated website
  • output - generated data and other objects

workflowr only manages the subset of files that it knows about, so you will need to manually stage and commit any other files that need to be mirrored on GitHub.

If any files in data and output are more than trivially small, they will not be shared via Git and GitHub.

  • .gitignore will be used to keep them out of Git.
  • There will be a separate mechanism (e.g. Zenodo) for sharing those large files.

renv directory

The renv package keeps track of the R packages (and their versions) used by the project. It allows anyone to reinstate the same packages and versions in their local copy of the project. renv does not address system dependencies (e.g. R version, system libraries), so if these are critical I will need to use something like docker - but not yet.

The renv directory contains the information need by renv to reinstate the local package environment.

.gitignore

.gitignore in the R project root directory is used for all the manually created entries, so that all the manual rules are in one place. Packages, such as renv, may create their own .gitignore files in subdirectories that they manage.

Browsing the automatically generated website

The static website automatically generated by workflowr is stored in the docs directory.

The key document is docs/index.html. Open this file with a browser to get access to the website. docs/index.html allows you to navigate to all the generated content.

This index page is mirrored on the internet at https://rgayler.github.io/multiclass_AUC/index.html

How to do things

  • All detailed setup instructions and notes go in this project-level READ.md file.
  • The README.md files in the subdirectories only state the purpose of each subdirectory and the files in that directory.

Installation

This assumes that you already have current versions of R and RStudio installed.

  1. Clone the project repository https://github.com/rgayler/multiclass_AUC from GitHub

  2. Open the cloned repository as an RStudio project

You can combine steps 1 and 2 using RStudio by creating a new project from the GitHub repository:
File | New Project... | Version Control | Git | Create Project

When you open the project you will get warning messages about packages not being installed. This is because you need to use the renv package to reinstate the packages that are used by the project.

  1. Install renv in that project if it is not already installed

  2. Use renv::restore() to install all the needed packages in the project-specific library:

    renv::restore()
    

Run analyses

The analyses are specified by notebooks in the analysis directory. You can, if you wish, ignore all the workflowr aspects and simply run the analysis notebooks locally.

If you wish to take advantage of the features of workflowr you will need to learn the workflowr workflow. See the workflowr getting started vignette for an introduction.

  • Create a new analysis notebook:

    workflowr::wflow_open("analysis/new_notebook_name.Rmd")
    
  • Build the website locally (either manually or indirectly via targets):

    workflowr::wflow_build()
    
  • Publish the website online (manually). This will only work if you have push authorisation for the GitHub remote repository.

    workflowr::wflow_publish("analysis/*.Rmd" "A commit message")
    
  • Add mathjax = "local" as an argument to workflowr::wflow_html in analysis/_site.yml so that the MathJax JavaScript library is bundled with the website in docs/ rather than being loaded from a remote server when the website is viewed. This removes the dependency on the remote server being available. See workflowr/workflowr#211

    output:
      workflowr::wflow_html:
        mathjax: "local"
    

renv collaboration

The renv package is used to keep track of the installed packages and their versions. See the renv collaboration guide or the workflow for synchronising package environments between collaborators.

About

Apply multiclass AUC approaches to some trial datasets to get a better understanding of how to apply it and the advantages/disadvantages

Resources

License

CC-BY-4.0, MIT licenses found

Licenses found

CC-BY-4.0
LICENSE.CC-BY
MIT
LICENSE.MIT

Stars

Watchers

Forks

Languages