Convex and Nonconvex Risk-based Linear Regression at Scale

This archive is distributed in association with the INFORMS Journal on Computing under the General Public License v2.0.

The software in this repository is a snapshot of the software that was used in the research reported on in the paper Convex and Nonconvex Risk-based Linear Regression at Scale by Can Wu, Ying Cui, Donghui Li and Defeng Sun. The snapshot corresponds to this release in the development repository.

Cite

To cite this material, please cite this repository, using the following DOI.

Below is the BibTex for citing this version of the code.

@article{Risk202Xregression,
  author =        {Can Wu, Ying Cui, Donghui Li and Defeng Sun},
  publisher =     {INFORMS Journal on Computing},
  title =         {Convex and Nonconvex Risk-based Linear Regression at Scale},
  year =          {2022},
  doi =           {DOI: 10.5281/zenodo.7483279},
  note =          {https://github.com/INFORMSJoC/JoCTemplate},
}

Description

The goal of this software is to solve high-dimensional sparse linear regression problems under either the VaR or the CVaR risk measures. it is written in MATLAB and compares with the publically available Gurobi optimizer and the PSG program. For all of them, free academic licenses are available.

Three types of optimization problems and the corresponding solvers

CVaR regression with a fixed lambda: the convex CVaR-based sparse linear regression with a fixed value of lambda.
- N-ALM: the semismooth Newton based on the proximal augmented Lagrangian method
- ADMM: the alternating direction method of multipliers
- Gurobi: the barrier method in Gurobi
- S-IRPN: the smoothing method based on the inexact regularized proximal Newton method
- PSG solvers: seven solvers incuding VAN, TANK, CAR, BULDOZER, VANGRB, CARGRB and HELI in the PSG package
CVaR regression with a sequence of lambda: the convex CVaR-based sparse linear regression with a given sequence of grid points lambda.
- AS+N-ALM: the adaptive sieving strategy combined with N-ALM
- Warm+N-ALM: the warm-strated N-ALM
- Pure N-ALM: the pure N-ALM
Truncated CVaR regression: the nonconvex truncated CVaR-based sparse linear regression.
- MM+N-PPA: the majorization-minimization algorithm combined with the semismooth Newton method based on the proximal point algorithm
- MM+Gurobi: the majorization-minimization algorithm combined with the barrier method in Gurobi

Two kinds of data sets

Real data: eleven real data sets in LIBSVM format are all available on LIBSVM Data.
Random data: generated randomly under three different heavy-tailed errors according to Table 1 or three different contamination schemes according to Table 3 in the supplementary materials.

Ten folders in the software

The main functions of all above solvers and their subfunctions are included in
- mainfun: all main functions for the above solvers;
- subfun: all the subfunctions called by the above main functions;
- mexfun: two C source files and a script to build the MEX files.
The scripts for replicating all numerical results in the paper or supplementary materials by calling the main functions are included in
- Test1_fixed_lambda: test the performance of N-ALM, ADMM, Gurobi, S-IRPN and PSG solvers for CVaR regression with a fixed lambda;
- Test2_Solution_path: test the performance of AS+N-ALM, Warm+N-ALM and N-ALM for CVaR regression with a sequence of lambda;
- Test3_truncated_CVaR_MM: test the performance of MM+N-PPA and MM+Gurobi for truncated CVaR regression;
- Test4_Risk-averse_model: test the performance of the risk-sensitive and risk-neutral regression models.
The real data sets will be downloaded and generated in
- genUCIdatafun: convert the UCI data from LIBSVM format to MAT format;
- UCIdata: all real data sets used in the paper and the high-accurate objective values.
Numerical results as an example are included in
- Results: numerical results and their corresponding running records in Table 2 of the paper.

Usage

All numerical results in the paper and supplementary materials were generated by this software that had been carried out using MATLAB R2020b on a desktop computer (16-core, Intel(R) Core(TM) i7-10700 @ 2.90GHz, 32G RAM). Please follow the steps below to obtain the corresponding results:

Step 1. Unpack the software and run Matlab in its directory.

Step 2. In the MATLAB command window, type:

    >> startup

Step 3. Generate the expanded UCI data sets as follows:

Download and unpack eleven real data sets: abalone_scale, bodyfat_scale, housing_scale, mpg_scale, pyrim_scale, space_ga_scale, triazines_scale, E2006.test, E2006.train, log1p.E2006.train and log1p.E2006.test;
Create a new folder UCIdataorg in genUCIdatafun and then put all eleven real data sets into the new folder;
Make the function libsvmread available:
- download and unpack the LIBSVM package in genUCIdatafun;
- run make.m in \genUCIdatafun\libsvm-3.3\matlab;
Run genUCIdata.m in \genUCIdatafun, then the expanded UCI data sets in MAT format will be generated in the folder UCIdata.

Step 4. Run BuildMex.m under the mexfun folder, then the C source files mexAx.c and OnceCD.c will be compiled into MEX files mexAx.mex and OnceCD.mex. (If BuildMex.m does not work, type 'mex -setup' in the MATLAB command window and choose a C++ compiler, then try to run BuildMex.m again.)

Step 5. Run the scripts for generating the corresponding results on UCI data in the paper according the following table:

Results	Folders	Scripts	INPUT in Scripts
Table 2	`Test1_fixed_lambda`	Test_NALM_UCI	prob=[1:11]; flag_tol=1; tol=1e-3; alpha_vec=[0.9, 0.5, 0.1];
		Test_SIRPN_UCI	prob=[1:11];
		Test_ADMM_UCI	prob=[1:11]; flag_tol=1;
Table 3	`Test1_fixed_lambda`	Test_NALM_UCI	prob=[7:9]; flag_tol=0; tol=1e-6; alpha_vec=[0.9, 0.5, 0.1];
Table 3	`Test1_fixed_lambda`	Test_PSG_UCI	prob=[7:9]; SOLVER='VAN';
Table 4	`Test1_fixed_lambda`	Test_NALM_UCI	prob=[1:11]; flag_tol=0; tol=1e-8; alpha_vec=[0.9, 0.5, 0.1];
Table 4	`Test1_fixed_lambda`	Test_Barrier_Gurobi_UCI	prob=[1:11];
Table 5	`Test2_Solution_path`	Test_AS_NALM_path_UCI	prob=[9, 4, 2];
		Test_warm_NALM_path_UCI	prob=[9, 4, 2];
		Test_NALM_path_UCI	prob=[9, 4, 2];
Figures 1 & 2	`Test2_Solution_path/ Result_solution_path_figure_UCI`	Test_path_figure_UCI	flag_time=1; flag_nnz=1;
Table 6	`Test3_truncated_CVaR_MM`	Test_MM_NPPA_UCI	prob=[10 11 5 8 9];
Table 6	`Test3_truncated_CVaR_MM`	Test_MM_Gurobi_UCI	prob=[10 11 5 8 9];

Step 6. Run the scripts for generating the corresponding results on random data in the supplementary materials according the following table:

Results	Folders	Scripts	INPUT in Scripts
Table 2	`Test4_Risk-averse_model`	Test_convex_CVaR_model_random	flag_err=1 or 2 or 3;
Table 4	`Test4_Risk-averse_model`	Test_truncated_CVaR_model_random	conf=1;
Table 5	`Test4_Risk-averse_model`	Test_truncated_CVaR_model_random	conf=2;
Table 6	`Test1_fixed_lambda`	Test_NALM_random	prob=[5:7]; flag_tol=1; tol=1e-3; alpha_vec=[0.9, 0.5, 0.1]; flag_J=0;
		Test_SIRPN_random	prob=[5:7];
		Test_ADMM_random	prob=[5:7]; flag_tol=1;
Table 7	`Test1_fixed_lambda`	Test_NALM_random	prob=[1 2]; flag_tol=0; tol=1e-6; alpha_vec=[0.9, 0.5, 0.1]; flag_J=0;
Table 7	`Test1_fixed_lambda`	Test_PSG_random	prob=[1 2]; solvers={all seven solvers};
Table 8	`Test1_fixed_lambda`	Test_NALM_random	prob=[3 4]; flag_tol=0; tol=1e-6; alpha_vec=[0.9, 0.5, 0.1]; flag_J=0;
Table 8	`Test1_fixed_lambda`	Test_PSG_random	prob=[3 4]; solvers={'VAN'};
Table 9	`Test1_fixed_lambda`	Test_NALM_random	prob=[5:7]; flag_tol=0; tol=1e-8; alpha_vec=[0.9, 0.5, 0.1]; flag_J=0;
Table 9	`Test1_fixed_lambda`	Test_Barrier_Gurobi_random	m_vec=[3e2, 1e3, 3e3]; n_vec=[5e4, 1e5, 5e5];
Table 10	`Test1_fixed_lambda`	Test_NALM_UCI	prob=6; flag_tol=0; tol=1e-6; alpha_vec=1-[1e-3.[1 5], 1e-2.[1:2:9, 10:10:90]];
Table 11	`Test1_fixed_lambda`	Test_NALM_random	prob=7; flag_tol=0; tol=1e-6; alpha_vec=1-[1e-3.[1 5], 1e-2.[1:2:9, 10:10:90]]; flag_J=0;
Figure 1	`Test1_fixed_lambda`	Test_NALM_random	prob=7; flag_tol=0; tol=1e-6; alpha_vec=[0.4:-0.1:0.1]; flag_J=1;
Figure 1	`Test1_fixed_lambda`	Figure_indexJ	iter_0=6; iter_1=20; m=3e3; n=5e5;
Table 12	`Test2_Solution_path`	Test_AS_NALM_path_random	m_vec=1e3.[0.5, 1, 3]; n_vec=1e5.[1, 2, 5]; edgp_mat=[15 30 45; 15 30 45; 30 60 90];
		Test_warm_NALM_path_random
		Test_NALM_path_random
Figures 2 & 3	`Test2_Solution_path/ Result_solution_path_figure_random`	Test_path_figure_random	flag_time=1; flag_nnz=1;
Table 13	`Test3_truncated_CVaR_MM`	Test_MM_NPPA_random	m_vec=1e3.[0.5, 0.5, 1, 1.5, 3]; n_vec=1e5.[0.03, 0.5, 1, 2, 5]; lamc_vec=[0.15 0.1 0.05];
Table 13	`Test3_truncated_CVaR_MM`	Test_MM_Gurobi_random

Remark.

Before running the scripts Test_PSG_UCI, Test_PSG_random, Test_Barrier_Gurobi_UCI, Test_Barrier_Gurobi_random, Test_MM_Gurobi_UCI and Test_MM_Gurobi_random, please assure that Gurobi optimizer and the PSG program are downloaded and installed.
Before replicating the results of Table 2 with the INPUT flag_err = 3 in Step 6 , please download the function randl.m and place it in the folder subfun.

Example

If you want to see the performance of N-ALM, S-IRPN and ADMM on UCI data with epsilon=1e-3 in Table 2 of the paper, you only need to execute the following three operations in Test1_fixed_lambda according to Step 5:

Modify the INPUT part of the script Test_NALM_UCI.m to
```
  prob=[1:11]; flag_tol=1; tol=1e-3; alpha_vec=[0.9, 0.5, 0.1];
```
then run this script.
Modify the INPUT part of the script Test_SIRPN_UCI.m to
```
  prob=[1:11]; 
```
then run this script.
Modify the INPUT part of the script Test_ADMM_UCI.m to
```
  prob=[1:11]; flag_tol=1;
```
then run this script.

During the above three operations, you will see the information for each iteration of N-ALM, S-IRPN and ADMM in the current command window (see Diary_NALM_UCI_flagtol_1_tol_1e-03.txt, Diary_SIRPN_UCI.txt and Diary_ADMM_UCI.txt in Results folder), respectively. Finally, you will obtain the files Result_NALM_UCI_flagtol_1_tol_1e-03.mat, Result_SIRPN_UCI.mat and Result_ADMM_UCI.mat in the current folder (see these three files inResults folder), which include all the information required in Table 2.

Replicating

To replicate all the results on UCI data in the paper, modify and run the scripts in the corresponding folders Test1_fixed_lambda, Test2_Solution_path and Test3_truncated_CVaR_MM according to Step 5, respectively.
To replicate all the results on random data in the supplementary materials, modify and run the scripts in the corresponding folders Test1_fixed_lambda, Test2_Solution_path, Test3_truncated_CVaR_MM and Test4_Risk-averse_model according to Step 6, respectively.

Remark

As mentioned in our paper, once two algorithms are stopped under different criteria, we use the objective values obtained from Gurobi or N-ALM under the tolerance 1e-9 as benchmarks to test the quality of the computed objective values by both algorithms. For convenience, the corresponding high-accurate objective values are available in the UCIdata folder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Convex and Nonconvex Risk-based Linear Regression at Scale

Cite

Description

Three types of optimization problems and the corresponding solvers

Two kinds of data sets

Ten folders in the software

Usage

Example

Replicating

Remark

About

Releases 1

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 197 Commits
Results		Results
Test1_fixed_lambda		Test1_fixed_lambda
Test2_Solution_path		Test2_Solution_path
Test3_truncated_CVaR_MM		Test3_truncated_CVaR_MM
Test4_Risk-averse_model		Test4_Risk-averse_model
UCIdata		UCIdata
genUCIdatafun		genUCIdatafun
mainfun		mainfun
mexfun		mexfun
subfun		subfun
AUTHORS		AUTHORS
LICENSE		LICENSE
README.md		README.md
startup.m		startup.m

License

INFORMSJoC/2022.0012

Folders and files

Latest commit

History

Repository files navigation

Convex and Nonconvex Risk-based Linear Regression at Scale

Cite

Description

Three types of optimization problems and the corresponding solvers

Two kinds of data sets

Ten folders in the software

Usage

Example

Replicating

Remark

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages