This archive is distributed in association with the INFORMS Journal on Computing under the General Public License v2.0.
The software in this repository is a snapshot of the software that was used in the research reported on in the paper Convex and Nonconvex Risk-based Linear Regression at Scale by Can Wu, Ying Cui, Donghui Li and Defeng Sun. The snapshot corresponds to this release in the development repository.
To cite this material, please cite this repository, using the following DOI.
Below is the BibTex for citing this version of the code.
@article{Risk202Xregression,
author = {Can Wu, Ying Cui, Donghui Li and Defeng Sun},
publisher = {INFORMS Journal on Computing},
title = {Convex and Nonconvex Risk-based Linear Regression at Scale},
year = {2022},
doi = {DOI: 10.5281/zenodo.7483279},
note = {https://github.com/INFORMSJoC/JoCTemplate},
}
The goal of this software is to solve high-dimensional sparse linear regression problems under either the VaR or the CVaR risk measures. it is written in MATLAB and compares with the publically available Gurobi optimizer and the PSG program. For all of them, free academic licenses are available.
- CVaR regression with a fixed lambda: the convex CVaR-based sparse linear regression with a fixed value of lambda.
- N-ALM: the semismooth Newton based on the proximal augmented Lagrangian method
- ADMM: the alternating direction method of multipliers
- Gurobi: the barrier method in Gurobi
- S-IRPN: the smoothing method based on the inexact regularized proximal Newton method
- PSG solvers: seven solvers incuding VAN, TANK, CAR, BULDOZER, VANGRB, CARGRB and HELI in the PSG package
- CVaR regression with a sequence of lambda: the convex CVaR-based sparse linear regression with a given sequence of grid points lambda.
- AS+N-ALM: the adaptive sieving strategy combined with N-ALM
- Warm+N-ALM: the warm-strated N-ALM
- Pure N-ALM: the pure N-ALM
- Truncated CVaR regression: the nonconvex truncated CVaR-based sparse linear regression.
- MM+N-PPA: the majorization-minimization algorithm combined with the semismooth Newton method based on the proximal point algorithm
- MM+Gurobi: the majorization-minimization algorithm combined with the barrier method in Gurobi
-
Real data: eleven real data sets in LIBSVM format are all available on LIBSVM Data.
-
Random data: generated randomly under three different heavy-tailed errors according to Table 1 or three different contamination schemes according to Table 3 in the supplementary materials.
- The main functions of all above solvers and their subfunctions are included in
mainfun
: all main functions for the above solvers;subfun
: all the subfunctions called by the above main functions;mexfun
: two C source files and a script to build the MEX files.
- The scripts for replicating all numerical results in the paper or supplementary materials by calling the main functions are included in
Test1_fixed_lambda
: test the performance of N-ALM, ADMM, Gurobi, S-IRPN and PSG solvers for CVaR regression with a fixed lambda;Test2_Solution_path
: test the performance of AS+N-ALM, Warm+N-ALM and N-ALM for CVaR regression with a sequence of lambda;Test3_truncated_CVaR_MM
: test the performance of MM+N-PPA and MM+Gurobi for truncated CVaR regression;Test4_Risk-averse_model
: test the performance of the risk-sensitive and risk-neutral regression models.
- The real data sets will be downloaded and generated in
genUCIdatafun
: convert the UCI data from LIBSVM format to MAT format;UCIdata
: all real data sets used in the paper and the high-accurate objective values.
- Numerical results as an example are included in
Results
: numerical results and their corresponding running records in Table 2 of the paper.
All numerical results in the paper and supplementary materials were generated by this software that had been carried out using MATLAB R2020b on a desktop computer (16-core, Intel(R) Core(TM) i7-10700 @ 2.90GHz, 32G RAM). Please follow the steps below to obtain the corresponding results:
Step 1. Unpack the software and run Matlab in its directory.
Step 2. In the MATLAB command window, type:
>> startup
Step 3. Generate the expanded UCI data sets as follows:
- Download and unpack eleven real data sets: abalone_scale, bodyfat_scale, housing_scale, mpg_scale, pyrim_scale, space_ga_scale, triazines_scale, E2006.test, E2006.train, log1p.E2006.train and log1p.E2006.test;
- Create a new folder
UCIdataorg
ingenUCIdatafun
and then put all eleven real data sets into the new folder; - Make the function libsvmread available:
- download and unpack the LIBSVM package in
genUCIdatafun
; - run make.m in
\genUCIdatafun\libsvm-3.3\matlab
;
- download and unpack the LIBSVM package in
- Run genUCIdata.m in
\genUCIdatafun
, then the expanded UCI data sets in MAT format will be generated in the folderUCIdata
.
Step 4. Run BuildMex.m under the mexfun
folder, then the C source files mexAx.c and OnceCD.c will be compiled into MEX files mexAx.mex and OnceCD.mex. (If BuildMex.m does not work, type 'mex -setup' in the MATLAB command window and choose a C++ compiler, then try to run BuildMex.m again.)
Step 5. Run the scripts for generating the corresponding results on UCI data in the paper according the following table:
Results | Folders | Scripts | INPUT in Scripts |
---|---|---|---|
Table 2 | `Test1_fixed_lambda` | Test_NALM_UCI | prob=[1:11]; flag_tol=1; tol=1e-3; alpha_vec=[0.9, 0.5, 0.1]; |
Test_SIRPN_UCI | prob=[1:11]; | ||
Test_ADMM_UCI | prob=[1:11]; flag_tol=1; | ||
Table 3 | `Test1_fixed_lambda` | Test_NALM_UCI | prob=[7:9]; flag_tol=0; tol=1e-6; alpha_vec=[0.9, 0.5, 0.1]; |
Test_PSG_UCI | prob=[7:9]; SOLVER='VAN'; | ||
Table 4 | `Test1_fixed_lambda` | Test_NALM_UCI | prob=[1:11]; flag_tol=0; tol=1e-8; alpha_vec=[0.9, 0.5, 0.1]; |
Test_Barrier_Gurobi_UCI | prob=[1:11]; | ||
Table 5 | `Test2_Solution_path` | Test_AS_NALM_path_UCI | prob=[9, 4, 2]; |
Test_warm_NALM_path_UCI | prob=[9, 4, 2]; | ||
Test_NALM_path_UCI | prob=[9, 4, 2]; | ||
Figures 1 & 2 | `Test2_Solution_path/ Result_solution_path_figure_UCI` | Test_path_figure_UCI | flag_time=1; flag_nnz=1; |
Table 6 | `Test3_truncated_CVaR_MM` | Test_MM_NPPA_UCI | prob=[10 11 5 8 9]; |
Test_MM_Gurobi_UCI | prob=[10 11 5 8 9]; |
Step 6. Run the scripts for generating the corresponding results on random data in the supplementary materials according the following table:
Results | Folders | Scripts | INPUT in Scripts |
---|---|---|---|
Table 2 | `Test4_Risk-averse_model` | Test_convex_CVaR_model_random | flag_err=1 or 2 or 3; |
Table 4 | `Test4_Risk-averse_model` | Test_truncated_CVaR_model_random | conf=1; |
Table 5 | `Test4_Risk-averse_model` | Test_truncated_CVaR_model_random | conf=2; |
Table 6 | `Test1_fixed_lambda` | Test_NALM_random | prob=[5:7]; flag_tol=1; tol=1e-3; alpha_vec=[0.9, 0.5, 0.1]; flag_J=0; |
Test_SIRPN_random | prob=[5:7]; | ||
Test_ADMM_random | prob=[5:7]; flag_tol=1; | ||
Table 7 | `Test1_fixed_lambda` | Test_NALM_random | prob=[1 2]; flag_tol=0; tol=1e-6; alpha_vec=[0.9, 0.5, 0.1]; flag_J=0; |
Test_PSG_random | prob=[1 2]; solvers={all seven solvers}; | ||
Table 8 | `Test1_fixed_lambda` | Test_NALM_random | prob=[3 4]; flag_tol=0; tol=1e-6; alpha_vec=[0.9, 0.5, 0.1]; flag_J=0; |
Test_PSG_random | prob=[3 4]; solvers={'VAN'}; | ||
Table 9 | `Test1_fixed_lambda` | Test_NALM_random | prob=[5:7]; flag_tol=0; tol=1e-8; alpha_vec=[0.9, 0.5, 0.1]; flag_J=0; |
Test_Barrier_Gurobi_random | m_vec=[3e2, 1e3, 3e3]; n_vec=[5e4, 1e5, 5e5]; |
||
Table 10 | `Test1_fixed_lambda` | Test_NALM_UCI | prob=6; flag_tol=0; tol=1e-6; alpha_vec=1-[1e-3.*[1 5], 1e-2.*[1:2:9, 10:10:90]]; |
Table 11 | `Test1_fixed_lambda` | Test_NALM_random | prob=7; flag_tol=0; tol=1e-6; alpha_vec=1-[1e-3.*[1 5], 1e-2.*[1:2:9, 10:10:90]]; flag_J=0; |
Figure 1 | `Test1_fixed_lambda` | Test_NALM_random | prob=7; flag_tol=0; tol=1e-6; alpha_vec=[0.4:-0.1:0.1]; flag_J=1; |
Figure_indexJ | iter_0=6; iter_1=20; m=3e3; n=5e5; | ||
Table 12 | `Test2_Solution_path` | Test_AS_NALM_path_random | m_vec=1e3.*[0.5, 1, 3]; n_vec=1e5.*[1, 2, 5]; edgp_mat=[15 30 45; 15 30 45; 30 60 90]; |
Test_warm_NALM_path_random | |||
Test_NALM_path_random | |||
Figures 2 & 3 | `Test2_Solution_path/ Result_solution_path_figure_random` | Test_path_figure_random | flag_time=1; flag_nnz=1; |
Table 13 | `Test3_truncated_CVaR_MM` | Test_MM_NPPA_random | m_vec=1e3.*[0.5, 0.5, 1, 1.5, 3]; n_vec=1e5.*[0.03, 0.5, 1, 2, 5]; lamc_vec=[0.15 0.1 0.05]; |
Test_MM_Gurobi_random |
Remark.
- Before running the scripts Test_PSG_UCI, Test_PSG_random, Test_Barrier_Gurobi_UCI, Test_Barrier_Gurobi_random, Test_MM_Gurobi_UCI and Test_MM_Gurobi_random, please assure that Gurobi optimizer and the PSG program are downloaded and installed.
- Before replicating the results of Table 2 with the INPUT flag_err = 3 in Step 6 , please download the function randl.m and place it in the folder
subfun
.
If you want to see the performance of N-ALM, S-IRPN and ADMM on UCI data with epsilon=1e-3 in Table 2 of the paper, you only need to execute the following three operations in Test1_fixed_lambda
according to Step 5:
-
Modify the INPUT part of the script Test_NALM_UCI.m to
prob=[1:11]; flag_tol=1; tol=1e-3; alpha_vec=[0.9, 0.5, 0.1];
then run this script.
-
Modify the INPUT part of the script Test_SIRPN_UCI.m to
prob=[1:11];
then run this script.
-
Modify the INPUT part of the script Test_ADMM_UCI.m to
prob=[1:11]; flag_tol=1;
then run this script.
During the above three operations, you will see the information for each iteration of N-ALM, S-IRPN and ADMM in the current command window (see Diary_NALM_UCI_flagtol_1_tol_1e-03.txt, Diary_SIRPN_UCI.txt and Diary_ADMM_UCI.txt in Results
folder), respectively. Finally, you will obtain the files Result_NALM_UCI_flagtol_1_tol_1e-03.mat, Result_SIRPN_UCI.mat and Result_ADMM_UCI.mat in the current folder (see these three files inResults
folder), which include all the information required in Table 2.
- To replicate all the results on UCI data in the paper, modify and run the scripts in the corresponding folders
Test1_fixed_lambda
,Test2_Solution_path
andTest3_truncated_CVaR_MM
according to Step 5, respectively. - To replicate all the results on random data in the supplementary materials, modify and run the scripts in the corresponding folders
Test1_fixed_lambda
,Test2_Solution_path
,Test3_truncated_CVaR_MM
andTest4_Risk-averse_model
according to Step 6, respectively.
As mentioned in our paper, once two algorithms are stopped under different criteria, we use the objective values obtained from Gurobi or N-ALM under the tolerance 1e-9 as benchmarks to test the quality of the computed objective values by both algorithms. For convenience, the corresponding high-accurate objective values are available in the UCIdata
folder.