Skip to content

INFORMSJoC/2022.0012

Repository files navigation

INFORMS Journal on Computing Logo

This archive is distributed in association with the INFORMS Journal on Computing under the General Public License v2.0.

The software in this repository is a snapshot of the software that was used in the research reported on in the paper Convex and Nonconvex Risk-based Linear Regression at Scale by Can Wu, Ying Cui, Donghui Li and Defeng Sun. The snapshot corresponds to this release in the development repository.

Cite

To cite this material, please cite this repository, using the following DOI.

DOI

Below is the BibTex for citing this version of the code.

@article{Risk202Xregression,
  author =        {Can Wu, Ying Cui, Donghui Li and Defeng Sun},
  publisher =     {INFORMS Journal on Computing},
  title =         {Convex and Nonconvex Risk-based Linear Regression at Scale},
  year =          {2022},
  doi =           {DOI: 10.5281/zenodo.7483279},
  note =          {https://github.com/INFORMSJoC/JoCTemplate},
}  

Description

The goal of this software is to solve high-dimensional sparse linear regression problems under either the VaR or the CVaR risk measures. it is written in MATLAB and compares with the publically available Gurobi optimizer and the PSG program. For all of them, free academic licenses are available.

Three types of optimization problems and the corresponding solvers

  • CVaR regression with a fixed lambda: the convex CVaR-based sparse linear regression with a fixed value of lambda.
    • N-ALM: the semismooth Newton based on the proximal augmented Lagrangian method
    • ADMM: the alternating direction method of multipliers
    • Gurobi: the barrier method in Gurobi
    • S-IRPN: the smoothing method based on the inexact regularized proximal Newton method
    • PSG solvers: seven solvers incuding VAN, TANK, CAR, BULDOZER, VANGRB, CARGRB and HELI in the PSG package
  • CVaR regression with a sequence of lambda: the convex CVaR-based sparse linear regression with a given sequence of grid points lambda.
    • AS+N-ALM: the adaptive sieving strategy combined with N-ALM
    • Warm+N-ALM: the warm-strated N-ALM
    • Pure N-ALM: the pure N-ALM
  • Truncated CVaR regression: the nonconvex truncated CVaR-based sparse linear regression.
    • MM+N-PPA: the majorization-minimization algorithm combined with the semismooth Newton method based on the proximal point algorithm
    • MM+Gurobi: the majorization-minimization algorithm combined with the barrier method in Gurobi

Two kinds of data sets

  • Real data: eleven real data sets in LIBSVM format are all available on LIBSVM Data.

  • Random data: generated randomly under three different heavy-tailed errors according to Table 1 or three different contamination schemes according to Table 3 in the supplementary materials.

Ten folders in the software

  • The main functions of all above solvers and their subfunctions are included in
    • mainfun: all main functions for the above solvers;
    • subfun: all the subfunctions called by the above main functions;
    • mexfun: two C source files and a script to build the MEX files.
  • The scripts for replicating all numerical results in the paper or supplementary materials by calling the main functions are included in
    • Test1_fixed_lambda: test the performance of N-ALM, ADMM, Gurobi, S-IRPN and PSG solvers for CVaR regression with a fixed lambda;
    • Test2_Solution_path: test the performance of AS+N-ALM, Warm+N-ALM and N-ALM for CVaR regression with a sequence of lambda;
    • Test3_truncated_CVaR_MM: test the performance of MM+N-PPA and MM+Gurobi for truncated CVaR regression;
    • Test4_Risk-averse_model: test the performance of the risk-sensitive and risk-neutral regression models.
  • The real data sets will be downloaded and generated in
    • genUCIdatafun: convert the UCI data from LIBSVM format to MAT format;
    • UCIdata: all real data sets used in the paper and the high-accurate objective values.
  • Numerical results as an example are included in
    • Results: numerical results and their corresponding running records in Table 2 of the paper.

Usage

All numerical results in the paper and supplementary materials were generated by this software that had been carried out using MATLAB R2020b on a desktop computer (16-core, Intel(R) Core(TM) i7-10700 @ 2.90GHz, 32G RAM). Please follow the steps below to obtain the corresponding results:

Step 1. Unpack the software and run Matlab in its directory.

Step 2. In the MATLAB command window, type:

    >> startup 

Step 3. Generate the expanded UCI data sets as follows:

Step 4. Run BuildMex.m under the mexfun folder, then the C source files mexAx.c and OnceCD.c will be compiled into MEX files mexAx.mex and OnceCD.mex. (If BuildMex.m does not work, type 'mex -setup' in the MATLAB command window and choose a C++ compiler, then try to run BuildMex.m again.)

Step 5. Run the scripts for generating the corresponding results on UCI data in the paper according the following table:

Results Folders Scripts INPUT in Scripts
Table 2 `Test1_fixed_lambda` Test_NALM_UCI prob=[1:11]; flag_tol=1; tol=1e-3; alpha_vec=[0.9, 0.5, 0.1];
Test_SIRPN_UCI prob=[1:11];
Test_ADMM_UCI prob=[1:11]; flag_tol=1;
Table 3 `Test1_fixed_lambda` Test_NALM_UCI prob=[7:9]; flag_tol=0; tol=1e-6; alpha_vec=[0.9, 0.5, 0.1];
Test_PSG_UCI prob=[7:9]; SOLVER='VAN';
Table 4 `Test1_fixed_lambda` Test_NALM_UCI prob=[1:11]; flag_tol=0; tol=1e-8; alpha_vec=[0.9, 0.5, 0.1];
Test_Barrier_Gurobi_UCI prob=[1:11];
Table 5 `Test2_Solution_path` Test_AS_NALM_path_UCI prob=[9, 4, 2];
Test_warm_NALM_path_UCI prob=[9, 4, 2];
Test_NALM_path_UCI prob=[9, 4, 2];
Figures 1 & 2 `Test2_Solution_path/ Result_solution_path_figure_UCI` Test_path_figure_UCI flag_time=1; flag_nnz=1;
Table 6 `Test3_truncated_CVaR_MM` Test_MM_NPPA_UCI prob=[10 11 5 8 9];
Test_MM_Gurobi_UCI prob=[10 11 5 8 9];

Step 6. Run the scripts for generating the corresponding results on random data in the supplementary materials according the following table:

Results Folders Scripts INPUT in Scripts
Table 2 `Test4_Risk-averse_model` Test_convex_CVaR_model_random flag_err=1 or 2 or 3;
Table 4 `Test4_Risk-averse_model` Test_truncated_CVaR_model_random conf=1;
Table 5 `Test4_Risk-averse_model` Test_truncated_CVaR_model_random conf=2;
Table 6 `Test1_fixed_lambda` Test_NALM_random prob=[5:7]; flag_tol=1; tol=1e-3; alpha_vec=[0.9, 0.5, 0.1]; flag_J=0;
Test_SIRPN_random prob=[5:7];
Test_ADMM_random prob=[5:7]; flag_tol=1;
Table 7 `Test1_fixed_lambda` Test_NALM_random prob=[1 2]; flag_tol=0; tol=1e-6; alpha_vec=[0.9, 0.5, 0.1]; flag_J=0;
Test_PSG_random prob=[1 2]; solvers={all seven solvers};
Table 8 `Test1_fixed_lambda` Test_NALM_random prob=[3 4]; flag_tol=0; tol=1e-6; alpha_vec=[0.9, 0.5, 0.1]; flag_J=0;
Test_PSG_random prob=[3 4]; solvers={'VAN'};
Table 9 `Test1_fixed_lambda` Test_NALM_random prob=[5:7]; flag_tol=0; tol=1e-8; alpha_vec=[0.9, 0.5, 0.1]; flag_J=0;
Test_Barrier_Gurobi_random m_vec=[3e2, 1e3, 3e3];
n_vec=[5e4, 1e5, 5e5];
Table 10 `Test1_fixed_lambda` Test_NALM_UCI prob=6; flag_tol=0; tol=1e-6; alpha_vec=1-[1e-3.*[1 5], 1e-2.*[1:2:9, 10:10:90]];
Table 11 `Test1_fixed_lambda` Test_NALM_random prob=7; flag_tol=0; tol=1e-6; alpha_vec=1-[1e-3.*[1 5], 1e-2.*[1:2:9, 10:10:90]]; flag_J=0;
Figure 1 `Test1_fixed_lambda` Test_NALM_random prob=7; flag_tol=0; tol=1e-6; alpha_vec=[0.4:-0.1:0.1]; flag_J=1;
Figure_indexJ iter_0=6; iter_1=20; m=3e3; n=5e5;
Table 12 `Test2_Solution_path` Test_AS_NALM_path_random m_vec=1e3.*[0.5, 1, 3]; n_vec=1e5.*[1, 2, 5]; edgp_mat=[15 30 45; 15 30 45; 30 60 90];
Test_warm_NALM_path_random
Test_NALM_path_random
Figures 2 & 3 `Test2_Solution_path/ Result_solution_path_figure_random` Test_path_figure_random flag_time=1; flag_nnz=1;
Table 13 `Test3_truncated_CVaR_MM` Test_MM_NPPA_random m_vec=1e3.*[0.5, 0.5, 1, 1.5, 3];
n_vec=1e5.*[0.03, 0.5, 1, 2, 5]; lamc_vec=[0.15 0.1 0.05];
Test_MM_Gurobi_random

Remark.

  • Before running the scripts Test_PSG_UCI, Test_PSG_random, Test_Barrier_Gurobi_UCI, Test_Barrier_Gurobi_random, Test_MM_Gurobi_UCI and Test_MM_Gurobi_random, please assure that Gurobi optimizer and the PSG program are downloaded and installed.
  • Before replicating the results of Table 2 with the INPUT flag_err = 3 in Step 6 , please download the function randl.m and place it in the folder subfun.

Example

If you want to see the performance of N-ALM, S-IRPN and ADMM on UCI data with epsilon=1e-3 in Table 2 of the paper, you only need to execute the following three operations in Test1_fixed_lambda according to Step 5:

  • Modify the INPUT part of the script Test_NALM_UCI.m to

      prob=[1:11]; flag_tol=1; tol=1e-3; alpha_vec=[0.9, 0.5, 0.1];
    

    then run this script.

  • Modify the INPUT part of the script Test_SIRPN_UCI.m to

      prob=[1:11]; 
    

    then run this script.

  • Modify the INPUT part of the script Test_ADMM_UCI.m to

      prob=[1:11]; flag_tol=1;
    

    then run this script.

During the above three operations, you will see the information for each iteration of N-ALM, S-IRPN and ADMM in the current command window (see Diary_NALM_UCI_flagtol_1_tol_1e-03.txt, Diary_SIRPN_UCI.txt and Diary_ADMM_UCI.txt in Results folder), respectively. Finally, you will obtain the files Result_NALM_UCI_flagtol_1_tol_1e-03.mat, Result_SIRPN_UCI.mat and Result_ADMM_UCI.mat in the current folder (see these three files inResults folder), which include all the information required in Table 2.

Replicating

  • To replicate all the results on UCI data in the paper, modify and run the scripts in the corresponding folders Test1_fixed_lambda, Test2_Solution_path and Test3_truncated_CVaR_MM according to Step 5, respectively.
  • To replicate all the results on random data in the supplementary materials, modify and run the scripts in the corresponding folders Test1_fixed_lambda, Test2_Solution_path, Test3_truncated_CVaR_MM and Test4_Risk-averse_model according to Step 6, respectively.

Remark

As mentioned in our paper, once two algorithms are stopped under different criteria, we use the objective values obtained from Gurobi or N-ALM under the tolerance 1e-9 as benchmarks to test the quality of the computed objective values by both algorithms. For convenience, the corresponding high-accurate objective values are available in the UCIdata folder.