Skip to content

Synchronization exercise

Karl Ehatäht edited this page Apr 21, 2020 · 43 revisions

Table of contents

Introduction

In the synchronization exercise we want to compare our analysis FW to other groups' FWs at the Ntuple level. We consider two types of synchronizations:

  • synchronization at the object level: to check how compatible the reconstructed objects are
    • do we apply the same energy corrections (to jets, leptons);
    • are we using the correct tau ID discriminant;
    • how are the objects cleaned;
    • are the object-level variables computed the same way in all implementations;
    • etc etc;
  • synchronization at the event level: to check if the event-level cuts are implemented correctly.

The trees and branches in the output Ntuple should be named according to some kind of nomenclature that all groups have agreed to follow. In 2016+2017+2018 analysis we decided on the nomenclature as explained here and here.

For the object level synchronization we have a dedicated analysis executable analyze_inclusive which doesn't select any events but computes proper objects used in the analyses. The standard analyses -- 0l+2tau, 1l+1tau etc all the way up to 4l, including four CRs (ttW, ttZ, WZ, ZZ) are capable of producing event-level sync Ntuples. The sync Ntuples are produced not only for the signal regions (SRs) but also for fake application regions (ARs), flip ARs and MC closure regions (when applicable). We also have the capability to produce the sync Ntuples with the following shape uncertainties: central, JES, JER, tauES, UnclusteredEn, btag.

The following examples are based on 2017 era.

Sync Ntuple production

Ntupelization

The starting point for all groups is a ttH signal MC MINIAODSIM file, which is Ntupelized by each group individually. In our framework we use our custom nanoAO fork to Ntupelize the MINIAOD file. The Ntupelization is carried out with cmsRun executable that requires a Python config file as an input. The config files are generated with a script called launchall_nanoaod.sh which is inspired by the launchall.sh script from the old VHbb days. Even though the script is globally available, it should be always executed in $CMSSW_BASE/src/tthAnalysis/NanoAOD for it to function properly. The help message of this script is currently:

Usage: launchall_nanoaod.sh -e <era>
                            -j <type>
                            [-d]
                            [-g]
                            [-f <dataset file>]
                            [-v version]
                            [-w whitelist]
                            [-n <job events = 50000>]
                            [-N <cfg events = -1>]
                            [-r <frequency = 1000>]
                            [-t <threads = 1>]
                            [ -p <publish: 0|1 = 1> ]
Available eras: 2016v2, 2016v3, 2017v1, 2017v2, 2018, 2018prompt
Available job types: data, mc, fast, sync

And the explanation of each option or flag:

  • -e <era>: mandatory option that specifies the MiniAOD production campaign you want to Ntupelize. Available eras are: 2016v2 (RunIISummer16MiniAODv2), 2016v3 (RunIISummer16MiniAODv3), 2017v1 (RunIIFall17MiniAOD), 2017v2 (RunIIFall17MiniAODv2), 2018 (RunIIAutumn18MiniAOD) and 2018prompt (only for processing 2018RunD data files).
  • -j <type>: mandatory option that specifies which type of files you want to Ntupelize. Available options are: data, mc (FullSim MC), fast (FastSim MC) and sync (MINIAODSIM for the syncrhonization exercise);
  • -d: flag that submits CRAB jobs with the --dryrun option. Useful when you want to validate CRAB submission;
  • -g: flag that tells the script to only generate config files, not prepare any CRAB jobs;
  • -f <dataset file>: optional. Specifies the the location to the text file that lists all datasets the user wants to Ntupelize. If this option is not provided, then the the location is automatically guessed from the job type and campaign era. In specialized analyses like multilepton or bbWW HH analysis the option is mandatory, because valid guesses are made only when you Ntupelize MINIAOD files in ttH analysis;
  • -v <version>: optional. Specifies the directory name of the CRAB jobs in /store/cms/user/. When the option is not provided, a default one is generated (consists of era and date of submission);
  • -w whitelist: optional comma-separated list of sites that the user wants their jobs to run on;
  • -n <job events = 50000>: (optional) number of events to process per CRAB job. Defaults to 50'000 events per CRAB job;
  • -N <cfg events = -1>: (optional) number of events to process per cmsRun task. Should never be changed, unless the user wants to test sync Ntuple production on a few events;
  • -r <frequency = 1000>: optional setting that tells how often the job should inform user about the progress. Defaults to 1000, which means that after every 1000th event a statement is made about the progress;
  • -t <threads = 1>: option that specifies the number of concurrent threads in the job. Setting it to a higher value is reasonable when the Ntupelization job is run locally (e.g. when Ntupelizing for the synchronization), but should never be touched when submitting CRAB jobs;
  • -p <publish: 0|1 = 1>: option that specifies whether the processed datasets should be published on DAS (1: the default) or not (0). Has no effect when -g flag is enabled.

When running Ntuple mass production, you need to open grid proxy for long time (~weeks) before running the script. However, in case of synchronization exercise we don't need massive computing resources that the grid computing provides -- we want the results as soon as possible. So, the plan of attack is to generate only the config files (use -g option), increase the number of concurrent threads to something reasonable (with -t option) and run the cmsRun job locally. If the job is run locally, there's no need to open the grid proxy at all.

Example. Let's say we want to produce synchronization Ntuple for 2017 legacy ttH analysis. In order to do that, we need to generate cmsRun config file with the following command:

launchall_nanoaod.sh -e 2017v2 -j sync -g -r 1

The command basically tells that we want to only generate the config files (-g) from RunIIFall17MiniAODv2 (2017v2) MiniAOD for the synchronization exercise (-j sync) where the reporting frequency is set to 1 per every event (-r 1). The user is first prompted the following message:

Sure you want to run NanoAOD production on samples defined in $CMSSW_BASE/src/tthAnalysis/NanoAOD/test/datasets/txt/datasets_sync_2017_RunIIFall17MiniAODv2.txt? [y/N]

In other words, the script has made an educated guess for the location of the file that contains the list of datasets subject to processing. After pressing y character, the user is prompted again with another question:

Sure you want to use this config file: $CMSSW_BASE/src/tthAnalysis/NanoAOD/test/cfgs/nano_sync_RunIIFall17MiniAODv2_cfg.py? [y/N]

The script made another educated guess for the location of generated config file. After entering y, the script proceeds to generate the config file for the Ntupelization job.

Finally, the Ntupelization job can be run anywhere, but the general advice is that you create a directory somewhere on you $HOME with the name that is descriptive enough for the job you're about to run, and then execute the following:

cmsRun $CMSSW_BASE/src/tthAnalysis/NanoAOD/test/cfgs/nano_sync_RunIIFall17MiniAODv2_cfg.py &> log.txt

The standard output and standard error streams are redirected to file log.txt that is placed in the same directory where you executed the above command. You can use tail -F log.txt to track progress of the job in another shell session.

Example #2. Let's say we want to Ntupelize a MINIAODSIM file for the synchronization exercise in HH bbWW analysis. We start with the same command as in the previous example, but provide -f option:

launchall_nanoaod.sh -e 2017v2 -j sync -g -r 1 -t 4 -f test/datasets/txt/datasets_hh_bbww_sync_2017_RunIIFall17MiniAODv2.txt

The script prompts the following question:

Sure you want to use this config file: $CMSSW_BASE/src/tthAnalysis/NanoAOD/test/cfgs/nano_sync_RunIIFall17MiniAODv2_cfg.py? [y/N]

After pressing y the config file for this job is generated. Notice that there's no question about the dataset file anymore since the user has already provided one.

Sample dictionaries

The location of Ntuples is managed via sample dictionaries, which come about in multiple stages.

JSON files

The only part that's written by a human are JSON files:

  • datasets.json -- specifies all MC samples used in ttH analysis;
  • datasets_sync.json -- specifies MC samples used in ttH synchronization exercise;
  • datasets_hh_multilepton.json -- specifies all MC signal samples used in HH multilepton analysis;
  • datasets_hh_bbww.json -- specifies all MC signal samples used in HH bbWW analysis;
  • datasets_hh_bbww_sync.json -- specifies MC samples used in HH bbWW synchronization exercise;

These JSON files follow certain rules:

  • samples grouped into an array of categories;
  • each category is split into samples that share the same physics and information about the cross section;
  • each sample is further split into production campaigns;
  • all DAS names (dbs) or locations (loc) of each individual sample or file name (file) are defined for every production campaign. It may be the case that there are multiple samples covering the same phase space (e.g. samples with parton shower weights or extended samples), so they must be defined in the same production campaign but must be given a different name via alt option.

For instance, datasets_hh_bbww_sync.json currently reads:

[
  {
    "category": "signal",
    "comment": "",
    "samples": [
      {
        "name": "signal_ggf_spin0_750_hh_2b2v",
        "enabled": 1,
        "use_case": "signal extraction",
        "process": "HH -> bbWW, WW -> 2l 2v (GGF)",
        "datasets": {
          "RunIISummer16MiniAODv2": [],
          "RunIISummer16MiniAODv3": [],
          "RunIIFall17MiniAOD": [],
          "RunIIFall17MiniAODv2": [
            {
              "dbs": "/GluGluToRadionToHHTo2B2VTo2L2Nu_M-750_narrow_13TeV-madgraph_correctedcfg/RunIIFall17MiniAODv2-PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/MINIAODSIM",
              "file": "/local/karl/store/mc/RunIIFall17MiniAODv2/GluGluToRadionToHHTo2B2VTo2L2Nu_M-750_narrow_13TeV-madgraph_correctedcfg/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/10000/F86CC95D-A1B0-E811-B516-0242AC130002.root"
            }
          ],
          "RunIIAutumn18MiniAOD": []
        },
        "xs": {
          "value": 0.026422,
          "order": "",
          "references": [
            "https://twiki.cern.ch/twiki/bin/view/LHCPhysics/CERNYellowReportPageBR",
            "http://pdglive.lbl.gov/BranchingRatio.action?desig=7&parCode=S043"
          ],
          "comment": "normalized to 1 pb: 2*BR(H->bb)*BR(H->WW)*BR(W->lnu)^2=2*0.5824*0.2137*0.3258^2"
        }
      }
    ]
  }
]

From this JSON, it's clear that the sample from RunIIFall17MiniAODv2 campaign is considered for the HH bbWW synchronization exercise.

The JSON files give rise to so-called dataset tables and sample sum tables that are grouped by production campaigns (which is clear from the file names). They're generated with generate_dataset_table.py script like so:

generate_dataset_table.py -i test/datasets/json/datasets.json

The output will be stored in $CMSSW_BASE/src/tthAnalysis/NanoAOD/test/datasets/txt. For instance, generate_dataset_table.py -i test/datasets/json/datasets_hh_bbww_sync.json creates only one file $CMSSW_BASE/src/tthAnalysis/NanoAOD/test/datasets/txt/datasets_hh_bbww_sync_2017_RunIIFall17MiniAODv2.txt which has the following content:

# file generated at 2019-05-12 02:26:43 with the following command:
# generate_dataset_table.py -i test/datasets/json/datasets_hh_bbww_sync.json

# HH -> bbWW, WW -> 2l 2v (GGF)
/GluGluToRadionToHHTo2B2VTo2L2Nu_M-750_narrow_13TeV-madgraph_correctedcfg/RunIIFall17MiniAODv2-PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/MINIAODSIM 1 signal signal_ggf_spin0_750_hh_2b2v 0.026422 /local/karl/store/mc/RunIIFall17MiniAODv2/GluGluToRadionToHHTo2B2VTo2L2Nu_M-750_narrow_13TeV-madgraph_correctedcfg/MINIAODSIM/PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/10000/F86CC95D-A1B0-E811-B516-0242AC130002.root  #  [1][2]; normalized to 1 pb: 2*BR(H->bb)*BR(H->WW)*BR(W->lnu)^2=2*0.5824*0.2137*0.3258^2

# References:
# [1] https://twiki.cern.ch/twiki/bin/view/LHCPhysics/CERNYellowReportPageBR
# [2] http://pdglive.lbl.gov/BranchingRatio.action?desig=7&parCode=S043

As it's evident from the first line in this file, each dataset table shows the actual command that was used to generate the tables. This is useful information because

  1. the user doesn't have to memorize the commands that were used to generate the dataset tables;
  2. when there's a mistake in the dataset tables, it's easier to track down the bug in the original JSON file.

The dataset files serve two purposes:

  1. they are read by launchall_nanoaod.sh script that generates config files for CRAB jobs or sync Ntuple production jobs;
  2. they are instrumental in building the so-called meta dictionaries from which sample dictionaries are generated;

NB! Unless you find new datasets that are missing in the JSON files, you don't need to generate the dataset tables. But if you do, please make sure that you push newly generated tables to the repository.

Meta dictionaries

The so-called meta-dictionaries contain basic information about the MINIAOD datasets: what are sample names, category names, cross sections, how many MINIAOD files the dataset has, how many (unweighted) events the dataset includes, in what CMSSW release the MINIAOD files were produced and what's the status of the dataset according to DBS. All this information can be fetched only if you have opened a grid proxy for sufficient amount of time.

Examples of meta dictionaries can be found in $CMSSW_BASE/src/tthAnalysis/HiggsToTauTau/python/samples/metaDict_2017_sync.py and in $CMSSW_BASE/src/hhAnalysis/bbww/python/samples/metaDict_2017_hh_sync.py. Each meta-dictionary contain the exact command that was used to generate the meta-dictionary. For instance, metaDict_2017_hh_sync.py was generated in $CMSSW_BASE/src/hhAnalysis/bbww with

find_samples.py -V -i ../../tthAnalysis/NanoAOD/test/datasets/txt/datasets_hh_bbww_sync_2017_RunIIFall17MiniAODv2.txt -m python/samples/metaDict_2017_hh_sync.py

as is evident from the content of this meta-dictionary:

from collections import OrderedDict as OD

# file generated at 2019-05-02 01:40:57 with the following command:
# find_samples.py -V -i ../../tthAnalysis/NanoAOD/test/datasets/txt/datasets_hh_bbww_sync_2017_RunIIFall17MiniAODv2.txt -m python/samples/metaDict_2017_hh_sync.py

meta_dictionary = OD()


### event sums

sum_events = { 
}


meta_dictionary["/GluGluToRadionToHHTo2B2VTo2L2Nu_M-750_narrow_13TeV-madgraph_correctedcfg/RunIIFall17MiniAODv2-PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/MINIAODSIM"] =  OD([
  ("crab_string",           ""),
  ("sample_category",       "signal"),
  ("process_name_specific", "signal_ggf_spin0_750_hh_2b2v"),
  ("nof_db_events",         200000),
  ("nof_db_files",          11),
  ("fsize_db",              11931037531),
  ("xsection",              0.026422),
  ("use_it",                True),
  ("genWeight",             True),
  ("comment",               "status: VALID; size: 11.93GB; nevents: 200.00k; release: 9_4_7; last modified: 2018-10-06 03:20:04"),
])


# event statistics by sample category:
# signal: 200.00k

Unlike JSON files and dataset tables which belong to tth-nanoAOD repository, the meta-dictionaries are part of certain analysis. The information about production campaign is now dropped in the file name and only the year of the production name is kept.

NB! There's no need to generate any meta-dictionaries unless there has something changed in the dataset tables (which in turn implies changes in the JSON files).

Sample dictionaries

Sample dictionaries define the locations of input NanoAOD Ntuples in an analysis. There are two or three types of sample dictionaries:

  1. sample dictionaries for Ntuples that haven't been post-processed. Needed before the post-processing begins;
  2. sample dictionaries for Ntuples that have been post-processed. Needed before we can run actual analysis jobs;
  3. sample dictionaries for Ntuples that have been post-processed and further skimmed for analyzing them with systematic uncertainties. Needed before we can run analysis jobs with systematic uncertainties.

Similarly to generating dataset tables and meta-dictionaries, the sample dictionaries always contain the actual command that was used to generate the sample dictionary itself. Here's an example from sample dictionary that corresponds to the bbWW sync Ntuple in 2017 ($CMSSW_BASE/src/hhAnalysis/bbww/python/samples/hhAnalyzeSamples_2017_nanoAOD_sync.py):

from collections import OrderedDict as OD

# file generated at 2019-05-02 01:46:10 with the following command:
# create_dictionary.py -m python/samples/metaDict_2017_hh_sync.py -p /hdfs/local/karl/sync_ntuples/2017/nanoAODproduction/2019May01 -N samples_2017 -E 2017 -o python/samples -g hhAnalyzeSamples_2017_nanoAOD_sync.py -M

samples_2017 = OD()
samples_2017["/GluGluToRadionToHHTo2B2VTo2L2Nu_M-750_narrow_13TeV-madgraph_correctedcfg/RunIIFall17MiniAODv2-PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/MINIAODSIM"] = OD([
  ("type",                            "mc"),
  ("sample_category",                 "signal"),
  ("process_name_specific",           "signal_ggf_spin0_750_hh_2b2v"),
  ("nof_files",                       1),
  ("nof_db_files",                    11),
  ("nof_events",                      {
  }),
  ("nof_tree_events",                 52000),
  ("nof_db_events",                   200000),
  ("fsize_local",                     141613135), # 141.61MB, avg file size 141.61MB
  ("fsize_db",                        11931037531), # 11.93GB, avg file size 1.08GB
  ("use_it",                          True),
  ("xsection",                        0.026422),
  ("genWeight",                       True),
  ("triggers",                        ['1e', '1mu', '2e', '2mu', '1e1mu', '3e', '3mu', '2e1mu', '1e2mu', '1e1tau', '1mu1tau', '2tau']),
  ("has_LHE",                         True),
  ("LHE_set",                         "LHA IDs 306000 - 306102 -> NNPDF31_nnlo_hessian_pdfas PDF set, expecting 103 weights (counted 103 weights)"),
  ("local_paths",
    [
      OD([
        ("path",      "/hdfs/local/karl/sync_ntuples/2017/nanoAODproduction/2019May01/signal_ggf_spin0_750_hh_2b2v"),
        ("selection", "*"),
        ("blacklist", []),
      ]),
    ]
  ),
  ("missing_from_superset",           [
    # not computed
  ]),
  ("missing_hlt_paths",               [

  ]),
  ("hlt_paths",               [
    # not computed
  ]),
])

samples_2017["sum_events"] = [
]

NB! The sample dictionaries for Ntuples that haven't been post-processed must be generated every time your ran the Ntupelization (i.e. cmsRun) job.

After the post-production job a new sample dictionary is needed for the sync Ntuple, which can be generated with the same command as the previous sample dictionary was, but the output file name (-g) and location of the Ntuples (-p) must be changed accordingly:

from collections import OrderedDict as OD

# file generated at 2019-05-16 20:02:57 with the following command:
# create_dictionary.py -m python/samples/metaDict_2017_hh_sync.py -p /hdfs/local/karl/ttHNtupleProduction/2017/2019May02_woPresel_nonNom_hh_bbww_sync/ntuples -N samples_2017 -E 2017 -o python/samples -g hhAnalyzeSamples_2017_sync.py -M

samples_2017 = OD()
samples_2017["/GluGluToRadionToHHTo2B2VTo2L2Nu_M-750_narrow_13TeV-madgraph_correctedcfg/RunIIFall17MiniAODv2-PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/MINIAODSIM"] = OD([
  ("type",                            "mc"),
  ("sample_category",                 "signal"),
  ("process_name_specific",           "signal_ggf_spin0_750_hh_2b2v"),
  ("nof_files",                       1),
  ("nof_db_files",                    11),
  ("nof_events",                      {
    'Count'                                            : [        52000, ],
    'CountWeighted'                                    : [        51972,        51953,        51984, ],
    'CountWeightedNoPU'                                : [        51992, ],
    'CountFullWeighted'                                : [        51972,        51953,        51984, ],
    'CountFullWeightedNoPU'                            : [        51992, ],
    'CountWeightedL1PrefireNom'                        : [        50011,        49988,        50026, ],
    'CountWeightedL1Prefire'                           : [        50011,        49562,        50455, ],
    'CountWeightedNoPUL1PrefireNom'                    : [        50031, ],
    'CountFullWeightedL1PrefireNom'                    : [        50011,        49988,        50026, ],
    'CountFullWeightedL1Prefire'                       : [        50011,        49562,        50455, ],
    'CountFullWeightedNoPUL1PrefireNom'                : [        50031, ],
  }),
  ("nof_tree_events",                 52000),
  ("nof_db_events",                   200000),
  ("fsize_local",                     203322842), # 203.32MB, avg file size 203.32MB
  ("fsize_db",                        11931037531), # 11.93GB, avg file size 1.08GB
  ("use_it",                          True),
  ("xsection",                        0.026422),
  ("genWeight",                       True),
  ("triggers",                        ['1e', '1mu', '2e', '2mu', '1e1mu', '3e', '3mu', '2e1mu', '1e2mu', '1e1tau', '1mu1tau', '2tau']),
  ("has_LHE",                         True),
  ("LHE_set",                         "LHA IDs 306000 - 306102 -> NNPDF31_nnlo_hessian_pdfas PDF set, expecting 103 weights (counted 103 weights)"),
  ("local_paths",
    [
      OD([
        ("path",      "/hdfs/local/karl/ttHNtupleProduction/2017/2019May02_woPresel_nonNom_hh_bbww_sync/ntuples/signal_ggf_spin0_750_hh_2b2v"),
        ("selection", "*"),
        ("blacklist", []),
      ]),
    ]
  ),
  ("missing_from_superset",           [
    # not computed
  ]),
  ("missing_hlt_paths",               [

  ]),
  ("hlt_paths",               [
    # not computed
  ]),
])

samples_2017["sum_events"] = [
]

NB! A sample dictionary for the post-processed Ntuples must be always generated after the post-processing step because the scripts that submit the analysis jobs read the Ntuple location from these sample dictionaries.

Post-processing

Copying the file

Once the cmsRun job is done, you need to copy the output file tree.root to /hdfs/local/$USER. However, the script that generates sample dictionaries (create_dictionary.py) assumes that the Ntuples are stored in a subdirectory that has the same name as the process name. Furthermore, in order to maintain the compatibility with directory structure produced by CRAB jobs, the Ntuple must be placed in another subdirectory called 0000 and renamed as tree_1.root Good examples of the locations to the NanoAOD Ntuples that haven't been post-processed are:

/hdfs/local/$USER/sync_ntuples/2017/nanoAODproduction/2019May04/ttHJetToNonbb_M125_amcatnlo/0000/tree_1.root
/hdfs/local/$USER/sync_ntuples/2017/nanoAODproduction/2019May01/signal_ggf_spin0_750_hh_2b2v/0000/tree_1.root

Running the jobs

You can now proceed with post-processing the Ntuple by executing

./test/tthProdNtuple.py -e 2017 -v 2018Nov07 -m sync -O

NB! The Ntuples used in regular analysis are not produced with the -O flag. The flag does not smear the jets and thus not recompute MET with the smeared jets, i.e. we use "non-nominal" jets. We use this option only in the synchronization because other groups do not smear the jets; if we were to enable this feature during the synchronization exercise it would difficult to disentangle this effect from other (potentially more serious) problems that may arise in the synchronization.

However, in some cases it would be nice to see how large impact smearing the jets would have. In that case it makes sense to post-process the sync Ntuple again without the -O flag, put the output Ntuple into a different sample dictionary (python/samples/tthAnalyzeSamples_2017_sync.py) and perform synchronization exercise with ourselves.

NBB! In case you want to run sync Ntuple post-production for another analysis, you need to specify this accordingly with -m/--mode option. For instance, to run Ntuple post-production for bbWW sync, you need to use hh_bbww_sync mode like so:

./test/tthProdNtuple.py -e 2017 -v 2019May02 -m hh_bbww_sync -O

Note that in case it's not possible to run the job on cluster, you have two options:

  1. either run the makefile in "local" mode by adding -R makefile at the end of above command; or
  2. run the Ntuple post-production interactively by refusing to submit the jobs to SLURM (which can be automatically done with -E option) and run produceNtuple.sh command directly on the generated config file.

However, if it's not possible to create any files to /hdfs, you have to set outputDir to configDir in ./test/tthProdNtuple.py. The same tips apply to analysis jobs as well.

Producing the sync Ntuple

Copying the file

Similarly to previous step, a sample dictionary must be generated -- this time for the post-processed NanoAOD Ntuple. However, there's no need to copy the output file explicitly, unless you ran the post-production job interactively (and even in that case you don't have to create the directory structure yourself -- it's been already created by tthProdNtuple.py script). By continuing the above examples, the post-processed Ntuple should be placed to:

/hdfs/local/$USER/ttHNtupleProduction/2017/2018Nov07_woPresel_nonNom_sync/ntuples/ttHJetToNonbb_M125_amcatnlo/0000/tree_1.root

in ttH analysis and

/hdfs/local/$USER/ttHNtupleProduction/2017/2019May02_woPresel_nonNom_hh_bbww_sync/ntuples/signal_ggf_spin0_750_hh_2b2v/0000/tree_1.root

in HH bbWW analysis.

The respective sample dictionaries can be created with:

# in $CMSSW_BASE/src/tthAnalysis/HiggsToTauTau
create_dictionary.py \
  -m python/samples/metaDict_2017_sync.py \
  -p /hdfs/local/$USER/ttHNtupleProduction/2017/2018Nov07_woPresel_nonNom_sync/ntuples \
  -N samples_2017 \
  -E 2017 \
  -o python/samples \
  -g tthAnalyzeSamples_2017_sync.py \
  -M

# in $CMSSW_BASE/src/hhAnalysis/bbww
create_dictionary.py \
  -m python/samples/metaDict_2017_hh_sync.py \
  -p /hdfs/local/$USER/ttHNtupleProduction/2017/2019May02_woPresel_nonNom_hh_bbww_sync/ntuples \
  -N samples_2017 \
  -E 2017 \
  -o python/samples \
  -g hhAnalyzeSamples_2017_sync.py \
  -M

Running the jobs

To get the final sync Ntuple that can be compared to that provided by other groups, you need to run the following command:

./test/tthSyncNtuple.py -e 2017 -v 2018Nov07 -O -o sync_Tallinn_v30.root

NB!! Notice again the -O flag! It has the same meaning as before: the jets in the input Ntuple are not smeared and MET not recomputed.

This script (tthSyncNtuple.py) is actually a wrapper for running multiple analyses all at once where each analysis job is told to produce a sync Ntuple from a single signal Ntuple. The workflow then hadds the files together into a file specified by -o option.

The output file can be found in /hdfs/local/$USER/ttHAnalysis/2017/2018Nov07/sync_ntuple/sync_Tallinn_v30.root.

This file contains many trees, including syncTree, which is used for object-level synchronization; other event-level trees are named after the convention syncTree_$CHANNEL_$REGION. The tree structure is the same regardless of the tree. As mention earlier, the workflow also supports the production of sync Ntuple for various cases of systematic uncertainties. Although we are currently (as of writing) the only group who has implemented this feature and at first glance would seem like it's unnecessary use-case, it's actually a useful way of quantifying the effects of various shape uncertainties at the sync Ntuple level. Synchronizing with ourselves on central vs shape uncertainties serves as a preliminary sanity check because running whole analyses with various shape uncertainties is very expensive in terms of computing and human time, and we would want to catch the mistakes as early as possible.

Note that in the sync exercise of HH bbWW analysis the sync Ntuple is generated with ./test/hhSyncNtuple.py in $CMSSW_BASE/src/hhAnalysis/bbww.

Comparing the sync Ntuples

There are three-four major tools available in our FW that makes it relatively easy to find out discrepancies in the sync Ntuples:

  • compare_sync_objects.py which compares object level sync Ntuples by counting the number of dR-matched objects;
  • compareRootRLENumbers.py which compares event level sync Ntuples by performing various set operations based on the run, lumi and event numbers in each channel and region;
  • compareSyncNtuples.C which is a macro for producing plots of variables taken from object level sync Ntuple;
  • tthSyncNtuple.py to rerun the sync Ntuple production on the events that other groups select but we reject.

Comparing at the object level

There are two main tools to tackle the object level synchronization: compareSyncNtuples.C and compare_sync_objects.py. The former is used to produce a bunch of plots that compare all branches in each tree between any two groups. It's a very useful tool that shows how well each and every object- and event-level variable agree in all channels and regions. However, the downside is that one needs to sift through a huge number of plots by hand. For instance, if there are 5 groups that each have provided 20 sync trees containing 50 variables, there will be C(5; 2) * 20 * 150 = 90'000 plots. Although, to be fair we're only interested in how well are we in sync with the other groups (C(5; 2) -> C(5; 1) = 5) and in the object level-sync we should only look at the inclusive tree (called syncTree: 20 -> 1), which leaves only 5 * 150 = 750 plots.

When looking at the sync plots, you should pay attention to the following:

  • are the variables cut off differently between the two groups?
    • if one or the other group cuts on the variable we sync it becomes apparent in the sync plot
  • is the plot that does a relative bin-by-bin comparison of the event counts in the bottom shifted up or down a bit?
    • this indicates that the two groups select different number of objects. If the normalized distributions match in the main plot, look more carefully either the first or the last bin: there must be some kind of excess of events.
  • do the distributions match, but not quite exactly?
    • if the variables are plotted for objects that depend on the cleaning (electrons, taus, jets), then you need to take into account this effect when looking at the plots
  • if only the positive range is filled, then it's likely that the absolute value is taken before filling the sync tree
  • if the distributions don't look nothing alike, then it's a major pitfall that needs to be sorted out

However, these plots tell only half-truths: the two groups may select completely different object at times but the overall distribution of object-level variables still remain the same either because the variables naturally follow the same trends or because number of such cases is small relative to the sample size.

This is where compare_sync_objects.py comes into play. The program has only two prerequisites - pyROOT and matplotlib - so in principle it's can be run on any platform.

According to its help message, the script has three functions:

$ compare_sync_objects.py -h
usage: compare_sync_objects.py [-h] {count,inspect,plot} ...

optional arguments:
  -h, --help            show this help message and exit

commands:
  {count,inspect,plot}

Each of these functions have their own help message, viz

$ compare_sync_objects.py count -h
usage: compare_sync_objects.py count [-h] -i path [path ...] [-t name] [-o]

optional arguments:
  -h, --help                                   show this help message and exit
  -i path [path ...], --input path [path ...]  Input files (default: None)
  -t name, --tree name                         TTree name (default: syncTree)
  -o, --count-objects                          Count the number of preselected objects (default: False)
  -a analysis, --analysis analysis             Type of analysis the sync Ntuple was produced in (default: tth)

Unfortunately, the plot function is a bit buggy, so you should avoid it.

Counting the events and objects

The counting function is useful when update the general sync table that is also filled by other groups:

$ compare_sync_objects.py count -i /some/path/to/sync_Tallinn_v30.root
/some/path/to/sync_Tallinn_v30.root:
  syncTree:                      56465
  syncTree_0l2tau_Fake:          38
  syncTree_0l2tau_SR:            37
  syncTree_0l2tau_mcClosure_t:   75
  ...
  syncTree_ttZctrl_SR:           43
  syncTree_ttZctrl_mcClosure_e:  51
  syncTree_ttZctrl_mcClosure_m:  54

When adding -o to the previous command, you'll also get the object counts at the very end:

  n_presel_mu:  18464
  n_presel_ele: 17660
  n_presel_tau: 14458
  n_presel_jet: 56440

In order to compare sync Ntuple produced in HH bbWW analysis, you have to add -a hh_bbww to the above command. This is needed because the nomenclature of the branch names in the sync Ntuples are a bit different between the two analyses.

Comparing the objects

The second function, inspect, takes so-called reference Ntuple (-i) and test Ntuple (-j) as inputs and compares the objects in sync tree specified by -t (default to syncTree aka the object-level sync tree) by dR-matching the objects with cone size given by -c (defaults to 0.01). It's also possible to limit the number of events in this test by either setting the maximum number of events with -n or by giving the list of RLE numbers or a file containing the RLE numbers as argument to -r. Flag -v increases the verbosity level as usual. Finally, option -a tells in which analysis the sync Ntuple was produced (default to tth). Here's the full help message:

$ compare_sync_objects.py inspect -h
usage: compare_sync_objects.py inspect [-h] -i path -j path [-t name]
                                       [-r path/list [path/list ...]]
                                       [-n number] [-d cone size] [-v]
                                       [-a analysis]

optional arguments:
  -h, --help                                           show this help message
                                                       and exit
  -i path, --input-ref path                            Input reference file (default: None)
  -j path, --input-test path                           Input test file (default: None)
  -t name, --tree name                                 TTree name (default: syncTree)
  -r path/list [path/list ...], --rle path/list [path/list ...]
                                                       Path to the list of run:lumi:event numbers, or explicit space-separated list of those (default: [])
  -n number, --max-events number                       Maximum number of events to be considered (default: -1, i.e. all dR-matched objects) (default: -1)
  -d cone size, --dr cone size                         Maximum cone size used in object dR-matching (default: 0.01)
  -v, --verbose                                        Enable verbose output (default: False)
  -a analysis, --analysis analysis                     Type of analysis the sync Ntuple was produced in (default: tth)

Example:

$ compare_sync_objects.py inspect -i sync_Tallinn_v30.root -j other_object_ntuple.root 
Total number of events considered: 56465
ispresel       mu1:     18464 (      7) in ref,   18464 (      7) in test,   18457 dR-matched
isfakeablesel  mu1:     13847 (      4) in ref,   13848 (      5) in test,   13843 dR-matched
ismvasel       mu1:     11790 (      4) in ref,   11791 (      5) in test,   11786 dR-matched
ispresel       mu2:      2890 (      6) in ref,    2889 (      5) in test,    2884 dR-matched
isfakeablesel  mu2:      1982 (      3) in ref,    1981 (      2) in test,    1979 dR-matched
ismvasel       mu2:      1522 (      4) in ref,    1519 (      1) in test,    1518 dR-matched
ispresel       ele1:    17660 (     27) in ref,   17765 (    132) in test,   17633 dR-matched
isfakeablesel  ele1:    10945 (     18) in ref,   11000 (     73) in test,   10927 dR-matched
ismvasel       ele1:     9345 (     17) in ref,    9395 (     67) in test,    9328 dR-matched
ispresel       ele2:     2721 (      8) in ref,    2750 (     37) in test,    2713 dR-matched
isfakeablesel  ele2:     1562 (      3) in ref,    1580 (     21) in test,    1559 dR-matched
ismvasel       ele2:     1192 (      2) in ref,    1210 (     20) in test,    1190 dR-matched
ispresel       tau1:    14458 (      2) in ref,   14458 (      2) in test,   14456 dR-matched
ispresel       tau2:     2262 (      1) in ref,    2261 (      0) in test,    2261 dR-matched
ispresel       jet1:    56440 (   1509) in ref,   56454 (   1523) in test,   54931 dR-matched
ispresel       jet2:    56170 (   3593) in ref,   56307 (   3730) in test,   52577 dR-matched
ispresel       jet3:    54997 (   5664) in ref,   55604 (   6271) in test,   49333 dR-matched
ispresel       jet4:    51532 (   6915) in ref,   53125 (   8508) in test,   44617 dR-matched

Here, the Tallinn Ntuple is used as a reference; the other group's object Ntuple is the test. The meaning of columns:

  • 1st column says which level of object select the object has passed
    • ispresel means loose
    • isfakeablesel means fakeable
    • ismvasel means tight
  • 2nd column tells which object we are dealing with
    • mu, ele, tau, jet stand for muon, electron, hadronic tau and jet, respectively
    • the number tells the order of object (1 for leading, 2 for subleading etc)
  • 3rd (4th) column tells how many objects the reference group selected (how many objects the reference group selected but aren't dR-matched with the objects of the same class from the test Ntuple)
  • 5th (6th) column tells how many objects the test group selected (how many objects the test group selected but aren't dR-matched with the objects of the same class from the reference Ntuple)
  • the last column tells how many objects selected by both reference and test group were actually dR-matched.

From the above example we can tell the following:

  • both groups select roughly the same objects for leading and subleading muons, electrons and taus
    • some discrepancy (< 1%) is expected and acceptable because the frameworks may use different precision for the variables
    • the number of unmatched electrons and taus looks relatively high, which may arise due to different cone size used in the cleaning of electrons and taus (but it can also mean nothing)
  • there are serious discrepancies in the way the jets are selected, though. This high level of disagreement is likely due to a different strategy of cleaning the jets, due to JECs or due to smearing of the jets, or just that the jets were not ordered by pT. The sync plots are probably very noisy/fuzzy in all jet variables.

By modifying the main loop of compare_sync_objects.py we can gain some insight into the matter:

  # Modify only between these long lines
  ##################################################################################################
  if not evt.jet1.is_matched or not evt.jet2.is_matched or not evt.jet3.is_matched or not evt.jet4.is_matched:
    print('RLE: %s' % rle)
    evt.jet1.printVars(['pt', 'E', 'eta', 'phi'])
    evt.jet2.printVars(['pt', 'E', 'eta', 'phi'])
    evt.jet3.printVars(['pt', 'E', 'eta', 'phi'])
    evt.jet4.printVars(['pt', 'E', 'eta', 'phi'])
    evt.tau1.printVars(['pt', 'eta', 'phi'])
    evt.tau2.printVars(['pt', 'eta', 'phi'])

  ##################################################################################################

Sure enough, after running the inspection on the first 20 events we see the following:

RLE: 1:8009:13579612
  jet1  pt       149.500000  vs     149.528900  =>     -0.028900
  jet1  E        452.318207  vs     452.394928  =>     -0.076721
  jet1  eta        1.771240  vs       1.771216  =>      0.000025
  jet1  phi        1.146484  vs       1.146584  =>     -0.000099
  jet2  pt        58.656250  vs      67.665901  =>     -9.009651
  jet2  E        209.465302  vs      68.779884  =>    140.685417
  jet2  eta        1.945557  vs      -0.174771  =>      2.120327
  jet2  phi        1.736572  vs       2.937042  =>     -1.200469
  jet3  pt        53.437500  vs      58.651024  =>     -5.213524
  jet3  E         55.312141  vs     209.422897  =>   -154.110756
  jet3  eta        0.110275  vs       1.945438  =>     -1.835163
  jet3  phi       -1.876709  vs       1.736583  =>     -3.613292
  jet4  pt          -        vs      53.436680  =>       -      
  jet4  E           -        vs      55.311844  =>       -      
  jet4  eta         -        vs       0.110281  =>       -      
  jet4  phi         -        vs      -1.876784  =>       -      
  tau1  pt        60.447723  vs      60.447723  =>      0.000000
  tau1  eta       -0.177399  vs      -0.177404  =>      0.000005
  tau1  phi        2.938477  vs       2.938482  =>     -0.000005
  tau2  pt        21.484795  vs      21.484795  =>      0.000000
  tau2  eta        0.717651  vs       0.717695  =>     -0.000043
  tau2  phi       -1.442871  vs      -1.442788  =>     -0.000083

So, it looks like the other group hasn't cleaned their jets wrt the hadronic taus. Further investigation indicates that the group hasn't cleaned their jets at all.

Comparing at the event level

The synchronization at the event level is based on the run, lumi and event (RLE) numbers in each channel and analysis region. If a group provides only the RLE numbers and nothing else, you can still use this information to build a minimal sync Ntuple for the event selection; example:

#!/usr/bin/env python

import ROOT
import array

input_map = {
  'syncTree_2lSS_SR'    : '2lss_sr.txt',
  'syncTree_3l_SR'      : '3l_sr.txt',
  'syncTree_ttWctrl_SR' : 'ttW_CR.txt',
  'syncTree_ttZctrl_SR' : 'ttZ_CR.txt',
  'syncTree_WZctrl_SR'  : 'wz_CR.txt',
}

fn = 'sync_tree.root'
f = ROOT.TFile.Open(fn, 'recreate')
for tree in input_map:
  tree_obj = ROOT.TTree(tree, tree)
  
  run = array.array('I', [0])
  lumi = array.array('I', [0])
  evt = array.array('L', [0])
  
  tree_obj.Branch("run", run, "run/i")
  tree_obj.Branch("ls", lumi, "ls/i")
  tree_obj.Branch("nEvent", evt, "nEvent/l")
  
  input_rles = input_map[tree]
  with open(input_rles, 'r') as input_f:
    for line in input_f:
      rles = list(map(int, line.rstrip('\n').split(':')))
      if len(rles) != 3:
        continue
      run[0] = rles[0]
      lumi[0] = rles[1]
      evt[0] = rles[2]
      
      tree_obj.Fill()
  tree_obj.Write()

f.Close()

Generate sync tables

With the following command:

compareRootRLENumbers.py \
  -i group_1.root group_3.root group_2.root \
  -n Group1 Group3 Group2 \
  -T -v -f \
  -o ~/path/to/results \
  -t syncTree_2lSS_SR 2lSS_Fake syncTree_WZctrl_SR syncTree_ttWctrl_SR syncTree_ttZctrl_SR

you'll generate various tables which are helpful for determining any discrepancies between two groups. The meaning of the flags and options are the following:

  • -i takes a list of full paths to the (event-level) sync Ntuples;
  • -n takes a complementary list of group names separated by a space (the order in which you pass these labels must match to the order in which you specify the input ROOT files)
  • -T generates various cross-tables
  • -v increases the on-screen verbosity level
  • -o tells where to store the tables
  • -f creates the output directory specified by -o if it doesn't exist
  • -t lists the tree names on which you want to synchronize on; if this option is not used, all trees are taken into account

For more options, see compareRootRLENumbers.py -h. Note that there is no upper limit on how many sync Ntuples you want to perform the synchronization on; the minimum number of input files is obviously two.

The tables are generated in two or three file formats:

  • .txt files are humand-readable tables
  • .csv are tables in CSV (comma-separated value) format
  • .xls are Excel tables which are produced only if unoconv program is installed

It should be noted that this script is self-sufficient and requires the following prerequisites:

  • pyROOT
  • prettytable module
  • unoconv program (optional)

The program works on any platform as long as those two (or three) requirements are satisfied. The first two requirements are automatically satisfied in any (recent) CMSSW release.

Analyzing the sync tables

The command produces 5 types of files as a result:

  1. cross_Group1.*, cross_Group2.*, cross_Group3.* etc shows how many events in one channel & region are shared by another channel & region of the same group. You don't want to see any overlaps between two different SRs and between fake/flip AR and SR of the same channel. So, this table serves as a cross-check for mutual exclusivity of the SRs, and SRs and fake/flip ARs of the same channel.

    Example 1.1: cross_Group1.txt

    +------------+---------+-----------+-----------+------------+------------+
    |   Group1   | 2lSS_SR | 2lSS_Fake | WZctrl_SR | ttWctrl_SR | ttZctrl_SR |
    +------------+---------+-----------+-----------+------------+------------+
    |  2lSS_SR   |   463   |     0     |     0     |     0      |     0      |
    | 2lSS_Fake  |    0    |    157    |     0     |     0      |     0      |
    | WZctrl_SR  |    0    |     0     |     9     |     0      |     2      |
    | ttWctrl_SR |    0    |     0     |     0     |     79     |     0      |
    | ttZctrl_SR |    0    |     0     |     2     |     0      |     43     |
    +------------+---------+-----------+-----------+------------+------------+
    

    According to this table Group1 has 9 events in the SR of WZ CR, and 2 from those events also enter the SR of ttZ CR. This is a bad sign because the SRs of two different channels should not overlap.

    Example 1.2: cross_Group2.txt

    +------------+---------+-----------+-----------+------------+------------+
    |   Group2   | 2lSS_SR | 2lSS_Fake | WZctrl_SR | ttWctrl_SR | ttZctrl_SR |
    +------------+---------+-----------+-----------+------------+------------+
    |  2lSS_SR   |   453   |     10    |     0     |     0      |     0      |
    | 2lSS_Fake  |    10   |    167    |     1     |     0      |     0      |
    | WZctrl_SR  |    0    |     1     |     14    |     0      |     0      |
    | ttWctrl_SR |    0    |     0     |     0     |     79     |     0      |
    | ttZctrl_SR |    0    |     0     |     0     |     0      |     43     |
    +------------+---------+-----------+-----------+------------+------------+
    

    Here we see that Group2 has selected 167 events in the fake AR of 2lSS channel. However, 10 events from this region also enter the SR of the same channel, which is wrong. There is 1 event selected in both the fake AR of 2lSS channel and SR of WZ CR, which is actually fine. Some minor overlaps between the SR of one channel and AR of another channel are expected and acceptable.

  2. Files cross_Group1_Group2.*, cross_Group2_Group3.*, cross_Group1_Group3.* show the overlaps of all channels and regions common between the two groups. The tables don't tell who has implemented the event selection incorrectly, but they tell that some kind of event migration is going on between the two groups.

    Example 2.1:

    +-----------------+---------+-----------+-----------+------------+------------+------+-------+
    | Group1 v Group2 | 2lSS_SR | 2lSS_Fake | WZctrl_SR | ttWctrl_SR | ttZctrl_SR | none | total |
    +-----------------+---------+-----------+-----------+------------+------------+------+-------+
    |    2lSS_SR      |   433   |     2     |     0     |     0      |     0      |  18  |  453  |
    |   2lSS_Fake     |    0    |    142    |     0     |     0      |     0      |  25  |  167  |
    |   WZctrl_SR     |    0    |     0     |     5     |     0      |     0      |  2   |   7   |
    |   ttWctrl_SR    |    11   |     1     |     0     |     70     |     0      |  5   |   87  |
    |   ttZctrl_SR    |    2    |     0     |     0     |     0      |     0      |  5   |    7  |
    |      none       |    17   |     12    |     4     |     9      |     43     |  x   |   76  |
    |     total       |   463   |    157    |     9     |     79     |     43     |  62  |   x   |
    +-----------------+---------+-----------+-----------+------------+------------+------+-------+
    

    Here rows (columns) correspond to channels & regions by Group2 (Group1).

    • the same 433 events are selected in 2lSS SR by both Group1 and Group2
    • 11 events selected in 2lSS SR by Group1 are actually selected in ttW SR by Group2
    • 2 events selected in 2lSS SR by Group2 are actually selected in 2lSS fake AR by Group2
    • etc etc
    • none means that the events selected in one category by one group are not selected at all by some other group
  3. For each cell in tables 1. are text files containing the RLE numbers of these events:

    • cross_rle_Group1_syncTree_2lSS_SR_syncTree_2lSS_SR.txt
    • cross_rle_Group1_syncTree_2lSS_SR_syncTree_2lSS_Fake.txt
    • cross_rle_Group1_syncTree_2lSS_SR_syncTree_WZctrl_SR.txt
    • ...
    • cross_rle_Group3_syncTree_ttZctrl_SR_syncTree_ttZctrl_SR.txt
  4. table_syncTree_$CHANNEL_$REGION.txt shows the same information as 2. but it compares multiple groups at once. This is extremely useful when multiple groups have covered the same channel and SR -- easier to see who to "blame".

    Example 4.1: table_syncTree_2lSS_SR.txt

    +--------------------------+------+--------+--------+--------+-----------------+-----------------+-----------------+-------+
    |     syncTree_2lSS_SR     | none | Group2 | Group3 | Group1 | Group2 & Group3 | Group1 & Group2 | Group1 & Group3 | total |
    +--------------------------+------+--------+--------+--------+-----------------+-----------------+-----------------+-------+
    |          Group2          |      |        |   19   |   20   |                 |                 |        19       |  453  |
    +--------------------------+------+--------+--------+--------+-----------------+-----------------+-----------------+-------+
    |          Group3          |      |   24   |        |   1    |                 |        0        |                 |  458  |
    +--------------------------+------+--------+--------+--------+-----------------+-----------------+-----------------+-------+
    |          Group1          |      |   30   |   6    |        |        6        |                 |                 |  463  |
    +--------------------------+------+--------+--------+--------+-----------------+-----------------+-----------------+-------+
    |     Group2 & Group3      | 434  |        |        |   1    |                 |                 |                 |       |
    +--------------------------+------+--------+--------+--------+-----------------+-----------------+-----------------+-------+
    |     Group1 & Group2      | 433  |        |   0    |        |                 |                 |                 |       |
    +--------------------------+------+--------+--------+--------+-----------------+-----------------+-----------------+-------+
    |     Group1 & Group3      | 457  |   24   |        |        |                 |                 |                 |       |
    +--------------------------+------+--------+--------+--------+-----------------+-----------------+-----------------+-------+
    | Group1 & Group2 & Group3 | 433  |        |        |        |                 |                 |                 |       |
    +--------------------------+------+--------+--------+--------+-----------------+-----------------+-----------------+-------+

Explanation:

  • the same 433 events are selected by Group1, Group2 and Group3
  • Group1 and Group3 select the same 457 events but 24 of those events are rejected by Group2
  • Group2 and Group3 select the same 434 events but 1 of those events are rejected by Group1
  • Group1 selects
    • 30 events that are rejected by Group2
    • 6 events that are rejected by Group2 and Group3
  • Group2 selects
    • 20 events that are rejected by Group1
    • 19 events that are rejected by Group1 and Group3
  • Group3 selects
    • 1 event that is rejected by Group1
    • 24 events that are rejected by Group2
  • conclusion: Group1 and Group3 agree the most and Group2 has some catch up to do. Although in practice there have been instances where the group that disagreed the most was actually right...
  1. For each filled cell in tables 4. there are files containing with the corresponding RLE numbers:

    • syncTree_2lSS_SR_Group1_Group2_Group3_select.txt
    • syncTree_2lSS_SR_Group1_Group2_select_Group3_reject.txt
    • syncTree_2lSS_SR_Group1_Group2_select.txt
    • syncTree_2lSS_SR_Group1_Group3_select_Group2_reject.txt
    • syncTree_2lSS_SR_Group1_Group3_select.txt
    • syncTree_2lSS_SR_Group1_select_Group2_Group3_reject.txt
    • syncTree_2lSS_SR_Group1_select_Group2_reject.txt
    • syncTree_2lSS_SR_Group1_select_Group3_reject.txt
    • syncTree_2lSS_SR_Group2_Group3_select_Group1_reject.txt
    • syncTree_2lSS_SR_Group2_Group3_select.txt
    • syncTree_2lSS_SR_Group2_select_Group1_Group3_reject.txt
    • syncTree_2lSS_SR_Group2_select_Group1_reject.txt
    • syncTree_2lSS_SR_Group2_select_Group3_reject.txt
    • syncTree_2lSS_SR_Group3_select_Group1_Group2_reject.txt
    • syncTree_2lSS_SR_Group3_select_Group1_reject.txt
    • syncTree_2lSS_SR_Group3_select_Group2_reject.txt

Further action

What can be done with all this information:

  • fix our code if we see an overlap between the same SRs or between the SR and fake/flip AR of the same channel
  • send the RLE numbers (syncTree_$CHANNEL_$REGION_Tallinn_select_*_reject.txt) that the other group rejects but we select in the same channel & region
  • figure out why we reject the events that the other group selects

The last point can be executed in an automatic way:

./test/tthSyncNtuple.py \
  -e 2017 -v 2018Nov07_debug -O -o sync_Tallinn_v30.root \
  -S "~/path/to/results/%s_Group1_Group2_select_Tallinn_reject.txt" \
  -D -c 2lss ttWctrl ttZctrl WZctrl

The new options are:

  • -S basically tells each analysis job where the RLE numbers are ('%s' is placeholder for the tree name)
  • -D enables debugging messages in each analysis job
  • -c lists the channels (to save time on waiting)

Note that this command won't produce any sync Ntuples because the jobs are run on events that we reject but other groups (Group1 and Group2 in this example) reject. The important output of this command are actually the log files, which include the cutflow table and detailed information about each event (only if -D flag is supplied).