Skip to content

Performing a Skim

azotz edited this page Feb 12, 2019 · 121 revisions

Table of content

Introduction

The following code fragments are meant as examples. All scripts offer meaningful help via -h/--help (or sometimes without arguments). Checkout recipe for skimming can be found under Skimming section.

First, examples are given how to test the skimming configuration on single files of data and MC samples. These examples can be run locally in your working directory. This is useful, since you can figure out this way, whether the skim will run properly. After this, you will learn how to manage existing and new datasets needed for your analysis. The relevant information on these datasets is stored in a database. Finally, instructions are given for a large-scale skim, since an analysis might need about 100 different datasets as input. These skims must be performed with grid-submission tools to be able to run the same task on several files in parallel.

Interactive Running of a Kappa Skim

Before submitting large numbers of grid jobs and during development phases, you can/should check your skim interactively. The python configuration for cmsRun is found in Kappa/Skimming/higgsTauTau/kSkimming_run2_cfg.py. It is steered by program arguments such that it called via

cd $CMSSW_BASE/src/Kappa/Skimming/higgsTauTau/
cmsRun kSkimming_run2_cfg.py [<argument1>=<value1> <argument2>=<value2> ...]

Example: skimming of one file of a di-muon data sample

cd $CMSSW_BASE/src/Kappa/Skimming/higgsTauTau/
cmsRun kSkimming_run2_cfg.py \
testfile=root://xrootd.unl.edu//store/data/Run2015C_25ns/DoubleMuon/MINIAOD/16Dec2015-v1/20000/081A3AE2-ABB5-E511-9A0D-7845C4FC368C.root \
nickname=DoubleMuon_Run2015C25ns_16Dec2015v1_13TeV_MINIAOD \
globalTag=76X_dataRun2_16Dec2015_v0

A (xrootd) path to a testfile for a given sample you find by searching the sample in DAS, clicking on Files and then clicking on Download.

The nickname can be looked up in Kappa/Skimming/higgsTauTau/samples/13TeV and is essential for the steering of the skim (see also the section New Datasets).

The global tag is different for simulation and data. The most recently used versions can be looked up in Kappa/Skimming/higgsTauTau/crabConfig.py. A complete list of global tags is found in the SWGuideFrontierConditions TWiki.

Example: skimming of one file of a DY MC sample

miniAOD, Fall15:

cd $CMSSW_BASE/src/Kappa/Skimming/higgsTauTau/ ;\
cmsRun kSkimming_run2_cfg.py \
testfile=root://cms-xrd-global.cern.ch//store/mc/RunIIFall15MiniAODv2/DYJetsToLL_M-50_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/MINIAODSIM/PU25nsData2015v1_76X_mcRun2_asymptotic_v12_ext1-v1/10000/004544CB-6DD8-E511-97E4-0026189438F6.root \
nickname=DYJetsToLLM50_RunIIFall15MiniAODv2_PU25nsData2015v1_13TeV_MINIAOD_madgraph-pythia8 \
globalTag=76X_mcRun2_asymptotic_RunIIFall15DR76_v1

miniAOD, Spring16:

cd $CMSSW_BASE/src/Kappa/Skimming/higgsTauTau/ ;\
cmsRun kSkimming_run2_cfg.py \
testfile=root://cms-xrd-global.cern.ch//store/mc/RunIISpring16MiniAODv2/DYJetsToLL_M-50_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/MINIAODSIM/PUSpring16_80X_mcRun2_asymptotic_2016_miniAODv2_v0_ext1-v1/00000/00F0B3DC-211B-E611-A6A0-001E67248A39.root \
nickname=DYJetsToLLM50_RunIISpring16MiniAODv2_PUSpring16_13TeV_MINIAOD_madgraph-pythia8_ext1 \
globalTag=80X_mcRun2_asymptotic_2016_miniAODv2_v1

miniAOD, Summer16:

cd $CMSSW_BASE/src/Kappa/Skimming/higgsTauTau/ ;\
cmsRun kSkimming_run2_cfg.py \
testfile=root://cms-xrd-global.cern.ch//store/mc/RunIISummer16MiniAODv2/DYJetsToLL_M-50_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/MINIAODSIM/PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6_ext1-v2/60000/CEAB3688-1CC7-E611-8BC3-C4346BBCB6A8.root \
nickname=DYJetsToLLM50_RunIISummer16MiniAODv2_PUMoriond17_13TeV_MINIAOD_madgraph-pythia8_ext1 \
globalTag=80X_mcRun2_asymptotic_2016_TrancheIV_v8

Example: skimming of one file of a SM HTT MC sample

miniAOD, Summer16:

cd $CMSSW_BASE/src/Kappa/Skimming/higgsTauTau/ ;\
cmsRun kSkimming_run2_cfg.py \
testfile=root://cms-xrd-global.cern.ch//store/mc/RunIISummer16MiniAODv2/GluGluHToTauTau_M125_13TeV_powheg_pythia8/MINIAODSIM/PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/120000/1C6F3F7F-96C8-E611-A0D7-0025905A4964.root \
nickname=GluGluHToTauTauM125_RunIISummer16MiniAODv2_PUMoriond17_13TeV_MINIAOD_powheg-pythia8 \
globalTag=80X_mcRun2_asymptotic_2016_TrancheIV_v8

Example: skimming of one file of a SingleMuon Data sample

miniAOD, Summer16:

cd $CMSSW_BASE/src/Kappa/Skimming/higgsTauTau/ ;\
cmsRun kSkimming_run2_cfg.py \
testfile=root://cms-xrd-global.cern.ch//store/data/Run2016H/SingleMuon/MINIAOD/03Feb2017_ver3-v1/80000/0040ECBB-76EA-E611-8FE7-A0000420FE80.root \
nickname=SingleMuon_Run2016H_03Feb2017ver3v1_13TeV_MINIAOD \
globalTag=80X_dataRun2_Prompt_v16

Example: skimming of one file of a DoubleMuon Data sample

miniAOD, 2017:

cd $CMSSW_BASE/src/Kappa/Skimming/higgsTauTau/ ;\
cmsRun kSkimming_run2_cfg.py \
testfile=root://cms-xrd-global.cern.ch//store/data/Run2017C/DoubleEG/MINIAOD/PromptReco-v3/000/300/742/00000/00970603-987E-E711-A3BE-02163E01A5C6.root \
nickname=DoubleEG_Run2017C_PromptRecov3_13TeV_MINIAOD \
globalTag=92X_dataRun2_Prompt_v8

Example: skimming of one file of a DYJetsToLL_M-50_TuneCUETP8M1_13TeV-madgraphMLM-pythia8 MC sample

miniAOD, 2017:

cd $CMSSW_BASE/src/Kappa/Skimming/higgsTauTau/ ;\
cmsRun kSkimming_run2_cfg.py \
testfile=root://cms-xrd-global.cern.ch//store/mc/PhaseISpring17MiniAOD/DYJetsToLL_M-50_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/MINIAODSIM/FlatPU28to62_902_90X_upgrade2017_realistic_v20_ext1-v1/00000/0837196E-D728-E711-89CD-A4BF0102A5BD.root \
nickname=DYJetsToLLM50_PhaseISpring17MiniAOD_FlatPU28to62_13TeV_MINIAOD_madgraph-pythia8_ext1 \
globalTag=90X_upgrade2017_realistic_v20

Example: skimming of one file of an embedding ElMu sample

cd $CMSSW_BASE/src/Kappa/Skimming/higgsTauTau/ ;\
cmsRun kSkimming_run2_cfg.py testfile=root://cms-xrd-global.cern.ch//store/user/jbechtel/gc_storage/ElMu_data_2016_CMSSW826_freiburg/PromptReco/TauEmbedding_ElMu_data_2016_CMSSW826_Run2016H/8/merged_7307.root >nickname=Embedding2016H_ElMuFinalState_imputPromptDoubleMumirrorminiAODv4_13TeV_USER  >globalTag=80X_dataRun2_Prompt_v16

In most cases, the output of this step is a file named kappaTuple.root in the current directory. It can be used as an input for the Artus step, e.g.

HiggsToTauTauAnalysis.py -i <path/to/kappaTuple.root> --nick <nickname> ...

It is essential to specify --nick here, as the nickname will not be found, because it is by default parsed from the filename, and a wrong Artus configuration might be picked up. Also note, that the executable HiggsToTauTauAnalysis.py is not available in the skimming setup. Make sure, that you use the same Kappa version in the analysis setup, that you used for producing the skim kappaTuple.root and that it is correctly compiled (using make -BC $CMSSW_BASE/src/Kappa/Dataformats/test).

Scripts and Code to Use

Three python files and one .json database are the starting point for the skimming.

The main database contains the information on datasets which we want to use in our analysis. For each dataset, we store the information that can be derived from DAS queries and from the dataset name itself to distinguish each dataset. This information is used to create unique nicknames for each dataset, which are used later in the analysis to apply dataset-dependent analysis steps.

Additional information can also be added to each dataset, for example the cross-section of a process simulated in a MC dataset, the global tag which should be used to process the dataset with Kappa or additional skim-specific tags, e.g. a tag for the "Moriond17" skim.

In the dataset helper all functionalities are implemented to manipulate the database, to query specific properties of the datasets that must fulfil desired constraints or to obtain the values of the asked properties for one or more datasets.

The dataset manager is used to perform these database manipulations and to extend the database with new datasets.

The skim manager is the main tool to perform a skim, starting with the definition of a skim database which is derived from the main database mentioned above, continuing with the submission of tasks to the grid and their maintenance (job status determination and resubmission requests) and finishing with creating the file lists for the successfully finished tasks which can by used in the further analysis steps.

In the following chapters, the usage of these tools is explained in more detail.

Adding Datasets to the Database

Before starting the actual skim, you should be sure, that the information on all relevant datasets which are available at that time is stored in the database. If this is not the case, you can add the missing datasets to the database as follows:

python Kappa/Skimming/scripts/DatasetManager.py -i datasets.json --addDatasets "/DYJetsToLL_M-50_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/RunIISummer16MiniAODv2-PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6_ext1-v2/MINIAODSIM" --globaltag "80X_mcRun2_asymptotic_2016_TrancheIV_v7" --xsec 5765.4 --overwrite

Make sure, that your VOMS proxy is valid to run the command properly. In this example, the inclusive Drell-Yan MC dataset from the RunII Summer16 campaign is added to the database. The syntax of the dataset string follows the conventions used at DAS. Since this is only one sample with a known cross-section (--xsec) and the global tag (--globaltag) with which it should be processed is also known, both quantities are also added to the database in one go. With the option --overwrite you explicitly demand, that the database used as input (-i) is overwritten by the version updated with the added dataset. If you first only want to look at the information, which is about to be added to the database by this command, you can omit this option.

In case the dataset is new, the corresponding output (without --overwrite) reads then similarly to:

Using input database at /portal/ekpbms1/home/akhmet/workdir/CMSSW_8_0_25/src/datasets.json
/DYJetsToLL_M-50_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/RunIISummer16MiniAODv2-PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6_ext1-v2/MINIAODSIM
https://cmsweb.cern.ch/dbs/prod/global/DBSReader/datasets?dataset=%2FDYJetsToLL_M-50_TuneCUETP8M1_13TeV-madgraphMLM-pythia8%2FRunIISummer16MiniAODv2-PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6_ext1-v2%2FMINIAODSIM
Adding missing datasets queried from the pattern: 
/DYJetsToLL_M-50_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/RunIISummer16MiniAODv2-PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6_ext1-v2/MINIAODSIM
---------------------------------------------
https://cmsweb.cern.ch/dbs/prod/global/DBSReader/blocksummaries?dataset=%2FDYJetsToLL_M-50_TuneCUETP8M1_13TeV-madgraphMLM-pythia8%2FRunIISummer16MiniAODv2-PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6_ext1-v2%2FMINIAODSIM
only NEW:
{
  "null": {
    "DYJetsToLLM50_RunIISummer16MiniAODv2_PUMoriond17_13TeV_MINIAOD_madgraph-pythia8_ext1": {
      "campaign": "RunIISummer16MiniAODv2", 
      "data": false, 
      "dbs": "/DYJetsToLL_M-50_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/RunIISummer16MiniAODv2-PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6_ext1-v2/MINIAODSIM", 
      "energy": "13", 
      "extension": "ext1", 
      "format": "MINIAOD", 
      "generator": "madgraph-pythia8", 
      "globalTag": "80X_mcRun2_asymptotic_2016_TrancheIV_v7", 
      "inputDBS": "global", 
      "n_events_generated": "49144274", 
      "n_files": "477", 
      "process": "DYJetsToLL_M50", 
      "scenario": "PUMoriond17", 
      "xsec": 5765.4
    }
  }
}
------------------
only OLD:
{}
I do not overwrite  /portal/ekpbms1/home/akhmet/workdir/CMSSW_8_0_25/src/Kappa/Skimming/data/datasets.json

Otherwise, a warning is given, that the dataset is already in the database and nothing is added to the database in any case.

Information like "n_events_generated" and "n_files" are directly queried from DAS. In some rare cases, the information delivered by the underlying DAS tool is different from the one shown on the web page. So you are strongly advised to check, whether the information in the output matches to the DAS query on the web page before you add the dataset to the database.

You can also add multiple datasets matching a DAS pattern:

python Kappa/Skimming/scripts/DatasetManager.py -i datasets.json --addDatasets "/SUSYGluGluToHToTauTau_M-*_TuneCUETP8M1_13TeV-pythia8/RunIISummer16MiniAODv2-PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/MINIAODSIM" --globaltag "80X_mcRun2_asymptotic_2016_TrancheIV_v7" --xsec 1.0 --overwrite

In this example, several SUSY Higgs datasets with different masses are added to the database. The cross-section is set to the default value 1.0.

Query and Extend Database Information

You may also want to extend already existing information in the database or to query datasets with specific properties before you start a skim. The information is stored in form of a .json dictionary for each dataset. So the database is nothing else than one big dictionary containing dataset dictionaries, allowing the usual dictionary manipulations used very often in python.

There are the following functionalities to extend or to change the existing information on one or more datasets:

--addentry: With this option you can add a new entry like "xsec" or "globalTag" with the corresponding value to existing datasets or change the value of an already existing entry of one or more datasets.

--addtag: This option can be used to add additional information on the datasets that is analysis dependent, e.g. a "Moriond17" tag. It's up to the user to define a meaningful tag.

--addtagvalues: If a tag is already there, e.g. "Moriond17", then a list of tag values can be added to it to add even more specific information: A skim for which analysis is this (e.g. "Skim_MSSM" or "Skim_Base")? What kind of dataset is this (e.g. "MC_Summer16" or "MC_Spring16")? And so on. The definition of the tag and the corresponding tagvalues is free, so it can also be used to specify who from the analysis group skims the sample, if more than one person should do this. One side-remark: this option should be used together with "--addtag".

These user defined tags and tagvalues can help later to query the right datasets for a skim. If you intend to make changes only for one specific dataset, it is equally fine just to make the changes directly in the database file. The options above should help in case changes to several datasets should be made.

There are three possibilities to query the information on datasets:

--nicks: This option can be used to query the dataset nicknames in the database that match the regular expression given to this option, e.g."DY.*JetsToLL.*RunIISummer16" would deliver all Drell-Yan + jets samples from the RunIISummer16 campaign.

--query: This option can be used to query the several properties of datasets, e.g. a query with '{"scenario" : "PUMoriond17", "process" : "W.*JetsToLNu"}' as argument would deliver all W + jets datasets with the Moriond17 pile-up scenario.

--tag and --tagvalues: These two options can be used to search for user defined tags and tagvalues. For now, if you define a list of tagvalues, e.g. "MC_Summer16,Skim_Base", only one of the list elements must be fulfilled. So a list is actually interpreted with logical OR's. This can be upgraded to combinations of OR's and AND's if desired.

Up to now you know how to query the desired datasets. In order to print the desired information on the queried datasets, you can use the following options:

--print: A boolean, whether a print-out is desired or not. If --printkeys is not specified, then the nick and the corresponding DAS name are printed out.

--printkeys: With this option you can specify the properties of a dataset which should be printed out besides the nick, e.g. "xsec,globalTag,campaign"

In the following, some examples are provided which combine the usage of the presented options of the dataset manager.

Example 1

python Kappa/Skimming/scripts/DatasetManager.py -i datasets.json --nicks "WJetsToLNu.*RunIISummer16" --addentry '{"globalTag" : "80X_mcRun2_asymptotic_2016_TrancheIV_v6"}' --addtag "testTag" --addtagvalues "testTagVal_1,testTagVal_2"

In this example the Summer16 W + Jets datasets are queried by their nicknames, their global tag value is changed and additional tags and tagvalues are added. The output reads similarly to:

Using input database at /portal/ekpbms1/home/akhmet/workdir/CMSSW_8_0_25/src/Kappa/Skimming/data/datasets.json
--------------WARNING--------------
globalTag is already in Dataset.
{globalTag : 80X_mcRun2_asymptotic_2016_TrancheIV_v7} will be overwritten by new entry {globalTag : 80X_mcRun2_asymptotic_2016_TrancheIV_v6}
-----------------------------------
--------------WARNING--------------
globalTag is already in Dataset.
{globalTag : 80X_mcRun2_asymptotic_2016_TrancheIV_v7} will be overwritten by new entry {globalTag : 80X_mcRun2_asymptotic_2016_TrancheIV_v6}
-----------------------------------
--------------WARNING--------------
globalTag is already in Dataset.
{globalTag : 80X_mcRun2_asymptotic_2016_TrancheIV_v7} will be overwritten by new entry {globalTag : 80X_mcRun2_asymptotic_2016_TrancheIV_v6}
-----------------------------------
only NEW:
{
  "WJetsToLNu_RunIISummer16MiniAODv2_PUMoriond17_13TeV_MINIAOD_amcatnlo-pythia8": {
    "globalTag": "80X_mcRun2_asymptotic_2016_TrancheIV_v6", 
    "testTag": [
      "testTagVal_1", 
      "testTagVal_2"
    ]
  }, 
  "WJetsToLNu_RunIISummer16MiniAODv2_PUMoriond17_13TeV_MINIAOD_madgraph-pythia8": {
    "globalTag": "80X_mcRun2_asymptotic_2016_TrancheIV_v6", 
    "testTag": [
      "testTagVal_1", 
      "testTagVal_2"
    ]
  }, 
  "WJetsToLNu_RunIISummer16MiniAODv2_PUMoriond17_13TeV_MINIAOD_madgraph-pythia8_ext2": {
    "globalTag": "80X_mcRun2_asymptotic_2016_TrancheIV_v6", 
    "testTag": [
      "testTagVal_1", 
      "testTagVal_2"
    ]
  }
}
------------------
only OLD:
{
   "WJetsToLNu_RunIISummer16MiniAODv2_PUMoriond17_13TeV_MINIAOD_amcatnlo-pythia8": {
      "globalTag": "80X_mcRun2_asymptotic_2016_TrancheIV_v7"
   }, 
   "WJetsToLNu_RunIISummer16MiniAODv2_PUMoriond17_13TeV_MINIAOD_madgraph-pythia8": {
      "globalTag": "80X_mcRun2_asymptotic_2016_TrancheIV_v7"
   }, 
   "WJetsToLNu_RunIISummer16MiniAODv2_PUMoriond17_13TeV_MINIAOD_madgraph-pythia8_ext2": {
      "globalTag": "80X_mcRun2_asymptotic_2016_TrancheIV_v7"
   }
}
I do not overwrite  /portal/ekpbms1/home/akhmet/workdir/CMSSW_8_0_25/src/Kappa/Skimming/data/datasets.json

You are warned, that the globalTag already exists and will be overwritten and the difference between the old and the new datasets are shown at the end. If you intend to keep the changes and add them to the input dataset, don't forget to use --overwrite.

Example 2

python Kappa/Skimming/scripts/DatasetManager.py -i datasets.json --query '{"scenario" : "PUMoriond17", "process" : "WJetsToLNu"}' --print --printkeys "dbs" "globalTag" "xsec"

In this example the inclusive W + Jets datasets with the Moriond17 pile-up scenario are queried and for these datasets, the DAS name, the global Tag and the cross-section are printed out. The output reads similarly to:

Using input database at /portal/ekpbms1/home/akhmet/workdir/CMSSW_8_0_25/src/Kappa/Skimming/data/datasets.json
only NEW:
{}
------------------
only OLD:
{}
WJetsToLNu_RunIISummer16MiniAODv2_PUMoriond17_13TeV_MINIAOD_madgraph-pythia8_ext2
dbs     :       /WJetsToLNu_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/RunIISummer16MiniAODv2-PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6_ext2-v1/MINIAODSIM
---------------------------------------------------------
globalTag       :       80X_mcRun2_asymptotic_2016_TrancheIV_v7
---------------------------------------------------------
xsec    :       61526.7
---------------------------------------------------------
WJetsToLNu_RunIISummer16MiniAODv2_PUMoriond17_13TeV_MINIAOD_amcatnlo-pythia8
dbs     :       /WJetsToLNu_TuneCUETP8M1_13TeV-amcatnloFXFX-pythia8/RunIISummer16MiniAODv2-PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/MINIAODSIM
---------------------------------------------------------
globalTag       :       80X_mcRun2_asymptotic_2016_TrancheIV_v7
---------------------------------------------------------
xsec    :       61526.7
---------------------------------------------------------
WJetsToLNu_RunIISummer16MiniAODv2_PUMoriond17_13TeV_MINIAOD_madgraph-pythia8
dbs     :       /WJetsToLNu_TuneCUETP8M1_13TeV-madgraphMLM-pythia8/RunIISummer16MiniAODv2-PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/MINIAODSIM
---------------------------------------------------------
globalTag       :       80X_mcRun2_asymptotic_2016_TrancheIV_v7
---------------------------------------------------------
xsec    :       61526.7
---------------------------------------------------------
I do not overwrite  /portal/ekpbms1/home/akhmet/workdir/CMSSW_8_0_25/src/Kappa/Skimming/data/datasets.json

Doing a Skim

Now you are familiar enough with the dataset management, so you are finally able to start a skim. Some preparation needs to be done before actually starting.

Preparations for skim

At first, the tools needed for grid submissions must be prepared. For crab3, source the following working version:

source /cvmfs/cms.cern.ch/crab3/crab.sh

Make sure, that you use a grid-control version, that works for your set-up of the allowed batch-systems. The following grid-control version was tested at KIT and was found to have all needed features:

git clone https://github.com/janekbechtel/grid-control.git

Be sure also to set up the grid-control paths accordingly. The following command assumes, that grid-control is installed in $CMSSW_BASE/src/

export PATH=$PATH:$CMSSW_BASE/src/grid-control:$CMSSW_BASE/src/grid-control/scripts

Make sure, that your VOMS proxy is valid (check with voms-proxy-info) and initialize it with voms-proxy-init if needed. The variable $X509_USER_PROXY should be set to the right path. To set it to a fixed path you may define an alias for this in your ~/.bashrc, which you have to execute before you start with submissions. Example:

alias set_voms='export X509_USER_PROXY=~/k5-ca-proxy.pem'

Furthermore, also check whether your grid certificate is mapped in SiteDB. You can check this by searching for your name in https://cmsweb.cern.ch/sitedb/prod/people. For instructions on how to map a certificate in SiteDB, see https://twiki.cern.ch/twiki/bin/viewauth/CMS/SiteDBForCRAB.

The path to the basic working directory for all your different skims can be stored in the environment variable SKIM_WORK_BASE. If it is not set, the path is chosen depending on the machine you are using. For the ekpbms1 machine at KIT, it's for example /portal/ekpbms1/home/<your-user-name>/kappa_skim_workdir/. If you wish to have a different basic working directory, specify SKIM_WORK_BASE in your ~/.bashrc.

In this basic directory, you should create a skimming working directory for the certain skim you want to perform with mkdir <path-to-your-skimming-workdir> before starting the Skim Manager. Please choose a meaningful name for this directory to be able to find it later easily, e.g. "MC_Summer16_version_10-01-2017". This directory name is used to save all information relevant for the skim (crab3 task folders, grid-control configs, grid-control workdirs, skim database, ...) and should be passed to the skim manager via the option -w. You can pass the name of the directory (MC_Summer16_version_10-01-2017) or its full path (/portal/ekpbms1/home/<your-user-name>/kappa_skim_workdir/MC_Summer16_version_10-01-2017).

Starting a Skim

Finally you can start! Since you want to skim only a subset of the datasets stored in the main database (used as input database via -i), you are strongly advised to choose the appropriate subset with a corresponding query, e.g. a certain tag and a certain tagvalue. As soon as you are sure about your query (test it with the dataset manager first), you can initialize a skim with a command similar to the following:

 SkimManager.py -i /portal/ekpbms1/home/akhmet/workdir/CMSSW_8_0_25/src/Kappa/Skimming/data/datasets.json --tag "Moriond17" --tagvalues "MC_Summer16" -w /portal/ekpbms1/home/akhmet/kappa_skim_workdir/MC_Summer16_version_10-01-2017 --init

In case you are working on the NAF, you need to specify the backend you are using, by adding the option -b naf at the above command.

During the skim initialization, crab tasks are created for each dataset with the folder syntax crab_<dataset_nickname> in the skimming directory. Additionally, grid-control configs are created in case you want to resubmit some of the tasks in parallel with grid-control (see next chapter). They are put into the gc_cfg folder in your skimming directory. After the initialization, the skim status for both crab3 and grid-control is set to "INIT".

The templates for the crab3 and grid-control configs are created by functions defined directly in the skim manager. This decision was made to have everything relevant for skimming in one place. Feel free to adapt the following functions in order to have grid submission configs suitable for your system:

https://github.com/KappaAnalysis/Kappa/blob/master/Skimming/scripts/SkimManager.py#L75

https://github.com/KappaAnalysis/Kappa/blob/master/Skimming/scripts/SkimManager.py#L233

https://github.com/KappaAnalysis/Kappa/blob/master/Skimming/scripts/SkimManager.py#L277

https://github.com/KappaAnalysis/Kappa/blob/master/Skimming/scripts/SkimManager.py#L296

As a guideline for crab3 configs, you can read the following twiki:

https://twiki.cern.ch/twiki/bin/view/CMSPublic/CRAB3ConfigurationFile

For grid-control, it is helpful to look into its documentation file:

https://github.com/janekbechtel/grid-control/blob/master/docs/documentation.conf

After the initialization step started with the command above the skim manager checks the status of crab tasks to update it for the individual datasets from "INIT" to "SUBMITTED", "QUEUED", "FAILED", "COMPLETED" and so on. To check and update the status of the still running crab tasks, just execute the command above with --init omitted. As soon as the "COMPLETED" status is reached by a dataset task, its status won't be checked any more to speed up the check for the remaining tasks.

If you want to start a skim for a newly added dataset in the main database within the same skim campaign, make first sure, that the dataset fulfils your skim query (in the example above: --tag "Moriond17" --tagvalues "MC_Summer16") and simply rerun the full command above. The grid-control configs of grid-control tasks in status "INIT" will be recreated again. If a grid-control working directory is still found for these tasks, you will be asked, whether you want to remove this directory.

Examples for currently supported analyses

CMSSW_7_6_3 Fall15 Data

mkdir $SKIM_WORK_BASE/Fall15_Data
cd $CMSSW_BASE/src
SkimManager.py -i Kappa/Skimming/data/datasets.json --tag "Fall15" --tagvalues "Skim_Fall15_Data" -w Fall15_Data --init

CMSSW_7_6_3 Fall15 MC Background

mkdir $SKIM_WORK_BASE/Fall15_MC_BG
cd $CMSSW_BASE/src
SkimManager.py -i Kappa/Skimming/data/datasets.json --tag "Fall15" --tagvalues "Skim_Fall15_MC_BG" -w Fall15_MC_BG --init

CMSSW_7_6_3 Fall15 MC Signal

mkdir $SKIM_WORK_BASE/Fall15_MC_Signal
cd $CMSSW_BASE/src
SkimManager.py -i Kappa/Skimming/data/datasets.json --tag "Fall15" --tagvalues "Skim_Fall15_MC_Signal" -w Fall15_MC_Signal --init

CMSSW_8_0_26_patch1 Moriond17 MC SM and MSSM Signal

mkdir $SKIM_WORK_BASE/Moriond17_MC_Signal
cd $CMSSW_BASE/src
SkimManager.py -i Kappa/Skimming/data/datasets.json --tag "Moriond17" --tagvalues "Skim_Moriond17_MC_Signal" -w Moriond17_MC_Signal --init

CMSSW_8_0_26_patch1 Moriond17 MC Background

mkdir $SKIM_WORK_BASE/Moriond17_MC_BG
cd $CMSSW_BASE/src
SkimManager.py -i Kappa/Skimming/data/datasets.json --tag "Moriond17" --tagvalues "Skim_Moriond17_MC_BG" -w Moriond17_MC_BG --init

CMSSW_8_0_26_patch1 Moriond17 Data re-Reco

mkdir $SKIM_WORK_BASE/Moriond17_Data_re-Reco
cd $CMSSW_BASE/src
SkimManager.py -i Kappa/Skimming/data/datasets.json --tag "Moriond17" --tagvalues "Skim_Moriond17_Data_re-Reco" -w Moriond17_Data_re-Reco --init

CMSSW_8_0_26_patch1 Moriond17 Data re-MiniAOD

mkdir $SKIM_WORK_BASE/Moriond17_Data_re-MiniAOD
cd $CMSSW_BASE/src
SkimManager.py -i Kappa/Skimming/data/datasets.json --tag "Moriond17" --tagvalues "Skim_Moriond17_Data_re-MiniAOD" -w Moriond17_Data_re-MiniAOD --init

Automatic splitting of jobs

When the automatic splitting of jobs option is used, there are three stages. More information about the 3 states and resubmission in https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/3774/1.html. Do not blindly trust the dashboard but check with --summary, instead of --init:

SkimManager.py -i Kappa/Skimming/data/datasets.json <usual commands> --summary

and eventualy via crab status:

crab status <skim_dir>/<crab_dir> --verboseErrors --long

the state rescheduled means it is resubmitted in a tail job

Skim Database

All information relevant during the skim is stored in a database derived from the main database, called skim_dataset.json.

Additionally to the information stored in the main database the current skim status is stored for each dataset corresponding to two dictionary keys "SKIM_STATUS" for the crab task and "GCSKIM_STATUS" for the grid-control task.

Further supplementary information is the dictionary output given by the crab status command. From this information the tasks that must be resubmitted are determined, the status delivered by crab is read out and the percentage of jobs done within a crab task is calculated.

You are encouraged to get familiar with the skim database, since you have to make changes in this database by hand in some cases.

Resubmission Strategies

During a skimming campaign you will find out pretty fast, that almost nothing runs ideally and is finished by the first attempt. But don't give up! There are several strategies to restart the failing jobs and tasks depending on the type of the problems.

Requested Memory Exceeded

Most frequent and perhaps most annoying error is the famous error 50660 for the crab tasks - that's when your job exceeds the memory that is requested for it on the working node. Such errors mostly point to memory leaks in the code of the running software. The long term solution would be to solve such bugs - but you still want to get your skim done with the current set-up, whatever it costs. Therefore, you keep on resubmitting the failing jobs with a higher memory threshold. Within the skim manager, this can be done as follows:

First, get the latest crab status for running tasks:

 SkimManager.py -i /portal/ekpbms1/home/akhmet/workdir/CMSSW_8_0_25/src/Kappa/Skimming/data/datasets.json --tag "Moriond17" --tagvalues "MC_Summer16" -w /portal/ekpbms1/home/akhmet/kappa_skim_workdir/MC_Summer16_version_10-01-2017/

After this, you resubmit the crab tasks which have failed jobs, increasing thereby the maximum memory given in MB:

 SkimManager.py -i /portal/ekpbms1/home/akhmet/workdir/CMSSW_8_0_25/src/Kappa/Skimming/data/datasets.json --tag "Moriond17" --tagvalues "MC_Summer16" -w /portal/ekpbms1/home/akhmet/kappa_skim_workdir/MC_Summer16_version_10-01-2017/ --resubmit-with-options '{"maxmemory" : "6000"}'

Jobs with up to 10000 MB maximum memory still got slots where they can run with crab, so feel free to push the threshold further via trial and error and document the bigger threshold here in case of a success. As you can see, this is far above the guaranteed threshold of 2500 MB, which is taken as default in the skim configuration.

Requested Walltime Exceeded

The next possible standard error is that the processing of a jobs takes too long and the job is then aborted. This can happen if you have big input files and slow worker nodes for such a job. Just proceed as before - first status update, then crab resubmission - with changed options --resubmit-with-options '{"maxjobruntime" : "1440"}'. This corresponds to 24 hours, a reasonable walltime which is bigger than the default value.

Bad Sites

Sometimes, also the sites you send your jobs to can have a bad day, for example because they're overloaded. You can blacklist these sites for this period of time with a corresponding option for the resubmission, e.g. --resubmit-with-options '{"siteblacklist" : "T3_FR_IPNL,T3_US_UCR,T2_BR_SPRACE,T1_RU_*,T2_RU_*,T3_US_UMiss"}'. If you find out, that a site is permanently bad, feel free to add it to the default blacklist in order to profit from this in future skimming campaigns. To have an overview over the failing jobs on such sites, search for your crab tasks on the task monitoring website.

All crab resubmit options discussed so far are processed as a dictionary and therefore you can combine them. For example you want to increase both the memory and the walltime thresholds for the next resubmission. Then use: --resubmit-with-options '{"maxmemory" : "6000", "maxjobruntime" : "1440"}'.

Resubmission with grid-control

Sometimes you have to resubmit a crab task very often. Then you can think about a resubmit of the still running tasks with grid-control, that will run in parallel. To prepare the grid-control resubmission, execute the following command:

 SkimManager.py -i /portal/ekpbms1/home/akhmet/workdir/CMSSW_8_0_25/src/Kappa/Skimming/data/datasets.json --tag "Moriond17" --tagvalues "MC_Summer16" -w /portal/ekpbms1/home/akhmet/kappa_skim_workdir/MC_Summer16_version_10-01-2017/ --resubmit-with-gc

A shell script named while.sh will be created in your current directory, which runs grid-control for all tasks to be submitted in a while loop. This loop can be interrupted by removing .lock file from the directory, where while.sh is executed. Using this shell script allows you to monitor several grid-control tasks in one bash shell. You may want to consider to run this script (and therefore the GC monitor interface) in a screen session.

If you open the while.sh script you can see which configs are currently running. It's intended, that the individual grid-control configs are modified and maintained manually to improve the submission to the batch system you use. You are advised to changed the low default value of the in flight option from 50 to a desired value in case your batch system and the storage site you use allow this. The management of the individual grid-control tasks is left to the user, so please get familiar with useful grid-control commands and configurations. Similarly to the crab tasks, you can change the memory and the wall time in the config and then resubmit the failing jobs for example via the --reset FAILED option of the grid-control executable. To be on the safe side, stop right before this the while.sh shell script.

To update the status of the grid-control tasks in the skim database you can perform the following command:

 SkimManager.py -i /portal/ekpbms1/home/akhmet/workdir/CMSSW_8_0_25/src/Kappa/Skimming/data/datasets.json --tag "Moriond17" --tagvalues "MC_Summer16" -w /portal/ekpbms1/home/akhmet/kappa_skim_workdir/MC_Summer16_version_10-01-2017/ --status-gc

After a status update you may have some grid-control tasks, which are completed. Then you can stop the while.sh script, rerun the grid-control resubmission command and update in that way the while.sh script, which then will have a reduced amount of grid-control tasks left. Then restart the script and you will notice, that the successfully finished tasks aren't monitored anymore.

To adapt the number of jobs, which are submitted in flight by grid-control, the memory or the max retry number, you can use command line tools provided by the common linux distributions. First move into the folder, where the grid-control configs for your skim are created:

cd  $SKIM_WORK_BASE/<Name-of-your-skim-folder>/gc_cfg/

To change for example the number of jobs in flight to 500 (default is 50) for all grid-control configs, execute in the gc_cfg folder:

find ./ -name "*.conf" -exec "sed" -i "s/in flight =.*/in flight = 500/g" {} \;

To modify only a subset of these configs, change the shell wildcard expression *.conf correspondingly.

Trouble-shooting

Impossible to retrieve proxy from myproxy.cern.ch

Sometimes CRAB does magic. If you encount this error, try to play around with your proxy until it works (you may want to use only one dataset from the datasets you want to skim, in order to quickly check if CRAB is working properly). One try that worked is:

  • Change L137 in SkimManager.py to submit_dict = {"config": config} (so just remove the part , "proxy" : self.voms_proxy)

Sometimes this step is enough to make it work. If it still doesn't:

  • Change the path of the var $X509_USER_PROXY: export X509_USER_PROXY=path-to-anywhere. 'anywhere' means everywhere you want, but not your .globus/ folder. In this way, CRAB should not be able to find a valid proxy, and it should ask you to set it.
  • If it works, discard the changes in SkimManager.py and go back to the default instructions.

Told you that CRAB does magic...

CRAB cache full

If you encount an error like:

File ''/cvmfs/cms.cern.ch/crab3/slc6\_amd64\_gcc493/cms/crabclient/3.3.1704-comp/lib/python2.7/site-packages/CRABClient/JobType/Analysis.py'', line 180, in run
configArguments['cachefilename'] = "%s.tar.gz" % uploadResult
UnboundLocalError: local variable 'uploadResult' referenced before assignment

this means that the CRAB cache is full (about 80 MB/job, it gets full pretty quickly). When a task is submitted, you can safely clean the corresponding crab cache, as explained in the twiki. Copy&paste the given script in a file (e.g. cleancrabcache.py) and execute it (i.e. python cleancrabcache.py). Then, do the following:

  • In skim_dataset.json manually substitute the SKIM_STATUS: INIT with SUBMITTED
  • Update the SKIM_STATUS by running the usual command with or without the --summary
  • Run again the command with --remake

Tasks simultaneously on CRAB and GC

Let's say you have submitted a subset of tasks (e.g. taskA, taskB, taskC) also on GC, while they are still running on CRAB. And let's say you have:

  • taskA completed in CRAB, running in GC
  • taskB completed in GC
  • taskC running in CRAB and GC If you stop the while.sh script as explained before, because you want it to do not monitor the taskB anymore (since it's completed), and you follow the instructions given before to update it, you will end up that only taskC appears in the script. This is due to the fact that SKIM_STATUS of taskA is COMPLETED. You have two possibilities:
  1. Change SKIM_STATUS of taskA from COMPLETED to SUBMITTED in skim_dataset.json and run the command to resubmit with gc (--resubmit-with-gc) or
  2. Add the missing dataset by hand in while.sh

Also, next time, if the number of tasks is not huge, you may want to update the while.sh by hand (by removing the completed GC tasks), in order to avoid this issue due to CRAB status and GC status conflict.

Creation of File-lists

If a complete task on a dataset is completed, you may start with the creation of a file-list for this dataset, which contains the paths to the outputs of the jobs which are written to a storage element which you have configured for your tasks. The default storage element is the dCache of T2_DE_DESY and the output files are accessed for this site via the dcap protocol. The way a storage element of a certain site should be accessed can be configured here:

self.voms_proxy = None
self.site_storage_access_dict = {
	"T2_DE_DESY" : {
		"dcap" : "dcap://dcache-cms-dcap.desy.de//pnfs/desy.de/cms/tier2/",
		"srm" : "srm://dcache-se-cms.desy.de:8443/srm/managerv2?SFN=/pnfs/desy.de/cms/tier2/",
		"xrootd" : "root://dcache-cms-xrootd.desy.de:1094/",
	},
	"T2_DE_RWTH" : {
		"dcap" : "dcap://grid-dcap-extern.physik.rwth-aachen.de/pnfs/physik.rwth-aachen.de/cms/",
		"srm" : "srm://grid-srm.physik.rwth-aachen.de:8443/srm/managerv2\?SFN=/pnfs/physik.rwth-aachen.de/cms/",
		"xrootd" : "root://grid-vo-cms.physik.rwth-aachen.de:1094/",
	}
}

For crab tasks, the paths are obtained from the output of the crab getoutput tool. The main advantage is that the retrieval of the file-list is independent of the storage element site and you are not limited to have a working directory on the NAF and to a certain folder structure to create file-lists. The main disadvantage: it lasts quite long for one task to retrieve the information depending how overloaded the crab server is. But since you usually create a file-list for a task only once, this is not a major problem.

For grid-control tasks, the information on the output files is retrieved from the job.info files. First it is checked, whether all jobs are successfully completed by counting the number of jobs with a zero exit code. If this matches the number of started jobs, then a file-list creation is initiated. To create the file-lists you should be on the machine, where you started the skim. Again no limitations on the folder structure are needed.

For now, you receive always at first the crab file-list and only for the case, that the corresponding grid-control task was completed before the crab task, a grid-control file-list is created. To perform the list creation, execute the following command in the analysis setup, such that filelists are already copied to the right place in $CMSSW_BASE/src/HiggsAnalysis/KITHiggsToTauTau/data/Samples/:

 SkimManager.py -i /portal/ekpbms1/home/akhmet/workdir/CMSSW_8_0_25/src/Kappa/Skimming/data/datasets.json --tag "Moriond17" --tagvalues "MC_Summer16" -w /portal/ekpbms1/home/akhmet/kappa_skim_workdir/MC_Summer16_version_10-01-2017/ --create-filelist -d [<date>] [-r]

After a file-list is created for a certain dataset, the status of the task (grid-control or crab) is set to "LISTED". This dataset is then excluded from a repeated file-list creation. In case you want to create a file-list from scratch, just use the option --reset-filelist just before executing the command above.

All file-lists are copied to the following folder:

$CMSSW_BASE/src/HiggsAnalysis/KITHiggsToTauTau/data/Samples/

"Recent" symlinks are automatically created with the -r option.

If you wish, you can also create a collection file, where the various file-lists needed for a certain analysis, e.g. SM HTT, are listed.

Now everything is ready for an Artus run.

Trouble-shooting

Getting output from crab exited with error. Try again later.

This error is also due to some proxy magic done by CRAB. To work around it, try the following:

  • First, destroy your proxy with the instructions given in the twiki
  • Then run the command
crab getoutput -d <path-to-crab-task> --xrootd --jobids 1-10

CRAB will then create your proxy at this stage.

  • After this, the usual command using the SkimManager should work.

Summary for the Skim Manager

At the end of this tutorial a summary diagram is presented for the skim manager. Enjoy it ;)

Skim Manager Workflow

Skimming with CRAB3

The samples to be skimmed are defined in the following line in Kappa/Skimming/higgsTauTau/crabConfig.py

nicknames = read_grid_control_includes(["samples/13TeV/..."])

To submit your jobs, do the commands listed in the block below. Be aware that sourcing should be done only when you want to run crab, because after this, some other commands (like some git commands) do not work in the same terminal. The script is prepared for being run on the NAF.

source /cvmfs/cms.cern.ch/crab3/crab.sh
cd $CMSSW_BASE/src/Kappa/Skimming/higgsTauTau/
python crabConfig.py submit

If you want to run it elsewhere, you should adjust the paths in crabConfig.py starting with /nfs/dust/cms/user/ as prepared in this commit for running in Aachen.

The outputs should be written in your personal DESY dCache directory

srm://dcache-se-cms.desy.de:8443/srm/managerv2?SFN=/pnfs/desy.de/cms/tier2/store/user/$USER/higgs-kit/skimming/<date>...

Normally, there is no reason to change this location. This location has the advantage, that you can access it easily from the NAF, where /pnfs is mounted. Additionally, most Artus jobs to process these outputs here are run on the NAF, which ensures the fastes way to read these files.

After your jobs are created, you can monitor them using crab status or use the Task Monitoring website, where you need to search for your name.

There is also a script checkOnCrabJobs.py available for ease-of-use when performing a skim of many samples. This script outputs the number of submitted, failed and completed jobs as well as jobs with status codes that the script can currently not understand (other). It can also automatically resubmit failed jobs using specific options (e.g., --maxmemory 6000) if desired.

Useful information

Trouble-shooting

CRAB writes log output to the same place as the ROOT output files (which is different compared to Grid-Control). You should log in to the NAF (see Getting Started - Accounts and Bash and SSH) and digest the log files there. The most convenient way might be to use the midnight commander.

cd /pnfs/desy.de/cms/tier2/store/user/$USER/higgs-kit/skimming/<date>...
mc

Note: Since for submitting the crab jobs you may have set the CERN username, here $USER should be again your CERN username. If your NAF username and CERN username are different, you have to manually substitute $USER (= NAF username) with your CERN username in the line above.

The directories containing the ROOT outputs also contain a sub-directory called failed in case jobs of the given sample are failed. Inside the corresponding log directory you find three log files, that might give you a hint to possible problems.

In cases just few of many similar jobs fail or the log files indicate a reason, that is most probably due to a problem at the site, where the job has been processed, e.g. input file could not be read, problems with finding the CMSSW software or problems with writing out the results, the failed jobs can simply be resubmitted.

crab resubmit </nfs/dust/cms/user/$USER/kappa/crab_kappa_skim-<date>/<subdir>> [--siteblacklist=...]

In many cases, CRAB does a limited number of automatic attempts do resubmit jobs. Blacklisting sites is useful in case jobs are constantly sent to sites that are known to have problems.

Jobs keep failing after resubmission

One of the reasons can be due to Error 50660: 'Not retrying job due to excessive memory use (job automatically killed on the worker node)'.

If this is the case, try to resubmit your jobs with the option --maxmemory 4500 if you use crab command, or

checkOnCrabJobs.py --dir <path-to-dir> --resubmit --resubmit-args "--maxmemory 4500"

If after N resubmissions, your jobs continue anyway to fail, try to check one of the following cases (see sections belows), before restart the task completely from scratch (i.e. submit all the jobs, failed and finished).

Memory limits offered by each site are accessible in the GlideinWMS VO Factory Monitor page, http://glidein.grid.iu.edu/factory/monitor/ (choose "Current Status of the Factory" and click on a site CE listed under "Entry Name" in the table)ref.

Jobs keep failing: number of jobs greater than number of files on DAS

Sometimes it happens that one or more files of a given dataset are somehow missing from DAS (not clear why this happens) and you will notice something like:

number of files on DAS for a specific dataset = N

number of jobs in the corresponding crab task = M

with M > N (assuming 1 job/file, otherwise it's just algebra...)

Then, in such a case, if the number of jobs that keep failing is M-N, then those jobs don't find the file that they are supposed to open (in this case, indeed, no cmsRun_X.log is created).

Long tasks

For very long tasks (high number of jobs), the probability that some Grid nodes have problems in the time all the jobs complete is very high (unscheduled downtime, for example). If you are very unlucky, the Grid scheduler can mess up with the status of your jobs. One symptom is given by running crab status -d path-to/crab_taskX --long and you get Error in retrieving task status. This can sometimes happen and, if there are no serious problems, this command should work after few minutes. If it doesn't, a good option is to create a Recovery task (see next section).

Recovery task

Let's say you have a task, e.g. taskA, in which out of 478 total number of jobs, 436 finished and 42 keep failing (or are stuck in idle state). If crab resubmit (or crab kill and then crab resubmit, in case of idle state) does not work anymore, the best solution is creating a recovery task (Twiki:Recovery_task). In this way, you create a new task, in which you run only the unfinished jobs of taskA.

In order to create the recovery task, do the following:

  • First of all, you need to get the list of lumi sections of the dataset (inputDatasetLumis.json) and the list of lumi sections already processed (processedLumis.json). These two files are created by crab report:
crab report -d path-to/crab_taskA

and they are saved in the folder path-to/crab_taskA/results/.

  • Then, you need to create the list of missing lumi sections (missingLumis.json), that is basically the diff between inputDatasetLumis.json and processedLumis.json. In principle, crab report should provide you a lumiToProcess.json (or notFinished.json if you use the option --recovery notFinished), but sometimes this doesn't work. In this case, you can just use the script findMissingLumis.py (in Kappa/Skimming/scripts):
findMissingLumis.json -d path-to/crab_taskA

The script creates the missingLumis.json and saves it in path-to/crab_taskA/results.

  • Next step is to modify crabConfig.py in order to run over the missing lumi sections of taskA: you need to set the nickname of taskA (in order to run only over this dataset):
nicknames = ['nickname_of_taskA']

and uncomment (and set the correct path) the following line:

config.Data.lumiMask = 'path-to/crab_taskA/results/missingLumis.json'

in order to run only over the missing lumi sections.

  • The last step is run python crabConfig.py submit.

When you will check the status of taskA, you will see that the number of jobs created is the number of unfinished jobs from taskA (42, in the beginning example).

When later you will create the dataset lists for Artus (using createInputFilelists.py), you need to remember to run it with the option --no-strict-checking, since your taskA is now split in two folders.

Note: it's better to have the taskA and its corresponding recovery task in two different crab folders, to avoid any unwanted overwriting (the job id is reset in the recovery task: this means that jobid=1 in the recovery task could not correspond to jobid=1 in the original taskA). If you create the recovery task in a different day from the first crab submission, you don't have to worry about (the crab folder is in the form 80X_crab_YYYY_MM_DD, and therefore the recovery task will automatically be in a new crab folder). Otherwise, you can either rename the crab folder (in order to have 80X_crab_YYYY_MM_DD_old and 80X_crab_YYYY_MM_DD), or modify config.General.requestName in crabConfig.py (in order to have nickname_taskA and nickname_taskA_recovery, but this could create problems in running createInputFilelists.py).

Preparations for Running Artus on new Skims

Creating Filelists

File lists for Artus need to be created by createInputFilelists.py

createInputFilelists.py -s /pnfs/desy.de/cms/tier2/store/user/tmuller/higgs-kit/skimming/<date>_<campaign> -d <date> [-r]

These filelists simply running over multiple files with Artus as it is not possible to use paths containing wildcards for dCache filesystems. We keep lists for various skimming compaigns indicated by the date (see option -d) in order to be able to compare them. Normally, we use the symlinks ending with *_recent.txt, that are created with the -r option. These (*_sample_*.txt) files we usually put in collections *_collection_*_recent.txt, that collect different samples (and collections) that we often process together. In case you are sure that your outputs are the most recent ones to be used by the group, create also these symlinks with the -r option. The resulting filelists need to be committed. As soon as they are committed, the skim is accessible to everybody.

Numbers of Generated Events

In case not all skims of a MC sample have been skimmed succesfully, you can manually check how many have been processed with getNumberOfGeneratedEvents.py. Example usage:

getNumberOfGeneratedEvents.py /pnfs/desy.de/cms/tier2/store/user/<USERNAME>/higgs-kit/skimming/<CAMPAIGN>/<SAMPLE-YOU-ARE-INTERESTED-IN>/*/*/*/*.root

Depending on the amount of files this may take a while.

NLO Generator Weights

When using NLO samples, the proper normalization has to be applied. This can be done with getGeneratorWeight.py. Do e.g.

getGeneratorWeight.py `cat <$CMSSW_BASE/src/HiggsAnalysis/KITHiggsToTauTau/data/Samples/XROOTD_sample_..._recent.txt>`
for $filelist in <$CMSSW_BASE/src/HiggsAnalysis/KITHiggsToTauTau/data/Samples/XROOTD_sample_..._recent.txt>; do getGeneratorWeight.py $filelist; done
getGeneratorWeight.py </pnfs/.../*.root> 

to update all entries in the Kappa database. Remember to commit your changes afterwards. Note: You should give the collection filelist, not the sample filelist. Collection list that includes filelists of the samples!

Jet Energy Correction Paramters

JEC parameters can be downloaded using getJecParameters.py:

getJecParameters.py

The retrieved files have to be configured in Artus.

Pile-up Weights

Pile-up weights are determined using puWeightCalc.py:

puWeightCalc.py -h

Cross Sections

Deleting old skims

There is a script which deletes the files not only on dCache but also removes the corresponding filelists:

cd $CMSSW_BASE/src/HiggsAnalysis/KITHiggsToTauTau
deleteInputFilelists.py data/Samples/XROOTD_sample_<sample>_<date>.txt <...>

This will git remove the filelists and should therefore be called in a folder of the KITHiggsToTauTau repository. Do git status afterwards and commit/push the changes the script prepared.

Only skims that are superceeded by a newer version/date should be removed. Check the recent symlink with

ls -l data/Samples/XROOTD_sample_<sample>_recent.txt

If the last copy of a skim is to be deleted, make sure that none of your colleagues will need this again (as old skims might be complicated to re-skim). In this case, collection and recent filelist must be deleted by hand.

Clone this wiki locally