JLab Computing

From PREX Wiki
Revision as of 00:10, 6 August 2021 by Cameronc (talk | contribs) (→‎Checking the Status and Resubmitting)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Using ifarm Scientific Computing Resources

The JLab computer center manages the various ifarm clusters and data storage. These are some of the resources offered there (general guide):

  • To log into the JLab servers first ssh into login.jlab.org, which requires a CUE JLab computer account. Then you can log into another node in the jlab network like the jlabl1 workstations or the ifarm via another ssh directly to those hostnames.
  • Standard nuclear physics computing programs and analysis software can be initialized into your coding environment by executing the "production" scripts. To execute these scripts upon logging into the jlab servers place the following code or something similar in a file called .login or any .*rc file you prefer:
source /site/12gev_phys/softenv.csh 2.3
  • If you have problems then make sure you are running the /bin/tcsh shell (execute "echo $SHELL" on the command line to see what shell you are running)
  • Recently the production scripts have been updated to "softenv" scripts, and the newest version is 2.3
  • The swif workflow program can manage running batch jobs on the Auger batch farm system for you.
  • See the "job management" section of Ciprian Gal's prexSim simulation readme for specific details on how to get a swif workflow up and running for you.

Getting Account Access Permissions

  • Register as a JLab user, undergrad, or grad student (here, or register from the "online" link in here).
    • After registration, you have to "Register New Visit" as user group "Remote Access," even though you aren't necessarily visiting, and you will need you to call the JLab helpdesk at some point.
    • While filling out the Registration form you can request an account on the JLab Common User Environment (CUE). You must include Bob Michaels (rom@jlab.org) as your JLab sponsor for the account - be sure to request access to a-parity and moller12gev user groups (here is a good starting link).
    • To set up your computing environment on the ifarm see above.
    • To get access to swif and scientific computing resources follow the instructions here.
    • Then to use swif see the guide below or readmes in relevant repositories.
  • Jefferson Lab github access - Send an email with the following (and if this doesn't work ask one of the senior members of the collaboration to add you themselves):
Subject: Please add me to the JeffersonLab github organization
To: <helpdesk@jlab.org>
Hello,
I'm a JLab user and my JLab user name is _______.
Could you please add me to the JeffersonLab github organization?
My github username is ______ and account id is ______

Introduction to SWIF

To use the ifarm's batch submission system (online monitoring and documentation here) one option is to use the Auger batch system manager called "swif" (documented somewhat here). Junhao Chen has made a useful auto-swif script that may be useful to people in the future here.

  • To use swif first you need access to the ifarm, then you need to create a certificate (see above)
  • Execute
/site/bin/jcert -create
  • To create a workflow on swif run
swif create -workflow WorkFlowName (where WorkFlowName is an identifier you give to it to monitor its progress)
  • To monitor the workflow run
swif status -workflow Name
  • To delete a workflow run
swif cancel -workflow Name
swif cancel -delete -workflow Name
  • To add a job run
swif add-jsub -workflow Name -script jobScript.xml
swif run -workflow Name
  • To create a script .xml file for running jobs see the description of its function and the python wrapper code included in Ciprian's prexSim code (https://github.com/cipriangal/prexSim) or Cameron's updated one to work with new remoll v2.0.0 data structures (jlabSubmit.py and its relatives)
  • A suggested .login file for your ifarm uses (that allows for batch job submission) is:
source /site/env/syslogin
source /site/env/sysapps
if ( `hostname` !~ "jlabl"* && `hostname` !~ "adaq"* )  then
source /site/12gev_phys/softenv.csh 2.3 
endif
  • A sample .tchsrc file for using the default ifarm tc shell is here:
# ~/.tcshrc: executed by tcsh(1) for non-login shells.
setenv PATH $PATH\:/site/bin 
set savehist = 100000
set histfile = ~/.tcsh_hist
alias root root -l
alias gits git status
alias swif /site/bin/swif
alias swifs swif status -workflow


Checking the Status and Resubmitting

(Taken from Hall D wiki here)

1. The status of jobs can be checked on the terminal with

jobstat -u gxprojN

For the status of jobs on Auger see http://scicomp.jlab.org/scicomp/#/auger/jobs and for SWIF use

swif list

or for more information,

swif status [workflow] -summary

Note that "swif status" tends to be out of date sometimes, so don't panic if your workflow/jobs aren't showing up right away. Also see the Auger job website. 2. For failed jobs, SWIF can resubmit jobs based on the problem. For resubmission for failed jobs with the same resources,

swif retry-jobs [workflow] -problems [problem name]

can be used, and for jobs to be submitted with more resources, e.g., use

swif modify-jobs [workflow] -ram add 2gb -problems AUGER-OVER_RLIMIT

This only re-stages the jobs, be sure to resubmit them with:

swif run -workflow [workflow] -errorlimit none

You can wait until almost all jobs finish before resubmitting failed jobs since the number should be relatively small. Even if jobs are resubmitted for one type of failure, jobs that later fail with that failure will not be automatically resubmitted.

3. For information on swif, use the "swif help" commands see the attached documentation in https://halldsvn.jlab.org/repos/trunk/scripts/monitoring/hdswif/manual_hdswif.pdf

4. Below is a table describing the various errors that can occur.

ERROR NAME Description Resolution
AUGER-SUBMIT

SWIF’s attempt to submit jobs to Auger failed. Includes server-side problems as well as user failing to provide valid job parameters (e.g. incorrect project name, too many resources, etc.)

If requested resources are known to be correct resubmit. Otherwise modify job resources using swif directly.

AUGER-FAILED

Auger reports the job FAILED with no specific details.

Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.

AUGER-OUTPUT-FAIL

Failure to copy one or more output files.Can be due to permission problem, quota problem, system error, etc.

Check if output files will exist after job execution and that output directory exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.

AUGER-INPUT-FAIL

Auger failed to copy one or more of the requested input files, similar to output failures. Can also happen if tape file is unavailable (e.g. missing/damaged tape)

Check if input file exists, resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.

AUGER-TIMEOUT

Job timed out.

If more time is needed for job add more resources. Default is to add 2 hrs of processing time. Also check whether code is hanging.

AUGER-OVER_RLIMIT

Not enough resources, RAM or disk space.

Add more resources for job.

SWIF-MISSING-OUTPUT

Output file specified by user was not found.

Check if output file exists at end of job.

SWIF-USER-NON-ZERO

User script exited with non-zero status code.

Your script exited with non-zero status. Check the code you are running.

SWIF-SYSTEM-ERROR

Job failed owing to a problem with swif (e.g. network connection timeout)

Resubmit jobs. If problem persists, contact Chris Larrieu or SciComp.


Transferring Huge Files

  • Do not scp or copy raw data file (see instruction right below)
  • How to copy raw data files from tape
(email from Brad Sawatzky)
Do not use 'jget' -- that makes a copy to an arbitrary file system (kind
of like 'cp' on regular files).

You want to use 'jcache get ...' which will have the system pull the file to /cache/hallc/spring17/raw/ as you wish. Run 'jcache -h' to get some help text and/or the online docs here: https://scicomp.jlab.org/docs/%20

So, something like this should work: jcache get /mss/hallc/spring17/raw/shms_all_02895.dat The file will show up here as soon as a tape drive is ready to pull the run (can be anywhere from minutes to 24 hours depending on load). /cache/hallc/spring17/raw/shms_all_02895.dat

Wildcards work too: jcache get /mss/hallc/spring17/raw/shms_all_028*.dat (But be careful, don't do 'jcache get /mss/hallc/spring17/raw/*' :-)

You can find the status of your request from: https://scicomp.jlab.org/scicomp/index.html#/cache/request

Example for PREX counting daq data:

ifarm1402.jlab.org> jcache get /mss/halla/happexsp/raw/prexRHRS_20982.dat.0
get request: 21382881
status: pending
/cache/halla/happexsp/raw/prexRHRS_20982.dat.0 -> pending
  • Brad's CH Tips and Tricks [1]


Use scp

A quick and dirty way to copy files from JLab work disks is somethine like the following:

 scp -rp yez@ftp.jlab.org:/work/halla/triton/yez/file .

Or a better command which sometimes works for me by most of times doesn't (however, I recommend to use rsync to copy from among different locations in the same PC, or ifarm):

 rsync -av yez@ftp.jlab.org:/work/halla/triton/yez/file .

Use Globus

Globus is a very nice tool to transfer a large number of big data files from JLab to any place with much faster internet connection speed (JLab server allows a maximum of 5GBps speed). It also has an easy web-based interface to perform the work. However, the instruction of how to use Globus is somewhat misleading. I simplified the instruction based on my painful experience of setting up the connections.

Here is the step-by-step How-to:

  • 1) Go to https://www.globus.org/, and sign-in your account by choosing Jefferson Lab or your own institution account if there are (I used my Argonne account). I haven't tried personal account. Please let me know how it works.
  • 2) If your institution already has an Endoint-sever ready, use them. Here, I chose to create my persional endpoint on an Argonne PC (or can be on your own laptop if you have enough space, or external hard-dirve).
    *Find a button "add Globus Connect Personal endpoint Endpoint List", 
    *Type in the name of the endpoint (for you to manage later)
    *Create the Setup Key and copy it. 
  • 3) Download the scripts to your local computer (or computer that you want to save the files to)
    https://docs.globus.org/how-to/globus-connect-personal-linux/ 
  • 4) *Ignore* the instruction on the webpage unless you want to install a user-interface globus which is not really needed. Simply unpack the zip files: globusconnectpersonal-x.x.x/
  • 5) Inside the folder, run the following command:
       ./globusconnect -setup <key>
   where <key> is the key-chain you generated and copied from the Globus webpage in step 2)
  • 6) Go to the directory ~/.globusonline/lta/, and edit or create a text-file named "config-paths". Inside this file, add lines to specify where you want to let the Globus get access to (so you can save files to or copy files from), e.g:
     /data/yez/Tritium,1,1
     /home/work/marathon,1,1
  where the first "1" means you allow Globus to visit this folder ("0" to turn off). The second "1" mean you allow Globus to write to this folder ("0" to set "read-only"). For more details, see this:https://docs.globus.org/faq/globus-connect-endpoints/#how_do_i_configure_accessible_directories_on_globus_connect_personal_for_linux
  • 7) Now you can start the endpoint-server by running:
    ./globusconnect -start &
  If you want to make some changes (like add or remove paths, change permission, stop the serve (./globusconnect -stop) and restart it after changing "config-paths")
   a) Connect to the JLab endpoint by searching " jlab#scifiles". (see https://scicomp.jlab.org/docs/node/11)
   b) Specify the endpoint that you just created (under "Administrated by Me"), and it will show your folders that are allowed to show (in config-paths).
   c) Choose the files or entire folders on JLab-endpoint (e.g., /cache/halla/triton/raw)
   d) Choose where you want to save these files on your personal endpoint PC. 
   e) Then click the big blue button in between two endpoints. Then just wait for your files to be transfered over.