wiki:query_affected_datasets_cheat_sheet

CRDS Affected Datasets Query Script Cheat Sheet

Overview

This comment contains notes for setting up and using the client script for downloading dataset ids affected by a CRDS context transition. The B6 delivery contains stored results for only the last two transitions, the other transitions predate the availability of stable dataset parameters and CRDS rules. Even the current transitions demonstrate the complexity of reconciling CRDS rules and dataset parameter values so that the system runs error free.

The CRDS client "reference script" for accessing the affected datasets web service is named:

query_affected_datasets

It handles nuances of the reprocessing system and web service and is based on a subclassable script which can be customized. And example customized script is

custom_query_affected_datasets

Since the service querying is an incremental process, the client script maintains a record of the last context processed in the/a CRDS cache config area. It's a normal CRDS script so by default it updates the CRDS cache, which in the pipeline environment is not correct since it is a highly concurrent environment. In theory pipeline accounts are configured with CRDS_CACHE_READONLY=1 to prevent cache updates, but that variable does not control the last processed status so that one piece of state will still update as required. It's an error to run more than one copy of query_affected_datasets concurrently.

I&T Initialization

Since cache updates are incremental, a cache baseline has to be established when the pipeline setup is integrated. The baseline client query environment is initialized like this:

% setenv CRDS_PATH  (some path which doesn't have to hold references,  could be the pipeline's normal cache.)

% setenv CRDS_SERVER_URL https://jwst-crds-b6it.stsci.edu   (must be connected to CRDS server)

The B6 CRDS server is initialized with only two reprocessing entries for context switches 175 -> 183, and 183 -> 191.  Reprocessing results are driven by evolving CRDS rules and dataset parameter sets so only the most recent transitions are fruitful for demonstration,  the other transitions all predate the availability of corresponding historical dataset parameters.

% query_affected_datasets -x 1 -y 41 --ignore-missing-results -q
CRDS  : INFO     Fetching effects for (1, '2013-04-13 00:00:00', 'jwst_0001.pmap', 'Linearity and dark files.')
CRDS  : INFO     No results for jwst_0000.pmap --> jwst_0001.pmap ignoring and proceeding.
...  log info about the first N unsupported contexts

CRDS  : INFO     Fetching effects for (40, '2016-05-05 09:44:40', 'jwst_0183.pmap', 'Delivery of new MIRI MRS distortion, regions, specwcs, v2v3 and wavelengthrange reference files derived from CDP5 data and converted to the right format by SSB. Delivery of new NIRSpec camera, collimator, disperser, fore, fpa, ifufore, ifupost, ifuslicer, msa, ote, and wavelengthrange reference files derived from CDP4.  Delivery of FGS pixel area maps and photometric reference files. This delivery corrects the headers of the pixel area files.  Delivery of NIRISS pixel area and photom reference files. The pixel area only corrects header keywords. Changed READPATT=ANY to READPATT=N/A for FGS SUPERBIAS.  Delivery FGS rmaps for photom and pixel area reference files to remove old files that had a different useafter date and were not replaced with the last delivery when they should have.  Delivery of NIRISS rmaps for pixel area map and photom to remove old file that had different useafter and that were not removed with the new delivery.  Changes to NIRSPEC CAMERA, COLLIMATOR, DISPERSER, FORE, FPA, IFUFORE, IFUPOST, IFUSLICER, MSA, OTE, WAVELENGTHRANGE changing handling of various EXP_TYPE.')

... suppressed log info about this crds.bestrefs run due to -q parameter ....

CRDS  : ERROR    CRDS server-side errors for 40 2016-05-21-11:36:42_0175_0183


... dataset ids on stdout for transition from contexts 175 to 183

jw80600016001_02101_00001.mirifulong
jw93135334001_02107_00001.mirifushort
jw93135334001_02108_00001.mirifulong

As of build-6 delivery, CRDS has ~43 history entries (42 transitions) covering all the historical changes in default context. The -x and -y parameters are used to specify a particular range to initialize the CRDS cache with. So the init step is really only outputting log and dataset id info for the second-to-last transition, the switch from jwst_0175.pmap to jwst_0183.pmap.

After basic initialization and initially empty CRDS cache contains a minimum of config files, it does not necessarily have to contain rules or references or be the same as the pipeline's CRDS cache although that should work.

% find $HOME/crds_cache_b6it/
/Users/jmiller/crds_cache_b6it/
/Users/jmiller/crds_cache_b6it//config
/Users/jmiller/crds_cache_b6it//config/jwst
/Users/jmiller/crds_cache_b6it//config/jwst/ad_last_processed
/Users/jmiller/crds_cache_b6it//config/jwst/bad_files.txt
/Users/jmiller/crds_cache_b6it//config/jwst/server_config

The contents of the ad_last_processed file:

(41, '2016-05-05 09:44:40', 'jwst_0183.pmap', 'Delivery of new MIRI MRS distortion, regions, specwcs, v2v3 and wavelengthrange reference files derived from CDP5 data and converted to the right format by SSB. Delivery of new NIRSpec camera, collimator, disperser, fore, fpa, ifufore, ifupost, ifuslicer, msa, ote, and wavelengthrange reference files derived from CDP4.  Delivery of FGS pixel area maps and photometric reference files. This delivery corrects the headers of the pixel area files.  Delivery of NIRISS pixel area and photom reference files. The pixel area only corrects header keywords. Changed READPATT=ANY to READPATT=N/A for FGS SUPERBIAS.  Delivery FGS rmaps for photom and pixel area reference files to remove old files that had a different useafter date and were not replaced with the last delivery when they should have.  Delivery of NIRISS rmaps for pixel area map and photom to remove old file that had different useafter and that were not removed with the new delivery.  Changes to NIRSPEC CAMERA, COLLIMATOR, DISPERSER, FORE, FPA, IFUFORE, IFUPOST, IFUSLICER, MSA, OTE, WAVELENGTHRANGE changing handling of various EXP_TYPE.')

are the record of the last context for which reprocessing was queried.

Real Life Pipeline Initialization

Initializing the query system in real life presumes that past transitions and dataset lists are "old news". The procedure for initializing the query state to act as if the last transition has already been seen is this:

% query_affected_datasets --reset

Given that initialization, following the next Set Context default context election on the CRDS server, the next call to query_affected_datasets should ouput information about that (future at the time of --reset) transition.

Routine Queries

Once the/a CRDS cache is initialized for the query script establishing a baseline state, the script can be called periodically without specifying the starting context, the starting context will be assumed to be the last processed.

As a demo of routine operation, query_affected_datasets is run again without the -x and -y parameters defining which transitions to produce results for. Without -x and -y, query_affected_datasets nominally produces output for the last context transition, the transition between the entry in ad_last_processed and the current default context.

% query_affected_datasets -q

CRDS  : INFO     Fetching effects for (41, '2016-05-18 09:40:19', 'jwst_0191.pmap', 'Delivered MIRI CDP5 files and corrections to several MIRI RMAPs to correct issues with selection of the reference files. Also, updated NIRSpec IMAP.')
CRDS  : ERROR    CRDS server-side errors for 41 2016-05-21-11:40:11_0183_0191

jw80600009001_02101_00001.mirimage
jw80600016001_02101_00001.mirifulong
jw80600018001_02101_00001.mirifulong
....

The -q switch supresses most of the bestrefs output about errors. query_affected_datasets can be used to replay arbitrary stored results if further error investigation is desired.

It should be noted that the current errors produced by the CRDS affected datasets system are a result of a complex interplay between CRDS rules and archive dataset parameters received from the archive web service. Those errors will be beaten down during the normal evolution of the JWST software. Handling them is actually part of the normal functional envelope of this system so other than review and commentary the errors should not prevent testing of the CRDS affected datasets system in general.

A further point to make is that the CRDS affected datasets *system* is used to produce and permanently store results related to official context transitions only. The crds.bestrefs tool can compute context-to-context differences for arbitrary transitions, but the client/server automatic system only produces results for a simple linear progression of contexts.

Help

The query_affected_datasets script has built-in help accessible with --help:

% query_affected_datasets --help

usage: /Users/jmiller/anaconda/bin/query_affected_datasets [-h] [-l]
                                                           [-x INT_OR_NAME]
                                                           [-y INT_OR_NAME]
                                                           [-s] [-i] [-k] [-z]
                                                           [-f LAST_PROCESSED_FILE]
                                                           [-q] [-r] [-v]
                                                           [--verbosity VERBOSITY]
                                                           [-R] [-I] [-V] [-J]
                                                           [-H] [--stats]
                                                           [--profile PROFILE]
                                                           [--log-time]
                                                           [--pdb]
                                                           [--debug-traps]

query_affected_datasets (QAD) queries the CRDS server for datasets affected by the specified 
CONTEXT(s), history INDICES, or DATES.  QAD relies on pre-computed results on the server,
so QAD queries are only valid for historical context transitions from the context history.
When no history interval is specified,  QAD uses the last processed context as the starting
point,  and the end of the current history as the stopping point.

Before going too far, a couple of points are in order:

    1. QueryAffectedDatasetsScript is intended to be sub-classed.   By default it just prints dataset
    IDs on STDOUT as well as log and affected datasets computation output to STDERR.   Override the
    use_affected() and use_all_ids() methods to do something custom with the affected dataset IDs,
    either switch-by-switch, or potentially multiple contexts at once, respectively.   Override the
    log_affected() or log_all_ids() methods to customize most logging.   QAD provides basic interaction
    with the context history, affected datasets service, recording state, and error handling.
    
    2. crds.bestrefs can be used to compute the datasets affected by arbitrary context transitions
    when run locally.   The computation can require minutes to hours depending on number of instruments,
    types, datasets, and tables potentially affected.    For this, the --affected-datasets switch 
    configures crds.bestrefs to perform an affected datasets computation using the bundle of standard
    options run on the CRDS servers.
    
Due to processing order on the server, new contexts appear in the context history as operational 
before the datasets affected have been computed.   This framework helps resolve that race condition
and provides options for handling affected datasets computations which contained errors.

To support interactive experimentation, QAD supports listing the context history:

    % query_affected_datasets --list
    (0, '2013-07-02 15:44:53', 'hst.pmap', 'set by system')
    (1, '2013-09-10 18:23:06', 'hst_0003.pmap', 'Updated hst.pmap with new references (known to reffile_ops_rep on harpo) up to 09-10-2013')
    ...
    (85, '2014-09-23 16:48:17', 'hst_0287.pmap', 'Delivery of new WFC3 darks.')
    (86, '2014-10-13 11:26:40', 'hst_0288.pmap', 'Delivery of a new ACS WFC1-1K bias.')
    (87, '2014-10-14 09:34:29', 'hst_0289.pmap', 'Delivery of a new COS FUV BPIXTAB.')

See also the -x and -y parameters below for customizing interactive query ranges.
    
With no history range specified, QAD selects the last history item processed as the starting point 
and the end of the current history as the stopping point.  Normally there's nothing new to report,
the last thing processed was the end of the history and nothing has changed on the server.

    % query_affected_datasets
    CRDS  : INFO     No new results available.
    CRDS  : INFO     0 errors
    CRDS  : INFO     0 warnings
    CRDS  : INFO     1 infos

Following a context change,  by default QAD will notice a difference between the last saved context 
and the new last context in the history.  QAD is designed to be sub-classed but by default prints log 
information and recorded affected datasets output to STDERR.   It prints affected dataset IDs to STDOUT.
Run periodically,  QAD will typically see at most a single context switch.

    % query_affected_datasets > ids
    CRDS  : INFO     Fetching effects for (96, '2014-11-25 15:48:40', 'hst_0300.pmap', 'Delivery of a new COS HVTAB for association LCIX02080.')
    ####################################################################################################
    --------------------------------------------------------------------------------------------------------------
    CRDS hst ops datasets affected hst_0299.pmap --> hst_0300.pmap on 2014-11-25-15:50:09
    --------------------------------------------------------------------------------------------------------------
    CRDS  : INFO     [2014-11-25 15:50:12,563]  Mapping differences from 'hst_0299.pmap' --> 'hst_0300.pmap' affect:
     {'cos': ['hvtab']}
    CRDS  : INFO     [2014-11-25 15:50:12,731]  Possibly affected --datasets-since dates determined by 'hst_0299.pmap' --> 'hst_0300.pmap' are:
     {'cos': '2009-05-11 00:00:00'}
    CRDS  : INFO     [2014-11-25 15:50:12,731]  Computing bestrefs for db datasets for ['cos']
    CRDS  : INFO     [2014-11-25 15:50:12,731]  Dumping dataset parameters for 'cos' from CRDS server at 'https://hst-crds.stsci.edu' since '2009-05-11 00:00:00'
    CRDS  : INFO     [2014-11-25 15:50:16,457]  Downloaded  19592 dataset ids for 'cos' since '2009-05-11 00:00:00'
    CRDS  : INFO     [2014-11-25 15:51:16,293]  Updated exposure counts:
     {'COS': {'hvtab': 13696}}
    CRDS  : INFO     [2014-11-25 15:51:16,309]  Affected products = 9494
    CRDS  : INFO     [2014-11-25 15:51:16,319]  Unique error types: 0
    CRDS  : INFO     [2014-11-25 15:51:16,319]  STARTED 2014-11-25 15:50:10.67
    CRDS  : INFO     [2014-11-25 15:51:16,319]  STOPPED 2014-11-25 15:51:16.31
    CRDS  : INFO     [2014-11-25 15:51:16,319]  ELAPSED 0:01:05.64
    CRDS  : INFO     [2014-11-25 15:51:16,319]  19.59 K datasets at 298.47  datasets-per-second
    CRDS  : INFO     [2014-11-25 15:51:16,319]  0 errors
    CRDS  : INFO     [2014-11-25 15:51:16,319]  0 warnings
    CRDS  : INFO     [2014-11-25 15:51:16,319]  12 infos
    --------------------------------------------------------------------------------------------------------------
    OK: CRDS hst ops datasets affected hst_0299.pmap --> hst_0300.pmap on 2014-11-25-15:50:09 : 9494 affected
    --------------------------------------------------------------------------------------------------------------
    ####################################################################################################
    CRDS  : INFO     Contributing context switches = 1
    CRDS  : INFO     Total products affected = 9494
    CRDS  : INFO     0 errors
    CRDS  : INFO     0 warnings
    CRDS  : INFO     3 infos
    
    % cat ids 
    i9zf01010
    i9zf02010
    i9zf03010
    i9zf04010
    i9zf05010
    ...

STDERR output from multiple context switches is delimited by ########## lines.

For the sake of custom queries,  QAD supports --starting-context (-x) and --stopping-context (-y) parameters to define the 
history starting and stopping points.  -x and -y can be specified as CONTEXTS, HISTORY INDICES, or DATES.

    % query_affected_datasets -x 92 -y 94 > ids
    CRDS  : INFO     Fetching effects for (92, '2014-11-05 13:53:15', 'hst_0296.pmap', 'Delivery of new ACS bias, dark, and cte_corrected dark reference files.')
    CRDS  : INFO     Fetching effects for (93, '2014-11-10 13:11:02', 'hst_0297.pmap', 'The WFC3 Team delivered 3 new dark reference files.')
    ####################################################################################################
    --------------------------------------------------------------------------------------------------------------
    CRDS hst ops datasets affected hst_0295.pmap --> hst_0296.pmap on 2014-11-05-13:55:08
    --------------------------------------------------------------------------------------------------------------
    CRDS  : INFO     [2014-11-05 13:55:12,922]  Mapping differences from 'hst_0295.pmap' --> 'hst_0296.pmap' affect:
     {'acs': ['biasfile', 'drkcfile', 'darkfile']}
    CRDS  : INFO     [2014-11-05 13:55:13,576]  Possibly affected --datasets-since dates determined by 'hst_0295.pmap' --> 'hst_0296.pmap' are:
     {'acs': '2014-08-26 09:55:53'}
    CRDS  : INFO     [2014-11-05 13:55:13,576]  Computing bestrefs for db datasets for ['acs']
    CRDS  : INFO     [2014-11-05 13:55:13,576]  Dumping dataset parameters for 'acs' from CRDS server at 'https://hst-crds.stsci.edu' since '2014-08-26 09:55:53'
    CRDS  : INFO     [2014-11-05 13:55:39,323]  Downloaded  1431 dataset ids for 'acs' since '2014-08-26 09:55:53'
    CRDS  : INFO     [2014-11-05 13:55:52,999]  Updated exposure counts:
     {'ACS': {'biasfile': 1374, 'darkfile': 1386, 'drkcfile': 1386}}
    CRDS  : INFO     [2014-11-05 13:55:53,001]  Affected products = 670
    CRDS  : INFO     [2014-11-05 13:55:53,001]  Unique error types: 0
    CRDS  : INFO     [2014-11-05 13:55:53,001]  STARTED 2014-11-05 13:55:10.62
    CRDS  : INFO     [2014-11-05 13:55:53,002]  STOPPED 2014-11-05 13:55:53.00
    CRDS  : INFO     [2014-11-05 13:55:53,002]  ELAPSED 0:00:42.37
    CRDS  : INFO     [2014-11-05 13:55:53,002]  1.43 K datasets at 33.77  datasets-per-second
    CRDS  : INFO     [2014-11-05 13:55:53,002]  0 errors
    CRDS  : INFO     [2014-11-05 13:55:53,002]  0 warnings
    CRDS  : INFO     [2014-11-05 13:55:53,002]  12 infos
    --------------------------------------------------------------------------------------------------------------
    OK: CRDS hst ops datasets affected hst_0295.pmap --> hst_0296.pmap on 2014-11-05-13:55:08 : 670 affected
    --------------------------------------------------------------------------------------------------------------
    ####################################################################################################
    --------------------------------------------------------------------------------------------------------------
    CRDS hst ops datasets affected hst_0296.pmap --> hst_0297.pmap on 2014-11-10-13:15:07
    --------------------------------------------------------------------------------------------------------------
    CRDS  : INFO     [2014-11-10 13:15:10,393]  Mapping differences from 'hst_0296.pmap' --> 'hst_0297.pmap' affect:
     {'wfc3': ['darkfile']}
    CRDS  : INFO     [2014-11-10 13:15:10,574]  Possibly affected --datasets-since dates determined by 'hst_0296.pmap' --> 'hst_0297.pmap' are:
     {'wfc3': '2014-10-27 00:30:40'}
    CRDS  : INFO     [2014-11-10 13:15:10,574]  Computing bestrefs for db datasets for ['wfc3']
    CRDS  : INFO     [2014-11-10 13:15:10,574]  Dumping dataset parameters for 'wfc3' from CRDS server at 'https://hst-crds.stsci.edu' since '2014-10-27 00:30:40'
    CRDS  : INFO     [2014-11-10 13:15:33,868]  Downloaded  714 dataset ids for 'wfc3' since '2014-10-27 00:30:40'
    CRDS  : INFO     [2014-11-10 13:15:43,290]  Updated exposure counts:
     {'WFC3': {'darkfile': 369}}
    CRDS  : INFO     [2014-11-10 13:15:43,291]  Affected products = 292
    CRDS  : INFO     [2014-11-10 13:15:43,291]  Unique error types: 0
    CRDS  : INFO     [2014-11-10 13:15:43,291]  STARTED 2014-11-10 13:15:08.58
    CRDS  : INFO     [2014-11-10 13:15:43,291]  STOPPED 2014-11-10 13:15:43.29
    CRDS  : INFO     [2014-11-10 13:15:43,291]  ELAPSED 0:00:34.70
    CRDS  : INFO     [2014-11-10 13:15:43,291]  714 datasets at 20.57  datasets-per-second
    CRDS  : INFO     [2014-11-10 13:15:43,292]  0 errors
    CRDS  : INFO     [2014-11-10 13:15:43,292]  0 warnings
    CRDS  : INFO     [2014-11-10 13:15:43,292]  12 infos
    --------------------------------------------------------------------------------------------------------------
    OK: CRDS hst ops datasets affected hst_0296.pmap --> hst_0297.pmap on 2014-11-10-13:15:07 : 292 affected
    --------------------------------------------------------------------------------------------------------------
    ####################################################################################################
    CRDS  : INFO     Contributing context switches = 2
    CRDS  : INFO     Total products affected = 962
    CRDS  : INFO     0 errors
    CRDS  : INFO     0 warnings
    CRDS  : INFO     4 infos

QAD is designed to be run perdiodically,  say every 30 minutes,  to check with the CRDS server for 
context updates.  If multiple context switches occur during one polling interval,  by default QAD
includes IDs from all of them.   This is also true of interactive queries using -x and/or -y,  so
it's possible to combine affected datasets from multiple switches when repeated IDs are expected.

QAD also supports a --single-context-switch (-s) mode for printing results 1-by-1 in the advent of 
multiple context switches in on polling interval.

Since there is a race condition between when a context is made operational and when affected datasets results
are available on the CRDS server,  QAD also supports a -i switch for ignoring unavailable results.  Alternately
missing results are considered an error,  the main difference being the exit status.

For initialzing,  specify -i to ignore any missing computations since QAD will attempt to process the 
entire history the first time it is run and precomputed results don't exist for all historical context
switches.

It's possible for precomputed results to contain bestrefs errors of some sort,  most likely due to invalid
bestrefs selection parametersin the HST DADSOPS catalog.   By default the datasets from a computation which
contained errors are excluded from the overall results.   Use -k to include the dataset IDs from computations 
which included errors.

Conversely, to abort processing when an affected datasets computation included errors,  use -z to fail
and quit.

The --quiet (-q) parameter suppresses the recorded log output from the affected datasets computations:

    % query_affected_datasets -x 94 -y 97 -q > ids
    CRDS  : INFO     Fetching effects for (94, '2014-11-18 17:15:35', 'hst_0298.pmap', 'Delivery of new ACS DKC, DRK, and BIA files.')
    CRDS  : INFO     Fetching effects for (95, '2014-11-20 16:12:34', 'hst_0299.pmap', 'Delivery of new WFC3 UVIS darks.')
    CRDS  : INFO     Fetching effects for (96, '2014-11-25 15:48:40', 'hst_0300.pmap', 'Delivery of a new COS HVTAB for association LCIX02080.')
    CRDS  : INFO     Contributing context switches = 3
    CRDS  : INFO     Total products affected = 10356
    CRDS  : INFO     0 errors
    CRDS  : INFO     0 warnings
    CRDS  : INFO     5 infos
    
NOTE:  CRDS logging is used in both query_affected_datasets and the original server-side affected datasets computations.  The
final errors count shown above only applies to the client-side computing in query_affected_datasets,  so server-side errors are
not *counted*.   However,  server-side errors are tracked and reduced to a single client-side error for each server-side 
bestrefs run with errors.

The --verbose parameter includes debug output in excess of normal application logging,  possibly useful
for debugging subclasses of the QueryAffectedDatasetsScript skeletal framework.

optional arguments:
  -h, --help            show this help message and exit
  -l, --list-history    Print out the context history and exit.
  -x INT_OR_NAME, --starting-context INT_OR_NAME
                        Use the affected datasets computation starting with this history index integer, date, or context name. Defaults to last processed.
  -y INT_OR_NAME, --stopping-context INT_OR_NAME
                        Use the affected datasets computation starting with this history index integer, date, or context name. Defaults to end of history.
  -s, --single-context-switch
                        For default indexing, if multiple new contexts are available,  just process one new context and stop.
  -i, --ignore-missing-results
                        Skip over any requested context switch which has no pre-computed results on the CRDS server.  Otherwise fatal.
  -k, --ignore-errant-history
                        If bestrefs status indicates errors occurred, issue an error message but include the dataset ids in results.
  -z, --fail-on-errant-history
                        If bestrefs status indicates errors occurred, quit processing.  (fix the server and rerun).
  -f LAST_PROCESSED_FILE, --last-processed-file LAST_PROCESSED_FILE
                        File containing the tuple of the last history item successfully processed. Defaults to file in CRDS cache.
  -q, --quiet           Terser log output.
  -r, --reset           Reset the last-context-processed file to the end of the current history.  Useful for init and reinit.
  -v, --verbose         Set log verbosity to True,  nominal debug level.
  --verbosity VERBOSITY
                        Set log verbosity to a specific level: 0..100.
  -R, --readonly-cache  Don't modify the CRDS cache.  Not compatible with options which implicitly modify the cache.
  -I, --ignore-cache    Download required files even if they're already in the cache.
  -V, --version         Print the software version and exit.
  -J, --jwst            Force observatory to JWST for determining header conventions.
  -H, --hst             Force observatory to HST for determining header conventions.
  --stats               Track and print timing statistics.
  --profile PROFILE     Output profile stats to the specified file.
  --log-time            Add date/time to log messages.
  --pdb                 Run under pdb.
  --debug-traps         Bypass exception error message traps and re-raise exception.


Last modified 3 years ago Last modified on 05/25/16 14:17:37