Settle on filename standard
We should settle on a filename standard that both MicadoWISE and the MICADO-Pipeline follow.
Implementing the first recipes in C highlighted the following problems with the current filenames created by the MicadoWISE prototypes:
- The names were too long for ESO-compliant FITS headers, so they simply could not properly be used in ESOREX.
- The names could not easily be generated by the C recipes, resulting in discrepancies between the prototype- and C-recipes.
- The names were unpredictable in advance, making it impossible to create e.g. SOF files for the full pipeline in advance.
However, we can resolve all these problems.
Conditions
There are several considerations:
- Filenames should preferably be somewhat understandable for humans. E.g. a MasterDark should preferably have something like
MasterDark
in the filename (andMICADO
). - Filenames should preferably be unique. That is, two different files, should have different names, certainly for the BasicMicadoWISE database/dataserver, and the raw data, where it is essentially a primary key.
- Filenames should preferably be predictable. That is, we can generate the SOF files for the entire pipeline from just a directory of files if the filenames of all intermediate processed data is known in advance.
- Filenames should be relatively short. That is, the file name of the dependencies should fit in a FITS header that starts with
ESO PRO REC1 RAW1 NAME =
, which leaves only 55 characters for the name.
Raw
For raw data, we do not really have a choice I believe. It seems that this is entirely dictated by ESO. So for raw data I propose to just use that:
-
MICADO.YYYY-MM-DDThh:mm:ss.sss.fits
, where YYYY is the year, etc. of the observation, up to the millisecond.
Drawbacks:
- Violation of condition 1, readability. It is not possible to see what kind of data it is by just looking at the filename. This forces us to use the DPR keywords and OCA classification rules, and we are just about there in our implementations.
- Violation of condition 2 and 3, uniqueness predictability. The way data is currently simulated would not lead to unique and predictable filenames. In particular, the observation date is not used at all. One way to resolve that would be to use actually use the observation date as a seed to the random number generator. I'll experiment with that.
Processed
Requirement 2 (uniqueness) and 3 (predictability) are in conflict with each other for processed data. In particular, predictable unique filenames require that the full input to a recipe influences the filename of the output product. That is, the contents of the input SOF file and the process parameters need to be reflected in the output FITS file in a predictable (but short) way.
One way to achieve such unique predictability is to include a hash of the input (e.g. a hash of the SOF file and the process parameters) in the file name. However, this would require a canonical representation of this input, which we currently do not have. But we do need uniqueness for the database anyway. So I propose a three step process.
-
Step 1. For the processing (in C and prototypes), we use the suffix method that is currently used in the depersist recipe. That is, we take the primary raw data product, and add the name of the data product to it as a filename. E.g.
-
MICADO.2021-12-01T14:17:57.123.fits
as raw data. -
MICADO.2021-12-01T14:17:57.123_depersisted.fits
as depersisted data. -
MICADO.2021-12-01T14:17:57.123_detrended.fits
as detrended data. - etc.
The drawback of this is that we do not fully uniquely determine our filenames. However, at this point in our development this should be fine because we don't really care that much except for experimentation.
-
-
Step 2. For the database/dataserver we do need a unique name. For this I will add a hash to the filename. That way we can go from the filename in step 1 to this filename and back. E.g.
-
MICADO.2021-12-01T14:17:57.123_depersisted.283f9abc.fits
(already 56 characters!)
We can do this in two sub-steps:
- Step 2a. I'll use a hash of the content (that is, a hash of the output of the recipe), similar to what MicadoWISE does now.
- Step 2b. I'll experiment with using a hash of the data lineage (that is, a hash of the input of the recipe). (Or maybe both.)
-
-
Based on the experience in step 1 and 2 we will settle on a filename scheme that hopefully will
- meet all 4 criteria above,
- be easily implementable in the C-recipes,
- be easily implementable in MicadoWISE.