Imsi Tracking ============= ``imsi`` has been built to facilate the setup, configuration, and running of complex physics-based models on different HPC platforms. For these problems, there are countless degrees of freedom involved: 1. model code version 2. physical parameters/settings 3. compiler configuration (ex: compiler used; optimization settings) 4. technical parameters/settings (ex: MPI layout) 5. sequencing configuration 6. machine specific settings 7. etc... All of these settings can make reproducibility hard to ensure, as it is easy for human eyes to lose track of all the settings they have activated. As such, a tracking toolkit has been added to ``imsi`` to help with this - specifically it tracks: 1. What ``imsi`` commands were executed to setup and manipulate the run 2. What config files are actually used by the simulation and how they've changed throughout it 3. What version of the source code has been used for the simulation and if any changes occurred during it What is tracked? ---------------- At a high level, there are three distinct items that need to be considered to reproduce a run: 1. what version of the source repo was used? 2. what version of the config files were used? 3. what machine was the simulation ran on? Where No. 3 is implicitly tracked in the config files. As such, the majority of the ``imsi`` tracking system is devoted to tracking the status/version of the source repo under ``src/`` and the config files under ``config/`` - it is important to note that to faciliate easy tracking of the files under ``config/``, ``imsi`` initiates it as a local ``git`` repo. With the above stated, for each of these directories ``imsi`` tracks: 1. the commit hash 2. the status of the repo 3. any ``diffs`` found in the repos where the details are stored under ``.imsi/states`` within the run's setup directory. The ``imsi`` ``states`` directory ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To track the uniqueness of the ``config`` and ``src`` directories, ``imsi`` relies on ``md5sum`` to checksum the contents and produce one unique hash to represent the status of these directories - these unique hashes are what users will find under .. code-block:: bash .imsi/states Under each hash directory, you can find: .. code-block:: bash src_*_rev.txt src_*_status.txt src_*_diff.diff config_*_rev.txt config_*_status.txt config_*_diff.diff where (for each relevant repo): * ``*_rev.txt`` files contain the current git commit hashes * ``*_status.txt`` files contain information on what files have changes * ``*_diff.diff`` files contain the actual ``git diff`` output When does tracking occur ------------------------ By default, ``imsi`` only logs the above information for certain ``cli`` commands - specifically: * ``imsi config`` * ``imsi reload`` * ``imsi override`` * ``imsi set`` * ``imsi build``` * ``imsi submit`` * ``imsi save-restarts`` In addition to tracking the ``config/`` and ``src/`` repos, ``imsi`` also stores a cli command log at ``.imsi-cli.log`` in the setup directory. .. note:: Due to the implementation of ``imsi ensemble``, if the above commands are executed using ``imsi ensemble ``, ``imsi`` will still log the necessary information for each member of the ensemble How to add tracking points ^^^^^^^^^^^^^^^^^^^^^^^^^^ While the above mentioned log points provide a good default state-logging framework, users might wish to have explicit state-logging at other points throughout their job scripts. For example, some groups might wish to explicitly track state of things right before the model launches in order to ensure no local user changes might go un-noticed. To do this, users can instrument their scripting with .. code-block:: bash imsi log-state -m "USEFUL LOGGING MESSAGE" -p /path/to/runid/setup/directory This will then make imsi track the state of the various directories at that exact point. What to do with tracking artifacts? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If you are on an HPC system where you can keep runs on-disk for a `long` time, simply relying on the various directory structures might be enough for you. `However` in most cases, users will need to clean-up simulations after they are completed and so the necessary reproducibility information might be lost. As such, if you have access to an archiving system, it is recommended that users setup a job to dump * the local ``config/`` directory and * the local ``.imsi/states`` directory to whatever archive system their machines have access to. With this, users should be able to determine all the necessary details to `potentially` re-run past simulations.