workflow="hpc" - HPC Cluster Optimization
=========================================

The ``hpc`` workflow is designed for long-running optimization jobs on High
Performance Computing (HPC) clusters. It wraps ``auto_global`` with automatic
checkpointing for fault tolerance and crash recovery.

When to Use
-----------

Use ``hpc`` workflow when:

- Running on HPC clusters (PBS, SLURM, etc.)
- Jobs may take hours or days to complete
- You need crash recovery via checkpoints
- Running on shared/preemptible resources

.. important::

   ``hpc`` **requires bounds** (same as ``auto_global``).

Basic Usage
-----------

.. code-block:: python

   from nlsq import fit
   import jax.numpy as jnp


   def model(x, a, b, c):
       return a * jnp.exp(-b * x) + c


   # HPC workflow with checkpointing
   popt, pcov = fit(
       model,
       xdata,
       ydata,
       p0=[1.0, 0.5, 0.0],
       workflow="hpc",
       bounds=([0, 0, -1], [10, 5, 1]),
       checkpoint_dir="/scratch/my_job/checkpoints",
   )

Checkpointing
-------------

Checkpoints are saved periodically during optimization:

.. code-block:: python

   popt, pcov = fit(
       model,
       x,
       y,
       p0=[...],
       workflow="hpc",
       bounds=bounds,
       checkpoint_dir="/scratch/checkpoints",
       checkpoint_interval=10,
   )  # Save every 10 iterations

**Checkpoint contents:**

- Current best parameters
- Optimization state
- Iteration number
- All explored starting points

**Automatic recovery:**

If a job crashes and restarts, NLSQ automatically detects existing checkpoints
and resumes from the last saved state.

Cluster Detection
-----------------

NLSQ automatically detects HPC environments:

**PBS/Torque:**

.. code-block:: bash

   # Detected via $PBS_NODEFILE
   export PBS_NODEFILE=/var/spool/pbs/aux/12345.node1

**SLURM:**

.. code-block:: bash

   # Detected via SLURM environment variables
   export SLURM_JOB_ID=12345
   export SLURM_NNODES=4

**Multi-GPU:**

.. code-block:: bash

   # Detected via JAX device count
   python -c "import jax; print(jax.device_count())"

HPC Job Script Example
----------------------

**PBS script:**

.. code-block:: bash

   #!/bin/bash
   #PBS -N nlsq_fit
   #PBS -l nodes=1:ppn=8:gpus=2
   #PBS -l walltime=24:00:00
   #PBS -q gpu

   cd $PBS_O_WORKDIR
   source activate nlsq_env

   python fit_job.py

**SLURM script:**

.. code-block:: bash

   #!/bin/bash
   #SBATCH --job-name=nlsq_fit
   #SBATCH --nodes=1
   #SBATCH --ntasks-per-node=1
   #SBATCH --gres=gpu:2
   #SBATCH --time=24:00:00

   module load cuda
   source activate nlsq_env

   python fit_job.py

**fit_job.py:**

.. code-block:: python

   from nlsq import fit
   import jax.numpy as jnp
   import numpy as np


   def model(x, a, b, c):
       return a * jnp.exp(-b * x) + c


   # Load your data
   data = np.load("/data/experiment.npz")
   x, y = data["x"], data["y"]

   # Run HPC optimization
   popt, pcov = fit(
       model,
       x,
       y,
       p0=[1, 0.5, 0],
       workflow="hpc",
       bounds=([0, 0, -1], [10, 5, 1]),
       checkpoint_dir="/scratch/$SLURM_JOB_ID/checkpoints",
       n_starts=50,
   )

   # Save results
   np.savez("/results/fit_result.npz", popt=popt, pcov=pcov)
   print(f"Fitted: {popt}")

Multi-GPU Configuration
-----------------------

For jobs with multiple GPUs:

.. code-block:: python

   popt, pcov = fit(
       model,
       x,
       y,
       p0=[...],
       workflow="hpc",
       bounds=bounds,
       n_starts=100,  # More starts for multi-GPU
       checkpoint_dir="/scratch/ckpts",
   )

NLSQ automatically distributes starting points across available GPUs.

Best Practices for HPC
----------------------

**1. Use scratch storage for checkpoints:**

.. code-block:: python

   # Good: fast local storage
   checkpoint_dir = "/scratch/user/job_123/ckpts"

   # Bad: network filesystem
   checkpoint_dir = "/home/user/checkpoints"

**2. Request appropriate walltime:**

Estimate based on:
- Dataset size
- Number of starts
- Complexity of model

**3. Handle preemption:**

For preemptible queues, use frequent checkpoints:

.. code-block:: python

   popt, pcov = fit(
       model, x, y, p0=[...], workflow="hpc", bounds=bounds, checkpoint_interval=5
   )  # More frequent saves

**4. Clean up checkpoints:**

After successful completion:

.. code-block:: python

   import shutil

   if fit_succeeded:
       shutil.rmtree(checkpoint_dir)

Complete HPC Example
--------------------

.. code-block:: python

   #!/usr/bin/env python
   """HPC curve fitting job with checkpointing."""

   import os
   import numpy as np
   import jax.numpy as jnp
   from nlsq import fit


   # Model definition
   def complex_model(x, a, b, c, d, e):
       return a * jnp.exp(-b * x) * jnp.cos(c * x + d) + e


   def main():
       # Setup paths
       job_id = os.environ.get("SLURM_JOB_ID", os.environ.get("PBS_JOBID", "local"))
       checkpoint_dir = f"/scratch/{job_id}/checkpoints"
       os.makedirs(checkpoint_dir, exist_ok=True)

       # Load data
       data = np.load("experiment_data.npz")
       x, y, sigma = data["x"], data["y"], data["sigma"]

       # Define bounds
       bounds = (
           [0, 0, 0, -np.pi, -10],  # Lower bounds
           [100, 10, 20, np.pi, 10],  # Upper bounds
       )

       # Run HPC fit
       print(f"Starting HPC fit with job ID: {job_id}")
       popt, pcov = fit(
           complex_model,
           x,
           y,
           p0=[10, 1, 5, 0, 0],
           sigma=sigma,
           workflow="hpc",
           bounds=bounds,
           n_starts=100,
           checkpoint_dir=checkpoint_dir,
           checkpoint_interval=10,
       )

       # Save results
       perr = np.sqrt(np.diag(pcov))
       np.savez("fit_results.npz", popt=popt, pcov=pcov, perr=perr)

       # Print summary
       names = ["a", "b", "c", "d", "e"]
       print("\nFit Results:")
       for name, val, err in zip(names, popt, perr):
           print(f"  {name} = {val:.4f} +/- {err:.4f}")


   if __name__ == "__main__":
       main()

Comparison: auto_global vs hpc
------------------------------

.. list-table::
   :header-rows: 1
   :widths: 30 35 35

   * - Feature
     - ``auto_global``
     - ``hpc``
   * - Checkpointing
     - No
     - Yes
   * - Crash recovery
     - No
     - Yes
   * - Cluster detection
     - No
     - Yes
   * - Overhead
     - Lower
     - Slightly higher
   * - Best for
     - Interactive use
     - Batch jobs

Troubleshooting HPC
-------------------

**Job times out before completion:**

- Increase walltime
- Reduce ``n_starts``
- Enable checkpointing for resume

**Checkpoint corruption:**

- Use atomic writes (NLSQ does this automatically)
- Check disk space on scratch

**Multi-GPU not detected:**

.. code-block:: python

   import jax

   print(f"Devices: {jax.devices()}")
   print(f"Device count: {jax.device_count()}")

**Memory errors on GPU:**

- Reduce batch size via ``memory_limit_gb``
- Use streaming for very large datasets

Next Steps
----------

- :doc:`../gpu_acceleration/multi_gpu` - Multi-GPU configuration
- :doc:`../troubleshooting/common_issues` - General troubleshooting
- :doc:`/reference/configuration` - Configuration reference