GPU Architecture and Acceleration
==================================

This guide explains how NLSQ leverages GPU hardware for massive speedups
and when GPU acceleration is most beneficial.

Why GPUs for Curve Fitting?
---------------------------

GPUs have thousands of cores optimized for parallel computation:

.. list-table::
   :header-rows: 1

   * - Hardware
     - Cores
     - Best For
   * - CPU (typical)
     - 4-16
     - Sequential, complex logic
   * - GPU (NVIDIA V100)
     - 5120
     - Parallel, simple operations
   * - GPU (NVIDIA A100)
     - 6912
     - Even more parallel capacity

Curve fitting benefits because:

1. **Data parallelism**: Same operation on many points
2. **Matrix operations**: Jacobian computation, SVD
3. **Batch processing**: Multiple function evaluations

Where Speedups Come From
------------------------

**Residual computation**:

.. code-block:: text

   CPU: Process each point sequentially
   for i in range(1_000_000):
       r[i] = y[i] - f(x[i])

   GPU: Process all points in parallel
   r = y - f(x)  # All 1M points at once

**Jacobian computation**:

.. code-block:: text

   CPU: Finite differences (2m function evaluations)
   GPU: Single reverse-mode AD pass

**Matrix operations**:

.. code-block:: text

   CPU: Sequential SVD
   GPU: Parallel SVD with cuBLAS/cuSOLVER

Performance Scaling
-------------------

Expected speedups by dataset size:

.. list-table::
   :header-rows: 1
   :widths: 25 25 25 25

   * - Dataset Size
     - SciPy (CPU)
     - NLSQ (V100)
     - Speedup
   * - 1,000
     - 0.05s
     - 0.43s (JIT)
     - 0.1x (JIT cost)
   * - 10,000
     - 0.18s
     - 0.04s
     - 4.5x
   * - 100,000
     - 2.1s
     - 0.09s
     - 23x
   * - 1,000,000
     - 40.5s
     - 0.15s
     - 270x
   * - 10,000,000
     - ~7 min
     - 1.5s
     - ~280x

Key observations:

1. **JIT compilation overhead**: First call is slower
2. **Crossover point**: GPU wins at ~5,000 points
3. **Scaling advantage**: GPU speedup increases with size

Memory Considerations
---------------------

GPU memory is limited (16-80 GB typical). NLSQ handles this with:

**Automatic chunking**:

.. code-block:: python

   from nlsq import curve_fit_large

   # Automatically chunks if data exceeds GPU memory
   popt, pcov = curve_fit_large(model, x, y, memory_limit_gb=8.0)  # Match your GPU

**Streaming optimization**:

.. code-block:: python

   from nlsq import AdaptiveHybridStreamingOptimizer, HybridStreamingConfig

   # Process data in chunks with bounded memory
   config = HybridStreamingConfig(chunk_size=50000)
   optimizer = AdaptiveHybridStreamingOptimizer(config)
   result = optimizer.fit((x, y), model, p0=p0)

Multi-GPU Usage
---------------

For systems with multiple GPUs:

.. code-block:: python

   import jax

   # See all available devices
   devices = jax.devices()
   print(f"Available GPUs: {devices}")

   # NLSQ uses the default device
   # To select a specific GPU:
   import os

   os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # Use GPU 0 only

Data Transfer Overhead
----------------------

Moving data between CPU and GPU has a cost:

.. code-block:: text

   CPU RAM ←──[PCIe bus]──→ GPU VRAM
                ~12 GB/s

For small datasets, this overhead can dominate. That's why:

- Small data (<1K points): Use CPU
- Large data (>10K points): Use GPU

NLSQ minimizes transfers by:

1. Keeping data on GPU throughout optimization
2. Only transferring results at the end
3. Using JAX's lazy evaluation

When GPU Doesn't Help
---------------------

GPU may not be beneficial when:

1. **Small datasets** (<1K points): JIT overhead dominates
2. **Simple models**: Not enough computation to parallelize
3. **Single fit**: JIT compilation cost not amortized
4. **Memory-bound**: Data larger than GPU memory (use streaming)

CPU may be preferred when:

1. Testing and development
2. Laptops without discrete GPU
3. Need maximum numerical precision

Optimizing GPU Performance
--------------------------

**1. Warm up JIT**

.. code-block:: python

   # Compile on small data first
   _ = curve_fit(model, x[:100], y[:100], p0=p0)

   # Then run on full data (uses cached compilation)
   popt, pcov = curve_fit(model, x, y, p0=p0)

**2. Use CurveFit class for repeated fits**

.. code-block:: python

   from nlsq import CurveFit

   fitter = CurveFit()  # Compile once

   for dataset in datasets:
       popt, pcov = fitter.curve_fit(model, dataset.x, dataset.y)
       # All calls after first are fast

**3. Batch similar models**

.. code-block:: python

   # If fitting many similar datasets, batch them
   import jax

   batched_fit = jax.vmap(single_fit, in_axes=(0, 0))
   results = batched_fit(x_batch, y_batch)

**4. Profile to find bottlenecks**

.. code-block:: python

   # Use JAX profiler
   with jax.profiler.trace("/tmp/jax-trace"):
       popt, pcov = curve_fit(model, x, y)

   # View with TensorBoard

Hardware Requirements
---------------------

Minimum:
- NVIDIA GPU with CUDA Compute Capability 5.0+
- 4 GB VRAM
- CUDA 11.x or 12.x

Recommended:
- NVIDIA GPU with Tensor Cores (V100, A100, RTX series)
- 16+ GB VRAM
- CUDA 12.x
- cuDNN 8.x

Cloud Options:
- Google Colab (free T4 GPU)
- AWS EC2 (p3, p4 instances)
- GCP Compute Engine (A100, T4)

Summary
-------

GPU acceleration in NLSQ provides:

1. **Massive parallelism**: Thousands of cores for data operations
2. **Automatic optimization**: JAX handles GPU placement
3. **Scaling advantage**: Speedup grows with dataset size
4. **Memory management**: Automatic chunking and streaming

Best for:
- Large datasets (>10K points)
- Repeated fits (amortize JIT)
- Production pipelines

See Also
--------

- :doc:`jax_autodiff` - How JAX enables GPU
- :doc:`/tutorials/routine/gpu_acceleration/gpu_usage` - GPU setup tutorial
- :doc:`/howto/optimize_performance` - Performance guide