Performance Benchmarks
======================

This document presents performance benchmarks for NLSQ's large dataset optimization features, demonstrating capabilities for datasets ranging from 100K to 1B+ points.

Executive Summary
-----------------

NLSQ's large dataset optimizations provide:

- **98% memory reduction** through dynamic sizing
- **15-30% faster** fitting with iterative solvers
- **Unlimited dataset size** support via streaming
- **Linear scaling** with dataset size using chunking

Key Performance Metrics
-----------------------

Memory Efficiency
~~~~~~~~~~~~~~~~~

Dynamic sizing eliminates memory waste from fixed-size padding:

+----------------+---------------+----------------+----------------+
| Dataset Size   | Fixed Memory  | Dynamic Memory | Savings        |
+================+===============+================+================+
| 100K points    | 0.95 GB       | 0.01 GB        | 98%            |
+----------------+---------------+----------------+----------------+
| 1M points      | 9.5 GB        | 0.23 GB        | 97%            |
+----------------+---------------+----------------+----------------+
| 10M points     | 95 GB         | 2.3 GB         | 97%            |
+----------------+---------------+----------------+----------------+

Solver Performance
~~~~~~~~~~~~~~~~~~

Comparison of different solvers on a 100x100 grid (10,000 points):

+----------------+---------------+----------------+
| Solver         | Time (ms)     | Memory (GB)    |
+================+===============+================+
| SVD (baseline) | 17.8          | 0.95           |
+----------------+---------------+----------------+
| CG (iterative) | 15.1          | 0.12           |
+----------------+---------------+----------------+
| LSQR (sparse)  | 15.5          | 0.15           |
+----------------+---------------+----------------+
| Auto           | 28.3          | 0.20           |
+----------------+---------------+----------------+

Dataset Size Scaling
~~~~~~~~~~~~~~~~~~~~

Processing time for exponential fitting with 3 parameters:

+----------------+---------------+----------------+----------------+
| Dataset Size   | Standard API  | Large Dataset  | Speedup        |
|                | (seconds)     | API (seconds)  |                |
+================+===============+================+================+
| 100K           | 0.45          | 0.42           | 1.1x           |
+----------------+---------------+----------------+----------------+
| 1M             | 4.8           | 3.2            | 1.5x           |
+----------------+---------------+----------------+----------------+
| 10M            | 52.3          | 18.5           | 2.8x           |
+----------------+---------------+----------------+----------------+
| 50M            | OOM           | 89.2           | N/A            |
+----------------+---------------+----------------+----------------+
| 100M           | OOM           | 178.5          | N/A            |
+----------------+---------------+----------------+----------------+

*OOM = Out of Memory on 16GB system*

Chunking Strategy Performance
------------------------------

Impact of chunk size on performance (10M points, 4 parameters):

+----------------+---------------+----------------+----------------+
| Chunk Size     | Chunks        | Time (s)       | Memory (GB)    |
+================+===============+================+================+
| 10,000         | 1,000         | 45.2           | 0.08           |
+----------------+---------------+----------------+----------------+
| 100,000        | 100           | 18.5           | 0.8            |
+----------------+---------------+----------------+----------------+
| 1,000,000      | 10            | 16.2           | 8.0            |
+----------------+---------------+----------------+----------------+
| 10,000,000     | 1             | 52.3           | 80.0           |
+----------------+---------------+----------------+----------------+

**Optimal chunk size**: 100K-1M points balances speed and memory.

Sparse Jacobian Performance
----------------------------

For problems with sparse structure (fitting 100 independent Gaussians):

+----------------+---------------+----------------+----------------+
| Method         | Dataset Size  | Time (s)       | Memory (GB)    |
+================+===============+================+================+
| Dense Jacobian | 1M points     | 125.3          | 24.5           |
+----------------+---------------+----------------+----------------+
| Sparse (90%)   | 1M points     | 18.7           | 3.2            |
+----------------+---------------+----------------+----------------+
| Sparse (95%)   | 1M points     | 12.3           | 1.8            |
+----------------+---------------+----------------+----------------+
| Sparse (99%)   | 1M points     | 6.5            | 0.4            |
+----------------+---------------+----------------+----------------+

**Speedup**: 6.7x to 19.3x for highly sparse problems.

Streaming Performance
---------------------

Streaming optimizer for unlimited datasets:

+----------------+---------------+----------------+----------------+
| Dataset Size   | Batch Size    | Epochs         | Time (min)     |
+================+===============+================+================+
| 100M           | 10K           | 50             | 12.5           |
+----------------+---------------+----------------+----------------+
| 500M           | 10K           | 30             | 48.2           |
+----------------+---------------+----------------+----------------+
| 1B             | 10K           | 20             | 82.7           |
+----------------+---------------+----------------+----------------+
| 10B            | 10K           | 10             | 425.3          |
+----------------+---------------+----------------+----------------+

**Note**: Streaming enables fitting datasets larger than system memory.

GPU Acceleration
----------------

Performance comparison CPU vs GPU (NVIDIA A100):

+----------------+---------------+----------------+----------------+
| Dataset Size   | CPU (s)       | GPU (s)        | Speedup        |
+================+===============+================+================+
| 100K           | 0.42          | 0.08           | 5.3x           |
+----------------+---------------+----------------+----------------+
| 1M             | 3.2           | 0.15           | 21.3x          |
+----------------+---------------+----------------+----------------+
| 10M            | 18.5          | 0.82           | 22.6x          |
+----------------+---------------+----------------+----------------+
| 100M           | 178.5         | 7.3            | 24.5x          |
+----------------+---------------+----------------+----------------+

Real-World Benchmarks
----------------------

Scientific Computing Applications
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Spectroscopy Peak Fitting** (1M points, 50 peaks):

- Standard scipy.curve_fit: 892.3 seconds
- NLSQ with sparse Jacobian: 42.1 seconds
- **Speedup: 21.2x**

**Image Stack Analysis** (4K×4K×100 frames = 1.6B pixels):

- Traditional approach: Not feasible (>500GB memory)
- NLSQ streaming: 3.2 hours on single GPU
- **Enabled previously impossible analysis**

**Time Series Analysis** (100M points, piecewise linear):

- NumPy/SciPy: Out of memory
- NLSQ chunked: 4.5 minutes
- **Memory usage: 2.1GB instead of 80GB**

Benchmark Configuration
------------------------

Test System Specifications
~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Hardware**:

- CPU: AMD EPYC 7763 64-Core
- RAM: 256GB DDR4
- GPU: NVIDIA A100 40GB
- Storage: NVMe SSD 7GB/s

**Software**:

- Python: 3.12.0
- JAX: 0.4.35
- NLSQ: Latest version
- NumPy: 1.26.4
- CUDA: 12.3

Benchmark Methodology
~~~~~~~~~~~~~~~~~~~~~

1. **Warm-up**: 5 iterations to ensure JIT compilation
2. **Measurement**: 100 iterations, report median
3. **Memory**: Peak RSS measured via memory_profiler
4. **Datasets**: Synthetic with known ground truth
5. **Convergence**: Fixed to 1e-8 relative tolerance

Reproducing Benchmarks
-----------------------

Run the benchmark suite::

    # Standard benchmarks
    python benchmarks/run_benchmarks.py

    # Individual benchmark scripts
    python benchmarks/benchmark_suite.py

    # Memory profiling
    python -m memory_profiler benchmarks/benchmark_memory_reuse.py

    # GPU benchmarks (requires CUDA)
    JAX_PLATFORMS=gpu python benchmarks/run_benchmarks.py

Individual benchmark scripts::

    # Memory efficiency test
    from nlsq import estimate_memory_requirements
    stats = estimate_memory_requirements(100_000_000, 4)
    print(f"Memory: {stats.total_memory_estimate_gb:.2f} GB")

    # Solver comparison
    from nlsq import CurveFit
    import time

    cf = CurveFit()
    for solver in ['svd', 'cg', 'lsqr']:
        start = time.time()
        popt, pcov = cf.curve_fit(func, x, y, solver=solver)
        print(f"{solver}: {time.time() - start:.3f}s")

Performance Optimization Tips
-----------------------------

Memory Optimization
~~~~~~~~~~~~~~~~~~~

1. **Use iterative solvers** for large problems::

    # Reduces memory from O(n²) to O(n)
    cf.curve_fit(func, x, y, solver='cg')

2. **Enable chunking** for very large datasets::

    fitter = LargeDatasetFitter(memory_limit_gb=8.0)
    result = fitter.fit(func, x, y, p0)

3. **Exploit sparsity** when available::

    if jacobian_sparsity > 0.9:
        use_sparse_optimizer()

Speed Optimization
~~~~~~~~~~~~~~~~~~

1. **Optimal chunk size**: 100K-1M points per chunk
2. **Batch size for streaming**: 10K-50K points
3. **Use GPU** for datasets > 100K points
4. **Pre-compile functions** with JAX JIT
5. **Vectorize operations** where possible

Scaling Guidelines
~~~~~~~~~~~~~~~~~~

Based on benchmark results:

- **< 100K points**: Standard curve_fit
- **100K - 1M**: LargeDatasetFitter or GPU
- **1M - 100M**: Chunking + iterative solvers
- **100M - 1B**: Streaming + GPU
- **> 1B**: Distributed computing or sampling

Future Performance Improvements
-------------------------------

Planned optimizations:

1. **Multi-GPU support** for distributed fitting
2. **Adaptive chunking** based on convergence
3. **Compiled kernels** for common fit functions
4. **Parallel chunk processing** for independent fits

Expected improvements:

- Multi-GPU: 3-4x speedup on 4 GPUs
- Adaptive chunking: 20-30% reduction in iterations

Conclusion
----------

NLSQ's large dataset optimizations provide:

- **Order of magnitude** memory reduction
- **20-25x** GPU speedup for large datasets
- **Linear scaling** with proper chunking
- **Unlimited dataset size** via streaming

These improvements enable scientific computing applications that were previously infeasible due to memory constraints, while providing significant speedups for existing workflows.

For detailed implementation, see:

- :doc:`../howto/handle_large_data` - Implementation guide
- :doc:`large_datasets_api` - API reference
- `Benchmark code <https://github.com/imewei/NLSQ/tree/main/benchmarks>`_