Performance Benchmarks ====================== This document presents performance benchmarks for NLSQ's large dataset optimization features, demonstrating capabilities for datasets ranging from 100K to 1B+ points. Executive Summary ----------------- NLSQ's large dataset optimizations provide: - **98% memory reduction** through dynamic sizing - **15-30% faster** fitting with iterative solvers - **Unlimited dataset size** support via streaming - **Linear scaling** with dataset size using chunking Key Performance Metrics ----------------------- Memory Efficiency ~~~~~~~~~~~~~~~~~ Dynamic sizing eliminates memory waste from fixed-size padding: +----------------+---------------+----------------+----------------+ | Dataset Size | Fixed Memory | Dynamic Memory | Savings | +================+===============+================+================+ | 100K points | 0.95 GB | 0.01 GB | 98% | +----------------+---------------+----------------+----------------+ | 1M points | 9.5 GB | 0.23 GB | 97% | +----------------+---------------+----------------+----------------+ | 10M points | 95 GB | 2.3 GB | 97% | +----------------+---------------+----------------+----------------+ Solver Performance ~~~~~~~~~~~~~~~~~~ Comparison of different solvers on a 100x100 grid (10,000 points): +----------------+---------------+----------------+ | Solver | Time (ms) | Memory (GB) | +================+===============+================+ | SVD (baseline) | 17.8 | 0.95 | +----------------+---------------+----------------+ | CG (iterative) | 15.1 | 0.12 | +----------------+---------------+----------------+ | LSQR (sparse) | 15.5 | 0.15 | +----------------+---------------+----------------+ | Auto | 28.3 | 0.20 | +----------------+---------------+----------------+ Dataset Size Scaling ~~~~~~~~~~~~~~~~~~~~ Processing time for exponential fitting with 3 parameters: +----------------+---------------+----------------+----------------+ | Dataset Size | Standard API | Large Dataset | Speedup | | | (seconds) | API (seconds) | | +================+===============+================+================+ | 100K | 0.45 | 0.42 | 1.1x | +----------------+---------------+----------------+----------------+ | 1M | 4.8 | 3.2 | 1.5x | +----------------+---------------+----------------+----------------+ | 10M | 52.3 | 18.5 | 2.8x | +----------------+---------------+----------------+----------------+ | 50M | OOM | 89.2 | N/A | +----------------+---------------+----------------+----------------+ | 100M | OOM | 178.5 | N/A | +----------------+---------------+----------------+----------------+ *OOM = Out of Memory on 16GB system* Chunking Strategy Performance ------------------------------ Impact of chunk size on performance (10M points, 4 parameters): +----------------+---------------+----------------+----------------+ | Chunk Size | Chunks | Time (s) | Memory (GB) | +================+===============+================+================+ | 10,000 | 1,000 | 45.2 | 0.08 | +----------------+---------------+----------------+----------------+ | 100,000 | 100 | 18.5 | 0.8 | +----------------+---------------+----------------+----------------+ | 1,000,000 | 10 | 16.2 | 8.0 | +----------------+---------------+----------------+----------------+ | 10,000,000 | 1 | 52.3 | 80.0 | +----------------+---------------+----------------+----------------+ **Optimal chunk size**: 100K-1M points balances speed and memory. Sparse Jacobian Performance ---------------------------- For problems with sparse structure (fitting 100 independent Gaussians): +----------------+---------------+----------------+----------------+ | Method | Dataset Size | Time (s) | Memory (GB) | +================+===============+================+================+ | Dense Jacobian | 1M points | 125.3 | 24.5 | +----------------+---------------+----------------+----------------+ | Sparse (90%) | 1M points | 18.7 | 3.2 | +----------------+---------------+----------------+----------------+ | Sparse (95%) | 1M points | 12.3 | 1.8 | +----------------+---------------+----------------+----------------+ | Sparse (99%) | 1M points | 6.5 | 0.4 | +----------------+---------------+----------------+----------------+ **Speedup**: 6.7x to 19.3x for highly sparse problems. Streaming Performance --------------------- Streaming optimizer for unlimited datasets: +----------------+---------------+----------------+----------------+ | Dataset Size | Batch Size | Epochs | Time (min) | +================+===============+================+================+ | 100M | 10K | 50 | 12.5 | +----------------+---------------+----------------+----------------+ | 500M | 10K | 30 | 48.2 | +----------------+---------------+----------------+----------------+ | 1B | 10K | 20 | 82.7 | +----------------+---------------+----------------+----------------+ | 10B | 10K | 10 | 425.3 | +----------------+---------------+----------------+----------------+ **Note**: Streaming enables fitting datasets larger than system memory. GPU Acceleration ---------------- Performance comparison CPU vs GPU (NVIDIA A100): +----------------+---------------+----------------+----------------+ | Dataset Size | CPU (s) | GPU (s) | Speedup | +================+===============+================+================+ | 100K | 0.42 | 0.08 | 5.3x | +----------------+---------------+----------------+----------------+ | 1M | 3.2 | 0.15 | 21.3x | +----------------+---------------+----------------+----------------+ | 10M | 18.5 | 0.82 | 22.6x | +----------------+---------------+----------------+----------------+ | 100M | 178.5 | 7.3 | 24.5x | +----------------+---------------+----------------+----------------+ Real-World Benchmarks ---------------------- Scientific Computing Applications ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Spectroscopy Peak Fitting** (1M points, 50 peaks): - Standard scipy.curve_fit: 892.3 seconds - NLSQ with sparse Jacobian: 42.1 seconds - **Speedup: 21.2x** **Image Stack Analysis** (4K×4K×100 frames = 1.6B pixels): - Traditional approach: Not feasible (>500GB memory) - NLSQ streaming: 3.2 hours on single GPU - **Enabled previously impossible analysis** **Time Series Analysis** (100M points, piecewise linear): - NumPy/SciPy: Out of memory - NLSQ chunked: 4.5 minutes - **Memory usage: 2.1GB instead of 80GB** Benchmark Configuration ------------------------ Test System Specifications ~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Hardware**: - CPU: AMD EPYC 7763 64-Core - RAM: 256GB DDR4 - GPU: NVIDIA A100 40GB - Storage: NVMe SSD 7GB/s **Software**: - Python: 3.12.0 - JAX: 0.4.35 - NLSQ: Latest version - NumPy: 1.26.4 - CUDA: 12.3 Benchmark Methodology ~~~~~~~~~~~~~~~~~~~~~ 1. **Warm-up**: 5 iterations to ensure JIT compilation 2. **Measurement**: 100 iterations, report median 3. **Memory**: Peak RSS measured via memory_profiler 4. **Datasets**: Synthetic with known ground truth 5. **Convergence**: Fixed to 1e-8 relative tolerance Reproducing Benchmarks ----------------------- Run the benchmark suite:: # Standard benchmarks python benchmarks/run_benchmarks.py # Individual benchmark scripts python benchmarks/benchmark_suite.py # Memory profiling python -m memory_profiler benchmarks/benchmark_memory_reuse.py # GPU benchmarks (requires CUDA) JAX_PLATFORMS=gpu python benchmarks/run_benchmarks.py Individual benchmark scripts:: # Memory efficiency test from nlsq import estimate_memory_requirements stats = estimate_memory_requirements(100_000_000, 4) print(f"Memory: {stats.total_memory_estimate_gb:.2f} GB") # Solver comparison from nlsq import CurveFit import time cf = CurveFit() for solver in ['svd', 'cg', 'lsqr']: start = time.time() popt, pcov = cf.curve_fit(func, x, y, solver=solver) print(f"{solver}: {time.time() - start:.3f}s") Performance Optimization Tips ----------------------------- Memory Optimization ~~~~~~~~~~~~~~~~~~~ 1. **Use iterative solvers** for large problems:: # Reduces memory from O(n²) to O(n) cf.curve_fit(func, x, y, solver='cg') 2. **Enable chunking** for very large datasets:: fitter = LargeDatasetFitter(memory_limit_gb=8.0) result = fitter.fit(func, x, y, p0) 3. **Exploit sparsity** when available:: if jacobian_sparsity > 0.9: use_sparse_optimizer() Speed Optimization ~~~~~~~~~~~~~~~~~~ 1. **Optimal chunk size**: 100K-1M points per chunk 2. **Batch size for streaming**: 10K-50K points 3. **Use GPU** for datasets > 100K points 4. **Pre-compile functions** with JAX JIT 5. **Vectorize operations** where possible Scaling Guidelines ~~~~~~~~~~~~~~~~~~ Based on benchmark results: - **< 100K points**: Standard curve_fit - **100K - 1M**: LargeDatasetFitter or GPU - **1M - 100M**: Chunking + iterative solvers - **100M - 1B**: Streaming + GPU - **> 1B**: Distributed computing or sampling Future Performance Improvements ------------------------------- Planned optimizations: 1. **Multi-GPU support** for distributed fitting 2. **Adaptive chunking** based on convergence 3. **Compiled kernels** for common fit functions 4. **Parallel chunk processing** for independent fits Expected improvements: - Multi-GPU: 3-4x speedup on 4 GPUs - Adaptive chunking: 20-30% reduction in iterations Conclusion ---------- NLSQ's large dataset optimizations provide: - **Order of magnitude** memory reduction - **20-25x** GPU speedup for large datasets - **Linear scaling** with proper chunking - **Unlimited dataset size** via streaming These improvements enable scientific computing applications that were previously infeasible due to memory constraints, while providing significant speedups for existing workflows. For detailed implementation, see: - :doc:`../howto/handle_large_data` - Implementation guide - :doc:`large_datasets_api` - API reference - `Benchmark code `_