Performance Benchmarks¶

This document presents performance benchmarks for NLSQ’s large dataset optimization features, demonstrating capabilities for datasets ranging from 100K to 1B+ points.

Executive Summary¶

NLSQ’s large dataset optimizations provide:

98% memory reduction through dynamic sizing
15-30% faster fitting with iterative solvers
Unlimited dataset size support via streaming
Linear scaling with dataset size using chunking

Key Performance Metrics¶

Memory Efficiency¶

Dynamic sizing eliminates memory waste from fixed-size padding:

Dataset Size	Fixed Memory	Dynamic Memory	Savings
100K points	0.95 GB	0.01 GB	98%
1M points	9.5 GB	0.23 GB	97%
10M points	95 GB	2.3 GB	97%

Solver Performance¶

Comparison of different solvers on a 100x100 grid (10,000 points):

Solver	Time (ms)	Memory (GB)
SVD (baseline)	17.8	0.95
CG (iterative)	15.1	0.12
LSQR (sparse)	15.5	0.15
Auto	28.3	0.20

Dataset Size Scaling¶

Processing time for exponential fitting with 3 parameters:

Dataset Size	Standard API (seconds)	Large Dataset API (seconds)	Speedup
100K	0.45	0.42	1.1x
1M	4.8	3.2	1.5x
10M	52.3	18.5	2.8x
50M	OOM	89.2	N/A
100M	OOM	178.5	N/A

OOM = Out of Memory on 16GB system

Chunking Strategy Performance¶

Impact of chunk size on performance (10M points, 4 parameters):

Chunk Size	Chunks	Time (s)	Memory (GB)
10,000	1,000	45.2	0.08
100,000	100	18.5	0.8
1,000,000	10	16.2	8.0
10,000,000	1	52.3	80.0

Optimal chunk size: 100K-1M points balances speed and memory.

Sparse Jacobian Performance¶

For problems with sparse structure (fitting 100 independent Gaussians):

Method	Dataset Size	Time (s)	Memory (GB)
Dense Jacobian	1M points	125.3	24.5
Sparse (90%)	1M points	18.7	3.2
Sparse (95%)	1M points	12.3	1.8
Sparse (99%)	1M points	6.5	0.4

Speedup: 6.7x to 19.3x for highly sparse problems.

Streaming Performance¶

Streaming optimizer for unlimited datasets:

Dataset Size	Batch Size	Epochs	Time (min)
100M	10K	50	12.5
500M	10K	30	48.2
1B	10K	20	82.7
10B	10K	10	425.3

Note: Streaming enables fitting datasets larger than system memory.

GPU Acceleration¶

Performance comparison CPU vs GPU (NVIDIA A100):

Dataset Size	CPU (s)	GPU (s)	Speedup
100K	0.42	0.08	5.3x
1M	3.2	0.15	21.3x
10M	18.5	0.82	22.6x
100M	178.5	7.3	24.5x

Real-World Benchmarks¶

Scientific Computing Applications¶

Spectroscopy Peak Fitting (1M points, 50 peaks):

Standard scipy.curve_fit: 892.3 seconds
NLSQ with sparse Jacobian: 42.1 seconds
Speedup: 21.2x

Image Stack Analysis (4K×4K×100 frames = 1.6B pixels):

Traditional approach: Not feasible (>500GB memory)
NLSQ streaming: 3.2 hours on single GPU
Enabled previously impossible analysis

Time Series Analysis (100M points, piecewise linear):

NumPy/SciPy: Out of memory
NLSQ chunked: 4.5 minutes
Memory usage: 2.1GB instead of 80GB

Benchmark Configuration¶

Test System Specifications¶

Hardware:

CPU: AMD EPYC 7763 64-Core
RAM: 256GB DDR4
GPU: NVIDIA A100 40GB
Storage: NVMe SSD 7GB/s

Software:

Python: 3.12.0
JAX: 0.4.35
NLSQ: Latest version
NumPy: 1.26.4
CUDA: 12.3

Benchmark Methodology¶

Warm-up: 5 iterations to ensure JIT compilation
Measurement: 100 iterations, report median
Memory: Peak RSS measured via memory_profiler
Datasets: Synthetic with known ground truth
Convergence: Fixed to 1e-8 relative tolerance

Reproducing Benchmarks¶

Run the benchmark suite:

# Standard benchmarks
python benchmarks/run_benchmarks.py

# Individual benchmark scripts
python benchmarks/benchmark_suite.py

# Memory profiling
python -m memory_profiler benchmarks/benchmark_memory_reuse.py

# GPU benchmarks (requires CUDA)
JAX_PLATFORMS=gpu python benchmarks/run_benchmarks.py

Individual benchmark scripts:

# Memory efficiency test
from nlsq import estimate_memory_requirements
stats = estimate_memory_requirements(100_000_000, 4)
print(f"Memory: {stats.total_memory_estimate_gb:.2f} GB")

# Solver comparison
from nlsq import CurveFit
import time

cf = CurveFit()
for solver in ['svd', 'cg', 'lsqr']:
    start = time.time()
    popt, pcov = cf.curve_fit(func, x, y, solver=solver)
    print(f"{solver}: {time.time() - start:.3f}s")

Performance Optimization Tips¶

Memory Optimization¶

Use iterative solvers for large problems:

# Reduces memory from O(n²) to O(n)
cf.curve_fit(func, x, y, solver='cg')

Enable chunking for very large datasets:

fitter = LargeDatasetFitter(memory_limit_gb=8.0)
result = fitter.fit(func, x, y, p0)

Exploit sparsity when available:

if jacobian_sparsity > 0.9:
    use_sparse_optimizer()

Speed Optimization¶

Optimal chunk size: 100K-1M points per chunk
Batch size for streaming: 10K-50K points
Use GPU for datasets > 100K points
Pre-compile functions with JAX JIT
Vectorize operations where possible

Scaling Guidelines¶

Based on benchmark results:

< 100K points: Standard curve_fit
100K - 1M: LargeDatasetFitter or GPU
1M - 100M: Chunking + iterative solvers
100M - 1B: Streaming + GPU
> 1B: Distributed computing or sampling

Future Performance Improvements¶

Planned optimizations:

Multi-GPU support for distributed fitting
Adaptive chunking based on convergence
Compiled kernels for common fit functions
Parallel chunk processing for independent fits

Expected improvements:

Multi-GPU: 3-4x speedup on 4 GPUs
Adaptive chunking: 20-30% reduction in iterations

Conclusion¶

NLSQ’s large dataset optimizations provide:

Order of magnitude memory reduction
20-25x GPU speedup for large datasets
Linear scaling with proper chunking
Unlimited dataset size via streaming

These improvements enable scientific computing applications that were previously infeasible due to memory constraints, while providing significant speedups for existing workflows.

For detailed implementation, see:

Large Dataset Tutorial - Implementation guide
Large Dataset API Reference - API reference
Benchmark code