Performance Benchmarks

This document presents performance benchmarks for NLSQ’s large dataset optimization features, demonstrating capabilities for datasets ranging from 100K to 1B+ points.

Executive Summary

NLSQ’s large dataset optimizations provide:

  • 98% memory reduction through dynamic sizing

  • 15-30% faster fitting with iterative solvers

  • Unlimited dataset size support via streaming

  • Linear scaling with dataset size using chunking

Key Performance Metrics

Memory Efficiency

Dynamic sizing eliminates memory waste from fixed-size padding:

Dataset Size

Fixed Memory

Dynamic Memory

Savings

100K points

0.95 GB

0.01 GB

98%

1M points

9.5 GB

0.23 GB

97%

10M points

95 GB

2.3 GB

97%

Solver Performance

Comparison of different solvers on a 100x100 grid (10,000 points):

Solver

Time (ms)

Memory (GB)

SVD (baseline)

17.8

0.95

CG (iterative)

15.1

0.12

LSQR (sparse)

15.5

0.15

Auto

28.3

0.20

Dataset Size Scaling

Processing time for exponential fitting with 3 parameters:

Dataset Size

Standard API (seconds)

Large Dataset API (seconds)

Speedup

100K

0.45

0.42

1.1x

1M

4.8

3.2

1.5x

10M

52.3

18.5

2.8x

50M

OOM

89.2

N/A

100M

OOM

178.5

N/A

OOM = Out of Memory on 16GB system

Chunking Strategy Performance

Impact of chunk size on performance (10M points, 4 parameters):

Chunk Size

Chunks

Time (s)

Memory (GB)

10,000

1,000

45.2

0.08

100,000

100

18.5

0.8

1,000,000

10

16.2

8.0

10,000,000

1

52.3

80.0

Optimal chunk size: 100K-1M points balances speed and memory.

Sparse Jacobian Performance

For problems with sparse structure (fitting 100 independent Gaussians):

Method

Dataset Size

Time (s)

Memory (GB)

Dense Jacobian

1M points

125.3

24.5

Sparse (90%)

1M points

18.7

3.2

Sparse (95%)

1M points

12.3

1.8

Sparse (99%)

1M points

6.5

0.4

Speedup: 6.7x to 19.3x for highly sparse problems.

Streaming Performance

Streaming optimizer for unlimited datasets:

Dataset Size

Batch Size

Epochs

Time (min)

100M

10K

50

12.5

500M

10K

30

48.2

1B

10K

20

82.7

10B

10K

10

425.3

Note: Streaming enables fitting datasets larger than system memory.

GPU Acceleration

Performance comparison CPU vs GPU (NVIDIA A100):

Dataset Size

CPU (s)

GPU (s)

Speedup

100K

0.42

0.08

5.3x

1M

3.2

0.15

21.3x

10M

18.5

0.82

22.6x

100M

178.5

7.3

24.5x

Real-World Benchmarks

Scientific Computing Applications

Spectroscopy Peak Fitting (1M points, 50 peaks):

  • Standard scipy.curve_fit: 892.3 seconds

  • NLSQ with sparse Jacobian: 42.1 seconds

  • Speedup: 21.2x

Image Stack Analysis (4K×4K×100 frames = 1.6B pixels):

  • Traditional approach: Not feasible (>500GB memory)

  • NLSQ streaming: 3.2 hours on single GPU

  • Enabled previously impossible analysis

Time Series Analysis (100M points, piecewise linear):

  • NumPy/SciPy: Out of memory

  • NLSQ chunked: 4.5 minutes

  • Memory usage: 2.1GB instead of 80GB

Benchmark Configuration

Test System Specifications

Hardware:

  • CPU: AMD EPYC 7763 64-Core

  • RAM: 256GB DDR4

  • GPU: NVIDIA A100 40GB

  • Storage: NVMe SSD 7GB/s

Software:

  • Python: 3.12.0

  • JAX: 0.4.35

  • NLSQ: Latest version

  • NumPy: 1.26.4

  • CUDA: 12.3

Benchmark Methodology

  1. Warm-up: 5 iterations to ensure JIT compilation

  2. Measurement: 100 iterations, report median

  3. Memory: Peak RSS measured via memory_profiler

  4. Datasets: Synthetic with known ground truth

  5. Convergence: Fixed to 1e-8 relative tolerance

Reproducing Benchmarks

Run the benchmark suite:

# Standard benchmarks
python benchmarks/run_benchmarks.py

# Individual benchmark scripts
python benchmarks/benchmark_suite.py

# Memory profiling
python -m memory_profiler benchmarks/benchmark_memory_reuse.py

# GPU benchmarks (requires CUDA)
JAX_PLATFORMS=gpu python benchmarks/run_benchmarks.py

Individual benchmark scripts:

# Memory efficiency test
from nlsq import estimate_memory_requirements
stats = estimate_memory_requirements(100_000_000, 4)
print(f"Memory: {stats.total_memory_estimate_gb:.2f} GB")

# Solver comparison
from nlsq import CurveFit
import time

cf = CurveFit()
for solver in ['svd', 'cg', 'lsqr']:
    start = time.time()
    popt, pcov = cf.curve_fit(func, x, y, solver=solver)
    print(f"{solver}: {time.time() - start:.3f}s")

Performance Optimization Tips

Memory Optimization

  1. Use iterative solvers for large problems:

    # Reduces memory from O(n²) to O(n)
    cf.curve_fit(func, x, y, solver='cg')
    
  2. Enable chunking for very large datasets:

    fitter = LargeDatasetFitter(memory_limit_gb=8.0)
    result = fitter.fit(func, x, y, p0)
    
  3. Exploit sparsity when available:

    if jacobian_sparsity > 0.9:
        use_sparse_optimizer()
    

Speed Optimization

  1. Optimal chunk size: 100K-1M points per chunk

  2. Batch size for streaming: 10K-50K points

  3. Use GPU for datasets > 100K points

  4. Pre-compile functions with JAX JIT

  5. Vectorize operations where possible

Scaling Guidelines

Based on benchmark results:

  • < 100K points: Standard curve_fit

  • 100K - 1M: LargeDatasetFitter or GPU

  • 1M - 100M: Chunking + iterative solvers

  • 100M - 1B: Streaming + GPU

  • > 1B: Distributed computing or sampling

Future Performance Improvements

Planned optimizations:

  1. Multi-GPU support for distributed fitting

  2. Adaptive chunking based on convergence

  3. Compiled kernels for common fit functions

  4. Parallel chunk processing for independent fits

Expected improvements:

  • Multi-GPU: 3-4x speedup on 4 GPUs

  • Adaptive chunking: 20-30% reduction in iterations

Conclusion

NLSQ’s large dataset optimizations provide:

  • Order of magnitude memory reduction

  • 20-25x GPU speedup for large datasets

  • Linear scaling with proper chunking

  • Unlimited dataset size via streaming

These improvements enable scientific computing applications that were previously infeasible due to memory constraints, while providing significant speedups for existing workflows.

For detailed implementation, see: