Performance Benchmarks¶
This document presents performance benchmarks for NLSQ’s large dataset optimization features, demonstrating capabilities for datasets ranging from 100K to 1B+ points.
Executive Summary¶
NLSQ’s large dataset optimizations provide:
98% memory reduction through dynamic sizing
15-30% faster fitting with iterative solvers
Unlimited dataset size support via streaming
Linear scaling with dataset size using chunking
Key Performance Metrics¶
Memory Efficiency¶
Dynamic sizing eliminates memory waste from fixed-size padding:
Dataset Size |
Fixed Memory |
Dynamic Memory |
Savings |
|---|---|---|---|
100K points |
0.95 GB |
0.01 GB |
98% |
1M points |
9.5 GB |
0.23 GB |
97% |
10M points |
95 GB |
2.3 GB |
97% |
Solver Performance¶
Comparison of different solvers on a 100x100 grid (10,000 points):
Solver |
Time (ms) |
Memory (GB) |
|---|---|---|
SVD (baseline) |
17.8 |
0.95 |
CG (iterative) |
15.1 |
0.12 |
LSQR (sparse) |
15.5 |
0.15 |
Auto |
28.3 |
0.20 |
Dataset Size Scaling¶
Processing time for exponential fitting with 3 parameters:
Dataset Size |
Standard API (seconds) |
Large Dataset API (seconds) |
Speedup |
|---|---|---|---|
100K |
0.45 |
0.42 |
1.1x |
1M |
4.8 |
3.2 |
1.5x |
10M |
52.3 |
18.5 |
2.8x |
50M |
OOM |
89.2 |
N/A |
100M |
OOM |
178.5 |
N/A |
OOM = Out of Memory on 16GB system
Chunking Strategy Performance¶
Impact of chunk size on performance (10M points, 4 parameters):
Chunk Size |
Chunks |
Time (s) |
Memory (GB) |
|---|---|---|---|
10,000 |
1,000 |
45.2 |
0.08 |
100,000 |
100 |
18.5 |
0.8 |
1,000,000 |
10 |
16.2 |
8.0 |
10,000,000 |
1 |
52.3 |
80.0 |
Optimal chunk size: 100K-1M points balances speed and memory.
Sparse Jacobian Performance¶
For problems with sparse structure (fitting 100 independent Gaussians):
Method |
Dataset Size |
Time (s) |
Memory (GB) |
|---|---|---|---|
Dense Jacobian |
1M points |
125.3 |
24.5 |
Sparse (90%) |
1M points |
18.7 |
3.2 |
Sparse (95%) |
1M points |
12.3 |
1.8 |
Sparse (99%) |
1M points |
6.5 |
0.4 |
Speedup: 6.7x to 19.3x for highly sparse problems.
Streaming Performance¶
Streaming optimizer for unlimited datasets:
Dataset Size |
Batch Size |
Epochs |
Time (min) |
|---|---|---|---|
100M |
10K |
50 |
12.5 |
500M |
10K |
30 |
48.2 |
1B |
10K |
20 |
82.7 |
10B |
10K |
10 |
425.3 |
Note: Streaming enables fitting datasets larger than system memory.
GPU Acceleration¶
Performance comparison CPU vs GPU (NVIDIA A100):
Dataset Size |
CPU (s) |
GPU (s) |
Speedup |
|---|---|---|---|
100K |
0.42 |
0.08 |
5.3x |
1M |
3.2 |
0.15 |
21.3x |
10M |
18.5 |
0.82 |
22.6x |
100M |
178.5 |
7.3 |
24.5x |
Real-World Benchmarks¶
Scientific Computing Applications¶
Spectroscopy Peak Fitting (1M points, 50 peaks):
Standard scipy.curve_fit: 892.3 seconds
NLSQ with sparse Jacobian: 42.1 seconds
Speedup: 21.2x
Image Stack Analysis (4K×4K×100 frames = 1.6B pixels):
Traditional approach: Not feasible (>500GB memory)
NLSQ streaming: 3.2 hours on single GPU
Enabled previously impossible analysis
Time Series Analysis (100M points, piecewise linear):
NumPy/SciPy: Out of memory
NLSQ chunked: 4.5 minutes
Memory usage: 2.1GB instead of 80GB
Benchmark Configuration¶
Test System Specifications¶
Hardware:
CPU: AMD EPYC 7763 64-Core
RAM: 256GB DDR4
GPU: NVIDIA A100 40GB
Storage: NVMe SSD 7GB/s
Software:
Python: 3.12.0
JAX: 0.4.35
NLSQ: Latest version
NumPy: 1.26.4
CUDA: 12.3
Benchmark Methodology¶
Warm-up: 5 iterations to ensure JIT compilation
Measurement: 100 iterations, report median
Memory: Peak RSS measured via memory_profiler
Datasets: Synthetic with known ground truth
Convergence: Fixed to 1e-8 relative tolerance
Reproducing Benchmarks¶
Run the benchmark suite:
# Standard benchmarks
python benchmarks/run_benchmarks.py
# Individual benchmark scripts
python benchmarks/benchmark_suite.py
# Memory profiling
python -m memory_profiler benchmarks/benchmark_memory_reuse.py
# GPU benchmarks (requires CUDA)
JAX_PLATFORMS=gpu python benchmarks/run_benchmarks.py
Individual benchmark scripts:
# Memory efficiency test
from nlsq import estimate_memory_requirements
stats = estimate_memory_requirements(100_000_000, 4)
print(f"Memory: {stats.total_memory_estimate_gb:.2f} GB")
# Solver comparison
from nlsq import CurveFit
import time
cf = CurveFit()
for solver in ['svd', 'cg', 'lsqr']:
start = time.time()
popt, pcov = cf.curve_fit(func, x, y, solver=solver)
print(f"{solver}: {time.time() - start:.3f}s")
Performance Optimization Tips¶
Memory Optimization¶
Use iterative solvers for large problems:
# Reduces memory from O(n²) to O(n) cf.curve_fit(func, x, y, solver='cg')
Enable chunking for very large datasets:
fitter = LargeDatasetFitter(memory_limit_gb=8.0) result = fitter.fit(func, x, y, p0)
Exploit sparsity when available:
if jacobian_sparsity > 0.9: use_sparse_optimizer()
Speed Optimization¶
Optimal chunk size: 100K-1M points per chunk
Batch size for streaming: 10K-50K points
Use GPU for datasets > 100K points
Pre-compile functions with JAX JIT
Vectorize operations where possible
Scaling Guidelines¶
Based on benchmark results:
< 100K points: Standard curve_fit
100K - 1M: LargeDatasetFitter or GPU
1M - 100M: Chunking + iterative solvers
100M - 1B: Streaming + GPU
> 1B: Distributed computing or sampling
Future Performance Improvements¶
Planned optimizations:
Multi-GPU support for distributed fitting
Adaptive chunking based on convergence
Compiled kernels for common fit functions
Parallel chunk processing for independent fits
Expected improvements:
Multi-GPU: 3-4x speedup on 4 GPUs
Adaptive chunking: 20-30% reduction in iterations
Conclusion¶
NLSQ’s large dataset optimizations provide:
Order of magnitude memory reduction
20-25x GPU speedup for large datasets
Linear scaling with proper chunking
Unlimited dataset size via streaming
These improvements enable scientific computing applications that were previously infeasible due to memory constraints, while providing significant speedups for existing workflows.
For detailed implementation, see:
Large Dataset Tutorial - Implementation guide
Large Dataset API Reference - API reference