# NLSQ Performance Tuning Guide **For Users**: How to get the best performance from NLSQ **Last Updated**: December 2025 --- ## Recent Optimizations (v0.4.2+) NLSQ has received significant performance improvements: ### Lazy Imports (43% Faster Cold Start) Specialty modules are now lazily imported, reducing initial import time from ~1084ms to ~620ms: ```python # These modules only load when first accessed: # - nlsq.global_optimization # - nlsq.streaming.adaptive_hybrid # - nlsq.profiler_visualization # - nlsq.gui import nlsq # Fast (~620ms) nlsq.curve_fit(...) # Core functionality loads immediately # Streaming loads only when needed nlsq.AdaptiveHybridStreamingOptimizer(...) # Lazy load happens here ``` ### Vectorized Sparse Jacobian (37-50x Speedup) Sparse Jacobian construction now uses vectorized NumPy operations: ```python # Old: O(nm) nested loop - slow for large matrices # New: O(nnz) COO sparse construction - much faster # 100k x 50 matrix: ~200ms → ~5ms (40x speedup) ``` ### LRU Memory Pool Memory pool now uses LRU eviction with adaptive TTL: ```python from nlsq.caching.memory_manager import MemoryManager manager = MemoryManager() # Arrays are cached and reused # LRU eviction when pool exceeds max_arrays manager.optimize_memory_pool(max_arrays=10) ``` --- ## Quick Start NLSQ is already highly optimized and should provide excellent performance out of the box. In most cases, **no tuning is needed**. **Typical Performance**: - 100-point fit: ~30ms (after initial JIT compilation) - 1000-point fit: ~110ms - 10000-point fit: ~134ms - 50000-point fit: ~120ms **Scaling**: 50x more data → only 1.2x slower [PASS] --- ## Understanding NLSQ Performance ### First Run vs Subsequent Runs **First run includes JIT compilation**: ```python from nlsq import curve_fit # First call: ~430ms (includes ~400ms JIT compilation) popt1, pcov1 = curve_fit(model, x, y, p0=[1, 1]) # Second call: ~30ms (uses cached compiled function) popt2, pcov2 = curve_fit(model, x2, y2, p0=[1, 1]) ``` **Solution**: JIT compilation is one-time cost, subsequent calls are much faster. ### GPU vs CPU **Automatic Backend Selection**: ```python import jax print(jax.devices()) # Check which devices are available # NLSQ automatically uses GPU/TPU if available popt, pcov = curve_fit(model, x, y) # Runs on GPU automatically ``` **Force CPU** (for debugging or small problems): ```bash JAX_PLATFORM_NAME=cpu python your_script.py ``` **GPU Benefits**: - Most noticeable for large problems (>10,000 points) - Parallel computation of Jacobians - Faster linear algebra operations --- ## Optimization Techniques ### 1. Reuse Compiled Functions (Highest Impact) **Problem**: Creating new `curve_fit` calls triggers recompilation **Solution**: Use `CurveFit` class to reuse compiled functions ```python from nlsq import CurveFit # BAD: Recompiles for each fit for dataset in datasets: popt, pcov = curve_fit(model, dataset.x, dataset.y) # Slow! # GOOD: Compile once, reuse many times cf = CurveFit() for dataset in datasets: popt, pcov = cf.curve_fit(model, dataset.x, dataset.y) # Fast! ``` **Speedup**: 10-100x for batch fitting (avoids repeated JIT compilation) ### 2. Batch Processing **Problem**: Fitting curves one at a time in a loop **Solution**: Process multiple fits efficiently ```python # BAD: Sequential processing results = [] for i in range(n_curves): popt, pcov = cf.curve_fit(model, x_data[i], y_data[i]) results.append(popt) # BETTER: Reuse CurveFit instance (as shown above) cf = CurveFit() results = [] for i in range(n_curves): popt, pcov = cf.curve_fit(model, x_data[i], y_data[i]) results.append(popt) # BEST: Use large_dataset module for very large batches from nlsq.streaming.large_dataset import LargeDatasetFitter fitter = LargeDatasetFitter() results = fitter.fit_multiple(model, x_data, y_data, p0_list) ``` ### 3. Provide Good Initial Guesses **Problem**: Poor initial guess → more iterations → slower convergence **Solution**: Provide reasonable `p0` parameter ```python # BAD: No initial guess (uses zeros) popt, pcov = curve_fit(exponential, x, y) # May take many iterations # GOOD: Reasonable initial guess p0 = [max(y), 1.0, min(y)] # Amplitude, decay rate, offset popt, pcov = curve_fit(exponential, x, y, p0=p0) # Faster convergence ``` **Speedup**: 2-5x for well-conditioned problems ### 4. Use Bounds When Appropriate **Problem**: Unbounded optimization may explore unrealistic parameter space **Solution**: Provide reasonable bounds ```python # Example: Exponential decay # y = a * exp(-b * x) + c # We know: a > 0, b > 0, c >= 0 bounds = ([0, 0, 0], [np.inf, np.inf, np.inf]) popt, pcov = curve_fit(exponential, x, y, p0=p0, bounds=bounds) ``` **Benefits**: - Faster convergence (avoids unrealistic regions) - More robust (prevents numerical issues) ### 5. Choose Appropriate Algorithm **TRF (default)**: Best for bounded problems ```python popt, pcov = curve_fit(model, x, y, method="trf", bounds=bounds) ``` **LM (Levenberg-Marquardt)**: Best for unbounded problems ```python popt, pcov = curve_fit(model, x, y, method="lm") # Slightly faster for unconstrained ``` **Dogbox**: Alternative for bounded problems ```python popt, pcov = curve_fit(model, x, y, method="dogbox", bounds=bounds) ``` ### 6. Reduce Data When Possible **Problem**: Fitting millions of data points when thousands would suffice **Solution**: Downsample if appropriate for your problem ```python # If you have 1M points but only fitting 5 parameters if len(x) > 10000: # Downsample intelligently indices = np.linspace(0, len(x) - 1, 10000, dtype=int) x_reduced = x[indices] y_reduced = y[indices] sigma_reduced = sigma[indices] if sigma is not None else None popt, pcov = curve_fit(model, x_reduced, y_reduced, sigma=sigma_reduced) ``` **Note**: Only do this if statistically valid for your application! --- ## Profiling Your Workload ### Basic Timing ```python import time from nlsq import CurveFit cf = CurveFit() # Time first call (includes JIT) start = time.time() popt1, pcov1 = cf.curve_fit(model, x, y, p0=p0) first_call = time.time() - start # Time second call (cached) start = time.time() popt2, pcov2 = cf.curve_fit(model, x2, y2, p0=p0) second_call = time.time() - start print(f"First call (with JIT): {first_call*1000:.1f}ms") print(f"Second call (cached): {second_call*1000:.1f}ms") print(f"Speedup: {first_call/second_call:.1f}x") ``` ### Detailed Profiling ```python import cProfile import pstats # Profile your code profiler = cProfile.Profile() profiler.enable() # Your fitting code here popt, pcov = curve_fit(model, x, y, p0=p0) profiler.disable() # Analyze results stats = pstats.Stats(profiler) stats.sort_stats("cumulative") stats.print_stats(20) # Top 20 functions ``` ### Using pytest-benchmark ```python # In your test file def test_fitting_performance(benchmark): """Benchmark curve fitting performance""" x = np.linspace(0, 10, 1000) y = 2.0 * np.exp(-0.5 * x) + 0.3 + 0.05 * np.random.randn(len(x)) p0 = [2.0, 0.5, 0.3] result = benchmark(curve_fit, exponential, x, y, p0=p0) popt, pcov = result assert len(popt) == 3 ``` Run with: ```bash pytest test_performance.py --benchmark-only ``` --- ## Common Performance Issues ### Issue 1: Slow First Call **Symptom**: First `curve_fit` call takes 200-500ms **Cause**: JIT compilation overhead **Solution**: [PASS] This is normal and expected - Subsequent calls will be much faster (~10-50ms) - Use `CurveFit` class to reuse compiled functions - Consider warming up the JIT cache on startup ```python # Warm up JIT cache cf = CurveFit() _ = cf.curve_fit(model, x_dummy, y_dummy, p0=p0_dummy) # Now real fits will be fast ``` ### Issue 2: Each Fit Is Slow **Symptom**: Every call to `curve_fit` takes 200+ ms **Diagnosis**: 1. Are you recreating the function each time? 2. Are you using different model functions? 3. Is your model function slow? **Solutions**: ```python # Make sure you're reusing CurveFit instance cf = CurveFit() # Create ONCE for data in datasets: popt, pcov = cf.curve_fit(model, data.x, data.y) # Reuse # Profile your model function import jax.numpy as jnp @jit # JIT compile your model def fast_model(x, a, b, c): return a * jnp.exp(-b * x) + c # Use jnp, not np! ``` ### Issue 3: Large Dataset Performance **Symptom**: Fitting >100,000 points is very slow **Solution**: Use large dataset optimization features ```python from nlsq.streaming.large_dataset import LargeDatasetFitter fitter = LargeDatasetFitter(chunk_size=10000) # Process in chunks popt, pcov = fitter.fit(model, x, y, p0=p0) ``` ### Issue 4: Fitting Doesn't Converge **Symptom**: Function takes very long, doesn't converge **Cause**: Poor initial guess or ill-conditioned problem **Solutions**: 1. Provide better initial guess 2. Use bounds to constrain search 3. Scale your data 4. Increase max iterations (if needed) ```python # Scale data to reasonable range x_scaled = (x - x.mean()) / x.std() y_scaled = (y - y.mean()) / y.std() # Fit on scaled data popt_scaled, pcov_scaled = curve_fit(model, x_scaled, y_scaled, p0=p0) # Transform parameters back to original scale popt = transform_params(popt_scaled, x.mean(), x.std(), y.mean(), y.std()) ``` --- ## Advanced Optimization ### Sparse Jacobian (For Specific Problems) If your Jacobian has sparse structure, exploit it: ```python from nlsq.sparse_jacobian import SparseCurveFit # Define sparsity pattern # (only if you know your Jacobian is sparse!) scf = SparseCurveFit(sparsity_pattern=pattern) popt, pcov = scf.curve_fit(model, x, y, p0=p0) ``` **Speedup**: 2-10x for problems with sparse Jacobians ### Custom Jacobian If you can provide analytical Jacobian: ```python def jac_analytical(x, a, b, c): """Analytical Jacobian for a*exp(-b*x) + c""" J = np.zeros((len(x), 3)) exp_term = np.exp(-b * x) J[:, 0] = exp_term # d/da J[:, 1] = -a * x * exp_term # d/db J[:, 2] = 1.0 # d/dc return J popt, pcov = curve_fit(model, x, y, p0=p0, jac=jac_analytical) ``` **Note**: JAX's autodiff is usually fast enough. Only provide custom Jacobian if: - You have analytical form - It's significantly simpler than automatic differentiation - Profiling shows Jacobian computation is bottleneck --- ## Benchmarking Checklist Before claiming "NLSQ is slow": - [ ] Are you using `CurveFit` class for multiple fits? - [ ] Have you excluded JIT compilation time from measurements? - [ ] Is your model function JIT-compiled and using JAX operations? - [ ] Are you providing reasonable initial guesses? - [ ] Is your problem well-conditioned? - [ ] Have you profiled to identify the actual bottleneck? - [ ] Are you comparing fair to fair (NLSQ on CPU vs SciPy on CPU)? --- ## Performance Expectations ### What is Fast? For reference, here are typical performance numbers on modern CPU: | Problem Size | Points | Parameters | Expected Time (after JIT) | |--------------|--------|------------|---------------------------| | Small | 100 | 2-5 | 10-30ms | | Medium | 1,000 | 2-5 | 50-150ms | | Large | 10,000 | 2-5 | 100-200ms | | XLarge | 50,000 | 2-5 | 100-300ms | | Huge | 100,000+ | 2-5 | Use large_dataset module | **GPU acceleration** can provide 2-10x additional speedup for large problems. ### When to Use GPU **GPU is beneficial when**: - Problem size > 10,000 points - Batch fitting many curves - Complex model functions - Large Jacobian matrices **GPU may not help when**: - Problem size < 1,000 points (overhead dominates) - Simple model functions - JIT compilation dominates (first run) --- ## Getting Help If you're experiencing performance issues: 1. **Profile first**: Identify the actual bottleneck 2. **Check the basics**: CurveFit class, good initial guess, etc. 3. **Review case study**: `docs/optimization_case_study.md` 4. **Open an issue**: With profiling data and minimal reproducible example **Template for performance issues**: ```python import numpy as np from nlsq import CurveFit import time # Your model def model(x, a, b): return a * x + b # Your data x = np.linspace(0, 10, 1000) y = 2.0 * x + 1.0 + 0.1 * np.random.randn(len(x)) # Timing cf = CurveFit() # First call (with JIT) start = time.time() popt1, pcov1 = cf.curve_fit(model, x, y, p0=[1, 0]) first = time.time() - start # Second call (cached) start = time.time() popt2, pcov2 = cf.curve_fit(model, x, y, p0=[1, 0]) second = time.time() - start print(f"First: {first*1000:.1f}ms, Second: {second*1000:.1f}ms") print(f"Expected: First ~400ms, Second ~30ms") ``` --- ## Summary **Key Takeaways**: 1. [PASS] **NLSQ is already fast** - Well-optimized, excellent scaling 2. [PASS] **Use CurveFit class** - Reuse compiled functions (biggest impact) 3. [PASS] **Good initial guesses** - Faster convergence 4. [PASS] **Profile before optimizing** - Identify actual bottlenecks 5. [PASS] **GPU for large problems** - Automatic acceleration when beneficial **Remember**: Premature optimization is the root of all evil. Profile first, optimize only what matters. --- **For More Information**: - Optimization case study: `docs/optimization_case_study.md` - Benchmark suite: `benchmarks/test_performance_regression.py` - Examples: `examples/` directory **Last Updated**: December 2025