Large Dataset Tutorial ====================== Learn how to efficiently handle datasets with millions to billions of points using NLSQ's advanced large dataset features. Learning Objectives ------------------- After completing this tutorial, you will: - Understand when and how to use ``curve_fit_large`` - Know how to estimate memory requirements before fitting - Be able to configure chunking and streaming strategies - Understand sparse Jacobian optimization - Master streaming optimization for unlimited-size data () Introduction to Large Dataset Challenges ----------------------------------------- Traditional curve fitting algorithms face several challenges with large datasets: - **Memory limitations**: Cannot load entire dataset into RAM - **Computational complexity**: O(n²) or O(n³) scaling with data size - **Numerical stability**: Large matrices can become ill-conditioned - **Processing time**: Sequential algorithms don't utilize modern hardware efficiently NLSQ addresses these challenges through: - Automatic chunking and memory management - Streaming data processing with zero data loss () - GPU/TPU acceleration via JAX - Sparse matrix optimizations - Mini-batch gradient descent for unlimited datasets Automatic Large Dataset Detection --------------------------------- The simplest approach is using ``curve_fit_large``, which automatically detects dataset size and chooses appropriate strategies: .. code-block:: python import numpy as np import jax.numpy as jnp from nlsq import curve_fit_large, estimate_memory_requirements # First, let's estimate memory requirements n_points = 10_000_000 # 10 million points n_params = 3 stats = estimate_memory_requirements(n_points, n_params) print(f"Dataset: {n_points:,} points, {n_params} parameters") print(f"Estimated memory: {stats.total_memory_estimate_gb:.2f} GB") print(f"Recommended chunks: {stats.n_chunks}") print(f"Chunk size: {stats.recommended_chunk_size:,}") print(f"Processing strategy: {stats.processing_strategy}") **Example Output:** .. code-block:: Dataset: 10,000,000 points, 3 parameters Estimated memory: 1.34 GB Recommended chunks: 4 Chunk size: 2,500,000 Processing strategy: chunked Now let's generate and fit this large dataset: .. code-block:: python # Generate large synthetic dataset print("Generating data...") x = np.linspace(0, 5, n_points) # True parameters true_params = [2.5, 0.8, 0.3] # Add realistic noise y_true = true_params[0] * np.exp(-true_params[1] * x) + true_params[2] noise = np.random.normal(0, 0.05, n_points) y = y_true + noise print(f"Data size: x={x.shape}, y={y.shape}") print(f"Memory usage: ~{(x.nbytes + y.nbytes) / 1e6:.1f} MB") # Define model function using JAX def exponential_model(x, a, b, c): return a * jnp.exp(-b * x) + c # Fit using automatic large dataset handling print("Starting fit...") popt, pcov = curve_fit_large( exponential_model, x, y, p0=[2.0, 0.5, 0.2], memory_limit_gb=2.0, # Limit memory usage show_progress=True, # Show progress bar ) # Display results param_errors = np.sqrt(np.diag(pcov)) print("\nFitting Results:") print("=" * 40) param_names = ["Amplitude (a)", "Decay rate (b)", "Offset (c)"] for name, true_val, fit_val, error in zip(param_names, true_params, popt, param_errors): percent_error = 100 * abs(fit_val - true_val) / true_val print(f"{name}: {fit_val:.6f} ± {error:.6f}") print(f" True value: {true_val}") print(f" Error: {percent_error:.3f}%") print() Manual Configuration with LargeDatasetFitter --------------------------------------------- For more control over the fitting process, use the ``LargeDatasetFitter`` class: .. code-block:: python from nlsq import LargeDatasetFitter from nlsq.streaming.large_dataset import LDMemoryConfig # Create custom configuration config = LDMemoryConfig( memory_limit_gb=4.0, # Maximum memory usage min_chunk_size=50000, # Minimum points per chunk max_chunk_size=2000000, # Maximum points per chunk # : Streaming optimization automatically handles very large datasets use_streaming=True, # Enable streaming for unlimited data streaming_batch_size=100000, # Mini-batch size for streaming ) # Create fitter with custom configuration fitter = LargeDatasetFitter(config=config) # Get processing recommendations recommendations = fitter.get_memory_recommendations(n_points, n_params) print("Processing Strategy Recommendations:") print(f" Strategy: {recommendations['processing_strategy']}") print(f" Memory estimate: {recommendations['memory_estimate_gb']:.2f} GB") print(f" Recommended chunks: {recommendations['n_chunks']}") print(f" Chunk size: {recommendations['chunk_size']:,}") # Perform fit with detailed progress reporting result = fitter.fit_with_progress( exponential_model, x, y, p0=[2.0, 0.5, 0.2], ) # Examine detailed results print(f"\nDetailed Results:") print(f" Success: {result.success}") print(f" Message: {result.message}") # Note: n_chunks only available for multi-chunk fits print(f" Fitted parameters: {result.popt}") print(f" Total function evaluations: {result.nfev}") Adaptive Hybrid Streaming for Unlimited Datasets ------------------------------------------------ For datasets too large to fit in memory, NLSQ uses adaptive hybrid streaming with L-BFGS warmup and streaming Gauss-Newton updates. This processes 100% of the data with bounded memory. .. code-block:: python from nlsq import AdaptiveHybridStreamingOptimizer, HybridStreamingConfig # Simulate billion-point dataset (or load from HDF5) n_huge = 1_000_000_000 # 1 billion points # Check memory requirements huge_stats = estimate_memory_requirements(n_huge, 3) print(f"Billion-point dataset:") print(f" Memory estimate: {huge_stats.total_memory_estimate_gb:.1f} GB") print(f" Processing strategy: streaming (processes ALL data)") # For demonstration, we'll use a smaller dataset n_demo = 5_000_000 # 5 million points x_demo = np.linspace(0, 10, n_demo) y_demo = 3.2 * np.exp(-0.4 * x_demo) + 0.8 + np.random.normal(0, 0.1, n_demo) # Use LargeDatasetFitter with adaptive hybrid streaming enabled config = LDMemoryConfig( memory_limit_gb=2.0, use_streaming=True, # Enable streaming for very large datasets streaming_batch_size=50000, streaming_max_epochs=10, ) fitter = LargeDatasetFitter(config=config) print(f"\nFitting {n_demo:,} points with adaptive hybrid streaming...") stream_result = fitter.fit_with_progress( exponential_model, x_demo, y_demo, p0=[3.0, 0.3, 0.5], ) print(f"Streaming fit parameters: {stream_result.popt}") print(f"Points processed: {n_demo:,} (ALL data, no loss)") print(f"Convergence: {stream_result.success}") Sparse Jacobian Optimization ----------------------------- Many large-scale problems have sparse Jacobian structures. NLSQ can detect and exploit this: .. code-block:: python from nlsq import SparseJacobianComputer # Create a problem with sparse structure # Example: Multiple independent exponential components def multi_exponential(x, *params): """Sum of multiple independent exponential decays.""" n_components = len(params) // 3 # Each component has 3 parameters result = jnp.zeros_like(x) for i in range(n_components): a = params[3 * i] # amplitude b = params[3 * i + 1] # decay rate c = params[3 * i + 2] # offset result += a * jnp.exp(-b * x) + c return result # Generate data with 5 components (15 parameters total) n_components = 5 n_points_sparse = 100000 x_sparse = np.linspace(0, 3, n_points_sparse) # True parameters for 5 components true_sparse_params = [] for i in range(n_components): true_sparse_params.extend( [2.0 + 0.5 * i, 0.5 + 0.2 * i, 0.1 * i] # amplitude # decay rate # offset ) y_sparse = multi_exponential(x_sparse, *true_sparse_params) y_sparse += 0.02 * np.random.normal(size=len(x_sparse)) # Detect sparsity sparse_computer = SparseJacobianComputer(sparsity_threshold=0.1) # Use a sample to detect sparsity pattern sample_size = min(1000, len(x_sparse)) sample_indices = np.random.choice(len(x_sparse), sample_size, replace=False) x_sample = x_sparse[sample_indices] p0_sparse = [1.8 + 0.4 * i for i in range(n_components * 3)] # Initial guess # Detect sparsity pattern pattern, sparsity = sparse_computer.detect_sparsity_pattern( multi_exponential, p0_sparse, x_sample ) print(f"Jacobian Analysis:") print(f" Matrix size: {pattern.shape}") print(f" Sparsity ratio: {sparsity:.1%}") print(f" Is sparse: {sparsity > 0.1}") if sparsity > 0.1: # If more than 10% sparse print(" -> Using sparse optimization algorithms") else: print(" -> Using dense optimization algorithms") Adaptive Hybrid Streaming for Unlimited Data ---------------------------------------------- For datasets that cannot fit in memory, use adaptive hybrid streaming with chunked Gauss-Newton updates: .. code-block:: python from nlsq import AdaptiveHybridStreamingOptimizer, HybridStreamingConfig # Configure adaptive hybrid streaming config = HybridStreamingConfig( chunk_size=50000, gauss_newton_max_iterations=20, enable_checkpoints=True, checkpoint_frequency=100, ) optimizer = AdaptiveHybridStreamingOptimizer(config) # Fit from in-memory arrays (chunked internally) stream_result = optimizer.fit( (x_demo, y_demo), exponential_model, p0=np.array([2.5, 0.4, 0.3]), verbose=1, ) print("Streaming Results:") print(f" Converged: {stream_result['success']}") print(f" Final parameters: {stream_result['x']}") print( f" Final cost: {stream_result['streaming_diagnostics']['gauss_newton_diagnostics']['final_cost']}" ) Performance Comparison ---------------------- Let's compare different strategies for the same large dataset: .. code-block:: python import time # Test dataset n_test = 2_000_000 # 2 million points x_test = np.linspace(0, 4, n_test) y_test = 1.8 * np.exp(-0.7 * x_test) + 0.2 + np.random.normal(0, 0.03, n_test) strategies = [ ( "Standard curve_fit_large", lambda: curve_fit_large(exponential_model, x_test, y_test, p0=[1.5, 0.5, 0.1]), ), ( "Chunked (4 chunks)", lambda: curve_fit_large( exponential_model, x_test, y_test, p0=[1.5, 0.5, 0.1], memory_limit_gb=0.5, # Force chunking ), ), ( "Adaptive hybrid streaming", lambda: curve_fit_large( exponential_model, x_test, y_test, p0=[1.5, 0.5, 0.1], memory_limit_gb=2.0, # Use adaptive hybrid streaming for huge data # Streaming tiers use AdaptiveHybridStreamingOptimizer ), ), ] results = {} print(f"Performance Comparison ({n_test:,} points)") print("=" * 60) for name, strategy in strategies: print(f"\nTesting: {name}") start_time = time.time() try: popt, pcov = strategy() duration = time.time() - start_time error = np.sqrt(np.mean((y_test - exponential_model(x_test, *popt)) ** 2)) results[name] = { "time": duration, "params": popt, "rms_error": error, "success": True, } print(f" Time: {duration:.2f} seconds") print(f" Parameters: [{popt[0]:.3f}, {popt[1]:.3f}, {popt[2]:.3f}]") print(f" RMS Error: {error:.5f}") except Exception as e: print(f" Failed: {e}") results[name] = {"success": False, "error": str(e)} # Summary print("\nSummary:") print("-" * 40) successful_results = {k: v for k, v in results.items() if v.get("success", False)} if successful_results: fastest = min(successful_results, key=lambda k: successful_results[k]["time"]) most_accurate = min( successful_results, key=lambda k: successful_results[k]["rms_error"] ) print(f"Fastest: {fastest} ({successful_results[fastest]['time']:.2f}s)") print( f"Most accurate: {most_accurate} (RMS: {successful_results[most_accurate]['rms_error']:.6f})" ) Best Practices for Large Datasets ---------------------------------- **1. Estimate Memory First** Always check memory requirements before fitting: .. code-block:: python # Check before processing stats = estimate_memory_requirements(len(x), n_parameters) if stats.total_memory_estimate_gb > available_memory_gb: print("Consider using chunking or adaptive hybrid streaming") **2. Choose Appropriate Strategies** - **< 1M points**: Use standard ``curve_fit`` - **1M - 10M points**: Use ``curve_fit_large`` with default settings - **10M - 100M points**: Use chunking with progress monitoring - **> 100M points**: Use adaptive hybrid streaming (processes 100% of data) **3. Optimize for Your Hardware** .. code-block:: python # Check available devices import jax print(f"Available devices: {jax.devices()}") # GPU memory is typically more limited if jax.devices()[0].device_kind == "gpu": memory_limit_gb = 2.0 # Conservative for GPU else: memory_limit_gb = 8.0 # More generous for CPU **4. Monitor Progress for Long Fits** .. code-block:: python # Always use progress bars for large datasets popt, pcov = curve_fit_large( func, x, y, show_progress=True, memory_limit_gb=4.0 # Essential for user experience ) **5. Validate Results** .. code-block:: python # Check residuals and parameter uncertainties residuals = y - func(x, *popt) rms_residual = np.sqrt(np.mean(residuals**2)) param_errors = np.sqrt(np.diag(pcov)) print(f"RMS residual: {rms_residual:.6f}") print(f"Max parameter uncertainty: {np.max(param_errors / np.abs(popt)):.2%}") Troubleshooting Large Dataset Issues ------------------------------------- **Memory Errors** .. code-block:: python # Use chunking or streaming for large datasets try: popt, pcov = curve_fit_large(func, x, y) except MemoryError: print("Using streaming optimization to handle unlimited data...") # : Streaming processes 100% of data with zero accuracy loss popt, pcov = curve_fit_large(func, x, y, memory_limit_gb=2.0, chunk_size=100000) **Convergence Issues** .. code-block:: python # Try different initial guesses or increase tolerances popt, pcov = curve_fit_large( func, x, y, p0=better_initial_guess, ftol=1e-6, xtol=1e-6 # Looser tolerance ) **Performance Issues** .. code-block:: python # Profile your function for JAX compatibility import jax # Test function compilation compiled_func = jax.jit(func) test_result = compiled_func(x[:100], *p0) # Should not raise errors Interactive Notebooks --------------------- Hands-on tutorials for large dataset handling: **Core Tutorials:** - `Large Dataset Demo `_ - Handle millions of data points with automatic chunking **Streaming and Fault Tolerance:** - `Hybrid Streaming API `_ - Parameter normalization and L-BFGS warmup **HPC and Cluster Computing:** - `HPC and Checkpointing `_ - PBS Pro, fault tolerance, and cluster computing Next Steps ---------- Congratulations! You now have the tools to handle datasets of any size. Continue with: 1. :doc:`../api/large_datasets_api` - Advanced fitting APIs and parameter constraints 2. :doc:`../api/performance_benchmarks` - Performance analysis and optimization 3. Browse the `examples directory `_ for more complex scenarios Further Reading --------------- - :doc:`../api/nlsq.large_dataset` - Comprehensive technical details - :doc:`../api/large_datasets_api` - Complete function documentation - `JAX Documentation `_ - Understanding JAX transformations