nlsq.streaming.large_dataset.LargeDatasetFitter¶
- class nlsq.streaming.large_dataset.LargeDatasetFitter(memory_limit_gb=8.0, config=None, curve_fit_class=None, logger=None, multistart=False, n_starts=10, sampler='lhs')[source]¶
Bases:
objectLarge dataset curve fitting with automatic memory management and chunking.
This class handles datasets with millions to billions of points that exceed available memory through automatic chunking, progressive parameter refinement, and streaming optimization. It maintains fitting accuracy while preventing memory overflow through dynamic memory monitoring and chunk size optimization.
Core Capabilities¶
Automatic memory estimation based on data size and parameter count
Dynamic chunk size calculation considering available system memory
Sequential parameter refinement across data chunks with convergence tracking
Streaming optimization for unlimited datasets (no accuracy loss)
Real-time progress monitoring with ETA for long-running fits
Full integration with NLSQ optimization algorithms and GPU acceleration
Multi-start optimization for global search (uses full data)
Memory Management Algorithm¶
Estimates total memory requirements from dataset size and parameter count
Calculates optimal chunk sizes considering available memory and safety margins
Monitors actual memory usage during processing to prevent overflow
Uses streaming optimization for extremely large datasets (processes all data)
Processing Strategies¶
Single Pass: For datasets fitting within memory limits
Sequential Chunking: Processes data in optimal-sized chunks with parameter propagation
Streaming Optimization: Mini-batch gradient descent for unlimited datasets (no subsampling)
Multi-Start Optimization¶
For medium-sized datasets (1M-100M points), multi-start optimization explores multiple starting points on full data, and the best starting point is then used for the full chunked optimization.
Performance Characteristics¶
Maintains <1% parameter error for well-conditioned problems using chunking
Achieves 5-50x speedup over naive approaches through memory optimization
Scales to datasets of unlimited size using streaming (processes all data)
Provides linear time complexity with respect to chunk count
Model Validation Caching (Task Group 7 - 5.1a)¶
Model functions are validated once per unique function identity using a cache keyed by (id(func), id(func.__code__)). This avoids redundant validation across chunks, providing 1-5% performance gain in chunked processing.
- param memory_limit_gb:
Maximum memory usage in GB. System memory is auto-detected if None.
- type memory_limit_gb:
float, default 8.0
- param config:
Advanced configuration for fine-tuning memory management behavior.
- type config:
LDMemoryConfig, optional
- param curve_fit_class:
Custom CurveFit instance for specialized fitting requirements.
- type curve_fit_class:
nlsq.minpack.CurveFit, optional
- param multistart:
Enable multi-start optimization for global search.
- type multistart:
bool, default False
- param n_starts:
Number of starting points for multi-start optimization.
- type n_starts:
int, default 10
- param sampler:
Sampling strategy for multi-start: ‘lhs’, ‘sobol’, or ‘halton’.
- type sampler:
str, default ‘lhs’
- config¶
Active memory management configuration
- Type:
- curve_fitter¶
Internal curve fitting engine with JAX acceleration
- Type:
nlsq.minpack.CurveFit
- logger¶
Internal logging for performance monitoring and debugging
- Type:
Logger
- fit : Main fitting method with automatic memory management
- fit_with_progress : Fitting with real-time progress reporting and ETA
- get_memory_recommendations : Pre-fitting memory analysis and strategy recommendations
- Important: Chunking-Compatible Model Functions
- -----------------------------------------------
- When using chunked processing (for datasets > memory limit), your model function
- MUST respect the size of xdata. During chunking, xdata will be a subset of the
- full dataset, and your model must return output matching that subset size.
- \*\*INCORRECT - Model ignores xdata size (will cause shape mismatch errors):**
- >>> def bad_model(xdata, a, b):
- ... # WRONG: Always returns full array, ignoring xdata size
- ... t_full = jnp.arange(10_000_000) # Fixed size!
- ... return a * jnp.exp(-b * t_full) # Shape mismatch during chunking
- \*\*CORRECT - Model respects xdata size:**
- >>> def good_model(xdata, a, b):
- ... # CORRECT: Uses xdata as indices to return only requested subset
- ... indices = xdata.astype(jnp.int32)
- ... return a * jnp.exp(-b * indices) # Shape matches xdata
- \*\*Alternative - Direct computation on xdata:**
- >>> def direct_model(xdata, a, b):
- ... # CORRECT: Operates directly on xdata
- ... return a * jnp.exp(-b * xdata) # Shape automatically matches
Examples
Basic usage with automatic configuration:
>>> import numpy as np >>> import jax.numpy as jnp >>> >>> # 10 million data points >>> x = np.linspace(0, 10, 10_000_000) >>> y = 2.5 * jnp.exp(-1.3 * x) + 0.1 + np.random.normal(0, 0.05, len(x)) >>> >>> fitter = LargeDatasetFitter(memory_limit_gb=4.0) >>> result = fitter.fit( ... lambda x, a, b, c: a * jnp.exp(-b * x) + c, ... x, y, p0=[2, 1, 0] ... ) >>> print(f"Parameters: {result.popt}") >>> print(f"Chunks used: {result.n_chunks}")
Multi-start optimization:
>>> fitter = LargeDatasetFitter( ... memory_limit_gb=4.0, ... multistart=True, ... n_starts=10, ... sampler='lhs', ... ) >>> result = fitter.fit( ... lambda x, a, b, c: a * jnp.exp(-b * x) + c, ... x, y, p0=[2, 1, 0], ... bounds=([0, 0, 0], [10, 5, 10]) ... )
Advanced configuration with progress monitoring:
>>> config = LDMemoryConfig( ... memory_limit_gb=8.0, ... min_chunk_size=10000, ... max_chunk_size=1000000, ... use_streaming=True, ... streaming_batch_size=50000 ... ) >>> fitter = LargeDatasetFitter(config=config) >>> >>> # Fit with progress bar for long-running operation >>> result = fitter.fit_with_progress( ... exponential_model, x_huge, y_huge, p0=[2, 1, 0] ... )
Memory analysis before processing:
>>> recommendations = fitter.get_memory_recommendations(len(x), n_params=3) >>> print(f"Strategy: {recommendations['processing_strategy']}") >>> print(f"Memory estimate: {recommendations['memory_estimate_gb']:.2f} GB") >>> print(f"Recommended chunks: {recommendations['n_chunks']}")
See also
curve_fit_largeHigh-level function with automatic dataset size detection
LDMemoryConfigConfiguration class for memory management parameters
estimate_memory_requirementsStandalone function for memory estimation
Notes
The sequential chunking algorithm maintains parameter accuracy by using each chunk’s result as the initial guess for the next chunk. This approach typically maintains fitting accuracy within 0.1% of single-pass results for well-conditioned problems while enabling processing of arbitrarily large datasets.
For extremely large datasets, streaming optimization processes all data using mini-batch gradient descent with no subsampling, ensuring zero accuracy loss compared to subsampling approaches (removed in v0.2.0).
- __init__(memory_limit_gb=8.0, config=None, curve_fit_class=None, logger=None, multistart=False, n_starts=10, sampler='lhs')[source]¶
Initialize LargeDatasetFitter.
- Parameters:
memory_limit_gb (float, optional) – Memory limit in GB (default: 8.0)
config (LDMemoryConfig, optional) – Custom memory configuration
curve_fit_class (nlsq.minpack.CurveFit, optional) – Custom CurveFit instance to use
logger (logging.Logger, optional) – External logger instance for integration with application logging. If None, uses NLSQ’s internal logger. This allows chunk failure warnings to appear in your application’s logs.
multistart (bool, optional) – Enable multi-start optimization for global search (default: False). When enabled, explores multiple starting points on full data before running the full chunked optimization.
n_starts (int, optional) – Number of starting points for multi-start optimization (default: 10). Set to 0 to disable multi-start even when multistart=True.
sampler (str, optional) – Sampling strategy for generating starting points (default: ‘lhs’). Options: ‘lhs’ (Latin Hypercube), ‘sobol’, ‘halton’.
- estimate_requirements(n_points, n_params)[source]¶
Estimate memory requirements and processing strategy.
- fit(f, xdata, ydata, p0=None, bounds=(-inf, inf), method='trf', solver='auto', **kwargs)[source]¶
Fit curve to large dataset with automatic memory management.
- Parameters:
f (callable) – The model function f(x, *params) -> y
xdata (np.ndarray) – Independent variable data
ydata (np.ndarray) – Dependent variable data
p0 (array-like, optional) – Initial parameter guess
bounds (tuple, optional) – Parameter bounds (lower, upper)
method (str, optional) – Optimization method (default: ‘trf’)
solver (str, optional) – Solver type (default: ‘auto’)
**kwargs – Additional arguments passed to curve_fit
- Returns:
Optimization result with fitted parameters and statistics
- Return type:
- fit_with_progress(f, xdata, ydata, p0=None, bounds=(-inf, inf), method='trf', solver='auto', **kwargs)[source]¶
Fit curve with progress reporting for long-running fits.
- Parameters:
f (callable) – The model function f(x, *params) -> y
xdata (np.ndarray) – Independent variable data
ydata (np.ndarray) – Dependent variable data
p0 (array-like, optional) – Initial parameter guess
bounds (tuple, optional) – Parameter bounds (lower, upper)
method (str, optional) – Optimization method (default: ‘trf’)
solver (str, optional) – Solver type (default: ‘auto’)
**kwargs – Additional arguments passed to curve_fit
- Returns:
Optimization result with fitted parameters and statistics
- Return type: