nlsq.streaming.large_dataset.LargeDatasetFitter¶

class nlsq.streaming.large_dataset.LargeDatasetFitter(memory_limit_gb=8.0, config=None, curve_fit_class=None, logger=None, multistart=False, n_starts=10, sampler='lhs')[source]¶

Bases: object

Large dataset curve fitting with automatic memory management and chunking.

This class handles datasets with millions to billions of points that exceed available memory through automatic chunking, progressive parameter refinement, and streaming optimization. It maintains fitting accuracy while preventing memory overflow through dynamic memory monitoring and chunk size optimization.

Core Capabilities¶

Automatic memory estimation based on data size and parameter count
Dynamic chunk size calculation considering available system memory
Sequential parameter refinement across data chunks with convergence tracking
Streaming optimization for unlimited datasets (no accuracy loss)
Real-time progress monitoring with ETA for long-running fits
Full integration with NLSQ optimization algorithms and GPU acceleration
Multi-start optimization for global search (uses full data)

Memory Management Algorithm¶

Estimates total memory requirements from dataset size and parameter count
Calculates optimal chunk sizes considering available memory and safety margins
Monitors actual memory usage during processing to prevent overflow
Uses streaming optimization for extremely large datasets (processes all data)

Processing Strategies¶

Single Pass: For datasets fitting within memory limits
Sequential Chunking: Processes data in optimal-sized chunks with parameter propagation
Streaming Optimization: Mini-batch gradient descent for unlimited datasets (no subsampling)

Multi-Start Optimization¶

For medium-sized datasets (1M-100M points), multi-start optimization explores multiple starting points on full data, and the best starting point is then used for the full chunked optimization.

Performance Characteristics¶

Maintains <1% parameter error for well-conditioned problems using chunking
Achieves 5-50x speedup over naive approaches through memory optimization
Scales to datasets of unlimited size using streaming (processes all data)
Provides linear time complexity with respect to chunk count

Model Validation Caching (Task Group 7 - 5.1a)¶

Model functions are validated once per unique function identity using a cache keyed by (id(func), id(func.__code__)). This avoids redundant validation across chunks, providing 1-5% performance gain in chunked processing.

param memory_limit_gb:: Maximum memory usage in GB. System memory is auto-detected if None.
type memory_limit_gb:: float, default 8.0
param config:: Advanced configuration for fine-tuning memory management behavior.
type config:: LDMemoryConfig, optional
param curve_fit_class:: Custom CurveFit instance for specialized fitting requirements.
type curve_fit_class:: nlsq.minpack.CurveFit, optional
param multistart:: Enable multi-start optimization for global search.
type multistart:: bool, default False
param n_starts:: Number of starting points for multi-start optimization.
type n_starts:: int, default 10
param sampler:: Sampling strategy for multi-start: ‘lhs’, ‘sobol’, or ‘halton’.
type sampler:: str, default ‘lhs’

config¶

Active memory management configuration

Type:: LDMemoryConfig

curve_fitter¶

Internal curve fitting engine with JAX acceleration

Type:: nlsq.minpack.CurveFit

logger¶

Internal logging for performance monitoring and debugging

Type:: Logger

fit : Main fitting method with automatic memory management

fit_with_progress : Fitting with real-time progress reporting and ETA

get_memory_recommendations : Pre-fitting memory analysis and strategy recommendations

Important: Chunking-Compatible Model Functions

-----------------------------------------------

When using chunked processing (for datasets > memory limit), your model function

MUST respect the size of xdata. During chunking, xdata will be a subset of the

full dataset, and your model must return output matching that subset size.

\*\*INCORRECT - Model ignores xdata size (will cause shape mismatch errors):**

>>> def bad_model(xdata, a, b):

... # WRONG: Always returns full array, ignoring xdata size

... t_full = jnp.arange(10_000_000) # Fixed size!

... return a * jnp.exp(-b * t_full) # Shape mismatch during chunking

\*\*CORRECT - Model respects xdata size:**

>>> def good_model(xdata, a, b):

... # CORRECT: Uses xdata as indices to return only requested subset

... indices = xdata.astype(jnp.int32)

... return a * jnp.exp(-b * indices) # Shape matches xdata

\*\*Alternative - Direct computation on xdata:**

>>> def direct_model(xdata, a, b):

... # CORRECT: Operates directly on xdata

... return a * jnp.exp(-b * xdata) # Shape automatically matches

Examples

Basic usage with automatic configuration:

>>> import numpy as np
>>> import jax.numpy as jnp
>>>
>>> # 10 million data points
>>> x = np.linspace(0, 10, 10_000_000)
>>> y = 2.5 * jnp.exp(-1.3 * x) + 0.1 + np.random.normal(0, 0.05, len(x))
>>>
>>> fitter = LargeDatasetFitter(memory_limit_gb=4.0)
>>> result = fitter.fit(
...     lambda x, a, b, c: a * jnp.exp(-b * x) + c,
...     x, y, p0=[2, 1, 0]
... )
>>> print(f"Parameters: {result.popt}")
>>> print(f"Chunks used: {result.n_chunks}")

Multi-start optimization:

>>> fitter = LargeDatasetFitter(
...     memory_limit_gb=4.0,
...     multistart=True,
...     n_starts=10,
...     sampler='lhs',
... )
>>> result = fitter.fit(
...     lambda x, a, b, c: a * jnp.exp(-b * x) + c,
...     x, y, p0=[2, 1, 0],
...     bounds=([0, 0, 0], [10, 5, 10])
... )

Advanced configuration with progress monitoring:

>>> config = LDMemoryConfig(
...     memory_limit_gb=8.0,
...     min_chunk_size=10000,
...     max_chunk_size=1000000,
...     use_streaming=True,
...     streaming_batch_size=50000
... )
>>> fitter = LargeDatasetFitter(config=config)
>>>
>>> # Fit with progress bar for long-running operation
>>> result = fitter.fit_with_progress(
...     exponential_model, x_huge, y_huge, p0=[2, 1, 0]
... )

Memory analysis before processing:

>>> recommendations = fitter.get_memory_recommendations(len(x), n_params=3)
>>> print(f"Strategy: {recommendations['processing_strategy']}")
>>> print(f"Memory estimate: {recommendations['memory_estimate_gb']:.2f} GB")
>>> print(f"Recommended chunks: {recommendations['n_chunks']}")