nlsq.streaming.large_dataset.LargeDatasetFitter

class nlsq.streaming.large_dataset.LargeDatasetFitter(memory_limit_gb=8.0, config=None, curve_fit_class=None, logger=None, multistart=False, n_starts=10, sampler='lhs')[source]

Bases: object

Large dataset curve fitting with automatic memory management and chunking.

This class handles datasets with millions to billions of points that exceed available memory through automatic chunking, progressive parameter refinement, and streaming optimization. It maintains fitting accuracy while preventing memory overflow through dynamic memory monitoring and chunk size optimization.

Core Capabilities

  • Automatic memory estimation based on data size and parameter count

  • Dynamic chunk size calculation considering available system memory

  • Sequential parameter refinement across data chunks with convergence tracking

  • Streaming optimization for unlimited datasets (no accuracy loss)

  • Real-time progress monitoring with ETA for long-running fits

  • Full integration with NLSQ optimization algorithms and GPU acceleration

  • Multi-start optimization for global search (uses full data)

Memory Management Algorithm

  1. Estimates total memory requirements from dataset size and parameter count

  2. Calculates optimal chunk sizes considering available memory and safety margins

  3. Monitors actual memory usage during processing to prevent overflow

  4. Uses streaming optimization for extremely large datasets (processes all data)

Processing Strategies

  • Single Pass: For datasets fitting within memory limits

  • Sequential Chunking: Processes data in optimal-sized chunks with parameter propagation

  • Streaming Optimization: Mini-batch gradient descent for unlimited datasets (no subsampling)

Multi-Start Optimization

For medium-sized datasets (1M-100M points), multi-start optimization explores multiple starting points on full data, and the best starting point is then used for the full chunked optimization.

Performance Characteristics

  • Maintains <1% parameter error for well-conditioned problems using chunking

  • Achieves 5-50x speedup over naive approaches through memory optimization

  • Scales to datasets of unlimited size using streaming (processes all data)

  • Provides linear time complexity with respect to chunk count

Model Validation Caching (Task Group 7 - 5.1a)

Model functions are validated once per unique function identity using a cache keyed by (id(func), id(func.__code__)). This avoids redundant validation across chunks, providing 1-5% performance gain in chunked processing.

param memory_limit_gb:

Maximum memory usage in GB. System memory is auto-detected if None.

type memory_limit_gb:

float, default 8.0

param config:

Advanced configuration for fine-tuning memory management behavior.

type config:

LDMemoryConfig, optional

param curve_fit_class:

Custom CurveFit instance for specialized fitting requirements.

type curve_fit_class:

nlsq.minpack.CurveFit, optional

param multistart:

Enable multi-start optimization for global search.

type multistart:

bool, default False

param n_starts:

Number of starting points for multi-start optimization.

type n_starts:

int, default 10

param sampler:

Sampling strategy for multi-start: ‘lhs’, ‘sobol’, or ‘halton’.

type sampler:

str, default ‘lhs’

config

Active memory management configuration

Type:

LDMemoryConfig

curve_fitter

Internal curve fitting engine with JAX acceleration

Type:

nlsq.minpack.CurveFit

logger

Internal logging for performance monitoring and debugging

Type:

Logger

fit : Main fitting method with automatic memory management
fit_with_progress : Fitting with real-time progress reporting and ETA
get_memory_recommendations : Pre-fitting memory analysis and strategy recommendations
Important: Chunking-Compatible Model Functions
-----------------------------------------------
When using chunked processing (for datasets > memory limit), your model function
MUST respect the size of xdata. During chunking, xdata will be a subset of the
full dataset, and your model must return output matching that subset size.
\*\*INCORRECT - Model ignores xdata size (will cause shape mismatch errors):**
>>> def bad_model(xdata, a, b):
...     # WRONG: Always returns full array, ignoring xdata size
...     t_full = jnp.arange(10_000_000)  # Fixed size!
...     return a * jnp.exp(-b * t_full)  # Shape mismatch during chunking
\*\*CORRECT - Model respects xdata size:**
>>> def good_model(xdata, a, b):
...     # CORRECT: Uses xdata as indices to return only requested subset
...     indices = xdata.astype(jnp.int32)
...     return a * jnp.exp(-b * indices)  # Shape matches xdata
\*\*Alternative - Direct computation on xdata:**
>>> def direct_model(xdata, a, b):
...     # CORRECT: Operates directly on xdata
...     return a * jnp.exp(-b * xdata)  # Shape automatically matches

Examples

Basic usage with automatic configuration:

>>> import numpy as np
>>> import jax.numpy as jnp
>>>
>>> # 10 million data points
>>> x = np.linspace(0, 10, 10_000_000)
>>> y = 2.5 * jnp.exp(-1.3 * x) + 0.1 + np.random.normal(0, 0.05, len(x))
>>>
>>> fitter = LargeDatasetFitter(memory_limit_gb=4.0)
>>> result = fitter.fit(
...     lambda x, a, b, c: a * jnp.exp(-b * x) + c,
...     x, y, p0=[2, 1, 0]
... )
>>> print(f"Parameters: {result.popt}")
>>> print(f"Chunks used: {result.n_chunks}")

Multi-start optimization:

>>> fitter = LargeDatasetFitter(
...     memory_limit_gb=4.0,
...     multistart=True,
...     n_starts=10,
...     sampler='lhs',
... )
>>> result = fitter.fit(
...     lambda x, a, b, c: a * jnp.exp(-b * x) + c,
...     x, y, p0=[2, 1, 0],
...     bounds=([0, 0, 0], [10, 5, 10])
... )

Advanced configuration with progress monitoring:

>>> config = LDMemoryConfig(
...     memory_limit_gb=8.0,
...     min_chunk_size=10000,
...     max_chunk_size=1000000,
...     use_streaming=True,
...     streaming_batch_size=50000
... )
>>> fitter = LargeDatasetFitter(config=config)
>>>
>>> # Fit with progress bar for long-running operation
>>> result = fitter.fit_with_progress(
...     exponential_model, x_huge, y_huge, p0=[2, 1, 0]
... )

Memory analysis before processing:

>>> recommendations = fitter.get_memory_recommendations(len(x), n_params=3)
>>> print(f"Strategy: {recommendations['processing_strategy']}")
>>> print(f"Memory estimate: {recommendations['memory_estimate_gb']:.2f} GB")
>>> print(f"Recommended chunks: {recommendations['n_chunks']}")

See also

curve_fit_large

High-level function with automatic dataset size detection

LDMemoryConfig

Configuration class for memory management parameters

estimate_memory_requirements

Standalone function for memory estimation

Notes

The sequential chunking algorithm maintains parameter accuracy by using each chunk’s result as the initial guess for the next chunk. This approach typically maintains fitting accuracy within 0.1% of single-pass results for well-conditioned problems while enabling processing of arbitrarily large datasets.

For extremely large datasets, streaming optimization processes all data using mini-batch gradient descent with no subsampling, ensuring zero accuracy loss compared to subsampling approaches (removed in v0.2.0).

__init__(memory_limit_gb=8.0, config=None, curve_fit_class=None, logger=None, multistart=False, n_starts=10, sampler='lhs')[source]

Initialize LargeDatasetFitter.

Parameters:
  • memory_limit_gb (float, optional) – Memory limit in GB (default: 8.0)

  • config (LDMemoryConfig, optional) – Custom memory configuration

  • curve_fit_class (nlsq.minpack.CurveFit, optional) – Custom CurveFit instance to use

  • logger (logging.Logger, optional) – External logger instance for integration with application logging. If None, uses NLSQ’s internal logger. This allows chunk failure warnings to appear in your application’s logs.

  • multistart (bool, optional) – Enable multi-start optimization for global search (default: False). When enabled, explores multiple starting points on full data before running the full chunked optimization.

  • n_starts (int, optional) – Number of starting points for multi-start optimization (default: 10). Set to 0 to disable multi-start even when multistart=True.

  • sampler (str, optional) – Sampling strategy for generating starting points (default: ‘lhs’). Options: ‘lhs’ (Latin Hypercube), ‘sobol’, ‘halton’.

estimate_requirements(n_points, n_params)[source]

Estimate memory requirements and processing strategy.

Parameters:
  • n_points (int) – Number of data points

  • n_params (int) – Number of parameters to fit

Returns:

Detailed statistics and recommendations

Return type:

DatasetStats

fit(f, xdata, ydata, p0=None, bounds=(-inf, inf), method='trf', solver='auto', **kwargs)[source]

Fit curve to large dataset with automatic memory management.

Parameters:
  • f (callable) – The model function f(x, *params) -> y

  • xdata (np.ndarray) – Independent variable data

  • ydata (np.ndarray) – Dependent variable data

  • p0 (array-like, optional) – Initial parameter guess

  • bounds (tuple, optional) – Parameter bounds (lower, upper)

  • method (str, optional) – Optimization method (default: ‘trf’)

  • solver (str, optional) – Solver type (default: ‘auto’)

  • **kwargs – Additional arguments passed to curve_fit

Returns:

Optimization result with fitted parameters and statistics

Return type:

OptimizeResult

fit_with_progress(f, xdata, ydata, p0=None, bounds=(-inf, inf), method='trf', solver='auto', **kwargs)[source]

Fit curve with progress reporting for long-running fits.

Parameters:
  • f (callable) – The model function f(x, *params) -> y

  • xdata (np.ndarray) – Independent variable data

  • ydata (np.ndarray) – Dependent variable data

  • p0 (array-like, optional) – Initial parameter guess

  • bounds (tuple, optional) – Parameter bounds (lower, upper)

  • method (str, optional) – Optimization method (default: ‘trf’)

  • solver (str, optional) – Solver type (default: ‘auto’)

  • **kwargs – Additional arguments passed to curve_fit

Returns:

Optimization result with fitted parameters and statistics

Return type:

OptimizeResult

memory_monitor()[source]

Context manager for monitoring memory usage during fits.

get_memory_recommendations(n_points, n_params)[source]

Get memory usage recommendations for a dataset.

Parameters:
  • n_points (int) – Number of data points

  • n_params (int) – Number of parameters

Returns:

Recommendations and memory analysis

Return type:

dict