Large Dataset API Reference

This page documents the API for NLSQ’s large dataset handling features, designed for datasets with 20M+ points.

Memory Estimation

nlsq.estimate_memory_requirements(n_points, n_params)[source]

Estimate memory requirements for a dataset.

Parameters:
  • n_points (int) – Number of data points

  • n_params (int) – Number of parameters

Returns:

Memory requirements and processing recommendations

Return type:

DatasetStats

Examples

>>> from nlsq.streaming.large_dataset import estimate_memory_requirements
>>>
>>> # Estimate requirements for 50M points, 3 parameters
>>> stats = estimate_memory_requirements(50_000_000, 3)
>>> print(f"Estimated memory: {stats.total_memory_estimate_gb:.2f} GB")
>>> print(f"Recommended chunk size: {stats.recommended_chunk_size:,}")
>>> print(f"Number of chunks: {stats.n_chunks}")

The estimate_memory_requirements function returns a DatasetStats object with the following attributes:

  • n_points: Number of data points

  • n_params: Number of parameters

  • total_memory_estimate_gb: Estimated memory requirement in GB

  • recommended_chunk_size: Recommended chunk size for processing

  • n_chunks: Number of chunks needed

Example:

from nlsq import estimate_memory_requirements

stats = estimate_memory_requirements(100_000_000, 4)
print(f"Total memory: {stats.total_memory_estimate_gb:.2f} GB")
print(f"Process in {stats.n_chunks} chunks of {stats.recommended_chunk_size} points")

LargeDatasetFitter

The main class for handling large datasets with automatic memory management.

For complete API documentation, see nlsq.LargeDatasetFitter in the nlsq.large_dataset module module.

Key Features:

  • Automatic memory management and chunking

  • Progress reporting for long-running fits

  • Configurable memory limits and chunk sizes

  • Integration with existing NLSQ optimization algorithms

Constructor Parameters:

  • memory_limit_gb (float): Maximum memory to use (default: 4.0)

  • config (LDMemoryConfig, optional): Advanced configuration object

Example:

from nlsq import LargeDatasetFitter

fitter = LargeDatasetFitter(memory_limit_gb=8.0)

# Get recommendations
recs = fitter.get_memory_recommendations(50_000_000, 3)
print(recs['processing_strategy'])

# Fit with progress
result = fitter.fit_with_progress(func, x, y, p0)

Convenience Functions

curve_fit_large Function

Primary large dataset fitting function with automatic dataset size detection.

For complete API documentation, see nlsq.curve_fit_large() in the nlsq.large_dataset module module.

This function provides a drop-in replacement for curve_fit with automatic detection and handling of large datasets. For small datasets (< 1M points), it behaves identically to curve_fit. For larger datasets, it automatically switches to memory-efficient processing with chunking and progress reporting.

Parameters:
  • func: Model function f(x, \*params) -> y

  • xdata: Independent variable data

  • ydata: Dependent variable data

  • p0: Initial parameter guess

  • memory_limit_gb: Memory limit in GB (default: auto-detect)

  • auto_size_detection: Automatically detect dataset size (default: True)

  • size_threshold: Threshold for switching to large dataset processing (default: 1M)

  • show_progress: Show progress bar for large datasets (default: False)

  • **kwargs: Additional fitting options

Returns:

popt, pcov tuple (same as scipy.optimize.curve_fit)

Example:

from nlsq import curve_fit_large
import jax.numpy as jnp

# Automatic handling - uses standard curve_fit for small datasets
popt, pcov = curve_fit_large(func, x_small, y_small, p0=[1, 0.5])

# Automatic chunking for large datasets
popt, pcov = curve_fit_large(
    func, x_large, y_large,
    p0=[1, 0.5],
    memory_limit_gb=8.0,
    show_progress=True
)

fit_large_dataset Function

Advanced large dataset fitting with OptimizeResult return format.

For complete API documentation, see nlsq.fit_large_dataset() in the nlsq.large_dataset module module.

Parameters:
  • func: Model function

  • xdata: Independent variable

  • ydata: Dependent variable

  • p0: Initial parameters

  • memory_limit_gb: Memory limit (default: 4.0)

  • show_progress: Show progress bar (default: False)

  • **kwargs: Additional fitting options

Returns:

OptimizeResult object with detailed fitting information

Example:

from nlsq import fit_large_dataset

result = fit_large_dataset(
    exponential, x_data, y_data,
    p0=[1.0, 0.5, 0.1],
    memory_limit_gb=4.0,
    show_progress=True
)
print(f"Success: {result.success}")
print(f"Parameters: {result.popt}")
# Note: n_chunks only available for multi-chunk fits
if hasattr(result, 'n_chunks'):
    print(f"Chunks used: {result.n_chunks}")

Advanced Features

Sparse Jacobian Support

For problems with sparse Jacobian structures, NLSQ provides:

See nlsq.large_dataset module for usage examples.

Example:

from nlsq import SparseJacobianComputer

# Detect sparsity automatically
sparse_computer = SparseJacobianComputer(sparsity_threshold=0.01)
pattern, sparsity = sparse_computer.detect_sparsity_pattern(func, p0, x_sample)

if sparsity > 0.1:  # If more than 10% sparse
    print(f"Jacobian is {sparsity:.1%} sparse")
    # NLSQ will automatically use sparse optimization

Adaptive Hybrid Streaming

For huge datasets, use the adaptive hybrid streaming optimizer:

  • L-BFGS warmup with defense layers

  • Streaming Gauss-Newton for accurate covariance

  • Chunked processing with bounded memory

Example:

from nlsq import AdaptiveHybridStreamingOptimizer, HybridStreamingConfig

config = HybridStreamingConfig(chunk_size=10000, gauss_newton_max_iterations=10)
optimizer = AdaptiveHybridStreamingOptimizer(config)
result = optimizer.fit((x, y), func, p0=p0)

Memory Configuration

Advanced memory configuration options.

For complete API documentation, see nlsq.large_dataset.LDMemoryConfig in the nlsq.large_dataset module module.

Parameters:

  • memory_limit_gb: Maximum memory in GB

  • safety_factor: Safety factor for memory calculations (default: 0.8)

  • min_chunk_size: Minimum chunk size (default: 1000)

  • max_chunk_size: Maximum chunk size (default: 1000000)

  • min_success_rate: Minimum success rate for chunked fitting (default: 0.5)

Example:

from nlsq import LargeDatasetFitter
from nlsq.streaming.large_dataset import LDMemoryConfig

# Custom configuration
config = LDMemoryConfig(
    memory_limit_gb=8.0,
    safety_factor=0.9,
    min_chunk_size=10000,
    max_chunk_size=1000000,
    min_success_rate=0.8,  # Require 80% of chunks to succeed
)

fitter = LargeDatasetFitter(config=config)

Data Chunking

Utility class for chunking large arrays.

For complete API documentation, see nlsq.large_dataset.DataChunker in the nlsq.large_dataset module module.

Returns:

Iterator yielding (x_chunk, y_chunk, indices) tuples

Example:

from nlsq.streaming.large_dataset import DataChunker

chunker = DataChunker(chunk_size=100000)

for x_chunk, y_chunk, idx in chunker.create_chunks(x, y):
    # Process chunk
    result = process_chunk(x_chunk, y_chunk)

Performance Considerations

Memory Usage Guidelines

Dataset sizes and recommended approaches:

  • < 1M points: Use standard curve_fit

  • 1M - 10M points: Use LargeDatasetFitter with default settings

  • 10M - 100M points: Use LargeDatasetFitter with chunking

  • 100M - 1B points: Use AdaptiveHybridStreamingOptimizer with chunked streaming

  • > 1B points: Use sampling strategies or distributed computing

Memory Estimation Formula

Approximate memory usage:

memory_gb = n_points * (3 * n_params + 5) * 8 / 1e9

Where: - 3 factors: x data, y data, residuals - n_params: Jacobian columns - 5: Working arrays - 8: Bytes per float64

Optimization Tips

  1. Check sparsity first: Many large problems have sparse Jacobians

  2. Use iterative solvers: CG and LSQR use less memory than SVD

  3. Enable sampling: For exploratory analysis on very large datasets

  4. Stream from disk: Use HDF5 for datasets larger than RAM

  5. Monitor progress: Use fit_with_progress for long fits

Best Practices

  1. Always estimate memory first:

    stats = estimate_memory_requirements(n_points, n_params)
    if stats.total_memory_estimate_gb > available_memory:
        use_large_dataset_fitter()
    
  2. Use appropriate chunk sizes:

    # Chunk size affects performance
    # Too small: overhead from many iterations
    # Too large: memory issues
    optimal_chunk = int(available_memory_gb * 1e9 / (8 * 3 * n_params))
    
  3. Leverage sparsity when available:

    # Many scientific problems have sparse Jacobians
    # (e.g., fitting multiple peaks, piecewise functions)
    if expected_sparsity > 0.9:
        use_sparse_optimizer()
    
  4. Use streaming for very large datasets:

    # For datasets >100M points, streaming optimization processes
    # all data in chunks with zero accuracy loss
    if n_points > 100_000_000:
        # Streaming is automatic in curve_fit_large
        popt, pcov = curve_fit_large(func, xdata, ydata, p0=p0,
                                      memory_limit_gb=8.0, show_progress=True)
    

See Also