ADR-003: Streaming Optimization Over Subsampling

Status: Accepted

Date: 2025-10-17

Deciders: Wei Chen (Maintainer), Code Quality Review

Context

NLSQ v0.1.x included a subsampling feature for large datasets that randomly sampled data when datasets exceeded a threshold. This approach had several issues:

Problems with Subsampling

  1. Accuracy Loss: Random sampling reduced data from 85-95% accuracy to potential information loss

  2. Non-deterministic Results: Different runs could produce different results even with same seed

  3. Complexity: Added ~250 lines of code with complex chunking logic

  4. User Confusion: Parameters like enable_sampling, sampling_threshold, max_sampled_size were poorly understood

  5. False Economy: Tried to save memory but lost scientific accuracy

Alternative Considered

Streaming Optimization: Process 100% of data in chunks using online optimization algorithms, integrated with existing chunked fitting infrastructure.

Decision

Remove subsampling entirely in favor of streaming optimization.

Key Changes

  1. Removed ~250 lines of subsampling code from large_dataset.py

  2. Removed parameters: enable_sampling, sampling_threshold, max_sampled_size

    • Previously deprecated, now fully removed

  3. Removed multi-start subsampling (multistart_subsample_size parameter)

    • Multi-start exploration now uses 100% of data

  4. Integrated streaming optimizer for datasets that don’t fit in memory

  5. Updated LargeDatasetFitter to use streaming by default

Migration Path

  • Remove any usage of enable_sampling, sampling_threshold, max_sampled_size, multistart_subsample_size

  • These parameters are no longer accepted and will raise TypeError

  • Streaming optimization is now the only strategy for large datasets

Consequences

Positive

[PASS] 100% Data Utilization: No accuracy loss from random sampling [PASS] Deterministic Results: Same data always produces same fit [PASS] Simpler Code: 250 fewer lines to maintain [PASS] Better Science: Processes all data for maximum statistical power [PASS] Streaming Integration: Reuses existing chunked fitting infrastructure [PASS] Clear API: Fewer confusing parameters

Negative

[FAIL] Breaking Change: Old parameters now raise TypeError

  • Mitigation: Clear migration path documented above [FAIL] Slightly Slower: Processing 100% of data takes longer than sampling 85%

  • Mitigation: Minimal impact due to efficient streaming implementation [FAIL] Requires h5py: Now a required dependency instead of optional

  • Mitigation: h5py is standard in scientific Python ecosystem

Performance Impact

  • Before (subsampling): 85-95% of data, faster but less accurate

  • After (streaming): 100% of data, slightly slower but scientifically correct

  • Typical overhead: 10-20% longer runtime for 100% accuracy

References

Status Updates

  • 2025-10-17: Accepted and parameters deprecated

  • 2025-10-18: Verified with 1241 tests passing, 100% success rate

  • 2025-12-21: Multi-start subsampling (multistart_subsample_size) removed

  • 2025-12-21: Deprecated enable_sampling, sampling_threshold, max_sampled_size fully removed