ADR-003: Streaming Optimization Over Subsampling¶

Status: Accepted

Date: 2025-10-17

Deciders: Wei Chen (Maintainer), Code Quality Review

Context¶

NLSQ v0.1.x included a subsampling feature for large datasets that randomly sampled data when datasets exceeded a threshold. This approach had several issues:

Problems with Subsampling¶

Accuracy Loss: Random sampling reduced data from 85-95% accuracy to potential information loss
Non-deterministic Results: Different runs could produce different results even with same seed
Complexity: Added ~250 lines of code with complex chunking logic
User Confusion: Parameters like enable_sampling, sampling_threshold, max_sampled_size were poorly understood
False Economy: Tried to save memory but lost scientific accuracy

Alternative Considered¶

Streaming Optimization: Process 100% of data in chunks using online optimization algorithms, integrated with existing chunked fitting infrastructure.

Decision¶

Remove subsampling entirely in favor of streaming optimization.

Key Changes¶

Removed ~250 lines of subsampling code from large_dataset.py
Removed parameters: enable_sampling, sampling_threshold, max_sampled_size
- Previously deprecated, now fully removed
Removed multi-start subsampling (multistart_subsample_size parameter)
- Multi-start exploration now uses 100% of data
Integrated streaming optimizer for datasets that don’t fit in memory
Updated LargeDatasetFitter to use streaming by default

Migration Path¶

Remove any usage of enable_sampling, sampling_threshold, max_sampled_size, multistart_subsample_size
These parameters are no longer accepted and will raise TypeError
Streaming optimization is now the only strategy for large datasets

Consequences¶

Positive¶

[PASS] 100% Data Utilization: No accuracy loss from random sampling [PASS] Deterministic Results: Same data always produces same fit [PASS] Simpler Code: 250 fewer lines to maintain [PASS] Better Science: Processes all data for maximum statistical power [PASS] Streaming Integration: Reuses existing chunked fitting infrastructure [PASS] Clear API: Fewer confusing parameters

Negative¶

[FAIL] Breaking Change: Old parameters now raise TypeError

Mitigation: Clear migration path documented above [FAIL] Slightly Slower: Processing 100% of data takes longer than sampling 85%
Mitigation: Minimal impact due to efficient streaming implementation [FAIL] Requires h5py: Now a required dependency instead of optional
Mitigation: h5py is standard in scientific Python ecosystem

Performance Impact¶

Before (subsampling): 85-95% of data, faster but less accurate
After (streaming): 100% of data, slightly slower but scientifically correct
Typical overhead: 10-20% longer runtime for 100% accuracy

References¶

Status Updates¶

2025-10-17: Accepted and parameters deprecated
2025-10-18: Verified with 1241 tests passing, 100% success rate
2025-12-21: Multi-start subsampling (multistart_subsample_size) removed
2025-12-21: Deprecated enable_sampling, sampling_threshold, max_sampled_size fully removed