ADR-003: Streaming Optimization Over Subsampling¶
Status: Accepted
Date: 2025-10-17
Deciders: Wei Chen (Maintainer), Code Quality Review
Context¶
NLSQ v0.1.x included a subsampling feature for large datasets that randomly sampled data when datasets exceeded a threshold. This approach had several issues:
Problems with Subsampling¶
Accuracy Loss: Random sampling reduced data from 85-95% accuracy to potential information loss
Non-deterministic Results: Different runs could produce different results even with same seed
Complexity: Added ~250 lines of code with complex chunking logic
User Confusion: Parameters like
enable_sampling,sampling_threshold,max_sampled_sizewere poorly understoodFalse Economy: Tried to save memory but lost scientific accuracy
Alternative Considered¶
Streaming Optimization: Process 100% of data in chunks using online optimization algorithms, integrated with existing chunked fitting infrastructure.
Decision¶
Remove subsampling entirely in favor of streaming optimization.
Key Changes¶
Removed ~250 lines of subsampling code from
large_dataset.pyRemoved parameters:
enable_sampling,sampling_threshold,max_sampled_sizePreviously deprecated, now fully removed
Removed multi-start subsampling (
multistart_subsample_sizeparameter)Multi-start exploration now uses 100% of data
Integrated streaming optimizer for datasets that don’t fit in memory
Updated
LargeDatasetFitterto use streaming by default
Migration Path¶
Remove any usage of
enable_sampling,sampling_threshold,max_sampled_size,multistart_subsample_sizeThese parameters are no longer accepted and will raise
TypeErrorStreaming optimization is now the only strategy for large datasets
Consequences¶
Positive¶
[PASS] 100% Data Utilization: No accuracy loss from random sampling [PASS] Deterministic Results: Same data always produces same fit [PASS] Simpler Code: 250 fewer lines to maintain [PASS] Better Science: Processes all data for maximum statistical power [PASS] Streaming Integration: Reuses existing chunked fitting infrastructure [PASS] Clear API: Fewer confusing parameters
Negative¶
[FAIL] Breaking Change: Old parameters now raise TypeError
Mitigation: Clear migration path documented above [FAIL] Slightly Slower: Processing 100% of data takes longer than sampling 85%
Mitigation: Minimal impact due to efficient streaming implementation [FAIL] Requires h5py: Now a required dependency instead of optional
Mitigation: h5py is standard in scientific Python ecosystem
Performance Impact¶
Before (subsampling): 85-95% of data, faster but less accurate
After (streaming): 100% of data, slightly slower but scientifically correct
Typical overhead: 10-20% longer runtime for 100% accuracy
References¶
Status Updates¶
2025-10-17: Accepted and parameters deprecated
2025-10-18: Verified with 1241 tests passing, 100% success rate
2025-12-21: Multi-start subsampling (
multistart_subsample_size) removed2025-12-21: Deprecated
enable_sampling,sampling_threshold,max_sampled_sizefully removed