# ADR-003: Streaming Optimization Over Subsampling **Status**: Accepted **Date**: 2025-10-17 **Deciders**: Wei Chen (Maintainer), Code Quality Review ## Context NLSQ v0.1.x included a subsampling feature for large datasets that randomly sampled data when datasets exceeded a threshold. This approach had several issues: ### Problems with Subsampling 1. **Accuracy Loss**: Random sampling reduced data from 85-95% accuracy to potential information loss 2. **Non-deterministic Results**: Different runs could produce different results even with same seed 3. **Complexity**: Added ~250 lines of code with complex chunking logic 4. **User Confusion**: Parameters like `enable_sampling`, `sampling_threshold`, `max_sampled_size` were poorly understood 5. **False Economy**: Tried to save memory but lost scientific accuracy ### Alternative Considered **Streaming Optimization**: Process 100% of data in chunks using online optimization algorithms, integrated with existing chunked fitting infrastructure. ## Decision **Remove subsampling entirely in favor of streaming optimization.** ### Key Changes 1. Removed ~250 lines of subsampling code from `large_dataset.py` 2. Removed parameters: `enable_sampling`, `sampling_threshold`, `max_sampled_size` - Previously deprecated, now fully removed 3. Removed multi-start subsampling (`multistart_subsample_size` parameter) - Multi-start exploration now uses 100% of data 4. Integrated streaming optimizer for datasets that don't fit in memory 5. Updated `LargeDatasetFitter` to use streaming by default ### Migration Path - Remove any usage of `enable_sampling`, `sampling_threshold`, `max_sampled_size`, `multistart_subsample_size` - These parameters are no longer accepted and will raise `TypeError` - Streaming optimization is now the only strategy for large datasets ## Consequences ### Positive [PASS] **100% Data Utilization**: No accuracy loss from random sampling [PASS] **Deterministic Results**: Same data always produces same fit [PASS] **Simpler Code**: 250 fewer lines to maintain [PASS] **Better Science**: Processes all data for maximum statistical power [PASS] **Streaming Integration**: Reuses existing chunked fitting infrastructure [PASS] **Clear API**: Fewer confusing parameters ### Negative [FAIL] **Breaking Change**: Old parameters now raise `TypeError` - **Mitigation**: Clear migration path documented above [FAIL] **Slightly Slower**: Processing 100% of data takes longer than sampling 85% - **Mitigation**: Minimal impact due to efficient streaming implementation [FAIL] **Requires h5py**: Now a required dependency instead of optional - **Mitigation**: h5py is standard in scientific Python ecosystem ### Performance Impact - **Before** (subsampling): 85-95% of data, faster but less accurate - **After** (streaming): 100% of data, slightly slower but scientifically correct - **Typical overhead**: 10-20% longer runtime for 100% accuracy ## References - [Large Dataset Implementation](../../../nlsq/streaming/large_dataset.py) - [Streaming Optimizer](../../../nlsq/streaming/adaptive_hybrid.py) - [Large Dataset Guide](../../howto/handle_large_data.rst) ## Status Updates - **2025-10-17**: Accepted and parameters deprecated - **2025-10-18**: Verified with 1241 tests passing, 100% success rate - **2025-12-21**: Multi-start subsampling (`multistart_subsample_size`) removed - **2025-12-21**: Deprecated `enable_sampling`, `sampling_threshold`, `max_sampled_size` fully removed