Homodyne NLSQ Fitting Architecture

Complete documentation of the NLSQ (Nonlinear Least Squares) fitting system in homodyne.

Version: 2.23.2 Last Updated: May 2026

Table of Contents

  1. High-Level Architecture

  2. Setup Phase

  3. Global Optimization Selection

  4. CMA-ES Global Optimization

  5. Adapter Selection

  6. Memory & Strategy Selection

  7. Stratification Decision

  8. Residual Function Setup

  9. Anti-Degeneracy Defense System

  10. Strategy Execution

  11. Error Recovery

  12. Result Building

  13. Quick Reference Tables

  14. Key Files Reference


High-Level Architecture

┌─────────────────────────────────────────────────────────────────────────────────┐
                              USER ENTRY POINTS                                   
                                                                                  
   fit_nlsq_jax(data, config)     fit_nlsq_multistart()     fit_with_cmaes()     
         (core.py)                    (core.py)             (cmaes_wrapper.py)   
                                   [line ~1402]                                  
                              (run_multistart_nlsq()                             
                               is in multistart.py)                              
                                                                               
                                Latin Hypercube             CMA-ES Global        
                                N starting points           Optimization         
                                                                               
              └───────────────────────────┴────────────────────────┘              
                                                                                 
├─────────────────────────────────────────────────────────────────────────────────┤
   NLSQ ALSO SERVES AS WARM-START FOR CMC (v2.20.0)                              
                                                                                  
   CLI: homodyne --method cmc  Automatically runs NLSQ first                    
   API: fit_mcmc_jax(data, config, nlsq_result=fit_nlsq_jax(...))               
                                                                                  
   Impact: Reduces CMC divergence rate from ~28% to <5%                          
└─────────────────────────────────────────────────────────────────────────────────┘
                                           
═══════════════════════════════════════════╪═══════════════════════════════════════
                                           
┌─────────────────────────────────────────────────────────────────────────────────┐
                            1. SETUP PHASE                                        
                               (core.py)                                          
└─────────────────────────────────────────────────────────────────────────────────┘
                                           
═══════════════════════════════════════════╪═══════════════════════════════════════
                                           
┌─────────────────────────────────────────────────────────────────────────────────┐
                    2. GLOBAL OPTIMIZATION SELECTION                              
                                                                                  
        enable_cmaes=True? ────YES───▶ CMA-ES Global Search + NLSQ Refinement    
                                            (cmaes_wrapper.py)                   
             NO                                                                   
                                                                                 
        enable_multi_start=True? YES─▶ Multi-Start Latin Hypercube              
                                            (multistart.py)                      
             NO                                                                   
                                                                                 
        Local Optimization (continue to Adapter Selection)                        
└─────────────────────────────────────────────────────────────────────────────────┘
                                           
═══════════════════════════════════════════╪═══════════════════════════════════════
                                           
┌─────────────────────────────────────────────────────────────────────────────────┐
                      3. ADAPTER SELECTION (T021-T024)                            
                                                                                  
                    NLSQAdapter ◄──────────► NLSQWrapper                          
                   (with fallback)         (stable fallback)                      
└─────────────────────────────────────────────────────────────────────────────────┘
                                    
════════════════════════════════════╪══════════════════════════════════════════════
                                    
┌─────────────────────────────────────────────────────────────────────────────────┐
                     4. MEMORY & STRATEGY SELECTION                               
                             (memory.py)                                          
                                                                                  
              STANDARD ◄────► OUT_OF_CORE ◄────► HYBRID_STREAMING                 
└─────────────────────────────────────────────────────────────────────────────────┘
                                    
════════════════════════════════════╪══════════════════════════════════════════════
                                    
┌─────────────────────────────────────────────────────────────────────────────────┐
                      5. STRATIFICATION DECISION                                  
                       (strategies/chunking.py)                                   
                                                                                  
                    Angle-Stratified Chunking (if needed)                         
└─────────────────────────────────────────────────────────────────────────────────┘
                                    
════════════════════════════════════╪══════════════════════════════════════════════
                                    
┌─────────────────────────────────────────────────────────────────────────────────┐
                       6. RESIDUAL FUNCTION SETUP                                 
                                                                                  
         StratifiedResidualFunction ◄────► StratifiedResidualFunctionJIT          
└─────────────────────────────────────────────────────────────────────────────────┘
                                    
════════════════════════════════════╪══════════════════════════════════════════════
                                    
┌─────────────────────────────────────────────────────────────────────────────────┐
               7. ANTI-DEGENERACY DEFENSE SYSTEM (laminar_flow only)              
                                                                                  
     Layer 1: Fourier Reparameterization                                          
     Layer 2: Hierarchical Optimization                                           
     Layer 3: Adaptive CV Regularization                                          
     Layer 4: Gradient Collapse Monitor                                           
     Layer 5: Shear-Sensitivity Weighting                                         
└─────────────────────────────────────────────────────────────────────────────────┘
                                    
════════════════════════════════════╪══════════════════════════════════════════════
                                    
┌─────────────────────────────────────────────────────────────────────────────────┐
                         8. STRATEGY EXECUTION                                    
                        (strategies/executors.py)                                 
                                                                                  
     StandardExecutor  LargeDatasetExecutor  StreamingExecutor                  
└─────────────────────────────────────────────────────────────────────────────────┘
                                    
════════════════════════════════════╪══════════════════════════════════════════════
                                    
┌─────────────────────────────────────────────────────────────────────────────────┐
                          9. ERROR RECOVERY                                       
                            (wrapper.py)                                          
                                                                                  
                    3-Attempt Recovery with Perturbation                          
└─────────────────────────────────────────────────────────────────────────────────┘
                                    
════════════════════════════════════╪══════════════════════════════════════════════
                                    
┌─────────────────────────────────────────────────────────────────────────────────┐
                          10. RESULT BUILDING                                     
                          (result_builder.py)                                     
                                                                                  
                         OptimizationResult                                       
└─────────────────────────────────────────────────────────────────────────────────┘

1. Setup Phase

File: core.py

Input Validation & Conversion

┌───────────────────────────────────────────────────────────────────────────┐
 Input Validation & Conversion                                              
                                                                            
  Validate analysis_mode: static_isotropic (3 phys) | laminar_flow (7)    
  Data format conversion:                                                  
     phi_angles_list  phi                                                  
     c2_exp  g2                                                            
     wavevector_q_list[0]  q (scalar)                                      
  Extract 1D time vectors from 2D meshgrids:                               
     t1_2d[:, 0]  t1_1d                                                    
     t2_2d[0, :]  t2_1d                                                    
  Generate default sigma (1%) if missing                                   
  Load L (stator_rotor_gap or sample_detector_distance) from config        
└───────────────────────────────────────────────────────────────────────────┘

Parameter Initialization

┌───────────────────────────────────────────────────────────────────────────┐
 Parameter Initialization                                                   
                                                                            
  Load initial params from config OR estimate from data                    
  Setup parameter bounds via ParameterManager                              
  Parameter structure:                                                     
                                                                            
   static_isotropic:  [contrast, offset, D0, alpha, D_offset]              
   laminar_flow:      [contrast, offset, D0, alpha, D_offset,              
                       gamma_dot_t0, beta, gamma_dot_t_offset, phi0]       
                                                                            
   With per-angle scaling (n_phi angles):                                   
     [c0, c1, ..., c_{n-1}, o0, o1, ..., o_{n-1}, *physical_params]        
└───────────────────────────────────────────────────────────────────────────┘

2. Global Optimization Selection

File: core.py

Decision Flow

┌───────────────────────────────────────────────────────────────────────────┐
 Global Optimization Selection                                              
                                                                            
   NLSQConfig fields control optimization strategy:                         
                                                                            
   enable_cmaes = True?                                                     
                                                                           
      YES ──▶ Check scale ratio via CMAESWrapper.should_use_cmaes()        
                                                                          
                ├─ scale_ratio > threshold (1000) ──▶ CMA-ES Global       
                                                                          
                └─ scale_ratio  threshold ──▶ Fall back to multi-start   
                                                                           
      NO                                                                    
                                                                           
   enable_multi_start = True?                                               
                                                                           
      YES ──▶ Multi-start with Latin Hypercube Sampling                    
                                                                           
      NO ──▶ Local Optimization (Adapter Selection)                        
└───────────────────────────────────────────────────────────────────────────┘

Scale Ratio Computation

# CMAESWrapper.compute_scale_ratio()
scale_ratio = max(param_ranges) / min(param_ranges)

# Example: laminar_flow mode
D bounds:    [100, 50000]      range = 49,900
γ̇ bounds:   [1e-6, 0.01]      range = 0.00999
scale_ratio = 49,900 / 0.00999  5 × 10   ──▶ CMA-ES recommended

Global Optimization Comparison

| Method | Best For | Convergence | Memory | |——–|———-|————-|——–| | CMA-ES | Multi-scale problems (ratio > 1000), complex landscapes | Global (covariance adaptation) | Bounded | | Multi-start | Multiple local minima, degeneracy detection | Local from N starting points | N × single fit | | Local (TRF) | Well-behaved problems, good initial guess | Local (quadratic) | Jacobian-based |


3. CMA-ES Global Optimization

File: cmaes_wrapper.py

When to Use CMA-ES

CMA-ES (Covariance Matrix Adaptation Evolution Strategy) excels when:

  • Multi-scale parameters: D₀ ~ 10⁴ vs γ̇₀ ~ 10⁻³ (scale ratio > 10⁶)

  • Complex loss landscapes: Multiple local minima, saddle points

  • Poor initial guess: CMA-ES explores globally, not just locally

  • laminar_flow mode: 7 physical params with vastly different scales

Architecture

┌─────────────────────────────────────────────────────────────────────────────────┐
                        CMA-ES GLOBAL OPTIMIZATION                                
                          (cmaes_wrapper.py)                                      
╠═════════════════════════════════════════════════════════════════════════════════╣
                                                                                  
   ┌─────────────────────────────────────────────────────────────────────────┐   
    PHASE 1: CMA-ES GLOBAL SEARCH                                               
             (NLSQ CMAESOptimizer with evosax backend)                          
                                                                                
       Population-based evolutionary optimization                              
       Covariance matrix adapts to parameter scales                            
       BIPOP restart strategy (alternating large/small populations)            
       Memory batching: population_batch_size, data_chunk_size                 
                                                                                
      Presets:                                                                  
        cmaes-fast:   50 generations (quick exploration)                        
        cmaes:       100 generations (balanced, default)                        
        cmaes-global: 200 generations (thorough search)                         
                                                                                
      Config: preset, max_generations, sigma, tol_fun, tol_x                    
              restart_strategy="bipop", max_restarts=9                          
   └─────────────────────────────────────────────────────────────────────────┘   
                                                                                 
                                                                                 
   ┌─────────────────────────────────────────────────────────────────────────┐   
    PHASE 2: NLSQ TRF REFINEMENT (if refine_with_nlsq=True)                     
             (explicit refinement via _run_nlsq_refinement)                     
                                                                                
       Uses NLSQ curve_fit with workflow="auto"                                
       Memory-aware: auto-selects standard/chunked/streaming                   
       Tighter tolerances than CMA-ES (ftol=1e-10 vs 1e-8)                     
       Provides proper covariance matrix via Jacobian                          
                                                                                
      Config: refinement_workflow="auto"                                        
              refinement_ftol=1e-10, refinement_xtol=1e-10                      
              refinement_gtol=1e-10, refinement_max_nfev=500                    
              refinement_loss="linear"                                          
   └─────────────────────────────────────────────────────────────────────────┘   
                                                                                 
                                                                                 
   ┌─────────────────────────────────────────────────────────────────────────┐   
    RESULT: CMAESResult                                                         
                                                                                
       parameters: optimized values                                            
       covariance: from NLSQ refinement (proper uncertainties)                 
       chi_squared: final fit quality                                          
       diagnostics: generations, evaluations, restarts, improvement %          
       nlsq_refined: True if refinement succeeded                              
   └─────────────────────────────────────────────────────────────────────────┘   
└─────────────────────────────────────────────────────────────────────────────────┘

CMA-ES vs NLSQ Internal Refinement

| Aspect | NLSQ Internal | Homodyne Explicit | |——–|—————|——————-| | workflow | "auto" | "auto" (configurable) | | ftol | ~1.49e-8 (default) | 1e-10 (tighter) | | xtol | ~1.49e-8 (default) | 1e-10 (tighter) | | gtol | ~1.49e-8 (default) | 1e-10 (tighter) | | max_nfev | Unbounded | 500 (bounded) | | Configurability | None | Full YAML config |

Configuration

optimization:
  nlsq:
    cmaes:
      enable: true                      # Enable CMA-ES global optimization
      preset: "cmaes"                   # "cmaes-fast", "cmaes", "cmaes-global"
      max_generations: 100              # Maximum CMA-ES generations
      sigma: 0.5                        # Initial step size (global search)
      sigma_warmstart: 0.05             # Reduced step size when warm-start is active
      nlsq_warmstart: true              # Run NLSQ first, use as CMA-ES starting point
      warmstart_auto_skip: true         # Skip CMA-ES if warm-start is good enough
      warmstart_skip_threshold: 5.0     # Skip if warm-start reduced chi2 < threshold
      tol_fun: 1.0e-8                   # Function tolerance
      tol_x: 1.0e-8                     # Parameter tolerance
      restart_strategy: "bipop"         # "none" or "bipop"
      max_restarts: 9                   # Maximum BIPOP restarts
      population_batch_size: null       # null = auto, or explicit batch size
      data_chunk_size: null             # null = auto, or explicit chunk size
      refine_with_nlsq: true            # Refine with NLSQ TRF after CMA-ES
      auto_select: true                 # Auto-select CMA-ES vs multi-start
      scale_threshold: 1000.0           # Scale ratio threshold for auto-selection
      memory_limit_gb: 8.0              # Memory limit for auto-configuration

      # NLSQ TRF Refinement Settings
      refinement_workflow: "auto"       # "auto", "standard", "streaming"
      refinement_ftol: 1.0e-10          # Tighter for local refinement
      refinement_xtol: 1.0e-10
      refinement_gtol: 1.0e-10
      refinement_max_nfev: 500          # Bounded iterations
      refinement_loss: "linear"         # "linear", "soft_l1", "huber"

Usage Example

from homodyne.optimization.nlsq.cmaes_wrapper import CMAESWrapper, CMAESWrapperConfig

# Create wrapper with custom config
wrapper = CMAESWrapper(CMAESWrapperConfig(
    preset="cmaes",
    refine_with_nlsq=True,
    refinement_ftol=1e-10,
))

# Check if CMA-ES is appropriate for this problem
if wrapper.should_use_cmaes(bounds, scale_threshold=1000):
    result = wrapper.fit(model_func, xdata, ydata, p0, bounds)
    print(f"Chi²: {result.chi_squared:.4e}")
    print(f"Refined: {result.nlsq_refined}")
    print(f"Improvement: {result.diagnostics.get('chi_squared_improvement', 0):.2%}")

4. Adapter Selection

Decision Point: T021-T024

┌─────────────────────────────────────────────────────────────────────────────────┐
                      ADAPTER SELECTION (T021-T024)                               
                                                                                  
                           use_adapter=True?                                      
                                                                                 
                    ┌─────────────┴─────────────┐                                 
                   YES                         NO                                 
                                                                                
                                                                                
   ┌────────────────────────────────┐                                            
      Try NLSQAdapter (adapter.py)                                             
                                                                               
     Model caching via                                                        
      WeakValueDictionary                                                      
      (3-5× speedup for multistart)                                            
     JIT compilation via                                                      
      NLSQ CurveFit class          │           │                                 │
      (2-3× per-iteration speedup)                                             
     Native NLSQ stability                                                    
   └────────────────────────────────┘                                            
                                                                                
               Success?                                                          
                                                                                
          ┌────────┴────────┐                                                    
         YES               NO                                                    
                                                                               
                                                                               
   ┌────────────┐    ┌─────────────────────────────────────────────────────────┐ 
     Return                    NLSQWrapper (wrapper.py)                       
     Result                                                                   
                      Full anti-degeneracy defense system (5 layers)        
    fallback          3-attempt error recovery with parameter perturbation  
    =False            Streaming mode support (AdaptiveHybridStreaming)      
   └────────────┘       Per-angle consistent initialization (v2.7.1 fix)      
                        Angle-stratified chunking for per-angle scaling       
                                                                                
                       Sets: fallback_occurred=True (if came from adapter)     
                     └─────────────────────────────────────────────────────────┘ 
└─────────────────────────────────────────────────────────────────────────────────┘

Adapter Comparison

| Feature | NLSQAdapter | NLSQWrapper | |———|————-|————-| | Model caching | Yes (3-5× speedup) | No | | JIT compilation | Yes (2-3× speedup) | No | | Anti-degeneracy | No | Yes (5 layers) | | Error recovery | Basic | 3-attempt with perturbation | | Streaming mode | No | Yes | | Best for | Standard fits, multistart | Complex laminar_flow, large datasets |


5. Memory & Strategy Selection

File: memory.py

Memory Estimation

peak_memory_gib = n_points × n_params × 8 bytes × 6.5 / 1024**3

6.5× Jacobian Overhead Factor (v2.14.0):
├─ 1.0× Base Jacobian matrix
├─ 2.0× Autodiff intermediates (JAX gradient computation)
├─ 1.5× Stratified array padding (copy + padding overhead)
├─ 1.5× JIT compilation buffers (XLA, remat traces)
└─ 0.5× Optimizer working memory (QR, trust-region storage)

Example: 23M points × 53 params × 8 × 6.5 / 1024**3  59.2 GiB

Adaptive Threshold

threshold = total_system_RAM × memory_fraction

memory_fraction = 0.75 (default, override via NLSQ_MEMORY_FRACTION env var)
Detection priority: psutil  sysconf  fallback 16 GB

Example: 64 GB system × 0.75 = 48 GB threshold

Strategy Decision Tree

┌───────────────────────────────────────────────────────────────────────────┐
                     STRATEGY DECISION TREE                                 
                    select_nlsq_strategy()                                  
                                                                            
   Decision is purely memory-based (no n_points branching):                
                                                                            
   1. index_memory > threshold?                                             
                                                                           
             YES ──────────────────────────────▶ HYBRID_STREAMING          
                                                (extreme scale)           
             NO                                                             
                                                                           
   2. peak_memory > threshold?                                              
                                                                           
             YES ──────────────────────────────▶ OUT_OF_CORE               
                                                (large scale)             
             NO                                                             
                                                                           
                                                                           
           STANDARD                                                         
           (in-memory)                                                      
└───────────────────────────────────────────────────────────────────────────┘

Strategy to Executor Mapping

NLSQStrategy Enum  Executor Mapping:
├─ STANDARD          StandardExecutor       curve_fit()
├─ OUT_OF_CORE       LargeDatasetExecutor   curve_fit_large()
└─ HYBRID_STREAMING  StreamingExecutor      AdaptiveHybridStreamingOptimizer

6. Stratification Decision

File: strategies/chunking.py

Decision Logic

┌───────────────────────────────────────────────────────────────────────────┐
 should_use_stratification() Decision Tree                                  
                                                                            
   Dataset < 100k points?     ──YES──▶ NO  (STANDARD, no chunking)         
                                                                           
         NO                                                                 
                                                                           
   No per-angle scaling?      ──YES──▶ NO  (regular chunking works)        
                                                                           
         NO                                                                 
                                                                           
   Single phi angle?          ──YES──▶ NO  (no stratification needed)      
                                                                           
         NO                                                                 
                                                                           
   Imbalance ratio > 5.0?     ──YES──▶ NO  (use sequential per-angle)      
                                                                           
         NO                                                                 
                                                                           
        YES ──▶ Use angle-stratified chunking                              
└───────────────────────────────────────────────────────────────────────────┘

Angle-Stratified Chunking

Problem Solved:

NLSQ’s default chunking splits data ARBITRARILY without angle awareness:

  • Some chunks may have NO points for angle[i]

  • Gradient w.r.t. contrast[i] = ZERO

  • Optimization fails silently (no convergence)

Solution: Round-Robin Distribution

1. Group data by phi angle
2. Distribute points from EACH angle across ALL chunks
3. Cyclic stratification ensures ALL data used (Jan 2026 fix)

Example (3 angles, 1.2M points, target 100k chunks):
  Angle A: 600k pts  distributed as 50k to each of 12 chunks
  Angle B: 400k pts  distributed as 33k to each of 12 chunks
  Angle C: 200k pts  distributed as 17k to each of 12 chunks
  ──────────────────────────────────────────────────────────
  Result: 12 chunks of ~100k each, ALL angles in EACH chunk

Key Functions:

  • create_angle_stratified_data() - Full copy approach

  • create_angle_stratified_indices() - Zero-copy approach (~1% memory overhead)


7. Residual Function Setup

Comparison

| Feature | StratifiedResidualFunction | StratifiedResidualFunctionJIT | |———|—————————|——————————| | File | strategies/residual.py | strategies/residual_jit.py | | Callable | Python | JIT-compiled | | Shapes | Dynamic | Static (padded) | | Parallelization | Vectorized indexing | vmap over chunks | | Memory | Lower | Slightly higher (padding) | | Speed | 15-20% speedup | 2-3× after JIT compile |

Residual Computation Flow

1. Compute g2_theory on global (phi, t1, t2) grid ONCE (not per-chunk)
2. Extract values via pre-computed flat indices (vectorized)
3. residual = (g2_data - g2_theory) / sigma
4. Mask diagonals: jnp.where(t1_idx != t2_idx, residual, 0.0)
5. Return flattened residuals for least-squares optimizer

Per-angle parameter handling:
  params = [c0, c1, ..., c_n, o0, o1, ..., o_n, *physical]
  Each angle uses its corresponding contrast[i], offset[i]

Optimizations

StratifiedResidualFunction:

  • Pre-compute global unique (phi, t1, t2) at init

  • Pre-compute flat indices (avoid searchsorted in loop)

  • Vectorized indexing (15-20% per-iter speedup)

StratifiedResidualFunctionJIT:

  • Padded arrays to max_chunk_size (static shapes for JIT)

  • vmap parallelization over chunks

  • Boolean mask for real vs padded data


8. Anti-Degeneracy Defense System

Activation Conditions (ALL must be true):

  • analysis_mode = laminar_flow

  • per_angle_scaling = True

  • enable = True (AntiDegeneracyConfig.enable, default True)

Note: n_phi >= constant_scaling_threshold (default: 3) determines the per-angle mode (auto_averaged vs individual), not whether the system activates. The system activates for any laminar_flow run with per-angle scaling enabled.

Root Problem: Structural Degeneracy

With 23 angles: 46 per-angle params vs 7 physical params
Per-angle params have larger, more consistent gradients
 Optimizer "absorbs" angle-dependent shear signal into per-angle params
 Physical params (especially γ̇) collapse to bounds

Additional: Shear gradient L/γ̇  Σcos(φ-φ) suffers 94.6% cancellation
(11 positive + 12 negative cos contributions with 23 angles)

Layer 1: Fourier Reparameterization

File: fourier_reparam.py

Problem: Too many per-angle degrees of freedom

Solution: Replace n_phi independent params with truncated Fourier series
  contrast(φ) = c + Σₖ[cₖ×cos() + sₖ×sin()]  for k=1..order
  offset(φ)   = o + Σₖ[oₖ×cos() + tₖ×sin()]

Parameter Reduction (order=2, 5 coeffs per group):
  n_phi=10:   20  10 params (50% reduction)
  n_phi=23:   46  10 params (78% reduction)
  n_phi=100: 200  10 params (95% reduction)

Config: per_angle_mode="auto", fourier_order=2, fourier_auto_threshold=6

Layer 2: Hierarchical Optimization

File: hierarchical.py

Problem: Physical and per-angle params compete; gradients cancel

Solution: Alternating two-stage optimization

  for outer_iteration in range(max_outer):     # default 5
    Stage 1: Freeze per-angle  optimize physical (L-BFGS, 100 iters)
    Stage 2: Freeze physical  optimize per-angle (L-BFGS, 50 iters)
    Check convergence on physical params (tol=1e-6)

Why it works: Physical params can't be absorbed when per-angle frozen

Config: hierarchical.enable=true, max_outer_iterations=5

Layer 3: Adaptive CV Regularization

File: adaptive_regularization.py

Problem: Absolute variance regularization (λ=0.01) was too weak
         Contributed only ~0.05% to total loss

Solution: CV-based (Coefficient of Variation) relative regularization

  CV = std(params) / |mean(params)|
  L_reg = λ × CV² × MSE × n_points

  Auto-tunes λ to contribute ~10% to loss (100× stronger than v2.8)
  λ = target_contribution / target_cv²

Config: mode="relative", lambda=1.0, target_cv=0.10, target_contrib=0.10

Layer 4: Gradient Collapse Monitor

File: gradient_monitor.py

Problem: Need runtime detection of physical param gradient loss

Solution: Monitor gradient ratio with automatic response

  ratio = ‖∇_physical / ‖∇_per_angle

  if ratio < threshold for N consecutive iterations:
     trigger response action

Response Actions:
  "warn"         Log warning only
  "hierarchical" Switch to hierarchical mode (recommended)
  "reset"        Reset per-angle params to mean
  "abort"        Abort optimization

Config: enable=true, ratio_threshold=0.01, consecutive_triggers=5

Layer 5: Shear-Sensitivity Weighting

File: shear_weighting.py

Problem: Shear gradient L/γ̇  Σcos(φ-φ) suffers 94.6% cancellation
         With 23 angles spanning 360°: sum0, gradient0

Solution: Angle-dependent residual weighting

  L = Σ_φ w(φ) × Σ_τ (g2_model - g2_exp)²
  w(φ) = w_min + (1 - w_min) × |cos(φ - φ)|^α

Effect: Emphasizes shear-sensitive angles (where cos is large)
        Prevents gradient cancellation across angle sum

Config: enable=true, min_weight=0.3, alpha=1.0, normalize=true

9. Strategy Execution

File: strategies/executors.py

STANDARD Strategy (StandardExecutor)

╔═══════════════════════════════════════════════════════════════════════════╗
                        STANDARD STRATEGY                                   
                       (StandardExecutor)                                   
╠═══════════════════════════════════════════════════════════════════════════╣
                                                                            
   Uses: NLSQ curve_fit()                                                   
   Best for: < 1M points with sufficient RAM                                
                                                                            
   Memory Pattern: Full Jacobian J (n_points × n_params) in RAM             
                                                                            
   Algorithm:                                                               
     1. Compute residuals r(p) for current params p                         
     2. Compute full Jacobian J = r/p via JAX autodiff                    
     3. Solve (J^T J + λI) Δp = -J^T r  (trust-region)                      
     4. Update p  p + Δp                                                   
     5. Repeat until convergence                                            
                                                                            
   No chunking, no progress bar, fastest for small datasets                 
╚═══════════════════════════════════════════════════════════════════════════╝

OUT_OF_CORE Strategy (LargeDatasetExecutor)

╔═══════════════════════════════════════════════════════════════════════════╗
                       OUT_OF_CORE STRATEGY                                 
                     (LargeDatasetExecutor)                                 
╠═══════════════════════════════════════════════════════════════════════════╣
                                                                            
   Uses: NLSQ curve_fit_large()                                             
   Best for: 1M - 100M points                                               
                                                                            
   Memory Pattern: Chunked J^T J accumulation (never full J in RAM)         
                   Uses angle-stratified chunking if per_angle_scaling      
                                                                            
   Algorithm:                                                               
     for each iteration:                                                    
       J^T_J = 0, J^T_r = 0                                                 
       for each chunk (with progress bar):                                  
         J_chunk = autodiff(residual_fn, chunk_data)  # small J             ║
         r_chunk = residual_fn(chunk_data)                                  
         J^T_J += J_chunk^T @ J_chunk                 # accumulate          ║
         J^T_r += J_chunk^T @ r_chunk                 # accumulate          ║
       solve: (J^T_J + λI) Δp = -J^T_r                                      
       trust-region update                                                  
                                                                            
   Key insight: J^T J is only (n_params × n_params)  small                 
                Never need full J (n_points × n_params) in RAM              
╚═══════════════════════════════════════════════════════════════════════════╝

HYBRID_STREAMING Strategy (StreamingExecutor)

╔═══════════════════════════════════════════════════════════════════════════╗
                     HYBRID_STREAMING STRATEGY                              
                      (StreamingExecutor)                                   
╠═══════════════════════════════════════════════════════════════════════════╣
                                                                            
   Uses: NLSQ AdaptiveHybridStreamingOptimizer                              
   Best for: 100M+ points, memory-constrained, many phi angles              
                                                                            
   Memory Pattern: Bounded ~2 GB regardless of dataset size                 
                                                                            
   ┌─────────────────────────────────────────────────────────────────────┐ 
    PHASE 0: PARAMETER NORMALIZATION                                     
                                                                         
      Problem: Gradient magnitudes differ wildly (D~10 vs γ̇~10⁻³)    
      Solution: Bounds-based normalization to [0, 1]                     
        p_norm = (p - lower) / (upper - lower)                           
                                                                         
      Equalizes gradient magnitudes for balanced optimization            
   └─────────────────────────────────────────────────────────────────────┘ 
                                                                           
                                                                           
   ┌─────────────────────────────────────────────────────────────────────┐ 
    PHASE 1: L-BFGS WARMUP (100-500 iterations)                          
                                                                         
       Fast initial exploration of parameter space                      
       Gradient-only (no Jacobian storage required)                     
       Memory: O(n_params²) for inverse Hessian approx                  
       Adaptive switching to Phase 2 when progress stalls               
                                                                         
      If shear_weighting enabled:                                        
        Creates HierarchicalOptimizer for weighted L-BFGS                
   └─────────────────────────────────────────────────────────────────────┘ 
                                                                           
                                                                           
   ┌─────────────────────────────────────────────────────────────────────┐ 
    PHASE 2: STREAMING GAUSS-NEWTON (50 iterations)                      
                                                                         
       Quadratic convergence near minimum                               
       Exact J^T J accumulation via streaming (50k points/chunk)        
       Trust-region refinement for robustness                           
                                                                         
      for each chunk:                                                    
        stream chunk to device                                           
        compute J_chunk, r_chunk                                         
        accumulate J^T J, J^T r                                          
        free chunk memory                                                
      solve normal equations                                             
   └─────────────────────────────────────────────────────────────────────┘ 
                                                                           
                                                                           
   ┌─────────────────────────────────────────────────────────────────────┐ 
    PHASE 3: DENORMALIZATION + COVARIANCE                                
                                                                         
       Transform params back to physical scales:                        
          p = p_norm × (upper - lower) + lower                           
                                                                         
       Transform covariance to original parameter space:                
          Cov_orig = S @ Cov_norm @ S^T                                  
          where S = diag(upper - lower)                                  
                                                                         
       Proper uncertainties in physical units                           
   └─────────────────────────────────────────────────────────────────────┘ 
╚═══════════════════════════════════════════════════════════════════════════╝

Parallel Chunk Accumulation

File: nlsq/parallel_accumulator.py

The OOC and HYBRID_STREAMING strategies share a parallel acceleration layer for dispatching chunk computations to persistent workers and reducing their results.

┌───────────────────────────────────────────────────────────────────────────┐
 Parallel Accumulator Architecture                                         
                                                                           
  Two-tier parallelism:                                                    
                                                                           
  Tier 1  OOCComputePool (compute-level parallelism)                      
    Persistent process pool that computes JtJ, Jtr, chi2 per chunk.        
    Workers share flat data arrays via shared memory (zero-copy).           
    Each worker caches JIT kernels across L-M iterations  compile once.    
                                                                           
  Tier 2  accumulate_chunks_parallel (reduction-level parallelism)        
    Reduces pre-computed (JtJ, Jtr, chi2) tuples across workers.           
    Exploits associativity of matrix addition for safe partitioning.        
                                                                           
  Both tiers fall back to sequential when n_chunks < 10 or on failure.     
└───────────────────────────────────────────────────────────────────────────┘

Shared Memory Layout (OOCSharedArrays)

┌───────────────────────────────────────────────────────────────────────────┐
 OOCSharedArrays  Zero-Copy Shared Memory                                 
                                                                           
  Parent process:                                                          
    Allocates POSIX shared memory blocks for 5 flat arrays:                
      phi_flat, t1_flat, t2_flat, g2_flat, sigma_flat (optional)           
    Stores picklable refs: { shm_name, shape, dtype } per array            
                                                                           
  Worker processes:                                                        
    Attach to named shared memory blocks (create=False)                    
    Construct np.ndarray views over shm.buf (zero-copy)                    
    Workers read chunk slices via chunk_boundaries                         
                                                                           
  Lifecycle:                                                               
    with OOCSharedArrays(...) as shared:                                   
        pool = OOCComputePool(n_workers, shared, ...)                      
        ...  # iterate                                                     │
    # shared.__exit__ -> close + unlink all shm blocks                     │
    # pool.__exit__ -> shutdown executor                                   │
└───────────────────────────────────────────────────────────────────────────┘

OOCComputePool — Persistent Worker Pool

┌───────────────────────────────────────────────────────────────────────────┐
 OOCComputePool Lifecycle                                                  
                                                                           
  1. Create ProcessPoolExecutor (spawn context)                            
     initializer = _ooc_worker_init                                        
     initargs = (shm_refs, physics_config, chunk_boundaries,               
                 threads_per_worker)                                        
                                                                           
  2. Worker initialization (_ooc_worker_init):                             
     a. Thread pinning (OMP/MKL/OPENBLAS_NUM_THREADS)                      
     b. Float64 before JAX import (JAX_ENABLE_X64=true)                    
     c. Persistent JIT compilation cache                                   
        jax.config.update("jax_compilation_cache_dir", cache_dir)          
        jax.config.update("jax_persistent_cache_min_compile_time_secs", 0) 
     d. Attach to shared memory arrays (zero-copy np.ndarray views)        
     e. create_ooc_kernels() -> JIT kernel pair:                           
        compute_chunk_accumulators: (p, phi, t1, t2, g2, sigma) -> JtJ,    
                                    Jtr, chi2                              
        compute_chunk_chi2:         (p, ...) -> chi2 (no Jacobian)         
     f. Register atexit cleanup for shm handles                            
                                                                           
  3. Per L-M iteration:                                                    
     pool.compute_accumulators(params)                                     
       -> submit _ooc_compute_chunk(params, chunk_id) to all chunks        
       -> collect (JtJ, Jtr, chi2) tuples via as_completed                 
       -> reduce via accumulate_chunks_parallel or sequential              
                                                                           
  4. Shutdown: executor.shutdown(wait=True, cancel_futures=True)           
└───────────────────────────────────────────────────────────────────────────┘

JIT Kernel Factory (create_ooc_kernels)

┌───────────────────────────────────────────────────────────────────────────┐
 create_ooc_kernels()                                                      
                                                                           
  Inputs (physics constants, closed over):                                 
    per_angle_scaling, n_phi, phi_unique, t1_unique, t2_unique,            
    n_t1, n_t2, q_val, L_val, dt_val                                      
                                                                           
  Returns two @jax.jit kernels:                                            
                                                                           
    compute_chunk_accumulators(p, phi_c, t1_c, t2_c, g2_c, sigma):         
      1. residual_fn(p) -> r  (uses compute_g2_scaled from physics_nlsq)   
      2. J = jax.jacobian(residual_fn)(p)                                  
      3. JtJ = J^T @ J,  Jtr = J^T @ r                                    
      4. chi2 = sum(r^2 / sigma^2)  (or unweighted)                        
      -> (JtJ, Jtr, chi2)                                                  
                                                                           
    compute_chunk_chi2(p, phi_c, t1_c, t2_c, g2_c, sigma):                 
      -> chi2 only (no Jacobian, used for cost probes)                     
                                                                           
  Handles both per_angle_scaling (vmapped per phi) and scalar scaling      
  Single source of truth for sequential and parallel OOC paths             
└───────────────────────────────────────────────────────────────────────────┘

Parallel Reduction (accumulate_chunks_parallel)

┌───────────────────────────────────────────────────────────────────────────┐
 accumulate_chunks_parallel(chunks, n_workers)                             
                                                                           
  Gate: n_chunks >= 10 (_MIN_CHUNKS_FOR_PARALLEL)                          
        Falls back to sequential below threshold                           
                                                                           
  Algorithm:                                                               
    1. Round-robin partition chunks across n_workers                        
    2. Each worker: accumulate_chunks_sequential(partition)                 
       -> (partial_JtJ, partial_Jtr, partial_chi2, count)                  
    3. Main process: reduce partial sums via as_completed                   
       total_JtJ += partial_JtJ  (matrix addition is associative)          
       total_Jtr += partial_Jtr                                            
       total_chi2 += partial_chi2                                          
                                                                           
  Error handling:                                                          
    OSError, RuntimeError, PicklingError, TimeoutError (300s)              
    -> fallback to accumulate_chunks_sequential                            
                                                                           
  Mathematical correctness:                                                
    J^T J = sum_i(J_i^T J_i) regardless of summation order                 
    Parallel and sequential produce identical results                      
└───────────────────────────────────────────────────────────────────────────┘

10. Error Recovery

File: wrapper.py

3-Attempt Recovery System

┌───────────────────────────────────────────────────────────────────────────┐
 3-Attempt Recovery System                                                  
                                                                            
   Attempt 1: Primary optimization                                          
                                                                           
       └─ Failure? (bounds violation, NaN, convergence failure)             
                                                                           
                                                                           
   Attempt 2: Reset params to bounds center                                 
             p_reset = (lower + upper) / 2                                 
             Log recovery action                                           
                                                                           
       └─ Failure? (numerical instability)                                  
                                                                           
                                                                           
   Attempt 3: Reset to geometric mean                                       
             p_reset = sqrt(lower × upper)                                 
             Comprehensive logging                                         
                                                                           
       └─ Return result with recovery_actions in device_info                
                                                                            
   Recovery actions logged: ["bounds_reset", "geometric_mean_reset", ...]  
└───────────────────────────────────────────────────────────────────────────┘

11. Result Building

File: result_builder.py

OptimizationResult Structure

@dataclass
class OptimizationResult:
    # Core results
    parameters: np.ndarray        # Optimized parameter values
    uncertainties: np.ndarray     # sqrt(diag(covariance))
    covariance: np.ndarray        # Full covariance matrix
    chi_squared: float            # Sum of squared residuals
    reduced_chi_squared: float    # χ² / (n_points - n_params_effective)
    # NOTE (post-Round-10): n_params_effective = 2 * n_phi + n_physical for
    # auto_averaged mode (e.g., 53 for 23-angle laminar_flow: 2*23 + 7).
    # The compressed optimizer vector has only 2 + n_physical params (9),
    # but effective DOF for s² uses the expanded per-angle count (53).

    # Status
    success: bool                 # Optimization succeeded
    message: str                  # Detailed status message
    iterations: int               # Number of iterations
    execution_time: float         # Wall clock time (seconds)
    convergence_status: str       # "converged"|"partial"|"failed"
    quality_flag: str             # "good"|"marginal"|"poor"

    # Metadata
    device_info: dict
        adapter: str              # "NLSQAdapter"|"NLSQWrapper"
        fallback_occurred: bool   # True if adapter→wrapper fallback
        strategy: str             # "standard"|"out_of_core"|"hybrid"
        recovery_actions: list    # List of recovery attempts
        peak_memory_gb: float     # Estimated peak memory
        n_chunks: int             # Number of chunks (if chunked)

# Alias used in some contexts (same type)
NLSQResult = OptimizationResult

ResultBuilder Fluent Interface

result_builder.py provides a fluent builder that populates OptimizationResult across different pipeline stages without a giant constructor call:

result = (
    ResultBuilder()
    .set_parameters(params, uncertainties, covariance)
    .set_chi_squared(chi_sq, n_points, n_params_effective)
    .set_status(success=True, message="...", iterations=42)
    .set_start_time(t0)              # execution_time computed on build()
    .set_recovery_actions(["bounds_reset"])
    .set_optimization_info({"adapter": "NLSQWrapper", "strategy": "out_of_core"})
    .set_stratification_diagnostics(diag_dict)
    .build()                         # → OptimizationResult
)

build() validates that required fields are set and computes derived fields (execution_time, quality_flag, convergence_status). Raises ValueError if mandatory fields are missing.

Quality Flag Determination

├─ "good":     reduced_χ² < 2.0 AND n_at_bounds == 0
├─ "marginal": reduced_χ² < 5.0 AND n_at_bounds <= 2
└─ "poor":     otherwise

Quick Reference Tables

Strategy Selection

| Dataset Size | Peak Memory | Strategy | Executor | NLSQ Function | Memory Pattern | |————–|————-|———-|———-|—————|—————-| | < 1M pts | < threshold | STANDARD | StandardExecutor | curve_fit() | Full J in RAM | | 1M - 100M | < threshold | OUT_OF_CORE | LargeDatasetExecutor | curve_fit_large() | Chunked J^T J | | Any | > threshold | HYBRID_STREAMING | StreamingExecutor | AdaptiveHybridStreaming | Bounded ~2 GB |

Mode-Specific Parameters

| Mode | Physical Params | With 23-angle scaling | Anti-degeneracy | |——|—————-|———————-|—————–| | static_isotropic | 3: D₀, α, D_offset | 3 + 46 = 49 total | No | | laminar_flow | 7: + γ̇₀, β, γ̇_offset, φ₀ | 7 + 46 = 53 total | Yes (5 layers) |

Jacobian Overhead Factor Evolution

| Version | Factor | Justification | |———|——–|—————| | v2.7.0 | 3.0× | Initial estimate | | v2.11.0 | 4.0× | Added JAX overhead | | v2.13.0 | 5.0× | Padding + stratification | | v2.14.0 | 6.5× | Empirical validation on 23M dataset |


Key Files Reference

File

Purpose

core.py

Entry points: fit_nlsq_jax(), fit_nlsq_multistart() (line ~1402)

multistart.py

Lower-level run_multistart_nlsq() (called by fit_nlsq_multistart())

cmaes_wrapper.py

CMA-ES global optimization: CMAESWrapper, fit_with_cmaes()

config.py

NLSQConfig with CMA-ES and refinement settings

memory.py

select_nlsq_strategy(), memory estimation (6.5× factor)

adapter.py

NLSQAdapter with model caching via get_or_create_model()

wrapper.py

NLSQWrapper with full feature set + 3-attempt recovery

result_builder.py

ResultBuilder fluent API → OptimizationResult; computes quality_flag, convergence_status

results.py

OptimizationResult, NLSQResult alias, FallbackInfo, FunctionEvaluationCounter

parallel_accumulator.py

OOCComputePool, OOCSharedArrays, accumulate_chunks_parallel()

strategies/chunking.py

create_angle_stratified_data(), should_use_stratification()

strategies/residual.py

StratifiedResidualFunction (Python-callable)

strategies/residual_jit.py

StratifiedResidualFunctionJIT (JIT-compiled via vmap)

strategies/executors.py

StandardExecutor, LargeDatasetExecutor, StreamingExecutor

anti_degeneracy_controller.py

5-layer defense orchestration

fourier_reparam.py

Layer 1: Fourier parameter compression

hierarchical.py

Layer 2: Alternating physical/per-angle stages

adaptive_regularization.py

Layer 3: CV-based regularization

gradient_monitor.py

Layer 4: Runtime collapse detection

shear_weighting.py

Layer 5: Angle-dependent loss weighting


NLSQ as CMC Warm-Start Provider (v2.20.0)

The Problem: Cold-Start CMC Divergence

When CMC (Consensus Monte Carlo) starts from configuration initial values without NLSQ warm-start:

| Metric | Cold-Start CMC | NLSQ Warm-Start CMC | |——–|—————-|———————| | Divergence Rate | ~28% | <5% | | D0 Accuracy | -37% from NLSQ | Within 10% | | D_offset Accuracy | -92% from NLSQ | Within 20% | | Uncertainty Quality | Artificially small | Proper magnitude |

Root cause: NUTS adaptation wastes warmup iterations searching a 6+ order-of-magnitude parameter space when initial values are far from the posterior mode.

Automatic Warm-Start via CLI (v2.20.0)

┌───────────────────────────────────────────────────────────────────────────┐
 CLI AUTOMATIC NLSQCMC WORKFLOW                                           
                                                                           
   User runs: homodyne --method cmc --config my_config.yaml               
                                                                           
   CLI internally:                                                        
   1. Runs fit_nlsq_jax(data, config) FIRST                               
   2. Extracts optimized parameters from NLSQ result                      
   3. Passes to fit_mcmc_jax(data, config, nlsq_result=nlsq_result)       
   4. CMC uses NLSQ parameters as initial values                          
                                                                           
   To disable (NOT RECOMMENDED):                                          
     homodyne --method cmc --no-nlsq-warmstart                            
└───────────────────────────────────────────────────────────────────────────┘

API Usage

from homodyne.optimization.nlsq import fit_nlsq_jax
from homodyne.optimization.cmc import fit_mcmc_jax

# Step 1: Run NLSQ first
nlsq_result = fit_nlsq_jax(
    data=pooled_data,
    t1=t1_pooled,
    t2=t2_pooled,
    phi=phi_pooled,
    q=q,
    L=L,
    analysis_mode="laminar_flow",
    config=config.optimization.nlsq,
)

# Step 2: Pass NLSQ result to CMC for warm-start
cmc_result = fit_mcmc_jax(
    data=pooled_data,
    t1=t1_pooled,
    t2=t2_pooled,
    phi=phi_pooled,
    q=q,
    L=L,
    analysis_mode="laminar_flow",
    cmc_config=config.optimization.cmc,
    nlsq_result=nlsq_result,  # CRITICAL: Provides warm-start
)

Benefits of NLSQ Warm-Start

  1. Reduced Divergences: ~28% → <5% divergence rate

  2. Accurate Parameter Estimates: CMC results within 10-20% of NLSQ

  3. Proper Uncertainties: No artificial precision from biased combination

  4. Faster Convergence: NUTS adapts from near-mode instead of searching

  5. Consistent Parameterization: NLSQ’s per_angle_mode propagates to CMC

NLSQ and Bimodal CMC Posteriors (v2.22.0)

When per-shard CMC posteriors are bimodal (e.g., D₀ ~19K and ~32K), NLSQ typically converges to one mode. Standard consensus MC averages both modes into a density trough, producing a z-score > 100 disagreement with NLSQ. Mode-aware consensus (v2.22.0) decomposes bimodal shards into per-component GMM statistics and runs consensus separately per mode cluster, preserving both physical solutions. The NLSQ result then matches one of the two consensus modes rather than disagreeing with a corrupted average.