Benchmarking

The first intended public benchmark is NOD-MEG/NOD-EEG. NeuRepTrace does not download large public datasets automatically; stage the relevant subject epochs and metadata locally first.

Paper-facing benchmark manifest snapshots are stored in FlorianPfaff/NeuRepTrace-Paper. The commands below use paths under NeuRepTrace-Paper/benchmarks/, matching the GitHub workflow checkout path. If the paper repository is a sibling of this checkout, adjust the paths to ../NeuRepTrace-Paper/benchmarks/.

NOD Animate/Inanimate Pilot

Use the NOD preprocessed epochs file and the matching detailed events CSV. If the metadata already contains stim_is_animate, pass it directly to neureptrace.mne_time_decode or derive a named condition column first:

python -m neureptrace.metadata \
  --events-csv data/nod/sub-01_events.csv \
  --source-column stim_is_animate \
  --positive-pattern "True" \
  --label-column condition \
  --positive-label animate \
  --negative-label inanimate \
  --out data/nod/sub-01_metadata_animate.csv

Then run the decoder:

python -m neureptrace.mne_time_decode \
  --epochs data/nod/sub-01_epo.fif \
  --metadata-csv data/nod/sub-01_metadata_animate.csv \
  --label-column condition \
  --group-column session \
  --tmin -0.1 \
  --tmax 0.8 \
  --window-ms 20 \
  --step-ms 10 \
  --out results/nod_sub-01_animate.csv \
  --observations-out results/nod_sub-01_animate_observations.csv \
  --emission-mode both

The output CSV contains fold-wise accuracy, log loss, Brier score, and expected calibration error for each time window.

For probability-driven model selection, tune inside each outer training fold with a proper or calibration-oriented probability objective instead of accuracy:

python -m neureptrace.mne_time_decode ... --tune-hyperparameters --tuning-scoring neg_log_loss
python -m neureptrace.mne_time_decode ... --tune-hyperparameters --tuning-scoring neg_brier

The optional observations CSV keeps the held-out decoder probabilities before they are reduced to accuracy or calibration summaries. Each row is one trial/time-window observation with the fold, time, sample index, true class, predicted class, confidence, probability assigned to the true class, and one prob_class_* column per class. This is the output to use when treating decoder traces as probabilistic evidence streams for HMMs or other temporal state models.

Fit the first conservative temporal model from those observations:

python -m neureptrace.temporal_model \
  results/nod_sub-01_animate_observations.csv \
  --out-summary results/nod_sub-01_animate_temporal_model.csv \
  --out-states results/nod_sub-01_animate_state_trace.csv \
  --effect-window 0.1 0.8 \
  --baseline-window -0.1 0.0 \
  --n-permutations 100

This command treats each trial as a probability time series. The hidden states are the decoder classes, and the fitted parameter is the probability that the latent state persists between adjacent time bins. The summary reports persistence gain relative to a uniform-memory baseline, plus controls that shuffle time order, shuffle probability-label columns, and fit the same model in the pre-stimulus baseline window. The optional state trace CSV contains Viterbi states and posterior state probabilities for downstream sequence analyses.

When the observations contain both calibrated and uncalibrated emissions, compare which emission mode gives cleaner state inference:

python -m neureptrace.emission_compare \
  results/nod_sub-01_animate_temporal_model.csv \
  --out-csv results/nod_sub-01_animate_emission_compare.csv \
  --out-report results/nod_sub-01_animate_emission_compare.md

The comparison uses the temporal model's persistence gain. Its main value is the control margin: observed effect-window gain minus the strongest baseline-window, shuffled-time, or shuffled-label control gain. A positive calibrated-minus- uncalibrated margin is evidence that calibrated probabilities give cleaner state inference, not merely nicer reliability plots.

Ask the first NOD neuroscience question from the state traces:

python -m neureptrace.semantic_stages \
  results/nod_sub-01_animate_state_trace.csv \
  --out-time results/nod_sub-01_animate_semantic_stage_time.csv \
  --out-stages results/nod_sub-01_animate_semantic_stages.csv \
  --out-report results/nod_sub-01_animate_semantic_stages.md \
  --posterior-threshold 0.6 \
  --match-threshold 0.6 \
  --min-duration 0.04

This asks whether the decoded semantic category for each trial becomes a stable latent state over contiguous time ranges. For NOD this corresponds to category staging, such as animate or inanimate evidence emerging over a post-stimulus interval. For navigation or planning data, use the same output shape with spatial bins or task states in place of semantic classes; the resulting stable stages become candidate trajectory segments that still need the temporal-model controls above.

Plot the single-subject result:

python -m neureptrace.plot_time_decode \
  results/nod_sub-01_animate.csv \
  --chance 0.5 \
  --title "NOD sub-01 animate/inanimate" \
  --out results/nod_sub-01_animate.png

After running several subjects, aggregate across subjects:

python -m neureptrace.results \
  results/nod_sub-01_animate.csv \
  results/nod_sub-02_animate.csv \
  results/nod_sub-03_animate.csv \
  --out results/nod_animate_summary.csv

Then plot the aggregate:

python -m neureptrace.plot_time_decode \
  results/nod_animate_summary.csv \
  --chance 0.5 \
  --title "NOD animate/inanimate summary" \
  --out results/nod_animate_summary.png

Manifest Runner

The same workflow can be run from a manifest:

python -m neureptrace.validate_manifest \
  NeuRepTrace-Paper/benchmarks/nod_animate_sub01.csv \
  --report-out results/nod_animate_sub01_validation.csv

python -m neureptrace.benchmark \
  NeuRepTrace-Paper/benchmarks/nod_animate_sub01.csv \
  --out-dir results/nod_animate_sub01 \
  --aggregate-out results/nod_animate_sub01_summary.csv \
  --plot-out results/nod_animate_sub01_summary.png \
  --observation-dir results/nod_animate_sub01/observations \
  --emission-mode both \
  --chance 0.5

Manifest paths are resolved relative to the manifest file. The example manifest expects staged files under data/nod/.

Five-Subject Pilot

For a paper-ready first pass, use the same animate/inanimate task and run five subjects at once from a single manifest:

python -m neureptrace.validate_manifest \
  NeuRepTrace-Paper/benchmarks/nod_animate_first5.csv \
  --report-out results/nod_animate_first5_validation.csv

python -m neureptrace.benchmark \
  NeuRepTrace-Paper/benchmarks/nod_animate_first5.csv \
  --out-dir results/nod_animate_first5 \
  --aggregate-out results/nod_animate_first5_summary.csv \
  --plot-out results/nod_animate_first5_summary.png \
  --chance 0.5

This keeps the experiment scope fixed (same preprocessing, same target labels, same window/grid parameters) and changes only the subject set.

Generate a compact Markdown report from the aggregate and subject-level result CSVs:

python -m neureptrace.report \
  results/nod_animate_first5/summary.csv \
  "results/nod_animate_first5/sub-*_time_decode.csv" \
  --chance 0.5 \
  --out results/nod_animate_first5/report.md

The report records the aggregate peak, baseline-window accuracy, effect-window accuracy, calibration metrics at the peak, and per-subject peaks.

Full NOD-EEG Pilot

After staging all available NOD-EEG preprocessed epoch files and detailed event files, validate the 19-subject manifest. This manifest uses 2 grouped folds because several subjects have only 2 unique session groups:

python -m neureptrace.validate_manifest \
  NeuRepTrace-Paper/benchmarks/nod_animate_all.csv \
  --report-out results/nod_animate_all_validation.csv

Then run the same animate/inanimate benchmark over every staged NOD-EEG subject:

python -m neureptrace.benchmark \
  NeuRepTrace-Paper/benchmarks/nod_animate_all.csv \
  --out-dir results/nod_animate_all \
  --aggregate-out results/nod_animate_all/summary.csv \
  --plot-out results/nod_animate_all/nod_animate_all_summary.png \
  --chance 0.5

Make calibration explicit in the benchmark report:

python -m neureptrace.calibration \
  results/nod_animate_all/summary.csv \
  --out-report results/nod_animate_all/calibration_report.md

The calibration report orders models by effect-window ECE, then Brier score and log loss. Accuracy is included as context, but the report is designed to keep probability quality visible rather than treating it as a secondary metric.

Run subject-level inference on the resulting subject CSVs:

python -m neureptrace.inference \
  "results/nod_animate_all/sub-*_time_decode.csv" \
  --chance 0.5 \
  --n-permutations 10000 \
  --cluster-alpha 0.05 \
  --out-time results/nod_animate_all/inference_time.csv \
  --out-clusters results/nod_animate_all/inference_clusters.csv

The inference command first averages folds within each subject, then runs a one-sided subject-level sign-flip test against chance at each time point. It also reports max-cluster-mass corrected p-values for contiguous above-threshold periods.

This larger run is the minimum useful scale for subject-level statistical testing. The 5-subject pilot is useful for smoke testing and early signal checking; reported claims should use the full staged manifest.

Second NOD-EEG Task

Use NeuRepTrace-Paper/benchmarks/nod_superclass_canine_device_all.csv for a second public task within the same staged NOD-EEG data. This task decodes ImageNet superclass labels canine versus device, using only trials whose super_class exactly matches one of those labels. The full staged set contains 7,293 canine trials and 6,950 device trials across 19 subjects.

python -m neureptrace.validate_manifest \
  NeuRepTrace-Paper/benchmarks/nod_superclass_canine_device_all.csv \
  --report-out results/nod_superclass_canine_device_all_validation.csv

python -m neureptrace.benchmark \
  NeuRepTrace-Paper/benchmarks/nod_superclass_canine_device_all.csv \
  --out-dir results/nod_superclass_canine_device_all \
  --aggregate-out results/nod_superclass_canine_device_all/summary.csv \
  --plot-out results/nod_superclass_canine_device_all/summary.png \
  --calibration-dir results/nod_superclass_canine_device_all/calibration \
  --chance 0.5

python -m neureptrace.report \
  results/nod_superclass_canine_device_all/summary.csv \
  "results/nod_superclass_canine_device_all/sub-*_time_decode.csv" \
  --chance 0.5 \
  --out results/nod_superclass_canine_device_all/report.md

python -m neureptrace.inference \
  "results/nod_superclass_canine_device_all/sub-*_time_decode.csv" \
  --chance 0.5 \
  --n-permutations 10000 \
  --cluster-alpha 0.05 \
  --out-time results/nod_superclass_canine_device_all/inference_time.csv \
  --out-clusters results/nod_superclass_canine_device_all/inference_clusters.csv

python -m neureptrace.calibration \
  results/nod_superclass_canine_device_all/summary.csv \
  "results/nod_superclass_canine_device_all/calibration/*_calibration_bins.csv" \
  --out-report results/nod_superclass_canine_device_all/calibration_report.md \
  --out-bins results/nod_superclass_canine_device_all/reliability_bins.csv

This gives the paper a second semantic benchmark without changing dataset, preprocessing, CV logic, or reporting machinery.

Next NOD-EEG Task

Use NeuRepTrace-Paper/benchmarks/nod_superclass_container_covering_all.csv for the next staged task. This contrast decodes ImageNet superclass labels container versus covering, using only trials whose super_class exactly matches one of those labels. Both labels are inanimate, so this task tests category decoding beyond the animate/inanimate distinction. The full staged set contains 5,215 container trials and 4,809 covering trials across all 19 subjects.

python -m neureptrace.validate_manifest \
  NeuRepTrace-Paper/benchmarks/nod_superclass_container_covering_all.csv \
  --report-out results/nod_superclass_container_covering_all_validation.csv

python -m neureptrace.benchmark \
  NeuRepTrace-Paper/benchmarks/nod_superclass_container_covering_all.csv \
  --out-dir results/nod_superclass_container_covering_all \
  --aggregate-out results/nod_superclass_container_covering_all/summary.csv \
  --plot-out results/nod_superclass_container_covering_all/summary.png \
  --calibration-dir results/nod_superclass_container_covering_all/calibration \
  --chance 0.5 \
  --resume

After the benchmark finishes, generate the same report, inference, calibration, and reliability outputs as for the canine/device superclass task:

python -m neureptrace.report \
  results/nod_superclass_container_covering_all/summary.csv \
  "results/nod_superclass_container_covering_all/sub-*_time_decode.csv" \
  --out results/nod_superclass_container_covering_all/report.md \
  --chance 0.5

python -m neureptrace.inference \
  "results/nod_superclass_container_covering_all/sub-*_time_decode.csv" \
  --chance 0.5 \
  --n-permutations 10000 \
  --out-time results/nod_superclass_container_covering_all/inference_time.csv \
  --out-clusters results/nod_superclass_container_covering_all/inference_clusters.csv

python -m neureptrace.calibration \
  results/nod_superclass_container_covering_all/summary.csv \
  "results/nod_superclass_container_covering_all/calibration/*_calibration_bins.csv" \
  --out-report results/nod_superclass_container_covering_all/calibration_report.md \
  --out-bins results/nod_superclass_container_covering_all/reliability_bins.csv

python -m neureptrace.plot_calibration \
  results/nod_superclass_container_covering_all/reliability_bins.csv \
  --out results/nod_superclass_container_covering_all/reliability.png \
  --time-window 0.1 0.8 \
  --title "NOD container/covering calibration"

Decoder Comparison

NeuRepTrace supports standard probability-producing decoders with the decoder manifest column or --decoder CLI option:

logistic: balanced multinomial logistic regression;
sparse_logistic: L1-regularized balanced logistic regression with the SAGA solver;
elastic_net_logistic: balanced logistic regression with SAGA elastic-net regularization;
ridge: calibrated balanced ridge classifier;
gaussian_nb: Gaussian naive Bayes;
lda: linear discriminant analysis;
shrinkage_lda: LDA with LSQR covariance shrinkage estimated inside each training fold;
linear_svm: calibrated balanced linear support vector machine.

Run the first-five-subject decoder comparison:

python -m neureptrace.benchmark \
  NeuRepTrace-Paper/benchmarks/nod_animate_decoders_first5.csv \
  --out-dir results/nod_animate_decoders_first5 \
  --aggregate-out results/nod_animate_decoders_first5/summary.csv \
  --plot-out results/nod_animate_decoders_first5/summary.png \
  --calibration-dir results/nod_animate_decoders_first5/calibration \
  --chance 0.5

When a manifest contains a decoder column, result files are named like sub-01_logistic_time_decode.csv, and aggregate summaries preserve the decoder column rather than averaging decoders together.

Manifests can also pin fold-local feature preprocessing and nested decoder tuning. The relevant columns are feature_preprocessor, pca_components, tune_hyperparameters, tuning_cv_splits, tuning_scoring, and tuning_c_grid. Supported feature preprocessors are none, pca, pca-whiten, and anova-select. For anova-select, pca_components is the percentage of highest-scoring ANOVA F-test features kept inside each training fold. These settings are preserved in aggregate summaries, so tuned and untuned variants are never averaged together accidentally.

Each benchmark also writes provenance.csv. This table has one row per run condition and records decoder, emission mode, PCA mode/components, tuning grid, compact selected-parameter counts, temporal mode/train window, and selected plus effect-window accuracy, log_loss, brier, and ece values. Use it to check whether an apparent gain changes accuracy as well as calibration metrics.

Run the tuned PCA-whitened logistic variant over all 19 staged subjects:

python -m neureptrace.benchmark \
  NeuRepTrace-Paper/benchmarks/nod_animate_logistic_tuned_pca_whiten_all.csv \
  --out-dir results/nod_animate_logistic_tuned_pca_whiten_all \
  --aggregate-out results/nod_animate_logistic_tuned_pca_whiten_all/summary.csv \
  --plot-out results/nod_animate_logistic_tuned_pca_whiten_all/summary.png \
  --calibration-dir results/nod_animate_logistic_tuned_pca_whiten_all/calibration \
  --chance 0.5 \
  --resume

That manifest uses feature_preprocessor=pca-whiten, pca_components=0.95, tune_hyperparameters=true, a 2-fold inner CV, balanced-accuracy scoring, and the C grid 0.01,0.1,1,10,100. PCA whitening and C tuning are fitted only on the training split for each outer fold.

Run the tuned ANOVA feature-selection logistic variant over all 19 staged subjects:

python -m neureptrace.benchmark \
  NeuRepTrace-Paper/benchmarks/nod_animate_logistic_tuned_anova_select_all.csv \
  --out-dir results/nod_animate_logistic_tuned_anova_select_all \
  --aggregate-out results/nod_animate_logistic_tuned_anova_select_all/summary.csv \
  --plot-out results/nod_animate_logistic_tuned_anova_select_all/summary.png \
  --calibration-dir results/nod_animate_logistic_tuned_anova_select_all/calibration \
  --chance 0.5 \
  --resume

That manifest uses anova-select with an initial 20 percent setting, then tunes both the selected feature percentile (10,20,40,60) and logistic C with 2-fold inner CV. This tests whether supervised fold-local denoising helps the main animate/inanimate task without changing the outer held-out folds.

Run an explicit shrinkage-LDA variant over all 19 staged subjects:

python -m neureptrace.benchmark \
  NeuRepTrace-Paper/benchmarks/nod_animate_shrinkage_lda_all.csv \
  --out-dir results/nod_animate_shrinkage_lda_all \
  --aggregate-out results/nod_animate_shrinkage_lda_all/summary.csv \
  --plot-out results/nod_animate_shrinkage_lda_all/summary.png \
  --calibration-dir results/nod_animate_shrinkage_lda_all/calibration \
  --chance 0.5 \
  --resume

shrinkage_lda uses LinearDiscriminantAnalysis(solver="lsqr", shrinkage="auto"). This is still fitted independently in each outer training fold, but it regularizes covariance estimates that can be unstable in short high-dimensional MEG windows.

Run the elastic-net logistic variant when dense logistic regression may be using too many weak noisy features but pure feature selection would be too aggressive:

python -m neureptrace.benchmark \
  NeuRepTrace-Paper/benchmarks/nod_animate_elastic_net_logistic_all.csv \
  --out-dir results/nod_animate_elastic_net_logistic_all \
  --aggregate-out results/nod_animate_elastic_net_logistic_all/summary.csv \
  --plot-out results/nod_animate_elastic_net_logistic_all/summary.png \
  --calibration-dir results/nod_animate_elastic_net_logistic_all/calibration \
  --chance 0.5 \
  --resume

The untuned manifest uses a fixed 50/50 L1/L2 mix. If tune_hyperparameters=true is enabled for elastic_net_logistic, nested CV searches both the C grid and the L1/L2 mixing grid 0.15,0.5,0.85.

Run the ridge classifier variant over all 19 staged subjects:

python -m neureptrace.benchmark \
  NeuRepTrace-Paper/benchmarks/nod_animate_ridge_all.csv \
  --out-dir results/nod_animate_ridge_all \
  --aggregate-out results/nod_animate_ridge_all/summary.csv \
  --plot-out results/nod_animate_ridge_all/summary.png \
  --calibration-dir results/nod_animate_ridge_all/calibration \
  --chance 0.5 \
  --resume

ridge uses a balanced RidgeClassifier with sigmoid calibration by default. When tune_hyperparameters=true, nested CV searches alpha values 0.01,0.1,1,10,100.

Run the Gaussian naive Bayes variant over all 19 staged subjects:

python -m neureptrace.benchmark \
  NeuRepTrace-Paper/benchmarks/nod_animate_gaussian_nb_all.csv \
  --out-dir results/nod_animate_gaussian_nb_all \
  --aggregate-out results/nod_animate_gaussian_nb_all/summary.csv \
  --plot-out results/nod_animate_gaussian_nb_all/summary.png \
  --calibration-dir results/nod_animate_gaussian_nb_all/calibration \
  --chance 0.5 \
  --resume

gaussian_nb estimates class-conditional feature distributions independently. When tune_hyperparameters=true, nested CV searches variance smoothing values 1e-12,1e-10,1e-9,1e-8,1e-6.

Run the slower tuned temporal train-window ensemble:

python -m neureptrace.benchmark \
  NeuRepTrace-Paper/benchmarks/nod_animate_logistic_tuned_temporal_ensemble_all.csv \
  --out-dir results/nod_animate_logistic_tuned_temporal_ensemble_all \
  --aggregate-out results/nod_animate_logistic_tuned_temporal_ensemble_all/summary.csv \
  --plot-out results/nod_animate_logistic_tuned_temporal_ensemble_all/summary.png \
  --calibration-dir results/nod_animate_logistic_tuned_temporal_ensemble_all/calibration \
  --chance 0.5 \
  --resume

This manifest combines --temporal-train-window 0.12 0.25 with --tune-hyperparameters. For each outer fold, every model in the temporal train-window ensemble is fitted on the outer training split and tunes C with inner CV before its probabilities are averaged across train-window centers.

Compare raw decoder probabilities with temporal posterior smoothing without changing the decoder:

python -m neureptrace.benchmark \
  NeuRepTrace-Paper/benchmarks/nod_animate_logistic_temporal_smoothing_all.csv \
  --out-dir results/nod_animate_logistic_temporal_smoothing_all \
  --aggregate-out results/nod_animate_logistic_temporal_smoothing_all/summary.csv \
  --plot-out results/nod_animate_logistic_temporal_smoothing_all/summary.png \
  --calibration-dir results/nod_animate_logistic_temporal_smoothing_all/calibration \
  --temporal-smoothing-dir results/nod_animate_logistic_temporal_smoothing_all/temporal_smoothing \
  --temporal-smoothing-fit-window 0.1 0.8 \
  --chance 0.5 \
  --resume

When --temporal-smoothing-dir is supplied, the benchmark runner exports the held-out probability observations, fits sticky forward-backward smoothing on those same observations, writes smoothed posterior metrics, and aggregates raw and smoothed rows into the same summary. The comparison appears as emission_mode=calibrated versus emission_mode=calibrated_temporal_posterior in both summary.csv and provenance.csv.

Run the sparse logistic variant when dense logistic regression may be using too many noisy sensor-time features:

python -m neureptrace.benchmark \
  NeuRepTrace-Paper/benchmarks/nod_animate_sparse_logistic_all.csv \
  --out-dir results/nod_animate_sparse_logistic_all \
  --aggregate-out results/nod_animate_sparse_logistic_all/summary.csv \
  --plot-out results/nod_animate_sparse_logistic_all/summary.png \
  --calibration-dir results/nod_animate_sparse_logistic_all/calibration \
  --chance 0.5 \
  --resume

Generate a decoder comparison report:

python -m neureptrace.report \
  results/nod_animate_decoders_first5/summary.csv \
  --out results/nod_animate_decoders_first5/report.md

For decoder comparisons, the report includes both raw effect-window accuracy and effect minus baseline-window accuracy. The baseline-corrected value is the more relevant reported number when a decoder shows pre-stimulus bias.

Create a calibration-aware decoder report and aggregate reliability bins:

python -m neureptrace.calibration \
  results/nod_animate_decoders_first5/summary.csv \
  "results/nod_animate_decoders_first5/calibration/*_calibration_bins.csv" \
  --out-report results/nod_animate_decoders_first5/calibration_report.md \
  --out-bins results/nod_animate_decoders_first5/reliability_bins.csv

Plot an effect-window reliability diagram from aggregate reliability bins:

python -m neureptrace.plot_calibration \
  results/nod_animate_decoders_first5/reliability_bins.csv \
  --out results/nod_animate_decoders_first5/reliability.png \
  --time-window 0.1 0.8 \
  --title "NOD animate/inanimate decoder calibration"

Run paired subject-level decoder statistics:

python -m neureptrace.paired_stats \
  "results/nod_animate_decoders_first5/sub-*_time_decode.csv" \
  --out-csv results/nod_animate_decoders_first5/paired_stats.csv \
  --out-report results/nod_animate_decoders_first5/paired_stats.md \
  --chance 0.5

Run the full 19-subject decoder comparison on a self-hosted GitHub Actions runner:

gh workflow run nod-decoder-all.yml \
  --repo IPS-Stuttgart/NeuRepTrace \
  --ref main \
  -f data_root=../data/nod \
  -f manifest_csv=NeuRepTrace-Paper/benchmarks/nod_animate_decoders_all.csv \
  -f output_dir=results/nod_animate_decoders_all \
  -f n_permutations=10000

The workflow rewrites the committed manifest to use the supplied data_root, runs logistic regression, LDA, and calibrated linear SVM across all 19 staged NOD-EEG subjects, then uploads only compact summary, calibration, and inference artifacts. The benchmark step uses --resume, so a rerun in the same output directory skips completed subject-decoder rows whose result and calibration-bin CSVs already exist. Use an absolute data_root when the self-hosted runner keeps the NOD files outside the repository workspace.

The same workflow can run the tuned PCA-whitened logistic manifest:

gh workflow run nod-decoder-all.yml \
  --repo IPS-Stuttgart/NeuRepTrace \
  --ref main \
  -f data_root=../data/nod \
  -f manifest_csv=NeuRepTrace-Paper/benchmarks/nod_animate_logistic_tuned_pca_whiten_all.csv \
  -f output_dir=results/nod_animate_logistic_tuned_pca_whiten_all \
  -f n_permutations=10000

Or run the tuned ANOVA feature-selection logistic manifest:

gh workflow run nod-decoder-all.yml \
  --repo IPS-Stuttgart/NeuRepTrace \
  --ref main \
  -f data_root=../data/nod \
  -f manifest_csv=NeuRepTrace-Paper/benchmarks/nod_animate_logistic_tuned_anova_select_all.csv \
  -f output_dir=results/nod_animate_logistic_tuned_anova_select_all \
  -f n_permutations=10000

Or run the tuned temporal train-window ensemble:

gh workflow run nod-decoder-all.yml \
  --repo IPS-Stuttgart/NeuRepTrace \
  --ref main \
  -f data_root=../data/nod \
  -f manifest_csv=NeuRepTrace-Paper/benchmarks/nod_animate_logistic_tuned_temporal_ensemble_all.csv \
  -f output_dir=results/nod_animate_logistic_tuned_temporal_ensemble_all \
  -f n_permutations=10000

Or run the shrinkage-LDA manifest:

gh workflow run nod-decoder-all.yml \
  --repo IPS-Stuttgart/NeuRepTrace \
  --ref main \
  -f data_root=../data/nod \
  -f manifest_csv=NeuRepTrace-Paper/benchmarks/nod_animate_shrinkage_lda_all.csv \
  -f output_dir=results/nod_animate_shrinkage_lda_all \
  -f n_permutations=10000

Or run the elastic-net logistic manifest:

gh workflow run nod-decoder-all.yml \
  --repo IPS-Stuttgart/NeuRepTrace \
  --ref main \
  -f data_root=../data/nod \
  -f manifest_csv=NeuRepTrace-Paper/benchmarks/nod_animate_elastic_net_logistic_all.csv \
  -f output_dir=results/nod_animate_elastic_net_logistic_all \
  -f n_permutations=10000

Or run the ridge manifest:

gh workflow run nod-decoder-all.yml \
  --repo IPS-Stuttgart/NeuRepTrace \
  --ref main \
  -f data_root=../data/nod \
  -f manifest_csv=NeuRepTrace-Paper/benchmarks/nod_animate_ridge_all.csv \
  -f output_dir=results/nod_animate_ridge_all \
  -f n_permutations=10000

Or run the Gaussian naive Bayes manifest:

gh workflow run nod-decoder-all.yml \
  --repo IPS-Stuttgart/NeuRepTrace \
  --ref main \
  -f data_root=../data/nod \
  -f manifest_csv=NeuRepTrace-Paper/benchmarks/nod_animate_gaussian_nb_all.csv \
  -f output_dir=results/nod_animate_gaussian_nb_all \
  -f n_permutations=10000

Or run the raw-versus-smoothed posterior comparison:

gh workflow run nod-decoder-all.yml \
  --repo IPS-Stuttgart/NeuRepTrace \
  --ref main \
  -f data_root=../data/nod \
  -f manifest_csv=NeuRepTrace-Paper/benchmarks/nod_animate_logistic_temporal_smoothing_all.csv \
  -f output_dir=results/nod_animate_logistic_temporal_smoothing_all \
  -f temporal_smoothing=true \
  -f temporal_smoothing_fit_window_start=0.1 \
  -f temporal_smoothing_fit_window_stop=0.8 \
  -f temporal_smoothing_stay_grid_size=200 \
  -f n_permutations=10000

Or run the sparse logistic L1 decoder:

gh workflow run nod-decoder-all.yml \
  --repo IPS-Stuttgart/NeuRepTrace \
  --ref main \
  -f data_root=../data/nod \
  -f manifest_csv=NeuRepTrace-Paper/benchmarks/nod_animate_sparse_logistic_all.csv \
  -f output_dir=results/nod_animate_sparse_logistic_all \
  -f n_permutations=10000

After downloading or locating the workflow output directory, export only paper-safe artifacts into the compact export directory:

python scripts/export_paper_results.py \
  results/nod_animate_decoders_all \
  ../NeuRepTrace-Paper/results/nod_animate_decoders_all \
  --max-mb 50 \
  --plot-reliability \
  --reliability-window 0.1 0.8

Acceptance Target

The first useful milestone is not just above-chance accuracy. The benchmark should produce stable probability traces and calibration metrics that can be compared across subjects, sessions, and decoder variants.

For interrupted runs, rerun the same command with --resume. NeuRepTrace will keep complete existing rows, regenerate missing rows, and rebuild the aggregate summary and plot from the combined output set.

When --observation-dir is requested, resume mode also requires the matching subject observation CSV before skipping a manifest row. This prevents a run from appearing complete when metric summaries exist but the probability traces needed for downstream state modeling are missing.

After a manifest run with --observation-dir, fit the same temporal model across all exported subject observations:

python -m neureptrace.temporal_model \
  "results/nod_animate_all/observations/*_observations.csv" \
  --out-summary results/nod_animate_all/temporal_model.csv \
  --out-states results/nod_animate_all/state_trace.csv \
  --n-permutations 100

Compare calibrated versus uncalibrated emissions:

python -m neureptrace.emission_compare \
  results/nod_animate_all/temporal_model.csv \
  --out-csv results/nod_animate_all/emission_compare.csv \
  --out-report results/nod_animate_all/emission_compare.md

Then summarize category-conditioned stages:

python -m neureptrace.semantic_stages \
  results/nod_animate_all/state_trace.csv \
  --out-time results/nod_animate_all/semantic_stage_time.csv \
  --out-stages results/nod_animate_all/semantic_stages.csv \
  --out-report results/nod_animate_all/semantic_stages.md

Calibration-Aware Temporal-State Workflow

Use neureptrace.temporal_state_workflow to run the calibration-aware temporal-state pass across the three staged NOD tasks: animate/inanimate, canine/device, and container/covering. The workflow prepares runner-local manifests, validates all 19 NOD-EEG subjects, runs matched calibrated and uncalibrated emissions in the same folds, fits sticky temporal models, compares controls, summarizes semantic stages, and writes compact artifacts for the compact export directory.

python -m neureptrace.temporal_state_workflow \
  --data-root data/nod \
  --out-dir results/temporal_state_inference \
  --compact-export-dir ../NeuRepTrace-Compact-Results/results/temporal_state_inference \
  --decoders logistic linear_svm \
  --n-permutations 100

The top-level outputs are temporal_state_summary.csv, temporal_state_reliability.png, temporal_state_evidence.md, and temporal_state_commands.md. The compact export intentionally excludes large probability observation and state-trace CSVs.

For a smoke test, run one task and one subject with fewer permutations:

python -m neureptrace.temporal_state_workflow \
  --data-root data/nod \
  --out-dir results/temporal_state_smoke \
  --task nod_animate \
  --decoders linear_svm \
  --max-subjects 1 \
  --n-permutations 5 \
  --stay-grid-size 20