The first intended public benchmark is NOD-MEG/NOD-EEG. RepTrace does not download large public datasets automatically; stage the relevant subject epochs and metadata locally first.
Use the NOD preprocessed epochs file and the matching detailed events CSV. If
the metadata already contains stim_is_animate, pass it directly to
reptrace.mne_time_decode or derive a named condition column first:
python -m reptrace.metadata \
--events-csv data/nod/sub-01_events.csv \
--source-column stim_is_animate \
--positive-pattern "True" \
--label-column condition \
--positive-label animate \
--negative-label inanimate \
--out data/nod/sub-01_metadata_animate.csv
Then run the decoder:
python -m reptrace.mne_time_decode \
--epochs data/nod/sub-01_epo.fif \
--metadata-csv data/nod/sub-01_metadata_animate.csv \
--label-column condition \
--group-column session \
--tmin -0.1 \
--tmax 0.8 \
--window-ms 20 \
--step-ms 10 \
--out results/nod_sub-01_animate.csv \
--observations-out results/nod_sub-01_animate_observations.csv \
--emission-mode both
The output CSV contains fold-wise accuracy, log loss, Brier score, and expected calibration error for each time window.
The optional observations CSV keeps the held-out decoder probabilities before
they are reduced to accuracy or calibration summaries. Each row is one
trial/time-window observation with the fold, time, sample index, true class,
predicted class, confidence, probability assigned to the true class, and one
prob_class_* column per class. This is the output to use when treating decoder
traces as probabilistic evidence streams for HMMs or other temporal state
models.
Fit the first conservative temporal model from those observations:
python -m reptrace.temporal_model \
results/nod_sub-01_animate_observations.csv \
--out-summary results/nod_sub-01_animate_temporal_model.csv \
--out-states results/nod_sub-01_animate_state_trace.csv \
--effect-window 0.1 0.8 \
--baseline-window -0.1 0.0 \
--n-permutations 100
This command treats each trial as a probability time series. The hidden states are the decoder classes, and the fitted parameter is the probability that the latent state persists between adjacent time bins. The summary reports persistence gain relative to a uniform-memory baseline, plus controls that shuffle time order, shuffle probability-label columns, and fit the same model in the pre-stimulus baseline window. The optional state trace CSV contains Viterbi states and posterior state probabilities for downstream sequence analyses.
When the observations contain both calibrated and uncalibrated emissions, compare which emission mode gives cleaner state inference:
python -m reptrace.emission_compare \
results/nod_sub-01_animate_temporal_model.csv \
--out-csv results/nod_sub-01_animate_emission_compare.csv \
--out-report results/nod_sub-01_animate_emission_compare.md
The comparison uses the temporal model’s persistence gain. Its main value is the control margin: observed effect-window gain minus the strongest baseline-window, shuffled-time, or shuffled-label control gain. A positive calibrated-minus- uncalibrated margin is evidence that calibrated probabilities give cleaner state inference, not merely nicer reliability plots.
Ask the first NOD neuroscience question from the state traces:
python -m reptrace.semantic_stages \
results/nod_sub-01_animate_state_trace.csv \
--out-time results/nod_sub-01_animate_semantic_stage_time.csv \
--out-stages results/nod_sub-01_animate_semantic_stages.csv \
--out-report results/nod_sub-01_animate_semantic_stages.md \
--posterior-threshold 0.6 \
--match-threshold 0.6 \
--min-duration 0.04
This asks whether the decoded semantic category for each trial becomes a stable latent state over contiguous time ranges. For NOD this corresponds to category staging, such as animate or inanimate evidence emerging over a post-stimulus interval. For navigation or planning data, use the same output shape with spatial bins or task states in place of semantic classes; the resulting stable stages become candidate trajectory segments that still need the temporal-model controls above.
Plot the single-subject result:
python -m reptrace.plot_time_decode \
results/nod_sub-01_animate.csv \
--chance 0.5 \
--title "NOD sub-01 animate/inanimate" \
--out results/nod_sub-01_animate.png
After running several subjects, aggregate across subjects:
python -m reptrace.results \
results/nod_sub-01_animate.csv \
results/nod_sub-02_animate.csv \
results/nod_sub-03_animate.csv \
--out results/nod_animate_summary.csv
Then plot the aggregate:
python -m reptrace.plot_time_decode \
results/nod_animate_summary.csv \
--chance 0.5 \
--title "NOD animate/inanimate summary" \
--out results/nod_animate_summary.png
The same workflow can be run from a manifest:
python -m reptrace.validate_manifest \
benchmarks/nod_animate_sub01.csv \
--report-out results/nod_animate_sub01_validation.csv
python -m reptrace.benchmark \
benchmarks/nod_animate_sub01.csv \
--out-dir results/nod_animate_sub01 \
--aggregate-out results/nod_animate_sub01_summary.csv \
--plot-out results/nod_animate_sub01_summary.png \
--observation-dir results/nod_animate_sub01/observations \
--emission-mode both \
--chance 0.5
Manifest paths are resolved relative to the manifest file. The example manifest
expects staged files under data/nod/.
For a paper-ready first pass, use the same animate/inanimate task and run five subjects at once from a single manifest:
python -m reptrace.validate_manifest \
benchmarks/nod_animate_first5.csv \
--report-out results/nod_animate_first5_validation.csv
python -m reptrace.benchmark \
benchmarks/nod_animate_first5.csv \
--out-dir results/nod_animate_first5 \
--aggregate-out results/nod_animate_first5_summary.csv \
--plot-out results/nod_animate_first5_summary.png \
--chance 0.5
This keeps the experiment scope fixed (same preprocessing, same target labels, same window/grid parameters) and changes only the subject set.
Generate a compact Markdown report from the aggregate and subject-level result CSVs:
python -m reptrace.report \
results/nod_animate_first5/summary.csv \
"results/nod_animate_first5/sub-*_time_decode.csv" \
--chance 0.5 \
--out results/nod_animate_first5/report.md
The report records the aggregate peak, baseline-window accuracy, effect-window accuracy, calibration metrics at the peak, and per-subject peaks.
After staging all available NOD-EEG preprocessed epoch files and detailed event files, validate the 19-subject manifest. This manifest uses 2 grouped folds because several subjects have only 2 unique session groups:
python -m reptrace.validate_manifest \
benchmarks/nod_animate_all.csv \
--report-out results/nod_animate_all_validation.csv
Then run the same animate/inanimate benchmark over every staged NOD-EEG subject:
python -m reptrace.benchmark \
benchmarks/nod_animate_all.csv \
--out-dir results/nod_animate_all \
--aggregate-out results/nod_animate_all/summary.csv \
--plot-out results/nod_animate_all/nod_animate_all_summary.png \
--chance 0.5
Make calibration explicit in the benchmark report:
python -m reptrace.calibration \
results/nod_animate_all/summary.csv \
--out-report results/nod_animate_all/calibration_report.md
The calibration report orders models by effect-window ECE, then Brier score and log loss. Accuracy is included as context, but the report is designed to keep probability quality visible rather than treating it as a secondary metric.
Run subject-level inference on the resulting subject CSVs:
python -m reptrace.inference \
"results/nod_animate_all/sub-*_time_decode.csv" \
--chance 0.5 \
--n-permutations 10000 \
--cluster-alpha 0.05 \
--out-time results/nod_animate_all/inference_time.csv \
--out-clusters results/nod_animate_all/inference_clusters.csv
The inference command first averages folds within each subject, then runs a one-sided subject-level sign-flip test against chance at each time point. It also reports max-cluster-mass corrected p-values for contiguous above-threshold periods.
This larger run is the minimum useful scale for subject-level statistical testing. The 5-subject pilot is useful for smoke testing and early signal checking; reported claims should use the full staged manifest.
Use benchmarks/nod_superclass_canine_device_all.csv for a second public task
within the same staged NOD-EEG data. This task decodes ImageNet superclass
labels canine versus device, using only trials whose super_class exactly
matches one of those labels. The full staged set contains 7,293 canine trials
and 6,950 device trials across 19 subjects.
python -m reptrace.validate_manifest \
benchmarks/nod_superclass_canine_device_all.csv \
--report-out results/nod_superclass_canine_device_all_validation.csv
python -m reptrace.benchmark \
benchmarks/nod_superclass_canine_device_all.csv \
--out-dir results/nod_superclass_canine_device_all \
--aggregate-out results/nod_superclass_canine_device_all/summary.csv \
--plot-out results/nod_superclass_canine_device_all/summary.png \
--calibration-dir results/nod_superclass_canine_device_all/calibration \
--chance 0.5
python -m reptrace.report \
results/nod_superclass_canine_device_all/summary.csv \
"results/nod_superclass_canine_device_all/sub-*_time_decode.csv" \
--chance 0.5 \
--out results/nod_superclass_canine_device_all/report.md
python -m reptrace.inference \
"results/nod_superclass_canine_device_all/sub-*_time_decode.csv" \
--chance 0.5 \
--n-permutations 10000 \
--cluster-alpha 0.05 \
--out-time results/nod_superclass_canine_device_all/inference_time.csv \
--out-clusters results/nod_superclass_canine_device_all/inference_clusters.csv
python -m reptrace.calibration \
results/nod_superclass_canine_device_all/summary.csv \
"results/nod_superclass_canine_device_all/calibration/*_calibration_bins.csv" \
--out-report results/nod_superclass_canine_device_all/calibration_report.md \
--out-bins results/nod_superclass_canine_device_all/reliability_bins.csv
This gives the paper a second semantic benchmark without changing dataset, preprocessing, CV logic, or reporting machinery.
Use benchmarks/nod_superclass_container_covering_all.csv for the next staged
task. This contrast decodes ImageNet superclass labels container versus
covering, using only trials whose super_class exactly matches one of those
labels. Both labels are inanimate, so this task tests category decoding beyond
the animate/inanimate distinction. The full staged set contains 5,215 container
trials and 4,809 covering trials across all 19 subjects.
python -m reptrace.validate_manifest \
benchmarks/nod_superclass_container_covering_all.csv \
--report-out results/nod_superclass_container_covering_all_validation.csv
python -m reptrace.benchmark \
benchmarks/nod_superclass_container_covering_all.csv \
--out-dir results/nod_superclass_container_covering_all \
--aggregate-out results/nod_superclass_container_covering_all/summary.csv \
--plot-out results/nod_superclass_container_covering_all/summary.png \
--calibration-dir results/nod_superclass_container_covering_all/calibration \
--chance 0.5 \
--resume
After the benchmark finishes, generate the same report, inference, calibration, and reliability outputs as for the canine/device superclass task:
python -m reptrace.report \
results/nod_superclass_container_covering_all/summary.csv \
"results/nod_superclass_container_covering_all/sub-*_time_decode.csv" \
--out results/nod_superclass_container_covering_all/report.md \
--chance 0.5
python -m reptrace.inference \
"results/nod_superclass_container_covering_all/sub-*_time_decode.csv" \
--chance 0.5 \
--n-permutations 10000 \
--out-time results/nod_superclass_container_covering_all/inference_time.csv \
--out-clusters results/nod_superclass_container_covering_all/inference_clusters.csv
python -m reptrace.calibration \
results/nod_superclass_container_covering_all/summary.csv \
"results/nod_superclass_container_covering_all/calibration/*_calibration_bins.csv" \
--out-report results/nod_superclass_container_covering_all/calibration_report.md \
--out-bins results/nod_superclass_container_covering_all/reliability_bins.csv
python -m reptrace.plot_calibration \
results/nod_superclass_container_covering_all/reliability_bins.csv \
--out results/nod_superclass_container_covering_all/reliability.png \
--time-window 0.1 0.8 \
--title "NOD container/covering calibration"
RepTrace supports standard probability-producing decoders with the decoder
manifest column or --decoder CLI option:
logistic: balanced multinomial logistic regression;lda: linear discriminant analysis;linear_svm: calibrated balanced linear support vector machine.Run the first-five-subject decoder comparison:
python -m reptrace.benchmark \
benchmarks/nod_animate_decoders_first5.csv \
--out-dir results/nod_animate_decoders_first5 \
--aggregate-out results/nod_animate_decoders_first5/summary.csv \
--plot-out results/nod_animate_decoders_first5/summary.png \
--calibration-dir results/nod_animate_decoders_first5/calibration \
--chance 0.5
When a manifest contains a decoder column, result files are named like
sub-01_logistic_time_decode.csv, and aggregate summaries preserve the
decoder column rather than averaging decoders together.
Generate a decoder comparison report:
python -m reptrace.report \
results/nod_animate_decoders_first5/summary.csv \
--out results/nod_animate_decoders_first5/report.md
For decoder comparisons, the report includes both raw effect-window accuracy and effect minus baseline-window accuracy. The baseline-corrected value is the more relevant reported number when a decoder shows pre-stimulus bias.
Create a calibration-aware decoder report and aggregate reliability bins:
python -m reptrace.calibration \
results/nod_animate_decoders_first5/summary.csv \
"results/nod_animate_decoders_first5/calibration/*_calibration_bins.csv" \
--out-report results/nod_animate_decoders_first5/calibration_report.md \
--out-bins results/nod_animate_decoders_first5/reliability_bins.csv
Plot an effect-window reliability diagram from aggregate reliability bins:
python -m reptrace.plot_calibration \
results/nod_animate_decoders_first5/reliability_bins.csv \
--out results/nod_animate_decoders_first5/reliability.png \
--time-window 0.1 0.8 \
--title "NOD animate/inanimate decoder calibration"
Run paired subject-level decoder statistics:
python -m reptrace.paired_stats \
"results/nod_animate_decoders_first5/sub-*_time_decode.csv" \
--out-csv results/nod_animate_decoders_first5/paired_stats.csv \
--out-report results/nod_animate_decoders_first5/paired_stats.md \
--chance 0.5
Run the full 19-subject decoder comparison on a self-hosted GitHub Actions runner:
gh workflow run nod-decoder-all.yml \
--repo IPS-Stuttgart/RepTrace \
--ref main \
-f data_root=../data/nod \
-f output_dir=results/nod_animate_decoders_all \
-f n_permutations=10000
The workflow rewrites the committed manifest to use the supplied data_root,
runs logistic regression, LDA, and calibrated linear SVM across all 19 staged
NOD-EEG subjects, then uploads only compact summary, calibration, and inference
artifacts. The benchmark step uses --resume, so a rerun in the same output
directory skips completed subject-decoder rows whose result and calibration-bin
CSVs already exist. Use an absolute data_root when the self-hosted runner
keeps the NOD files outside the repository workspace.
After downloading or locating the workflow output directory, export only paper-safe artifacts into the compact export directory:
python scripts/export_paper_results.py \
results/nod_animate_decoders_all \
../2026-05-RepTrace-Paper/results/nod_animate_decoders_all \
--max-mb 50 \
--plot-reliability \
--reliability-window 0.1 0.8
The first useful milestone is not just above-chance accuracy. The benchmark should produce stable probability traces and calibration metrics that can be compared across subjects, sessions, and decoder variants.
For interrupted runs, rerun the same command with --resume. RepTrace will keep
complete existing rows, regenerate missing rows, and rebuild the aggregate
summary and plot from the combined output set.
When --observation-dir is requested, resume mode also requires the matching
subject observation CSV before skipping a manifest row. This prevents a run from
appearing complete when metric summaries exist but the probability traces needed
for downstream state modeling are missing.
After a manifest run with --observation-dir, fit the same temporal model across
all exported subject observations:
python -m reptrace.temporal_model \
"results/nod_animate_all/observations/*_observations.csv" \
--out-summary results/nod_animate_all/temporal_model.csv \
--out-states results/nod_animate_all/state_trace.csv \
--n-permutations 100
Compare calibrated versus uncalibrated emissions:
python -m reptrace.emission_compare \
results/nod_animate_all/temporal_model.csv \
--out-csv results/nod_animate_all/emission_compare.csv \
--out-report results/nod_animate_all/emission_compare.md
Then summarize category-conditioned stages:
python -m reptrace.semantic_stages \
results/nod_animate_all/state_trace.csv \
--out-time results/nod_animate_all/semantic_stage_time.csv \
--out-stages results/nod_animate_all/semantic_stages.csv \
--out-report results/nod_animate_all/semantic_stages.md
Use reptrace.temporal_state_workflow to run the calibration-aware temporal-state pass
across the three staged NOD tasks: animate/inanimate, canine/device, and
container/covering. The workflow prepares runner-local manifests, validates all
19 NOD-EEG subjects, runs matched calibrated and uncalibrated emissions in the
same folds, fits sticky temporal models, compares controls, summarizes semantic
stages, and writes compact artifacts for the compact export directory.
python -m reptrace.temporal_state_workflow \
--data-root data/nod \
--out-dir results/temporal_state_inference \
--compact-export-dir ../RepTrace-Compact-Results/results/temporal_state_inference \
--decoders logistic linear_svm \
--n-permutations 100
The top-level outputs are temporal_state_summary.csv, temporal_state_reliability.png,
temporal_state_evidence.md, and temporal_state_commands.md. The compact export intentionally
excludes large probability observation and state-trace CSVs.
For a smoke test, run one task and one subject with fewer permutations:
python -m reptrace.temporal_state_workflow \
--data-root data/nod \
--out-dir results/temporal_state_smoke \
--task nod_animate \
--decoders linear_svm \
--max-subjects 1 \
--n-permutations 5 \
--stay-grid-size 20