RepTrace

Benchmarking

The first intended public benchmark is NOD-MEG/NOD-EEG. RepTrace does not download large public datasets automatically; stage the relevant subject epochs and metadata locally first.

NOD Animate/Inanimate Pilot

Use the NOD preprocessed epochs file and the matching detailed events CSV. If the metadata already contains stim_is_animate, pass it directly to reptrace.mne_time_decode or derive a named condition column first:

python -m reptrace.metadata \
  --events-csv data/nod/sub-01_events.csv \
  --source-column stim_is_animate \
  --positive-pattern "True" \
  --label-column condition \
  --positive-label animate \
  --negative-label inanimate \
  --out data/nod/sub-01_metadata_animate.csv

Then run the decoder:

python -m reptrace.mne_time_decode \
  --epochs data/nod/sub-01_epo.fif \
  --metadata-csv data/nod/sub-01_metadata_animate.csv \
  --label-column condition \
  --group-column session \
  --tmin -0.1 \
  --tmax 0.8 \
  --window-ms 20 \
  --step-ms 10 \
  --out results/nod_sub-01_animate.csv \
  --observations-out results/nod_sub-01_animate_observations.csv \
  --emission-mode both

The output CSV contains fold-wise accuracy, log loss, Brier score, and expected calibration error for each time window.

The optional observations CSV keeps the held-out decoder probabilities before they are reduced to accuracy or calibration summaries. Each row is one trial/time-window observation with the fold, time, sample index, true class, predicted class, confidence, probability assigned to the true class, and one prob_class_* column per class. This is the output to use when treating decoder traces as probabilistic evidence streams for HMMs or other temporal state models.

Fit the first conservative temporal model from those observations:

python -m reptrace.temporal_model \
  results/nod_sub-01_animate_observations.csv \
  --out-summary results/nod_sub-01_animate_temporal_model.csv \
  --out-states results/nod_sub-01_animate_state_trace.csv \
  --effect-window 0.1 0.8 \
  --baseline-window -0.1 0.0 \
  --n-permutations 100

This command treats each trial as a probability time series. The hidden states are the decoder classes, and the fitted parameter is the probability that the latent state persists between adjacent time bins. The summary reports persistence gain relative to a uniform-memory baseline, plus controls that shuffle time order, shuffle probability-label columns, and fit the same model in the pre-stimulus baseline window. The optional state trace CSV contains Viterbi states and posterior state probabilities for downstream sequence analyses.

When the observations contain both calibrated and uncalibrated emissions, compare which emission mode gives cleaner state inference:

python -m reptrace.emission_compare \
  results/nod_sub-01_animate_temporal_model.csv \
  --out-csv results/nod_sub-01_animate_emission_compare.csv \
  --out-report results/nod_sub-01_animate_emission_compare.md

The comparison uses the temporal model’s persistence gain. Its main value is the control margin: observed effect-window gain minus the strongest baseline-window, shuffled-time, or shuffled-label control gain. A positive calibrated-minus- uncalibrated margin is evidence that calibrated probabilities give cleaner state inference, not merely nicer reliability plots.

Ask the first NOD neuroscience question from the state traces:

python -m reptrace.semantic_stages \
  results/nod_sub-01_animate_state_trace.csv \
  --out-time results/nod_sub-01_animate_semantic_stage_time.csv \
  --out-stages results/nod_sub-01_animate_semantic_stages.csv \
  --out-report results/nod_sub-01_animate_semantic_stages.md \
  --posterior-threshold 0.6 \
  --match-threshold 0.6 \
  --min-duration 0.04

This asks whether the decoded semantic category for each trial becomes a stable latent state over contiguous time ranges. For NOD this corresponds to category staging, such as animate or inanimate evidence emerging over a post-stimulus interval. For navigation or planning data, use the same output shape with spatial bins or task states in place of semantic classes; the resulting stable stages become candidate trajectory segments that still need the temporal-model controls above.

Plot the single-subject result:

python -m reptrace.plot_time_decode \
  results/nod_sub-01_animate.csv \
  --chance 0.5 \
  --title "NOD sub-01 animate/inanimate" \
  --out results/nod_sub-01_animate.png

After running several subjects, aggregate across subjects:

python -m reptrace.results \
  results/nod_sub-01_animate.csv \
  results/nod_sub-02_animate.csv \
  results/nod_sub-03_animate.csv \
  --out results/nod_animate_summary.csv

Then plot the aggregate:

python -m reptrace.plot_time_decode \
  results/nod_animate_summary.csv \
  --chance 0.5 \
  --title "NOD animate/inanimate summary" \
  --out results/nod_animate_summary.png

Manifest Runner

The same workflow can be run from a manifest:

python -m reptrace.validate_manifest \
  benchmarks/nod_animate_sub01.csv \
  --report-out results/nod_animate_sub01_validation.csv

python -m reptrace.benchmark \
  benchmarks/nod_animate_sub01.csv \
  --out-dir results/nod_animate_sub01 \
  --aggregate-out results/nod_animate_sub01_summary.csv \
  --plot-out results/nod_animate_sub01_summary.png \
  --observation-dir results/nod_animate_sub01/observations \
  --emission-mode both \
  --chance 0.5

Manifest paths are resolved relative to the manifest file. The example manifest expects staged files under data/nod/.

Five-Subject Pilot

For a paper-ready first pass, use the same animate/inanimate task and run five subjects at once from a single manifest:

python -m reptrace.validate_manifest \
  benchmarks/nod_animate_first5.csv \
  --report-out results/nod_animate_first5_validation.csv

python -m reptrace.benchmark \
  benchmarks/nod_animate_first5.csv \
  --out-dir results/nod_animate_first5 \
  --aggregate-out results/nod_animate_first5_summary.csv \
  --plot-out results/nod_animate_first5_summary.png \
  --chance 0.5

This keeps the experiment scope fixed (same preprocessing, same target labels, same window/grid parameters) and changes only the subject set.

Generate a compact Markdown report from the aggregate and subject-level result CSVs:

python -m reptrace.report \
  results/nod_animate_first5/summary.csv \
  "results/nod_animate_first5/sub-*_time_decode.csv" \
  --chance 0.5 \
  --out results/nod_animate_first5/report.md

The report records the aggregate peak, baseline-window accuracy, effect-window accuracy, calibration metrics at the peak, and per-subject peaks.

Full NOD-EEG Pilot

After staging all available NOD-EEG preprocessed epoch files and detailed event files, validate the 19-subject manifest. This manifest uses 2 grouped folds because several subjects have only 2 unique session groups:

python -m reptrace.validate_manifest \
  benchmarks/nod_animate_all.csv \
  --report-out results/nod_animate_all_validation.csv

Then run the same animate/inanimate benchmark over every staged NOD-EEG subject:

python -m reptrace.benchmark \
  benchmarks/nod_animate_all.csv \
  --out-dir results/nod_animate_all \
  --aggregate-out results/nod_animate_all/summary.csv \
  --plot-out results/nod_animate_all/nod_animate_all_summary.png \
  --chance 0.5

Make calibration explicit in the benchmark report:

python -m reptrace.calibration \
  results/nod_animate_all/summary.csv \
  --out-report results/nod_animate_all/calibration_report.md

The calibration report orders models by effect-window ECE, then Brier score and log loss. Accuracy is included as context, but the report is designed to keep probability quality visible rather than treating it as a secondary metric.

Run subject-level inference on the resulting subject CSVs:

python -m reptrace.inference \
  "results/nod_animate_all/sub-*_time_decode.csv" \
  --chance 0.5 \
  --n-permutations 10000 \
  --cluster-alpha 0.05 \
  --out-time results/nod_animate_all/inference_time.csv \
  --out-clusters results/nod_animate_all/inference_clusters.csv

The inference command first averages folds within each subject, then runs a one-sided subject-level sign-flip test against chance at each time point. It also reports max-cluster-mass corrected p-values for contiguous above-threshold periods.

This larger run is the minimum useful scale for subject-level statistical testing. The 5-subject pilot is useful for smoke testing and early signal checking; reported claims should use the full staged manifest.

Second NOD-EEG Task

Use benchmarks/nod_superclass_canine_device_all.csv for a second public task within the same staged NOD-EEG data. This task decodes ImageNet superclass labels canine versus device, using only trials whose super_class exactly matches one of those labels. The full staged set contains 7,293 canine trials and 6,950 device trials across 19 subjects.

python -m reptrace.validate_manifest \
  benchmarks/nod_superclass_canine_device_all.csv \
  --report-out results/nod_superclass_canine_device_all_validation.csv

python -m reptrace.benchmark \
  benchmarks/nod_superclass_canine_device_all.csv \
  --out-dir results/nod_superclass_canine_device_all \
  --aggregate-out results/nod_superclass_canine_device_all/summary.csv \
  --plot-out results/nod_superclass_canine_device_all/summary.png \
  --calibration-dir results/nod_superclass_canine_device_all/calibration \
  --chance 0.5

python -m reptrace.report \
  results/nod_superclass_canine_device_all/summary.csv \
  "results/nod_superclass_canine_device_all/sub-*_time_decode.csv" \
  --chance 0.5 \
  --out results/nod_superclass_canine_device_all/report.md

python -m reptrace.inference \
  "results/nod_superclass_canine_device_all/sub-*_time_decode.csv" \
  --chance 0.5 \
  --n-permutations 10000 \
  --cluster-alpha 0.05 \
  --out-time results/nod_superclass_canine_device_all/inference_time.csv \
  --out-clusters results/nod_superclass_canine_device_all/inference_clusters.csv

python -m reptrace.calibration \
  results/nod_superclass_canine_device_all/summary.csv \
  "results/nod_superclass_canine_device_all/calibration/*_calibration_bins.csv" \
  --out-report results/nod_superclass_canine_device_all/calibration_report.md \
  --out-bins results/nod_superclass_canine_device_all/reliability_bins.csv

This gives the paper a second semantic benchmark without changing dataset, preprocessing, CV logic, or reporting machinery.

Next NOD-EEG Task

Use benchmarks/nod_superclass_container_covering_all.csv for the next staged task. This contrast decodes ImageNet superclass labels container versus covering, using only trials whose super_class exactly matches one of those labels. Both labels are inanimate, so this task tests category decoding beyond the animate/inanimate distinction. The full staged set contains 5,215 container trials and 4,809 covering trials across all 19 subjects.

python -m reptrace.validate_manifest \
  benchmarks/nod_superclass_container_covering_all.csv \
  --report-out results/nod_superclass_container_covering_all_validation.csv

python -m reptrace.benchmark \
  benchmarks/nod_superclass_container_covering_all.csv \
  --out-dir results/nod_superclass_container_covering_all \
  --aggregate-out results/nod_superclass_container_covering_all/summary.csv \
  --plot-out results/nod_superclass_container_covering_all/summary.png \
  --calibration-dir results/nod_superclass_container_covering_all/calibration \
  --chance 0.5 \
  --resume

After the benchmark finishes, generate the same report, inference, calibration, and reliability outputs as for the canine/device superclass task:

python -m reptrace.report \
  results/nod_superclass_container_covering_all/summary.csv \
  "results/nod_superclass_container_covering_all/sub-*_time_decode.csv" \
  --out results/nod_superclass_container_covering_all/report.md \
  --chance 0.5

python -m reptrace.inference \
  "results/nod_superclass_container_covering_all/sub-*_time_decode.csv" \
  --chance 0.5 \
  --n-permutations 10000 \
  --out-time results/nod_superclass_container_covering_all/inference_time.csv \
  --out-clusters results/nod_superclass_container_covering_all/inference_clusters.csv

python -m reptrace.calibration \
  results/nod_superclass_container_covering_all/summary.csv \
  "results/nod_superclass_container_covering_all/calibration/*_calibration_bins.csv" \
  --out-report results/nod_superclass_container_covering_all/calibration_report.md \
  --out-bins results/nod_superclass_container_covering_all/reliability_bins.csv

python -m reptrace.plot_calibration \
  results/nod_superclass_container_covering_all/reliability_bins.csv \
  --out results/nod_superclass_container_covering_all/reliability.png \
  --time-window 0.1 0.8 \
  --title "NOD container/covering calibration"

Decoder Comparison

RepTrace supports standard probability-producing decoders with the decoder manifest column or --decoder CLI option:

logistic: balanced multinomial logistic regression;
lda: linear discriminant analysis;
linear_svm: calibrated balanced linear support vector machine.

Run the first-five-subject decoder comparison:

python -m reptrace.benchmark \
  benchmarks/nod_animate_decoders_first5.csv \
  --out-dir results/nod_animate_decoders_first5 \
  --aggregate-out results/nod_animate_decoders_first5/summary.csv \
  --plot-out results/nod_animate_decoders_first5/summary.png \
  --calibration-dir results/nod_animate_decoders_first5/calibration \
  --chance 0.5

When a manifest contains a decoder column, result files are named like sub-01_logistic_time_decode.csv, and aggregate summaries preserve the decoder column rather than averaging decoders together.

Generate a decoder comparison report:

python -m reptrace.report \
  results/nod_animate_decoders_first5/summary.csv \
  --out results/nod_animate_decoders_first5/report.md

For decoder comparisons, the report includes both raw effect-window accuracy and effect minus baseline-window accuracy. The baseline-corrected value is the more relevant reported number when a decoder shows pre-stimulus bias.

Create a calibration-aware decoder report and aggregate reliability bins:

python -m reptrace.calibration \
  results/nod_animate_decoders_first5/summary.csv \
  "results/nod_animate_decoders_first5/calibration/*_calibration_bins.csv" \
  --out-report results/nod_animate_decoders_first5/calibration_report.md \
  --out-bins results/nod_animate_decoders_first5/reliability_bins.csv

Plot an effect-window reliability diagram from aggregate reliability bins:

python -m reptrace.plot_calibration \
  results/nod_animate_decoders_first5/reliability_bins.csv \
  --out results/nod_animate_decoders_first5/reliability.png \
  --time-window 0.1 0.8 \
  --title "NOD animate/inanimate decoder calibration"

Run paired subject-level decoder statistics:

python -m reptrace.paired_stats \
  "results/nod_animate_decoders_first5/sub-*_time_decode.csv" \
  --out-csv results/nod_animate_decoders_first5/paired_stats.csv \
  --out-report results/nod_animate_decoders_first5/paired_stats.md \
  --chance 0.5

Run the full 19-subject decoder comparison on a self-hosted GitHub Actions runner:

gh workflow run nod-decoder-all.yml \
  --repo IPS-Stuttgart/RepTrace \
  --ref main \
  -f data_root=../data/nod \
  -f output_dir=results/nod_animate_decoders_all \
  -f n_permutations=10000

The workflow rewrites the committed manifest to use the supplied data_root, runs logistic regression, LDA, and calibrated linear SVM across all 19 staged NOD-EEG subjects, then uploads only compact summary, calibration, and inference artifacts. The benchmark step uses --resume, so a rerun in the same output directory skips completed subject-decoder rows whose result and calibration-bin CSVs already exist. Use an absolute data_root when the self-hosted runner keeps the NOD files outside the repository workspace.

After downloading or locating the workflow output directory, export only paper-safe artifacts into the compact export directory:

python scripts/export_paper_results.py \
  results/nod_animate_decoders_all \
  ../2026-05-RepTrace-Paper/results/nod_animate_decoders_all \
  --max-mb 50 \
  --plot-reliability \
  --reliability-window 0.1 0.8

Acceptance Target

The first useful milestone is not just above-chance accuracy. The benchmark should produce stable probability traces and calibration metrics that can be compared across subjects, sessions, and decoder variants.

For interrupted runs, rerun the same command with --resume. RepTrace will keep complete existing rows, regenerate missing rows, and rebuild the aggregate summary and plot from the combined output set.

When --observation-dir is requested, resume mode also requires the matching subject observation CSV before skipping a manifest row. This prevents a run from appearing complete when metric summaries exist but the probability traces needed for downstream state modeling are missing.

After a manifest run with --observation-dir, fit the same temporal model across all exported subject observations:

python -m reptrace.temporal_model \
  "results/nod_animate_all/observations/*_observations.csv" \
  --out-summary results/nod_animate_all/temporal_model.csv \
  --out-states results/nod_animate_all/state_trace.csv \
  --n-permutations 100

Compare calibrated versus uncalibrated emissions:

python -m reptrace.emission_compare \
  results/nod_animate_all/temporal_model.csv \
  --out-csv results/nod_animate_all/emission_compare.csv \
  --out-report results/nod_animate_all/emission_compare.md

Then summarize category-conditioned stages:

python -m reptrace.semantic_stages \
  results/nod_animate_all/state_trace.csv \
  --out-time results/nod_animate_all/semantic_stage_time.csv \
  --out-stages results/nod_animate_all/semantic_stages.csv \
  --out-report results/nod_animate_all/semantic_stages.md

Calibration-Aware Temporal-State Workflow

Use reptrace.temporal_state_workflow to run the calibration-aware temporal-state pass across the three staged NOD tasks: animate/inanimate, canine/device, and container/covering. The workflow prepares runner-local manifests, validates all 19 NOD-EEG subjects, runs matched calibrated and uncalibrated emissions in the same folds, fits sticky temporal models, compares controls, summarizes semantic stages, and writes compact artifacts for the compact export directory.

python -m reptrace.temporal_state_workflow \
  --data-root data/nod \
  --out-dir results/temporal_state_inference \
  --compact-export-dir ../RepTrace-Compact-Results/results/temporal_state_inference \
  --decoders logistic linear_svm \
  --n-permutations 100

The top-level outputs are temporal_state_summary.csv, temporal_state_reliability.png, temporal_state_evidence.md, and temporal_state_commands.md. The compact export intentionally excludes large probability observation and state-trace CSVs.

For a smoke test, run one task and one subject with fewer permutations:

python -m reptrace.temporal_state_workflow \
  --data-root data/nod \
  --out-dir results/temporal_state_smoke \
  --task nod_animate \
  --decoders linear_svm \
  --max-subjects 1 \
  --n-permutations 5 \
  --stay-grid-size 20

This site is open source. Improve this page.