Build an end-to-end pipeline that finds the relevant predictor–outcome effect size(s) for a research question, converts to Pearson’s r when needed, and outputs a single aggregate effect size per study.
Conducting meta-analysis is time consuming across psychology because manually coding effect sizes from articles is complex and open-ended. Studies often report many scales that can map onto multiple constructs, so meta-analysts must carefully determine which reported relationship matches a given research question and then extract the correct effect size.
This year's SIOP Machine Learning Competition focuses on automating the article coding process: given a set of articles and a corresponding research question, your pipeline must extract the relevant predictor–dependent variable relationship effect size(s), convert them to Pearson’s r if needed, and report a single aggregated effect size per paper.
Competition format: Competitors receive a development dataset to build and test their pipeline and, during the final week, a test dataset to generate official predictions.
| Date | Event | Submission Limits |
|---|---|---|
| 3/7 | Competition begins @ 5pm Eastern & Dev dataset released | (5 submissions/day) (100 total submissions) |
| 4/4 | Test dataset released @ 5pm Eastern | (3 total submissions) |
| 4/11 | Competition ends @ 11:59pm Eastern | — |
| 4/12 | Winners notified via email | — |
| 4/13 | Winning solution verification begins | — |
| 4/17 | Winning solution verification completes | — |
| 4/30 | Winning solutions presented at the 2026 SIOP conference | — |
Each study might have multiple effect sizes that pertain to the effect size relevant to the research question. For example, Study 1 used Scale A to measure the predictor and Scales B, C, and D to measure the criterion. The effect sizes are rAB, rAC, and rAD. The average of those observed correlations is the true aggregate score. Your pipeline will be predicting the true aggregate score by averaging across effects it extracts. Pipelines are evaluated as whole using the mean squared error (MSE) between predicted and true aggregate scores:
$$ \mathrm{MSE} = \frac{1}{N}\sum_{i=1}^{N}\left(\hat{r}_i - r_i\right)^2 $$
Submissions must be uploaded as a CSV file containing exactly two columns: studyid and aggregateeffectsize. Each row should correspond to one study, where aggregateeffectsize is your predicted aggregate Pearson’s r for that study.
Example CSV:
studyid,aggregateeffectsize
study1,0.23
study2,-0.11
study3,0.00
Tip: Ensure the header names match exactly (studyid, aggregateeffectsize) and that every
study in the dataset appears once.
Participants design a function (in a programming language of their choice) that accepts:
The function must then (fully computationally and automatically):
Any approach is allowed as long as it:
Enter your team token below and download the Dev dataset.
Enter your team token below and download the Test dataset.
If you have issues accessing any of the articles, please send a message to ivanhernandez@vt.edu