project-predictions-and-pmids
Sysrev creates machine learning models for every review. Sometimes it is useful to combine mdoel predictions with article data to create new reviews. Here we do that for pubmed, but the concepts are generalizable. We will use the Sysrev project jian-jiang/AOP updating vub steatosis for this example:
`external_id` is used to track external document identifiers for sysrev articles:
aop <- rsr::get_articles(103067) |>
select(project_id,aid,datasource_name,external_id)
knitr::kable(head(aop))
project_id | aid | datasource_name | external_id |
---|---|---|---|
103067 | 13591692 | pubmed | “24151358” |
103067 | 13591690 | pubmed | “23236639” |
103067 | 13591691 | pubmed | “23978457” |
103067 | 13591698 | pubmed | “25560223” |
103067 | 13591699 | pubmed | “26276582” |
103067 | 13591697 | pubmed | “24523126” |
The first article can be found at sysrev.com/p/103067/article/13591692 and also at https://pubmed.ncbi.nlm.nih.gov/24151358/.
Filtering with machine learning
Both of these projects have predictions generated by Sysrev machine learning.
aop.pred.raw <- rsr::get_predictions(103067)
# get the most recent predictions
aop.pred = aop.pred.raw |>
group_by(project_id) |>
filter(predict_run_id==max(predict_run_id))
aop.pred |>
select(project_id,predict_run_id,lid,value,probability) |>
print(n=2)
# A tibble: 164,300 × 5
# Groups: project_id [1]
project_id predict_run_id lid value probability
<int> <int> <chr> <chr> <dbl>
1 103067 27277 412b8a66-7b92-4554-b543… Dire… 0.502
2 103067 27277 412b8a66-7b92-4554-b543… Dire… 0.511
# … with 164,298 more rows
Filtering with predict_run_id==max(predict_run_id)
in each project gives us the most recent predictions in each project. Now we can join with the labels defined in each project:
aop.lbl <- rbind(
rsr::get_labels(103067),
rsr::get_labels(113583)) |>
select(lid,short_label)
# TODO, this vignette is incomplete come back and finish it
aop.pred |>
inner_join(aop.lbl,by="lid") |>
filter(short_label=="Include") |>
select(project_id,aid,short_label,value,probability)
# A tibble: 13,144 × 5
# Groups: project_id [1]
project_id aid short_label value probability
<int> <int> <chr> <chr> <dbl>
1 103067 13592066 Include TRUE 0.391
2 103067 13592227 Include TRUE 0.512
3 103067 13591910 Include TRUE 0.431
4 103067 13592023 Include TRUE 0.351
5 103067 13592090 Include TRUE 0.575
6 103067 13592080 Include TRUE 0.448
7 103067 13592069 Include TRUE 0.428
8 103067 13592079 Include TRUE 0.536
9 103067 13592077 Include TRUE 0.544
10 103067 13592070 Include TRUE 0.594
# … with 13,134 more rows