Ontox-Dev: Predictions and pmids

Sysrev creates machine learning models for every review. Sometimes it is useful to combine mdoel predictions with article data to create new reviews. Here we do that for pubmed, but the concepts are generalizable. We will use the Sysrev project jian-jiang/AOP updating vub steatosis for this example:

How to link sysrev article ids to pubmed ids?**

`external_id` is used to track external document identifiers for sysrev articles:

aop <- rsr::get_articles(103067) |> 
  select(project_id,aid,datasource_name,external_id) 

knitr::kable(head(aop))

project_id	aid	datasource_name	external_id
103067	13591692	pubmed	“24151358”
103067	13591690	pubmed	“23236639”
103067	13591691	pubmed	“23978457”
103067	13591698	pubmed	“25560223”
103067	13591699	pubmed	“26276582”
103067	13591697	pubmed	“24523126”

The first article can be found at sysrev.com/p/103067/article/13591692 and also at https://pubmed.ncbi.nlm.nih.gov/24151358/.

Filtering with machine learning
Both of these projects have predictions generated by Sysrev machine learning.

aop.pred.raw  <- rsr::get_predictions(103067)

# get the most recent predictions
aop.pred = aop.pred.raw |> 
  group_by(project_id) |> 
  filter(predict_run_id==max(predict_run_id)) 

aop.pred |> 
  select(project_id,predict_run_id,lid,value,probability) |> 
  print(n=2)

# A tibble: 164,300 × 5
# Groups:   project_id [1]
  project_id predict_run_id lid                      value probability
       <int>          <int> <chr>                    <chr>       <dbl>
1     103067          27277 412b8a66-7b92-4554-b543… Dire…       0.502
2     103067          27277 412b8a66-7b92-4554-b543… Dire…       0.511
# … with 164,298 more rows

Filtering with predict_run_id==max(predict_run_id) in each project gives us the most recent predictions in each project. Now we can join with the labels defined in each project:

aop.lbl <- rbind(
  rsr::get_labels(103067),
  rsr::get_labels(113583)) |> 
  select(lid,short_label)

# TODO, this vignette is incomplete come back and finish it
aop.pred |> 
  inner_join(aop.lbl,by="lid") |> 
  filter(short_label=="Include") |> 
  select(project_id,aid,short_label,value,probability)

# A tibble: 13,144 × 5
# Groups:   project_id [1]
   project_id      aid short_label value probability
        <int>    <int> <chr>       <chr>       <dbl>
 1     103067 13592066 Include     TRUE        0.391
 2     103067 13592227 Include     TRUE        0.512
 3     103067 13591910 Include     TRUE        0.431
 4     103067 13592023 Include     TRUE        0.351
 5     103067 13592090 Include     TRUE        0.575
 6     103067 13592080 Include     TRUE        0.448
 7     103067 13592069 Include     TRUE        0.428
 8     103067 13592079 Include     TRUE        0.536
 9     103067 13592077 Include     TRUE        0.544
10     103067 13592070 Include     TRUE        0.594
# … with 13,134 more rows