Predictions and pmids

project-predictions-and-pmids

Thomas Luechtefeld
2022-02-28

Sysrev creates machine learning models for every review. Sometimes it is useful to combine mdoel predictions with article data to create new reviews. Here we do that for pubmed, but the concepts are generalizable. We will use the Sysrev project jian-jiang/AOP updating vub steatosis for this example:

How to link sysrev article ids to pubmed ids?**

`external_id` is used to track external document identifiers for sysrev articles:

aop <- rsr::get_articles(103067) |> 
  select(project_id,aid,datasource_name,external_id) 

knitr::kable(head(aop))
project_id aid datasource_name external_id
103067 13591692 pubmed “24151358”
103067 13591690 pubmed “23236639”
103067 13591691 pubmed “23978457”
103067 13591698 pubmed “25560223”
103067 13591699 pubmed “26276582”
103067 13591697 pubmed “24523126”

The first article can be found at sysrev.com/p/103067/article/13591692 and also at https://pubmed.ncbi.nlm.nih.gov/24151358/.

Filtering with machine learning
Both of these projects have predictions generated by Sysrev machine learning.

aop.pred.raw  <- rsr::get_predictions(103067)

# get the most recent predictions
aop.pred = aop.pred.raw |> 
  group_by(project_id) |> 
  filter(predict_run_id==max(predict_run_id)) 

aop.pred |> 
  select(project_id,predict_run_id,lid,value,probability) |> 
  print(n=2)
# A tibble: 164,300 × 5
# Groups:   project_id [1]
  project_id predict_run_id lid                      value probability
       <int>          <int> <chr>                    <chr>       <dbl>
1     103067          27277 412b8a66-7b92-4554-b543… Dire…       0.502
2     103067          27277 412b8a66-7b92-4554-b543… Dire…       0.511
# … with 164,298 more rows

Filtering with predict_run_id==max(predict_run_id) in each project gives us the most recent predictions in each project. Now we can join with the labels defined in each project:

aop.lbl <- rbind(
  rsr::get_labels(103067),
  rsr::get_labels(113583)) |> 
  select(lid,short_label)

# TODO, this vignette is incomplete come back and finish it
aop.pred |> 
  inner_join(aop.lbl,by="lid") |> 
  filter(short_label=="Include") |> 
  select(project_id,aid,short_label,value,probability)
# A tibble: 13,144 × 5
# Groups:   project_id [1]
   project_id      aid short_label value probability
        <int>    <int> <chr>       <chr>       <dbl>
 1     103067 13592066 Include     TRUE        0.391
 2     103067 13592227 Include     TRUE        0.512
 3     103067 13591910 Include     TRUE        0.431
 4     103067 13592023 Include     TRUE        0.351
 5     103067 13592090 Include     TRUE        0.575
 6     103067 13592080 Include     TRUE        0.448
 7     103067 13592069 Include     TRUE        0.428
 8     103067 13592079 Include     TRUE        0.536
 9     103067 13592077 Include     TRUE        0.544
10     103067 13592070 Include     TRUE        0.594
# … with 13,134 more rows