---
title: "Label Propagation"
output:
    rmarkdown::html_vignette:
        toc: true
description: >
  Validate or extend cluster insights to new observations through semi-supervised label propagation.
vignette: >
  %\VignetteIndexEntry{Label Propagation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

<style>
div.aside { background-color:#fff2e6; }
</style>

```{r, include = FALSE}
knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
```

Download a copy of the vignette to follow along here: [label_propagation.Rmd](https://raw.githubusercontent.com/BRANCHlab/metasnf/main/vignettes/label_propagation.Rmd)

In this vignette, we will walk through label propagation in the metasnf package.
Code from this vignette is largely taken from the end of the [less simple example vignette](https://branchlab.github.io/metasnf/articles/a_complete_example.html).

The label propagation procedure can be used to predict cluster membership for new, unlabeled observations based on their similarity to previously labeled observations.
These unlabeled observations could be a held out test set from your original sample or a new sample entirely.

The process involves the following steps:

1. Assign clusters to some group of observations
2. Calculate all the pairwise similarities amongst all the already clustered and to-be-labeled observations
3. Run the label propagation algorithm to predict cluster membership in the to-be-labeled observations

There is a lot of room for flexibility in how steps 1 and 2 are conducted.
SNF is not necessary at any part of the process.
For example, step one could be done by assigning clusters in your training set manually or by a simple clustering method like k-means.
Step two could be done just by calculating the euclidean distances across all the training and testing subjects for a small subset of features.
The features used to calculate the similarities in step 2 don't necessarily need to be the same ones used to derive the cluster solution in the training set either.

All that aside, we show here a simple approach that involves assigning the clusters by a call to `batch_snf`, assembling a data list that has the training and testing set subjects, and feeding the results into a simple label propagating function, `lp_solutions_matrix`.

```{r}
library(metasnf)

# Start by making a data list containing all our dataframes to more easily
# identify subjects without missing data
all_subjects <- generate_data_list(
    list(cort_t, "cort_t", "neuroimaging", "continuous"),
    list(cort_sa, "cort_sa", "neuroimaging", "continuous"),
    list(subc_v, "subc_v", "neuroimaging", "continuous"),
    list(income, "household_income", "demographics", "continuous"),
    list(pubertal, "pubertal_status", "demographics", "continuous"),
    list(anxiety, "anxiety", "behaviour", "ordinal"),
    list(depress, "depressed", "behaviour", "ordinal"),
    uid = "unique_id"
)

# Get a vector of all the subjects
all_subjects <- get_dl_subjects(all_subjects)

# Dataframe assigning 80% of subjects to train and 20% to test
train_test_split <- train_test_assign(
    train_frac = 0.8,
    subjects = all_subjects
)

# Pulling the training and testing subjects specifically
train_subs <- train_test_split$"train"
test_subs <- train_test_split$"test"

# Partition a training set
train_cort_t <- cort_t[cort_t$"unique_id" %in% train_subs, ]
train_cort_sa <- cort_sa[cort_sa$"unique_id" %in% train_subs, ]
train_subc_v <- subc_v[subc_v$"unique_id" %in% train_subs, ]
train_income <- income[income$"unique_id" %in% train_subs, ]
train_pubertal <- pubertal[pubertal$"unique_id" %in% train_subs, ]
train_anxiety <- anxiety[anxiety$"unique_id" %in% train_subs, ]
train_depress <- depress[depress$"unique_id" %in% train_subs, ]

# Partition a test set
test_cort_t <- cort_t[cort_t$"unique_id" %in% test_subs, ]
test_cort_sa <- cort_sa[cort_sa$"unique_id" %in% test_subs, ]
test_subc_v <- subc_v[subc_v$"unique_id" %in% test_subs, ]
test_income <- income[income$"unique_id" %in% test_subs, ]
test_pubertal <- pubertal[pubertal$"unique_id" %in% test_subs, ]
test_anxiety <- anxiety[anxiety$"unique_id" %in% test_subs, ]
test_depress <- depress[depress$"unique_id" %in% test_subs, ]

# Find cluster solutions in the training set
train_data_list <- generate_data_list(
    list(train_cort_t, "cort_t", "neuroimaging", "continuous"),
    list(train_cort_sa, "cortical_sa", "neuroimaging", "continuous"),
    list(train_subc_v, "subc_v", "neuroimaging", "continuous"),
    list(train_income, "household_income", "demographics", "continuous"),
    list(train_pubertal, "pubertal_status", "demographics", "continuous"),
    uid = "unique_id"
)

# We'll pick a solution that has good separation over our target features
train_target_list <- generate_data_list(
    list(train_anxiety, "anxiety", "behaviour", "ordinal"),
    list(train_depress, "depressed", "behaviour", "ordinal"),
    uid = "unique_id"
)

set.seed(42)
settings_matrix <- generate_settings_matrix(
    train_data_list,
    nrow = 5,
    min_k = 10,
    max_k = 30
)

train_solutions_matrix <- batch_snf(
    train_data_list,
    settings_matrix
)

extended_solutions_matrix <- extend_solutions(
    train_solutions_matrix,
    train_target_list
)

# Determining solution with the lowest minimum p-value
lowest_min_pval <- min(extended_solutions_matrix$"min_pval")
which(extended_solutions_matrix$"min_pval" == lowest_min_pval)
top_row <- extended_solutions_matrix[4, ]

# Propagate that solution to the subjects in the test set
# data list below has both training and testing subjects
full_data_list <- generate_data_list(
    list(cort_t, "cort_t", "neuroimaging", "continuous"),
    list(cort_sa, "cort_sa", "neuroimaging", "continuous"),
    list(subc_v, "subc_v", "neuroimaging", "continuous"),
    list(income, "household_income", "demographics", "continuous"),
    list(pubertal, "pubertal_status", "demographics", "continuous"),
    uid = "unique_id"
)

# Use the solutions matrix from the training subjects and the data list from
# the training and testing subjects to propagate labels to the test subjects
propagated_labels <- lp_solutions_matrix(top_row, full_data_list)

head(propagated_labels)
tail(propagated_labels)
```

You could, if you wanted, see how *all* of your clustering solutions propagate to the test set, but that would mean reusing your test set and removing much of the protection against overfitting provided by this procedure.

```{r}
propagated_labels_all <- lp_solutions_matrix(
    extended_solutions_matrix,
    full_data_list
)

head(propagated_labels_all)
tail(propagated_labels_all)
```