# Using OmicLearn **OmicLearn** enables researchers and scientists to explore the latest algorithms in machine learning (ML) for their usage in proteomics/transcriptomics. The core steps of the pipeline are `Preprocessing`, `Missing Value Imputation`, `Feature Selection`, `Classification`, and `Validation` of selected method/algorithms and are presented in the flowchart below: ![OmicLearn Workflow](images/workflow.png) _**Figure 1:** Main steps for the workflow of OmicLearn at a glance_ ## Uploading data Own data can be uploaded via dragging and dropping on the file menu or clicking the link. The data should be formatted according to the following conventions: > - The file format should be `.xlsx (Excel)`, `.csv (Comma-separated values)` or `.tsv (tab-separated values)`. For `.csv`, the separator should be either `comma (,)` or `semicolon (;)`. > > - Maximum file size is 200 Mb. > > - 'Identifiers' such as protein IDs, gene names, lipids or miRNA IDs should be uppercase. > > - Each row corresponds to a sample, each column to a feature. > > - Additional features should be marked with a leading underscore (`_`). ![DATA_UPLOAD/SELECTION](images/upload.png) _**Figure 2:** Uploading a dataset or selecting a sample file_ The data will be checked for consistency, and if your dataset contains missing values (`NaNs`), a notification will appear. Then, you might consider using the methods listed on the left sidebar for the imputation of missing values. ![NAN_WARNING](images/nan_warning.png) _**Figure 3:** Missing value warning_ ### Sample Datasets OmicLearn has several sample [datasets](https://github.com/MannLabs/OmicLearn/tree/master/data) included that can be used for exploring the analysis, which can be selected from the dropdown menu. Here is the list of sample datasets available: **`1. Alzheimer Dataset`** > πŸ“ **File Name:** Alzheimer.xlsx > > πŸ“– **Description:** Proteome profiling in cerebrospinal fluid reveals novel biomarkers of Alzheimer's disease > > πŸ”— **Source:** Bader, J., Geyer, P., MΓΌller, J., Strauss, M., Koch, M., & Leypoldt, F. et al. (2020). Proteome profiling in cerebrospinal fluid reveals novel biomarkers of Alzheimer's disease. Molecular Systems Biology, 16(6). doi: [10.15252/msb.20199356](http://doi.org/10.15252/msb.20199356). **`2. Sample Dataset`** > πŸ“ **File Name:** Sample.xlsx > > πŸ“– **Description:** Sample dataset for testing the tool > > πŸ”— **Source:** - ## Sidebar: Selecting Parameters OmicLearn has a large variety of options to choose from which are detailed in the [methods](METHODS.md). The parameters can be selected in the sidebar. Moreover, after changing the parameters, you are asked to re-run the analysis. Each analysis result will be stored in the [`Session History` section](#checking-the-session-history). ![OmicLearn SideBar](images/sidebar.png) _**Figure 4:** OmicLearn sidebar options_ ## Main Window: Selecting data, define workflow, and explore results ### Data Selection After uploading the data, the data will be displayed within the OmicLearn window and can be explored. The dropdown menu `Subset` allows you to specify a subset of data based on values within a column. This way, you can exclude data that should not be used at all. An example use case could be that you collected data from different sites and want to exclude a site. ![Subset](images/subset.png) _**Figure 5:** Example usage for `Subset` section_ Within `Features`, you should select the target column. This refers to the variable that the classifier should be able to distinguish. As we are performing a binary classification task, there are only two options for the outcome of the classifier. By assigning multiple values to a class, multiple combinations of classifications can be tested. ![Classification target](images/target.png) _**Figure 6:** `Classification target` section for selecting the target columns and `Define classes` section for assigning the classes_ Furthermore, `Additional Features` can be selected. This refers to columns that are not your identifiers such as protein IDs, gene names, lipids or miRNA IDs (not uppercase and have a leading underscore (`_`). There is also an option to upload a file with multiple additional features that should be included for training. To utilize this option, you should upload a CSV (comma `,` separated) file where each row corresponds to an additional feature to be included for training. The file does not need to include a header (title row). > Example file should like look: > > ``` > _additional_feature_1 > _additional_feature_2 > ``` ![Add Features](images/additional.png) _**Figure 7:** Sample case for `Additional Features` option_ The section `Exclude features` enables users to exclude selected features manually. This can be useful e.g., when wanting to asses performance without a top feature. There is also an option to upload a file with multiple features that should be excluded. To utilize this option, you should upload a CSV (comma `,` separated) file where each row corresponds to a feature to be excluded. The file does not need to include a header (title row). > Example file should like look: > > ``` > protein_A > protein_B > ``` ![exclude_identifiers](images/exclude.png) _**Figure 8:** Selections on the dataset_ The option `Cohort comparison` allows comparing results over different cohorts (i.e., train on one cohort and predict on another) ![dataselections](images/cohort.png) _**Figure 9:** Selections on the dataset_ ### Running the Workflow After selecting all parameters you are able to execute the workflow by clicking the `Run Analysis` button. ![runninginfo](images/runninginfo.png) _**Figure 10:** Running info_ ### Analysis results and plots Once the analysis is completed, OmicLearn automatically generates the plots together with a table showing the results of each validation run. The plots are downloadable as `.pdf` and `.svg` format in addition to the `.png` format provided by Plotly. ![FeatAtt_Chart](images/feature_importance.png) ![FeatAtt_Table](images/feature_importance_table.png) _**Figure 11:** Bar chart for feature importance values received from the classifier after all cross-validation runs, its table containing links to NCBI search and download options_ ![ROC Curve](images/roc_curve.png) ![PR Curve](images/pr_curve.png) _**Figure 12:** Receiver operating characteristic (ROC) Curve, Precision-Recall (PR) Curve and download options_ ![CONF-MATRIX](images/confusion.png) _**Figure 13:** Confusion matrix, slider for looking at the other matrix tables and download options_ OmicLearn generates a `Summary` to describe the method. This can be used for a method section in a publication. ![Results table](images/summary.png) ![summary text](images/summary_text.png) _**Figure 14:** Results table of the analysis, its download option, and auto-generated `Summary` text_ ### Checking the Session History Each analysis run will be appended to the `Session History` so that you can investigate the different results for different parameter sets. ![session](images/session_history.png) _**Figure 15:** Session history table and download option_