Comp Bio – Data

Project Data as of: Jul 8, 2024

Be aware that we make an effort to put all data we are using on the project Google Drive: https://drive.google.com/drive/folders/1aYzIjcohDQVnnBnSZYuYsjoPOLVO_dvN?usp=drive_link
Mostly in Google Drive\2022 Hack4NF\Project Files\Data\Challenge #3

Note that we worked with 2 datasets – the old one being MIPE3.0, the new one being DepMap. I’ll answer your questions separately for both data sets:

DEPMAP
Reproducing data collection: go to: https://depmap.org/portal/data_page/?tab=allData -> Click the drop down for “Select a file set to view:” -> Find the “Drug Screens” section. -> Look at PRISM Repurposing 19Q4

Note that other files like SANGER in that list are things for us to look at as well – and is what the high school team is working on.

 

1.The table with the cell lines – medication  that has the dose numbers:

  • For our calculations we did not start with raw data. This is not always published, and I’m unsure of we have it for DepMap (we do for MIPE3.0 for sure).
    • This might be: 2022 Hack4NF\Project Files\Data\Challenge #3\streamlit data reference\data\DepMap\Prism19Q4\secondary-screen-logfold-change.csv – but we have not spent time parsing this file, and it’s not immediately obvious how to get this

 

  1. The table with the cell lines – medication that has the computed values like AUC and low/high number – I think Paul and I plugged it into a database
  • \2022 Hack4NF\Project Files\Data\Challenge #3\streamlit data reference\data\DepMap\Prism19Q4\secondary-screen-dose-response-curve-parameters.csv

 

  1. The table where rows are medications and columns are cell lines that have 0 1 or 2 that show mutations.
  1. A table for approved medication. This I have not seen and don’t know if it will have all the things we need – yet I need to look at it
  • No table exists, but we have an XML file that represents what you see on drugbank.org – with a limited use license (don’t share this file) –
  • 2022 Hack4NF\Project Files\Data\Challenge #3\DrugBank (non-commercial)\drugbank_df.pkl
    • This is a dataframe of drugbank loaded as a Python pickle file. Fundamentally drugbank give each drug an ID – and each “drug” can be in one of these statuses: approved, experimental, investigational – where experimental has clinical data, and investigational is mentioned in literature somewhere. In a past investigation compounds in the MIPE3.0 dataset only had 33% of entries that also occurred in Drugbank.
  1. The table that connects cell-lines to a specific cancer – I know its in one the columns, yet I cannot understand those and need something more substantial
  • This is also in the “secondary-screen-dose-response-curve-parameters.csv” file – note that the third column is: ccle_name
    • It is a composite value that looks like “CAL27_UPPER_AERODIGESTIVE_TRACT” – where the split by the first under score. To the left the first underscore is the cell line name (different than the DepMap cell line IDs that look like – ACH-000832. To the right is the tissue sample area, which we are using for ‘cancer type’ – UPPER_AERODIGESTIVE_TRACT

Be also aware that we have a manually curated ontology of “GENES” that place a compound’s GENE target into a group such as “MEK Inhibitor” that is useful for looking at compound by group (by it’s gene ‘targets’) – 2022 Hack4NF\Project Files\Data\Challenge #3\streamlit data reference\data\ManualOntology\Manual_ontology.csv

  • This data set was manually curated by Paul, and has full coverage for genes in MIPE3.0 – and we are actively evaluating how much coverage it has in DepMap (likely not enough).
  • We are interested identifying and leveraging new Ontologies
    • Active research is needed @pyousefi@gmail.com @’Sam Keating’ to what datasets we can work with to bring in pathways or GENE/disease relationships – where GENE is targeted by a compound through a mechanism of action useful to a disease.

MIPE3.0

Reproducing data collection: See the section “Get the data” on the GitHub – https://github.com/mocomakers/nf_streamlit – it involves the NF Data Portal – https://www.synapse.org/#!Synapse:syn5522627

  1. The table with the cell lines – medication  that has the dose numbers:
  • 2022 Hack4NF\Project Files\Data\Challenge #3\streamlit data reference\data\syn5522627\ [VARIOUS Cell lines data by cell line name]qhts.csv

 

  1. The table with the cell lines – medication that has the computed values like AUC and low/high number – I think Paul and I plugged it into a database
  • These are the same files as above. Note that we use – LAC50, INF, ZERO
    • We ‘unlog’ the Log AC50 values before using them. The concentration was discovered to already be in micromolar units – so we factored that in.
    • Note activity values on the Y axis of the dose response curve – like INF, are % response above a DSMO control (that light intensity as a percent above just liquid only in the cell well)
  1. The table where rows are medications and columns are cell lines that have 0 1 or 2 that show mutations.
  • This does not exist to our knowledge for our 5 used cell lines in the paper
  1. A table for approved medication. This I have not seen and don’t know if it will have all the things we need – yet I need to look at it
  • See drug bank data above
  1. The table that connects cell-lines to a specific cancer – I know its in one the columns, yet I cannot understand those and need something more substantial
  • This style of evaluation was not historically done – but all cells are for ‘plexiform neurofibromas’ where the reference was peripheral nerve tissue, and all four test lines were schwann cell tumors (cells that produce myelin).

 

See also that there are three videos on our new ‘spin-up’ guide for new team members that cover how we ‘used’ the data: https://www.mocomakers.com/wiki/comp-bio/ – video 2 focus on MIPE3.0 and delta S, video 3 focuses on DepMap and delta S prime (ΔS’)