Cell line: depmap, 2024-Q2

!lamin load laminlabs/bionty-assets
✅ wrote new records from public sources.yaml to /home/zeth/.lamin/bionty/versions/sources_local.yaml!

if you see this message repeatedly, run: import bionty; bionty.base.reset_sources()
✅ wrote new records from public sources.yaml to /home/zeth/.lamin/bionty/versions/sources_local.yaml!

if you see this message repeatedly, run: import bionty; bionty.base.reset_sources()
💡 connected lamindb: laminlabs/bionty-assets
import lamindb as ln
import bionty as bt
import pandas as pd

ln.context.uid = "GOgp5sRkbin90000"
ln.track()

new_ontology = ln.ULabel.filter(name="new_ontology").one()
ln.context.run.transform.ulabels.add(new_ontology)
✅ wrote new records from public sources.yaml to /home/zeth/.lamin/bionty/versions/sources_local.yaml!

if you see this message repeatedly, run: import bionty; bionty.base.reset_sources()
✅ wrote new records from public sources.yaml to /home/zeth/.lamin/bionty/versions/sources_local.yaml!

if you see this message repeatedly, run: import bionty; bionty.base.reset_sources()
💡 connected lamindb: laminlabs/bionty-assets
💡 notebook imports: bionty==0.48.2 lamindb==0.76.0 pandas==2.2.2
WARNING: Skipping /home/zeth/miniconda3/envs/lamindb/lib/python3.11/site-packages/jupyterlab_widgets-3.0.10.dist-info due to invalid metadata entry 'name'

💡 loaded Transform('GOgp5sRkbin90000') & loaded Run('2024-08-21 12:49:45.216289+00:00')

Curate source

We obtained the model.csv file from https://depmap.org/portal/data_page/?tab=allData using version 24Q2.

depmap_df = pd.read_csv("depmap_q2_model.csv", sep=",")
depmap_df.head(3)
ModelID PatientID CellLineName StrippedCellLineName DepmapModelType OncotreeLineage OncotreePrimaryDisease OncotreeSubtype OncotreeCode LegacyMolecularSubtype ... EngineeredModel TissueOrigin ModelDerivationMaterial PublicComments CCLEName HCMIID WTSIMasterCellID SangerModelID COSMICID DateSharedIndbGaP
0 ACH-000001 PT-gj46wT NIH:OVCAR-3 NIHOVCAR3 HGSOC Ovary/Fallopian Tube Ovarian Epithelial Tumor High-Grade Serous Ovarian Cancer HGSOC NaN ... NaN NaN NaN NaN NIHOVCAR3_OVARY NaN 2201.0 SIDM00105 905933.0 NaN
1 ACH-000002 PT-5qa3uk HL-60 HL60 AML Myeloid Acute Myeloid Leukemia Acute Myeloid Leukemia AML NaN ... NaN NaN NaN NaN HL60_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE NaN 55.0 SIDM00829 905938.0 NaN
2 ACH-000003 PT-puKIyc CACO2 CACO2 COAD Bowel Colorectal Adenocarcinoma Colon Adenocarcinoma COAD NaN ... NaN NaN NaN NaN CACO2_LARGE_INTESTINE NaN NaN SIDM00891 NaN NaN

3 rows × 43 columns

completeness_per_column = depmap_df.notna().mean() * 100
completeness_per_column
ModelID                     100.000000
PatientID                   100.000000
CellLineName                100.000000
StrippedCellLineName        100.000000
DepmapModelType             100.000000
OncotreeLineage              99.744768
OncotreePrimaryDisease      100.000000
OncotreeSubtype             100.000000
OncotreeCode                 92.802450
LegacyMolecularSubtype        7.708014
LegacySubSubtype             42.419602
PatientMolecularSubtype       7.044410
RRID                         96.324655
Age                          79.428280
AgeCategory                 100.000000
Sex                          98.723839
PatientRace                  32.363451
PrimaryOrMetastasis          83.665135
SampleCollectionSite         99.744768
SourceType                   97.498724
SourceDetail                 91.475242
CatalogNumber                52.935171
PatientTreatmentStatus        3.215926
PatientTreatmentType          0.153139
PatientTreatmentDetails       0.153139
Stage                         0.357325
StagingSystem                 0.000000
PatientTumorGrade             0.000000
PatientTreatmentResponse      0.204186
GrowthPattern               100.000000
OnboardedMedia               78.611536
FormulationID                78.611536
PlateCoating                  0.000000
EngineeredModel               0.714650
TissueOrigin                  0.000000
ModelDerivationMaterial       0.102093
PublicComments                4.338948
CCLEName                     97.090352
HCMIID                        0.612557
WTSIMasterCellID             49.821337
SangerModelID                62.072486
COSMICID                     49.872384
DateSharedIndbGaP             0.000000
dtype: float64
# Drop all columns with less than 70% completeness
depmap_df = depmap_df.loc[:, completeness_per_column >= 70]
depmap_df
ModelID PatientID CellLineName StrippedCellLineName DepmapModelType OncotreeLineage OncotreePrimaryDisease OncotreeSubtype OncotreeCode RRID ... AgeCategory Sex PrimaryOrMetastasis SampleCollectionSite SourceType SourceDetail GrowthPattern OnboardedMedia FormulationID CCLEName
0 ACH-000001 PT-gj46wT NIH:OVCAR-3 NIHOVCAR3 HGSOC Ovary/Fallopian Tube Ovarian Epithelial Tumor High-Grade Serous Ovarian Cancer HGSOC CVCL_0465 ... Adult Female Metastatic ascites ATCC ATCC Adherent MF-001-041 RPMI + 20% FBS + 0.01 mg/ml insulin NIHOVCAR3_OVARY
1 ACH-000002 PT-5qa3uk HL-60 HL60 AML Myeloid Acute Myeloid Leukemia Acute Myeloid Leukemia AML CVCL_0002 ... Adult Female Primary haematopoietic_and_lymphoid_tissue ATCC ATCC Suspension MF-005-001 IMDM + 10% FBS HL60_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE
2 ACH-000003 PT-puKIyc CACO2 CACO2 COAD Bowel Colorectal Adenocarcinoma Colon Adenocarcinoma COAD CVCL_0025 ... Adult Male Primary Colon ATCC ATCC Adherent MF-015-009 EMEM + 20% FBS CACO2_LARGE_INTESTINE
3 ACH-000004 PT-q4K2cp HEL HEL AML Myeloid Acute Myeloid Leukemia Acute Myeloid Leukemia AML CVCL_0001 ... Adult Male Primary haematopoietic_and_lymphoid_tissue DSMZ DSMZ Suspension MF-001-001 RPMI + 10% FBS HEL_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE
4 ACH-000005 PT-q4K2cp HEL 92.1.7 HEL9217 AML Myeloid Acute Myeloid Leukemia Acute Myeloid Leukemia AML CVCL_2481 ... Adult Male NaN bone_marrow ATCC ATCC Mixed MF-001-001 RPMI + 10% FBS HEL9217_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1954 ACH-003161 PT-or1hkT ABM-T9430 ABMT9430 ZIMMPSC Pancreas Non-Cancerous Immortalized Pancreatic Stromal Cells NaN NaN ... Unknown NaN NaN pancreas ABM ABM Adherent MF-043-001 PriGrow I (TM001) + 25 μg/ml BPE + 0.15 ng/ml ... NaN
1955 ACH-003181 PT-W75e4m NRH-LMS1 NRHLMS1 LMS Soft Tissue Leiomyosarcoma Leiomyosarcoma LMS NaN ... Adult Female Metastatic soft_tissue Academic lab Oslo University Hospital-The Norwegian Radium ... Mixed MF-001-014 RPMI + 5% FBS NRH-LMS1
1956 ACH-003183 PT-BqidXH NRH-MFS3 NRHMFS3 MFS Soft Tissue Myxofibrosarcoma Myxofibrosarcoma MFS NaN ... Adult Male Primary soft_tissue Academic lab Oslo University Hospital-The Norwegian Radium ... Mixed MF-001-014 RPMI + 5% FBS NRH-MFS3
1957 ACH-003184 PT-21NMVa NRH-LMS2 NRHLMS2 LMS Soft Tissue Leiomyosarcoma Leiomyosarcoma LMS NaN ... Adult Female Primary soft_tissue Academic lab Oslo University Hospital-The Norwegian Radium ... Mixed MF-001-014 RPMI + 5% FBS NRH-LMS2
1958 ACH-003191 PT-B8KJKw NRH-GCT2 NRHGCT2 GCTB Bone Giant Cell Tumor of Bone Giant Cell Tumor of Bone GCTB NaN ... Adult Male Primary soft_tissue Academic lab Oslo University Hospital-The Norwegian Radium ... Mixed MF-001-014 RPMI + 5% FBS NRH-GCT2

1959 rows × 21 columns

# Unfortunately, there is no reasonable 'definitions' column
depmap_df = depmap_df.rename(columns={"ModelID": "ontology_id",
                                      "CellLineName": "name",
                                })
depmap_df["parents"] = "[]"
depmap_df['synonyms'] = depmap_df['StrippedCellLineName'] + '|' + depmap_df['CCLEName']
depmap_df = depmap_df.drop(["StrippedCellLineName",
                            "CCLEName"], axis=1)
cols = ['name', 'synonyms'] + [col for col in depmap_df.columns if col not in ['name', 'synonyms']]
depmap_df = depmap_df[cols]
depmap_df = depmap_df.set_index("ontology_id")
depmap_df = depmap_df.drop(["PatientID",
                "Age",
                "AgeCategory",
                "Sex",
                "SourceType",
                "SourceDetail"], axis=1)
depmap_df
name synonyms DepmapModelType OncotreeLineage OncotreePrimaryDisease OncotreeSubtype OncotreeCode RRID PrimaryOrMetastasis SampleCollectionSite GrowthPattern OnboardedMedia FormulationID parents
ontology_id
ACH-000001 NIH:OVCAR-3 NIHOVCAR3|NIHOVCAR3_OVARY HGSOC Ovary/Fallopian Tube Ovarian Epithelial Tumor High-Grade Serous Ovarian Cancer HGSOC CVCL_0465 Metastatic ascites Adherent MF-001-041 RPMI + 20% FBS + 0.01 mg/ml insulin []
ACH-000002 HL-60 HL60|HL60_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE AML Myeloid Acute Myeloid Leukemia Acute Myeloid Leukemia AML CVCL_0002 Primary haematopoietic_and_lymphoid_tissue Suspension MF-005-001 IMDM + 10% FBS []
ACH-000003 CACO2 CACO2|CACO2_LARGE_INTESTINE COAD Bowel Colorectal Adenocarcinoma Colon Adenocarcinoma COAD CVCL_0025 Primary Colon Adherent MF-015-009 EMEM + 20% FBS []
ACH-000004 HEL HEL|HEL_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE AML Myeloid Acute Myeloid Leukemia Acute Myeloid Leukemia AML CVCL_0001 Primary haematopoietic_and_lymphoid_tissue Suspension MF-001-001 RPMI + 10% FBS []
ACH-000005 HEL 92.1.7 HEL9217|HEL9217_HAEMATOPOIETIC_AND_LYMPHOID_TI... AML Myeloid Acute Myeloid Leukemia Acute Myeloid Leukemia AML CVCL_2481 NaN bone_marrow Mixed MF-001-001 RPMI + 10% FBS []
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
ACH-003161 ABM-T9430 NaN ZIMMPSC Pancreas Non-Cancerous Immortalized Pancreatic Stromal Cells NaN NaN NaN pancreas Adherent MF-043-001 PriGrow I (TM001) + 25 μg/ml BPE + 0.15 ng/ml ... []
ACH-003181 NRH-LMS1 NRHLMS1|NRH-LMS1 LMS Soft Tissue Leiomyosarcoma Leiomyosarcoma LMS NaN Metastatic soft_tissue Mixed MF-001-014 RPMI + 5% FBS []
ACH-003183 NRH-MFS3 NRHMFS3|NRH-MFS3 MFS Soft Tissue Myxofibrosarcoma Myxofibrosarcoma MFS NaN Primary soft_tissue Mixed MF-001-014 RPMI + 5% FBS []
ACH-003184 NRH-LMS2 NRHLMS2|NRH-LMS2 LMS Soft Tissue Leiomyosarcoma Leiomyosarcoma LMS NaN Primary soft_tissue Mixed MF-001-014 RPMI + 5% FBS []
ACH-003191 NRH-GCT2 NRHGCT2|NRH-GCT2 GCTB Bone Giant Cell Tumor of Bone Giant Cell Tumor of Bone GCTB NaN Primary soft_tissue Mixed MF-001-014 RPMI + 5% FBS []

1959 rows × 14 columns

depmap_df.to_parquet("df_all__depmap__2024-Q2__CellLine.parquet")

Register in laminlabs/bionty-assets

from bionty.core._bionty import register_source_in_bionty_assets
source_record = bt.Source.filter(name="depmap", organism="all", version="2024-Q2", entity="bionty.CellLine").one()
register_source_in_bionty_assets(filepath="df_all__depmap__2024-Q2__CellLine.parquet", source=source_record)
... uploading df_all__depmap__2024-Q2__CellLine.parquet: 100.0%
registered Source(uid='2zHO', entity='bionty.CellLine', organism='all', name='depmap', version='2024-Q2', in_db=False, currently_used=True, description='Dependency Map', url='s3://bionty-assets/df_all__depmap__2024-Q2__CellLine.parquet', md5='', source_website='https://depmap.org/portal/', created_by_id=2, dataframe_artifact_id=180, updated_at='2024-08-21 13:28:33 UTC') with dataframe Artifact(uid='ImWoMC3V3jU2WFCztblE', is_latest=True, key='df_all__depmap__2024-Q2__CellLine.parquet', suffix='.parquet', size=110099, hash='Ic6bd5W9ImRj0ZAbM_1zww', _hash_type='md5', visibility=1, _key_is_virtual=False, created_by_id=2, storage_id=1, transform_id=10, run_id=11, updated_at='2024-08-21 13:28:27 UTC')
Artifact(uid='ImWoMC3V3jU2WFCztblE', is_latest=True, key='df_all__depmap__2024-Q2__CellLine.parquet', suffix='.parquet', size=110099, hash='Ic6bd5W9ImRj0ZAbM_1zww', _hash_type='md5', visibility=1, _key_is_virtual=False, created_by_id=2, storage_id=1, transform_id=10, run_id=11, updated_at='2024-08-21 13:28:27 UTC')
ln.finish()
❗ cells [(9, 11), (11, 14)] were not run consecutively