Drug
: chebi, 2024-07-27¶
!lamin load laminlabs/bionty-assets
💡 connected lamindb: laminlabs/bionty-assets
import lamindb as ln
import bionty as bt
import pandas as pd
ln.context.uid = "fQpBV2oEQUFi0000"
ln.track()
new_ontology = ln.ULabel.filter(name="new_ontology").one()
ln.context.run.transform.ulabels.add(new_ontology)
💡 connected lamindb: laminlabs/bionty-assets
💡 notebook imports: bionty==0.48.2 lamindb==0.76.0 pandas==2.2.2
WARNING: Skipping /home/zeth/miniconda3/envs/lamindb/lib/python3.11/site-packages/jupyterlab_widgets-3.0.10.dist-info due to invalid metadata entry 'name'
💡 loaded Transform('fQpBV2oEQUFi0000') & created Run('2024-08-20 10:05:14.415182+00:00')
Curate source¶
The chebi owl file only has chebi IDs. However, mappings between chebi and chembl exist that we will add to the chebi DataFrame. We obtained a source from https://ftp.ebi.ac.uk/pub/databases/chembl/UniChem/data/table_dumps/ which tells us that Source 1 corresponds to chembl and source 7 to chebi. Hence, we obtain the mapping from src1 to src7 from https://ftp.ebi.ac.uk/pub/databases/chembl/UniChem/data/wholeSourceMapping/.
# The parquet file was obtained by loading http://purl.obolibrary.org/obo/chebi/236/chebi.owl with Bionty Drug
drug_df = pd.read_parquet("chebi_2024-07-27.parquet")
drug_df.head()
name | definition | synonyms | parents | |
---|---|---|---|---|
ontology_id | ||||
CHEBI:10 | (+)-Atherospermoline | None | (+)-Atherospermoline | [CHEBI:133004] |
CHEBI:100 | (-)-medicarpin | The (-)-Enantiomer Of Medicarpin. | (-)-Medicarpin|(-)-medicarpin|(6aR,11aR)-9-met... | [CHEBI:16114] |
CHEBI:10000 | Vismione D | None | Vismione D | [CHEBI:46955] |
CHEBI:100000 | (2S,3S,4R)-3-[4-(3-cyclopentylprop-1-ynyl)phen... | None | None | [CHEBI:36820, CHEBI:22712, CHEBI:38777] |
CHEBI:100001 | N-[(2R,3S,6R)-2-(hydroxymethyl)-6-[2-[[oxo-[4-... | None | None | [CHEBI:20857] |
def read_mapping_file(file_path: str) -> dict[str, str]:
chembl_dict = {}
with open(file_path, "r") as file:
next(file)
for line in file:
fromsrc1, tosrc7 = line.strip().split()
chembl_dict[f"CHEBI:{tosrc7}"] = fromsrc1
return chembl_dict
src_mapping = read_mapping_file("src1src7.txt")
first_key = next(iter(src_mapping))
print(f"First element of the mapping: {first_key}: {src_mapping[first_key]}")
First element of the mapping: CHEBI:16273: CHEMBL46810
drug_df["chembl_id"] = drug_df.index.map(src_mapping.get)
drug_df
name | definition | synonyms | parents | chembl_id | |
---|---|---|---|---|---|
ontology_id | |||||
CHEBI:10 | (+)-Atherospermoline | None | (+)-Atherospermoline | [CHEBI:133004] | CHEMBL500609 |
CHEBI:100 | (-)-medicarpin | The (-)-Enantiomer Of Medicarpin. | (-)-Medicarpin|(-)-medicarpin|(6aR,11aR)-9-met... | [CHEBI:16114] | CHEMBL238845 |
CHEBI:10000 | Vismione D | None | Vismione D | [CHEBI:46955] | CHEMBL487795 |
CHEBI:100000 | (2S,3S,4R)-3-[4-(3-cyclopentylprop-1-ynyl)phen... | None | None | [CHEBI:36820, CHEBI:22712, CHEBI:38777] | None |
CHEBI:100001 | N-[(2R,3S,6R)-2-(hydroxymethyl)-6-[2-[[oxo-[4-... | None | None | [CHEBI:20857] | None |
... | ... | ... | ... | ... | ... |
CHEBI:99995 | 2-[(2S,4aS,12aS)-5-methyl-6-oxo-8-[(1-oxo-2-ph... | None | None | [CHEBI:22160] | None |
CHEBI:99996 | N-[(1S,3S,4aR,9aS)-3-[2-[(2,5-difluorophenyl)m... | None | None | [CHEBI:74927] | None |
CHEBI:99997 | N-[(2S,4aS,12aS)-2-[2-(cyclohexylmethylamino)-... | None | None | [CHEBI:17792, CHEBI:36586] | None |
CHEBI:99998 | N-[[(3S,9S,10R)-16-(dimethylamino)-12-[(2S)-1-... | None | None | [CHEBI:52898, CHEBI:24995] | CHEMBL1903737 |
CHEBI:99999 | N-[(5S,6S,9S)-5-methoxy-3,6,9-trimethyl-2-oxo-... | None | None | [CHEBI:52898, CHEBI:24995] | None |
200981 rows × 5 columns
drug_df.to_parquet("df_all__chebi__2024-07-27__Drug.parquet")
Register in laminlabs/bionty-assets
¶
from bionty.core._bionty import register_source_in_bionty_assets
source_record = bt.Source.filter(name="chebi", organism="all", version="2024-07-27", entity="Drug").one()
register_source_in_bionty_assets(filepath="df_all__chebi__2024-07-27__Drug.parquet", source=source_record)
... uploading df_all__chebi__2024-07-27__Drug.parquet: 100.0%
registered Source(uid='1atB', entity='Drug', organism='all', name='chebi', version='2024-07-27', in_db=False, currently_used=False, description='', url='http://purl.obolibrary.org/obo/chebi/236/chebi.owl', md5='', source_website='', created_by_id=3, dataframe_artifact_id=176, updated_at='2024-08-20 10:05:33 UTC') with dataframe Artifact(uid='FeIg71WrUn9HBeS1VbtA', is_latest=True, key='df_all__chebi__2024-07-27__Drug.parquet', suffix='.parquet', size=13901923, hash='0MdXAAAHwLqglrfW55lEhw', _hash_type='md5', visibility=1, _key_is_virtual=False, created_by_id=2, storage_id=1, transform_id=9, run_id=10, updated_at='2024-08-20 10:05:22 UTC')
Artifact(uid='FeIg71WrUn9HBeS1VbtA', is_latest=True, key='df_all__chebi__2024-07-27__Drug.parquet', suffix='.parquet', size=13901923, hash='0MdXAAAHwLqglrfW55lEhw', _hash_type='md5', visibility=1, _key_is_virtual=False, created_by_id=2, storage_id=1, transform_id=9, run_id=10, updated_at='2024-08-20 10:05:22 UTC')
ln.finish()
✅ cell execution numbers increase consecutively
💡 go to: https://lamin.ai/laminlabs/bionty-assets/transform/fQpBV2oEQUFi0000
💡 if you want to update your notebook without re-running it, use `lamin save notebook.ipynb`