Gene: ensembl, release-107

!lndb load bionty-assets
migrate-unnecessary
!lndb login sunnyosun
import lamindb as ln
import pandas as pd
from lnschema_bionty import id

ln.nb.header()
2022-10-26 11:38:03,833:INFO - Note: NumExpr detected 10 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2022-10-26 11:38:03,834:INFO - NumExpr defaulting to 8 threads.
authorSunny Sun (sunnyosun)
idz2WNjjvFuzwf
version1
time_init2022-09-27 11:04
time_run2022-10-26 09:40
consecutive_cellsTrue
pypackagelamindb==0.6.0 lnschema_bionty==0.4.3 pandas==1.5.0

Ensembl download

The table has a version column with value of Ens107.

These tables are downloaded from biomart database (Ensembl Genes 107) containing the following id columns for every species:

  • Gene stable ID

  • Transcript stable ID

  • Protein stable ID

  • Gene name

  • Gene Synonym

  • Gene type

  • Gene description # this is a new column added in v2

  • NCBI gene (formerly Entrezgene) ID

Addtional species-specific columns are also present for:

  • human: HGNC ID, MIM gene accession

  • mouse: MGI ID

# Downloaded on 2022-09-27

dfs = {
    "human": "https://bionty-assets.s3.amazonaws.com/mart_export-human.txt",  # http://www.ensembl.org/biomart/martview/4d75e3d44de27e6ed58cfce974f0a755?VIRTUALSCHEMANAME=default&ATTRIBUTES=hsapiens_gene_ensembl.default.feature_page.ensembl_gene_id|hsapiens_gene_ensembl.default.feature_page.ensembl_transcript_id|hsapiens_gene_ensembl.default.feature_page.ensembl_peptide_id|hsapiens_gene_ensembl.default.feature_page.external_gene_name|hsapiens_gene_ensembl.default.feature_page.external_synonym|hsapiens_gene_ensembl.default.feature_page.gene_biotype|hsapiens_gene_ensembl.default.feature_page.entrezgene_id|hsapiens_gene_ensembl.default.feature_page.hgnc_id|hsapiens_gene_ensembl.default.feature_page.mim_gene_accession&FILTERS=&VISIBLEPANEL=resultspanel
    "mouse": "https://bionty-assets.s3.amazonaws.com/mart_export-mouse.txt",  # http://www.ensembl.org/biomart/martview/4d75e3d44de27e6ed58cfce974f0a755?VIRTUALSCHEMANAME=default&ATTRIBUTES=mmusculus_gene_ensembl.default.feature_page.ensembl_gene_id|mmusculus_gene_ensembl.default.feature_page.ensembl_transcript_id|mmusculus_gene_ensembl.default.feature_page.ensembl_peptide_id|mmusculus_gene_ensembl.default.feature_page.external_gene_name|mmusculus_gene_ensembl.default.feature_page.external_synonym|mmusculus_gene_ensembl.default.feature_page.gene_biotype|mmusculus_gene_ensembl.default.feature_page.entrezgene_id|mmusculus_gene_ensembl.default.feature_page.mgi_id&FILTERS=&VISIBLEPANEL=resultspanel
}

Curate the tables

allids = []

for species, path in dfs.items():
    print(f"----------{species}----------")
    df = pd.read_csv(path, dtype=str)
    print(f"Initial shape: {df.shape}")

    # Aggregate the `Gene Synonym` column
    df_alias = df[["Gene name", "Gene Synonym"]].drop_duplicates().dropna()
    df_alias = df_alias.groupby("Gene name").agg("|".join)
    del df["Gene Synonym"]
    df = df.drop_duplicates()
    df = pd.merge(df, df_alias, on="Gene name", how="left")

    # add the version column
    df["version"] = "Ens107"

    display(df.head())
    print(f"All ids shape: {df.shape}")

    # save all ids to a parquet file
    df.to_parquet(f"ensembl-ids-{species}.parquet")
    print(f"Saved as ensembl-ids-{species}.parquet.")

    # subset to genes only
    df = df.loc[:, ~df.columns.isin(["Transcript stable ID", "Protein stable ID"])]
    df = df.drop_duplicates()

    # add ids to each entry
    ids = []
    for i in df.index:
        ids.append(id.gene())
    df.index = ids
    df.index.name = "id"

    display(df.head())
    print(f"Final shape: {df.shape}")

    # save all ids to a parquet file
    df.to_parquet(f"gene-{species}.parquet")
    print(f"Saved as gene-{species}.parquet.")

    # all ids across species
    allids += ids

# make sure ids are unique
assert len(set(allids)) == len(allids)
----------human----------
Initial shape: (620902, 10)
Gene stable ID Transcript stable ID Protein stable ID Gene name Gene type Gene description NCBI gene (formerly Entrezgene) ID HGNC ID MIM gene accession Gene Synonym version
0 ENSG00000210049 ENST00000387314 NaN MT-TF Mt_tRNA mitochondrially encoded tRNA-Phe (UUU/C) [Sour... NaN HGNC:7481 NaN MTTF|trnF Ens107
1 ENSG00000211459 ENST00000389680 NaN MT-RNR1 Mt_rRNA mitochondrially encoded 12S rRNA [Source:HGNC ... NaN HGNC:7470 NaN 12S|MOTS-c|MTRNR1 Ens107
2 ENSG00000210077 ENST00000387342 NaN MT-TV Mt_tRNA mitochondrially encoded tRNA-Val (GUN) [Source... NaN HGNC:7500 NaN MTTV|trnV Ens107
3 ENSG00000210082 ENST00000387347 NaN MT-RNR2 Mt_rRNA mitochondrially encoded 16S rRNA [Source:HGNC ... NaN HGNC:7471 NaN 16S|HN|MTRNR2 Ens107
4 ENSG00000209082 ENST00000386347 NaN MT-TL1 Mt_tRNA mitochondrially encoded tRNA-Leu (UUA/G) 1 [So... NaN HGNC:7490 NaN MTTL1|TRNL1 Ens107
All ids shape: (276652, 11)
Saved as ensembl-ids-human.parquet.
Gene stable ID Gene name Gene type Gene description NCBI gene (formerly Entrezgene) ID HGNC ID MIM gene accession Gene Synonym version
id
Lzl9xt ENSG00000210049 MT-TF Mt_tRNA mitochondrially encoded tRNA-Phe (UUU/C) [Sour... NaN HGNC:7481 NaN MTTF|trnF Ens107
ILAWa7 ENSG00000211459 MT-RNR1 Mt_rRNA mitochondrially encoded 12S rRNA [Source:HGNC ... NaN HGNC:7470 NaN 12S|MOTS-c|MTRNR1 Ens107
XkyeQz ENSG00000210077 MT-TV Mt_tRNA mitochondrially encoded tRNA-Val (GUN) [Source... NaN HGNC:7500 NaN MTTV|trnV Ens107
jDD2jW ENSG00000210082 MT-RNR2 Mt_rRNA mitochondrially encoded 16S rRNA [Source:HGNC ... NaN HGNC:7471 NaN 16S|HN|MTRNR2 Ens107
J58H9b ENSG00000209082 MT-TL1 Mt_tRNA mitochondrially encoded tRNA-Leu (UUA/G) 1 [So... NaN HGNC:7490 NaN MTTL1|TRNL1 Ens107
Final shape: (68856, 9)
Saved as gene-human.parquet.
----------mouse----------
Initial shape: (296054, 9)
Gene stable ID Transcript stable ID Protein stable ID Gene name Gene type Gene description NCBI gene (formerly Entrezgene) ID MGI ID Gene Synonym version
0 ENSMUSG00000064336 ENSMUST00000082387 NaN mt-Tf Mt_tRNA mitochondrially encoded tRNA phenylalanine [So... NaN MGI:102487 tRNA|tRNA-Phe|TrnF tRNA Ens107
1 ENSMUSG00000064337 ENSMUST00000082388 NaN mt-Rnr1 Mt_rRNA mitochondrially encoded 12S rRNA [Source:MGI S... NaN MGI:102493 12S ribosomal RNA|12S rRNA|12SrRNA|Rnr1 s-rRNA Ens107
2 ENSMUSG00000064338 ENSMUST00000082389 NaN mt-Tv Mt_tRNA mitochondrially encoded tRNA valine [Source:MG... NaN MGI:102472 tRNA|tRNA-Val|TrnaV tRNA Ens107
3 ENSMUSG00000064339 ENSMUST00000082390 NaN mt-Rnr2 Mt_rRNA mitochondrially encoded 16S rRNA [Source:MGI S... NaN MGI:102492 16S ribosomal RNA|16S rRNA|16SrRNA|Rnr2 16S ri... Ens107
4 ENSMUSG00000064340 ENSMUST00000082391 NaN mt-Tl1 Mt_tRNA mitochondrially encoded tRNA leucine 1 [Source... NaN MGI:102482 tRNA|tRNA Leu|tRNA Leu_1|TrnrL1 tRNA Ens107
All ids shape: (150702, 10)
Saved as ensembl-ids-mouse.parquet.
Gene stable ID Gene name Gene type Gene description NCBI gene (formerly Entrezgene) ID MGI ID Gene Synonym version
id
Epd98t ENSMUSG00000064336 mt-Tf Mt_tRNA mitochondrially encoded tRNA phenylalanine [So... NaN MGI:102487 tRNA|tRNA-Phe|TrnF tRNA Ens107
RiOxA6 ENSMUSG00000064337 mt-Rnr1 Mt_rRNA mitochondrially encoded 12S rRNA [Source:MGI S... NaN MGI:102493 12S ribosomal RNA|12S rRNA|12SrRNA|Rnr1 s-rRNA Ens107
cMIElg ENSMUSG00000064338 mt-Tv Mt_tRNA mitochondrially encoded tRNA valine [Source:MG... NaN MGI:102472 tRNA|tRNA-Val|TrnaV tRNA Ens107
DbiNNA ENSMUSG00000064339 mt-Rnr2 Mt_rRNA mitochondrially encoded 16S rRNA [Source:MGI S... NaN MGI:102492 16S ribosomal RNA|16S rRNA|16SrRNA|Rnr2 16S ri... Ens107
NO6NBF ENSMUSG00000064340 mt-Tl1 Mt_tRNA mitochondrially encoded tRNA leucine 1 [Source... NaN MGI:102482 tRNA|tRNA Leu|tRNA Leu_1|TrnrL1 tRNA Ens107
Final shape: (57110, 8)
Saved as gene-mouse.parquet.

Push to bionty-assets.lndb

ingest = ln.db.Ingest()
ingest.add("ensembl-ids-human.parquet")
ingest.add("ensembl-ids-mouse.parquet")

ingest.add("gene-human.parquet")
ingest.add("gene-mouse.parquet");
ingest.commit()
✅ Cell numbers increase consecutively: Awesome!
2022-10-26 11:38:51,646:INFO - Found credentials in shared credentials file: ~/.aws/credentials
Upload /Users/sunnysun/Documents/repos.nosync/bionty-assets/docs/ingest/ensembl-ids-human.parquet: 1.00
Upload /Users/sunnysun/Documents/repos.nosync/bionty-assets/docs/ingest/ensembl-ids-mouse.parquet: 1.00
Upload /Users/sunnysun/Documents/repos.nosync/bionty-assets/docs/ingest/gene-human.parquet: 1.00
Upload /Users/sunnysun/Documents/repos.nosync/bionty-assets/docs/ingest/gene-mouse.parquet: 1.00
ℹ️ Added notebook 'Ensembl gene -> `bionty.Gene().df`' (z2WNjjvFuzwf, 1) by user sunnyosun.
✅ Ingested the following dobjects:
+---+---------------------------------------------------+--------------------------------------------------------+----------------------+
|   | dobject                                           | jupynb                                                 | user                 |
+---+---------------------------------------------------+--------------------------------------------------------+----------------------+
| 0 | ensembl-ids-human.parquet (eS3P7zGVRniwrYQoAlIO4) | 'Ensembl gene -> `bionty.Gene().df`' (z2WNjjvFuzwf, 1) | sunnyosun (kmvZDIX9) |
| 1 | ensembl-ids-mouse.parquet (aTfPJhiYUeTiohzHArIXM) | 'Ensembl gene -> `bionty.Gene().df`' (z2WNjjvFuzwf, 1) | sunnyosun (kmvZDIX9) |
| 2 | gene-human.parquet (KJ1HgB695AqbVWvfit8sl)        | 'Ensembl gene -> `bionty.Gene().df`' (z2WNjjvFuzwf, 1) | sunnyosun (kmvZDIX9) |
| 3 | gene-mouse.parquet (xaBDkhBYLXWHq6gJYnedD)        | 'Ensembl gene -> `bionty.Gene().df`' (z2WNjjvFuzwf, 1) | sunnyosun (kmvZDIX9) |
+---+---------------------------------------------------+--------------------------------------------------------+----------------------+
ℹ️ Set notebook version to 1 & wrote pypackages.

Now on S3:

  • human genes: https://bionty-assets.s3.amazonaws.com/KJ1HgB695AqbVWvfit8sl.parquet

  • mouse genes: https://bionty-assets.s3.amazonaws.com/xaBDkhBYLXWHq6gJYnedD.parquet