Gene
: ensembl, release-107¶
!lndb load bionty-assets
migrate-unnecessary
!lndb login sunnyosun
import lamindb as ln
import pandas as pd
from lnschema_bionty import id
ln.nb.header()
2022-10-26 11:38:03,833:INFO - Note: NumExpr detected 10 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2022-10-26 11:38:03,834:INFO - NumExpr defaulting to 8 threads.
author | Sunny Sun (sunnyosun) |
id | z2WNjjvFuzwf |
version | 1 |
time_init | 2022-09-27 11:04 |
time_run | 2022-10-26 09:40 |
consecutive_cells | True |
pypackage | lamindb==0.6.0 lnschema_bionty==0.4.3 pandas==1.5.0 |
Ensembl download¶
The table has a version
column with value of Ens107
.
These tables are downloaded from biomart database (Ensembl Genes 107
) containing the following id columns for every species:
Gene stable ID
Transcript stable ID
Protein stable ID
Gene name
Gene Synonym
Gene type
Gene description
# this is a new column added in v2NCBI gene (formerly Entrezgene) ID
Addtional species-specific columns are also present for:
human:
HGNC ID
,MIM gene accession
mouse:
MGI ID
# Downloaded on 2022-09-27
dfs = {
"human": "https://bionty-assets.s3.amazonaws.com/mart_export-human.txt", # http://www.ensembl.org/biomart/martview/4d75e3d44de27e6ed58cfce974f0a755?VIRTUALSCHEMANAME=default&ATTRIBUTES=hsapiens_gene_ensembl.default.feature_page.ensembl_gene_id|hsapiens_gene_ensembl.default.feature_page.ensembl_transcript_id|hsapiens_gene_ensembl.default.feature_page.ensembl_peptide_id|hsapiens_gene_ensembl.default.feature_page.external_gene_name|hsapiens_gene_ensembl.default.feature_page.external_synonym|hsapiens_gene_ensembl.default.feature_page.gene_biotype|hsapiens_gene_ensembl.default.feature_page.entrezgene_id|hsapiens_gene_ensembl.default.feature_page.hgnc_id|hsapiens_gene_ensembl.default.feature_page.mim_gene_accession&FILTERS=&VISIBLEPANEL=resultspanel
"mouse": "https://bionty-assets.s3.amazonaws.com/mart_export-mouse.txt", # http://www.ensembl.org/biomart/martview/4d75e3d44de27e6ed58cfce974f0a755?VIRTUALSCHEMANAME=default&ATTRIBUTES=mmusculus_gene_ensembl.default.feature_page.ensembl_gene_id|mmusculus_gene_ensembl.default.feature_page.ensembl_transcript_id|mmusculus_gene_ensembl.default.feature_page.ensembl_peptide_id|mmusculus_gene_ensembl.default.feature_page.external_gene_name|mmusculus_gene_ensembl.default.feature_page.external_synonym|mmusculus_gene_ensembl.default.feature_page.gene_biotype|mmusculus_gene_ensembl.default.feature_page.entrezgene_id|mmusculus_gene_ensembl.default.feature_page.mgi_id&FILTERS=&VISIBLEPANEL=resultspanel
}
Curate the tables¶
allids = []
for species, path in dfs.items():
print(f"----------{species}----------")
df = pd.read_csv(path, dtype=str)
print(f"Initial shape: {df.shape}")
# Aggregate the `Gene Synonym` column
df_alias = df[["Gene name", "Gene Synonym"]].drop_duplicates().dropna()
df_alias = df_alias.groupby("Gene name").agg("|".join)
del df["Gene Synonym"]
df = df.drop_duplicates()
df = pd.merge(df, df_alias, on="Gene name", how="left")
# add the version column
df["version"] = "Ens107"
display(df.head())
print(f"All ids shape: {df.shape}")
# save all ids to a parquet file
df.to_parquet(f"ensembl-ids-{species}.parquet")
print(f"Saved as ensembl-ids-{species}.parquet.")
# subset to genes only
df = df.loc[:, ~df.columns.isin(["Transcript stable ID", "Protein stable ID"])]
df = df.drop_duplicates()
# add ids to each entry
ids = []
for i in df.index:
ids.append(id.gene())
df.index = ids
df.index.name = "id"
display(df.head())
print(f"Final shape: {df.shape}")
# save all ids to a parquet file
df.to_parquet(f"gene-{species}.parquet")
print(f"Saved as gene-{species}.parquet.")
# all ids across species
allids += ids
# make sure ids are unique
assert len(set(allids)) == len(allids)
----------human----------
Initial shape: (620902, 10)
Gene stable ID | Transcript stable ID | Protein stable ID | Gene name | Gene type | Gene description | NCBI gene (formerly Entrezgene) ID | HGNC ID | MIM gene accession | Gene Synonym | version | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | ENSG00000210049 | ENST00000387314 | NaN | MT-TF | Mt_tRNA | mitochondrially encoded tRNA-Phe (UUU/C) [Sour... | NaN | HGNC:7481 | NaN | MTTF|trnF | Ens107 |
1 | ENSG00000211459 | ENST00000389680 | NaN | MT-RNR1 | Mt_rRNA | mitochondrially encoded 12S rRNA [Source:HGNC ... | NaN | HGNC:7470 | NaN | 12S|MOTS-c|MTRNR1 | Ens107 |
2 | ENSG00000210077 | ENST00000387342 | NaN | MT-TV | Mt_tRNA | mitochondrially encoded tRNA-Val (GUN) [Source... | NaN | HGNC:7500 | NaN | MTTV|trnV | Ens107 |
3 | ENSG00000210082 | ENST00000387347 | NaN | MT-RNR2 | Mt_rRNA | mitochondrially encoded 16S rRNA [Source:HGNC ... | NaN | HGNC:7471 | NaN | 16S|HN|MTRNR2 | Ens107 |
4 | ENSG00000209082 | ENST00000386347 | NaN | MT-TL1 | Mt_tRNA | mitochondrially encoded tRNA-Leu (UUA/G) 1 [So... | NaN | HGNC:7490 | NaN | MTTL1|TRNL1 | Ens107 |
All ids shape: (276652, 11)
Saved as ensembl-ids-human.parquet.
Gene stable ID | Gene name | Gene type | Gene description | NCBI gene (formerly Entrezgene) ID | HGNC ID | MIM gene accession | Gene Synonym | version | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
Lzl9xt | ENSG00000210049 | MT-TF | Mt_tRNA | mitochondrially encoded tRNA-Phe (UUU/C) [Sour... | NaN | HGNC:7481 | NaN | MTTF|trnF | Ens107 |
ILAWa7 | ENSG00000211459 | MT-RNR1 | Mt_rRNA | mitochondrially encoded 12S rRNA [Source:HGNC ... | NaN | HGNC:7470 | NaN | 12S|MOTS-c|MTRNR1 | Ens107 |
XkyeQz | ENSG00000210077 | MT-TV | Mt_tRNA | mitochondrially encoded tRNA-Val (GUN) [Source... | NaN | HGNC:7500 | NaN | MTTV|trnV | Ens107 |
jDD2jW | ENSG00000210082 | MT-RNR2 | Mt_rRNA | mitochondrially encoded 16S rRNA [Source:HGNC ... | NaN | HGNC:7471 | NaN | 16S|HN|MTRNR2 | Ens107 |
J58H9b | ENSG00000209082 | MT-TL1 | Mt_tRNA | mitochondrially encoded tRNA-Leu (UUA/G) 1 [So... | NaN | HGNC:7490 | NaN | MTTL1|TRNL1 | Ens107 |
Final shape: (68856, 9)
Saved as gene-human.parquet.
----------mouse----------
Initial shape: (296054, 9)
Gene stable ID | Transcript stable ID | Protein stable ID | Gene name | Gene type | Gene description | NCBI gene (formerly Entrezgene) ID | MGI ID | Gene Synonym | version | |
---|---|---|---|---|---|---|---|---|---|---|
0 | ENSMUSG00000064336 | ENSMUST00000082387 | NaN | mt-Tf | Mt_tRNA | mitochondrially encoded tRNA phenylalanine [So... | NaN | MGI:102487 | tRNA|tRNA-Phe|TrnF tRNA | Ens107 |
1 | ENSMUSG00000064337 | ENSMUST00000082388 | NaN | mt-Rnr1 | Mt_rRNA | mitochondrially encoded 12S rRNA [Source:MGI S... | NaN | MGI:102493 | 12S ribosomal RNA|12S rRNA|12SrRNA|Rnr1 s-rRNA | Ens107 |
2 | ENSMUSG00000064338 | ENSMUST00000082389 | NaN | mt-Tv | Mt_tRNA | mitochondrially encoded tRNA valine [Source:MG... | NaN | MGI:102472 | tRNA|tRNA-Val|TrnaV tRNA | Ens107 |
3 | ENSMUSG00000064339 | ENSMUST00000082390 | NaN | mt-Rnr2 | Mt_rRNA | mitochondrially encoded 16S rRNA [Source:MGI S... | NaN | MGI:102492 | 16S ribosomal RNA|16S rRNA|16SrRNA|Rnr2 16S ri... | Ens107 |
4 | ENSMUSG00000064340 | ENSMUST00000082391 | NaN | mt-Tl1 | Mt_tRNA | mitochondrially encoded tRNA leucine 1 [Source... | NaN | MGI:102482 | tRNA|tRNA Leu|tRNA Leu_1|TrnrL1 tRNA | Ens107 |
All ids shape: (150702, 10)
Saved as ensembl-ids-mouse.parquet.
Gene stable ID | Gene name | Gene type | Gene description | NCBI gene (formerly Entrezgene) ID | MGI ID | Gene Synonym | version | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
Epd98t | ENSMUSG00000064336 | mt-Tf | Mt_tRNA | mitochondrially encoded tRNA phenylalanine [So... | NaN | MGI:102487 | tRNA|tRNA-Phe|TrnF tRNA | Ens107 |
RiOxA6 | ENSMUSG00000064337 | mt-Rnr1 | Mt_rRNA | mitochondrially encoded 12S rRNA [Source:MGI S... | NaN | MGI:102493 | 12S ribosomal RNA|12S rRNA|12SrRNA|Rnr1 s-rRNA | Ens107 |
cMIElg | ENSMUSG00000064338 | mt-Tv | Mt_tRNA | mitochondrially encoded tRNA valine [Source:MG... | NaN | MGI:102472 | tRNA|tRNA-Val|TrnaV tRNA | Ens107 |
DbiNNA | ENSMUSG00000064339 | mt-Rnr2 | Mt_rRNA | mitochondrially encoded 16S rRNA [Source:MGI S... | NaN | MGI:102492 | 16S ribosomal RNA|16S rRNA|16SrRNA|Rnr2 16S ri... | Ens107 |
NO6NBF | ENSMUSG00000064340 | mt-Tl1 | Mt_tRNA | mitochondrially encoded tRNA leucine 1 [Source... | NaN | MGI:102482 | tRNA|tRNA Leu|tRNA Leu_1|TrnrL1 tRNA | Ens107 |
Final shape: (57110, 8)
Saved as gene-mouse.parquet.
Push to bionty-assets.lndb¶
ingest = ln.db.Ingest()
ingest.add("ensembl-ids-human.parquet")
ingest.add("ensembl-ids-mouse.parquet")
ingest.add("gene-human.parquet")
ingest.add("gene-mouse.parquet");
ingest.commit()
✅ Cell numbers increase consecutively: Awesome!
2022-10-26 11:38:51,646:INFO - Found credentials in shared credentials file: ~/.aws/credentials
Upload /Users/sunnysun/Documents/repos.nosync/bionty-assets/docs/ingest/ensembl-ids-human.parquet: 1.00
Upload /Users/sunnysun/Documents/repos.nosync/bionty-assets/docs/ingest/ensembl-ids-mouse.parquet: 1.00
Upload /Users/sunnysun/Documents/repos.nosync/bionty-assets/docs/ingest/gene-human.parquet: 1.00
Upload /Users/sunnysun/Documents/repos.nosync/bionty-assets/docs/ingest/gene-mouse.parquet: 1.00
ℹ️ Added notebook 'Ensembl gene -> `bionty.Gene().df`' (z2WNjjvFuzwf, 1) by user sunnyosun.
✅ Ingested the following dobjects:
+---+---------------------------------------------------+--------------------------------------------------------+----------------------+
| | dobject | jupynb | user |
+---+---------------------------------------------------+--------------------------------------------------------+----------------------+
| 0 | ensembl-ids-human.parquet (eS3P7zGVRniwrYQoAlIO4) | 'Ensembl gene -> `bionty.Gene().df`' (z2WNjjvFuzwf, 1) | sunnyosun (kmvZDIX9) |
| 1 | ensembl-ids-mouse.parquet (aTfPJhiYUeTiohzHArIXM) | 'Ensembl gene -> `bionty.Gene().df`' (z2WNjjvFuzwf, 1) | sunnyosun (kmvZDIX9) |
| 2 | gene-human.parquet (KJ1HgB695AqbVWvfit8sl) | 'Ensembl gene -> `bionty.Gene().df`' (z2WNjjvFuzwf, 1) | sunnyosun (kmvZDIX9) |
| 3 | gene-mouse.parquet (xaBDkhBYLXWHq6gJYnedD) | 'Ensembl gene -> `bionty.Gene().df`' (z2WNjjvFuzwf, 1) | sunnyosun (kmvZDIX9) |
+---+---------------------------------------------------+--------------------------------------------------------+----------------------+
ℹ️ Set notebook version to 1 & wrote pypackages.
Now on S3:
human genes: https://bionty-assets.s3.amazonaws.com/KJ1HgB695AqbVWvfit8sl.parquet
mouse genes: https://bionty-assets.s3.amazonaws.com/xaBDkhBYLXWHq6gJYnedD.parquet