UniProtKB table -> bionty.Protein().df

import pandas as pd
import lamindb as ln
from lnschema_bionty import id

ln.nb.header()
authorSunny Sun (sunnyosun)
iduV9o7RZmv6rG
version1
time_init2022-09-26 21:17
time_run2022-10-25 16:23
consecutive_cellsTrue
pypackagelamindb==0.6.0 lnschema_bionty==0.4.3 pandas==1.5.0

Files are downloaded from: https://www.uniprot.org/uniprotkb

# Downloaded from 2022-09-26

filepaths = {
    "human": "https://bionty-assets.s3.amazonaws.com/uniprot-human.tsv.gz",
    "mouse": "https://bionty-assets.s3.amazonaws.com/uniprot-mouse.tsv.gz",
}

Curate the tables

allids = []

for species, filepath in filepaths.items():
    print(f"Loading {species} data...")

    df = pd.read_csv(filepath, sep="\t")

    # add ids to each entry
    ids = []
    for i in df.index:
        ids.append(id.protein())
    df.index = ids
    df.index.name = "id"

    allids += ids

    print(f"shape: {df.shape}")
    display(df.head())

    filename = f"uniprot-{species}.parquet"
    df.to_parquet(filename)

    print(f"Wrote {filename}.")
    print("------------------------------------------------")

assert len(allids) == len(set(allids))
Loading human data...
shape: (204961, 9)
Entry Entry Name Protein names Length Organism (ID) Gene Names (primary) Gene Names (synonym) Ensembl GeneID
id
1zrr8Wy A0A024QZ08 A0A024QZ08_HUMAN Intraflagellar transport 20 homolog (Chlamydom... 132 9606 IFT20 NaN NaN 90410;
xNgxtFu A0A024QZ86 A0A024QZ86_HUMAN T-box 2, isoform CRA_a 712 9606 TBX2 NaN NaN 6909;
X9K8OgK A0A024QZA8 A0A024QZA8_HUMAN Receptor protein-tyrosine kinase, EC 2.7.10.1 976 9606 EPHA2 NaN NaN 1969;
8jW9Ci4 A0A024QZB8 A0A024QZB8_HUMAN Battenin 438 9606 CLN3 NaN NaN 1201;
nZNsA6F A0A024QZQ1 A0A024QZQ1_HUMAN Sirtuin (Silent mating type information regula... 747 9606 SIRT1 NaN NaN 23411;
Wrote uniprot-human.parquet.
------------------------------------------------
Loading mouse data...
shape: (86436, 9)
Entry Entry Name Protein names Length Organism (ID) Gene Names (primary) Gene Names (synonym) Ensembl GeneID
id
oWysKQr A0A075F5C6 A0A075F5C6_MOUSE Heat shock factor protein 1 (Heat shock transc... 531 10090 Hsf1 NaN ENSMUST00000228371.2; 15499;
IGupwHD A0A087WPF7 AUTS2_MOUSE Autism susceptibility gene 2 protein homolog 1261 10090 Auts2 Kiaa0442 ENSMUST00000161226 [A0A087WPF7-1];ENSMUST00000... NaN
XrEF1mC A0A087WPT2 A0A087WPT2_MOUSE Prostaglandin G/H synthase 2 62 10090 Ptgs2 NaN ENSMUST00000190784.2; NaN
qACNsPf A0A087WPU4 A0A087WPU4_MOUSE FAT atypical cadherin 1 159 10090 Fat1 NaN ENSMUST00000186342.3; NaN
izgkQbe A0A087WRK1 A0A087WRK1_MOUSE Predicted gene, 20814 (Predicted gene, 20850) ... 222 10090 Gm20850 Gm20814 Gm20835 Gm20855 Gm20869 Gm20870 Gm2088... ENSMUST00000185240.2;ENSMUST00000185245.2;ENSM... 100042201;100042279;100042594;100861691;108167...
Wrote uniprot-mouse.parquet.
------------------------------------------------

Push to bionty-assets.lndb

!lndb load bionty-assets
migrate-unnecessary
!lndb login sunnyosun
ingest = ln.db.Ingest()
ingest.add("uniprot-human.parquet")
ingest.add("uniprot-mouse.parquet");
ingest.commit()
✅ Cell numbers increase consecutively: Awesome!
2022-10-25 18:22:19,238:INFO - Found credentials in shared credentials file: ~/.aws/credentials
Upload /Users/sunnysun/Documents/repos.nosync/bionty-assets/docs/ingest/uniprot-human.parquet: 1.00
Upload /Users/sunnysun/Documents/repos.nosync/bionty-assets/docs/ingest/uniprot-mouse.parquet: 1.00
ℹ️ Added notebook 'UniProtKB table -> `bionty.Protein().df`' (uV9o7RZmv6rG, 1) by user sunnyosun.
✅ Ingested the following dobjects:
+---+-----------------------------------------------+--------------------------------------------------------------+----------------------+
|   | dobject                                       | jupynb                                                       | user                 |
+---+-----------------------------------------------+--------------------------------------------------------------+----------------------+
| 0 | uniprot-human.parquet (5WBmdkTO4JCFzPzBcDOJ3) | 'UniProtKB table -> `bionty.Protein().df`' (uV9o7RZmv6rG, 1) | sunnyosun (kmvZDIX9) |
| 1 | uniprot-mouse.parquet (6vgntdGiAbz5bEYP53sma) | 'UniProtKB table -> `bionty.Protein().df`' (uV9o7RZmv6rG, 1) | sunnyosun (kmvZDIX9) |
+---+-----------------------------------------------+--------------------------------------------------------------+----------------------+
ℹ️ Set notebook version to 1 & wrote pypackages.

Now on S3:

  • human: https://bionty-assets.s3.amazonaws.com/5WBmdkTO4JCFzPzBcDOJ3.parquet

  • mouse: https://bionty-assets.s3.amazonaws.com/6vgntdGiAbz5bEYP53sma.parquet