Human cell markers -> bionty.CellMarker().df

import pandas as pd
from lnschema_bionty import id
import lamindb as ln

ln.nb.header()
authorSunny Sun (sunnyosun)
idwxPc3RWmRp2a
version1
time_init2022-09-26 15:37
time_run2022-10-25 14:27
consecutive_cellsTrue
pypackagelamindb==0.6.0 lnschema_bionty==0.4.3 pandas==1.5.0

Curate the human cell marker table

url = "http://xteam.xbio.top/CellMarker/download/Human_cell_markers.txt"
df = pd.read_csv(url, sep="\t", dtype=str)

df.shape
(2868, 15)
df.head()
speciesType tissueType UberonOntologyID cancerType cellType cellName CellOntologyID cellMarker geneSymbol geneID proteinName proteinID markerResource PMID Company
0 Human Kidney UBERON_0002113 Normal Normal cell Proximal tubular cell NaN Intestinal Alkaline Phosphatase ALPI 248 PPBI P09923 Experiment 9263997 NaN
1 Human Liver UBERON_0002107 Normal Normal cell Ito cell (hepatic stellate cell) CL_0000632 Synaptophysin SYP 6855 SYPH P08247 Experiment 10595912 NaN
2 Human Endometrium UBERON_0001295 Normal Normal cell Trophoblast cell CL_0000351 CEACAM1 CEACAM1 634 CEAM1 P13688 Experiment 10751340 NaN
3 Human Germ UBERON_0000923 Normal Normal cell Primordial germ cell CL_0000670 VASA DDX4 54514 DDX4 Q9NQI0 Experiment 10920202 NaN
4 Human Corneal epithelium UBERON_0001772 Normal Normal cell Epithelial cell CL_0000066 KLF6 KLF6 1316 KLF6 Q99612 Experiment 12407152 NaN
import re


def _split_list(string):
    """Parse out a, b, [c, d] to [a, b, [c, d]]"""
    in_bracket = re.findall("\[(.*?)\]", string)
    for lst in in_bracket:
        lst = f"[{lst}]"
        new_lst = lst.replace(", ", "; ")
        string = string.replace(lst, new_lst)
    return string


markers = []
genes = []
gene_ids = []
proteins = []
protein_ids = []

problem_rows = []
for i, row in df.iterrows():
    if ", " in row["cellMarker"]:
        marker = row["cellMarker"].rstrip(", ").split(", ")
        gene = _split_list(row["geneSymbol"].rstrip(", ")).split(", ")
        gene_id = _split_list(row["geneID"].rstrip(", ")).split(", ")
        protein = _split_list(row["proteinName"].rstrip(", ")).split(", ")
        protein_id = _split_list(row["proteinID"].rstrip(", ")).split(", ")

        try:
            assert (
                len(marker)
                == len(gene)
                == len(gene_id)
                == len(protein)
                == len(protein_id)
            )
            markers += marker
            genes += gene
            gene_ids += gene_id
            proteins += protein
            protein_ids += protein_id
        except AssertionError:
            problem_rows.append(row)
            continue
        assert (
            len(markers)
            == len(genes)
            == len(gene_ids)
            == len(proteins)
            == len(protein_ids)
        )
    else:
        markers.append(row["cellMarker"])
        genes.append(row["geneSymbol"])
        gene_ids.append(row["geneID"])
        proteins.append(row["proteinName"])
        protein_ids.append(row["proteinID"])
        assert (
            len(markers)
            == len(genes)
            == len(gene_ids)
            == len(proteins)
            == len(protein_ids)
        )
# these 11 rows didn't get parsed due to unequal number of markers and genes/proteins

len(problem_rows)
11
mapper = pd.DataFrame()
mapper["cell_marker"] = markers
mapper["gene_symbols"] = genes
mapper["ncbi_gene_ids"] = gene_ids
mapper["protein_names"] = proteins
mapper["uniprotkb_ids"] = protein_ids

mapper = mapper.drop_duplicates().dropna()
markers_df = mapper.groupby("cell_marker").agg("|".join)
def _contain_digits(string):
    return any(char.isdigit() for char in string)


for i, row in markers_df.iterrows():
    for k, v in row.items():
        values = [j for j in set(v.split("|")) if j != "NA"]
        if len(values) == 0:
            markers_df.loc[i, k] = ""
            continue
        else:
            if k == "uniprotkb_ids":
                values = [j for j in values if _contain_digits(j)]
            markers_df.loc[i, k] = "|".join(values)
markers_df.iloc[100:105]
gene_symbols ncbi_gene_ids protein_names uniprotkb_ids
cell_marker
ACAD11 ACAD11 84129 ACD11 Q709F0
ACADS ACADS 35 ACADS P16219
ACADSB ACADSB 36 ACDSB P45954
ACAP2 ACAP2 23527 ACAP2 Q15057
ACAT1 ACAT1 38 THIL P24752
markers_df.loc["CD8"]
gene_symbols       CD8A
ncbi_gene_ids       925
protein_names      CD8A
uniprotkb_ids    P01732
Name: CD8, dtype: object
markers_df.loc["CD45RO"]
gene_symbols      PTPRC
ncbi_gene_ids      5788
protein_names     PTPRC
uniprotkb_ids    P08575
Name: CD45RO, dtype: object
markers_df.shape
(11079, 4)

generate dobject ids

markers_df = markers_df.reset_index()

ids = []
for i in markers_df.index:
    ids.append(id.cell_marker())
markers_df.index = ids
markers_df.index.name = "id"

assert markers_df.index.is_unique
markers_df.to_parquet("CellMarker-human.parquet")

Push to bionty-assets.lndb

!lndb load bionty-assets
migrate-unnecessary
!lndb login sunnyosun
ingest = ln.db.Ingest()
ingest.add("CellMarker-human.parquet");
ingest.commit()
✅ Cell numbers increase consecutively: Awesome!
2022-10-25 16:27:19,161:INFO - Found credentials in shared credentials file: ~/.aws/credentials
Upload /Users/sunnysun/Documents/repos.nosync/bionty-assets/docs/ingest/CellMarker-human.parquet: 1.00
ℹ️ Added notebook 'Human cell markers -> `bionty.CellMarker().df`' (wxPc3RWmRp2a, 1) by user sunnyosun.
✅ Ingested the following dobjects:
+---+--------------------------------------------------+--------------------------------------------------------------------+----------------------+
|   | dobject                                          | jupynb                                                             | user                 |
+---+--------------------------------------------------+--------------------------------------------------------------------+----------------------+
| 0 | CellMarker-human.parquet (GbC3D7dKnsomHB7ZMeUpC) | 'Human cell markers -> `bionty.CellMarker().df`' (wxPc3RWmRp2a, 1) | sunnyosun (kmvZDIX9) |
+---+--------------------------------------------------+--------------------------------------------------------------------+----------------------+
ℹ️ Set notebook version to 1 & wrote pypackages.

Now on S3:

  • human: https://bionty-assets.s3.amazonaws.com/GbC3D7dKnsomHB7ZMeUpC.parquet