Human cell markers -> bionty.CellMarker().df
¶
import pandas as pd
from lnschema_bionty import id
import lamindb as ln
ln.nb.header()
author | Sunny Sun (sunnyosun) |
id | wxPc3RWmRp2a |
version | 1 |
time_init | 2022-09-26 15:37 |
time_run | 2022-10-25 14:27 |
consecutive_cells | True |
pypackage | lamindb==0.6.0 lnschema_bionty==0.4.3 pandas==1.5.0 |
Curate the human cell marker table¶
url = "http://xteam.xbio.top/CellMarker/download/Human_cell_markers.txt"
df = pd.read_csv(url, sep="\t", dtype=str)
df.shape
(2868, 15)
df.head()
speciesType | tissueType | UberonOntologyID | cancerType | cellType | cellName | CellOntologyID | cellMarker | geneSymbol | geneID | proteinName | proteinID | markerResource | PMID | Company | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Human | Kidney | UBERON_0002113 | Normal | Normal cell | Proximal tubular cell | NaN | Intestinal Alkaline Phosphatase | ALPI | 248 | PPBI | P09923 | Experiment | 9263997 | NaN |
1 | Human | Liver | UBERON_0002107 | Normal | Normal cell | Ito cell (hepatic stellate cell) | CL_0000632 | Synaptophysin | SYP | 6855 | SYPH | P08247 | Experiment | 10595912 | NaN |
2 | Human | Endometrium | UBERON_0001295 | Normal | Normal cell | Trophoblast cell | CL_0000351 | CEACAM1 | CEACAM1 | 634 | CEAM1 | P13688 | Experiment | 10751340 | NaN |
3 | Human | Germ | UBERON_0000923 | Normal | Normal cell | Primordial germ cell | CL_0000670 | VASA | DDX4 | 54514 | DDX4 | Q9NQI0 | Experiment | 10920202 | NaN |
4 | Human | Corneal epithelium | UBERON_0001772 | Normal | Normal cell | Epithelial cell | CL_0000066 | KLF6 | KLF6 | 1316 | KLF6 | Q99612 | Experiment | 12407152 | NaN |
import re
def _split_list(string):
"""Parse out a, b, [c, d] to [a, b, [c, d]]"""
in_bracket = re.findall("\[(.*?)\]", string)
for lst in in_bracket:
lst = f"[{lst}]"
new_lst = lst.replace(", ", "; ")
string = string.replace(lst, new_lst)
return string
markers = []
genes = []
gene_ids = []
proteins = []
protein_ids = []
problem_rows = []
for i, row in df.iterrows():
if ", " in row["cellMarker"]:
marker = row["cellMarker"].rstrip(", ").split(", ")
gene = _split_list(row["geneSymbol"].rstrip(", ")).split(", ")
gene_id = _split_list(row["geneID"].rstrip(", ")).split(", ")
protein = _split_list(row["proteinName"].rstrip(", ")).split(", ")
protein_id = _split_list(row["proteinID"].rstrip(", ")).split(", ")
try:
assert (
len(marker)
== len(gene)
== len(gene_id)
== len(protein)
== len(protein_id)
)
markers += marker
genes += gene
gene_ids += gene_id
proteins += protein
protein_ids += protein_id
except AssertionError:
problem_rows.append(row)
continue
assert (
len(markers)
== len(genes)
== len(gene_ids)
== len(proteins)
== len(protein_ids)
)
else:
markers.append(row["cellMarker"])
genes.append(row["geneSymbol"])
gene_ids.append(row["geneID"])
proteins.append(row["proteinName"])
protein_ids.append(row["proteinID"])
assert (
len(markers)
== len(genes)
== len(gene_ids)
== len(proteins)
== len(protein_ids)
)
# these 11 rows didn't get parsed due to unequal number of markers and genes/proteins
len(problem_rows)
11
mapper = pd.DataFrame()
mapper["cell_marker"] = markers
mapper["gene_symbols"] = genes
mapper["ncbi_gene_ids"] = gene_ids
mapper["protein_names"] = proteins
mapper["uniprotkb_ids"] = protein_ids
mapper = mapper.drop_duplicates().dropna()
markers_df = mapper.groupby("cell_marker").agg("|".join)
def _contain_digits(string):
return any(char.isdigit() for char in string)
for i, row in markers_df.iterrows():
for k, v in row.items():
values = [j for j in set(v.split("|")) if j != "NA"]
if len(values) == 0:
markers_df.loc[i, k] = ""
continue
else:
if k == "uniprotkb_ids":
values = [j for j in values if _contain_digits(j)]
markers_df.loc[i, k] = "|".join(values)
markers_df.iloc[100:105]
gene_symbols | ncbi_gene_ids | protein_names | uniprotkb_ids | |
---|---|---|---|---|
cell_marker | ||||
ACAD11 | ACAD11 | 84129 | ACD11 | Q709F0 |
ACADS | ACADS | 35 | ACADS | P16219 |
ACADSB | ACADSB | 36 | ACDSB | P45954 |
ACAP2 | ACAP2 | 23527 | ACAP2 | Q15057 |
ACAT1 | ACAT1 | 38 | THIL | P24752 |
markers_df.loc["CD8"]
gene_symbols CD8A
ncbi_gene_ids 925
protein_names CD8A
uniprotkb_ids P01732
Name: CD8, dtype: object
markers_df.loc["CD45RO"]
gene_symbols PTPRC
ncbi_gene_ids 5788
protein_names PTPRC
uniprotkb_ids P08575
Name: CD45RO, dtype: object
markers_df.shape
(11079, 4)
generate dobject ids¶
markers_df = markers_df.reset_index()
ids = []
for i in markers_df.index:
ids.append(id.cell_marker())
markers_df.index = ids
markers_df.index.name = "id"
assert markers_df.index.is_unique
markers_df.to_parquet("CellMarker-human.parquet")
Push to bionty-assets.lndb¶
!lndb load bionty-assets
migrate-unnecessary
!lndb login sunnyosun
ingest = ln.db.Ingest()
ingest.add("CellMarker-human.parquet");
ingest.commit()
✅ Cell numbers increase consecutively: Awesome!
2022-10-25 16:27:19,161:INFO - Found credentials in shared credentials file: ~/.aws/credentials
Upload /Users/sunnysun/Documents/repos.nosync/bionty-assets/docs/ingest/CellMarker-human.parquet: 1.00
ℹ️ Added notebook 'Human cell markers -> `bionty.CellMarker().df`' (wxPc3RWmRp2a, 1) by user sunnyosun.
✅ Ingested the following dobjects:
+---+--------------------------------------------------+--------------------------------------------------------------------+----------------------+
| | dobject | jupynb | user |
+---+--------------------------------------------------+--------------------------------------------------------------------+----------------------+
| 0 | CellMarker-human.parquet (GbC3D7dKnsomHB7ZMeUpC) | 'Human cell markers -> `bionty.CellMarker().df`' (wxPc3RWmRp2a, 1) | sunnyosun (kmvZDIX9) |
+---+--------------------------------------------------+--------------------------------------------------------------------+----------------------+
ℹ️ Set notebook version to 1 & wrote pypackages.
Now on S3:
human: https://bionty-assets.s3.amazonaws.com/GbC3D7dKnsomHB7ZMeUpC.parquet