ICD 11¶
The data was obtained on 2024-01-22 (version 01/2023) by clicking on info
-> spreadsheet file
on https://icd.who.int/browse11/l-m/en.
There isn’t a proper ontology_id
, so we’ll use the linearization URI.
Note that it can have other
and unspecified
which pose alternatives. We will keep them but replace them with o
and u
respectively.
import pandas as pd
import re
df = pd.read_excel("icd_11.xlsx")
df = df[["Linearization (release) URI", "Code", "Title"]]
df.head()
Linearization (release) URI | Code | Title | |
---|---|---|---|
0 | http://id.who.int/icd/release/11/2023-01/mms/1... | NaN | Certain infectious or parasitic diseases |
1 | http://id.who.int/icd/release/11/2023-01/mms/5... | NaN | - Gastroenteritis or colitis of infectious origin |
2 | http://id.who.int/icd/release/11/2023-01/mms/1... | NaN | - - Bacterial intestinal infections |
3 | http://id.who.int/icd/release/11/2023-01/mms/2... | 1A00 | - - - Cholera |
4 | http://id.who.int/icd/release/11/2023-01/mms/4... | 1A01 | - - - Intestinal infection due to other Vibrio |
df.rename(
columns={"Code": "code", "Title": "name", "Linearization (release) URI": "URI"},
inplace=True,
)
def extract_code(url: str) -> str:
match = re.search(r"/(\d+)(?:/(other|unspecified))?$", url)
if match:
code = match.group(1)
suffix = match.group(2)
if suffix == "other":
code += "o"
elif suffix == "unspecified":
code += "u"
return code
else:
return "No code found"
# Finding the parent for each term
def find_parent(term, all_terms):
depth = term.count("-")
parent_depth = depth - 1
term_index = all_terms.index(term)
# Search upwards for the nearest term with one less dash
for previous_term in reversed(all_terms[:term_index]):
if previous_term.count("-") == parent_depth:
return previous_term.strip("- ").strip()
return None
df["parents"] = df["name"].apply(lambda x: find_parent(x, df["name"].tolist()))
df["ontology_id"] = df["URI"].apply(extract_code)
df.drop("URI", inplace=True, axis=1)
df["name"] = df["name"].str.replace("-", "").str.strip()
title_to_ontology = dict(zip(df["name"], df["ontology_id"]))
df["parents"] = df["parents"].apply(title_to_ontology.get)
df.set_index("ontology_id", inplace=True)
df
code | name | parents | |
---|---|---|---|
ontology_id | |||
1435254666 | NaN | Certain infectious or parasitic diseases | None |
588616678 | NaN | Gastroenteritis or colitis of infectious origin | 1435254666 |
135352227 | NaN | Bacterial intestinal infections | 588616678 |
257068234 | 1A00 | Cholera | 135352227 |
416025325 | 1A01 | Intestinal infection due to other Vibrio | 135352227 |
... | ... | ... | ... |
1956913761 | XD36Q1 | Infusion Pumps, Syringe | 1529373361 |
783787054 | XD1N14 | Infusion Pumps, Syringe, Nuclear Magnetic Reso... | 1529373361 |
1524741217 | XD80Z7 | Medical/medicinal gas systems and relative acc... | 1838822834 |
280385798 | XD4U38 | General purpose electrocardiographs | 1838822834 |
1799393163 | XD6UU3 | Oxygen Concentrators | 1838822834 |
35574 rows × 3 columns
df.to_parquet("df_human__icd__icd-11-2023__Disease.parquet.parquet")
from bionty.dev._md5 import calculate_md5
calculate_md5("df_human__icd__icd-11-2023__Disease.parquet.parquet")
❗ You are running 3.11.5
Only python versions 3.8~3.10 are currently tested, use at your own risk.
✅ wrote new records from public sources.yaml to /home/zeth/.lamin/bionty/versions/sources_local.yaml!
if you see this message repeatedly, run: bt.reset_sources()
'16263aef644d2c62c47b7b1ecfbad9d6'