ICD 11

The data was obtained on 2024-01-22 (version 01/2023) by clicking on info -> spreadsheet file on https://icd.who.int/browse11/l-m/en.

There isn’t a proper ontology_id, so we’ll use the linearization URI. Note that it can have other and unspecified which pose alternatives. We will keep them but replace them with o and u respectively.

import pandas as pd
import re
df = pd.read_excel("icd_11.xlsx")
df = df[["Linearization (release) URI", "Code", "Title"]]
df.head()
Linearization (release) URI Code Title
0 http://id.who.int/icd/release/11/2023-01/mms/1... NaN Certain infectious or parasitic diseases
1 http://id.who.int/icd/release/11/2023-01/mms/5... NaN - Gastroenteritis or colitis of infectious origin
2 http://id.who.int/icd/release/11/2023-01/mms/1... NaN - - Bacterial intestinal infections
3 http://id.who.int/icd/release/11/2023-01/mms/2... 1A00 - - - Cholera
4 http://id.who.int/icd/release/11/2023-01/mms/4... 1A01 - - - Intestinal infection due to other Vibrio
df.rename(
    columns={"Code": "code", "Title": "name", "Linearization (release) URI": "URI"},
    inplace=True,
)
def extract_code(url: str) -> str:
    match = re.search(r"/(\d+)(?:/(other|unspecified))?$", url)
    if match:
        code = match.group(1)
        suffix = match.group(2)
        if suffix == "other":
            code += "o"
        elif suffix == "unspecified":
            code += "u"
        return code
    else:
        return "No code found"
# Finding the parent for each term
def find_parent(term, all_terms):
    depth = term.count("-")
    parent_depth = depth - 1
    term_index = all_terms.index(term)

    # Search upwards for the nearest term with one less dash
    for previous_term in reversed(all_terms[:term_index]):
        if previous_term.count("-") == parent_depth:
            return previous_term.strip("- ").strip()
    return None


df["parents"] = df["name"].apply(lambda x: find_parent(x, df["name"].tolist()))
df["ontology_id"] = df["URI"].apply(extract_code)
df.drop("URI", inplace=True, axis=1)
df["name"] = df["name"].str.replace("-", "").str.strip()
title_to_ontology = dict(zip(df["name"], df["ontology_id"]))

df["parents"] = df["parents"].apply(title_to_ontology.get)
df.set_index("ontology_id", inplace=True)
df
code name parents
ontology_id
1435254666 NaN Certain infectious or parasitic diseases None
588616678 NaN Gastroenteritis or colitis of infectious origin 1435254666
135352227 NaN Bacterial intestinal infections 588616678
257068234 1A00 Cholera 135352227
416025325 1A01 Intestinal infection due to other Vibrio 135352227
... ... ... ...
1956913761 XD36Q1 Infusion Pumps, Syringe 1529373361
783787054 XD1N14 Infusion Pumps, Syringe, Nuclear Magnetic Reso... 1529373361
1524741217 XD80Z7 Medical/medicinal gas systems and relative acc... 1838822834
280385798 XD4U38 General purpose electrocardiographs 1838822834
1799393163 XD6UU3 Oxygen Concentrators 1838822834

35574 rows × 3 columns

df.to_parquet("df_human__icd__icd-11-2023__Disease.parquet.parquet")
from bionty.dev._md5 import calculate_md5

calculate_md5("df_human__icd__icd-11-2023__Disease.parquet.parquet")
❗ You are running 3.11.5
Only python versions 3.8~3.10 are currently tested, use at your own risk.
✅ wrote new records from public sources.yaml to /home/zeth/.lamin/bionty/versions/sources_local.yaml!

if you see this message repeatedly, run: bt.reset_sources()
'16263aef644d2c62c47b7b1ecfbad9d6'