ICD 11¶

The data was obtained on 2024-01-22 (version 01/2023) by clicking on info -> spreadsheet file on https://icd.who.int/browse11/l-m/en.

There isn’t a proper ontology_id, so we’ll use the linearization URI. Note that it can have other and unspecified which pose alternatives. We will keep them but replace them with o and u respectively.

import pandas as pd
import re

df = pd.read_excel("icd_11.xlsx")

df = df[["Linearization (release) URI", "Code", "Title"]]
df.head()

	Linearization (release) URI	Code	Title
0	http://id.who.int/icd/release/11/2023-01/mms/1...	NaN	Certain infectious or parasitic diseases
1	http://id.who.int/icd/release/11/2023-01/mms/5...	NaN	- Gastroenteritis or colitis of infectious origin
2	http://id.who.int/icd/release/11/2023-01/mms/1...	NaN	- - Bacterial intestinal infections
3	http://id.who.int/icd/release/11/2023-01/mms/2...	1A00	- - - Cholera
4	http://id.who.int/icd/release/11/2023-01/mms/4...	1A01	- - - Intestinal infection due to other Vibrio

df.rename(
    columns={"Code": "code", "Title": "name", "Linearization (release) URI": "URI"},
    inplace=True,
)

def extract_code(url: str) -> str:
    match = re.search(r"/(\d+)(?:/(other|unspecified))?$", url)
    if match:
        code = match.group(1)
        suffix = match.group(2)
        if suffix == "other":
            code += "o"
        elif suffix == "unspecified":
            code += "u"
        return code
    else:
        return "No code found"

# Finding the parent for each term
def find_parent(term, all_terms):
    depth = term.count("-")
    parent_depth = depth - 1
    term_index = all_terms.index(term)

    # Search upwards for the nearest term with one less dash
    for previous_term in reversed(all_terms[:term_index]):
        if previous_term.count("-") == parent_depth:
            return previous_term.strip("- ").strip()
    return None


df["parents"] = df["name"].apply(lambda x: find_parent(x, df["name"].tolist()))

df["ontology_id"] = df["URI"].apply(extract_code)

df.drop("URI", inplace=True, axis=1)

df["name"] = df["name"].str.replace("-", "").str.strip()

title_to_ontology = dict(zip(df["name"], df["ontology_id"]))

df["parents"] = df["parents"].apply(title_to_ontology.get)

df.set_index("ontology_id", inplace=True)

df

	code	name	parents
ontology_id
1435254666	NaN	Certain infectious or parasitic diseases	None
588616678	NaN	Gastroenteritis or colitis of infectious origin	1435254666
135352227	NaN	Bacterial intestinal infections	588616678
257068234	1A00	Cholera	135352227
416025325	1A01	Intestinal infection due to other Vibrio	135352227
...	...	...	...
1956913761	XD36Q1	Infusion Pumps, Syringe	1529373361
783787054	XD1N14	Infusion Pumps, Syringe, Nuclear Magnetic Reso...	1529373361
1524741217	XD80Z7	Medical/medicinal gas systems and relative acc...	1838822834
280385798	XD4U38	General purpose electrocardiographs	1838822834
1799393163	XD6UU3	Oxygen Concentrators	1838822834

35574 rows × 3 columns

df.to_parquet("df_human__icd__icd-11-2023__Disease.parquet.parquet")

from bionty.dev._md5 import calculate_md5

calculate_md5("df_human__icd__icd-11-2023__Disease.parquet.parquet")

❗ You are running 3.11.5
Only python versions 3.8~3.10 are currently tested, use at your own risk.
✅ wrote new records from public sources.yaml to /home/zeth/.lamin/bionty/versions/sources_local.yaml!

if you see this message repeatedly, run: bt.reset_sources()

'16263aef644d2c62c47b7b1ecfbad9d6'