Identifiable Genes and Genetic Disorders¶

Motivation:¶

Problem¶

As scientific advancements accelerate, a greater focus is being placed on genetic bases for everything we know - and don't know - about humans and medicine. Many life-altering disorders are caused by a mutation in DNA or a change in chromosomal structure. With an increasing population and more widespread detection tools, genetic disorders are also increasing exponentionally. Low awareness of genetic testing can lead to even bigger problems.

Solution¶

Advancing genetic testing resources and technology mean that a library of genes can be connected to patterns of genetic disorders associated with a certain combination of genetic mutations and "activation." Using the available information on genes combined with other information, the goal is to use the gene information to do so.

Impact¶

If successful, genetic testing can help predict the likely genetic disorder and the type of disorder it is, which can start a road towards therapies or research advancement. It can also help establish patterns that need more samples or that have not been considered yet, widening the testing range.

Dataset¶

Detail¶

We will use a Kaggle Data Set of genetic disorders and the genes behind them to observe:

  • Patient Age
  • Gene's inherited in mother's side
  • Gene's inherited in father's side
  • Maternal/Paternal Gene (Yes/No)
  • Blood Cell Count
  • Patient Name
  • Family Information (Parent age, etc.)
  • Status
  • Respitory/Heart Rate
  • Gender
  • H/O serious maternal illness
  • Assisted Conception
  • Anomaly history of other pregnancies
  • Birth Defects
  • White Blood Cell Count
  • More...
In [1]:
import pandas as pd
pd.read_csv('train_genetic_disorders.csv')
Out[1]:
Patient Id Patient Age Genes in mother's side Inherited from father Maternal gene Paternal gene Blood cell count (mcL) Patient First Name Family Name Father's name ... Birth defects White Blood cell count (thousand per microliter) Blood test result Symptom 1 Symptom 2 Symptom 3 Symptom 4 Symptom 5 Genetic Disorder Disorder Subclass
0 PID0x6418 2.0 Yes No Yes No 4.760603 Richard NaN Larre ... NaN 9.857562 NaN 1.0 1.0 1.0 1.0 1.0 Mitochondrial genetic inheritance disorders Leber's hereditary optic neuropathy
1 PID0x25d5 4.0 Yes Yes No No 4.910669 Mike NaN Brycen ... Multiple 5.522560 normal 1.0 NaN 1.0 1.0 0.0 NaN Cystic fibrosis
2 PID0x4a82 6.0 Yes No No No 4.893297 Kimberly NaN Nashon ... Singular NaN normal 0.0 1.0 1.0 1.0 1.0 Multifactorial genetic inheritance disorders Diabetes
3 PID0x4ac8 12.0 Yes No Yes No 4.705280 Jeffery Hoelscher Aayaan ... Singular 7.919321 inconclusive 0.0 0.0 1.0 0.0 0.0 Mitochondrial genetic inheritance disorders Leigh syndrome
4 PID0x1bf7 11.0 Yes No NaN Yes 4.720703 Johanna Stutzman Suave ... Multiple 4.098210 NaN 0.0 0.0 0.0 0.0 NaN Multifactorial genetic inheritance disorders Cancer
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
22078 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
22079 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
22080 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
22081 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
22082 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

22083 rows × 45 columns

In [2]:
pd.read_csv('test_genetic_disorders.csv')
Out[2]:
Patient Id Patient Age Genes in mother's side Inherited from father Maternal gene Paternal gene Blood cell count (mcL) Patient First Name Family Name Father's name ... History of anomalies in previous pregnancies No. of previous abortion Birth defects White Blood cell count (thousand per microliter) Blood test result Symptom 1 Symptom 2 Symptom 3 Symptom 4 Symptom 5
0 PID0x4175 6.0 No Yes No No 4.981655 Charles NaN Kore ... -99 2.0 Multiple -99.000000 slightly abnormal True True True True True
1 PID0x21f5 10.0 Yes No NaN Yes 5.118890 Catherine NaN Homero ... Yes -99.0 Multiple 8.179584 normal False False False True False
2 PID0x49b8 5.0 No NaN No No 4.876204 James NaN Danield ... No 0.0 Singular -99.000000 slightly abnormal False False True True False
3 PID0x2d97 13.0 No Yes Yes No 4.687767 Brian NaN Orville ... Yes -99.0 Singular 6.884071 normal True False True False True
4 PID0x58da 5.0 No NaN NaN Yes 5.152362 Gary NaN Issiah ... No -99.0 Multiple 6.195178 normal True True True True False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9458 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9459 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9460 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9461 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9462 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

9463 rows × 43 columns

We will use the information and train csv to establish a pattern and predict genetic disorders from the test data. Furthermore, we could cluster this data into different disorder types and provide first steps towards either therpaies or research advancement based on this clustering.