This repository contains the code and dataset for Task 1 of the Advanced Machine Learning course at ETH. The goal of this task is to predict a person’s age based on brain MRI data.
Magnetic Resonance Imaging (MRI) is a key technology in medical imaging. It uses a magnetic field and radio waves to produce 3D detailed anatomical images. MRI is non-invasive and non-radiative, making it a preferred method for investigating sensitive organs like the brain.
Aging does not affect everyone uniformly. Individual rates of aging are influenced by environmental, genetic, and epigenetic factors. Brain MRI studies have shown a relationship between accelerated aging and brain atrophy. Predicting brain age can improve early diagnosis and risk assessments for neurodegenerative and neuropsychiatric diseases such as Alzheimer’s, Parkinson’s, and Huntington’s diseases.
- Features Size: Approximately 830 features
- Dataset Size: Approximately 1200 samples
MRI feature extraction for this project uses around 800 anatomical features derived from image data using FreeSurfer. The dataset includes:
- X_train.csv: Training features
- y_train.csv: Training labels (ages)
- X_test.csv: Testing features
- sample.csv: Sample submission file
- Background: The dataset contains missing values represented as NaNs.
- Requirement: Fill the missing values in the training and test sets using suitable imputation methods (mean, median, most frequent, etc.).
- Background: The training set contains outliers.
- Requirement: Build an outlier detection model to classify samples in the training set as outliers or non-outliers.
- Background: The dataset includes additional manual features that require selection.
- Requirement: Use feature selection methods to label features as selected or unselected (irrelevant and redundant features).
- Background: After preprocessing and dimensionality reduction, the main task is regression-based age prediction.
- Requirement: Use suitable regression methods to predict the age from brain MRI features.
The main evaluation metric is the Coefficient of Determination (R²). It measures the proportion of variance in the dependent variable that is predictable from the independent variable, ranging from 1 (best) to negative infinity.
from sklearn.metrics import r2_score
score = r2_score(y_true, y_pred)
- X_train.csv: Training features
- y_train.csv: Training labels (ages)
- X_test.csv: Testing features
- sample.csv: Sample submission file
- task1.ipynb: Main task solution file
- requirements.txt: File with the requirements on the environment to run the notebook