7 MedTech Datasets to Diagnose Diseases with Machine Learning

Healthcare Hackathon Ideas

Kicking off our dataset series, we will be exploring some open, public datasets related to healthcare. These datasets are perfect for anyone interested in health looking for new machine learning projects. Regardless of your skill level and experience, whether you are a beginner, intermediate or expert there is a dataset here that you will find interesting.

I will also be giving some tasks for each dataset as well as a potential approach to consider. This way, even if you need to do some additional learning, you have an initial direction. The datasets I share are perfect for any kind of data science or healthcare hackathon and will let you explore actual practical applications of machine learning.

MedTech is one of the coolest applications of machine learning in my opinion. We are already seeing an explosion of machine learning tools in the health field. For example, huge companies like Google and IBM have their own subteams dedicated to using machine learning for cancer diagnosis. Google Health has lofty goals including averting blindness and developing natural human computer interfaces.

Disease Datasets

Numerical Datasets

Heart Attack Analysis & Prediction Dataset: This dataset includes the age, sex, chest pain type, resting blood pressure and serum cholesterol along with other factors that can be used to predict a given participants heart disease diagnosis.

Task: Predict whether each participant is prone to heart disease (angiographic disease status)

Approach: Beginners can try using a decision tree with the library scikit-learn. For a more involved approach to the classification problem, try writing a neural network in keras for binary classification. If you are familiar with neural networks, you can use multiple trainable models such as XGBoost and LightGBM to see which has the best results.

Stroke Prediction Dataset: This dataset includes 11 predictors of a stroke including various diseases and smoking status. Since stroke is the 2nd leading cause of death globally, this is super relevant from a medical perspective.

Task: Predict which patients will have a stroke and which ones will not.

Approach: Beginners can try using a decision tree or logistic regression. Intermediate readers who are familiar with supervised learning but want practice with feature engineering and data preprocessing will enjoy this dataset.

MRI and Alzheimers: This dataset contains MRI comparisons between patients with and without dementia. The features include Estimated Total Intracranial Volume, Whole Brain Volume, and many more.

Task: Predict which patients have dementia and which do not

Approach: This problem is well suited for beginners. A decision tree or a random forest would perform quite well on this dataset. Additionally, more experienced readers can try to implement the decision tree from scratch, or look into methods for visualizing the results of a scikit-learn decision tree.

Picture Datasets

Chest X-Ray Images (Pneumonia): More than 5000 labeled images from Chest X-Rays of patients with and without Pneumonia.

Task: Perform image classification to decide who has Pneumonia

Approach: A convolutional neural network would perform best here. However, beginners may have a hard time working with image data since scikit-learn does not offer support for CNN’s. This is a good dataset for Intermediate and Advanced readers who are interested in computer vision.

Skin Cancer MNIST: HAM10000: Sticking with the image theme, this dataset has labeled images of seven types of skin cancer, including melanoma, basal cell carcinoma, and vascular lesions. Ton’s of practical application here but maybe not for the squeamish.

Task: Perform image classification to decide which type of skin cancer is which

Approach: This is a multi-class classification problem using images. A convolutional neural network would also fare best. Try implementing one in keras tensorflow and see if you can beat 70% accuracy.

Breast Histopathology Images: This dataset includes labeled images of regions with and without Invasive Ductal Carcinoma (IDC), the most common type of breast cancer.

Task: Classify each tile of the mammogram as containing IDC or not.

Approach: Advanced readers will enjoy this dataset. The dataset is relatively small so overfitting is a real possibility. A potential approach to this problem is to use transfer learning with a pre-trained CNN such as resnet. Transfer learning involves using the weights of a smaller neural network to initialize a larger net. More general learning is preserved and more specific learning occurs in the later layers.

For ML(G) Pros

Brain MRI segmentation: This dataset includes images of brain MRI scans along with manually determined FLAIR abnormality segmentation masks. This dataset can be used to train a machine learning model to diagnose brain cancer. In fact, it was used in a paper from 2019 to do just that.

Task: Produce a mask that highlights the cancerous regions in an image of a brain MRI.

Approach: This task is not as straight forward as running through a CNN since we need to produce an image. We must use a more complex architecture. If you’re reading this you must like a challenge. So, look into a concept called UNets. Click here to get started.

Avi Arora
Avi Arora

Avi is a Computer Science student at the Georgia Institute of Technology pursuing a Masters in Machine Learning. He is a software engineer working at Capital One, and the Co Founder of the company Octtone. His company creates software products in the Health & Wellness space.

One comment

Comments are closed.