Brain Tumour Evaluation
AI is being used in various fields all over the world. One of the biggest potential beneficiaries of this is the medical field. We can use machine learning models to accurately and quickly identify and then classify different brain tumours from MRI images. I developed a CNN to classify brain tumours into four categories: glioma, meningioma, pituitary, and notumor. For this post, I am not going to talk about the actual design of the model but more just the evaluation of my model, highlighting its strengths, weaknesses, and the implications of using AI in the medical sector.
Dataset Break Down
The dataset is made up of 7023 images, these images are labelled as either pituitary, glioma, meningioma, and notumor.
The dataset is also broken into train and test datasets. The training dataset contains 5712 images, ~81% of the total data and the testing dataset contains 1311 images, ~19% of the total data.
Understanding the Evaluation Metrics
Firstly I am going to explain the different metrics I have used to evaluate my model's performance.
•
Precision: This measures how many of the predicted positive cases are actually positive. In medical diagnostics, high precision helps avoid false positives, preventing unnecessary treatments or patient distress.
•
Recall: This indicates how many actual positive cases are correctly identified. High recall is vital to ensure no tumours are missed.
•
F1 Score: The F1 score is a score that balances precision and recall, offering a single metric that accounts for both false positives and false negatives.
•
Accuracy: The overall percentage of correct predictions across all classes, offering a broad view of model reliability.
These metrics were calculated for each class, giving us a detailed view of how the model performs across different tumour types. When evaluating our model on the test dataset, we achieved the following results:
Per-Class Performance Metrics
•
Glioma:
•
Precision: 0.8619
•
Recall: 0.7700
•
F1 Score: 0.8134
•
Accuracy: 0.7700
•
Meningioma:
•
Precision: 0.7582
•
Recall: 0.2255
•
F1 Score: 0.3476
•
Accuracy: 0.2255
•
Notumor:
•
Precision: 0.6672
•
Recall: 1.0000
•
F1 Score: 0.8004
•
Accuracy: 1.0000
•
Pituitary:
•
Precision: 0.8435
•
Recall: 0.9700
•
F1 Score: 0.9023
•
Accuracy: 0.9700
Overall Accuracy: 0.7595 (75.97% of predictions were correct)
Analysis of the Metrics
•
Notumor: Perfect Recall and Accuracy
•
The model achieves a perfect recall (1.0000) and accuracy (1.0000) for the notumor class, meaning it correctly identifies every instance where no tumour is present. This is critical for confidently ruling out tumours, reducing unnecessary interventions. However, the precision (0.6672) is lower, indicating some false positives (other classes misclassified as notumor), which could lead to missed tumour diagnoses if not addressed.
•
Pituitary: Strong Performance
•
For pituitary tumours, the model performs exceptionally well with a precision of 0.8435, recall of 0.9700, F1 score of 0.9023, and an accuracy of 0.9700. The high recall ensures that nearly all pituitary tumours are detected, and the strong precision minimises false positives, making this a reliable class for the model.
•
Glioma: Good Precision, Moderate Recall
•
The glioma class shows a strong precision (0.8619), meaning most predicted gliomas are correct. However, the recall and accuracy (both 0.7700) indicate that 23% of actual gliomas are missed, which is concerning in a medical context where missing a tumour can delay treatment, which can lead to fatal consequences. The F1 score (0.8134) reflects a balanced but imperfect performance.
•
Meningioma: Significant Challenges
•
The meningioma class has the weakest performance, with a precision of 0.7582 but a very low recall and accuracy (both 0.2255) and an F1 score of 0.3476. This suggests the model struggles to detect meningiomas, missing approximately 77% of actual cases. The low recall is particularly problematic, as it indicates a high rate of false negatives.
•
Overall Accuracy in Context
•
The overall accuracy of 75.97% is a reasonable starting point, but it is nowhere near ready for clinical use. The significant variation in per-class performance, particularly the poor recall for meningiomas, shows the need for targeted improvements.
Confusion Matrix
The above 4x4 confusion matrix, visualised as a heatmap, shows how often the model confuses one class for another. Just like our other metrics showed us my model struggles the most with meningioma tumours. We can also see from this that the model believes that most meningiomas actually are notumors. This can help me understand where the model is going wrong, giving guidance for future improvements.
Implications and Future Directions
My model demonstrates promising capabilities, particularly in identifying non-tumour cases and pituitary tumours. However, several areas require attention.
•
Improving Glioma Detection: The recall of 0.7700 for gliomas indicates missed diagnoses, which is unacceptable in clinical settings. Increasing training data for gliomas could help the model better recognise these tumours.
•
Reducing False Positives for Notumor: The lower precision (0.6672) for notumor suggests other tumour types are being misclassified as non-tumours. Techniques like feature visualisation (e.g., Grad-CAM) could reveal whether the model is focusing on irrelevant image features.
•
Enhancing Overall Accuracy: At 75.97%, the accuracy is a starting point but falls short of clinical standards. Fine-tuning the CNN architecture, perhaps by adding layers or experimenting with transfer learning, could boost performance.
Moving Towards Clinical Integration
While my model shows potential, it is not yet ready for clinical use. Models like this should be used as a decision support tool to assist radiologists, not replace them. Before deploying a model like this, I would need to:
•
Validation on Larger Datasets: Testing on a more diverse MRI dataset to ensure generalisability, also be able to detect more tumour types as there are over 130 different types of brain tumours
•
Interpretability: Implementing tools like Grad-CAM to show which MRI regions drive predictions.
•
Collaboration: Partnering with radiologists to validate predictions and ensure the model aligns with clinical needs.
Conclusion
My brain tumour classification model is a step forward in leveraging AI for medical diagnostics, with strong performance in identifying non-tumour cases and pituitary tumours. However, challenges with glioma recall and meningioma metrics highlight areas for improvement.