Does Machine Learning Struggle with Explainability?

It is quite common to hear the phrase that "AI/ML models are black boxes". In this article, let's try to analyze how true this is and if the state-of-affairs can be improved?

Why seek explainability in ML models?

You might be tempted to ask how it matters that ML/AI models are difficult to explain as long as they work? Let's start by answering this particular question. Why do we seek explainability in Machine Learning?

Satisfying natural human curiosity - These can be asnwers to the following questions - Why do ML models work where traditional methods fail? Why do classical ML algorithms like random forest algorithms or SVMs show superior performance over deep neural networks in certain areas? All such questions stem from our natural curiosity to understand things.

Adding to scientific knowledge - If something works and one is not able to explain why it has worked. Then you might be adding nothing to the existing scientific knowledge. However if the model works better than traditional approaches then this shows that the model has learnt something that other approaches have missed. Extracting this information out can be very useful to increase our understanding of the scientific problem.

Improving existing models - Being able to understand the inner workings of the ML model allows to understand it's weakness and thus build a better model.

Trust, Fairness and Privacy - With the ever increasing integration of AI into our everday life, question of fairness naturally arise. Even in our current scenarios of people making decision, its frustrating to be denied a loan when you are in dire need of it. But imagine when the machines are doing the job and you desire to know why your loan request was denied. Since ML models are prone to the bias that exists in the data, can we trust such a model? How can fairness be programmed into the model? Similarly, privacy becomes a bigger concern if data that you provide is going to be used in a way that negatively affects you.

Interpretable Machine Learning

I hope that you are convinced that ML models should be more than just black box algorithms that can give you high values on some metrics like classification accuracy, etc. It is often the case that a single metric like classification accuracy is an incomplete description of most real-world tasks. The opposite of a black-box model might be a model that is said to be interpretable. Interpretable Machine Learning refers to methods and models that make the behavior and predictions of machine learning systems understandable to humans. It is currently an area of active research in Machine Learning. But what does it actually mean to say that a model is interepretable? There is no mathematical definition of interpretability. However when comparing two models, it could be said that a model is more interpretable than the other if it's decisions are easier for a human to understand. Now can this goal be achieved?

Algorithm Transparency. How does the algorithm learn a model from the data and what kind of relationships are learned from it? People with experience in Convolutional Neural Networks (CNN) know that the lowest layers of a CNN learn edge detectors. The least squares method, a kind of linear model, also is a method where the working of the algorithm is known.

Global Model Interpretability. This involves knowing the entire model at once. To explain the model output on a global level requires you to explain the trained model, the algorithm and the data involved.

Global Model Interpretability on a Modular Level

A Naive Bayes model with many hundreds of features would be too big for me and you to keep in our working memory. To predict what the output would be given a data point would be near impossible without actually computation. But what you can do is try to comprehend the impact of a single weight.

Local Interpretability for a Single Prediction

You can zoom in on a single instance and examine what the model predicts for this input, and explain why.

A survey of existing interpretability methods

Interpretable Models. Linear regression, logistic regression, decision trees, Support Vector Machines are some of the easily interpretable machine learning models. There are several methods that are focussed on such easily interpretable and linear methods.

Model-Agnostic Methods. Such methods have several advantages over methods are specifically designed based on the model. Better comparisons can be done as the same methods is used to evaluated different models. Also model specific methods are much more rigid in comparison to model agnostic ones. The alternative to using model-agnostic methods is to only use interepretable models which may not be ideal in all scenarios. Let us take a high level look at model-agnostic interpretability. We capture information about the world by collecting data. It is then abstracted by learning to predict the data for a specific task using a black-box ML model. Interpretability is then another layer on top of this that helps human understand the black-box model. There are several model-agnostic methods that are worth mentioning. LIME or Local Interpretable Model Agnostic Explanations \cite{doshi-velez_towards_2017} is one such method. Another one is analyzing Shapley values. We will be going into detail into both these methods in the upcoming issue.

Example-Based methods. These can be considered model-agnostic as they make any machine learning model more interpretable. These are different from the other methods in they select data instances and do not make use of the feature summaries or feature importance. Such methods can be motivated with examples from daily life. Doctors in their clinical practice often uses patients who have had similar symptoms in order to make diagnosis. Another example might be a gamer who makes decision based on gameplays with similar situations he had encountered in the past. The template for such explanations is based on the following. Thing B is similar to thing A and A caused Y, so I predict that B will cause Y as well. Some machine learning models are implicitly example based such as decision trees. For a new instance, a knn model locates the k-nearest neighbors and returns the average of the outcomes of those neighbors as a prediction.

There are several example-based methods that can be used.

Counterfactual explanations tell us how an instance has to change to significantly change its prediction.
Adversarial examples are counterfactuals used to fool machine learning models. The emphasis is on flipping the prediction and not explaining it.
Prototypes are a selection of representative instances from the data and criticisms are instances that are not well represented by those prototypes.
Influential instances are the training data points that were the most influential for the parameters of a prediction model or the predictions themselves.

Neural Network Interpretation methods. There are several interpretation methods that are specific to neural networks.

The different categories of techniques that fall under this are:

Feature Visualization: Visualizing what features the network has learned.
Concepts: Which abstract concepts has the neural network learned?
Feature Attribution: These try to explain how each input feature contributed to a particular prediction.
Model distillation: This attempts to explain a neural network using a simpler model.

Pixel Attribution can be seen as a special case of feature attribution but for images. It is known under various names such as sensitivity map, saliency map, pixel attribution map, gradient-based attribution methods, feature relevance, feature attribution, and feature contribution. Pixel attribution methods can be classified based on several criteria.

There are occlusion or perturbation based methods. Methods such as SHAP and LIME fall under these. These change parts of the image in order to generate explanations.
Gradient-based methods compute the gradient of the prediction with respect to the input features. There are several gradient based methods that differ in the way the gradient is computed.

What is common to both the methods is that the explanation and the input image are of the same size or shape. And each pixel is assigned a value that highlights its importance towards the prediction. Another distinction can be made within pixel attribution methods based on the baseline question.

Gradient-only methods tell us whether a change in pixel would cause the model prediction to change. Examples are Vanilla Gradient and Grad-CAM.
Path-attribution methods compare the input image to a reference image or a baseline image. This is usually taken to be a black image (a "zero" image). This includes methods such as integrated gradients, as well as the methods such as LIME and SHAP. Some path attribution methods are "complete", meaning that the sum of the relevance values for all input features is the difference between the prediction of the image minus the prediction of a reference image. The difference between classification scores of the actual image and the baseline image are attributed to the pixels. The choice of the reference image (distribution) has a big effect on the explanation.

References:

Finale Doshi-Velez and Been Kim. Towards A Rigorous Science of Interpretable Machine Learning, March 2017.
Christoph Molnar. Interpretable Machine Learning: A Guide for Making Black Box Models Interpretable.
Zachary C. Lipton. The Mythos of Model Interpretability, March 2017.
Tim Miller. Explanation in Artificial Intelligence: Insights from the Social Sciences, August 2018.

Search This Blog

Musings of a Machine Learning Maniac