Responsible AI
It is an approach to evaluating, developing and implementing AI
systems in a safe, reliable and ethical manner, and making responsible
decisions and actions.
Generally speaking, Responsible AI is the practice of upholding the principles of AI when designing, building,
and using artificial intelligence systems.
Differential privacy adds noise so the maximum impact of an individual
on the outcome of an aggregative analysis is at most epsilon (ϵ)
• The incremental privacy risk between opting out vs participation
for any individual is governed by ϵ
• Lower ϵ values result in greater privacy but lower accuracy
• Higher ϵ values result in greater accuracy with higher risk of individual
identification
2. Fairness
Absence of negative impact on groups based on:
Ethnicity
Gender
Age
Physical disability
Other sensitive features
Mitigating Unfairness
Create models with parity constraints:
Algorithms:
• Exponentiated Gradient - A *reduction* technique that applies a cost-
minimization approach to learning the optimal trade-off of overall predictive
performance and fairness disparity (Binary classification and regression)
• Grid Search - A simplified version of the Exponentiated Gradient algorithm
that works efficiently with small numbers of constraints (Binary classification
and regression)
• Threshold Optimizer - A *post-processing* technique that applies a
constraint to an existing classifier, transforming the prediction as
appropriate (Binary classification)
Mitigating Unfairness
Constraints:
• Demographic parity: Minimize disparity in the selection rate across sensitive
feature groups.
• True positive rate parity: Minimize disparity in true positive rate across
sensitive feature groups
• False positive rate parity: Minimize disparity in false positive rate across
sensitive feature groups
• Equalized odds: Minimize disparity in combined true positive rate and false
positive rate across sensitive feature groups
• Error rate parity: Ensure that the error for each sensitive feature group does
not deviate from the overall error rate by more than a specified amount
• Bounded group loss: Restrict the loss for each sensitive feature group in a
regression model
Building responsible AI models in
Azure Machine Learning
Luis Beltrán
luis@luisbeltran.mx
Thank you for your attention!
Notas del editor
When we talk about AI, we usually refer to a machine learning model that is used within a system to automate something. For example, a self-driving car can take images using sensors. A machine learning model can use these images to make predictions (for example, the object in the image is a tree). These predictions are used by the car to make decisions (for example, turn left to avoid the tree). We refer to this whole system as AI.
When AI is developed, there are risks that it will be unfair or seen as a black box that makes decisions for humans.
For example, another model that analyzes a person's information (such as their salary, nationality, age, etc.) and decides whether to grant them a loan or not. Human participation is limited in those decisions made by the system. This can lead to many potential problems and companies need to define a clear approach to the use of AI. Responsible AI is a governance framework meant to do exactly that.
Responsible AI is the practice of designing, developing, and deploying AI with good intent to empower employees and businesses, and impact customers and society fairly, safely, and ethically, enabling organizations to build trust and scale AI more securely.
They are the product of many decisions made by those who develop and implement them. From the purpose of the system to the way people interact with AI systems, responsible AI can help proactively guide decisions toward more beneficial and equitable outcomes. That means keeping people and their goals at the center of system design decisions and respecting enduring values like fairness, reliability, and transparency.
Evaluating and researching ML models before their implementation remains at the core of reliable and responsible AI development.
Microsoft has developed a Responsible AI Standard. It's a framework for building AI systems according to six key principles: fairness, reliability and safety, privacy and security, inclusiveness, transparency, and accountability. For Microsoft, these principles are the foundations of a responsible and trustworthy approach to AI, especially as intelligent technology becomes more prevalent in products and services that people use every day.
Let’s talk about some of the principles
AI systems like facial recognition or voice tagging can definitely be used to breach an individual's privacy and threaten security. How an individual's online footprint is used to track, deduce and influence someone's preferences or perspectives is a serious concern that needs to be addressed. The way in which "fake news" or "deep fakes" influence public opinion also represents a threat to individual or social security. AI systems are increasingly misused in this domain. There is a pertinent need to establish a framework that protects an individual's privacy and security.
Privacy is any data that can identify an individual and/or their location, activities and interests. Such data is generally subject to strict privacy and compliance laws, for example GDPR in Europe. AI systems must comply with privacy laws that require transparency about the collection, use, and storage of data. It should require consumers to have adequate controls in choosing how their data is used.
Data science projects, including machine learning projects, involve analysis of data; and often that data includes sensitive personal details that should be kept private.
In practice, most reports that are published from the data include aggregations of the data, which you may think would provide some privacy – after all, the aggregated results do not reveal the individual data values.
However, consider a case where multiple analyses of the data result in reported aggregations that when combined, could be used to work out information about individuals in the source dataset. In the example on the slide, 10 participants share data about their location and salary. The aggregated salary data tells us the average salary in Seattle; and the location data tells us that 10% of the study participants (in other words, a single person) is based in Seattle – so we can easily determine the specific salary of the Seattle-based participant.
Anyone reviewing both studies who happens to know a person from Seattle who participated, now knows that person's salary.
Differential privacy seeks to protect individual data values by adding statistical "noise" to the analysis process. The math involved in adding the noise is quite complex, but the principle is fairly intuitive – the noise ensures that data aggregations stay statistically consistent with the actual data values allowing for some random variation, but make it impossible to work out the individual values from the aggregated data. In addition, the noise is different for each analysis, so the results are non-deterministic – in other words, two analyses that perform the same aggregation may produce slightly different results.
Two open source packages that can enable further implementation of privacy and security principles:
Counterfit: Counterfit is an open-source project comprising a command-line tool and generic automation layer to enable developers to simulate cyberattacks against AI systems and verify their security.
SmartNoise: SmartNoise is a project (co-developed by Microsoft) that contains components for building differentially private systems that are global.
Built-in support for training simple machine learning models like linear and logistic regression
Compatible with open-source training libraries such TensorFlow Privacy
You can use SmartNoise to create an analysis in which noise is added to the source data. The underlying mathematics of how the noise is added can be quite complex, but SmartNoise takes care of most of the details for you
Epsilon: The amount of variation caused by adding noise is configurable through a parameter called epsilon. This value governs the amount of additional risk that your personal data can be identified. The key thing is that it applies this privacy principle for every member in the data. A low epsilon value provides the most privacy, at the expense of less accuracy when aggregating the data. A higher epsilon value results in aggregations that are more true to the actual data distribution, but in which the individual contribution of a single individual to the aggregated value is less obscured by noise.
However, there are a few concepts it's useful to be aware of.
Upper and lower bounds: Clamping is used to set upper and lower bounds on values for a variable. This is required to ensure that the noise generated by SmartNoise is consistent with the expected distribution of the original data.
Sample size: To generate consistent differentially private data for some aggregations, SmartNoise needs to know the size of the data sample to be generated.
It's common when analyzing data to examine the distribution of a variable using a histogram.
For example, let's look at the true distribution of ages in the diabetes dataset.
The histograms are similar enough to ensure that reports based on the differentially private data provide the same insights as reports from the raw data.
Now let's compare that with a differentially private histogram of Age.
Another common goal of analysis is to establish relationships between variables. SmartNoise provides a differentially private covariance function that can help with this.
In this case, the covariance between Age and DisatolicBloodPressure is positive, indicating that older patients tend to have higher blood pressure.
In addition to the Analysis functionality, SmartNoise enables you to use SQL queries against data sources to retrieve differentially private aggregated results.
First, you need to define the metadata for the tables in your data schema. You can do this in a .yml file, such as the diabetes.yml file in the /metadata folder. The metadata describes the fields in the tables, including data types and minimum and maximum values for numeric fields.
With the metadata defined, you can create readers that you can query. In the following example, we'll create a PandasReader to read the raw data from a Pandas dataframe, and a PrivateReader that adds a differential privacy layer to the PandasReader.
Now you can submit a SQL query that returns an aggregated resultset to the private reader.
Let's compare the result to the same aggregation from the raw data.
You can customize the behavior of a PrivateReader with the epsilon_per_column parameter.
Let's try a reader with a high epsilon (low privacy) value, and another with a low epsilon (high privacy) value.
Note that the results of the high epsilon (low privacy) reader are closer to the true results from the raw data than the results from the low epsilon (high privacy) reader.
Machine learning models are increasingly used to inform decisions that affect peoples lives. For example, prediction made by a machine learning model might influence:
- Approval for a loan, insurance, or other financial service.
- Acceptance into a school or college course.
- Eligibility for a medical trial or experimental treatment.
- Inclusion in a marketing promotion.
- Selection for employment or promotion.
With such critical decisions in the balance, it's important to have confidence that the machine learning models we rely on predict fairly, and don't discriminate for or against subsets of the population based on ethnicity, gender, age, or other factors.
Fairness and inclusiveness in Azure Machine Learning: The fairness assessment component of the Responsible AI dashboard enables data scientists and developers to assess model fairness across sensitive groups defined in terms of gender, ethnicity, age, and other characteristics.
The Responsible AI dashboard provides a single interface to help you implement Responsible AI in practice effectively and efficiently. It brings together several mature Responsible AI tools in the areas of:
Model performance and fairness assessment
Data exploration
Machine learning interpretability
Error analysis
Counterfactual analysis and perturbations
Causal inference
The dashboard offers a holistic assessment and debugging of models so you can make informed data-driven decisions. Having access to all of these tools in one interface empowers you to:
Evaluate and debug your machine learning models by identifying model errors and fairness issues, diagnosing why those errors are happening, and informing your mitigation steps.
Boost your data-driven decision-making abilities by addressing questions such as:
"What is the minimum change that users can apply to their features to get a different outcome from the model?"
"What is the causal effect of reducing or increasing a feature (for example, red meat consumption) on a real-world outcome (for example, diabetes progression)?"
you'll use the Fairlearn package to analyze a model and explore disparity in prediction performance for different subsets of data based on specific features, such as age.
To use the Fairlearn package with Azure Machine Learning, you need the Azure Machine Learning and Fairlearn Python packages, so run the following cell verify that the azureml-contrib-fairness package is installed.
Train model
After that, you can use the Fairlearn package to compare its behavior for different sensitive feature values.
A mix of fairlearn and scikit-learn metric functions are used to calculate the performance values.
Use scikit-learn metric functions to calculate overall accuracy, recall, and precision metrics.
Use the fairlearn selection_rate function to return the selection rate (percentage of positive predictions) for the overall population.
Use a MetricFrame to calculate selection rate, accuracy, recall, and precision for each age group in the Age sensitive feature.
From these metrics, you should be able to discern that a larger proportion of the older patients are predicted to be diabetic. Accuracy should be more or less equal for the two groups, but a closer inspection of precision and recall indicates some disparity in how well the model predicts for each age group.
The model does a better job of this for patients in the older age group than for younger patients.
It's often easier to compare metrics visually. To do this, you'll use the Fairlearn fairness dashboard:
When the widget is displayed, use the Get started link to start configuring your visualization.
Select the sensitive features you want to compare (in this case, there's only one: Age).
Select the model performance metric you want to compare (in this case, it's a binary classification model so the options are Accuracy, Balanced accuracy, Precision, and Recall). Start with Recall.
Select the type of fairness comparison you want to view. Start with Demographic parity difference.
The choice of parity constraint depends on the technique being used and the specific fairness criteria you want to apply. Constraints include:- Demographic parity: Use this constraint with any of the mitigation algorithms to minimize disparity in the selection rate across sensitive feature groups. For example, in a binary classification scenario, this constraint tries to ensure that an equal number of positive predictions are made in each group.
View the dashboard charts, which show:
Selection rate - A comparison of the number of positive cases per subpopulation.
False positive and false negative rates - how the selected performance metric compares for the subpopulations, including underprediction (false negatives) and overprediction (false positives).
Edit the configuration to compare the predictions based on different performance and fairness metrics.
The results show a much higher selection rate for patients over 50 than for younger patients. However, in reality, age is a genuine factor in diabetes, so you would expect more positive cases among older patients.
If we base model performance on accuracy (in other words, the percentage of predictions the model gets right), then it seems to work more or less equally for both subpopulations. However, based on the precision and recall metrics, the model tends to perform better for patients who are over 50 years old.
A common approach to mitigation is to use one of the algorithms and constraints to train multiple models, and then compare their performance, selection rate, and disparity metrics to find the optimal model for your needs. Often, the choice of model involves a trade-off between raw predictive performance and fairness. Generally, fairness is measured by reduction in disparity of feature selection or by a reduction in disparity of performance metric.
To train the models for comparison, you use mitigation algorithms to create alternative models that apply parity constraints to produce comparable metrics across sensitive feature groups. Some common algorithms used to optimize models for fairness.
GridSearch trains multiple models in an attempt to minimize the disparity of predictive performance for the sensitive features in the dataset (in this case, the age groups)
- Exponentiated Gradient - A *reduction* technique that applies a cost-minimization approach to learning the optimal trade-off of overall predictive performance and fairness disparity (Binary classification and regression)
- Grid Search - A simplified version of the Exponentiated Gradient algorithm that works efficiently with small numbers of constraints (Binary classification and regression)
- Threshold Optimizer - A *post-processing* technique that applies a constraint to an existing classifier, transforming the prediction as appropriate (Binary classification)
The choice of parity constraint depends on the technique being used and the specific fairness criteria you want to apply.
The EqualizedOdds parity constraint tries to ensure that models that exhibit similar true and false positive rates for each sensitive feature grouping.
The models are shown on a scatter plot. You can compare the models by measuring the disparity in predictions (in other words, the selection rate) or the disparity in the selected performance metric (in this case, recall). In this scenario, we expect disparity in selection rates (because we know that age is a factor in diabetes, with more positive cases in the older age group). What we're interested in is the disparity in predictive performance, so select the option to measure Disparity in recall.
The chart shows clusters of models with the overall recall metric on the X axis, and the disparity in recall on the Y axis. Therefore, the ideal model (with high recall and low disparity) would be at the bottom right corner of the plot. You can choose the right balance of predictive performance and fairness for your particular needs, and select an appropriate model to see its details.
An important point to reinforce is that applying fairness mitigation to a model is a trade-off between overall predictive performance and disparity across sensitive feature groups - generally you must sacrifice some overall predictive performance to ensure that the model predicts fairly for all segments of the population.
The chart shows clusters of models with the overall recall metric on the X axis, and the disparity in recall on the Y axis. Therefore, the ideal model (with high recall and low disparity) would be at the bottom right corner of the plot. You can choose the right balance of predictive performance and fairness for your particular needs, and select an appropriate model to see its details.
An important point to reinforce is that applying fairness mitigation to a model is a trade-off between overall predictive performance and disparity across sensitive feature groups - generally you must sacrifice some overall predictive performance to ensure that the model predicts fairly for all segments of the population.
By way of conclusion, we recall that the principles that are recommended to follow to develop a responsible AI are
Reliability: We need to make sure that the systems we develop are consistent with the ideas, values, and design principles so that they don't create any harm in the world.
Privacy: Complexity is part of AI systems, more data is needed and our software must ensure that that data is protected, that it is not leaked or disclosed.
Inclusiveness: Empower and engage people by making sure no one is left out. Consider inclusion and diversity in your models so that the entire spectrum of communities is covered.
Transparency: Transparency means that people creating AI systems must be open about how and are using AI and also open about the limitations of their systems. Transparency also means interpretability, which refers to the fact that people must be able to understand the behavior of AI systems. As a result, transparency helps gain more trust from users.
Accountability: Define best practices and processes that AI professionals can follow, such as commitment to equity, to consider at every step of the AI lifecycle.