After completing this learning module, you will be able to:
Define the terms that are associated with AI dataset poisoning
Use TensorFlow and Scikit-learn to create a dataset poisoning attack
Use the same tools to deploy a defense of the same attack
Dataset poisoning is the intentional introduction of errors in a dataset in order to compromise the effectiveness of an AI model. This can be done through many ways. In our demonstration, we will be adding additional false data and appending it to the pre-existing dataset.
The Boston Housing dataset is a dataset that attempts to model housing prices in Boston given a number of other factors. It has fallen out of favor in recent years; however, for the purposes of this demonstration (we are not attempting to build a model to predict housing prices for the sake of actually predicting housing pries), it will suffice.
Linear Regression is a type of machine learning in which a model attempts to create a linear relationship between a number of independent datapoints and one dependent data point. The idea is to use the features of the dataset to get the model to learn what the house price will be based on the provided features.
This is a technique that is used to identify outliers in a dataset. In this lab, we will be using the Isolation Forest algorithm to identify and subsequently remove the outlying data poitns that we add in the model poisoning step.
Prepare the Boston Housing dataset.
Train a baseline linear regression model on the clean, unpoised data.
Introduce poisoned data points and measure the effect on model performance.
Use Isolation Forest to detect and remove poisoned data.
Retrain the model on cleaned data and assess its performance.