Model inversion is a technique in which one aims to be able to reconstruct a model's input based on its output. The goal of this lab is to demonstrate a model inversion attack, as well as a defense against a model inversion attack using differential privacy.
As a model inversion attack is attempting to reconstruct training data, it goes without saying that such an attack does not have access to this data. Instead, model inversion attacks follow these general steps:
Interrogation of the Model: The model is used to produce outputs. This is essentially an attempt by the attacker to probe the model for more information on how the model works.
Synthesis of Substitute Data: The attacker attempts to create a substitute dataset. This is done using a combination of domain knowledge and information gleaned from the model interrogation.
Adversarial Training: The substitute data is used to train a new model that can copy the outputs of the targeted model. Additionally, the model is trained to be able to identify which inputs could lead to these outputs.
The OmniGlot dataset is a dataset comprised of hand-drawn characters from different alphabets. It was created in order to better challenge ML models, particularly on tasks related to generalization.
The OmniGlot dataset will be split into 2 different sets in order to have different stages of the model training and evaluation process:
Training Subset: This is used to train the target model. It is made up of characters from twenty different alphabets.
Adversary Training and Testing Subsets: The rest of the characters are split into 2 sets of 5 alphabets. One of these sets is used to train the attacker model, and the other is used to test the model's ability to reconstruct the input characters.
On-Device Model: The first part of the target model is designed to run on mobile devices (similar to the concept from module 2). It has convolutional layers, batch normalization, max pooling, and dropout layers. This architecture is meant to be appropriate for mobile devices which have fewer computational resources. This model is responsible for processing the input images initially.
Central Server Model: The second part of the model is on a server that has access to more computing power. It is responsible for the final classification.
The model will use the Adam optimizer as well as the cross-entropy loss function.
During training, the model's performance will be judged using accuracy for both the training and validation datasets. Early stopping and model checkpoints will be used to prevent overfitting and to save the best model weights during training.
tf.GradientTape is a key feature that enables the use of automatic differentiation. This is when the gradient of the loss with respect to the model's variables are computed. This is the essence of backpropagation in neural networks.
The training also uses a validation step which allows the model's performance to be measured against unseen data. This helps to ensure that the model isn't simply memorizing the training data (prevent overfitting).
The loss and accuracy are reset at the end of each epoch. If the validation loss does not improve by a minimum amount over a specified number of epochs (patience), early stopping will be triggered, and the training will come to an end (again, this is to prevent overfitting).
Whenever there is an improvement in the validation loss, the model weights are saved. This allows the model to be restored to its best state at a later point in time.
The goal of the adversary is to be able to reconstruct the input data from the target model's output. This is done by using the outputs of the original model to train a new model in order to create data that is similar to the original input.
The adversary uses a deconvolutional network, also known as a transposed convolutional network. This type of network architecture is mainly used in tasks like image generation or super-resolution. It reverses the forward pass of a convolutional network and uses transposed convolutional layers to return the condensed information back into a higher-resolution image.
It is designed to use the output probabilities of the target model as inputs and create an image. This image should be able to be passed back through the target model and have the target model output a similar probability.
Attack Input Generation: Using each batch of images from the attacker dataset, the target model is generates the output probabilities. These probabilities are now the inputs of the attack model.
Image Generation: The attack model uses these inputs to create images. The goal is recreate the images that were fed to the target model originally.
Loss Calculation: Mean Squared Error (MSE) is used to guage the difference between these generated images and the originals. The goal is to minimize this loss.
Backpropagation: Gradients are computed for the loss with respect to the attack model's parameters, and the optimizer updates the model using this information.
Early Stopping: Early Stopping is used to prevent overfitting. If the MSE doesn't improve by a specified minimum amount (min_delta) after a specified number of epochs (patience), training is stopped.
The adversary's specific goals are to:
Reconstruct Inputs
Minimize Reconstruction Error
Preserve Privacy
The third dataset, testing dataset, is necessary to evaluate the adversary's model. It has a number of unseen data examples that the adversary model has not been trained on. this allows us to assess the model's ability to generalize - how well can it reconstruct the images?
The adversary's reconstruction process involves the following steps:
Input Generation
Image Reconstruction
Post-Processing
The adversary's success have significant implications for privacy:
Character Complexity: Because of the additional complexity of the OmniGlot dataset, if the adversary is able to successfully reconstruct this data, this means that it is very adept at the inversion process.
Individual vs. Class Representation: If the model were to be so advanced as to be able to reconstruct specific individuals handwriting, privacy concerns would be elevated.
Generalizing to Other Domains
Mitigation Strategies
Differential Privacy (DP) is used to safeguard the privacy of individuals in a dataset. The principle behind DF is that the presence or absence of any particular datapoint does not significantly effect the result. This ensures individual privacy.
TensorFlow Privacy (TFP) is a library that expands on preexisting TensorFlow capabilities in order to allow for the integration of DP with machine learning training.
In order to implement DP with the target model, we need to make a number of changes:
Gradient Clipping: Gradients are clipped so that their L2 norms are no larger than a specified amount (l2_norm_clip). This allows us to ensure that no one example can have too large of an effect.
Noise Addition: Once the gradients are clipped, we add Gaussian noise to the gradients as well. This noise makes it even more difficult to be able to identify whether or not any one data point is included in the training dataset.
Microbatches: The splitting of the datasets into mircrobatches enables a finer level of control over how much impact each example has.
Differentially Private Optimizer: A specialized optimizer, DPKerasAdamOptimizer, will be used to be able to compute the modified gradients.
Sparse Categorical Crossentropy: The loss function is such that it can work with a per-example setup. This to the end that the loss is considered separately for each example during the gradient calculation.
The train_step function is where the differential privacy tools are used. The gradients are calculated with respect to the loss, clipped, noise is added, and then the optimizer will update the model's weights.
The second-stage layers of a model are the layers that are closest to the output end of the model. Usually, an adversary would not have access to the final output. instead, it may only have access to the "middle" features of the model.
By cutting off the second-stage layers, we're able to simulate the adversary's limited access to the target model.
The idea behind saving the stage-one model's weights is primarily based on the desire to achieve the following:
Consistency
Transfer Learning
Isolation of Variables
This is a very similar process to training a model normally; however, some considerations must be made given the fact that we are implementing DF.
The training routine for the adversary:
Defining the Loss Function: The adversary uses a Mean Squared Error (MSE) loss function.
Optimization: Adam.
Metrics: The MSE is also used to measure performance.
Training Step: The adversary gets intermediate representations from the target model and attempts to reconstruct the original images. The gradients are calculated based on the MSE, and they are applied to the adversary's model parameters.
Early stopping is implemented to prevent overfitting.
For testing and evaluation:
Testing Setup: The adversary is tested using a separate dataset that that it wasn't trained on.
Performance Metrics: The adversary's performance is measured using the Mean Squared Error between the reconstructed images and the original images.
Differential Privacy Considerations: The ability of the adversary to reconstruct the original images based on what it receives from the DP model's output is indicative of the achieved privacy level.