Protecting Customer Data with Linear Transformation • Data Fukuro

Project details:

This project develops a secure data-transformation method that protects sensitive customer information while keeping model performance intact. A linear regression model was trained on the original data and then re-validated on data transformed with an invertible matrix. Identical R² scores confirmed that the transformation preserves predictive power while preventing reconstruction of personal data.

Date: April 2022

Link: Github Repository

Tags: Data Cleaning, Feature Engineering, Linear Algebra, Model Validation, NumPy, Pandas, Regression Modeling, Scikit-learn

Description

Business Context & Problem

An insurance company must protect clients’ personal information while still using the data for analytical tasks. Traditional anonymisation can sometimes strip away useful patterns, so the company needs a transformation method that hides the original values but keeps the relationships between features intact. This project explores whether linear transformations can achieve both goals.

Data & Analytical Approach

The dataset included several customer-level features relevant for insurance modelling. After cleaning and preparing the data, an invertible transformation matrix was generated and applied to all feature columns. The purpose of this transformation is that the original personal information cannot be reconstructed without knowing the matrix, but the structure of the data remains mathematically consistent.

Statistical / ML Analysis

A linear regression model was fitted on the original dataset and then applied again to the transformed dataset. To validate that the transformation preserves the model’s behaviour, both models were evaluated using R². Identical scores demonstrated that the transformation does not distort feature relationships or reduce predictive quality. Additional checks confirmed that multiplying by the inverse matrix successfully recovers the transformed features back to their original scale, proving the math is correct — but only when the inverse is available.

Key Insights & Final Recommendations

The experiment confirmed that an invertible linear transformation can reliably protect sensitive customer attributes while keeping the data fully usable for modelling. This approach allows the company to share or store transformed data safely without compromising analytical accuracy.
The method can be integrated into the company’s data-processing pipeline to secure personal information before it reaches modelling teams or external partners.