joint work with Pedro Esperança and Chris Holmes
The prevalence of big data presents challenges with computational intractabilities, but also in ensuring the privacy of an ever growing amount of potentially sensitive data which is increasingly being stored with third party 'cloud' providers. The ideal solution is to store only encrypted versions of the data, but this appears to preclude any analysis being performed without first decrypting and risking revealing the data.
Recent advances in cryptography enable limited computational operations to be performed without first decrypting, opening up the prospect of fully encrypted data analysis without compromising security concerns. However, the constraints associated with these cryptographic schemes mean that many traditional statistical models cannot simply be fitted encrypted without modification. Furthermore, the substantial computational burden of encrypted calculations make issues of scalability an important constraint.
Here we present new statistical machine learning methods designed to learn on such fully homomorphic encrypted (FHE) data. In this talk we will overview one of the two tailored machine learning algorithms we have recently proposed, completely random forests. We demonstrate that this technique performs competitively on a variety of classification data sets and provide information about the computational practicalities of this and other FHE methods. We will also briefly overview our most recent work on a method to fit linear models, which also admits ridge penalties.
All our illustrations are run in an open source R package, with all cryptographic functions coded in high performance parallelised C++ to ameliorate some of the computational costs associated with performing homomorphic operations on encrypted data.