Facilitating the spread of knowledge and innovation in professional software development
Dmitry Vyazelenko explores implementing CRC checksums for a durable log while trying to retain respectable performance. Vyazelenko highlights how a new feature can amplify the call to revisit performance of an overall design.
Charles Humble talks to Akhilesh Gupta, the technical lead for LinkedIn’s real-time delivery infrastructure, and also LinkedIn messaging. They discuss the architecture behind LinkedIn’s real-time platform, its building blocks, the frameworks used and other technical details.
In this article, author Brendon Machado discusses how data owners and data scientists can work together to create models on privatized data using the federated learning technique and shows how to use it in loan risk prediction use case.
The panelists discuss DevOps buzzwords and when and where they might have value for organizations seeking performance improvements.
InfoQ Homepage
Articles
Federated Machine Learning for Loan Risk Prediction
Brendon Machado
Srini Penchikala
A model is only as strong as the data it’s provided, but what happens when data isn’t readily accessible or contains personally identifying information? In this case, can data owners and data scientists work together to create models on privatized data? Federated learning shows that it is indeed possible to pursue advanced models while still keeping data in the hands of data owners.
This new technology is readily applicable to financial services, as banks have extremely sensitive information ranging from transaction history to demographic information for customers. In general, it’s very risky to give data to a third party to perform analytical tasks. However, through federated learning, the data can be kept in the hands of financial institutions and the intellectual property of data scientists can also be preserved. In this article, we will demystify the technology of federated learning and touch upon one of the many use cases in finance: loan risk prediction.
Federated Learning, in short, is a method to train machine learning (ML) models securely via decentralization. That is, instead of aggregating all the data necessary to train a model, the model is instead sent to each individual data owner. Then, after models are trained on each subset of the data, the updated weights are sent back to the coordinator and averaged together for a final model. Through this approach, the data never leaves the hands of its original owner, ensuring a higher level of security and trust between data owner and data scientist without a compromise in model performance.
Currently, there is a slight additional computational cost for developing federated learning models as well as a limitation to neural networks as the main supported algorithm by the most common federated learning frameworks. Despite this, there is a great potential for federated learning to transform the way that models are trained due to the vast improvements in data privacy and security.
Federated learning as a methodology is effective, yet still has some flaws by itself. How can the intellectual property of data scientists’s AI models be kept private? What techniques can a data scientist use to explore a private dataset? PySyft, an open-source library created by OpenMined, enables fully private AI by combining federated learning with two other key concepts: Secured Multi-Party Computation (SMPC) and Differential Privacy. Google’s Tensorflow Federated also provides federated learning capabilities, as it integrates with Tensorflow and Keras for a deep learning backend.
Let us say we have some data which we want to perform an operation on. In this case, let’s say our data is the number 6 and the operation is multiplication by 2. How can we get a third party to complete this operation without knowledge of the data? We use Secure Multi-Party Computation, of course! Instead of dealing with the data as a whole, we can split it into multiple parts, perform the operation, and combine each part’s result back together.
Notice, each individual only has a part of the data (colloquially referred to as “shared encryption”), but we are still able to obtain the correct result of the operation. In the case of performing multiplication, SMPC is relatively simple; more advanced algorithms like backpropagation or linear algebra are much more difficult to compute. Luckily, PySyft integrates with PyTorch in a process known as “hooking” and brings SMPC into deep learning, allowing models to be trained by data owners without knowledge of the weights or updates.
As a data scientist, it’s imperative to look at several statistics to gain an elementary understanding of a dataset. However, any individual record of the dataset will affect these statistics in some manner. Differential Privacy is a rigorous, mathematical definition of privacy that measures how much a statistic changes when individual rows are included or excluded from the dataset.
The common method of measuring privacy now is known as ε-differential privacy. The concept behind ε-differential privacy is that an individual whose record is not included in some data has perfect privacy. Thus, we want to limit the dependence of any statistical functions on any individual.
For example, the mean equally reflects all entries in a group, so by observing the change in mean when removing or adding an entry it’s possible to calculate the exact value of an entry.
As shown in the figure above, when removing one individual from the statistical calculation it’s possible to understand exactly what the removed individual’s value is — therefore the mean is not a differentially private statistic.
Randomized response is one example of a differentially private algorithm. Imagine we are trying to find out if individuals have an iPhone or an Android phone. The procedure of collecting data is as follows:
By this process (and the second coin toss in step 2), each individual has plausible deniability to how they answered the question. However, the majority of people who have an iPhone will answer as such and same for those who have an Android. Therefore, it’s possible to estimate roughly what percentage of the population has an iPhone or Android.
PySyft aims to track how much privacy is being lost through all analysis-related operations; data scientists and data owners can now collaborate to establish “privacy budgets” and ensure not too much PII is leaked.
Financial services industry, where applications manage very sensitive data, is naturally a great industry target for Federated Learning. Loan Risk Prediction is one specific example — below, we will see how to get a basic Federated Learning application up and running. Doing so, we’ll be able to see the benefits of using PySyft to create private AI.
Now that a model has been trained, the process to get predictions to the data owner is very simple. This process is again facilitated by PySyft.
Now, let’s go in depth to see the essential steps of building a sample application using Federated Learning.
To initialize PySyft and ensure all deep learning operations are made secure, PySyft is “hooked” to the PyTorch library.
On the data owner’s end, the data now needs to be deployed via a WebsocketServerWorker, which enables federated learning. Different arguments are specified, such as host IP, the PySyft hook, and the specific port that Finastra’s PySyft client will connect to.
Now that a dataset is deployed from the Data Owner, we data scientists can connect via a WebsocketClientWorker defined in the PySyft library.
Next, we can specify the model and its training configuration. These will be sent to each WebsocketServerWorker (in this case there’s only one). Since PySyft sits on top of PyTorch, these are defined via Torch syntax.
Now, the function fit_model_on_worker handles several steps. Via Syft, the model and training configuration is encrypted, the model is trained securely on the data owner, and lastly the updated model and its training loss are retrieved. All that needs to happen now is to call the function on each of the Data Owners. Although the computation is happening on the Data Owner side, all of the training is initiated by the data scientist!
Now, we have obtained potentially several models from each remote worker. All that remains is to combine all the models into a final model via averaging all weights and parameters together. Naturally, this is also handled by PySyft.
After running the training and averaging loop for the desired number of iterations, training is complete. The final model can then be sent to banks to predict the loan risk of new customers. In this scenario, because there is only one data owner, the model performance would be identical to that of a model trained in a typical ML pipeline. However, what happens to the performance when a model is trained on multiple data owners?
These code snippets are based on the Asynchronous Federated Learning Tutorial from the OpenMined github repository which trains a model on MNIST handwritten digit recognition data. In this example, the data is split up by digit. Multiple models are trained on subsets of the data and later combined. Let’s compare the performance of Federated Learning to the best performance in research.
Given the guarantee of 100% privacy, Federated Learning achieves very similar performance to traditional deep learning techniques.
We see through this process two main benefits. The first is the main goal of federated learning : privacy. No bank is able to understand what the model architecture is, and we are unable to gain access to any data. The second bonus is enhanced intelligence. Now, banks are able to gain access to more robust models that have been trained on a vast variety of data sources because the model is a culmination of the information from several other financial institutions. The applications of Private AI and federated learning are immense — any use case where personal data is involved, from loans to credit risk, can reap the benefits of these new methodologies.
For more information, be sure to check out PySyft on GitHub to learn more about the framework.
Brendon Machado is a Data Scientist at Finastra’s Innovation Lab keen on bridging the gap between new developments in AI/ML and the Financial Services industry. Brendon holds a Master’s degree in Computer Science with a specialization in Machine Learning from the Georgia Institute of Technology.
A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers.
View an example
You need to Register an InfoQ account or Login or login to post comments. But there’s so much more behind being registered.
A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers.
View an example
InfoQ.com and all content copyright © 2006-2020 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we’ve ever worked with.Privacy Notice, Terms And Conditions, Cookie Policy
SOURCE: https://www.w24news.com/news/federated-machine-learning-for-loan-risk-prediction/?remotepost=264223