Impacts of training data and model bias
The sheer volume of big data annihilates the treacherous reality of garbage in, garbage out. Or does it? In fact, the messiness of data can only be accepted if enough data from a variety of sources and distributions can be fully learned without causing any biases in the outcomes of the learning. The actual training of the big data in a centralized location does take a lot of time and huge computational resources and storage. Also, we would probably have to find methods to measure and reduce model bias without directly collecting and accessing sensitive and private data, which would conflict with some of the privacy regulations discussed previously. FL also has an aspect of distributed and collaborative learning, which becomes critical to eliminate data and model bias to absorb the messiness of the data. With collaborative and distributed learning, we could significantly increase the data accessibility and efficiency of an entire learning process that is often very expensive and time-consuming. It gives us a chance to break through the limitation that big data training used to have, as discussed in the following sections.
Expensive training of big data
According to the report: https://www.flexera.com/blog/cloud/cloud-computing-trends-2022-state-of-the-cloud-report, 37% of enterprises annually spend more than $12 million and 80% spend over $1.2 million per year for public cloud. The training cost over the cloud is not cheap, and it can easily be assumed that this cost is going to boost significantly, together with the increasing demand for AI and ML. Sometimes, big data cannot be fully trained for ML because of the following issues:
- Big data storage: Big data storage is an architecture for compute and storage that collects and manages large amounts of datasets for AI applications or real-time analytics. Worldwide enterprise companies are paying more than $100 billion just for cloud storage and data center costs (https://a16z.com/2021/05/27/cost-of-cloud-paradox-market-cap-cloud-lifecycle-scale-growth-repatriation-optimization/). While some of the datasets are critical for the applications they provide, what they really want is often business intelligence that can be extracted from the data, not just the data itself.
- Significant training time: Building and training an ML model that can be delivered as an authentic product basically takes a significant amount of time, not only for the training process but also for the preparation of the ML pipelines. Therefore, in many cases, the true value of the intelligence is going to be lost by the time the ML model is delivered.
- Huge computation: Training of an ML model often consumes significant computational resources. For example, an ML task of manipulating pieces such as a Rubik’s Cube using a robotic hand could sometimes require more than 1,000 computers. It could also take a dozen machines just to run some specialized graphics chips for several months.
- Communications latency: To form big data, especially in the cloud, a significant amount of data needs to be transferred to the server, which in itself causes communications latency. In most use cases, FL requires much less data to be transferred from local devices or learning environments to a server called an aggregator that is there to synthesize the local ML models collected from those devices.
- Scalability: In traditional centralized systems, scalability becomes an issue because of the complexity of big data and its costly infrastructures such as huge storage and computing resources in the cloud server environment. In an FL server, only an aggregation is conducted to synthesize the multiple local models that have been trained to update the global model. Therefore, both the system and learning scalability increase significantly as ML training is conducted on edge devices in a distributed manner, not only in a single centralized learning server.
FL effectively utilizes distributed computational resources that can be used for light training of the ML models. Whether training happens on actual physical devices or virtual instances of the cloud system, parallelizing the model training process into distributed environments often accelerates the speed of learning itself.
In addition, once the trained models are collected, the FL system can quickly synthesize them to generate an updated ML model called a global model that absorbs enough learnings at the edge sides, and thus delivering the intelligence in near real time is possible.
Model bias and training data
ML bias happens when an ML algorithm generates results that are systemically prejudiced because of erroneous assumptions in the ML process. ML bias is also sometimes called algorithm bias or AI bias.
Yann LeCun, the 2018 Turing Award winner for his outstanding contribution to the development of DL, says “ML systems are biased when data is biased” (https://twitter.com/ylecun/status/1274782757907030016). This comes from a computer vision (CV) model trained with the Flickr-Faces-HQ
dataset compiled by the Nvidia team. Based on the face upsampling system, many people are classified as white as the network was pre-trained on Flickr-Faces-HQ
data mainly containing pictures of white people. For this problem of misclassification of the people, the architecture of the model is not the issue that mandates this output. Hence, the conclusion is that a racially skewed dataset generated a neutral model to produce biased outcomes.
Productive conversations about AI and ML biases have been led by the former lead of AI Ethics at Google. The 2018 publication of the Gender Shades paper demonstrated race and gender bias in major facial recognition models, and lawmakers in Congress have sought to prohibit the use of the technology by the US federal government. Tech companies including Amazon, IBM, and Microsoft also agreed to suspend or terminate sales of facial recognition models to the police. They are encouraged to use an interventionist approach to data collection by advising scientists and engineers to specify the objectives of model development, form a strict policy for data collection, and conduct a thorough appraisal of collected data to avoid biases—details are available on the FATE/CV website (https://sites.google.com/view/fatecv-tutorial/home).
FL could be one of the most promising ML technologies to overcome data-silo issues. Very often, the data is not even be accessible or usable for the training, causing a significant bias in data and models. Naturally, FL is useful for overcoming bias by resolving the issues of data privacy and silos that become the bottleneck to fundamentally avoiding data bias. In this context, FL is becoming a breakthrough in the implementation of big data services and applications, as thoroughly investigated in https://arxiv.org/pdf/2110.04160.pdf.
Also, there are several techniques that try to mitigate model bias in FL itself, such as Reweighing and Prejudice Remover, both detailed in https://arxiv.org/pdf/2012.02447.pdf.