FL is often said to be one of the most popular privacy-preserving AI technologies because private data does not have to be collected or shared with third-party entities to generate high-quality intelligence. Therefore, in this section, we discuss the data privacy that has been a bottleneck that FL tries to resolve to create high-quality intelligence.
What is data privacy? In May 2021, HCA Healthcare announced that the company had struck a deal to share its patient records and real-time medical data with Google. Various media quickly responded by warning the public about the deal, as Google had been mentioned for its Project Nightingale where the tech giant allegedly exploited the sensitive data of millions of American patients. Given above 80% of the public believes that the potential risks in data collection by companies outweigh the benefits, according to a 2019 poll by Pew Research Center, data sharing projects of such a scale are naturally seen as a threat to people’s data privacy.
Data privacy, also known as information privacy, is the right of individuals to control how their personal information is used, which mandates third parties to handle, process, store, and use such information properly in accordance with the law. It is often confused with data security, which ensures that data is accurate, reliable, and accessible only to authorized users. In the case of Google accounts, data privacy regulates how the company can use the account holders’ information, while data security requires them to deploy measures such as password protection and 2-step verification. In explaining these two concepts, the data privacy managers use an analogy of a window for security and a curtain for privacy: data security is a prerequisite for data privacy. Put together, they comprise data protection, as shown in the following diagram:
Figure 1.2 – Data security versus data privacy
We can see from the preceding diagram that while data security limits who can access data, data privacy limits what can be in the data. Understanding this distinction is very important because data privacy can multiply the consequences of failures in data security. Let’s look into how.
Risks in handling private data
Failing in data protection is costly. According to IBM’s Cost of a Data Breach Report 2021, the global average cost of a data breach in the year marked US dollars (USD) $4.24 million, which is considerably higher than $3.86 million a year earlier and is the highest amount in the 17-year history of the report; an increased number of people working remotely in the aftermath of the COVID-19 outbreak is considered a major reason for the spike. The top five industries for average total cost are healthcare, finance, pharmaceuticals, technology, and energy. Nearly half of breaches in the year included customer personally identifiable information (PII), which costs $180 per record on average. Once customer PII is breached, negativities such as system downtime during the response, loss of customers, need for acquiring new customers, reputation losses, and diminished goodwill ensue; hence, the hefty cost.
The IBM study also found that failing to comply with regulations for data protection was top among the factors that amplify data breach costs (https://www.ibm.com/downloads/cas/ojdvqgry).
Increased data protection regulations
As technology advances, the need to protect customer data has become more critical. Consumers require and expect privacy protection during every transaction; many simple activities can risk personal data, whether online banking or using a phone app.
Governments worldwide were initially slow to react by creating laws and regulations to protect personal data from identity theft, cybercrime, and data privacy violations. However, times are now changing as data protection laws are beginning to take shape globally.
There are several drivers for the increase in regulations. These include the growth of enormous amounts of data, and we need more data security and privacy to protect users from nefarious activities such as identity theft.
Let’s look at some of the measures taken toward data privacy in the following sub-sections.
General Data Protection Regulation (GDPR)
The General Data Protection Regulation (GDPR) by European Union is regarded as the first data protection regulation in the modern data economy and was emulated by many countries to craft their own. GDPR was proposed in 2012, adapted by the EU Council and Parliament in 2016, and enforced in May 2018. It superseded the Data Protection Directive that had been adopted in 1995.
What makes GDPR epoch-making is its stress on the protection of PII, including people’s names, locations, racial or ethnic origin, political or sexual orientation, religious beliefs, association memberships, and genetic/biometric/health information. Organizations and individuals both in and outside the EU have to follow the regulation when dealing with the personal data of EU residents. There are seven principles of GDPR, among which six were inherited from the Data Protection Directive; the new principle is accountability, which demands data users maintain documentation about the purpose and procedure of personal data usage.
GDPR has shown the public what the consequences of its violation can be. Depending on the severity of non-compliance, the GDPR fine can go from 2% of global annual turnover or €10 million, whichever is higher, or 4% of global annual turnover or €20 million, whichever is higher. In May 2018, thousands of Europeans filed a complaint against Amazon.com Inc. through the French organization La Quadrature du Net, also known as Squaring the Net in English, accusing the company of using its advertisement targeting system without customer consent. After 3 years of investigation, Luxembourg’s National Commission for Data Protection (CNDP) made headlines around the world: it issued Amazon a €746 million fine. Similarly, WhatsApp was fined by Ireland’s Data Protection Commission in September 2021 for GDPR infringement; again, the investigation had taken 3 years, and the fine amounted to €225 million.
Currently, in the US, a majority of states have privacy protections in place or soon will. Additionally, several states have strengthened existing regulations, such as California, Colorado, and Virginia. Let’s look at each to get an idea of these changes.
California Consumer Privacy Act (CCPA)
The state of California followed suit. The California Consumer Privacy Act (CCPA) became effective on January 1, 2020. As the name suggests, the aim of the regulation is to protect consumers’ PII just as GDPR does. Compared to GDPR, the scope of the CCPA is significantly limited. The CCPA is applicable only to for-profit organizations that collect data from over 50,000 points (residents, households, or devices in the state) in a year, generate annual revenue over $25 million, or make half of their annual revenue by selling such information. However, CCPA infringement can be much more costly than GDPR infringement since the former has no ceiling for its fine ($2,500 per record for each unintentional violation; $7,500 per record for each intentional violation).
Colorado Privacy Act (CPA)
Under the Colorado Privacy Act (CPA), starting July 1, 2024, data collectors and controllers will have to follow universal opt-outs that users have selected for generating targeted advertising and sales. This rule protects residents in Colorado from targeted sales and advertising as well as certain types of profiling.
Virginia Consumer Data Protection Act (CDPA)
Virginia’s Consumer Data Protection Act (CDPA) will make several changes to increase security and privacy on January 1, 2023. These changes will be applicable to organizations that do business in Virginia or with residents in Virginia. Data collectors need to obtain approval to utilize their private data. These changes also try to determine the adequacy of privacy and security of AI vendors, which may require the removal of that data.
These are just a few simple examples of how data regulations will take shape in the US. What does this look like for the rest of the world? Some estimate that by 2024, 75% of the global population will have personal data covered by privacy regulations of one type or another.
Another example of major data protection regulation is Brazil’s Lei Geral de Proteção de Dados Pessoais (LGPD) which has been in force since September 2020. It replaced dozens of laws in the country related to data privacy. LGPD was modeled after GDPR, and the contents are almost identical. In Asia, Japan was the first country to introduce a data protection regulation: the Act on the Protection of Personal Information (APPI) was adopted in 2003 and amended in 2015. In April 2022, the latest version of APPI was put in force to address modern concerns over data privacy.
FL has been identified as a critical technology that can work well with privacy regulations and regulatory compliance in different domains.
From privacy by design to data minimalism
Organizations have been acclimatizing to these regulations. TrustArc’s Global Privacy Benchmarks Survey 2021 found that the number of enterprises with a dedicated privacy office is increasing: 83% of respondents in the survey had a privacy office, whereas the rate was only 67% in 2020. 85% had a strategic and reportable privacy management program in place, yet 73% of them believed that they could do more to protect privacy. Their eagerness is hardly surprising as 34% of the respondents claimed that they had faced a data breach in the previous 3 years, the costly consequences of which was mentioned previously in this chapter. A privacy office would be led by a data protection officer (DPO) who is responsible for the company’s Data Protection Impact Assessment (DPIA) in order to comply with regulations such as GDPR that demand accountability and documentation of personal data handling. DPOs are also responsible for monitoring and ensuring that personal data is treated by their organizations in compliance with the law, and the top management and board are supposed to provide necessary support and resources to DPOs to allow them to complete their task.
In the face of GDPR, the current trend in data protection is shifting toward data minimalism. Data minimalism in this context does not necessarily encourage minimization of the size of data; it pertains more directly to minimizing PII factors in data so that individuals cannot be identified with its data points. Therefore, data minimalism affects AI sectors in their ability to create a high-performing AI application because a shortage in data variety for the ML process simply generates ML model biases with unsatisfying performance in prediction.
The abundance mindset for big data introduced at the beginning of the chapter has thus been disciplined by the public concern over data privacy. The risk of being fined for violating data protection regulations, coupled with the wasteful cost of having a data graveyard, calls for practicing data minimalism rather than data abundance.
That is why FL is becoming a must-have solution for many AI solution providers such as medical sectors that are struggling with public concerns and data privacy, which basically becomes an issue when a third-party entity needs to collect private data for improving the quality of ML models and their applications. As mentioned, FL is a promising framework for privacy-preserving AI because learning of the data can happen anywhere; even if the data is not available for the AI service providers, all we have to do is collect and aggregate trained ML models in a consistent way.
Now, let’s consider another facet of the Triple-A mindset for big data being challenged: acceptance of messy data.