Data privacy is one of the biggest concerns with emerging artificial intelligence technology. The most prominent AI models—large language learning models (LLMs) like OpenAI's ChatGPT, Google's Bard, and Meta's LLaMa—are trained on vast quantities of data. The more data these AI models consume, the better they become at simulating human thought and conversation.
People naturally wonder how much data these AI models have access to. How much should they have access to? And what are the risks if they have our personal information?
Lawmakers in and outside the U.S. have started to limit how data—and, importantly, personal information—that's used to train AI is collected, stored, processed, and delivered.
AI models are trained on large swaths of text coming mostly from websites, books, and newspapers. But where does that data come from? It's a straightforward question that usually gets a less-than-forthcoming answer. Generally, the owners of popular LLMs vaguely declare that their data is from public sources. For example:
Data sourced from public sources usually contains personal information like names, email addresses, and birthdates. This information can be taken from databases, articles, blogs, forums, and social media. And people whose personal data is being fed to AI models often don't know that what they've shared online is being used in these training sets. Again, AI developers haven't given details about what kinds of personal data have been collected and from whom.
Should the fact that data is publicly available mean that anyone is allowed to use it for any reason? Many say no, worrying that training data can and inevitably will be revealed to anyone who asks the AI the right questions.
Data privacy is an area of law that has to do with access to our personal information. Lawmakers try to protect people's personal information through consumer protection and privacy laws. The idea is to stop businesses from using consumer data in ways that would be unfair, deceptive, or harmful.
In data privacy laws, "personal information" or "personal data" usually refers to information that can directly or indirectly identify someone. Typically, personal information or data can include:
In the U.S., the Federal Trade Commission (FTC) is tasked with protecting consumers' privacy and security. Under Section 5 of the FTC Act, the FTC is responsible for preventing people and businesses from using "unfair or deceptive acts or practices" while conducting business in the U.S. (15 U.S.C. § 45 (2023).)
The FTC can specify the kinds of business practices that are considered unfair or deceptive, such as the unreasonable collection and processing of personal information. The federal agency can also launch investigations and charge businesses with violating consumer protection laws. Businesses that violate the law can be forced to pay civil penalties and restitution to consumers.
Although it has the FTC, the U.S. doesn't have a comprehensive data privacy law. But some lawmakers are trying to change that.
In June 2022, the House Energy and Commerce Committee introduced the American Data Privacy Act, a bill that would provide rights and protections to consumers. These rights would include the right to access, correct, delete, and consent to the data that'll be collected and processed. As of August 2023, the ADPA hasn't been passed in either the House or Senate.
Whereas U.S. laws apply throughout the country, state laws apply to businesses that operate within those states and to consumers who reside there. States have approached data privacy in varying ways. Some have no consumer data privacy laws. A handful have comprehensive privacy laws.
For example, California has the California Privacy Rights Act (CPRA), a law that took effect on January 1, 2023 and that expands on the California Consumer Privacy Act of 2018. The CPRA is one of the most protective state measures for consumer privacy. It includes the rights to:
(Cal. Civ. Code §§ 1798.140 and following (2023).)
Despite California's stricter regulations and the FTC's investigation into ChatGPT, the U.S., in general, is considered behind other nations when it comes to consumer protection and data privacy laws.
Perhaps the most widely known data protection law is the General Data Protection Regulation (GDPR). The GDPR is a relatively strict European Union (EU) law that protects personal data and privacy. (It went into effect in May 2018.) While the law applies only to EU member states, many countries have used it as a model and put similar regulations into place.
The GDPR applies to most businesses that process personal data. Under the GDPR, companies can collect and process personal data only under limited circumstances and have to follow strict protocols for collecting, storing, and processing that data.
The GDPR allows companies to process—for example, collect, record, store, organize, or use—personal data only if one of the following is true:
Most of these situations, except for the last one, are relatively obvious to identify. However, proving you have a legitimate interest to process personal data is tricky. To determine whether you have a legitimate interest, you must:
(Article 6 of the GDPR (2023).)
When it comes to Bard, Google has cited "legitimate interests" as its basis for collecting personal data from EU users. In its Privacy Help Hub for Bard, Google says that it's collecting personal information so it can "provide, maintain, improve, and develop Google products, services, and machine learning technologies."
Whatever the justification is for processing personal data, the GDPR requires that companies make sure the data is accurate, up to date, and secure. Companies also need to be transparent with the data subject about the processing of their personal data. For example, the company must let the person know generally why their personal data was collected and for how long their personal information will be stored.
Some EU member countries have taken action against AI companies to enforce consumer rights under the GDPR. Here are a couple of examples.
Italy temporarily banned ChatGPT. In March 2023, Italy banned ChatGPT due to concerns about the chatbot's potential GDPR violations. Italy took issue with how OpenAI was collecting its training data from Italian consumers and the fact that inappropriate data could reach underage users. Italy gave OpenAI 20 days to develop an action plan to address these issues and to fully comply with the GDPR. By the end of April 2023, OpenAI had made changes such as verifying users' ages when they sign up and providing a way for people to remove their personal information from ChatGPT's training data. In response to OpenAI's improvements, Italy lifted its ban.
Ireland stalled Google's EU launch of Bard. Before launching its AI model in the EU in July 2023, Google worked to create stricter privacy policies to satisfy the demands of the Irish Data Protection Commissioner. In an attempt to comply with GDPR rules and to provide more transparency, Google made various changes to Bard pre-launch, including requiring users to create a Google account to use Bard and authenticate that they're 18 years of age or older.
The United Kingdom, on the other hand, is taking a more relaxed approach to AI regulation. Even though the UK is no longer an EU member state, it incorporates the GDPR into its Data Protection Act. The UK has said that it doesn't plan to create new data privacy laws geared toward AI but will give voluntary guidance on existing laws. For example, the UK Information Commissioner's Office has provided companies with best practices and principles to consider when adopting AI within their industry.
In the U.S., the FTC's investigation of OpenAI and ChatGPT could be an indicator of how serious the government will get in regulating the way that AI companies gather, use, and share our personal information. If lawmakers decide to get serious about the issue, they could create data protection laws that provide ways for Americans to better control their personal information. If the U.S. and other countries follow the EU's lead, companies will have to reconsider how they use personal information to train AI.