OpenAI’s regulatory troubles are only just beginning

OpenAI managed to appease Italian data authorities and lift the country’s effective ban on ChatGPT last week, but its fight against European regulators is far from over.

Earlier this year, OpenAI’s popular and controversial ChatGPT chatbot hit a big legal snag: an effective ban in Italy.

The Italian Data Protection Authority (GPDP) accused OpenAI of violating EU data protection rules, and the company agreed to restrict access to the service in Italy while it attempted to fix the problem. On April 28th, ChatGPT returned to the country, with OpenAI lightly addressing GPDP’s concerns without making major changes to its service — an apparent victory.

The GPDP has said it “welcomes” the changes ChatGPT made. However, the firm’s legal issues — and those of companies building similar chatbots — are likely just beginning. Regulators in several countries are investigating how these AI tools collect and produce information, citing a range of concerns from companies’ collection of unlicensed training data to chatbots’ tendency to spew misinformation. In the EU, they’re applying the General Data Protection Regulation (GDPR), one of the world’s strongest legal privacy frameworks, the effects of which will likely reach far outside Europe. Meanwhile, lawmakers in the bloc are putting together a law that will address AI specifically — likely ushering in a new era of regulation for systems like ChatGPT.

ChatGPT’s various issues with misinformation, copyright, and data protection have placed a target on its back

ChatGPT is one of the most popular examples of generative AI — a blanket term covering tools that produce text, image, video, and audio based on user prompts. The service reportedly became one of the fastest-growing consumer applications in history after reaching 100 million monthly active users just two months after launching in November 2022 (OpenAI has never confirmed these figures). People use it to translate text into different languages, write college essays, and generate code. But critics — including regulators — have highlighted ChatGPT’s unreliable output, confusing copyright issues, and murky data protection practices.

Italy was the first country to make a move. On March 31st, it highlighted four ways it believed OpenAI was breaking GDPR: allowing ChatGPT to provide inaccurate or misleading information, failing to notify users of its data collection practices, failing to meet any of the six possible legal justifications for processing personal data, and failing to adequately prevent children under 13 years old using the service. It ordered OpenAI to immediately stop using personal information collected from Italian citizens in its training data for ChatGPT.

No other country has taken such action. But since March, at least three EU nations — Germany, France, and Spain — have launched their own investigations into ChatGPT. Meanwhile, across the Atlantic, Canada is evaluating privacy concerns under its Personal Information Protection and Electronic Documents Act, or PIPEDA. The European Data Protection Board (EDPB) has even established a dedicated task force to help coordinate investigations. And if these agencies demand changes from OpenAI, they could affect how the service runs for users across the globe.

Regulators’ concerns can be broadly split into two categories: where ChatGPT’s training data comes from and how OpenAI is delivering information to its users.

ChatGPT uses either OpenAI’s GPT-3.5 and GPT-4 large language models (LLMs), which are trained on vast quantities of human-produced text. OpenAI is cagey about exactly what training text is used but says it draws on “a variety of licensed, created, and publicly available data sources, which may include publicly available personal information.”

This potentially poses huge problems under GDPR. The law was enacted in 2018 and covers every service that collects or processes data from EU citizens — no matter where the organization responsible is based. GDPR rules require companies to have explicit consent before collecting personal data, to have legal justification for why it’s being collected, and to be transparent about how it’s being used and stored.

European regulators claim that the secrecy around OpenAI’s training data means there’s no way to confirm if the personal information swept into it was initially given with user consent, and the GPDP specifically argued that OpenAI had “no legal basis” for collecting it in the first place. OpenAI and others have gotten away with little scrutiny so far, but this claim adds a big question mark to future data scraping efforts.

Then there’s GDPR’s “right to be forgotten,” which lets users demand that companies correct their personal information or remove it entirely. OpenAI preemptively updated its privacy policy to facilitate those requests, but there’s been debate about whether it’s technically possible to handle them, given how complex it can be to separate specific data once it’s churned into these large language models.

OpenAI also gathers information directly from users. Like any internet platform, it collects a range of standard user data (e.g., name, contact info, card details, etc). But, more significantly, it records interactions users have with ChatGPT. As stated in an FAQ, this data can be reviewed by OpenAI’s employees and is used to train future versions of its model. Given the intimate questions people ask ChatGPT — using the bot as a therapist or a doctor — this means the company is scooping up all sorts of sensitive data.

At least some of this data may have been collected from minors, as while OpenAI’s policy states that it “does not knowingly collect personal information from children under the age of 13,” there’s no strict age verification gate. That doesn’t play well with EU rules, which ban collecting data from people under 13 and (in some countries) require parental consent for minors under 16. On the output side, the GPDP claimed that ChatGPT’s lack of age filters exposes minors to “absolutely unsuitable responses with respect to their degree of development and self-awareness.”

OpenAI maintains broad latitude to use that data, which has worried some regulators, and storing it presents a security risk. Companies like Samsung and JPMorgan have banned employees from using generative AI tools over fears they’ll upload sensitive data. And, in fact, Italy announced its ban soon after ChatGPT suffered a serious data leak, exposing users’ chat history and email addresses.

ChatGPT’s propensity for providing false information may also pose a problem. GDPR regulations stipulate that all personal data must be accurate, something the GPDP highlighted in its announcement. Depending on how that’s defined, it could spell trouble for most AI text generators, which are prone to “hallucinations”: a cutesy industry term for factually incorrect or irrelevant responses to a query. This has already seen some real-world repercussions elsewhere, as a regional Australian mayor has threatened to sue OpenAI for defamation after ChatGPT falsely claimed he had served time in prison for bribery.

ChatGPT’s popularity and current dominance over the AI market make it a particularly attractive target, but there’s no reason why its competitors and collaborators, like Google with Bard or Microsoft with its OpenAI-powered Azure AI, won’t face scrutiny, too. Before ChatGPT, Italy banned the chatbot platform Replika for collecting information on minors — and so far, it’s stayed banned.

While GDPR is a powerful set of laws, it wasn’t made to address AI-specific issues. Rules that do, however,may be on the horizon.

In 2021, the EU submitted its first draft of the Artificial Intelligence Act (AIA), legislation that will work alongside GDPR. The act governs AI tools according to their perceived risk, from “minimal” (things like spam filters) to “high” (AI tools for law enforcement or education) or “unacceptable” and therefore banned (like a social credit system). After the explosion of large language models like ChatGPT last year, lawmakers are now racing to add rules for “foundation models” and “General Purpose AI Systems (GPAIs)” — two terms for large-scale AI systems that include LLMs — and potentially classing them as “high risk” services.

The AIA’s provisions go beyond data protection. A recently proposed amendment would force companies to disclose any copyrighted material used to develop generative AI tools. That could expose once-secret datasets and leave more companies vulnerable to infringement lawsuits, which are already hitting some services.

Laws specifically designed to regulate AI may not come into effect in Europe until late 2024

But passing it may take a while. EU lawmakers reached a provisional AI Act deal on April 27th. A committee will vote on the draft on May 11th, and the final proposal is expected by mid-June. Then, the European Council, Parliament, and Commission will have to resolve any remaining disputes before implementing the law. If everything goes smoothly, it could be adopted by the second half of 2024, a little behind the official target of Europe’s May 2024 elections.

For now, Italy and OpenAI’s spat offers an early look at how regulators and AI companies might negotiate. The GPDP offered to lift its ban if OpenAI met several proposed resolutions by April 30th. That included informing users how ChatGPT stores and processes their data, asking for explicit consent to use said data, facilitating requests to correct or remove false personal information generated by ChatGPT, and requiring Italian users to confirm they’re over 18 when registering for an account. OpenAI didn’t hit all of those stipulations, but it met enough to appease Italian regulators and get access to ChatGPT in Italy restored.

OpenAI still has targets to meet. It has until September 30th to create a harder age-gate to keep out minors under 13 and require parental consent for older underage teens. If it fails, it could see itself blocked again. But it’s provided an example of what Europe considers acceptable behavior for an AI company — at least until new laws are on the books.