Back to writing

Article

AI and Privacy

Originally written as La IA y Tu Privacidad. Santiago Luis Rivas Betancourt. SIA - ITBA. June 21, 2024.

Original PDF

Introduction

With the rise of new artificial intelligence technologies and their broad public availability, many people have started using them in everyday life. In particular, many have incorporated these tools into both their personal and professional routines, using the benefits of LLMs to support their work. This creates several conflicts around data privacy, not only at a personal level but also across the business ecosystem. Companies such as Apple, Wired, and Verizon Communications have introduced restrictions on employees using technologies such as ChatGPT and other AI systems.

Many programmers have started using AI as a complement to their education and their work. From code generation to code correction, LLMs have succeeded in many software development use cases. With GitHub Copilot and ChatGPT, a programmer can generate large amounts of code and can also ask for help optimizing or improving existing code.

There is room to debate how useful these tools are, and how good the code generated by LLMs really is, but that is not the focus here. Like any tool, artificial intelligence can be used correctly or incorrectly, and it may fit some workflows better than others. The focus of this article is how a programmer, or a team of programmers, can incorporate AI into their workflow privately while protecting both personal data and company data.

The Problem

In this era of social networks and web applications, people often do not think deeply about how to protect their personal data. Many freely hand over data to companies such as Meta, Google, Microsoft, and now OpenAI. Most people, including programmers, do not give enough importance to the privacy of their information.

From posting about personal trips to using cloud systems such as Google Drive to store documents, users end up trusting the security of the system where they publish that information. They trust that these companies use good encryption practices and secure communication protocols to keep their data safe. They also assume these companies do not sell that information, or that if they do, they only do so anonymously.

Companies face this problem from another angle. The disclosure of private company information over the internet is a serious issue that every corporation must address. It is normal for a company to monitor what its employees disclose on social networks or other platforms.

One example happened at Samsung, where the company allowed employees to use ChatGPT. Some employees sent confidential information to ChatGPT, including the source code of a new program. Today, ChatGPT offers private chats, but this still creates risk for a company because it must trust the promise OpenAI makes to the world. Nobody can fully guarantee that promise, just as nobody can fully verify the security of the data handling practices used by any online platform. Software that does not run on your own computer is untrusted software.

These problems already existed, but with the appearance of ChatGPT, people now think even less before sharing preferences, tastes, personal information, and work information. Each person can evaluate what personal data they have handed to OpenAI through the great GPT confidant. For many, however, there seems to be no alternative. Leaving the world of corporate software often requires more effort, but the decision depends on how much one values privacy.

The Solution

Large companies have found ways to address this by creating their own models. Google has Bard, and Microsoft has Copilot. But the same issue also applies to small startups and independent programmers. A startup with a revolutionary or innovative idea could risk disclosing secrets that might be central to its success. Sensitive configuration files may contain passwords or access keys for critical services.

A possible solution to this problem is self-hosting. For the issue being analyzed here, self-hosting can solve the privacy problem and also bring benefits to users. For self-hosted AI, there are several alternatives, including Ollama and TabbyML, platforms that allow users to run LLMs inside their own systems.

TabbyML focuses on code completion and also provides a graphical interface for analyzing metrics generated by a programming team. Ollama handles the download and execution of different predefined LLMs, such as Llama or Mistral, and can also create new ones from models hosted on Hugging Face. Hugging Face, in turn, encourages the development of open source AI that competes with products from OpenAI and other large companies.

Advantages and Disadvantages

The first benefit is that self-hosting solves the problem of protecting data. Everything entered into these models remains on the local system running Ollama or TabbyML. This ensures that code or credentials that might be read by a code analyzer or an AI chat are not placed under the custody of third parties.

A second benefit is fine-tuning. Although it is possible to fine-tune ChatGPT, doing so carries an additional cost. With a local system, fine-tuning can be done independently, provided the necessary time and resources are invested. If the fine-tuning is meant to use a private dataset or codebase, this solution is also more appropriate.

Finally, because these models can run locally, the required investment can be lower. If the hardware programmers already use can support the models, then the investment is zero. If a more powerful system is needed, perhaps requiring purchased or rented hardware, then the decision becomes a matter of cost analysis. For an independent programmer this can be a limitation, but commercial hardware has already reached the level required to run these models, especially when considering devices aimed at programmers.

Conclusion

The open source ecosystem, and the ecosystem oriented around data privacy, has always had a place for different use cases. In the context of artificial intelligence and the evolution of hardware, it is becoming increasingly realistic to use these technologies locally and privately. For companies of all kinds, as well as for independent programmers, this possibility already exists and should be considered when developing and evaluating a project.

That said, this should also be a call to guide the creation of AI-based software toward data privacy. This includes privacy at the enterprise level, but also the privacy of end users who do not necessarily have the knowledge required to understand what happens to the information they provide to large technology companies through their services.

References

  1. Jonathan Gillham, Company AI Policy Examples and Templates - Who Has Banned ChatGPT. originality.ai.
  2. Lewis Maddison, Samsung workers made a major error by using ChatGPT. TechRadar.
  3. OpenAI, New ways to manage your data in ChatGPT. OpenAI.
  4. Ollama.
  5. TabbyML.
  6. Rucy, Self-hosted LLMs: Are they worth it? Medium.
  7. Michiel De Koninck, Should we fine-tune a LLM for this use case? Or consider other techniques? ML6.