Article

AI and Privacy

Originally written as La IA y Tu Privacidad. Santiago Luis Rivas Betancourt. SIA - ITBA. June 21, 2024.

Introduction

With the rise of new artificial intelligence technologies and their broad public availability, many people have started using them in everyday life. In particular, many have incorporated these tools into both their personal and professional routines, using the benefits of LLMs to support their work. This creates several conflicts around data privacy, not only at a personal level but also across the business ecosystem. Companies such as Apple, Wired, and Verizon Communications have introduced restrictions on employees using technologies such as ChatGPT and other AI systems.

Many programmers have started using AI as a complement to their education and their work. From code generation to code correction, LLMs have succeeded in many software development use cases. With GitHub Copilot and ChatGPT, a programmer can generate large amounts of code and can also ask for help optimizing or improving existing code.

There is room to debate how useful these tools are, and how good the code generated by LLMs really is, but that is not the focus here. Like any tool, artificial intelligence can be used correctly or incorrectly, and it may fit some workflows better than others. The focus of this article is how a programmer, or a team of programmers, can incorporate AI into their workflow privately while protecting both personal data and company data.

The Problem

In this era of social networks and web applications, people often do not think deeply about how to protect their personal data. Many freely hand over data to companies such as Meta, Google, Microsoft, and now OpenAI. Most people, including programmers, do not give enough importance to the privacy of their information.

From posting about personal trips to using cloud systems such as Google Drive to store documents, users end up trusting the security of the system where they publish that information. They trust that these companies use good encryption practices and secure communication protocols to keep their data safe. They also assume these companies do not sell that information, or that if they do, they only do so anonymously.

Companies face this problem from another angle. The disclosure of private company information over the internet is a serious issue that every corporation must address. It is normal for a company to monitor what its employees disclose on social networks or other platforms.

One example happened at Samsung, where the company allowed employees to use ChatGPT. Some employees sent confidential information to ChatGPT, including the source code of a new program. Today, ChatGPT offers private chats, but this still creates risk for a company because it must trust the promise OpenAI makes to the world. Nobody can fully guarantee that promise, just as nobody can fully verify the security of the data handling practices used by any online platform. Software that does not run on your own computer is untrusted software.

These problems already existed, but with the appearance of ChatGPT, people now think even less before sharing preferences, tastes, personal information, and work information. Each person can evaluate what personal data they have handed to OpenAI through the great GPT confidant. For many, however, there seems to be no alternative. Leaving the world of corporate software often requires more effort, but the decision depends on how much one values privacy.

The Solution

Large companies have found ways to address this by creating their own models. Google has Bard, and Microsoft has Copilot. But the same issue also applies to small startups and independent programmers. A startup with a revolutionary or innovative idea could risk disclosing secrets that might be central to its success. Sensitive configuration files may contain passwords or access keys for critical services.

A possible solution to this problem is self-hosting. For the issue being analyzed here, self-hosting can solve the privacy problem and also bring benefits to users. For self-hosted AI, there are several alternatives, including Ollama and TabbyML, platforms that allow users to run LLMs inside their own systems.

TabbyML focuses on code completion and also provides a graphical interface for analyzing metrics generated by a programming team. Ollama handles the download and execution of different predefined LLMs, such as Llama or Mistral, and can also create new ones from models hosted on Hugging Face. Hugging Face, in turn, encourages the development of open source AI that competes with products from OpenAI and other large companies.

Advantages and Disadvantages

The first benefit is that self-hosting solves the problem of protecting data. Everything entered into these models remains on the local system running Ollama or TabbyML. This ensures that code or credentials that might be read by a code analyzer or an AI chat are not placed under the custody of third parties.

A second benefit is fine-tuning. Although it is possible to fine-tune ChatGPT, doing so carries an additional cost. With a local system, fine-tuning can be done independently, provided the necessary time and resources are invested. If the fine-tuning is meant to use a private dataset or codebase, this solution is also more appropriate.

Finally, because these models can run locally, the required investment can be lower. If the hardware programmers already use can support the models, then the investment is zero. If a more powerful system is needed, perhaps requiring purchased or rented hardware, then the decision becomes a matter of cost analysis. For an independent programmer this can be a limitation, but commercial hardware has already reached the level required to run these models, especially when considering devices aimed at programmers.

Conclusion

The open source ecosystem, and the ecosystem oriented around data privacy, has always had a place for different use cases. In the context of artificial intelligence and the evolution of hardware, it is becoming increasingly realistic to use these technologies locally and privately. For companies of all kinds, as well as for independent programmers, this possibility already exists and should be considered when developing and evaluating a project.

That said, this should also be a call to guide the creation of AI-based software toward data privacy. This includes privacy at the enterprise level, but also the privacy of end users who do not necessarily have the knowledge required to understand what happens to the information they provide to large technology companies through their services.

References

Jonathan Gillham, Company AI Policy Examples and Templates - Who Has Banned ChatGPT. originality.ai.
Lewis Maddison, Samsung workers made a major error by using ChatGPT. TechRadar.
OpenAI, New ways to manage your data in ChatGPT. OpenAI.
Ollama.
TabbyML.
Rucy, Self-hosted LLMs: Are they worth it? Medium.
Michiel De Koninck, Should we fine-tune a LLM for this use case? Or consider other techniques? ML6.

Artículo

La IA y tu privacidad

Ensayo escrito para SIA - ITBA. Santiago Luis Rivas Betancourt. 21 de junio de 2024.

PDF original

Introducción

Con el crecimiento de nuevas tecnologías de inteligencia artificial y su amplia disponibilidad pública, muchas personas comenzaron a usarlas en la vida cotidiana. En particular, muchas incorporaron estas herramientas tanto a sus rutinas personales como profesionales, aprovechando los beneficios de los LLMs para apoyar su trabajo. Esto genera varios conflictos alrededor de la privacidad de los datos, no solo a nivel personal sino también dentro del ecosistema empresarial. Empresas como Apple, Wired y Verizon Communications introdujeron restricciones al uso de tecnologías como ChatGPT y otros sistemas de IA por parte de sus empleados.

Muchos programadores empezaron a usar IA como complemento para su educación y su trabajo. Desde generación de código hasta corrección de código, los LLMs lograron resultados útiles en muchos casos de desarrollo de software. Con GitHub Copilot y ChatGPT, un programador puede generar grandes cantidades de código y también pedir ayuda para optimizar o mejorar código existente.

Hay espacio para discutir qué tan útiles son estas herramientas y qué tan bueno es el código generado por LLMs, pero ese no es el foco acá. Como cualquier herramienta, la inteligencia artificial puede usarse bien o mal, y puede encajar mejor en algunos flujos de trabajo que en otros. El foco de este artículo es cómo un programador, o un equipo de programadores, puede incorporar IA en su flujo de trabajo de forma privada, protegiendo tanto datos personales como datos de la empresa.

El problema

En esta era de redes sociales y aplicaciones web, las personas muchas veces no piensan en profundidad cómo proteger sus datos personales. Muchos entregan libremente datos a empresas como Meta, Google, Microsoft y ahora OpenAI. La mayoría de las personas, incluyendo programadores, no le da suficiente importancia a la privacidad de su información.

Desde publicar sobre viajes personales hasta usar sistemas cloud como Google Drive para guardar documentos, los usuarios terminan confiando en la seguridad del sistema donde publican esa información. Confían en que esas empresas usan buenas prácticas de cifrado y protocolos de comunicación seguros para mantener sus datos protegidos. También asumen que esas empresas no venden esa información, o que si lo hacen, la venden de forma anónima.

Las empresas enfrentan este problema desde otro ángulo. La divulgación de información privada de una compañía a través de internet es un asunto serio que toda corporación debe tratar. Es normal que una empresa monitoree lo que sus empleados divulgan en redes sociales u otras plataformas.

Un ejemplo ocurrió en Samsung, donde la compañía permitió que empleados usaran ChatGPT. Algunos enviaron información confidencial a ChatGPT, incluyendo el código fuente de un nuevo programa. Hoy ChatGPT ofrece chats privados, pero esto sigue generando riesgo para una empresa porque debe confiar en la promesa que OpenAI hace al mundo. Nadie puede garantizar completamente esa promesa, así como nadie puede verificar por completo la seguridad de las prácticas de manejo de datos de cualquier plataforma online. El software que no corre en tu propia computadora es software no confiable.

Estos problemas ya existían, pero con la aparición de ChatGPT las personas ahora piensan aún menos antes de compartir preferencias, gustos, información personal e información laboral. Cada persona puede evaluar qué datos personales entregó a OpenAI a través del gran confidente GPT. Para muchos, sin embargo, parece no haber alternativa. Salir del mundo del software corporativo suele requerir más esfuerzo, pero la decisión depende de cuánto se valore la privacidad.

La solución

Las grandes empresas encontraron formas de abordar esto creando sus propios modelos. Google tiene Bard y Microsoft tiene Copilot. Pero el mismo problema también aplica a startups pequeñas y programadores independientes. Una startup con una idea revolucionaria o innovadora podría arriesgarse a divulgar secretos centrales para su éxito. Archivos de configuración sensibles pueden contener contraseñas o claves de acceso para servicios críticos.

Una posible solución a este problema es el self-hosting. Para el problema analizado acá, el self-hosting puede resolver la privacidad y además traer beneficios a los usuarios. Para IA self-hosted existen varias alternativas, incluyendo Ollama y TabbyML, plataformas que permiten ejecutar LLMs dentro de los propios sistemas del usuario.

TabbyML se enfoca en completado de código y también ofrece una interfaz gráfica para analizar métricas generadas por un equipo de programación. Ollama se encarga de descargar y ejecutar distintos LLMs predefinidos, como Llama o Mistral, y también puede crear modelos nuevos a partir de modelos alojados en Hugging Face. Hugging Face, a su vez, fomenta el desarrollo de IA open source que compite con productos de OpenAI y otras grandes empresas.

Ventajas y desventajas

El primer beneficio es que el self-hosting resuelve el problema de proteger los datos. Todo lo ingresado a estos modelos queda en el sistema local donde corre Ollama o TabbyML. Esto asegura que código o credenciales que podrían ser leídos por un analizador de código o un chat de IA no queden bajo custodia de terceros.

Un segundo beneficio es el fine-tuning. Aunque es posible hacer fine-tuning de ChatGPT, hacerlo tiene un costo adicional. Con un sistema local, el fine-tuning puede hacerse de manera independiente, siempre que se inviertan el tiempo y los recursos necesarios. Si el fine-tuning busca usar un dataset o codebase privado, esta solución también es más apropiada.

Finalmente, como estos modelos pueden correr localmente, la inversión necesaria puede ser menor. Si el hardware que los programadores ya usan puede soportar los modelos, entonces la inversión es cero. Si se necesita un sistema más potente, quizá comprando o alquilando hardware, la decisión pasa a ser un análisis de costos. Para un programador independiente esto puede ser una limitación, pero el hardware comercial ya alcanzó el nivel necesario para ejecutar estos modelos, especialmente si se consideran dispositivos orientados a programadores.

Conclusión

El ecosistema open source, y el ecosistema orientado a la privacidad de datos, siempre tuvo lugar para distintos casos de uso. En el contexto de la inteligencia artificial y la evolución del hardware, se vuelve cada vez más realista usar estas tecnologías de forma local y privada. Para empresas de todo tipo, así como para programadores independientes, esta posibilidad ya existe y debería considerarse al desarrollar y evaluar un proyecto.

Dicho esto, también debería ser un llamado a orientar la creación de software basado en IA hacia la privacidad de datos. Esto incluye la privacidad a nivel empresarial, pero también la privacidad de usuarios finales que no necesariamente tienen el conocimiento necesario para entender qué ocurre con la información que entregan a grandes empresas de tecnología a través de sus servicios.

Referencias

Jonathan Gillham, Company AI Policy Examples and Templates - Who Has Banned ChatGPT. originality.ai.
Lewis Maddison, Samsung workers made a major error by using ChatGPT. TechRadar.
OpenAI, New ways to manage your data in ChatGPT. OpenAI.
Ollama.
TabbyML.
Rucy, Self-hosted LLMs: Are they worth it? Medium.
Michiel De Koninck, Should we fine-tune a LLM for this use case? Or consider other techniques? ML6.