a dripping spigot
News Analysis

Microsoft's AI Data Leak Isn't the Last One We'll See

7 minute read
David Barry avatar
SAVED
News that Microsoft AI leaked 38TB of data raises questions about the security of LLMs — for all companies with Gen AI-driven applications.

Microsoft has spent most of 2023 advocating for the use of generative AI and positioning itself as a trusted leader in the space. But recent developments revealed the company's AI research team accidentally leaked 38TB of confidential data onto its GitHub page.

So much for trust.

According to cloud security firm Wiz, the data, which included the backup of the computers used by two former employees, as well as passwords encryption keys and Teams threads, was accidentally uploaded to Microsoft’s AI GitHub repository where the team was publishing a tranche of open-source training data.

Given the amount of data that is needed to train Large Language Models (LLMs), the idea that such an incident could happen isn't all that surprising. But there is still a lesson to take away from this event: this could happen to any company working with LLMs.

Protecting LLM Data

Protecting large datasets used in the development of LLMs poses a considerable challenge, said Anurag Gurtu , co-founder and COO of StrikeReady. And the Microsoft incident highlights the complexity of safeguarding such data.

While it's essential to implement robust security measures, including access controls and encryption, the sheer size of these datasets can make them susceptible to accidental exposure, Gurtu said.

One solution to this is to implement strict access controls and employ encryption techniques to protect sensitive data, he said. Additionally, organizations should invest in comprehensive data governance and monitoring solutions that can detect unusual data access patterns and trigger alerts when data is accessed or shared inappropriately. Regular security audits and employee training can further enhance data protection efforts.

It is also important, he added, for organizations to consider the ethical aspects of using large datasets, ensuring that privacy and consent are respected throughout the data collection and usage process. 

If anything, the Microsoft leak, Gurtu said, serves as a stark reminder of the critical importance of data security. As organizations harness the power of AI and LLMs to drive innovation, they must also prioritize security. Data protection should be an integral part of AI development and deployment strategies.

"Ultimately, while the incident is a sobering reminder of the challenges in protecting large datasets, it should also serve as a catalyst for organizations to continually improve their data security practices and enhance their data governance frameworks," he said.

Related Article: Data Mesh or Data Fabric as a Foundation for Data Management Strategy

A Flaw in the Design

Even though Microsoft's leak was accidental, it's important to remember, said Davi Ottenheimer, VP of trust and digital ethics at Boston-based Inrupt, a firm co-founded by world wide web inventor Sir Tim Berners-Lee, that the key to the development of this kind of technology is data management and data storage.

“Microsoft's simple error brings focus to data storage methods for everyone, which should be personally dedicated and easily protected instead of over-centralized and easily over-shared,” he said.

Ottenheimer believes this is a far bigger story than AI infrastructure, since any knowledge service should adhere to data owner rights such as controlled sharing.

One of the crucial flaws with Microsoft in this case, Ottenheimer said, is that it has a design that assigns trust in ways that the internet sees more like a hidden backdoor. He believes a proper multi-user distributed open standard for data sharing controls greatly reduces risks of the wrong token ending up in the wrong hands.

Of note in the Microsoft case, he said, is how their centralized services lacked some basic monitoring capabilities.

Ottenheimer said that anyone looking to construct LLMs using high-scale data with safety should be looking at Solid, a Sir Lee-driven W3C protocol that has been in the spotlight following the leap in knowledge-sharing capabilities that came with the emergence of generative AI. LLMs, Ottenheimer said, like the Generative Pre-trained Transformer (GPT), typically need massive datasets of diverse text sources (volume and variety).

The Solid Protocol uses existing W3C standards to define a user-centric method of storing and sharing data. The objective is to decentralize the web and put data ownership back in the hands of its creators, rather than data warehouses or the companies that control them.

Related Article: Why Web3 and Web 3.0 Are Not the Same

The Human Problem

For security firm Dasera's CEO and co-founder, Ani Chaudhuri, the Microsoft leak also highlights one of the most unpredictable element in the security perimeters around LLMs and their creation: human fallibility.

Learning Opportunities

“It is a stark reminder that even the most fortified entities can falter due to internal oversights," he said. "In this dynamic data landscape, meticulous attention to every data fragment is the sine qua non, overshadowing even the most advanced security protocols and frameworks if neglected."

Chaudhuri said that any organization thinking about deploying generative AI must recognize that the human element can be both the weakest link and the strongest defense in the security apparatus.

This, in turn, highlights the need for holistic, granular and proactive data governance, enabling organizations to discern, monitor and secure every shard of information, ensuring that the sanctity of data is uncompromised, even in the face of internal inadvertences.

“Training, awareness and a culture of security are paramount," he said. "Every individual within an organization must be a vigilant data custodian, understanding the repercussions of inadvertent exposures."

LLMs operate on colossal datasets, and any lapse, any diminution in vigilance, can cascade into extensive, irreversible ramifications. “It's unwise to assume infallibility," he said. "Even the Titans can stumble. This incident reveals that data security is not a destination but a perpetual, evolving journey marked by continuous learning, adaptation and vigilance."

Yet, this is not just about creating security or governance roles; it’s about building a security and governance mindset across the entire organization.

“[The incident] tells us that security is not a siloed responsibility but a collective endeavor, where every interaction, every piece of information is treated with the utmost sanctity and guarded with unwavering diligence," Chaudhuri said.

Related Article: AI Governance Is a Challenge That Can't Be Ignored

The Need for Governance

We will continue to see companies integrate AI and LLMs like Bard, ChatGPT and other internal private models into their strategies for digital transformation, said Arti Raman, CEO and founder of AI data security firm Titaniam.

But as we advance in the space, one critical aspect of this integration remains data governance. Microsoft's incident included an AI research team with presumable oversight of tools used; this likely will not be the case everywhere, she said.

“We are currently seeing terms like ‘Shadow AI,’ the unmonitored and unsanctioned use of AI tools. At this time, all generative AI applications that are accessible can be freely used by employees unknown and unmonitored by security teams with no security oversight," she said.

This new phenomenon poses a similar risk as its twin, Shadow IT, which Cisco describes as "the use of IT-related hardware or software by a department or individual without the knowledge of the IT or security group within the organization." Shadow IT increases the risk of data breaches and costs organizations over one million dollars per incident.

Ungoverned LLMs and generative AI in the enterprise have put company IP and other sensitive information within internal business networks at risk. The urgency, now, is that these generative AI tools are becoming increasingly popular and need to be addressed promptly. Employees are likely to continue using them across the enterprise, and companies need a defense.

Security teams and business executives need to consider possible guardrails that allow the secure and functional use of these generative AI tools because one way or another, once they are introduced into the enterprise, workers are going to use them.

About the Author

David Barry

David is a European-based journalist of 35 years who has spent the last 15 following the development of workplace technologies, from the early days of document management, enterprise content management and content services. Now, with the development of new remote and hybrid work models, he covers the evolution of technologies that enable collaboration, communications and work and has recently spent a great deal of time exploring the far reaches of AI, generative AI and General AI.

Main image: Walter Randlehoff | unsplash