Morbi et tellus imperdiet, aliquam nulla sed, dapibus erat. Aenean dapibus sem non purus venenatis vulputate. Donec accumsan eleifend blandit.

Get In Touch

Microsoft Launches Research to Track AI Training Data and Credit Creators

  • Home |
  • Microsoft Launches Research to Track AI Training Data and Credit Creators
Microsoft Launches Research to Track AI Training Data and Credit Creators

Microsoft is developing a new research initiative aimed at tracing the influence of specific data used in the training of artificial intelligence (AI) models — a move that could one day enable creators to receive credit and possibly compensation when their work is used to develop generative AI systems.

According to a job listing originally posted in December and recently recirculated on LinkedIn, Microsoft is hiring a research intern to explore what it calls “training-time provenance.” The goal of the project is to estimate the impact of individual training examples, such as books, images, or code, on the outputs of AI tools like chatbots, coding assistants, or image generators.

“Current neural network architectures are opaque in terms of providing sources for their generations, and there are good reasons to change this,” the job listing reads. It adds that the project is interested in developing tools that could lay the groundwork for “incentives, recognition, and potentially pay” for people who contributed valuable data to future AI systems.

The idea is gaining traction amid growing legal and ethical scrutiny of how generative AI models are built. Many of these systems are trained using vast amounts of publicly available online content — much of it copyrighted. While AI companies often argue that fair use doctrine protects their training practices, creators across industries, including artists, authors, musicians, and programmers, have pushed back.

Microsoft itself is currently facing at least two lawsuits over alleged copyright infringement. In December, The New York Times filed suit against Microsoft and its AI partner OpenAI, alleging that millions of its articles were used without permission to train language models. Separately, a group of software developers sued Microsoft over GitHub Copilot, claiming that the coding assistant was trained on copyrighted code repositories.

The new research effort appears to be partly inspired by the ideas of Jaron Lanier, a renowned technologist and interdisciplinary scientist at Microsoft Research. Lanier has long advocated for a concept he calls “data dignity” — the notion that digital content should remain connected to the people who created it.

In an April 2023 op-ed published in The New Yorker, Lanier outlined a scenario in which AI-generated content — such as a custom animated film featuring a user’s children in a fantasy world — could trace its creative roots back to specific artists, voice actors, and writers. These contributors, or their estates, could then be acknowledged and even compensated based on their influence on the output.

The Microsoft project builds on this vision by exploring whether training data can be tracked and quantified in a meaningful way. If successful, it could lead to systems that reward content creators directly when their data is used to power AI tools.

Other companies are exploring similar territory. Bria, an AI startup that recently secured $40 million in venture capital, claims to “programmatically” compensate data owners based on how influential their contributions are to model outputs. Meanwhile, Adobe and Shutterstock pay contributors to their datasets, though the specifics of their payout models remain largely opaque.

For most large AI labs, the current standard approach is to sign licensing agreements with publishers or offer opt-out mechanisms that allow creators to exclude their content from future training. However, these opt-out systems don’t apply to models that have already been trained, and many creators say the process of opting out is burdensome and unclear.

Microsoft’s research could represent a shift toward a more creator-centric model. The company’s emphasis on provenance — the ability to trace back outputs to individual training samples — could also prove useful in addressing legal challenges related to copyright and content attribution.

Still, it remains to be seen whether Microsoft’s project will move beyond the research phase. OpenAI announced similar plans nearly a year ago, promising tools that would let creators control how their work is used in training data. But as of today, those tools have not been released.

Critics have also warned that such initiatives may amount to little more than “ethics washing” — attempts to portray AI companies as socially responsible while continuing controversial practices behind the scenes. At the same time, several top AI labs, including Google and OpenAI, have lobbied the U.S. government to ease copyright protections to accommodate AI development. OpenAI has explicitly called for fair use to be codified as a legal shield for model training.

Against this backdrop, Microsoft’s willingness to investigate ways to measure and credit the use of training data stands out. Whether motivated by legal pressure, ethical responsibility, or public relations, the company appears to be taking the idea of data attribution seriously — at least in theory.

If successful, the research could pave the way for a new AI ecosystem where creators are not just data sources but active participants in a value chain. This could change how artists, writers, and developers engage with AI platforms and shift the conversation from exploitation to collaboration.

For now, Microsoft’s training-time provenance project remains a work in progress. But in an industry grappling with copyright lawsuits, creator backlash, and regulatory uncertainty, it could be the beginning of a more transparent and accountable future for AI.

 

Leave A Comment

Fields (*) Mark are Required