AI Will Not Fix Publishing Metadata (Until Publishers Know What They Are Fixing)

Henry Marsden
Jun 10
7 min read

For all the excitement around AI in music, one of the most important questions for publishers is also one of the least glamorous.

What exactly is AI being asked to solve?

In recent months, much of the industry conversation has understandably focused on the bigger strategic questions. Licensing. Training data. Consent. Creator compensation. New revenue models. Whether publishers should participate in AI partnerships, and on what terms. These are important conversations, and naturally will shape the next phase of the industry.

Yet underneath them sits a more immediate, practical and operational challenge. If publishers want to use AI meaningfully, they first need to know whether their own data is in a fit state to support it.

It sounds obvious, but in music publishing it is far from simple. Publishing data is not a neat database of facts waiting to be queried. It is a living, fragmented, territory-specific web of claims, registrations, identifiers, historical decisions, local practices, contractual assumptions and human judgement. Anyone who has spent any length of time inside the operational machinery of publishing knows this instinctively- and this is the world AI is entering.

Start with a Clean House

There is a lot AI can do in a Big Data world. but what if the underlying data itself is poor? AI won’t magically make it reliable. It simply gives the poor data more reach, more speed and more confidence- a confidence that is misplaced.

A publisher with inconsistent internal data, duplicate works, incomplete writer information, misaligned shares, missing identifiers and unclear source hierarchy cannot simply layer AI over the top and expect a generalised solution to emerge. The AI will be able to surface patterns, but not necessarily the right patterns. It can infer relationships, but not necessarily the relationships that matter. It will extrapolate from the available data, but if that available data is structurally flawed, the output may become a faster and more elegant version of the same problem. AI is the great amplifier- if the source data is poor, that is what is going to be amplified.

In music publishing, bad metadata rarely stays contained- particularly with automated workflows like CWR deliveries. This is why data cleanup has to happen in-house first, at least in the sense that a rights holder needs to establish its own high-confidence view of a catalog before automating broadly across it.

AI can absolutely help with that process, and in many cases will add great value. But the direction of travel matters- AI should be used to support the creation of a more reliable internal foundation, not to compensate for the absence of one.

… are we even talking about the same thing?

I've worked on a significant number of artist catalogs. With some of the biggest it often becomes apparent that the first task is not to directly “fix” anything, at least in the initial phase. The first task is simply to understand what everyone else thinks the catalog is actually comprised of.

That means collating repertoire reports from the creator’s PRO(s), their publisher(s), and the key licensing hubs and collection sources relevant to each of the creators revenue partners. Depending on the repertoire and writer affiliations, that might include any of the MLC, BMI, ASCAP, SACEM for certain UMPG-administered writers or ICE for Sony-administered repertoire. The list goes on! One should never assume that any single source has the truth (... despite their claims).

Once all the data is gathered, the real work begins as the gaps become obvious. Which works are genuine? Which are duplicates? Which are erroneous? Which are derivatives, translations, medleys, live versions, arrangements, alternate titles, or local registrations that represent the same underlying work in slightly different ways?

If four out of six sources agree, does that mean the four are correct? Possibly…. but not always. The two dissenting sources may represent newer, updated claims or a more accurate split for a particular territory. Equally, they may simply be wrong. If all six sources disagree on most data points, the question becomes more fundamental- are they even talking about the same work? If minimal data points align it is challenging to even disambiguate if they are actually concerning the same composition.

Discernment can’t be Replaced

This is where publishing data needs discerning eyes. An experienced operator can look at a spectrum of registrations and understand the likely context behind them. They know what 200% shares mean. They understand how “publisher’s share” and “writer’s share” mean different things in different contexts. They can interpret the difference between Anglo-American conventions and BIEM-influenced practices. They can spot when Nordic registrations appear unusual to a UK eye, but make sense locally. They can tell when a record looks like a duplicate, when it looks like a derivative, and when it needs to be kept distinct rather than automatically collapsed into another entry.

This kind of discernment matters because not all metadata issues are equally important.

Some issues block revenue and some create audit risk. Some cause confusion but have little practical economic impact. Some are worth fixing immediately, whereas some should be logged but not prioritised. Some look messy but do not materially affect collection, whilst others look minor but are precisely the thing preventing revenue from being allocated.

Aligning internally, then going to Market

In the superstar catalog examples mentioned above, once we had established the likely universe of works we developed a system to use CISAC systems to retrieve or allocate Preferred ISWCs. Until every work can be uniquely identified (and by all parties) conversations with external partners are incredibly vulnerable to misunderstanding.

This is one of the most common causes of friction in publishing data projects. Different parties think they are discussing the same work, but their systems are pointing to overlapping, duplicated or slightly different records. A publisher may refer to one version of a title, a society may refer to another, a hub will almost certainly have multiple registrations and a partner may be looking at a derivative rather than the original composition. One licensing hub recently expressed to me they tend to see 20-30 individual registrations on average for 1 given unique work! The data flow and likelihood of misalignment is staggering- and unsurprising given the fragmented nature of publishing rights.

Once a Preferred ISWC is established, everyone has a better chance of at least talking about the same specific composition. Only then does it make sense to build out an 'authoritative' works view, or more accurately, a ‘highest-confidence’ view. This includes alternate titles, writer names, IPIs, roles, controlled shares, collection shares, publisher details and any relevant notes on territorial treatment.

That distinction between “authoritative” and “high confidence” matters. As my friend Dan Fowler puts it, there is no “truth” to be found in music publishing data sets, only claims. The job is to understand which claims are most credible, most current, most commercially relevant and fit together in the tidiest way.

First Works, then Recordings

Once the works layer is in place, recordings can then (and really only then) be matched properly.

The order is often overlooked. There is understandable pressure to jump straight to recording-to-work matching because that is where so much revenue leakage is visible. But if the works layer is not clean, recording matching becomes much harder, if not a total misnomer. A recording cannot be confidently attached to the right composition if the composition itself is duplicated, misidentified or sitting in multiple conflicting forms.

This becomes even more important where derivatives, medleys, translations, remixes and adaptations are prevalent. A recording may appear to match a title, but the correct underlying work could be a specific version, arrangement or translation. Matching it to the wrong work may appear superficially successful while creating deeper downstream problems.

Again, AI can be enormously useful here. It can compare ISRCs, titles, artist names, release dates, contributors, audio metadata and external identifiers at a surface level, and can generate likely match candidates and confidence scores. It can highlight where a recording appears to sit against the wrong derivative and reduce the manual workload dramatically- but the final judgement still needs publishing experience.

The reason is simple. The decision is not just whether two strings look similar- the decision is whether a particular and specific recording should be linked to a particular and specific work for the purposes of licensing, royalty processing and future operational confidence. That is a publishing judgement rather than a simple “likely” data science match.

Bringing in AI

Once the internal high-confidence data set exists, AI becomes much more powerful. At this point it can be used to compare external sources against the internal reference view. It can flag where CMOs, licensing hubs and DSPs or internal databases diverge- or again at a minimum be used to develop tooling to do this comparatively simpler data comparison task at scale. It can help prepare registration updates, it can prioritise conflicts by likely revenue impact, it can monitor whether corrections have propagated. It can create a feedback loop where operational improvements are tracked rather than disappearing into spreadsheets and email threads.

This is where I think the most immediate value sits. AI should not be viewed first as a magic layer that sits above publishing data and makes sense of it. It should be viewed as a practical accelerator for workflows that publishers already understand but struggle to execute at scale.

Catalogs are too large to manage practically- let alone given the fragmentation of data sources, the sheer volume of recordings, versions and uses across platforms and territories. Manual processes cannot keep up, yet AI cannot be solely trusted as a silver bullet to solve the dichotomy.

The starting point has to be the publisher’s own understanding of its catalog. Without that foundation AI risks accelerating the wrong work. With that foundation, it can become genuinely useful.

A Next Generation Rights Holder

The next phase of publishing operations will not simply belong to the companies that adopt AI quickest. It will belong to the companies that understand their data and workflows well enough to apply AI responsibly, practically and economically.

That means investing in cleanup before automation. It means involving experienced catalog and copyright people before assuming the machine has solved the problem. It means knowing the difference between a cosmetic metadata issue and a revenue-blocking issue. It means recognising that local registration nuance is not noise to be normalised away, but context to be interpreted.

Most importantly, it means treating AI as part of the publishing operation, not a replacement for publishing expertise.

The opportunity has always been real for rights holders that build strong internal data foundations. That opportunity has suddenly been sharpened, and exponentially so, with the application AI. However, good foundations come first.

In music publishing, the quality of catalog management has always depended on the quality of the data underneath it. AI does not change that- but it is making the consequences more visible than ever before.