In short: do not wait for a clean data warehouse, because nobody has one. Pick one decision a commercial leader makes every cycle, assemble only the data that decision needs onto one consistent map of your territories, and build from there. Readiness is not a state you reach before starting; it is a side effect of shipping the first useful thing.
Almost every commercial leader in Indian pharma who has looked at AI has run into the same wall: the IQVIA and SMSRC numbers do not line up, the secondary feeds are messy, and there are three definitions of a territory in active use. The instinct is to fix all of that first, stand up a clean warehouse, and then do AI. That instinct is why the project is still not live two years later. There is a better order of operations.
The short answer
You do not need ready data. You need ready-enough data for one decision. The teams that get the most out of AI did not begin with a pristine data lake; they began with a single workflow a leader cared about, assembled the minimum data that workflow required, got it onto one consistent view, and used the act of shipping to surface and fix the data problems that actually mattered. The problems that did not block that workflow were left alone, correctly.
A two-year "get the data ready" programme is the most expensive way to discover which data was never going to matter.
What 'ready' actually means
"Ready" is not a property of your data warehouse. It is a property of a specific question. Your primary sales are already clean enough to answer "what did we ship." They are nowhere near ready to answer "did this brand move in this territory because of the call plan," because that question needs secondary, IQVIA, and SMSRC reconciled to the same geography, and those rarely are.
So the useful definition is narrow: data is ready for a workload when the few sources that workload depends on agree on the same territories, the same SKUs, and the same time periods, well enough that a Brand Manager would trust the answer. Everything else, the long tail of feeds and fields you are not using yet, can stay messy. Readiness is per-workload, not per-company. This is the same reason the build-or-buy decision is per-layer, not per-platform.
Start with one decision, not one warehouse
Pick a decision a commercial leader already makes on a cycle and currently makes on gut plus a spreadsheet. Good candidates: which territories are under-indexing on a brand relative to its market share; which doctors a Territory Manager should re-prioritise next cycle; where secondary movement and prescription movement disagree, and what that is telling you.
Then assemble only what that decision needs. A territory under-indexing question needs IQVIA market share and your own secondary, on the same territory map. It does not need your full SKU master cleaned, your entire stockist network reconciled, or your CRM migrated. Scope the data to the decision, not the decision to the data. The first shipped workload then becomes the forcing function: the data problems it hits are, by definition, the data problems worth fixing first.
The territory map is the unlock, and the work
There is one piece of plumbing nearly every commercial workload depends on, and it is the one most companies have not done: a single, consistent definition of a territory that your sales, secondary, IQVIA, and SMSRC data all map to. Most Indian pharma companies carry at least three live definitions, because the field structure has been reorganised and acquired into over the years. Until those agree, every attribution analysis is quietly running on incompatible geographies, and the conclusions do not survive the next month's data.
Building that shared map is unglamorous and it is most of the work; a workload that quietly normalises your feeds onto one territory view has done the majority of the engineering the production deployment will ever need. It is also the layer no vendor can hand you off the shelf, which is why it sits on the build-or-own side of the line. Start it small, scoped to the territories your first decision touches, and extend it as the next workloads demand. Do not try to map the whole country before you ship anything.
A ninety-day path that does not boil the ocean
A pragmatic sequence, drawn from teams that have done it:
First, name the decision and the leader who owns it. If no commercial leader will use the output, the data work has no anchor and will drift back into an IT backlog. The role that owns the workflow matters more than the tooling.
Second, assemble the two or three sources that decision needs onto one territory definition, scoped to the territories in play. Accept that the map is partial. Partial and consistent beats complete and contradictory.
Third, ship a first answer the leader can sanity-check against what they already believe, and trace every number back to its source. The goal of the first cycle is trust, not accuracy at the third decimal. Where the model and the leader's gut disagree, that disagreement is the most valuable output you will get; it is either a data bug to fix or an insight to act on.
Fourth, let the workload pull the next data-cleanup item into priority. The feeds it stumbles on are the feeds worth fixing. The ones it never touches were never the bottleneck.
Why this belongs in the consortium
"Where do I start" is a question with a real answer, and the answer is mostly tacit: which corners to cut, which feeds to fix first, which definition of a territory to standardise on, how clean is clean enough for the first cycle. None of that is in a vendor deck. It lives with the people who have already built one of these pipelines at a peer company and remember exactly which part took the months.
That is the conversation AI for Pharma is convened to host, on a recurring schedule, between commercial leaders and the engineers who have done the unglamorous 80%. If your data is a mess and you are trying to figure out where to start, that is not a disqualification; it is the most common starting point in the room. Apply.
