The Hidden Costs of Tableau Extracts in Your Snowflake and Databricks Reporting

The Monday meeting

It's nine on a Monday morning and the call has three numbers for last week's revenue.

Marketing pulled their data from a Tableau extract. It refreshed at 2am. Finance's report ran directly against Databricks at 7am. Product uses a Power BI Import with a filter someone added in Q2 (that they aren't even really aware is part of the import) and a refresh schedule that runs hourly. The three numbers are all within a few percent of each other. But they aren't the same.

The next forty minutes aren't a decision meeting, instead it's a conversation trying to reconcile (or defend) the numbers. By the time the room agrees which number is the real number… or, rather, agrees to go back and run them again and look again tomorrow… the meeting is over and nothing has actually been decided.

This is what data islands cost you. And nobody put it in the business case.

Why we built extracts in the first place

Tableau extracts aren't a bad thing, but they are just that — extracts. They are a perfectly sensible answer to a perfectly real problem: querying Snowflake or Databricks directly for every dashboard refresh is slow and expensive, and someone is paying that bill.

The extract pulls a snapshot of the warehouse data into Tableau's local format, scopes it to the columns and filters the dashboard actually needs, and refreshes on a schedule you control. Dashboards open instantly. Filters apply instantly. The warehouse doesn't have to be awake for the analyst to be productive.

This is why every BI tool has the same feature under a different name — Power BI calls it Import mode. Looker's persistent derived tables and aggregate awareness solve a piece of it. Sigma's materialized data… Mode's dataset caching… The whole BI ecosystem has independently arrived at the same answer: don't hit the warehouse if you don't have to.

The trouble is that the answer is a local answer, scoped to one BI tool, refreshed on its own schedule, owned by whoever set it up. Multiply that by every team, every BI tool, every dashboard, every refresh cadence, and you get the Monday meeting.

The hidden costs nobody puts in the business case

There isn't a line-item that talks about data islands; in fact most people don't even realize they're creating them (islands and costs), or what it means to own your own little copy of the enterprise data.

Cost in time… Cost in Trust… Cost in real money

Every extract is a sync job. Sync jobs fail. Schemas change upstream and the extract breaks. The person who built it leaves. The new owner doesn't know what filter is baked into row three of the data source. The extract keeps running because nobody wants to be the one who broke the executive dashboard.

Over a few years a mid-size data org accumulates extracts the way an old garage accumulates extension cords. You can't remember which one powers what. You're afraid to unplug any of them. The data engineering team spends a steadily growing share of its time keeping extracts in sync, not analyzing data.

And then we start seeing the inconsistencies. Three teams pulling from the same warehouse table at three different times produce three different snapshots. If a transaction lands at 3am, Marketing's 2am extract doesn't see it and Finance's 7am query does.

It's answerable — but it's something we have to stop and think about… why are the numbers different, oh right, different snapshots… now… where were we? That derailment in a meeting with a dozen people slows down everyone and adds up to real money over time.

This is the cost that hurts most. Not in dollars — in trust. Once a leadership team has been burned by two dashboards disagreeing, they stop trusting every dashboard. The next quarter's "let me get back to you on that number" replaces the next quarter's decision.

The proliferation problem

The worst part is that this is fractal. Marketing has their extract… Finance has their data… Product has Power BI Import. The customer-facing analytics product has a per-tenant Postgres replica because Tableau extracts don't work for embedded use cases. The ML team has a feature store fed by reverse-ETL. Each one is rational in isolation. Together they are an archipelago of slightly-different versions of the same underlying truth.

Every island has a sync schedule. Every island drifts at its own rate. Every island is one schema change away from breaking. The data team spends more time keeping the islands in sync than they spend on actual analysis. The warehouse bill keeps growing anyway, because each island has to pull from it.

Extracts drift. Three teams, three extracts of the same table, three different numbers in Monday's meeting.

Extracts can't answer new questions. The moment someone needs a new breakdown, the warehouse wakes up anyway.

Extracts go stale silently. Between refreshes, the dashboard is wrong — just not obviously enough to notice.

Extracts proliferate. Every team, every tool, every dashboard. The engineering hours add up to more than the warehouse bill they were supposed to reduce.

Why people put up with it

Because the alternative they know is caching the warehouse, and traditional caches are too blunt to trust.

The classic problem with a cache in front of a database is invalidation. When the underlying data changes, what do you throw away? Most caches answer: all of it. Or worse: we don't know, so we serve stale answers and hope nobody notices. A cache you can't trust isn't a cache. It's a bug with a TTL on it.

So the rational response is to skip the cache and build an extract instead — at least with an extract you know when it refreshed, even if it drifts the rest of the day. Predictable staleness beats unpredictable staleness. Every team in your org has independently made this calculation and arrived at the same answer.

The reason we keep building islands is that nobody has shipped a cache surgical enough to be trusted as the single source of truth.

The third option

A cache between the BI tools and the warehouse can answer the dashboard refresh without waking Snowflake or Databricks. Same SQL. Same drivers. Same credentials. The BI tool doesn't know it's talking to a cache — it thinks it's talking to the warehouse.

The catch — the reason this hasn't worked before — is invalidation. If the cache can't bust precisely when data changes, you're back to the extract problem: stale answers, drifting numbers, the Monday meeting.

We solve this with fine-grained cache busting — invalidation that knows exactly what changed and clears only the cache entries that actually depend on it. The mechanics are involved enough to deserve their own piece; what matters here is what they make possible. That changes the math. The cache stops being a "best effort" optimization and starts being the single source of truth for every consumer of the data — Tableau, Power BI, the embedded analytics product, the ad-hoc analyst, the scheduled report. One cache. Same numbers everywhere, by construction.

The Monday meeting ends with agreement because there is nothing to reconcile.

What the gateway doesn't replace

Extracts have legitimate uses outside this story. Offline analysis, disconnected presentations, Power BI Premium features that require imported data — extracts still solve those, and nothing here changes that.

The gateway is not your warehouse. It does not run your ETL. It does not replace dbt, your data catalog, your identity provider, or your governance layer. The warehouse is still the source of truth; the gateway is the layer between your tools and the warehouse that keeps the warehouse asleep for the questions whose answers are already known.

It also doesn't replace good data modeling. A cache cannot fix a query that should never have been written. It can only stop the warehouse from running it twice.

Velocity and lower costs

What you get back, when the islands collapse, is two things.

Velocity. Decisions move at the speed of the question, not the speed of the next extract refresh. The PM's Tuesday-afternoon question gets a Tuesday-afternoon answer. The exec who wants to break down last quarter by a dimension nobody anticipated doesn't have to wait for someone to rebuild the extract. The Monday meeting becomes a decision meeting again, because everyone in it is looking at the same numbers.

Lower costs. The beast doesn't wake up because someone had a question. The dashboard refresh hits a cached answer. The ad-hoc breakdown hits a cached answer for the parts that are already known, and only forwards the truly new computation. Snowflake credits and Databricks DBUs both bill on warehouse-awake time — a sleeping warehouse doesn't bill.

Both come from the same property: a cache surgical enough to be the single source of truth.