Published On Aug 28, 2025
Updated On Aug 28, 2025
What Is On-Chain Data and Why Is It Hard to Trust At Scale

On-chain data is often described as the ultimate source of truth in blockchain. Every block is permanent, every transaction is transparent, and every state change is recorded.
But when you look across the system and try to build analytics, risk systems, or governance dashboards on top of it, fault lines in trust appear.
The challenge isn’t that data doesn’t exist; it’s that at scale, it becomes fragmented, inconsistent, and sometimes misleading.
What appears to be clarity at the protocol level can quickly turn into noise when billions of dollars depend on the numbers being accurate.
This blog explores that tension: why on-chain data, despite its transparency, is difficult to trust at scale.
We’ll define what actually counts as on-chain data, examine the practical reasons trust breaks down, and outline the real-world costs of getting it wrong, as well as frameworks and tools that help rebuild reliability.
Let’s get started.
TL;DR
On-chain data includes blocks, transactions, state, logs, rollup proofs, and bridge records. It’s often treated as a single source of truth, but in 2025, it’s fragmented across L1s, L2s, and DA layers.
Billions move daily through DeFi and DAOs. When analytics depend on partial or inconsistent data, trust breaks, leading to misreported TVL, treasury leaks, or governance disputes.
Six recurring gaps drive drift:
- Ephemerality & Finality: blobs and proofs expire, finality differs across chains.
- RPC Inconsistencies: providers prune differently, queries return mismatched results.
- Event Blind Spots: rebases, fake events, and ERC-4337 quirks distort metrics.
- Cross-Chain Drift: assets exist in multiple forms; liquidity gets double-counted.
- Operational Realities: third-party infra gaps and outages corrupt history.
- Shifting Assumptions: “on-chain” means different things at L1s, rollups, and DA layers.
In 2025, hacks topped $1.7B in four months, nearly half of DeFi projects report unverifiable TVL, and bridges move $24B monthly while remaining the leading vector for systemic loss.
To build reliable on-chain analytics, teams need trust-first pipelines with layered controls: multiple RPCs for consistency, metric contracts for clarity, lineage proofs for verifiability, blob capture for retention, and continuous verification to keep drift in check.
On-chain data doesn’t fail at the chain level; it fails in how we capture, store, and interpret it. In a modular, multi-chain world, only trust-first analytics will stand up to scrutiny.
Let’s deep dive into what on-chain data is, where its limitations lie, and how layered controls can make it provable, not just available.
What Is On-Chain Data?
On-chain data refers to all information that is permanently recorded and verifiable within a blockchain network.
This includes the blocks themselves, the transactions they contain, and the resulting changes in state. Unlike off-chain data, which relies on third-party reporting or external oracles, on-chain data is embedded in the consensus process of the chain.
But in 2025, that definition needs more precision. The blockchain stack has expanded far beyond base-layer transactions on Ethereum or Bitcoin. Today, “on-chain data” spans:
- Core primitives: blocks, headers, transactions, receipts, and logs.
- State data: account balances, contract storage, and Merkle proofs.
- Rollup artefacts: commitments, validity or fraud proofs, and data availability blobs.
- Cross-chain records: bridge messages, wrapped asset representations, and checkpoint attestations.
Each of these carries different guarantees. For example, Ethereum’s execution layer data is final once the chain reaches consensus, but rollup transaction data may only be trustworthy once its proof window has closed.
Similarly, Ethereum “blob” data introduced with proto-danksharding (EIP-4844) is only guaranteed to be available for a limited retention period, which introduces new risks for teams building historical analytics pipelines.
This makes even the basic act of defining “on-chain data” a non-trivial decision. What one team considers canonical, e.g., ERC-20 transfer events, may in practice be incomplete or even misleading, while another team might define “on-chain data” more strictly as raw state transitions.
In short, on-chain data is not a single dataset; it’s a layered set of records whose trustworthiness depends on where you draw the boundary and how you access it.
At scale, those layers overlap and conflict across chains, rollups, and DA systems, turning what looks like a single source of truth into inconsistent and incomplete signals.
This is where trust begins to break.
Limitations of On-Chain Data at Scale
On-chain data looks absolute in principle: every block is immutable, every transaction recorded, every state change transparent.
But scale changes the picture. As ecosystems stretch across rollups, bridges, and modular DA layers, the guarantees that look solid at the block level begin to weaken.
Trust breaks not because the blockchain itself stops working, but because the way data is stored, accessed, and interpreted introduces gaps.
Here are the six most common ways those gaps surface:

Ephemerality and Finality
Blockchains are often treated as permanent ledgers, but not all data is kept forever.
Ethereum’s blobs (introduced in EIP-4844) are pruned after roughly 18 days. Celestia retains about 30 days of light client access, while EigenDA ties retention to economic incentives.
For anyone building analytics, this means history can literally vanish unless you capture it yourself. Finality adds another layer: optimistic rollups can take up to two weeks to finalise, while ZK rollups are instant only if provers are available.
Why it matters: Teams basing audits or governance on “available” data risk losing historical completeness overnight. Without archiving, you may not even be able to re-create your own metrics.
RPC Inconsistencies
Most developers and analysts interact with blockchains through RPC endpoints, not raw nodes. But RPC providers don’t all return the same answers.
Some prune old state aggressively, others sync more slowly, and each has its own policy for handling reorgs. Even a simple “eth_getLogs” call may produce different results depending on where you query.
On top of that, mempool data is increasingly redacted to reduce MEV exposure, meaning what you see is a filtered version of reality.
Why it matters: Two teams can query the same block and get different results. Metrics built on one provider may not reproduce elsewhere, undermining credibility in governance or reporting.
Why Events Mislead Blockchain Analytics
It’s convenient to build dashboards by indexing contract events like “Transfer”. The problem is, events aren’t consensus-critical.
A token can rebase balances without emitting an event, or charge fees on transfers that never appear in logs.
Attackers have even spoofed fake events to inflate token volumes. And as ERC-4337 adoption grows, account abstraction introduces new event structures that older pipelines weren’t built to capture.
Why it matters: If your pipeline trusts events as “truth,” you will miscount balances, volumes, and liquidity. In some cases, attackers exploit these blind spots to drain incentives or manipulate market perception.
Cross-Chain Identity Drift
Assets no longer live in one place. A single token like USDC now exists as a native asset, as a bridged version, and as wrapped derivatives across Ethereum, Arbitrum, Optimism, and Base.
Unless those forms are reconciled, analytics will double-count liquidity or overstate supply. The problem intensifies on Orbit chains and appchains, where custom bridges create even more divergence.
Without strict mapping rules, the same “asset” quickly becomes multiple, conflicting records.
Why it matters: Misstating supply or liquidity leads directly to inflated TVL, incorrect risk assessments, and governance disputes over “phantom” assets.
Costs of Running Blockchain Data Infrastructure
The cost of keeping perfect history is high. Running an archive node for full historical access can cost thousands per month, so most teams rely on third-party RPCs or indexers.
That means inheriting their pruning decisions, reorg handling, and outages. Even subtleties like block timestamp manipulation, where validators nudge block times, can distort time-series analytics.
Backfills often happen in chunks, and interruptions leave silent gaps that corrupt datasets without anyone noticing until incentives or governance decisions go wrong.
Why it matters: Silent data gaps erode trust. By the time a treasury overpays incentives or a DAO disputes voter balances, the underlying hole is nearly impossible to patch retroactively.
Shifting Trust Assumptions
Finally, what “on-chain” means has changed. Ethereum’s execution layer offers one set of guarantees, while rollups introduce others, and DA layers like Celestia, Avail, and EigenDA add their own.
Each defines finality, retention, and verifiability differently. A dataset pulled from one layer can’t be assumed to carry the same guarantees as another.
Why it matters: Without recognising these differing assumptions, teams build analytics that look correct but rest on mismatched definitions of “truth.”
This framework doesn’t remove complexity, but it replaces assumptions with proof.
The closer you get to production-grade analytics, the clearer it becomes: on-chain data is not inherently trustworthy; it has to be made trustworthy.
And when it isn’t, the cost appears quickly, in financial losses, misreported metrics, and governance disputes. Let’s examine what that failure has actually entailed.
The Real-World Cost of Getting It Wrong
When teams build analytics, dashboards, governance, or risk systems atop on-chain data, even subtle inconsistencies can create cascading failures.
Below, the stakes are illustrated with vivid 2025 examples that highlight not just that failures occur, but how they multiply in impact as scale grows.
A Security Crisis Unfolding in Real Time
- In the first four months of 2025, crypto losses hit $1.742 billion, eclipsing the entire 2024 total of $1.49 billion and marking a 4× increase over the same period in 2024.
- April alone accounted for $92.45 million lost across 15 DeFi hacks, doubling from March’s $41.4 million and marking a 27.3% year-over-year increase.
- These incidents were concentrated: Ethereum and BNB Chain accounted for roughly 60% of April’s losses, with “newer” Layer‑2s like Base entering attackers' crosshairs too.
Why it matters: As you scale analytics across L1s, L2s, and bridges, reliance on incomplete or untimely data can mask early exploit patterns. By the time anomalies surface, it’s too late.
TVL: A Metric That Looks Clean, But Isn’t Verifiably So
- Total Value Locked (TVL) across DeFi protocols soared from under $1 billion in early 2020 to over $116 billion by May 2025.
- But a 2025 academic study found only 46.5% of DeFi projects report TVL that aligns with verifiable on-chain balance queries; the rest depend on external servers or non-standard calculations.
- Previous research exposed massive double‑counting: during the DeFi peak in late 2021, the gap between TVL and a verifiable measure (“TVR”) was $139.8 billion, nearly double.
Why it matters: Decisions based on inflated or unverifiable TVL, whether for governance, investments, or incentive design, are built on shifting sands. Without rigorous, on-chain only baselining, trust fades.
Explosive Growth of L2s, With Fragile Foundations
- DeFi TVL hit a three-year high of $140 billion by mid‑2025, buoyed by Ethereum L1 and its L2s.
Why it matters: With immense volumes flowing through rollups and modular DA layers, any trust gap, whether in rollup finality, blob retention, or cross-chain reconciliation, can amplify into large-scale financial, reputational, or governance damage.
Bridges: High Volume, High Risk, Low Trust
- Bridges now facilitate over $24 billion monthly in cross-chain flows.
- A major 2025 academic survey names bridge mechanisms as the leading source of financial loss in Web3, citing structural flaws in validation, access control, and messaging.
Why it matters: Bridges distribute trust assumptions across disparate chains. Without robust verification of cross-chain state, analytics misstate liquidity, transfers, or TVL, creating misaligned user experiences, financial misreporting, and governance blind spots.
These aren’t abstract risks. They’re escalating threats in 2025 as ecosystems multiply, chains diversify, and data complexity explodes.
If analytics pipelines trust data that’s partial, delayed, or reinvented, the consequences are immediate that are financial loss, governance breakdown, and erosion of credibility.
Next, we’ll see how to build a blueprint for Trust Architecture for On-Chain Data, a structure of controls, observability, and lineage that restores confidence and scales with complexity.
How to Make On-Chain Data Trustworthy
On-chain data is often assumed to be “the truth you can build on.” But as we’ve seen, the reality is more complex. Data gets pruned, events can be spoofed, and bridges create multiple versions of the same asset.
If a protocol team, DAO, or analyst treats raw data as perfect, the downstream metrics like TVL, liquidity, or supply quickly drift from reality.
That’s why serious teams now design trust architecture: a set of controls and processes that turn messy blockchain records into datasets that can actually be relied on.
Think of it as moving from “data is available” to “data is verifiable, reproducible, and auditable.”
With this framing in mind, the real question becomes: how do you architect your systems so that the numbers you publish remain consistent, defensible, and trusted even under scale?
Acquisition: Ensuring the Right Data Enters the System
- Multiple sources, not one
- A single RPC provider can’t be treated as gospel.
- Different providers prune differently, lag in syncing, or miss logs during reorgs.
- Pulling data from three independent providers and accepting results only when at least two agree dramatically reduces silent drift.
- Respecting finality
- On Ethereum, finality is reached when blocks are marked as “safe” and “finalised.”
- On optimistic rollups, transactions can be challenged for 7–14 days. On ZK rollups, finality depends on the prover submission.
- Indexing without accounting for these windows means you may treat “pending” data as permanent.
- Reorg resilience
- Reorganisations, even if shallow, can erase or reorder recent blocks.
- Pipelines need idempotent write strategies that can roll back and replay small ranges without breaking downstream metrics.
Semantics: Defining What Counts as Truth
- Metric contracts
- Before you report TVL or liquidity, you need a precise definition of how it’s calculated. what assets are counted, which contracts are included, and what exclusions apply.
- Otherwise, different teams produce “the same metric” with different rules.
- Token behaviour awareness
- Tokens don’t behave uniformly. Rebasing tokens changes balances without emitting transfer events.
- Fee-on-transfer tokens silently reduce balances mid-transfer. Wrapped or bridged assets carry supply on one chain but value on another.
- Without a registry that encodes these quirks, your pipeline misstates supply and volumes.
- Cross-chain normalisation
- USDC, for example, exists as native, bridged, and wrapped forms across Ethereum, Arbitrum, Base, and other L2s.
- Treating them as one asset requires explicit mapping; otherwise, liquidity and supply get double-counted.
Lineage: Proving Where Numbers Come From
Trust doesn’t end at producing a number; you need to be able to show your work.
- Content addressing
- Store raw block slices or event batches with cryptographic hashes so their integrity can be independently verified.
- Signed manifests
- Publish dataset snapshots that prove which blocks and code produced the output. If governance debates erupt, the dataset can be independently checked.
- Versioning transformations
- Every metric should carry the code version that produced it. This ensures future changes don’t overwrite past results without explanation.
Availability: Surviving Pruned Data
Ethereum’s blobs introduced in EIP-4844 are only retained for about 18 days, Celestia for ~30 days on light clients, and EigenDA according to its economic settings.
If you don’t capture and store this data locally, history literally disappears.
The solution is to treat availability as part of your trust model:
- Continuous blob capture into long-term storage.
- DA manifests that prove not only that data was posted, but that it was verifiable at the time of capture.
Verification: Continuous Proof, Not One-Time QA
Even with controls in place, pipelines drift unless they’re constantly tested.
- Cross-provider diffs
- Randomly recheck blocks across different providers nightly; freeze metrics if discrepancies appear.
- Invariant checks
- Supply must equal the sum of balances after rebases; AMM reserves must match inflows minus fees; bridged assets must equal locked minus unlocked.
- Drift alarms
- Run both a fast dashboard pipeline and a slow, authoritative pipeline. If outputs diverge beyond tolerance, investigate before publishing.
This architecture doesn’t make data perfect; it makes it provable.
Every number can be traced back to its origin, verified against multiple sources, and reproduced later. That is the standard protocols, treasuries, and DAOs will increasingly need as billions of dollars flow through modular chains and bridges.
To reach that standard, teams now rely on a new wave of infrastructure. In 2025, several tools have emerged that make trust-first data pipelines practical in production.
Best On-Chain Data Tools 2025
Designing trust architecture in theory is one thing, and operating it at scale is another.
Fortunately, the last two years have seen a new generation of data infrastructure tools emerge, purpose-built for modular, multi-chain ecosystems.
They fall into five main patterns: deterministic indexing and ingestion, data availability, provenance and attestation layers, observability frameworks, and privacy-safe access.
Deterministic Indexing and Ingestion
- Substreams (by StreamingFast)
- Strength: Parallel, reorg-aware block processing built directly on Firehose. Ideal for high-throughput chains and backfills.
- Best for teams needing reproducible pipelines at scale. By treating extraction, transforms, and sinks as modular WASM code, Substreams makes ingestion deterministic, perfect for trust-first workflows.
- Limitation: Requires technical overhead; not as “plug-and-play” as managed indexing.
- Subsquid
- Strength: EVM-first indexing with batch-based architecture. High throughput and cost efficiency for querying both historical and live data.
- Excellent when you need structured datasets fast, but want control over how those datasets are materialised.
- Limitation: Event-focused pipelines still require added logic for handling rebasing tokens, fee-on-transfer, and DA retention.
- Goldsky
- Strength: Managed real-time pipelines, API endpoints, and alerting.
- Ideal for teams who don’t want to run infra themselves but need production-grade ingestion with SLAs.
- Limitation: You trade off transparency into internals—hence, pair with your own verification layer.
Data Availability-Aware Storage
- Ethereum Blobs (EIP-4844)
- Blobs are cheap to post but pruned after ~18 days. Any analytics layer depending on blobs must run continuous capture jobs into long-term object storage.
- Celestia, EigenDA, and Avail
- Celestia: Sampled availability proofs with ~30-day retention for light nodes.
- EigenDA: Economic guarantees via EigenLayer restaking; retention is configurable.
- Avail: Fraud-proof aligned DA with long-term retrieval incentives.
- For modular chains and Orbit rollups, DA layers now define what is retrievable when. Trust pipelines must integrate DA proofs, not just data bodies.
Provenance and Attestation Layers
- Ethereum Attestation Service (EAS)
- Strength: Anchors dataset snapshots as verifiable attestations.
- Perfect for publishing governance-relevant metrics (TVL, participation, treasury reports). Communities can verify not just the number, but the dataset manifest behind it.
- Content-addressed storage (IPFS, Arweave, Filecoin)
- Strength: Cryptographic addressing ensures reproducibility.
- Use for raw dataset slices, then layer attestations over manifests for lineage.
Observability and Verification
- Differential Queries
- Tools like Tenderly or Blocknative can be used to cross-verify traces, gas profiles, and mempool data against your pipeline.
- Custom invariant engines
- Many teams now run nightly invariant tests: total supply vs sum of balances, AMM reserve reconciliation, bridge lock/unlock deltas. These aren’t off-the-shelf but must be baked into internal systems.
Privacy-Safe Access
- Self-hosted RPC clusters
- As MEV filtering grows, public RPCs often redact mempool or historical details. Self-hosting or hybrid setups using Erigon, Nethermind, or Reth ensure your data isn’t pre-filtered by someone else’s policies.
- Managed RPC with audit logs
Putting It Together
A trust-first pipeline in 2025 typically looks like this:
- Ingest: Substreams or Subsquid for deterministic ingestion; Goldsky for managed real-time feeds.
- Store: Long-term blob capture + DA manifests (Celestia/EigenDA/Avail) in content-addressed storage.
- Prove: Anchor dataset manifests with EAS attestations.
- Verify: Cross-provider diffs + invariant checks nightly.
- Publish: Expose two tiers of metrics that are fast, real-time dashboards, and slower, verified governance-grade reports.
This pattern doesn’t eliminate complexity, but it transforms raw blockchain data into something communities can actually rely on.
The next step is applying these patterns to real metrics, which balances liquidity, rollup health, and governance, to see how a trust-first approach works in practice.
Building Reliable On-Chain Analytics
It’s one thing to design architecture; it’s another to apply it to the metrics everyone uses daily.
Below are examples of how teams in 2025 can apply trust-first principles across balances, DEX volumes, rollup health, and governance analytics.
Balances and Supply Hygiene
- Most dashboards use “Transfer” events to calculate balances. But rebasing tokens, fee-on-transfer tokens, and mint/burn mechanics make this misleading.
- Trust-First Approach:
- Derive balances from state diffs or execution traces, not just events.
- Maintain a token registry with flags for rebasing and transfer fees.
- Reconcile balances against “totalSupply” invariants at the end of every block.
- Example: For rebasing tokens like stETH, your pipeline must re-calculate balances from storage values each epoch to avoid under- or over-reporting supply.
DEX Liquidity and Volume Metrics
- Fake trades, wash volume, and pool migrations distort liquidity signals.
- Trust-First Approach:
- Use pool reserve deltas as the base for volume, rather than summing swaps. This catches fee-on-transfer discrepancies and eliminates spoofed “Swap” events.
- Track liquidity across migrated pools, e.g., Uniswap v3 → v4 by canonical pool IDs.
- Apply MEV-aware filters to exclude transactions that revert but still emit logs in some clients.
- Example: When Uniswap v4 introduced hooks in 2024, reserve changes became more complex. Trust-first analytics require both reserve deltas and hook event context to avoid misreporting.
Rollup Health and Liveness
- Users and DAOs often treat rollup state as final the moment it’s sequenced. In reality, optimistic rollups can be challenged for days, and blob availability can lapse.
- Trust-First Approach:
- Track batch posting latency from sequencer → L1 contract.
- Monitor proof inclusion times (fraud/validity) and flag delays beyond historical baselines.
- Store blob capture manifests locally to defend against missing data after retention windows.
- Example: Arbitrum’s average batch posting time is minutes, but significant delays have occurred. A trust-first pipeline surfaces these anomalies instead of treating sequencer output as unquestionable truth.
Governance and Incentives
- Voting weights, incentive payouts, and distributions often use snapshot data that may later diverge from chain reality. Flash loans or Sybil clusters distort participation metrics.
- Trust-First Approach:
- Always compute voting weights from snapshot-at-block state using storage diffs, not explorer APIs.
- Cross-check governance token supply vs. balances to catch miscounts.
- Run anomaly detection for suspicious voter concentration (flash loan–amplified votes, clustered wallets).
- Example: In 2024, several DAOs overpaid liquidity mining incentives because fake transfer events inflated pool activity. A trust-first design would reconcile payouts against verified reserve deltas before treasury execution.
Distribution and Incentive Monitoring
- Teams often trust their distribution scripts without backtesting. If the indexer misses logs or misinterprets balances, treasuries leak funds.
- Trust-First Approach:
- Compare intended distribution outputs against state diffs post-distribution.
- Automate anomaly alerts if payouts deviate from expected curves.
- Example: Stablecoin protocols running yield incentives must validate that incentives reach unique addresses rather than routing back to Sybil clusters.
Each of these controls moves pipelines away from fragile event-only assumptions and towards state-verified, reproducible, and auditable metrics.
By treating metrics like products with schemas, invariants, and continuous verification, teams can prevent errors that would otherwise drain treasuries, mislead governance, or distort markets.
To make this actionable, here’s a practical checklist of the controls every team can apply when building trust-first pipelines.
On-Chain Data Implementation Checklist
Building trust into on-chain data pipelines isn’t about adding one new tool or patch. It’s about layering controls so that every number is defensible.
Here’s a practical checklist teams can adopt:
Acquisition
- Query multiple RPC providers; reject data that doesn’t match.
- Respect finality rules for each chain, like Ethereum, Optimistic rollups, and ZK rollups.
- Handle reorgs with rollback-and-replay logic.
Semantics
- Define metric contracts, e.g., TVL, circulating supply, with explicit rules.
- Maintain a token registry with flags for rebasing, fee-on-transfer, and bridged assets.
- Normalise cross-chain identities to prevent double-counting.
Lineage
- Store raw datasets with cryptographic hashes (content addressing).
- Publish signed dataset manifests or attestations for key metrics.
- Version transformations and tie them to code commits.
Availability
- Capture Ethereum blobs (EIP-4844) before the ~18-day pruning window.
- Store DA layer data (Celestia, EigenDA, Avail) with proofs and manifests.
- Replicate to multiple regions for redundancy.
Verification
- Run nightly cross-provider diffs to catch inconsistencies.
- Apply invariant tests: token supply vs balances, AMM reserves vs flows, bridge lock vs unlock.
- Set up metric drift alarms to flag discrepancies between “fast” and “slow” pipelines.
By applying this checklist, teams shift from dashboards that merely look complete to infrastructures that can be proven complete.
In a multi-chain, modular world, that is the only path to data that communities, investors, and regulators will actually trust.
From Transparency to Trust in Blockchain Data
On-chain data is often described as transparent and final.
But at scale, transparency without verification creates as many problems as it solves. Data fragments across rollups, bridges, and DA layers. Blobs expire, RPCs disagree, and token mechanics break naïve assumptions.
The result is dashboards that look precise but conceal silent errors; errors that drain treasuries, distort governance, or hide security risks until it’s too late.
The shift that protocols, DAOs, and investors need is from “data is available” to “data is provable.” That means designing pipelines with explicit acquisition rules, semantic contracts, lineage tracking, availability safeguards, and continuous verification.
In 2025’s modular, multi-chain world, these aren’t nice-to-haves; they’re survival requirements.
Teams that treat on-chain data as infrastructure, not an afterthought, will be the ones whose analytics still stand up to scrutiny years later. And for ecosystems built on trust, that difference is existential.
At Lampros Tech, we help blockchain teams build trust-first analytics.
Whether you need real-time dashboards, governance-grade reporting, or cross-chain risk monitoring, our Web3 Data Analytics services turn raw blockchain data into reliable, verifiable insights.
We design pipelines with the accuracy, transparency, and resilience that today’s protocols demand.

Astha Baheti
Growth Lead