Datasets

Datasets are pre-built, high-level abstractions that make it easy to extract common blockchain data. They are implemented as helper functions that construct complete Pipeline objects with predefined schemas and transformations.

Available Datasets

Ethereum (EVM) Datasets

Dataset	Description	Use Cases
Blocks	Extract block headers and metadata	Block analysis, network statistics, gas analysis
Address Appearances	Track all address appearances in traces	Contract interactions, address relationships, contract creation tracking
All Contracts	Information about all contracts	Contract deployment analysis

Solana (SVM) Datasets

Dataset	Description	Use Cases
Token Balances	Track token account balances	Token holdings, transfers, token program analysis

Usage Pattern

All datasets follow a similar usage pattern:

from cherry_etl import datasets
from cherry_etl.pipeline import run_pipeline

# Create a pipeline using a dataset
pipeline = datasets.evm.blocks(  # or any other dataset
    provider=provider,
    writer=writer,
    from_block=18123123,  # or from_slot for Solana
    to_block=18123200     # or to_slot for Solana
)

# Run the pipeline
await run_pipeline(pipeline_name="dataset_name", pipeline=pipeline)

Common Features

All datasets share these common features:

Predefined Schemas: Each dataset has a well-defined output schema
Optimized Performance: Leverages Rust-based core components
Parallel Processing: Data ingestion and processing happen in parallel
Crash Resistance: Built-in support for crash recovery
Continuous Ingestion: Can keep datasets fresh with continuous updates

Data Providers

Datasets work with any supported data provider:

EVM Chains: HyperSync, SQD
Solana: SQD (beta), Yellowstone-GRPC

Output Formats

Datasets can write to any supported output format:

ClickHouse
Iceberg
Deltalake
DuckDB
Arrow Datasets
Parquet

Writing Custom Datasets

While the built-in datasets cover common use cases, you can also create custom datasets by:

Defining your schema
Creating transformation steps
Building a pipeline configuration

See the Writing Custom Pipelines section for more details.

Notes

Datasets are inspired by cryo
Each dataset is optimized for its specific use case
Datasets handle all the complexity of data extraction and transformation
You can combine datasets with custom pipeline steps for advanced use cases

Cherry Documentation