Designing Data Pipelines for a New Asset Class

When expanding a systematic trading framework into a new asset class, most attention initially goes to strategies and models. Traders naturally ask questions like:

What signals work in this market?
What features might be predictive?
How does volatility behave?

But before any of those questions can be answered reliably, a deeper engineering challenge must be addressed:

How should the data pipeline be designed for the new market?

Data pipelines are the foundation of systematic trading research. If the pipeline encodes incorrect assumptions about the market, the entire research stack inherits those errors. Models may appear to work during development while quietly learning artifacts of the data processing layer instead of real market behavior.

This is why expanding into a new asset class should begin with careful pipeline design rather than model experimentation.

Market Structure Drives Data Design

Every asset class carries its own structural characteristics. These characteristics directly affect how data must be processed and interpreted.

For example, index futures typically have:

centralized exchange matching
relatively stable tick sizes
highly liquid continuous trading sessions
standardized contract structures

Equities introduce different considerations:

fragmented trading venues
pre-market and after-hours sessions
large overnight gaps
corporate actions such as splits and dividends
highly uneven liquidity across symbols

A pipeline designed around futures assumptions may unintentionally distort equity data. The reverse can also happen.

For this reason, the first step in designing a new pipeline is not coding. It is identifying which properties of the original market were implicitly assumed by the existing infrastructure.

Separating Raw Data From Research Data

A common mistake in trading systems is combining raw market data with research-ready data too early in the pipeline.

A more robust architecture separates these layers clearly.

A useful structure looks like this:

Raw Market Data
↓
Normalized Market Data
↓
Feature Tables
↓
Strategy State

Each layer serves a different purpose.

Raw Market Data

This layer should represent the market exactly as it occurred. It should be as close as possible to the source data and should avoid interpretation.

Examples include:

trades
quotes
order book snapshots
raw bar data
exchange timestamps

Raw data tables should be append-only whenever possible. They act as the permanent historical record of what the system observed.

Normalized Market Data

The normalization layer transforms raw data into a format suitable for consistent analysis.

Examples of normalization tasks include:

converting timestamps to a unified timezone
handling contract rollover
adjusting prices for splits and dividends
identifying regular trading hours
reconstructing consistent bar intervals

This layer ensures that research tools operate on a stable and comparable dataset.

Importantly, normalization should not introduce predictive features. It should only correct structural inconsistencies.

Feature Tables

Once normalized data exists, feature generation becomes possible.

Feature tables may include:

volatility measures such as ATR
rolling averages
volume statistics
relative strength measures
time-of-day context
regime indicators

These features represent hypotheses about market behavior.

Separating them from normalization logic prevents accidental mixing of structural corrections with predictive assumptions.

Strategy State

The final layer represents the live state of the strategy itself.

Examples include:

signal outputs
regime classifications
trailing stop advice
execution state
trade outcome labeling

This layer is where research and live trading converge.

Because it depends on the earlier layers, errors in raw ingestion or normalization propagate directly into strategy behavior.

Session Awareness

One of the most important differences between asset classes involves trading sessions.

Futures markets often trade nearly around the clock, with only short maintenance breaks. Many research workflows treat the trading day as a continuous series of bars.

Equities behave differently.

Equity markets include:

pre-market trading
regular trading hours (RTH)
after-hours trading

Liquidity and volatility vary dramatically between these periods.

A well-designed pipeline should explicitly mark session state for every bar. This allows research code to distinguish between:

overnight gaps
opening auction dynamics
midday liquidity shifts
closing auction effects

Ignoring these differences can lead to misleading statistical conclusions.

Handling Corporate Actions

Corporate actions are another major source of complexity in equities pipelines.

Events such as stock splits and dividends change the historical price series. If these adjustments are not handled properly, features based on historical price comparisons may break.

For example, a stock that undergoes a 2-for-1 split will appear to lose half its value overnight unless historical prices are adjusted accordingly.

A robust pipeline typically maintains:

raw price history
adjusted price history

Research features should usually be computed on adjusted prices, while execution systems must remain aware of raw prices.

Separating these views avoids confusion during analysis.

Symbol Identity and Instrument Metadata

In futures trading, instrument identity is often defined by a contract code and expiration cycle.

Equities rely on ticker symbols, which carry their own complexities:

ticker changes
delistings
mergers and acquisitions
exchange migrations

Maintaining a clean instrument master table becomes essential. This table typically stores metadata such as:

symbol
exchange
asset class
sector classification
tick size
trading session definition

The pipeline should treat this metadata as the authoritative source of instrument properties.

Time Alignment

Many systematic trading models rely on multiple timeframes simultaneously.

For example:

1-minute bars for microstructure analysis
5-minute bars for regime detection
daily bars for higher-level context

Aligning these timeframes consistently requires careful timestamp handling.

Key design questions include:

When exactly does a bar close?
How are missing bars represented?
How are partial sessions handled?
Are timestamps aligned to exchange time or UTC?

Answering these questions early prevents downstream confusion in both research and live execution.

Reproducibility

One of the most important qualities of a research pipeline is reproducibility.

Given the same historical data and code version, the pipeline should always produce identical results.

Achieving this requires attention to:

deterministic transformations
versioned feature definitions
explicit data dependencies
immutable raw datasets

Without reproducibility, debugging model behavior becomes extremely difficult.

Preparing for Model Development

Once the pipeline produces stable normalized data and features, model development can begin with confidence.

At this stage, researchers can focus on questions such as:

Which features contain predictive signal?
How stable are signals across regimes?
Do models generalize across symbols?
How sensitive are results to session boundaries?

Because the pipeline was designed carefully, model results are more likely to reflect real market behavior rather than artifacts of data processing.

Closing Thought

Systematic trading is often portrayed as a modeling problem. In reality, it is just as much a data engineering problem.

Markets differ not only in how they move, but also in how their data must be interpreted and structured. Expanding into a new asset class requires respecting those differences at the pipeline level.

When the data architecture is designed thoughtfully, research becomes clearer, models become more trustworthy, and the trading system gains a stronger foundation for long-term evolution.

In the next article in this series, we will explore how modeling approaches must adapt when signals are evaluated across multiple symbols and asset classes.