Designing Data Pipelines for a New Asset Class
When expanding a systematic trading framework into a new asset class, most attention initially goes to strategies and models. Traders naturally ask questions like:
- What signals work in this market?
- What features might be predictive?
- How does volatility behave?
But before any of those questions can be answered reliably, a deeper engineering challenge must be addressed:
How should the data pipeline be designed for the new market?
Data pipelines are the foundation of systematic trading research. If the pipeline encodes incorrect assumptions about the market, the entire research stack inherits those errors. Models may appear to work during development while quietly learning artifacts of the data processing layer instead of real market behavior.
This is why expanding into a new asset class should begin with careful pipeline design rather than model experimentation.
Market Structure Drives Data Design
Every asset class carries its own structural characteristics. These characteristics directly affect how data must be processed and interpreted.
For example, index futures typically have:
- centralized exchange matching
- relatively stable tick sizes
- highly liquid continuous trading sessions
- standardized contract structures
Equities introduce different considerations:
- fragmented trading venues
- pre-market and after-hours sessions
- large overnight gaps
- corporate actions such as splits and dividends
- highly uneven liquidity across symbols
A pipeline designed around futures assumptions may unintentionally distort equity data. The reverse can also happen.
For this reason, the first step in designing a new pipeline is not coding. It is identifying which properties of the original market were implicitly assumed by the existing infrastructure.
Separating Raw Data From Research Data
A common mistake in trading systems is combining raw market data with research-ready data too early in the pipeline.
A more robust architecture separates these layers clearly.
A useful structure looks like this:
Raw Market Data
↓
Normalized Market Data
↓
Feature Tables
↓
Strategy State
Each layer serves a different purpose.
Raw Market Data
This layer should represent the market exactly as it occurred. It should be as close as possible to the source data and should avoid interpretation.
Examples include:
- trades
- quotes
- order book snapshots
- raw bar data
- exchange timestamps
Raw data tables should be append-only whenever possible. They act as the permanent historical record of what the system observed.
Normalized Market Data
The normalization layer transforms raw data into a format suitable for consistent analysis.
Examples of normalization tasks include:
- converting timestamps to a unified timezone
- handling contract rollover
- adjusting prices for splits and dividends
- identifying regular trading hours
- reconstructing consistent bar intervals
This layer ensures that research tools operate on a stable and comparable dataset.
Importantly, normalization should not introduce predictive features. It should only correct structural inconsistencies.
Feature Tables
Once normalized data exists, feature generation becomes possible.
Feature tables may include:
- volatility measures such as ATR
- rolling averages
- volume statistics
- relative strength measures
- time-of-day context
- regime indicators
These features represent hypotheses about market behavior.
Separating them from normalization logic prevents accidental mixing of structural corrections with predictive assumptions.
Strategy State
The final layer represents the live state of the strategy itself.
Examples include:
- signal outputs
- regime classifications
- trailing stop advice
- execution state
- trade outcome labeling
This layer is where research and live trading converge.
Because it depends on the earlier layers, errors in raw ingestion or normalization propagate directly into strategy behavior.
Session Awareness
One of the most important differences between asset classes involves trading sessions.
Futures markets often trade nearly around the clock, with only short maintenance breaks. Many research workflows treat the trading day as a continuous series of bars.
Equities behave differently.
Equity markets include:
- pre-market trading
- regular trading hours (RTH)
- after-hours trading
Liquidity and volatility vary dramatically between these periods.
A well-designed pipeline should explicitly mark session state for every bar. This allows research code to distinguish between:
- overnight gaps
- opening auction dynamics
- midday liquidity shifts
- closing auction effects
Ignoring these differences can lead to misleading statistical conclusions.
Handling Corporate Actions
Corporate actions are another major source of complexity in equities pipelines.
Events such as stock splits and dividends change the historical price series. If these adjustments are not handled properly, features based on historical price comparisons may break.
For example, a stock that undergoes a 2-for-1 split will appear to lose half its value overnight unless historical prices are adjusted accordingly.
A robust pipeline typically maintains:
- raw price history
- adjusted price history
Research features should usually be computed on adjusted prices, while execution systems must remain aware of raw prices.
Separating these views avoids confusion during analysis.
Symbol Identity and Instrument Metadata
In futures trading, instrument identity is often defined by a contract code and expiration cycle.
Equities rely on ticker symbols, which carry their own complexities:
- ticker changes
- delistings
- mergers and acquisitions
- exchange migrations
Maintaining a clean instrument master table becomes essential. This table typically stores metadata such as:
- symbol
- exchange
- asset class
- sector classification
- tick size
- trading session definition
The pipeline should treat this metadata as the authoritative source of instrument properties.
Time Alignment
Many systematic trading models rely on multiple timeframes simultaneously.
For example:
- 1-minute bars for microstructure analysis
- 5-minute bars for regime detection
- daily bars for higher-level context
Aligning these timeframes consistently requires careful timestamp handling.
Key design questions include:
- When exactly does a bar close?
- How are missing bars represented?
- How are partial sessions handled?
- Are timestamps aligned to exchange time or UTC?
Answering these questions early prevents downstream confusion in both research and live execution.
Reproducibility
One of the most important qualities of a research pipeline is reproducibility.
Given the same historical data and code version, the pipeline should always produce identical results.
Achieving this requires attention to:
- deterministic transformations
- versioned feature definitions
- explicit data dependencies
- immutable raw datasets
Without reproducibility, debugging model behavior becomes extremely difficult.
Preparing for Model Development
Once the pipeline produces stable normalized data and features, model development can begin with confidence.
At this stage, researchers can focus on questions such as:
- Which features contain predictive signal?
- How stable are signals across regimes?
- Do models generalize across symbols?
- How sensitive are results to session boundaries?
Because the pipeline was designed carefully, model results are more likely to reflect real market behavior rather than artifacts of data processing.
Closing Thought
Systematic trading is often portrayed as a modeling problem. In reality, it is just as much a data engineering problem.
Markets differ not only in how they move, but also in how their data must be interpreted and structured. Expanding into a new asset class requires respecting those differences at the pipeline level.
When the data architecture is designed thoughtfully, research becomes clearer, models become more trustworthy, and the trading system gains a stronger foundation for long-term evolution.
In the next article in this series, we will explore how modeling approaches must adapt when signals are evaluated across multiple symbols and asset classes.