Chapter 1: Machine Learning for Trading 8

How to read this book 9

What to expect 10

Who should read this book 10

How the book is organized 11

Part I – the framework – from data to strategy design 11

Part 2 – ML fundamentals 12

Part 3 – natural language processing 13

Part 4 – deep and reinforcement learning 13

What you need to succeed 14

Data sources 14

GitHub repository 15

Python libraries 15

The rise of ML in the investment industry 15

From electronic to high-frequency trading 16

Factor investing and smart beta funds 18

Algorithmic pioneers outperform humans at scale 20

ML driven funds attract $1 trillion AUM 21

The emergence of quantamental funds 22

Investments in strategic capabilities 23

ML and alternative data 23

Crowdsourcing of trading algorithms 25

Design and execution of a trading strategy 25

Sourcing and managing data 26

Alpha factor research and evaluation 27

Portfolio optimization and risk management 28

Strategy backtesting 28

ML and algorithmic trading strategies 29

Use Cases of ML for Trading 30

Data mining for feature extraction 30

Supervised learning for alpha factor creation and aggregation 31

Asset allocation 31

Testing trade ideas 32

Reinforcement learning 32

Summary 32

Chapter 2: Market and Fundamental Data 33

How to work with market data 34

Market microstructure 34

Marketplaces 34

Types of orders 36

Working with order book data 36

The FIX protocol 37

Nasdaq TotalView-ITCH Order Book data 38

Parsing binary ITCH messages 38

Reconstructing trades and the order book 42

Regularizing tick data 45

Tick bars 46

Time bars 47

Volume bars 49

Dollar bars 50

API access to market data 50

Remote data access using pandas 51

Reading html tables 51

pandas-datareader for market data 51

The Investor Exchange 52

Quantopian 53

Zipline 54

Quandl 55

Other market-data providers 56

How to work with fundamental data 57

Financial statement data 57

Automated processing – XBRL 57

Building a fundamental data time series 58

Extracting the financial statements and notes dataset 58

Retrieving all quarterly Apple filings 60

Building a price/earnings time series 61

Other fundamental data sources 62

pandas_datareader – macro and industry data 63

Efficient data storage with pandas 63

Summary 64

Chapter 3: Alternative Data for Finance 65

The alternative data revolution 66

Sources of alternative data 67

Individuals 68

Business processes 68

Sensors 69

Satellites 70

Geolocation data 70

Evaluating alternative datasets 71

Evaluation criteria 72

Quality of the signal content 72

Asset classes 72

Investment style 72

Risk premiums 72

Alpha content and quality 73

Quality of the data 73

Legal and reputational risks 73

Exclusivity 74

Time horizon 74

Frequency 74

Reliability 75

Technical aspects 75

Latency 75

Format 75

The market for alternative data 75

Data providers and use cases 77

Social sentiment data 77

Dataminr 78

StockTwits 78

RavenPack 78

Satellite data 78

Geolocation data 79

Email receipt data 79

Working with alternative data 79

Scraping OpenTable data 79

Extracting data from HTML using requests and BeautifulSoup 80

Introducing Selenium – using browser automation 81

Building a dataset of restaurant bookings 82

One step further – Scrapy and splash 83

Earnings call transcripts 84

Parsing HTML using regular expressions 85

Summary 87

Chapter 4: Alpha Factor Research 88

Engineering alpha factors 89

Important factor categories 90

Momentum and sentiment factors 90

Rationale 91

Key metrics 92

Value factors 93

Rationale 94

Key metrics 95

Volatility and size factors 96

Rationale 96

Key metrics 97

Quality factors 97

Rationale 98

Key metrics 98

How to transform data into factors 99

Useful pandas and NumPy methods 100

Loading the data 100

Resampling from daily to monthly frequency 100

Computing momentum factors 101

Using lagged returns and different holding periods 102

Compute factor betas 102

Built-in Quantopian factors 103

TA-Lib 103

Seeking signals – how to use zipline 104

The architecture – event-driven trading simulation 105

A single alpha factor from market data 106

Combining factors from diverse data sources 108

Separating signal and noise – how to use alphalens 110

Creating forward returns and factor quantiles 110

Predictive performance by factor quantiles 112

The information coefficient 114

Factor turnover 117

Alpha factor resources 117

Alternative algorithmic trading libraries 117

Summary 118

Chapter 5: Strategy Evaluation 119

How to build and test a portfolio with zipline 120

Scheduled trading and portfolio rebalancing 120

How to measure performance with pyfolio 122

The Sharpe ratio 122

The fundamental law of active management 123

In and out-of-sample performance with pyfolio 124

Getting pyfolio input from alphalens 125

Getting pyfolio input from a zipline backtest 125

Walk-forward testing out-of-sample returns 126

Summary performance statistics 127

Drawdown periods and factor exposure 128

Modeling event risk 129

How to avoid the pitfalls of backtesting 129

Data challenges 130

Look-ahead bias 130

Survivorship bias 130

Outlier control 131

Unrepresentative period 131

Implementation issues 131

Mark-to-market performance 131

Trading costs 132

Timing of trades 132

Data-snooping and backtest-overfitting 132

The minimum backtest length and the deflated SR 133

Optimal stopping for backtests 133

How to manage portfolio risk and return 134

Mean-variance optimization 135

How it works 136

The efficient frontier in Python 136

Challenges and shortcomings 139

Alternatives to mean-variance optimization 140

The 1/n portfolio 140

The minimum-variance portfolio 141

Global Portfolio Optimization - The Black-Litterman approach 141

How to size your bets – the Kelly rule 142

The optimal size of a bet 142

Optimal investment – single asset 143

Optimal investment – multiple assets 144

Risk parity 144

Risk factor investment 145

Hierarchical risk parity 145

Summary 146

Chapter 6: The Machine Learning Process 147

Learning from data 148

Supervised learning 150

Unsupervised learning 150

Applications 151

Cluster algorithms 151

Dimensionality reduction 152

Reinforcement learning 152

The machine learning workflow 153

Basic walkthrough – k-nearest neighbors 154

Frame the problem – goals and metrics 154

Prediction versus inference 155

Causal inference 155

Regression problems 156

Classification problems 158

Receiver operating characteristics and the area under the curve 159

Precision-recall curves 159

Collecting and preparing the data 160

Explore, extract, and engineer features 161

Using information theory to evaluate features 161

Selecting an ML algorithm 162

Design and tune the model 162

The bias-variance trade-off 163

Underfitting versus overfitting 163

Managing the trade-off 164

Learning curves 165

How to use cross-validation for model selection 166

How to implement cross-validation in Python 167

Basic train-test split 167

Cross-validation 168

Using a hold-out test set 168

KFold iterator 169

Leave-one-out CV 169

Leave-P-Out CV 170

ShuffleSplit 170

Parameter tuning with scikit-learn 170

Validation curves with yellowbricks 171

Learning curves 171

Parameter tuning using GridSearchCV and pipeline 172

Challenges with cross-validation in finance 172

Time series cross-validation with sklearn 173

Purging, embargoing, and combinatorial CV 173

Summary 174

Chapter 7: Linear Models 175

Linear regression for inference and prediction 176

The multiple linear regression model 177

How to formulate the model 177

How to train the model 178

Least squares 178

Maximum likelihood estimation 179

Gradient descent 180

The Gauss—Markov theorem 181

How to conduct statistical inference 182

How to diagnose and remedy problems 184

Goodness of fit 184

Heteroskedasticity 185

Serial correlation 186

Multicollinearity 187

How to run linear regression in practice 187

OLS with statsmodels 187

Stochastic gradient descent with sklearn 190

How to build a linear factor model 190

From the CAPM to the Fama—French five-factor model 191

Obtaining the risk factors 193

Fama—Macbeth regression 194

Shrinkage methods: regularization for linear regression 198

How to hedge against overfitting 198

How ridge regression works 199

How lasso regression works 201

How to use linear regression to predict returns 201

Prepare the data 201

Universe creation and time horizon 202

Target return computation 202

Alpha factor selection and transformation 203

Data cleaning – missing data 203

Data exploration 204

Dummy encoding of categorical variables 204

Creating forward returns 205

Linear OLS regression using statsmodels 206

Diagnostic statistics 206

Linear OLS regression using sklearn 207

Custom time series cross-validation 207

Select features and target 207

Cross-validating the model 208

Test results – information coefficient and RMSE 209

Ridge regression using sklearn 210

Tuning the regularization parameters using cross-validation 211

Cross-validation results and ridge coefficient paths 212

Top 10 coefficients 212

Lasso regression using sklearn 213