• 使用行业内的排序,进行因子测试;与回归版本,以及原始因子值版本进行比较。本部分参考自《QEPM》 p.p 117
  • 请在环境变量中设置DB_URI指向数据库

参数设定



In [1]:
%matplotlib inline
import os
import pandas as pd
import numpy as np
from PyFin.api import *
from alphamind.api import *

factor = 'CFO2EV'
universe = Universe('zz800')
start_date = '2010-01-01'
end_date = '2018-04-26'
freq = '20b'
category = 'sw_adj'
level = 1
horizon = map_freq(freq)

engine = SqlEngine(os.environ['DB_URI'])

ref_dates = makeSchedule(start_date, end_date, freq, 'china.sse')
sample_date = '2018-01-04'
sample_codes = engine.fetch_codes(sample_date, universe)

sample_industry = engine.fetch_industry(sample_date, sample_codes, category=category, level=level)

In [2]:
sample_industry.head()


Out[2]:
code industry_code industry
0 1 103032101 银行
1 2 1030320 房地产
2 6 1030320 房地产
3 8 1030309 机械设备
4 9 1030328 综合

样例因子


我们下面分三种方法,分别考查这几种方法在避免行业集中上面的效果:

  • 使用原始因子的排序;
  • 使用原始因子在行业内的排序;
  • 使用原始因子在行业哑变量上回归后得到的残差排序。

1. 原始因子排序



In [3]:
factor1 = {'f1': CSQuantiles(factor)}
sample_factor1 = engine.fetch_factor(sample_date, factor1, sample_codes)
sample_factor1 = pd.merge(sample_factor1, sample_industry[['code', 'industry']], on='code')

In [4]:
sample_factor1.sort_values('f1', ascending=False).head(15)


Out[4]:
f1 code chgPct secShortName industry
760 1.00000 601988 0.0000 中国银行 银行
660 0.99875 600919 0.0093 江苏银行 银行
764 0.99750 601997 0.0175 贵阳银行 银行
707 0.99625 601288 -0.0026 农业银行 银行
329 0.99500 2807 0.0181 江阴银行 银行
645 0.99375 600875 -0.0027 东方电气 电气设备
716 0.99250 601398 -0.0146 工商银行 银行
657 0.99125 600908 0.0297 无锡银行 银行
710 0.99000 601328 -0.0016 交通银行 银行
675 0.98875 601001 0.0302 大同煤业 采掘
631 0.98750 600823 0.0000 世茂股份 房地产
590 0.98625 600657 0.0017 信达地产 房地产
209 0.98500 2244 0.0025 滨江集团 房地产
67 0.98375 627 0.0012 天茂集团 保险
410 0.98250 600050 -0.0030 中国联通 通信

对于原始因子,如果我们不做任何行业上面的处理,发现我们选定的alpha因子CFO2EV较大的股票集中于银行和大金融板块。

2. 行业内排序因子


这里我们使用调整后的申万行业分类作为行业标签:


In [5]:
factor2 = {'f2': CSQuantiles(factor, groups='sw1_adj')}
sample_factor2 = engine.fetch_factor(sample_date, factor2, sample_codes)
sample_factor2 = pd.merge(sample_factor2, sample_industry[['code', 'industry']], on='code')
sample_factor2.sort_values('f2', ascending=False).head(15)


Out[5]:
f2 code chgPct secShortName industry
35 1.0 415 -0.0066 渤海金控 多元金融
564 1.0 600584 -0.0068 长电科技 电子
760 1.0 601988 0.0000 中国银行 银行
666 1.0 600967 0.0025 内蒙一机 国防军工
675 1.0 601001 0.0302 大同煤业 采掘
45 1.0 528 0.0069 柳工 机械设备
615 1.0 600754 0.0165 锦江股份 休闲服务
452 1.0 600170 0.0080 上海建工 建筑装饰
27 1.0 158 -0.0202 常山北明 计算机
694 1.0 601168 0.0097 西部矿业 有色金属
670 1.0 600978 0.0043 宜华生活 轻工制造
669 1.0 600977 -0.0051 中国电影 传媒
30 1.0 338 0.0058 潍柴动力 汽车
608 1.0 600737 -0.0049 中粮糖业 农林牧渔
410 1.0 600050 -0.0030 中国联通 通信

使用行业内的排序,则行业分布会比较平均。

3. 使用回归将因子行业中性


还有一种思路,使用线性回归,以行业为哑变量,使用回归后的残差作为因子的替代值,做到行业中性:


In [6]:
factor3 = {'f3': factor}
sample_factor3 = engine.fetch_factor(sample_date, factor3, sample_codes)
risk_cov, risk_exp = engine.fetch_risk_model(sample_date, sample_codes)
sample_factor3 = pd.merge(sample_factor3, sample_industry[['code', 'industry']], on='code')
sample_factor3 = pd.merge(sample_factor3, risk_exp, on='code')

In [7]:
raw_factors = sample_factor3['f3'].values
industry_exp = sample_factor3[industry_styles + ['COUNTRY']].values.astype(float)
processed_values = factor_processing(raw_factors, pre_process=[], risk_factors=industry_exp, post_process=[percentile])
sample_factor3['f3'] = processed_values

In [8]:
sample_factor3 = sample_factor3[['code', 'f3', 'industry']]
sample_factor3.sort_values('f3', ascending=False).head(15)


Out[8]:
code f3 industry
760 601988 1.000000 银行
660 600919 0.998748 银行
764 601997 0.997497 银行
707 601288 0.996245 银行
329 2807 0.994994 银行
716 601398 0.993742 银行
657 600908 0.992491 银行
710 601328 0.991239 银行
333 2839 0.989987 银行
405 600036 0.988736 银行
188 2142 0.987484 银行
0 1 0.986233 银行
755 601939 0.984981 银行
645 600875 0.983730 电气设备
704 601229 0.982478 银行

我们发现这种方法的效果并不是很好。调整的幅度并不是很大,同时仍然存在着集中于大金融板块的问题。

回测结果


我们使用简单等权重做多前20%支股票,做空后20%的方法,考察三种方法的效果:


In [9]:
factors = {
    'raw': CSQuantiles(factor),
    'peer quantile': CSQuantiles(factor, groups='sw1'),
    'risk neutral': LAST(factor)
}

In [10]:
df_ret = pd.DataFrame(columns=['raw', 'peer quantile', 'risk neutral'])
df_ic = pd.DataFrame(columns=['raw', 'peer quantile', 'risk neutral'])

for date in ref_dates:
    ref_date = date.strftime('%Y-%m-%d')
    codes = engine.fetch_codes(ref_date, universe)

    total_factor = engine.fetch_factor(ref_date, factors, codes)
    risk_cov, risk_exp = engine.fetch_risk_model(ref_date, codes)
    industry = engine.fetch_industry(ref_date, codes, category=category, level=level)
    rets = engine.fetch_dx_return(ref_date, codes, horizon=horizon, offset=1)
    total_factor = pd.merge(total_factor, industry[['code', 'industry']], on='code')
    total_factor = pd.merge(total_factor, risk_exp, on='code')
    total_factor = pd.merge(total_factor, rets, on='code').dropna()

    raw_factors = total_factor['risk neutral'].values
    industry_exp = total_factor[industry_styles + ['COUNTRY']].values.astype(float)
    processed_values = factor_processing(raw_factors, pre_process=[], risk_factors=industry_exp, post_process=[percentile])
    total_factor['risk neutral'] = processed_values

    total_factor[['f1_d', 'f2_d', 'f3_d']] = (total_factor[['raw', 'peer quantile', 'risk neutral']] >= 0.8) * 1.
    total_factor.loc[total_factor['raw'] <= 0.2, 'f1_d'] = -1.
    total_factor.loc[total_factor['peer quantile'] <= 0.2, 'f2_d'] = -1.
    total_factor.loc[total_factor['risk neutral'] <= 0.2, 'f3_d'] = -1.
    total_factor[['f1_d', 'f2_d', 'f3_d']] /= np.abs(total_factor[['f1_d', 'f2_d', 'f3_d']]).sum(axis=0)

    ret_values = total_factor.dx.values @ total_factor[['f1_d', 'f2_d', 'f3_d']].values
    df_ret.loc[date] = ret_values
    
    ic_values = total_factor[['dx', 'raw', 'peer quantile', 'risk neutral']].corr().values[0, 1:]
    df_ic.loc[date] = ic_values
    print(f"{date} is finished")


2010-01-04 00:00:00 is finished
2010-02-01 00:00:00 is finished
2010-03-08 00:00:00 is finished
2010-04-06 00:00:00 is finished
2010-05-05 00:00:00 is finished
2010-06-02 00:00:00 is finished
2010-07-05 00:00:00 is finished
2010-08-02 00:00:00 is finished
2010-08-30 00:00:00 is finished
2010-09-30 00:00:00 is finished
2010-11-04 00:00:00 is finished
2010-12-02 00:00:00 is finished
2010-12-30 00:00:00 is finished
2011-01-28 00:00:00 is finished
2011-03-04 00:00:00 is finished
2011-04-01 00:00:00 is finished
2011-05-04 00:00:00 is finished
2011-06-01 00:00:00 is finished
2011-06-30 00:00:00 is finished
2011-07-28 00:00:00 is finished
2011-08-25 00:00:00 is finished
2011-09-23 00:00:00 is finished
2011-10-28 00:00:00 is finished
2011-11-25 00:00:00 is finished
2011-12-23 00:00:00 is finished
2012-01-31 00:00:00 is finished
2012-02-28 00:00:00 is finished
2012-03-27 00:00:00 is finished
2012-04-27 00:00:00 is finished
2012-05-29 00:00:00 is finished
2012-06-27 00:00:00 is finished
2012-07-25 00:00:00 is finished
2012-08-22 00:00:00 is finished
2012-09-19 00:00:00 is finished
2012-10-24 00:00:00 is finished
2012-11-21 00:00:00 is finished
2012-12-19 00:00:00 is finished
2013-01-21 00:00:00 is finished
2013-02-25 00:00:00 is finished
2013-03-25 00:00:00 is finished
2013-04-24 00:00:00 is finished
2013-05-27 00:00:00 is finished
2013-06-27 00:00:00 is finished
2013-07-25 00:00:00 is finished
2013-08-22 00:00:00 is finished
2013-09-23 00:00:00 is finished
2013-10-28 00:00:00 is finished
2013-11-25 00:00:00 is finished
2013-12-23 00:00:00 is finished
2014-01-21 00:00:00 is finished
2014-02-25 00:00:00 is finished
2014-03-25 00:00:00 is finished
2014-04-23 00:00:00 is finished
2014-05-23 00:00:00 is finished
2014-06-23 00:00:00 is finished
2014-07-21 00:00:00 is finished
2014-08-18 00:00:00 is finished
2014-09-16 00:00:00 is finished
2014-10-21 00:00:00 is finished
2014-11-18 00:00:00 is finished
2014-12-16 00:00:00 is finished
2015-01-15 00:00:00 is finished
2015-02-12 00:00:00 is finished
2015-03-19 00:00:00 is finished
2015-04-17 00:00:00 is finished
2015-05-18 00:00:00 is finished
2015-06-15 00:00:00 is finished
2015-07-14 00:00:00 is finished
2015-08-11 00:00:00 is finished
2015-09-10 00:00:00 is finished
2015-10-15 00:00:00 is finished
2015-11-12 00:00:00 is finished
2015-12-10 00:00:00 is finished
2016-01-08 00:00:00 is finished
2016-02-05 00:00:00 is finished
2016-03-11 00:00:00 is finished
2016-04-11 00:00:00 is finished
2016-05-10 00:00:00 is finished
2016-06-07 00:00:00 is finished
2016-07-07 00:00:00 is finished
2016-08-04 00:00:00 is finished
2016-09-01 00:00:00 is finished
2016-10-10 00:00:00 is finished
2016-11-07 00:00:00 is finished
2016-12-05 00:00:00 is finished
2017-01-03 00:00:00 is finished
2017-02-07 00:00:00 is finished
2017-03-07 00:00:00 is finished
2017-04-06 00:00:00 is finished
2017-05-05 00:00:00 is finished
2017-06-06 00:00:00 is finished
2017-07-04 00:00:00 is finished
2017-08-01 00:00:00 is finished
2017-08-29 00:00:00 is finished
2017-09-26 00:00:00 is finished
2017-10-31 00:00:00 is finished
2017-11-28 00:00:00 is finished
2017-12-26 00:00:00 is finished
2018-01-24 00:00:00 is finished
2018-02-28 00:00:00 is finished
2018-03-28 00:00:00 is finished

In [11]:
df_ret.cumsum().plot(figsize=(14, 7))


Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x22d19a173c8>

In [12]:
df_ic.cumsum().plot(figsize=(14, 7))


Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x22d22a517b8>

In [ ]: