基于Caret和RandomForest包进行随机森林分析的一般步骤（1） - 服务器托管|北京服务器租用|机房托管租用|IDC托管租用|机房机柜带宽租用-价格及费用咨询

Caret构建机器学习流程的一般步骤

Caret依赖trainControl函数设置交叉验证参数，train函数具体训练和评估模型。首先是选择一系列需要评估的参数和参数值的组合，然后设置重采样评估方式，循环训练模型评估结果、计算模型的平均性能，根据设定的度量值选择最好的模型参数组合，使用全部训练集和最优参数组合完成模型的最终训练。

基于Caret和RandomForest包进行随机森林分析的一般步骤

createDataPartition是拆分数据为训练集和测试集的函数。对于分类数据，按照每个类的大小成比例拆分；如果是回归数据，则先把响应值分为n个区间，再成比例拆分。

# 拆分数据为测试集和训练集
seed if(file.exists('rda/rf_default.rda')){
  rf_default

## Random Forest 
## 
##   59 samples
## 7070 predictors
##    2 classes: 'DLBCL', 'FL' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 53, 53, 54, 53, 53, 54, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa     
##      2  0.7707937  0.09159664
##    118  0.8914286  0.62016807
##   7069  0.9390476  0.77675070
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 7069.

精确性随默认调参的变化

plot(rf_default)

Kappa值随默认调参的变化

# ?plot.train查看更多使用信息plot(rf_default, metric="Kappa")

#str(rf_default)

Caret比较不同算法的性能

Caret包的流程统一性在这就体现出来了，我之前没有看过ranger和parRF包，也不知道他们的具体使用。但我可以借助Caret很快地用他们构建一个初步模型，并与randomForest的结果进行比较。

# Too slow
# RRF: Regularized Random Forest
if(file.exists('rda/RRF_default.rda')){
  RRF_default

## Regularized Random Forest 
## 
##   59 samples
## 7070 predictors
##    2 classes: 'DLBCL', 'FL' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 53, 53, 54, 53, 53, 54, ... 
## Resampling results across tuning parameters:
## 
##   mtry  coefReg  coefImp  Accuracy   Kappa      
##      2  0.010    0.0      0.6612698   0.06472400
##      2  0.010    0.5      0.6300000  -0.02007385
##      2  0.010    1.0      0.7317460   0.26162083
##      2  0.505    0.0      0.6984127   0.08957347
##      2  0.505    0.5      0.7711111   0.33546603
##      2  0.505    1.0      0.7436508   0.28044267
##      2  1.000    0.0      0.8269841   0.48410091
##      2  1.000    0.5      0.8450794   0.47661766
##      2  1.000    1.0      0.7195238   0.22447669
##    118  0.010    0.0      0.6668254   0.06598972
##    118  0.010    0.5      0.7785714   0.36287284
##    118  0.010    1.0      0.8485714   0.55917306
##    118  0.505    0.0      0.8036508   0.42708196
##    118  0.505    0.5      0.8422222   0.52480669
##    118  0.505    1.0      0.8388889   0.53413165
##    118  1.000    0.0      0.9061905   0.70898014
##    118  1.000    0.5      0.8550794   0.55650040
##    118  1.000    1.0      0.8101587   0.49044569
##   7069  0.010    0.0      0.7260317   0.29956225
##   7069  0.010    0.5      0.7250794   0.24852437
##   7069  0.010    1.0      0.7390476   0.28045078
##   7069  0.505    0.0      0.7982540   0.43476213
##   7069  0.505    0.5      0.7560317   0.31467344
##   7069  0.505    1.0      0.7457143   0.28034256
##   7069  1.000    0.0      0.8950794   0.65746499
##   7069  1.000    0.5      0.8600000   0.55006709
##   7069  1.000    1.0      0.7715873   0.34952193
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 118, coefReg = 1 and coefImp
##  = 0.

ranger是random forest的快速版本，特别适用于高维数据，支持分类、回归和生存分析。

# ranger
# ranger is a fast implementation of random forests (Breiman 2001) or recursive partitioning, particularly suited for high dimensional data. Classification, regression, and survival forests are supported. Classification and regression forests are implemented as in the original Random Forest (Breiman 2001), survival forests as in Random Survival Forests (Ishwaran et al. 2008).
if(file.exists('rda/ranger_default.rda')){
  ranger_default

## Random Forest 
## 
##   59 samples
## 7070 predictors
##    2 classes: 'DLBCL', 'FL' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 52, 52, 54, 53, 53, 53, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   Accuracy   Kappa     
##      2  gini        0.7626984  0.05238095
##      2  extratrees  0.7504762  0.00000000
##    118  gini        0.8777778  0.58852454
##    118  extratrees  0.8441270  0.42128852
##   7069  gini        0.9434921  0.78263305
##   7069  extratrees  0.9200000  0.73072829
## 
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 7069, splitrule = gini
##  and min.node.size = 1.

parRF是并行随机森林算法。

# parRF: Parallel Random Forest
if(file.exists('rda/parRF_default.rda')){
  parRF_default

## Parallel Random Forest 
## 
##   59 samples
## 7070 predictors
##    2 classes: 'DLBCL', 'FL' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 52, 54, 54, 53, 52, 53, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa     
##      2  0.7560317  0.01904762
##    118  0.8615873  0.53526865
##   7069  0.9161905  0.72790935
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 7069.

合并三个包的模型结果

resamps

## 
## Call:
## resamples.default(x = list(RRF = RRF_default, rf = rf_default, parRF
##  = parRF_default, ranger = ranger_default))
## 
## Models: RRF, rf, parRF, ranger 
## Number of resamples: 30 
## Performance metrics: Accuracy, Kappa 
## Time estimates for: everything, final model fit

比较绘制三个包的性能差异

theme1

这个结果跟输出的矩阵是吻合的

dotplot(resamps)

统计检验评估模型之间差异是否显著

diff.value

## 
## Call:
## summary.diff.resamples(object = diff.value)
## 
## p-value adjustment: bonferroni 
## Upper diagonal: estimates of the difference
## Lower diagonal: p-value for H0: difference = 0
## 
## Accuracy 
##        RRF    rf        parRF     ranger   
## RRF           -0.032857 -0.010000 -0.037302
## rf     0.5193            0.022857 -0.004444
## parRF  1.0000 1.0000              -0.027302
## ranger 1.0000 1.0000    1.0000             
## 
## Kappa 
##        RRF rf        parRF     ranger   
## RRF        -0.067771 -0.018929 -0.073653
## rf     1              0.048841 -0.005882
## parRF  1   1                   -0.054724
## ranger 1   1         1

Caret训练最终模型

if(file.exists('rda/rf_finaltest.rda')){
  rf_final

## Random Forest 
## 
##   59 samples
## 7070 predictors
##    2 classes: 'DLBCL', 'FL' 
## 
## No pre-processing
## Resampling: None

基于模型对测试集进行预测

# ?predict.train 
# type: raw (返回预测的类或回归的数值) or prob （返回分类到每个类的概率） 
predict(rf_final, newdata=head(test_data))

## [1] DLBCL DLBCL DLBCL DLBCL DLBCL DLBCL
## Levels: DLBCL FL

虽然都能被分类到DLBCL，但分类概率是不同的，如DLBCL_16就在边缘徘徊。

# ?predict.train 
# type: raw (返回预测的类或回归的数值) or prob （返回分类到每个类的概率） 
predict(rf_final, newdata=head(test_data), type="prob")

##          DLBCL    FL
## DLBCL_2  0.960 0.040
## DLBCL_5  0.934 0.066
## DLBCL_11 0.742 0.258
## DLBCL_13 0.892 0.108
## DLBCL_16 0.866 0.134
## DLBCL_17 0.838 0.162

获得模型结果评估矩阵(confusion matrix)

predictions

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction DLBCL FL
##      DLBCL    14  2
##      FL        0  2
##                                           
##                Accuracy : 0.8889          
##                  95% CI : (0.6529, 0.9862)
##     No Information Rate : 0.7778          
##     P-Value [Acc > NIR] : 0.2022          
##                                           
##                   Kappa : 0.6087          
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.8750          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.7778          
##          Detection Rate : 0.7778          
##    Detection Prevalence : 0.8889          
##       Balanced Accuracy : 0.7500          
##                                           
##        'Positive' Class : DLBCL           
##

绘制ROC曲线

prediction_prob

## 
## Call:
## roc.default(response = test_data_group, predictor = prediction_prob[,     1])
## 
## Data: prediction_prob[, 1] in 14 controls (test_data_group DLBCL) > 4 cases (test_data_group FL).
## Area under the curve: 0.9821

ROC_data

机器学习系列教程

从随机森林开始，一步步理解决策树、随机森林、ROC/AUC、数据集、交叉验证的概念和实践。

文字能说清的用文字、图片能展示的用、描述不清的用公式、公式还不清楚的写个简单代码，一步步理清各个环节和概念。

再到成熟代码应用、模型调参、模型比较、模型评估，学习整个机器学习需要用到的知识和技能。

服务器托管，北京服务器托管，服务器租用 http://www.fwqtg.net
机房租用，北京机房租用，IDC机房托管， http://www.e1idc.net

Caret构建机器学习流程的一般步骤

基于Caret和RandomForest包进行随机森林分析的一般步骤

Caret比较不同算法的性能

Caret训练最终模型

基于模型对测试集进行预测

服务器托管，北京服务器托管，服务器租用，机房机柜带宽租用