- 函数的最小值。找函数的最小值有不同的方法,Spark中采用的是梯度下降法(stochastic gradient descent, SGD)。
关于正则化手段
线性回归同样可以采用正则化手段,其主要目的就是防止过拟合。
当采用L1正则化时,则变成了Lasso Regresion;当采用L2正则化时,则变成了Ridge Regression;线性回归未采用正则化手段。通常来说,在训练模型时是建议采用正则化手段的,特别是在训练数据的量特别少的时候,若不采用正则化手 段,过拟合现象会非常严重。L2正则化相比L1而言会更容易收敛(迭代次数少),但L1可以解决训练数据量小于维度的问题(也就是n元一次方程只有不到n 个表达式,这种情况下是多解或无穷解的)。
MLlib提供L1、L2和无正则化三种方法:
|
regularizer R(w) |
---|
|
gradient or sub-gradient | |
---|---|---|
zero (unregularized) | 0 | 0 |
L2 | 12∥w∥22 |
|
w |
|
|
L1 | ∥w∥1 |
|
sign(w) |
|
Spark线性回归实现
测试数据
-0.4307829,-1.63735562648104 -2.00621178480549 -1.86242597251066 -1.02470580167082 -0.522940888712441 -0.863171185425945 -1.04215728919298 -0.864466507337306 -0.1625189,-1.98898046126935 -0.722008756122123 -0.787896192088153 -1.02470580167082 -0.522940888712441 -0.863171185425945 -1.04215728919298 -0.864466507337306 -0.1625189,-1.57881887548545 -2.1887840293994 1.36116336875686 -1.02470580167082 -0.522940888712441 -0.863171185425945 0.342627053981254 -0.155348103855541 ...
附件下载:lpsa
数据格式:逗号之前为label;之后为8个特征值,以空格分隔。
代码实现
public static void main(String[] args) { SparkConf sparkConf = new SparkConf() .setAppName("Regression") .setMaster("local[2]"); JavaSparkContext sc = new JavaSparkContext(sparkConf); JavaRDD<String> data = sc.textFile("/home/yurnom/lpsa.txt"); JavaRDD<LabeledPoint> parsedData = data.map(line -> { String[] parts = line.split(","); double[] ds = Arrays.stream(parts[1].split(" ")) .mapToDouble(Double::parseDouble) .toArray(); return new LabeledPoint(Double.parseDouble(parts[0]), Vectors.dense(ds)); }).cache(); int numIterations = 100; //迭代次数 LinearRegressionModel model = LinearRegressionWithSGD.train(parsedData.rdd(), numIterations); RidgeRegressionModel model1 = RidgeRegressionWithSGD.train(parsedData.rdd(), numIterations); LassoModel model2 = LassoWithSGD.train(parsedData.rdd(), numIterations); print(parsedData, model); print(parsedData, model1); print(parsedData, model2); //预测一条新数据方法 double[] d = new double[]{1.0, 1.0, 2.0, 1.0, 3.0, -1.0, 1.0, -2.0}; Vector v = Vectors.dense(d); System.out.println(model.predict(v)); System.out.println(model1.predict(v)); System.out.println(model2.predict(v)); } public static void print(JavaRDD<LabeledPoint> parsedData, GeneralizedLinearModel model) { JavaPairRDD<Double, Double> valuesAndPreds = parsedData.mapToPair(point -> { double prediction = model.predict(point.features()); //用模型预测训练数据 return new Tuple2<>(point.label(), prediction); }); Double MSE = valuesAndPreds.mapToDouble((Tuple2<Double, Double> t) -> Math.pow(t._1() - t._2(), 2)).mean(); //计算预测值与实际值差值的平方值的均值 System.out.println(model.getClass().getName() + " training Mean Squared Error = " + MSE); }
运行结果
LinearRegressionModel training Mean Squared Error = 6.206807793307759 RidgeRegressionModel training Mean Squared Error = 6.416002077543526 LassoModel training Mean Squared Error = 6.972349839013683 Prediction of linear: 0.805390219777772 Prediction of ridge: 1.0907608111865237 Prediction of lasso: 0.18652645118913225
可以看到由于采用了正则化手段,ridge和lasso相对于linear其误差要大一些。在实际测试过程中,将迭代次数变成25时,有如下输出:
LinearRegressionModel training Mean Squared Error = 50.57566692735476 RidgeRegressionModel training Mean Squared Error = 1.664723124099061E7 LassoModel training Mean Squared Error = 6.972196762562953
可以看到此时linear还没有收敛到最终结果,而ridge却过拟合十分严重,此时lasso已经收敛等于最终结果。至于为什么产生这样的现象,我也不清楚,原理性的东西希望以后能有机会在写一篇文章。