spark.ml是基于DataFrame 数据集,是spark官方现在推荐的包,未来会持续更新
spark.mllib是基于Rdd 数据集,原有基于RDD的API目前处于维护状态,不再加新Feature,预计在Spark3.0会删除该包
参考Spark官方原文:
As of Spark 2.0, the RDD-based APIs in thespark.mllibpackage have entered maintenance mode.
The primary Machine Learning API for Spark is now the DataFrame-based API in thespark.mlpackage.
What are the implications?
-
MLlib will still support the RDD-based API inspark.mllibwith bug fixes.
-
MLlib will not add new features to the RDD-based API.
-
In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.
-
After reaching feature parity (roughly estimated for Spark 2.3), the RDD-based API will be deprecated.
-
The RDD-based API is expected to be removed in Spark 3.0.