图计算
创建GraphFrame需要传入两个dataframe,其中一个包含顶点数据,另一个包含边数据,有以下要求。
(1)顶点dataframe需要有一个"id"列,可以是非数字类型的,这是GraphFrame相对于GraphX较大的优势。
(2)边dataframe需要包含列名为"src"和"dst"两列,与顶点的"id"列映射。
程序基本框架:
import org.apache.spark.sql.SparkSession import org.graphframes.GraphFrame object GraphFrameTest { def main(args: Array[String]) { val spark = SparkSession .builder() .appName("GraphFrameTest") .master("local[4]") .getOrCreate() val sqlContext = spark.sqlContext // 创建顶点DataFrame val v = sqlContext.createDataFrame(List( ("a", "Alice", 34), ("b", "Bob", 36), ("c", "Charlie", 30), ("d", "David", 29), ("e", "Esther", 32), ("f", "Fanny", 36), ("g", "Gabby", 60) )).toDF("id", "name", "age") // 创建边DataFrame val e = sqlContext.createDataFrame(List( ("a", "b", "friend"), ("b", "c", "follow"), ("c", "b", "follow"), ("f", "c", "follow"), ("e", "f", "follow"), ("e", "d", "friend"), ("d", "a", "friend"), ("a", "e", "friend") )).toDF("src", "dst", "relationship") // 创建 GraphFrame 图对象 val g = GraphFrame(v, e) //TODO spark.stop() }
GraphFrame包含了顶点dataframe和边dataframe的引用,从以下GraphFrame构造函数可以看出:
class GraphFrame private( @transient private val _vertices: DataFrame, @transient private val _edges: DataFrame) extends Logging with Serializable
GraphFrame很多基本操作均是基于顶点和边的dataframe操作的,如GraphFrame的persist方法:
def persist(): this.type = { vertices.persist() edges.persist() this }