图计算
作者: 时海
创建图

创建GraphFrame需要传入两个dataframe,其中一个包含顶点数据,另一个包含边数据,有以下要求。

(1)顶点dataframe需要有一个"id"列,可以是非数字类型的,这是GraphFrame相对于GraphX较大的优势。

(2)边dataframe需要包含列名为"src"和"dst"两列,与顶点的"id"列映射。

程序基本框架:

import org.apache.spark.sql.SparkSession
import org.graphframes.GraphFrame
object GraphFrameTest {
  def main(args: Array[String]) {

    val spark = SparkSession
      .builder()
      .appName("GraphFrameTest")
      .master("local[4]")
      .getOrCreate()

    val sqlContext = spark.sqlContext

    // 创建顶点DataFrame
    val v = sqlContext.createDataFrame(List(
      ("a", "Alice", 34),
      ("b", "Bob", 36),
      ("c", "Charlie", 30),
      ("d", "David", 29),
      ("e", "Esther", 32),
      ("f", "Fanny", 36),
      ("g", "Gabby", 60)
    )).toDF("id", "name", "age")
    // 创建边DataFrame
    val e = sqlContext.createDataFrame(List(
      ("a", "b", "friend"),
      ("b", "c", "follow"),
      ("c", "b", "follow"),
      ("f", "c", "follow"),
      ("e", "f", "follow"),
      ("e", "d", "friend"),
      ("d", "a", "friend"),
      ("a", "e", "friend")
    )).toDF("src", "dst", "relationship")


    // 创建 GraphFrame 图对象
    val g = GraphFrame(v, e)

    //TODO

    spark.stop()

  }

GraphFrame包含了顶点dataframe和边dataframe的引用,从以下GraphFrame构造函数可以看出:

class GraphFrame private(
    @transient private val _vertices: DataFrame,
    @transient private val _edges: DataFrame) extends Logging with Serializable 

GraphFrame很多基本操作均是基于顶点和边的dataframe操作的,如GraphFrame的persist方法:

  def persist(): this.type = {
    vertices.persist()
    edges.persist()
    this
  }





标签: graphframe、dataframe、val、顶点、sqlcontext
一个创业中的苦逼程序员
  • 回复
隐藏