BD8 Spark


Spark

It is for full-DAG query processing. Its first-class citizen: RDD.

8 nodes, 16 cores per node and 128 GB of memory per node.

RDD

HDFS, S3 …creation ->RDD, RDD -Transformation->RDD, RDD -action->on screen

Features

  • lineage graph
  • lazy evaluation: each action triggers an evaluation

Transformation

  • on single RDD: filter, map, flatMap, distinct, sample (fraction+seed)
  • on two or more RDDs: union, intersection, subtract, cartesian product
  • on key-value pair RDDs: reduce by key, group by key, map values, keys, values, join, subtract by key

    Actions

  • collect, count, count by value, take, top, take ordered, take sample, reduce, fold, aggregate, for each
  • on pair RDDs: count by key, lookup

Physical layer

parallel execution, optimization: avoid expensive network communication, stage.

Stage

it is a sequence of parallelizable tasks performed on a single machine.

Dependency

  • narrow dependency: stays on the same machine.
  • wide dependency: cannot have a single stage, need to shuffle around. Need the network.

General DAG

partition the DAG into sub-graphs, which can be computed without the network communication.

Performance tuning

  • persisting RDDs
  • avoiding a wide dependency: pre-partition according to the keys.

DataFrames

easy to do mutual transformation between RDD and DataFrame.

Abbreviations

DAG: directed acyclic graphs
RDD: resilient distributed dataset


Author: Fululu
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source Fululu !
  TOC