Fululu

Since 2024-09-02, this blog has been stopped to be updated. For newer contents, please go to fuguirong.de.

BD8 Spark

Learning Notes

Publish Date: 2019-01-25

Spark

It is for full-DAG query processing. Its first-class citizen: RDD.

8 nodes, 16 cores per node and 128 GB of memory per node.

RDD

HDFS, S3 …creation ->RDD, RDD -Transformation->RDD, RDD -action->on screen

Features

lineage graph
lazy evaluation: each action triggers an evaluation

Transformation

on single RDD: filter, map, flatMap, distinct, sample (fraction+seed)
on two or more RDDs: union, intersection, subtract, cartesian product
on key-value pair RDDs: reduce by key, group by key, map values, keys, values, join, subtract by key
Actions
collect, count, count by value, take, top, take ordered, take sample, reduce, fold, aggregate, for each
on pair RDDs: count by key, lookup

Physical layer

parallel execution, optimization: avoid expensive network communication, stage.

Stage

it is a sequence of parallelizable tasks performed on a single machine.

Dependency

narrow dependency: stays on the same machine.
wide dependency: cannot have a single stage, need to shuffle around. Need the network.

General DAG

partition the DAG into sub-graphs, which can be computed without the network communication.

Performance tuning

persisting RDDs
avoiding a wide dependency: pre-partition according to the keys.

DataFrames

easy to do mutual transformation between RDD and DataFrame.

Abbreviations

DAG: directed acyclic graphs
RDD: resilient distributed dataset

Fululu

https://fuguigui.github.io

All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source Fululu !

System Big Data

BD6 Map reduce

2019-01-25 Learning Notes

System Big Data

BD9 Performance at large scale

2019-01-25 Learning Notes

System Big Data

BD8 Spark

Spark

RDD

Features

Transformation

Actions

Physical layer

Stage

Dependency

General DAG

Performance tuning

DataFrames

Abbreviations

你的赏识是我前进的动力