Learning Notes

Publish Date: 2019-01-24

Use Cases

Billions of TB files: the files are relatively small, but the amount is large
- Object Storage (technology)
- Key-value Model(model)
Millions of PB files: the files are large, but the amount is relatively small.
- block(file) storage (technology)
- file system (model)

GoogleFS (GFS)

It is the first DFS. It says there are some characteristic and requirements to be realized to design a DFS, which include: fault tolerance, file update model, performance requirements.

Design a DFS

Fault tolerance

Even we know the nodes will fail, the system should continue to work. What to do to realize fault tolerance?

Monitoring
Error Detection
Automatic Recovery
File update model
There are two typical models can be used: random access and upsert/append only.
For GB size files, we choose append: immutable, not that flexible.
Performance requirements
Top priority throughput, secondary priority: latency.

The Model of DFS

Logical Layer

you can choose key-value model or file system

Physical Storage

Object Storage or Block Storage. The latter, what the hard driver actually does. Files are not stored continuously, but in blocks and achieved as blocks. The hard driver read blocks, not single bits.

HDFS

H means Hadoop. Initiated in 2006. Inspired by GFS(2003), MapReduce (2004), and BigTable (2006).
Its primary features: 1. DFS, 2. MapReduce, 3. Wide Column Store(HBase).

Blocks

The reason to choose block:

Files bigger than a disk
simpler level of abstraction

Size

The block in a single file system: 4kb
The block in Relational Database: 4KB-32KB
The block in DFS: 64MB-128MB

Architecture

Master-Slave model.

Master: Namenode
Slave: Datanode

From a file view: the file is splited into 128MB chunks. Each chunk is replicated for several times. The chunks are stored in datanodes and the namenode knows where one part of a file is stored. If several clients want to read this file, they request to the namenode, and the namenode will return the files’ location from the datanodes, and the clients go to the corresponding datanode to read.

NameNode

Functions: all system-side activity

file namespace + access control
file to block mapping
block locations
Compositions

Memory: 1. file system hierarchy, 2. file to block mapping, 3. block locations
Persistent Storage: namespace file + edit log
DataNode
Blocks are stored in datanodes, on the local disk. The datanodes know their own hardware, so they can deal with disk failure. Blocks are identified by BlockID(64bits),

Communication

RPC

Client Protocol

The client send Metadata operations to a namenode, and the namenode return the DataNode location and Block IDs.

DataNode Protocol

The DataNode(who always initiate connection) register on a NameNode. Every 3s, it sends heartbeat to the namenode; Every 6h, it sends Block Report to the namenode, as well as “BlockReceived” message.
The NameNode sends Block operations to the DataNode.

DataTransfer Protocol

Between Client and DataNodes. They transfer data blocks with each other, streaming.
If a client is writing data, it will write on one DataNode, and this DataNode replicate the blocks to other DataNodes, using a pipeline.

Metadata

Physical level

In each block, there are two files: metadata file and the data file.
Metadata includes checksum, generation stamp and so on.

checksum: is calculated when the block is written, and is used to check for data integrity (may be caused by disk errors, network faults, buggy software and so on) when the file is read back.
generation stamp prevents reading stale data.

functionality

create/delete directory
write/append to/read/delete file

Read a file

client -> namenode: asks for a file
namenode -> client: the block locations, multiple datanodes for each block, sorted by distance.
datanodes -> client: form an input stream for client to read.

Write a file

client -> namenode: create command.
A circuit
1. namenode -> client: datanodes for first block.
2. client -> datanodes: Client creates a pipeline for streaming data to DataNodes.
3. client -> datanode, datanode->datanode: the client by DataStreamer sends a packet to DN1, DN1 streams to DN2 the same way, and so on. If one node fails the pipeline is recreated with remaining nodes.
4. datanodes -> client: Ack message
5. Repeat from the beginning: Namenode -> client: datanodes for second, third … nth block.
client -> NameNode: close/release lock.
DataNodes -> NN: through DataNode Protocol, the NN checks for minimal replication.
NN -> client: Ack.

Replicas

the number of replicas: default 3.

Considerations: reliability, read/write bandwidth, block distribution

Distance: three layers: node, rack, cluster, 1 between them. e.g. from a node to another node on the same/different rack: 2/4.

Placement

the same node as the client (or random), rack A
a node in a different rack, rack B
a different node in the same rack with 2, rack B
4 or beyond: random, but if possible: at most one replica per node/at most two replicas per rack.
Put on different rack is to make the distribution more evenly.

Performance and Availability

The NameNode is a single node, when it fails, HDFS uses Startup again to solve the problem.

Startup

It usually takes 30 minutes.

NN read in the persistent storage, and builds the new file system, according to Namespace life and edit log.
The DNs report to the NN, to recover the Block locations.
Other strategies
Checkpoints with secondary NN: SNN composes the old namespace file and edit log as a new namespace file, which will reduces the time in step 1.
High Availability(HA): Backup Namenodes: maintains the mappings and locations in memory like the namenode, ready to take over it at all times.
Federated DFS: different sub-directories are organized by different NN, more like the file system.

GFS vs HDFS

HDFS	GFS
NameNode	Master
DataNode	ChunkServer
Block	Chunk
FS Image	Checkpoint image
Edit log	Operation log

Block size: Cloudera HDFS: 128 MB, GFS/Apache HDFS: 64 MB.

Abbreviations

DFS

BD4 File system

Use Cases

GoogleFS (GFS)