Learning Notes

Publish Date: 2019-01-22

Database prehistory:

speaking/singing(expressing information).
writing(recording).
accounting (processing information)
printing (broadcasting in large scale).

Database history: 1960s: File system; 1970s: relational era; 1980s: object era; 2000s: NoSQLs

3Vs:

volume, variety, velocity

Volume

Prefix: KMGTPEZY:
kilo(3 zeros),mega (6 zeros),giga,tera,peta,exa,zetta,yotta.
kibi(2^10),mebi(2^20),gibi,tebi,pebi,exbi,zebi,yobi.

Variety

Data shapes: table, tree, graph, cube, text

Velocity

Velocity paramount factors: capacity, latency, throughput. Rule: logarithmic.
From Capacity to throughput: parallelize.
From throughput to latency: batch processing.

Teacher’s definition: Big Data : technologies to store, manage and analyze data that is too large to fit on a single machine, while accommodating for the issue of growing discrepancy between capacity, throughput and latency.

Course Overview

Data in the large

Key-value stores (S3)
Distributed file systems (HDFS)
Distributed query processing (MapReduce, Spark)
Resource management (YARN)
Column stores (HBase)
Data in the small
Document stores (MongoDB)
Syntax (XML, JSON)
Data models, Schemas, Querying
Data in the very small
Data warehouses (OLAP, ROLAP, XBRL)
Graph databases (RDF)

Learn from the past: data independence: logical data model and physical storage are independent.

Data Model: 1. what data looks like; 2. what you can do with that.

Overall architecture: language/model/compute/storage.

The stack

from bottom to the top

Storage

local file system, NFS, GFS, HDFS, S3, Azure Blob Storage

Encoding

ASCII, ISO-8859-1, UTF-8, BSON(???)

Syntax

Text, CSV, XML, JSON, RDF/XML, Turtle, XBRL

Data Models

Table: relational model
Tress: XML Infoset, XDM
Graphs: RDF
Cubes: OLAP

Validation

XML Schema, JSON Schema, Relational schemas, XBRL taxonomies

Processing

two-phase: MapReduce
DAG(???)-driven: Tez, Spark, Flink, Ray
Elastic computing: EC2

Indexing

key-value stores, hash indices, B-Trees, geographical indices(???), spatial indicies(???)

Data Stores

RDBMS, MongoDB, CouchBase, Elastic Search, Hive, HBase, MarkLogic, Cassandra.

Querying

SQL,XQuery,JSONiq, N1QL, MDX, SPARQL, REST APIs

UI

Excel, Access, Tableau …

Fululu

https://fuguigui.github.io

All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source Fululu !

System Big Data

BD11 Data Model

2019-01-22 Learning Notes

System Big Data

BD2 Database Basics

2019-01-22 Learning Notes

System Big Data

BD1 Introduction

3Vs:

Volume

Variety

Velocity

Course Overview

Data in the large

Data in the small

The stack

Storage

Encoding

Syntax

Data Models

Validation

Processing

Indexing

Data Stores

Querying

UI

你的赏识是我前进的动力