http://www.hemelock.com

云顶官网

时间:2020-01-05

1.应用场景和特点

HBase Architectural Components(HBase架构组件)

HBase架构也是主从架构,由三部分构成HRegionServerHBase MasterZooKeeper

RegionServer负责数据的读写与客户端交互,对于region的操作则是由HMaster处理,ZooKeeper则是负责维护运行中的节点。

在底层,它将数据存储于HDFS文件中,因而涉及到HDFS的NN、DN等。RegionServer会搭配HDFS中的DataNode节点,可以将数据放在本节点的DataNode上。NameNode则是维护每个物理数据块的元数据信息。

刚开始只要大致了解一下每个组件是干啥的,后面会进行详细介绍。

Physically, HBase is composed of three types of servers in a master slave type of architecture. Region servers serve data for reads and writes. When accessing data, clients communicate with HBase RegionServers directly. Region assignment, DDL (create, delete tables) operations are handled by the HBase Master process. Zookeeper, which is part of HDFS, maintains a live cluster state.

The Hadoop DataNode stores the data that the Region Server is managing. All HBase data is stored in HDFS files. Region Servers are collocated with the HDFS DataNodes, which enable data locality (putting the data close to where it is needed) for the data served by the RegionServers. HBase data is local when it is written, but when a region is moved, it is not local until compaction.

The NameNode maintains metadata information for all the physical data blocks that comprise the files.

云顶官网 1

HBase是运行在Hadoop集群上的一个数据库,与传统的数据库有严格的ACID(原子性、一致性、隔离性、持久性)要求不一样,HBase降低了这些要求从而获得更好的扩展性,它更适合存储一些非结构化和半结构化的数据。

hbase => 当数据量非常大的时候才会体现出hbase的优势

Regions

上一篇文章说到HBase的一张表会被切分成若干块,每块叫做一个Region。每个Region中存储着从startKey到endKey中间的记录。这些Region会被分到集群的各个节点上去存储,每个节点叫做一个RegionServer,这些节点负责数据的读写,一个RegionServer可以处理大约1000个regions。

HBase Tables are divided horizontally by row key range into “Regions.” A region contains all rows in the table between the region’s start key and end key. Regions are assigned to the nodes in the cluster, called “Region Servers,” and these serve data for reads and writes. A region server can serve about 1,000 regions.

云顶官网 2

Apache HBase is a database that runs on a Hadoop cluster. HBase is not a traditional RDBMS, as it relaxes the ACID (Atomicity, Consistency, Isolation, and Durability) properties of traditional RDBMS systems in order to achieve much greater scalability. Data stored in HBase also does not need to fit into a rigid schema like with an RDBMS, making it ideal for storing unstructured or semi-structured data.

特点:

HBase HMaster

HMaster的功能是处理region的协调工作,具体包括以下的内容:

  1. 管理RegionServer,实现其负载均衡。
  2. 管理和分配Region,比如在Region split时分配新的Region;在RegionServer退出时迁移其内的Region到其他RegionServer上。
  3. 监控集群中所有RegionServer的状态(通过Heartbeat和监听ZooKeeper中的状态)
  4. 处理schema更新请求 (创建、删除、修改Table的定义)。

Region assignment, DDL (create, delete tables) operations are handled by the HBase Master.

A master is responsible for:

  • Coordinating the region servers

    - Assigning regions on startup , re-assigning regions for recovery or load balancing

    - Monitoring all RegionServer instances in the cluster (listens for notifications from zookeeper)

  • Admin functions

    - Interface for creating, deleting, updating tables

    云顶官网 3

ZooKeeper: The Coordinator

ZooKeeper作为协调者,负责维护节点的运行状态,也就是哪个节点是运行中的,哪个是已经挂了的。每个节点周期性地像它发送心跳信息,从而让它能时刻了解到每个节点的运行情况。如果发现有节点出现异常情况,它会发出提醒。

HBase uses ZooKeeper as a distributed coordination service to maintain server state in the cluster. Zookeeper maintains which servers are alive and available, and provides server failure notification. Zookeeper uses consensus to guarantee common shared state. Note that there should be three or five machines for consensus.

云顶官网 4

传统数据库vs.HBase

首先要理解一个概念,什么是NoSql,可以去下面的地址先了解一下:http://www.runoob.com/mongodb/nosql.html。

HBase也是基于NoSql的思想,那么为什么需要他们呢?先来看看传统数据库:

云顶官网,Why do we need NoSQL/HBase? First, let’s look at the pros of relational databases before we discuss its limitations:

  • Relational databases have provided a standard persistence model
  • SQL has become a de-facto standard model of data manipulation (SQL)
  • Relational databases manage concurrency for transactions
  • Relational database have lots of tools

云顶官网 5

Relational-Databases-vs-HBase

传统数据库已经存在很长时间了,那么为什么又要造出来一个NoSql或者HBase呢?那是因为数据量不够大,举个例子,如果在关系数据库中存了几百条记录,那么全表扫描查询的时间还ok,但一旦涉及到几千万条或者上亿的记录,那这个时间就不敢恭维了,随着数据量的增大,我们需要扩展数据库,其中一种方法当然是换个更好的服务器,但是缺点也很明显, 贵啊!而且随着服务器容量的增大可能又会有一些其他的限制。

Relational databases were the standard for years, so what changed? With more and more data came the need to scale. One way to scale is vertically with a bigger server, but this can get expensive, and there are limits as your size increases.

云顶官网 6

Relational-Databases-vs-HBase-2

海量数据存储 => 单表可有上百亿行。上百万的列。也就是对列没有限制。 => 关系型数据库正常单表不超过五百万行,不超过三十列。

How the Components Work Together(各组件的协调工作)

那么这些组件是怎么协同工作的呢?

ZooKeeper作为一个关系协调者,协调整个系统中各个组件的一些状态信息。Region servers和运行中的HMaster会跟ZooKeeper建立一个会话连接。ZooKeeper通过心跳信息来维持一个正在进行会话的短时节点(ephemeral node)(还是直接看英文吧,翻译总感觉说不清楚想要表达的意思)。

每个RegionServer会创建一个短时节点(ephemeral node),而HMaster会监控这些节点来寻找空闲的节点,同时它也会检测这些节点当中是不是有节点已经挂了。HMaster会竞争创建短时节点,ZooKeeper会决定哪个HMaster作为主节点,并且时刻保持一个时间段只有一个活跃的HMaster。这个活跃的HMaster会向ZooKeeper发送心跳信息,而在之前竞争中失败的HMaster就是非活跃的HMaster,他们时刻都想着取代那个唯一的活跃的HMaster,所以他们就一直注意监听ZooKeeper中有没有那个活跃的HMaster挂了的通知。突然发现这个系统跟人类社会很相似啊,一个身居高位的人,底下有无数的人在盼望着他出丑下台从而自己取代他。

如果一个RegionServer或者那个活跃的HMaster没有发送心跳信息给ZooKeeper,那么建立的这个会话就会被关闭,与这些出问题节点有关的所有短时节点都会被删除。会有相应的监听器监听这些信息从而通知这些需要被删除的节点。因为活跃的那个HMaster会监听这些RegionServer的状态,所以一旦他们出问题,HMaster就会想办法恢复这些错误。如果挂掉的是HMaster,那么很简单,底下那些不活跃的HMaster会第一时间收到通知并且开始竞争这个“岗位”。

总结一下这个过程:

  1. 在HMaster和RegionServer连接到ZooKeeper后创建Ephemeral节点,并使用Heartbeat机制维持这个节点的存活状态,如果某个Ephemeral节点失效,则HMaster会收到通知,并做相应的处理。
  2. HMaster通过监听ZooKeeper中的Ephemeral节点来监控HRegionServer的加入或宕机。
  3. 在第一个HMaster连接到ZooKeeper时会创建Ephemeral节点来表示Active的HMaster,其后加进来的HMaster则监听该Ephemeral节点,如果当前Active的HMaster宕机,则该节点消失,因而其他HMaster得到通知,而将自身转换成Active的HMaster,在变为Active的HMaster之前,它会创建在/hbase/back-masters/下创建自己的Ephemeral节点

Zookeeper is used to coordinate shared state information for members of distributed systems. Region servers and the active HMaster connect with a session to ZooKeeper. The ZooKeeper maintains ephemeral nodes for active sessions via heartbeats.

云顶官网 7

Each Region Server creates an ephemeral node. The HMaster monitors these nodes to discover available region servers, and it also monitors these nodes for server failures. HMasters vie to create an ephemeral node. Zookeeper determines the first one and uses it to make sure that only one master is active. The active HMaster sends heartbeats to Zookeeper, and the inactive HMaster listens for notifications of the active HMaster failure.

If a region server or the active HMaster fails to send a heartbeat, the session is expired and the corresponding ephemeral node is deleted. Listeners for updates will be notified of the deleted nodes. The active HMaster listens for region servers, and will recover region servers on failure. The Inactive HMaster listens for active HMaster failure, and if an active HMaster fails, the inactive HMaster becomes active.

NoSql跟传统数据库的区别

另外一种扩展的方式是横向扩展机器,这个相比换更好的服务器方法不需要花太多的人民币,你只要再购买一些相同的机器就好了。这样原先在一台机器上存储的数据就要分布式地存储到每台机器上,而这些数据是根据行为单位进行划分的。比如10000行数据,1-2500行存储到A机器上,2500-5000行数据存储到B机器上,以此类推。但是你是不能对传统数据库干这种事的。不过要注意的是,因为使用了这样的存储方式,也失去了其他一些关系数据库的操作,因为关系数据库是运行在一个节点上,而不是运行在集群中,所以两者的操作必然会有其针对的目标。

An alternative to vertical scaling is to scale horizontally with a cluster of machines, which can use commodity hardware. This can be cheaper and more reliable. To horizontally partition or shard a RDBMS, data is distributed on the basis of rows, with some rows residing on a single machine and the other rows residing on other machines, However, it’s complicated to partition or shard a relational database, and it was not designed to do this automatically. In addition, you lose the querying, transactions, and consistency controls across shards. Relational databases were designed for a single node; they were not designed to be run on clusters.

云顶官网 8

NoSQL-scale

面向列 => 动态添加数据的时候生成列。单独对列进行各种操作。

HBase First Read or Write

HBase中有个特殊的日志表叫做META table,这个表里存储着每个region在集群的哪个节点上,回想一下region是什么,它是一张完整的表被切分的每个数据块。这张表的地址存储在ZooKeeper上,也就是说这张表实际上是存在RegionServer中的,但具体是哪个RegionServer,只有ZooKeeper知道。

当客户端要读写数据的时候,无法避免的一个问题就是,我要访问的数据在哪个节点上或者要写到哪个节点上?这个问题的答案就与META table有关。

第一步,客户端要先去ZooKeeper找到这这表存在哪个RegionServer上了。

第二步,去那个RegionServer上找,怎么找呢?当然是根据要访问数据的key来找。找到后客户端会将这些数据以及META的地址存储到它的缓存中,这样下次再找到相同的数据时就不用再执行前面的步骤了,直接去缓存中找就完成了,如果缓存里没找到,再根据缓存的META地址去查META表就行了。

第三步,很简单,第二步已经知道键为key的数据在哪个RegionServer上了,直接去找就好了。

There is a special HBase Catalog table called the META table, which holds the location of the regions in the cluster. ZooKeeper stores the location of the META table.

This is what happens the first time a client reads or writes to HBase:

  1. The client gets the Region server that hosts the META table from ZooKeeper.
  2. The client will query the .META. server to get the region server corresponding to the row key it wants to access. The client caches this information along with the META table location.
  3. It will get the Row from the corresponding Region Server.

For future reads, the client uses the cache to retrieve the META location and previously read row keys. Over time, it does not need to query the META table, unless there is a miss because a region has moved; then it will re-query and update the cache.

云顶官网 9

传统数据模型的局限性

数据库的课程中我们说过关系数据库的规范化,它通过关系模式的拆分,可以解决数据库中的数据冗余问题,但同样随着关系模式的标准化程度越高,拆分的关系模式越多,也就意味着在查询的过程中需要连接很多个表,这当然会造成性能问题。而hbase没有这样的标准化过程,它将可能会同时访问的数据存储在一起,所以它避免了传统数据库在查询时的性能问题。两者存储的区别如下图所示:

Database normalization eliminates redundant data, which makes storage efficient. However, a normalized schema causes joins for queries, in order to bring the data back together again. While HBase does not support relationships and joins, data that is accessed together is stored together so it avoids the limitations associated with a relational model. See the difference in data storage models in the chart below:

云顶官网 10

RDBMS-HBase-storage

前面提到,可能同时访问的数据会存储在一起,HBase正是基于这个理念进行扩展的。它使用了Key-Value的方式,将数据按照key进行分组,再通过这个key决定存储到哪个的节点上去。这样每个节点上都有数据,并且是原始数据的一部分。HBase实际上是谷歌的BigTable的一个实现。

HBase was designed to scale due to the fact that data that is accessed together is stored together. Grouping the data by key is central to running on a cluster. In horizontal partitioning or sharding, the key range is used for sharding, which distributes different data across multiple servers. Each server is the source for a subset of data. Distributed data is accessed together, which makes it faster for scaling. HBase is actually an implementation of the BigTable storage architecture, which is a distributed storage system developed by Google that’s used to manage structured data that is designed to scale to a very large size.

多版本 

HBase Meta Table

这个META表具体是什么呢?我们一起来看看,它里面是一个所有Region地址的列表。使用的数据结构是b树,key和value分别如下图所示:

  • This META table is an HBase table that keeps a list of all regions in the system.

  • The .META. table is like a b tree.

  • The .META. table structure is as follows:

    - Key: region start key,region id

    - Values: RegionServer

云顶官网 11

HBase的特点:分布式、可扩展、快速

Hbase是基于列族的!他的每一行数据都有一个key,整个数据库是按照key进行索引的,你可以用一个key来查询到数据库的某一行数据。这个列族到底是个什么东西?其实就是列的集合,比如地址(Adress)这个列族,他可以是省份(Province)、城市(City)、街道(Street)这些列的集合,这样,一行可能有若干个列族,其实就是将关系数据库中的有关联的字段结合在一起了,所以每一行可以看成是这些列族的结合。

HBase is referred to as a column family-oriented data store. It’s also row-oriented: each row is indexed by a key that you can use for lookup (for example, lookup a customer with the ID of 1234). Each column family groups like data (customer address, order) within rows. Think of a row as the join of all values in all column families.

云顶官网 12

HBase-column

HBase是一个分布式的数据库,通过key将数据进行分组,这些key是更新操作时的基本单位。

HBase is also considered a distributed database. Grouping the data by key is central to running on a cluster and sharding. The key acts as the atomic unit for updates. Sharding distributes different data across multiple servers, and each server is the source for a subset of data.

云顶官网 13

HBase-distributed-database

稀疏行 => 为空的列不占用磁盘空间。 => 关系型数据库当列为空的时候值会为null。也会占用磁盘空间

Region Server Components(RegionServer组成)

下面来了解一下RegionServer的组成。一个RegionServer由WAL、BlockCache、MemStore和HFile构成。

WAL:全程是Write Ahead Log,其实就是个日志文件,存储在分布式系统上。这个日志是用来存储那些还没有被写入硬盘永久保存的数据,是用来进行数据恢复的。换句话说,无论要写什么数据,都要现在这个日志里登记一下新数据,道理很简单,如果没有日志文件,那么如果在写数据的时候出问题了,比如停电啊这种故障,那么数据库如何恢复数据呢?而先写了日志之后,数据没写入硬盘没关系,直接通过日志就知道之前进行了哪些操作,通过对比日志和你现在数据库中的值就能知道是在哪个地方出故障的,接下来就按照日志的记录往后一步步执行就好了,数据库的数据就能恢复了。

BlockCache:它是读缓存,它存储了被经常读取的数据,使用的是LRU算法,也就是说,缓冲区满了之后,最近最少被使用的数据会被淘汰。

MemStore:是写缓存,用于存储还没有被写入到硬盘的数据,并且是排序过的。每一个列族对应一个MemStore。

HFile:它存储了按照KeyValue格式的每行数据。

A Region Server runs on an HDFS data node and has the following components:

  • WAL: Write Ahead Log is a file on the distributed file system. The WAL is used to store new data that hasn't yet been persisted to permanent storage; it is used for recovery in the case of failure.
  • BlockCache: is the read cache. It stores frequently read data in memory. Least Recently Used data is evicted when full.
  • MemStore: is the write cache. It stores new data which has not yet been written to disk. It is sorted before writing to disk. There is one MemStore per column family per region.
  • Hfiles store the rows as sorted KeyValues on disk.

云顶官网 14

HBase数据模型

说了这么多抽象的东西,终于可以看看HBase的到底是怎么存储数据的了。之前说HBase每一行都有一个key,我们管他叫RowKey,它的作用有点像关系数据库中的主键,根据这个RowKey就能找到这一行的数据。每行是按照RowKey进行排序的,这个是HBase数据存储的一个基础模型,在HBase中是严格遵守的。

Data stored in HBase is located by its “rowkey.” This is like a primary key from a relational database. Records in HBase are stored in sorted order, according to rowkey. This is a fundamental tenet of HBase and is also a critical semantic used in HBase schema design.

云顶官网 15

HBase-data-model

既然是分布式存储,那么必然要考虑怎么讲一张完整的表切分成若干块分别存储。我们已经知道了表中的数据是按照key进行排序的,那么切分就按行的顺序往下切就好了,比如10个一组地切。这样我们就把原始数据切成了好多块,每块叫一个Region,每个Region被分配到一个节点上存储。

Tables are divided into sequences of rows, by key range, called regions. These regions are then assigned to the data nodes in the cluster called “RegionServers.” This scales read and write capacity by spreading regions across the cluster. This is done automatically and is how HBase was designed for horizontal sharding.

云顶官网 16

HBase-tables

每个Region中是若干行的数据,而每行又是若干个列族的结合,每个列族存储时对应于一个存储文件(HFile)

The image below shows how column families are mapped to storage files. Column families are stored in separate files, which can be accessed separately.

云顶官网 17

HBase-column-families

列族和RowKey已经说过了,还剩下的就是具体存储的数据了,数据当然是存储在表的每个格子中。考虑在这个模型中,想要获得一个格子中的数据应该怎么定位?首先是RowKey定位到一行,然后通过column family定位到这行的某个列族,一个列族中又有若干列,所以在通过列名定位到这一列,看起来定位一个数据只需要这些就够了。实际上还需要一个timestamp。

timestamp是更新数据的时候引入的,HBase在更新数据时不会简单的覆盖原始数据,而是在保留原始数据的同时存储新数据,那么就需要引入一个版本一样的属性来区别两个数据,timestamp时间戳做为这个属性在合适不过。

这些定位的数据加上那个格子里存储的数据叫一个KeyValue结构。key就是rowkey+column family+column name+timestamp,用于定位数据,value就是存储的那个数据。

The data is stored in HBase table cells. The entire cell, with the added structural information, is called Key Value. The entire cell, the row key, column family name, column name, timestamp, and value are stored for every cell for which you have set a value. The key consists of the row key, column family name, column name, and timestamp.

云顶官网 18

Hbase-table-cells

Logically, cells are stored in a table format, but physically, rows are stored as linear sets of cells containing all the key value information inside them.

In the image below, the top left shows the logical layout of the data, while the lower right section shows the physical storage in files. Column families are stored in separate files. The entire cell, the row key, column family name, column name, timestamp, and value are stored for every cell for which you have set a value.

云顶官网 19

Logical-vs-physical-storage

HBase在存储的过程中可能会遇到很多格子没有数据的情况,对于这种情况,HBase不会存储空数据,这种特性使得hbase在应对稀疏表的时候也不会浪费存储空间。对于表格中的每个数据,都对应着一个版本,默认会采用timestamp作为版本,当然这个版本也可以自定义。所以对于row+family+column定位的格子,里面可能有若干个版本的数据。

​As mentioned before, the complete coordinates to a cell's value are: Table:Row:Family:Column:Timestamp ➔ Value. HBase tables are sparsely populated. If data doesn’t exist at a column, it’s not stored. Table cells are versioned uninterpreted arrays of bytes. You can use the timestamp or set up your own versioning system. For every coordinate row : family : column, there can be multiple versions of the value.

云顶官网 20

sparse-data

有了版本的概念后,我们会发现,一个put操作可以看成一个insert和一个update两个,一方面并没有直接更改已有的数据,而是创建了一个新的数据,另一方面由于这个新数据的版本号比较新,因此同时也是更新的操作。

如果是删除一个数据,那么会给删除的数据加上一个墓碑的标记,有了这个墓碑的标记的数据在查询的时候就不会返回了。

在查询数据时,如果指定了相应的参数,会返回特定版本的数据,如果没有指定版本的参数的话,默认会返回最新版本的数据。

每个列族中的数据可以存储多少个版本是可以通过配置进行设置的,默认会保存三个版本的数据,如果超过了这个值,那么最老的版本就会被淘汰,始终维持最新的三个版本。

Versioning is built in. A put is both an insert (create) and an update, and each one gets its own version. Delete gets a tombstone marker. The tombstone marker prevents the data being returned in queries. Get requests return specific version(s) based on parameters. If you do not specify any parameters, the most recent version is returned. You can configure how many versions you want to keep and this is done per column family. The default is to keep up to three versions. When the max number of versions is exceeded, extra records will be eventually removed.

云顶官网 21

versioned-data

扩展性 => 底层依赖于HDFS => 数据内存不够的时候只需要动态添加机器就行。

HBase Write Steps(HBase的写入步骤)

当客户端发起一个put请求的时候,首先根据RowKey寻址,从META表中查出该Put数据最终需要去的RegionServer。

然后就到了下面的这个图,客户端把这个put操作先写到RegionServer的WAL日志文件中。

一旦写入到日志成功后,RegionServer会根据put请求中的TableName和RowKey找到对应的Region,然后再根据column family找到该列族对应的MemStore,将数据写入MemStore。

最后,数据写成功后,给客户端一个应答,表明数据已经写好了 。

HBase Write Steps (1)

When the client issues a Put request, the first step is to write the data to the write-ahead log, the WAL:

- Edits are appended to the end of the WAL file that is stored on disk.

- The WAL is used to recover not-yet-persisted data in case a server crashes.

云顶官网 22

HBase Write Steps (2)

Once the data is written to the WAL, it is placed in the MemStore. Then, the put request acknowledgement returns to the client.

云顶官网 23

总结

引用网上一篇博客的内容作为总结(http://www.cnblogs.com/tgzhu/p/5857035.html):

Table: 与传统关系型数据库类似,HBase以表(Table)的方式组织数据,应用程序将数据存入HBase表中。

Row: HBase表中的行通过 RowKey 进行唯一标识,不论是数字还是字符串,最终都会转换成字段数据进行存储;HBase表中的行是按RowKey字典顺序排列。

Column Family: HBase表由行和列共同组织,同时引入列族的概念,它将一列或多列组织在一起,HBase的列必须属于某一个列族,在创建表时只需指定表名和至少一个列族。

Cell: 行和列的交叉点称为单元格,单元格的内容就是列的值,以二进制形式存储,同时它是版本化

Version: 每个cell的值可保存数据的多个版本(到底支持几个版本可在建表时指定),按时间顺序倒序排列,时间戳是64位的整数,可在写入数据时赋值,也可由RegionServer自动赋值。

注意

  • HBase没有数据类型,任何列值都被转换成字符串进行存储。
  • 与关系型数据库在创建表时需明确包含的列及类型不同,HBase表的每一行可以有不同的列。
  • 相同RowKey的插入操作被认为是同一行的操作。即相同RowKey的二次写入操作,第二次可被可为是对该行某些列的更新操作
  • 列由列族和列名连接而成, 分隔符是冒号,如 d:Name (d: 列族名, Name: 列名)

以一个示例来说明关系型数据表和HBase表各自的解决方案(示例:博文及作者),关系型数据库表结构设计及数据如下图:

云顶官网 24

文章的作者是一个外键,指向作者表中的PK,下面是两个表的示例数据:

云顶官网 25

如果用HBase来设计的话,就会变成这个样子:

云顶官网 26

小结

  • HBase不支持条件查询和Order by等查询,读取记录只能按Row key(及其range)或全表扫描

  • 在表创建时只需声明表名和至少一个列族名,每个Column Family为一个存储单元。

  • 在上例中设计了一个HBase表blog,该表有两个列族:article和author,但在实际应用中强烈建议使用单列族。

  • Column不用创建表时定义即可以动态新增,同一Column Family的Columns会群聚在一个存储单元上,并依Column key排序,因此设计时应将具有相同I/O特性的Column设计在一个Column Family上以提高性能。注意:这个列是可以增加和删除的,这和我们的传统数据库很大的区别。所以他适合非结构化数据。

  • HBase通过row和column确定一份数据,这份数据的值可能有多个版本,不同版本的值按照时间倒序排序,即最新的数据排在最前面,查询时默认返回最新版本。如上例中row key=1的author:nickname值有两个版本,分别为1317180070811对应的“一叶渡江”和1317180718830对应的“yedu”(对应到实际业务可以理解为在某时刻修改了nickname为yedu,但旧值仍然存在)。Timestamp默认为系统当前时间(精确到毫秒),也可以在写入数据时指定该值。

  • 每个单元格值通过4个键唯一索引,tableName+RowKey+ColumnKey+Timestamp--->value, 例如上例中{tableName=’blog’,RowKey=’1’,ColumnName=’author:nickname’,Timestamp=’ 1317180718830’}索引到的唯一值是“yedu”。

  • 存储类型

    • TableName 是字符串
    • RowKey 和 ColumnName 是二进制值(Java 类型 byte[])
    • Timestamp 是一个 64 位整数(Java 类型 long)
    • value 是一个字节数组(Java类型 byte[])

参考资料

  1. 原文章:HBase and MapR-DB: Designed for Distribution, Scale, and Speed
  2. HBase体系结构剖析(上)

高可靠性 => 

HBase MemStore

上一步我们说到数据会写到MemStore中,而MemStore是一个写缓冲区,什么意思呢?就是写操作会先将数据写到缓冲区中,而每条数据在缓冲区中是KeyValue结构的,并且是按照key排序的。

The MemStore stores updates in memory as sorted KeyValues, the same as it would be stored in an HFile. There is one MemStore per column family. The updates are sorted per column family.

云顶官网 27

高性能 => 高写高读性能。

HBase Region Flush

缓冲区有一定的大小,如果缓冲区满了,那么缓冲区的数据就会被flush到一个新的HFile文件中永久保存。HBase中,对于每个列族可以有多个HFile,HFile里的数据跟缓冲区中的格式也是一样的,都是一个Key对应一个Value的结构。

需要注意的是,如果一个MemStore满了,那么所有的MemStore都要将存储的数据flush到HFile中,这也就是为什么官方建议一个表中不要有太多列族,因为每个列族对应一个MemStore,如果列族太多的话会导致频繁刷新缓冲区等性能问题。

缓冲区在刷新写入到HFile的时候,还会保存一个序列数(sequence number),这个东西是干嘛的呢?其实是为了让系统知道目前HFile上保存了哪些数据。这个序列数作为元数据(meta field)存在HFile中,所以每次在刷新的时候都等于给HFile做了个标记。

When the MemStore accumulates enough data, the entire sorted set is written to a new HFile in HDFS. HBase uses multiple HFiles per column family, which contain the actual cells, or KeyValue instances. These files are created over time as KeyValue edits sorted in the MemStores are flushed as files to disk.

Note that this is one reason why there is a limit to the number of column families in HBase. There is one MemStore per CF; when one is full, they all flush. It also saves the last written sequence number so the system knows what was persisted so far.

The highest sequence number is stored as a meta field in each HFile, to reflect where persisting has ended and where to continue. On region startup, the sequence number is read, and the highest is used as the sequence number for new edits.

云顶官网 28

准实时查询 => 百毫秒实时查询上亿数据量

HBase HFile

缓冲区的数据都是根据key进行排序的,所以在flush到HFile上的时候,就直接按书序一条一条记录往下写就行了,这样顺序写的过程是非常快速的,因为他避免了磁盘磁头的移动。

Data is stored in an HFile which contains sorted key/values. When the MemStore accumulates enough data, the entire sorted KeyValue set is written to a new HFile in HDFS. This is a sequential write. It is very fast, as it avoids moving the disk drive head.

云顶官网 29

应用场景 :

HBase HFile Structure

HFile的组成就相对来说比较复杂了,因为要考虑到查询的性能,最好别出现把整个文件都扫描一遍后才发现要访问的数据不再这个HFile中的情况。因此就需要在文件的组织形式上花点心思。怎样在不完全扫描文件的情况下知道要访问的数据在不再文件中呢?我们想到的答案可能是使用索引(Index)。HFile实际上也是这种思想,它使用的是多级索引,在形式上类似于b树。不得不感慨一句,b树这个数据结构在数据库中的应用程度真的是很高啊!

An HFile contains a multi-layered index which allows HBase to seek to the data without having to read the whole file. The multi-level index is like a b+tree:

  • Key value pairs are stored in increasing order
  • Indexes point by row key to the key value data in 64KB “blocks”
  • Each block has its own leaf-index
  • The last key of each block is put in the intermediate index
  • The root index points to the intermediate index

The trailer points(位于文件最后) to the meta blocks, and is written at the end of persisting the data to the file. The trailer also has information like bloom filters(布隆过滤器) and time range info. Bloom filters help to skip files that do not contain a certain row key. The time range info is useful for skipping the file if it is not in the time range the read is looking for.

云顶官网 30

交通,金融,电商,移动...

HFile Index

当一个HFile被打开的时候,这个HFile的索引就被加载到BlockCache中了,还记得我们之前说BlockCache是什么吗?就是读缓冲区。

The index, which we just discussed, is loaded when the HFile is opened and kept in memory. This allows lookups to be performed with a single disk seek.

云顶官网 31

 

HBase Read Merge(HBase读操作)

上面我们已经讨论了HBase的存储结构,在涉及读取操作的时候其实有个问题,一行数据可能存在的位置有哪些?一种是已经永久存储在HFile中了,一种是还没来得及写入到HFile,在缓冲区MemStore中,还有一种情况是读缓存,也就是经常读取的数据会被放到读缓存BlockCache中,那么读取数据的操作怎么去查询数据的存储位置呢?有这么几个步骤:

首先,设置读缓存的位置是什么?当时是为了高效地读取数据,所以读缓存绝对是第一优先级的,别忘了BlockCache中使用的是LRU算法。

其次,HFile和写缓存,选哪个?HFile的个数那么多,当然是效率最低的,而一个列族只有一个MemStore,效率必然比HFile高的多,所以它作为第二优先级,也就是说如果读缓存中没有找到数据,那么就去MemStore中去找。

最后,如果不幸上面两步都没能找到数据,那没办法只能去HFile上找了。

We have seen that the KeyValue cells corresponding to one row can be in multiple places, row cells already persisted are in Hfiles, recently updated cells are in the MemStore, and recently read cells are in the Block cache. So when you read a row, how does the system get the corresponding cells to return? A Read merges Key Values from the block cache, MemStore, and HFiles in the following steps:

  1. First, the scanner looks for the Row cells in the Block cache - the read cache. Recently Read Key Values are cached here, and Least Recently Used are evicted when memory is needed.
  2. Next, the scanner looks in the MemStore, the write cache in memory containing the most recent writes.
  3. If the scanner does not find all of the row cells in the MemStore and Block Cache, then HBase will use the Block Cache indexes and bloom filters to load HFiles into memory, which may contain the target row cells.

云顶官网 32

As discussed earlier, there may be many HFiles per MemStore, which means for a read, multiple files may have to be examined, which can affect the performance. This is called read amplification.

云顶官网 33

概念与定位

HBase Minor Compaction

刚才我们说到写数据的时候会先写到缓冲区,缓冲区满了会将缓冲区的内容冲刷到HFile中永久保存,试想这个写数据的过程一直持续,那么HFile的数量会越来越多,管理起来就会不太方便,就有了compaction这个操作,意思是压紧压实的意思,实在是没找到合适的中文翻译,这个名词就不翻了。

HBase有两种compaction,一种是Minor Compaction,另一种是Major Compaction。

Minor Compaction是将一些小的HFile文件合并成大文件,很明显它可以减少HFile文件的数量,并且在这个过程中不会处理已经Deleted或Expired的Cell。一次Minor Compaction的结果是更少并且更大的HFile。

HBase will automatically pick some smaller HFiles and rewrite them into fewer bigger Hfiles. This process is called minor compaction. Minor compaction reduces the number of storage files by rewriting smaller files into fewer but larger ones, performing a merge sort.

云顶官网 34