leveldb注释3–基础类

petermao — Sun, 07 Jul 2013 05:29:40 +0000

这一节主要分析下util目录下的代码。这部分代码相对独立，功能明确，很容易看明白。可先尝试理解这部分代码，以为进一步理解leveldb代码打下基础。

1 内存管理(arena.cc/.h)
leveldb的大部分内存管理依赖于C++语言的默认实现，也就是不对内存进行管理。只是在memtable的实现中用到了一个简单的内存管理器(arena)。因为memtable的内部实现skip list写入时，需要分配新节点，大量节点的直接分配可能会带来较多的碎片，影响运行效率。因此，leveldb在每个memtable中都会绑定一个arena，在memtable进行minor compact后，memtable销毁时进行统一释放。
下图是一个arena某个运行时刻的截图。从图中可以看出，arena内部使用的基本块大小为4K，已分配的块的指保存在一个vector中。具体分配策略如下。若分配的内存可从剩余的块中分配，则直接从剩余块中分配并调整剩余块指针的位置。否则检查要分配的size是否大于1/4的block(也就是1K)，若大于则直接new所需大小的内存，将指针保存在vector中并返回内存指针。从这里可以看出，vector保存的内存块的指针大小可能>4K，也就是对大内存块直接分配。前面两个条件若都不满足，则需要new一个基本块(=4K)，将new出来的新块作为当前块，从这块当中分配内存并调整当前内存指针位置。原来当前块的剩余内存就被浪费掉了。
实际中可考虑用tcmalloc/jemalloc以提升内存管理效率。

leveldb arena

  // alloc_ptr_指向基本块空闲内存位置，alloc_bytes_remaining_为剩余内存大小
  // blocks_保存了所有的块，包括了基本块与直接分配的大块
  // Allocation state
  char* alloc_ptr_;
  size_t alloc_bytes_remaining_;

  // Array of new[] allocated memory blocks
  std::vector blocks_;

  // Bytes of memory in blocks allocated so far
  size_t blocks_memory_;

2 过滤器bloom filter的实现(bloom.cc)
leveldb通过接口NewBloomFilterPolicy提供了一个filter的实现，使用时通过传入该filter可提升对不存在的数查找时的效率。open db时的options.filter_policy参数指定该filter。如果指定了filter，leveldb会在生成新sst文件时对该文件的所有key构造一个合适的bloom filter字符串并保存在sst文件中。现在我们来看看这个bloom filter的实现机制。bloom filter需要指定bits_per_key参数，表示每个key需要检测多少个bit位/保存多少个bit位。创建bloom filter的核心代码如下：

  virtual void CreateFilter(const Slice* keys, int n, std::string* dst) const {
    // Compute bloom filter size (in both bits and bytes)
    size_t bits = n * bits_per_key_;

    // For small n, we can see a very high false positive rate.  Fix it
    // by enforcing a minimum bloom filter length.
    if (bits < 64) bits = 64;

    size_t bytes = (bits + 7) / 8;
    bits = bytes * 8;

    const size_t init_size = dst->size();
    dst->resize(init_size + bytes, 0);
	// 将探测需要的尾数放至最后
    dst->push_back(static_cast(k_));  // Remember # of probes in filter
    char* array = &(*dst)[init_size];
    for (size_t i = 0; i < n; i++) {
      // Use double-hashing to generate a sequence of hash values.
      // See analysis in [Kirsch,Mitzenmacher 2006].
      uint32_t h = BloomHash(keys[i]);
      const uint32_t delta = (h >> 17) | (h << 15);  // Rotate right 17 bits
      for (size_t j = 0; j < k_; j++) {
        const uint32_t bitpos = h % bits;
        array[bitpos/8] |= (1 << (bitpos % 8));
        h += delta;
      }
    }

创建时，先预留bloom filter所需空间(n * bits_per_key_/8对齐后)，并将需要检测的位数保存至字符串末尾，以方便使用时解出。然后遍历key，计算其hash值，将hash值所对应的bit位置1(需要设置bits_per_key_次，并且后续的bit位从原始的hash值右移17位得到)。过滤函数KeyMayMatch基本就是上述过程的逆过程。bloom filter字符串的内部结构可如下图所示。

leveldb bloomfilter结构

3 默认的cache机制—LRU实现(cache.cc/.h)
leveldb的cache主要是为了加快get操作的性能。cache分为block cache与table cache。table cache针对sst文件，key是file_number，cache的大小为options.max_open_files – 10，在db初始化时创建，且cache机制使用默认的LRU算法，外部没法更改。

  // Reserve ten files or so for other uses and give the rest to TableCache.
  const int table_cache_size = options.max_open_files - 10;
  table_cache_ = new TableCache(dbname_, &options_, table_cache_size);

而block_cache则可以在open时通过options.block_cache参数指定cache实例。外部可自定义合适的cache机制替换系统自带的LRU cache。如果外部没有提供block_cache参数，则leveldb会在初始化时构造8MB大小的cache，该cache同样是LRU cache的实例。该cache的key是block所属table文件的一个cache_id + block在文件中的偏移。

  // 没有提供block_cache，则内部使用8MB的LRUCache
  if (result.block_cache == NULL) {
    result.block_cache = NewLRUCache(8 << 20);
  }

      char cache_key_buffer[16];
      EncodeFixed64(cache_key_buffer, table->rep_->cache_id);
      EncodeFixed64(cache_key_buffer+8, handle.offset());
      Slice key(cache_key_buffer, sizeof(cache_key_buffer));
      cache_handle = block_cache->Lookup(key);

如果用户想自定义block_cache算法，则可参考cache.h中Cache类所定义的抽象接口，主要接口代码如下所示。可以看出，也就是key的增删查改以及句柄的释放。

  // When the inserted entry is no longer needed, the key and
  // value will be passed to "deleter".
  virtual Handle* Insert(const Slice& key, void* value, size_t charge,
                         void (*deleter)(const Slice& key, void* value)) = 0;

  // If the cache has no mapping for "key", returns NULL.
  //
  // Else return a handle that corresponds to the mapping.  The caller
  // must call this->Release(handle) when the returned mapping is no
  // longer needed.
  virtual Handle* Lookup(const Slice& key) = 0;

  // Release a mapping returned by a previous Lookup().
  // REQUIRES: handle must not have been released yet.
  // REQUIRES: handle must have been returned by a method on *this.
  virtual void Release(Handle* handle) = 0;

  // Return the value encapsulated in a handle returned by a
  // successful Lookup().
  // REQUIRES: handle must not have been released yet.
  // REQUIRES: handle must have been returned by a method on *this.
  virtual void* Value(Handle* handle) = 0;

  // If the cache contains entry for key, erase it.  Note that the
  // underlying entry will be kept around until all existing handles
  // to it have been released.
  virtual void Erase(const Slice& key) = 0;

  // Return a new numeric id.  May be used by multiple clients who are
  // sharing the same cache to partition the key space.  Typically the
  // client will allocate a new id at startup and prepend the id to
  // its cache keys.
  virtual uint64_t NewId() = 0;

NewLRUCache函数可调用自带的LRU实现。现在我们来看看LRU的具体实现。下图是LRUCache的内部实现图与相关代码。

单个LRUCache

struct LRUHandle {
  void* value;
  void (*deleter)(const Slice&, void* value);
  LRUHandle* next_hash; // hash链表指针
  LRUHandle* next;       // 循环链表指针
  LRUHandle* prev;
  size_t charge;      // TODO(opt): Only allow uint32_t?
  size_t key_length;
  uint32_t refs;
  uint32_t hash;      // Hash of key(); used for fast sharding and comparisons
  char key_data[1];   // Beginning of key
--
};
class HandleTable {
  ---
  uint32_t length_;
  uint32_t elems_;
  LRUHandle** list_;
  ---
};

class LRUCache {
  ---
  // Initialized before use.
  size_t capacity_;
  ---
  // Dummy head of LRU list.
  // lru.prev is newest entry, lru.next is oldest entry.
  LRUHandle lru_;
  HandleTable table_;
};

static const int kNumShardBits = 4;
static const int kNumShards = 1 << kNumShardBits;
class ShardedLRUCache : public Cache {
  ---
  LRUCache shard_[kNumShards];
  --
};

每个节点有3个指针，一个指针用于hash table的单链接指向。另外两个形成双向循环链表，以实现lru功能，方便控制cache的大小。插入时除了在hash table中查找合适的位置并插入，还需要插入到循环链表的开头。如果数据过多，就通过head来删除最老的节点。对于table，插入的元素个数大于桶的数目时，需要重新调整大小，新的size将是2的幂，但比元素个数多。
leveldb为了加快查找速度，并不只是使用了1个如上所示的LRUCache，而是16个(也就是说会按key再分一次桶)，最终调用的是ShardedLRUCache接口。

4 内部编码/解析器(coding.cc/.h)
leveldb为了节省空间，可对int进行可变长度编码，也就是对小于2^7使用1个字节存储，小于2^14使用两个字节。。。。编码时将输入每隔7位分组，将字节的最高位始终置1，一直到最后1个字节的最高位置0表示结束。比如对int型的二进制数00000000,00000001,11011001,01000110，每隔7位分组则为111,0110010,1000110，再将每个字节的高位置1，则实际存储的为00000111,10110010,11000110。这样比起始终用4个字节存储，省了1个字节的存储空间。实际的编码代码如下：

char* EncodeVarint32(char* dst, uint32_t v) {
  // Operate on characters as unsigneds
  unsigned char* ptr = reinterpret_cast(dst);
  static const int B = 128;
  if (v < (1<<7)) {
    *(ptr++) = v;
  } else if (v < (1<<14)) {
    *(ptr++) = v | B;
    *(ptr++) = v>>7;
  } else if (v < (1<<21)) {
    *(ptr++) = v | B;
    *(ptr++) = (v>>7) | B;
    *(ptr++) = v>>14;
  } else if (v < (1<<28)) {
    *(ptr++) = v | B;
    *(ptr++) = (v>>7) | B;
    *(ptr++) = (v>>14) | B;
    *(ptr++) = v>>21;
  } else {
    *(ptr++) = v | B;
    *(ptr++) = (v>>7) | B;
    *(ptr++) = (v>>14) | B;
    *(ptr++) = (v>>21) | B;
    *(ptr++) = v>>28;
  }
  return reinterpret_cast(ptr);

5 默认比较器的实现—字节级比较器
由于leveldb内部是有序的，因此其比较器是至关重要的。头文件中的Comparator类提供了比较器的抽象接口。Name定义了该比较器的名字，会在open时进行检查。至于FindShortestSeparator、FindShortSuccessor表示什么意思，可参看默认的字节级比较器BytewiseComparator的实现，相当简单。

  // Three-way comparison.  Returns value:
  //   < 0 iff "a" < "b",
  //   == 0 iff "a" == "b",
  //   > 0 iff "a" > "b"
  virtual int Compare(const Slice& a, const Slice& b) const = 0;

  // The name of the comparator.  Used to check for comparator
  // mismatches (i.e., a DB created with one comparator is
  // accessed using a different comparator.
  //
  // The client of this package should switch to a new name whenever
  // the comparator implementation changes in a way that will cause
  // the relative ordering of any two keys to change.
  //
  // Names starting with "leveldb." are reserved and should not be used
  // by any clients of this package.
  virtual const char* Name() const = 0;

  // Advanced functions: these are used to reduce the space requirements
  // for internal data structures like index blocks.

  // If *start < limit, changes *start to a short string in [start,limit).
  // Simple comparator implementations may return with *start unchanged,
  // i.e., an implementation of this method that does nothing is correct.
  virtual void FindShortestSeparator(
      std::string* start,
      const Slice& limit) const = 0;

  // Changes *key to a short string >= *key.
  // Simple comparator implementations may return with *key unchanged,
  // i.e., an implementation of this method that does nothing is correct.
  virtual void FindShortSuccessor(std::string* key) const = 0;

未完待续

6 crc32

7 hash

8 histogram

9 logging.cc

10 random.h

petermao的技术blog » leveldb comparator

leveldb注释3–基础类