RocksDB Advantages over LevelDB

Tags

Problem Solution Result
LevelDB had low write rate

o    Was using only one CPU

Multi-threaded compaction 10x write rate
Stalls – P99 latencies x10s Thread aware compaction, P99 reduced to < 1s

o    Dedicated threads to flush memtables

o    Pipelined memtables

Fewer stalls
High write amplification

o    Due to leveled compaction

Introduced universal compaction 7x lower write amplification
High read amplification

o    LevelDB doesn’t use bloom for scans

Implemented prefix scans

o    Range scans within same key prefix

o    Blooms created for prefix

Blooms for range scans
Read-modify-write Introduced merge operator Ability to do read-modify-write
Rigid design Pluggable

o    Compaction filter

o    Memtable/sstable format for RAM/Flash

E.g. plaintable format

o    Compaction algorithm

Optimizations for flash/RAM
  • Pluggable SST format
    • No blocks, no caching, mmap the whole file
    • Build efficient lookup index on load
  • Blooms for memTable
    • Once SST is optimized, memTable lookup becomes the major cost
  • Pluggable memTable format
    • Use case – distinct load phase
      • Write throughput is limited by single writer thread
      • New memTable format that doesn’t keep keys sorted

RocksDB recovery

Tags

Which log to start recovery from is per column-family.

The reason is as follows:

Different CFs have different memtables that get flushed independently, but WAL is common (which is how atomicity across CFs is achieved).

Thus is it is possible that log has entries which are already flushed to sst for some CF but not other CFs.

As soon as any CF is flushed a new WAL is started. This is to ensure that either entire WAL is applicable to a CF or not (i.e. no partial WALs).

The applicable LogNumber is tracked per CF.

During recovery RocksDB checks in SeekToColumnFamily function whether the log is applicable to the given CF and applies the change to memtable being prepared for this CF.

00 RocksDbStoreUnitTest!rocksdb::`anonymous namespace’::MemTableInserter::SeekToColumnFamily
01 RocksDbStoreUnitTest!rocksdb::`anonymous namespace’::MemTableInserter::PutCF
02 RocksDbStoreUnitTest!rocksdb::WriteBatch::Iterate
03 RocksDbStoreUnitTest!rocksdb::WriteBatchInternal::InsertInto
04 RocksDbStoreUnitTest!rocksdb::DBImpl::RecoverLogFiles
05 RocksDbStoreUnitTest!rocksdb::DBImpl::Recover
06 RocksDbStoreUnitTest!rocksdb::DB::Open

Newly prepared memtables for CFs are flushed to sst files during recovery (so that log replay doesn’t happen again and again) unless DB is opened in read-only mode, in which case logs are translated to memtables during recovery as usual but memtables are not flushed.

Real observation:

We had one CF that has high write rate and one CF with very low write rate. The low write rate CF forces several logs to hang around since it flushed much less frequently, while the entries corresponding to these logs are already flushed for high write-rate CF. Thus on recovery a number of WALs are applicable to low write-rate CF but only the latest log for high write-rate CF.

PacificA

PacificA is a distributed replication protocol.

Configuration changes are managed with Paxos – which has continuous replication and strong consistency.

Read/Writes are managed using primary/secondaries. Reads are all done through Primary.

Steady State Operation For Writes

  • Two phases for any operation – prepare/commit.
  • Primary sends prepare messages to secondaries. Once secondaries ack, it is committed on primary and primary asks secondaries to commit as well.
  • Invariant
    • Primary commit list is a prefix of secondaries prepare list.
    • CommitList of secondary < CommitList of Primary < PrepareList of Secondary (< denotes subset)
  • Example:
  • Primary:
    • P1, P2, P3, P4, P5, P6, P7, P8
    • C1, C2, C3, C4, C5
  • Secondary:
    • P1, P2, P3, P4, P5, P6
    • C1, C2, C3

Configuration Management

  • Invariant
    • Single Primary
  • Configuration changes are done through config manager
  • If primary dies, secondary asks config manager to elect it as primary
  • Since multiple secondaries would try to become primary, config manager makes the first one win and bumps up config version. Changes proposed with prev version are now refused.
  • It is possible that primary didn’t die but is in a different network partition. Hence lease mechanism is used to ensure that old primary doesn’t continue to act as primary.
    • Primary sends periodic beacon to secondaries to obtain a lease for lease period.
    • Secondary proposes a config change only after grace period > lease period.
    • Lease mechanism is decentralized as it is between replicas only (typically 3).
    • Also writes messages themselves act as beacon – real beacons are needed only when there is no write activity.

Failure handling

Removal of secondaries

Primary proposes a new configuration that excludes the failed secondaries.

Removal of primary

When secondary detects primary unavailability, it tries to get itself elected as primary as described above.

Once it becomes primary it starts a reconciliation process.

Reconciliation

The new primary

  • Sends prepare messages for all the messages in its prepared list, and gets them committed
    • Since its prepared list is a superset of committed list on any replica, no one can have more advanced commits.
    • In the example above it will send prepare for P4, P5, P6
  • It asks other replicas to truncate any remaining prepares
    • In the example above if previous primary rejoined it would discard P7, P8
  • Now the new primary can send new prepares and everyone would be in sync.
    • New primary can reuse the serial numbers 7 and 8 now (although in practice config# + serial number may be the key)

New secondary

It needs to be brought up-to-date. For that:

  • Primary makes  it a “candidate secondary” and starts sending it prepare messages.
    • Candidacy is terminated if it fails to ack any prepare message.
  • Secondary obtains the previous prepared messages from other replicas.
  • When it is finally caught up, primary gets it added to the new configuration.
  • In practice, there are checkpoints, so the new secondary needs checkpoint+remaining prepared messages.

Implementation

Based on the replication protocol here are the requirements for the underlying system:

1) Query processing – must be able to return query from committed only list

2) Reconciliation – truncated prepared list (to make it identical to the new primary).

3) State transfer – transfer the needed portion of prepared list to the newly joined replica. Some entries from uncommitted prepare list may need to be removed after state transfer (as state might already contain them).

Windows -> Linux tools mapping

  Windows Linux
Editor SourceInsight, VS Eclipse, Emacs, vim, Code::Blocks (http://en.wikipedia.org/wiki/Code::Blocks), can run SourceInsight on wine.

See http://stackoverflow.com/questions/4049613/source-insight-alternative-for-linux for suggested alternatives.

Compiler VC GC
Debugger KD

WindDbg

Gdb (KD eq.),

http://sourceware.org/insight/ (WinDbg eq.)

Static verifier Prefast Valgrind??
Bounds checking App verifier Address Sanitizer (https://code.google.com/p/address-sanitizer/wiki/AddressSanitizer ) , Valgrind
Memory leak RADAR, UMDH ??
Perf analysis Xperf perf (based on similar eventing infrastructure) see posts on perf.

Windows Profiling

Disabling Paging Executive

64-bit stackwalk needs disabling of paging of drivers and system code. To disable paging executive the following regkey needs to be set:

REG ADD “HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management” -v DisablePagingExecutive -d 0x1 -t REG_DWORD -f

Profiling start

Useful for CPU bound workload

xperf -on PROC_THREAD+LOADER+INTERRUPT+DPC+PROFILE -stackwalk profile -minbuffers 16 -maxbuffers 1024 -BufferSize 1024

Profiling end

xperf -d profile.etl

Wait analysis (context switch profiling)

Useful for I/O bound workload

Xperf -on Base+CSWITCH+DISPATCHER -stackwalk CSwitch+ReadyThread -BufferSize 1024 -MinBuffers 50 -MaxBuffers 1024

Columns on the left of goldbar: NewProcess, NewThreadId, NewThreadStack, Max:Waits (us), ReadyThreadStack, ReadyingProcess, and ReadyingThreadId

See detailed synopsis at http://devproconnections.com/development/use-xperfs-wait-analysis-application-performance-troubleshooting

If not sure about CPU bound or I/O bound, you can combine both options above (at the cost of increased etl size).

Viewing traces

set _NT_SYMBOL_PATH=srv*;<your_app_pdb_path>

wpa.exe profile.etl

1) Check Trace->LoadSymbols

2) Include ‘Stack’ column on the left of gold bar for inclusive times. Include ‘Function’ column (and remove ‘Stack’ if present) for exclusive times.