- Use VSInstr.exe to instrument the binary.
- For 64-bit binary locate 64-bit VSInstr under C:\Program Files [(x86)]\Microsoft Visual Studio 14.0\Team Tools\Performance Tools\x64
- The binary to be instrumented must be built with /PROFILE option passed to linker.
- This can be added through project->properties->linker->advanced
- Or add the project file directly: add <Profile>true</Profile> under link.
- The machine where scenario is to be run would need to have profiling tools installed. Best to use standalone profiler installer.
- VsPerfCmd /start:trace /output:<outputfile>.vsp
- VsPerfCmd /shutdown
- VSPerfReport.exe /summary:all /packsymbols <vsp_file>
- Please note that this is not in x64 directory…is one level up in the regular directory.
- Load .vsp file in VisualStudio
|LevelDB had low write rate
o Was using only one CPU
|Multi-threaded compaction||10x write rate|
|Stalls – P99 latencies x10s||Thread aware compaction, P99 reduced to < 1s
o Dedicated threads to flush memtables
o Pipelined memtables
|High write amplification
o Due to leveled compaction
|Introduced universal compaction||7x lower write amplification|
|High read amplification
o LevelDB doesn’t use bloom for scans
|Implemented prefix scans
o Range scans within same key prefix
o Blooms created for prefix
|Blooms for range scans|
|Read-modify-write||Introduced merge operator||Ability to do read-modify-write|
o Compaction filter
o Memtable/sstable format for RAM/Flash
E.g. plaintable format
o Compaction algorithm
|Optimizations for flash/RAM|
- Pluggable SST format
- No blocks, no caching, mmap the whole file
- Build efficient lookup index on load
- Blooms for memTable
- Once SST is optimized, memTable lookup becomes the major cost
- Pluggable memTable format
- Use case – distinct load phase
- Write throughput is limited by single writer thread
- New memTable format that doesn’t keep keys sorted
- Use case – distinct load phase
Which log to start recovery from is per column-family.
The reason is as follows:
Different CFs have different memtables that get flushed independently, but WAL is common (which is how atomicity across CFs is achieved).
Thus is it is possible that log has entries which are already flushed to sst for some CF but not other CFs.
As soon as any CF is flushed a new WAL is started. This is to ensure that either entire WAL is applicable to a CF or not (i.e. no partial WALs).
The applicable LogNumber is tracked per CF.
During recovery RocksDB checks in SeekToColumnFamily function whether the log is applicable to the given CF and applies the change to memtable being prepared for this CF.
00 RocksDbStoreUnitTest!rocksdb::`anonymous namespace’::MemTableInserter::SeekToColumnFamily
01 RocksDbStoreUnitTest!rocksdb::`anonymous namespace’::MemTableInserter::PutCF
Newly prepared memtables for CFs are flushed to sst files during recovery (so that log replay doesn’t happen again and again) unless DB is opened in read-only mode, in which case logs are translated to memtables during recovery as usual but memtables are not flushed.
We had one CF that has high write rate and one CF with very low write rate. The low write rate CF forces several logs to hang around since it flushed much less frequently, while the entries corresponding to these logs are already flushed for high write-rate CF. Thus on recovery a number of WALs are applicable to low write-rate CF but only the latest log for high write-rate CF.
|Read/Write||File pointer/offset based||Mmap variant||Posix API used|
|Sequential File||R||File pointer||–||fread_unlocked|
PacificA is a distributed replication protocol.
Configuration changes are managed with Paxos – which has continuous replication and strong consistency.
Read/Writes are managed using primary/secondaries. Reads are all done through Primary.
Steady State Operation For Writes
- Two phases for any operation – prepare/commit.
- Primary sends prepare messages to secondaries. Once secondaries ack, it is committed on primary and primary asks secondaries to commit as well.
- Primary commit list is a prefix of secondaries prepare list.
- CommitList of secondary < CommitList of Primary < PrepareList of Secondary (< denotes subset)
- P1, P2, P3, P4, P5, P6, P7, P8
- C1, C2, C3, C4, C5
- P1, P2, P3, P4, P5, P6
- C1, C2, C3
- Single Primary
- Configuration changes are done through config manager
- If primary dies, secondary asks config manager to elect it as primary
- Since multiple secondaries would try to become primary, config manager makes the first one win and bumps up config version. Changes proposed with prev version are now refused.
- It is possible that primary didn’t die but is in a different network partition. Hence lease mechanism is used to ensure that old primary doesn’t continue to act as primary.
- Primary sends periodic beacon to secondaries to obtain a lease for lease period.
- Secondary proposes a config change only after grace period > lease period.
- Lease mechanism is decentralized as it is between replicas only (typically 3).
- Also writes messages themselves act as beacon – real beacons are needed only when there is no write activity.
Removal of secondaries
Primary proposes a new configuration that excludes the failed secondaries.
Removal of primary
When secondary detects primary unavailability, it tries to get itself elected as primary as described above.
Once it becomes primary it starts a reconciliation process.
The new primary
- Sends prepare messages for all the messages in its prepared list, and gets them committed
- Since its prepared list is a superset of committed list on any replica, no one can have more advanced commits.
- In the example above it will send prepare for P4, P5, P6
- It asks other replicas to truncate any remaining prepares
- In the example above if previous primary rejoined it would discard P7, P8
- Now the new primary can send new prepares and everyone would be in sync.
- New primary can reuse the serial numbers 7 and 8 now (although in practice config# + serial number may be the key)
It needs to be brought up-to-date. For that:
- Primary makes it a “candidate secondary” and starts sending it prepare messages.
- Candidacy is terminated if it fails to ack any prepare message.
- Secondary obtains the previous prepared messages from other replicas.
- When it is finally caught up, primary gets it added to the new configuration.
- In practice, there are checkpoints, so the new secondary needs checkpoint+remaining prepared messages.
Based on the replication protocol here are the requirements for the underlying system:
1) Query processing – must be able to return query from committed only list
2) Reconciliation – truncated prepared list (to make it identical to the new primary).
3) State transfer – transfer the needed portion of prepared list to the newly joined replica. Some entries from uncommitted prepare list may need to be removed after state transfer (as state might already contain them).
|Editor||SourceInsight, VS||Eclipse, Emacs, vim, Code::Blocks (http://en.wikipedia.org/wiki/Code::Blocks), can run SourceInsight on wine.
See http://stackoverflow.com/questions/4049613/source-insight-alternative-for-linux for suggested alternatives.
|Gdb (KD eq.),
http://sourceware.org/insight/ (WinDbg eq.)
|Bounds checking||App verifier||Address Sanitizer (https://code.google.com/p/address-sanitizer/wiki/AddressSanitizer ) , Valgrind|
|Memory leak||RADAR, UMDH||??|
|Perf analysis||Xperf||perf (based on similar eventing infrastructure) see posts on perf.|
Disabling Paging Executive
64-bit stackwalk needs disabling of paging of drivers and system code. To disable paging executive the following regkey needs to be set:
REG ADD “HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management” -v DisablePagingExecutive -d 0x1 -t REG_DWORD -f
Useful for CPU bound workload
xperf -on PROC_THREAD+LOADER+INTERRUPT+DPC+PROFILE -stackwalk profile -minbuffers 16 -maxbuffers 1024 -BufferSize 1024
xperf -d profile.etl
Wait analysis (context switch profiling)
Useful for I/O bound workload
Xperf -on Base+CSWITCH+DISPATCHER -stackwalk CSwitch+ReadyThread -BufferSize 1024 -MinBuffers 50 -MaxBuffers 1024
Columns on the left of goldbar: NewProcess, NewThreadId, NewThreadStack, Max:Waits (us), ReadyThreadStack, ReadyingProcess, and ReadyingThreadId
If not sure about CPU bound or I/O bound, you can combine both options above (at the cost of increased etl size).
1) Check Trace->LoadSymbols
2) Include ‘Stack’ column on the left of gold bar for inclusive times. Include ‘Function’ column (and remove ‘Stack’ if present) for exclusive times.
perf record -g –call-graph dwarf <program>
perf record -g –call-graph dwarf ./db_bench
perf report -g –call-graph –inverted –showcpuutilization –sort dso –symbol-filter=fsync
perf report -g flat –stdio –sort dso