Investigating Locks

Suresh Ramagiri -

The most common cause of locking in a database is concurrent transactions trying to update the same row(s) of a table. As part of MVCC locks are necessary to keep data consistent across different transactions. Since locking will present to the application as latency, it’s important to identify why locks are being held and lessen or eliminate the impact from locking contention on the user experience.

To look for active locks in the system, you can use the following query:

sql> select * from system.lockman;

Example:

sql> select * from system.lockman;
+--------+---------------------+---------------------+---------+------------+--------+---------------------+-----------+
| nodeid | holder | waiter | wait_ms | timeout_ms | domain | key | exclusive |
+--------+---------------------+---------------------+---------+------------+--------+---------------------+-----------+
| 15 | 6268069961036267554 | 6268069963849242654 | 11977 | 0 | 4103 | 5979715258906695681 | 1 |
| 15 | 6268069961036267554 | 6268069968114098208 | 10980 | 0 | 4103 | 5979715258906695681 | 1 |
| 15 | 6268069961036267554 | 6268069975814778910 | 9249 | 0 | 4103 | 5979715258906695681 | 1
+--------+---------------------+---------------------+---------+------------+---------------------+---------------------+-----------+--------+------------------+---------------+
1 row in set (0.01 sec)

The holder is the XID of the transaction holding a lock, and waiter indicates an XID waiting for that lock to be released. This example is a clear illustration of some transaction sitting on a lock for more than 11 seconds (it was an explicit write transaction that didn't commit).


To see which transaction is holding the most locks at any given time you can use following query:

SQL> SELECT * FROM system.sessions WHERE xid = (select holder FROM (select holder,count(1) cnt FROM system.lockman GROUP BY holder ORDER BY cnt desc limit 1) dummy)\G

The example below is an explicit write transaction waiting to commit:

sql> SELECT * FROM system.sessions WHERE xid = (select holder FROM (select holder,count(1) cnt FROM system.lockman GROUP BY holder ORDER BY cnt desc limit 1) dummy)\G
*************************** 1. row ***************************
nodeid: 4
session_id: 3774276612
source_ip: 10.65.148.79
source_port: 33845
local_ip: 10.65.60.51
local_port: 3306
user: 5762026690960054275
database: foo
trx_state: open
statement_state: finished
xid: 6268069961036267554
cpu: 1
isolation: REPEATABLE-READ
last_statement: update SOME_TABLE set product_version = 25 where product_id = 'ABC123' and supplier_id ='foobar_id' and sku_id ='SKUABC123' and product_version = 24
time_in_state_s: 18715
created: 2016-07-11 09:27:16
heap_id: 72057594040818613
trx_age: 0
trx_mode: explicit
trx_counter_select: 0
trx_counter_insert: 0
trx_counter_update: 0
trx_counter_delete: 0
trx_is_writer: 0
1 row in set (0.01 sec)


Types of Locking


Serialized Locking Queue:


If the result of the above query keeps changing, this usually indicates a serialized locking queue. The most common cause of this type of locking is overlapping rows for inserts or updates. For example:

  • Query A inserts rows with unique id 1,2,3
  • Query B updates rows with unique id 2,4
  • Query C updates rows with unique id 4,1

Query B affects rows that are locked by Query A so it must wait until Query A finishes to apply it's changes. Query C in turn, must wait for Query B to finish because it is updating the same rows. A long line of these events can cause queries at the end of the chain to take a long time and consequently have a large value for lockman_waittime_ms (see historical locking below).

Ways to resolve / avoid Serialized Locking Queues:

  • Avoid concurrent updates to the same row(s).
  • Look for ways to improve the performance of the query causing locking contention (index and query tuning).


Bad Query / Missing Index:

A poorly written query or a query missing an index can often cause issues for the database. Overlapping write queries can cause cascading issues by making other queries against those rows or table wait on locks.

In identifying whether if a query is missing any indexes or otherwise underperformant, it is often helpful to look at the explain plan and DDL’s of the tables to understand the cost of the plan, and if add an index or refactor the query to improve performance.


Historical Locking:

To identify queries that have been held up by locking in the past you can look at the query log, specifically `lockman_waits` and `lockman_waittime_ms`. A high lockman_waittime_ms time with a large number of lockman_waits usually indicates a serialized locking queue, while a high lockman_waittime_ms with a single (or very small amount) of lockman_waits usually indicates another query took out a large table or row lock on that the logged query was trying to write to. You’ll often be able to find the lock holding query logged around the same time as the waiting query. You can look at the queries that had the most lockman_waittime for a given log segment with the following:

[root@test001 ~]# /opt/clustrix/bin/clx cmd 'grep "^YYYY-MM-DD" /data/clustrix/log/query.log' > /tmp/YYYY-MM-DD_query.log

[root@test001 ~]# cat /tmp/YYYY-MM-DD_query.log | sed 's/^.*lockman_waittime_ms: \([0-9]*\).*$/\1 \0/' |sort -rn|less


Example (query log entries sorted based on "lockman_waitime_ms"):

2016-03-22 21:40:05.449257 UTC ip-10-10-10-1 clxnode: INSTR SQLERR SID:149052431 db=test user=test@10.10.10.1 ac=N xid=56fca1e77ece381e sql="INSERT INTO `foo` (`account_id`, `created_at`, `member_id`, `sub_list_id`, `updated_at`) VALUES (7825, '2016-03-22 21:31:25', 78348504, 166562, '2016-03-22 21:31:25')" [Session killed due to closed client socket: ] time 520100.0ms; reads: 6; inserts: 15; deletes: 0; updates: 0; counts: 1; rows_read: 6; forwards: 13; broadcasts: 0; rows_output: 7; semaphore_matches: 0; fragment_executions: 15; runtime_ns: 13; cpu_waits: 1; cpu_waittime_ns: 1; bm_fixes: 39; bm_loads: 0; bm_waittime_ns: 0; lockman_waits: 2; lockman_waittime_ms: 517712; trxstate_waits: 0; trxstate_waittime_ms: 0; wal_perm_waittime_ms: 0; bm_perm_waittime_ms: 0;attempts: 2

2016-03-22 21:41:18.389570 UTC ip-1010-10-2 clxnode: INSTR SLOW SID:231803921 db=test user=test@10.10.10.2 ac=N xid=56fca211d2b06022 sql="INSERT INTO `foo` (`account_id`, `created_at`, `member_id`, `sub_list_id`, `updated_at`) VALUES (7825, '2016-03-22 21:33:51', 78348504, 84758, '2016-03-22 21:33:51')" [Ok: 1 rows affected] time 446739.8ms; reads: 3; inserts: 13; deletes: 0; updates: 0; counts: 1; rows_read: 3; forwards: 12; broadcasts: 0; rows_output: 2; semaphore_matches: 0; fragment_executions: 13; runtime_ns: 11; cpu_waits: 0; cpu_waittime_ns: 0; bm_fixes: 44; bm_loads: 0; bm_waittime_ns: 0; lockman_waits: 1; lockman_waittime_ms: 446699; trxstate_waits: 0; trxstate_waittime_ms: 0; wal_perm_waittime_ms: 0; bm_perm_waittime_ms: 0; attempts: 1

2016-03-22 21:33:09.726210 UTC ip-10-10-10-1 clxnode: INSTR SLOW SID:149049359 db=test user=test@10.10.10.1 ac=N xid=56fca1791d45d81e sql="INSERT INTO `foo` (`account_id`, `created_at`, `member_id`, `sub_list_id`, `updated_at`) VALUES (7825, '2016-03-22 21:31:20', 78348504, 166562, '2016-03-22 21:31:20')" [Ok: 1 rows affected] time 109261.8ms; reads: 3; inserts: 13; deletes: 0; updates: 0; counts: 1; rows_read: 3; forwards: 12; broadcasts: 0; rows_output: 2; semaphore_matches: 0; fragment_executions: 13; runtime_ns: 11; cpu_waits: 0; cpu_waittime_ns: 0; bm_fixes: 47; bm_loads: 0; bm_waittime_ns: 0; lockman_waits: 1; lockman_waittime_ms: 109232; trxstate_waits: 0; trxstate_waittime_ms: 0; wal_perm_waittime_ms: 0; bm_perm_waittime_ms: 0; attempts: 1


In the instance of the longest waiter from above shown example logs, it was waiting on two different locks for a total of 520100.0ms (520 seconds) and was then killed by an automated transaction killer. Presumably, the client net_write_timeout is set to 500 seconds. The query above has a very small value for runtime_ns (13), which means it spent most of its time waiting on locks.


Assuming the locking is caused by the queries already identified then the insert/updates that have been identified should only lock the rows they're directly affecting. The overlap is where queries will get a stack up on locks. You can imagine a large build-up if you have a few queries all taking 30+ seconds waiting on each other to complete. In the previous example, you'd get the following result:

  • Query A - Takes the time of Query A
  • Query B - Takes the time of Query A & B
  • Query C - Takes the time of Query A & B & C

The more overlap, the longer the wait, and the worse things get. If you're able to reduce the overlap and/or the runtime of the individual queries, you should see less locking overall. Running things in batches, especially when the overlap is expected should help mitigate the issue here.

 

 

Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.
Powered by Zendesk