How to Troubleshoot Replication Slave Lag

Henry Hwang -

Replication Slave Lags

With replication, you may experience occasional slave lag, where the slave is processing events are some seconds behind when the master originally executed them. Often, the slave will catch up on its own, but in cases where it doesn't or if you want to know what might have caused the lag, the following may help with troubleshooting slave lags.

Common Causes of Slave Lags

Some common causes of replication slave lags include:

  • CPU load: cluster is to busy to process replication events in a timely manner.
  • missing indexes: no primary key on the table that is being updated or if SBR, not having the correct indexes.
  • lock wait timeouts: lock wait timeouts caused by uncommitted transaction on the slave.
  • general volume of writes: increased volume of writes perhaps from growth in site traffic or data migrations with bulk delete or update.
  • data consistency: frequent slave stops and starts due to data consistency issues (duplicate keys or row not found).
  • long running aggregate queries (SBR): aggregate queries in SBR taking a long time to process resulting in a backlog of queries awaiting to be processed.
  • data distribution: target table not evenly distributed on all the nodes.
  • network issues: self explanatory.

Before we jump into the different causes of slave lags, we should set some expectations and review some replication basics.

Expectations

Processing write events will be slower on the slave than on the master because replication is a single threaded process while on the master, write queries can be processed in parallel.

For most customers, a single binlog is more than adequate for their replication needs, but in write-heavy environments, having multiple binlogs may be a viable option.

See our documentation: Configuring Replication

Types of Replication

The two types of replication are:

  • SBR (Statement-based replication): good for bulk writes, simpler to troubleshoot.
  • RBR (Row-based replication): good for small writes where the slave processes the row changes rather than processing the query, generally more efficient.

To determine what type of replication your master cluster is running, you can run:

show binlogs\G

The format value will tell you what type of replication you're running for that particular binlog:

mysql> show binlogs\G
*************************** 1. row ***************************
   log_name: test
       head: 6257570335613048834
log_all_dbs: 1
     format: row
    version: 2
   segments: 1
      bytes: 524288
1 row in set (0.08 sec)

Knowing the type of replication is useful as there are different tables and stats that are associated with each replication type (covered later). It also means that if you're running RBR, the queries that generated the writes aren't logged in the query logs of the slave so you would need to look on the master to identify problematic queries.

Query Logs

Reviewing query and clustrix log entries will be important in helping you identify why the slave is lagging. You'll need to follow the steps in the documentation How To Read the Query Logs to extract the log entries.

Note that when troubleshooting slave lags, you can usually ignore long running SELECT statements since those read queries are not written to the binlogs and replicated to the slave. But do keep in mind that the master or slave could be hit hard by poorly performing SELECT queries. Don't automatically eliminate the SELECT queries as a culprit.

Generally for RBR, you would want to look for long running write queries or for write queries that affect a large number of rows on the master as a potential cause of slave lags.

For SBR, you should look at the query logs on the slave cluster to see what SLOW writes are coming across.

Clustrix Logs

Slave issues are logged in the clustrix logs on the node that the slave process is running on. You can find out where the slave is running from with:

SELECT slave_name,
       master_host,
       nodeid
FROM system.mysql_repconfig;

Note that the slave process isn't always bound to the same node as the slave process can move after a group change.

The log entries will note when the slave stopped and usually why it stopped along with the last successful log position and log file that it had read. The cluster will also log when the slave restarted and what log position it was requesting.

Here are some sample clustrix.log entries:

Aug 6 16:46:54.164643 UTC 2014 clxnode002 real: ALERT SLAVE_STOP WARNING Stopped mysql slave clxslave (10.0.1.10:3306) on non-transient error: Duplicate key in container: test.__idx_email__PRIMARY (support@clustrix.com). Statement: COMMIT, Database: (master_log.015688, position 89420179)

Aug 6 16:47:02.750361 UTC 2014 clxnode002 real: INFO mysql/replication/slave_driver.ct:112 mysql_slave_sig_migrated(): User stopping mysql slave "clxslave"

Aug 6 16:47:04.096524 UTC 2014 clxnode002 real: INFO mysql/replication/slave_driver.ct:105 mysql_slave_sig_migrated(): User starting mysql slave "clxslave"

Aug 6 16:47:04.128394 UTC 2014 clxnode002 real: INFO mysql/replication/slave.ct:804 mysql_slave_initialized(): mysql slave "clxslave" writing to binlogs per system.mysql_slave_log_updates

Aug 6 16:47:04.128844 UTC 2014 clxnode002 real: INFO mysql/replication/slave.ct:723 mysql_slave_session(): mysql slave "clxslave" connecting to 10.0.1.10:3306

Aug 6 16:47:04.128884 UTC 2014 clxnode002 real: INFO mysql/replication/slave.ct:690 mysql_slave_connected(): mysql slave "clxslave" opened relay log for 10.0.1.10:3306 (requesting "master_log.015688, position 89419912")

Aug 6 16:47:04.549432 UTC 2014 chndb002 real: INFO mysql/replication/slave_trx.c:129 check_skip_evt(): slave clxslave (10.0.1.10:3306) (master_log.015688, position 89420179) skipped transaction (remaining 0): t_mysql_query

To familiarize yourself with how to read the clustrix log entries, we'll pick out some details from the sample log entries above:

  • The slave stopped when reading the event at log position 89420179 because of a duplicate key in table test, namely, a duplicate key on the primary key of the table "test" with the value of "support@clustrix.com".
  • The binlog file that it was reading from is master_log.015688 and the last successful log position for the replication event is at 89419912.
  • The slave process then automatically stops and restarts the slave after skipping one event.

Note that the slave skipping behavior is configurable. See our documentation: Controlling Slave Behavior on Errors

You'll need the log position and binlog name as a starting point for extracting binlog details to identify what is coming across the replication stream. See our FAQ Extracting binlog data for Troubleshooting for how to extract binlog events.

Slave Status

The first thing to check is to determine whether the slave is processing events from the replication stream. You can get the slave status with:

show slave status\G

The thing to note is whether the log position is "moving" and whether the relay log is growing. If it is, you know that the slave is processing the replication stream and that its getting events from the master.

Also pay attention to the last error. If it says duplicate key or lock wait timeout, then the issue is obvious and go to the appropriate sections below. Otherwise, you will need to do some further troubleshooting.

CPU Load

If a cluster is busy either on the master or slave, replication can be affected.

A heavily loaded master may not be able to feed the slave fast enough while a heavily loaded slave (in a master-master setup) may not be able to process the replicated events in a timely manner. To see the load and what the load trend has been, you can run the following commands on the cluster:

mysql> show load;
+--------------+------+
| facility     | LOAD |
+--------------+------+
| bm_miss_rate |    0 |
| cpu          |  0.4 |
| disk_util    |  2.8 |
+--------------+------+
3 rows in set (0.19 sec)
mysql> SELECT * FROM clustrix_statd.statd_history NATURAL JOIN clustrix_statd.statd_metadata WHERE name='clustrix.cpu.load_avg' ORDER BY TIMESTAMP DESC LIMIT 10;
+-----+---------------------+----------+-----------------------+
| id  | timestamp           | value    | name                  |
+-----+---------------------+----------+-----------------------+
| 643 | 2016-03-01 21:40:00 |   0.1207 | clustrix.cpu.load_avg |
| 643 | 2016-03-01 21:35:00 | 0.124875 | clustrix.cpu.load_avg |
| 643 | 2016-03-01 21:30:00 | 0.117828 | clustrix.cpu.load_avg |
| 643 | 2016-03-01 21:25:00 | 0.106306 | clustrix.cpu.load_avg |
| 643 | 2016-03-01 21:20:00 | 0.124611 | clustrix.cpu.load_avg |
| 643 | 2016-03-01 21:15:01 | 0.104327 | clustrix.cpu.load_avg |
| 643 | 2016-03-01 21:10:00 | 0.113514 | clustrix.cpu.load_avg |
| 643 | 2016-03-01 21:05:00 | 0.115097 | clustrix.cpu.load_avg |
| 643 | 2016-03-01 21:00:00 | 0.107797 | clustrix.cpu.load_avg |
| 643 | 2016-03-01 20:55:00 | 0.113899 | clustrix.cpu.load_avg |
+-----+---------------------+----------+-----------------------+
10 rows in set (0.00 sec)

If the CPU load average is higher than normal, you'll need to identify what's causing the heavy load. This could be due to:

  • volume of queries
  • long running queries (perhaps analytics or queries that generate lots of writes)
  • poorly optimized queries (queries not using the correct indexes or the tables are missing indexes).

You might have fast writes that are being processed in a timely manner on the master, but due to the volume of writes, the slave starts to lag. The queries themselves may not get logged in the query log because the execution times for them don't exceed the slow logging threshold; however, you can query the clustrix_statd.qpc_history table to see how frequent your updates, deletes or inserts may be running with the following query on the master (amend timestamp and statement accordingly):

SELECT TIMESTAMP,
       STATEMENT,
       sum(exec_count),
       avg(avg_exec_ms)
FROM clustrix_statd.qpc_history
WHERE rank < 10
  AND TIMESTAMP > '2014-10-10 12:40:00'
  AND STATEMENT LIKE 'update%'
GROUP BY query_key
ORDER BY TIMESTAMP ASC LIMIT 3 \G

If it shows lots of updates during the stats period, then it's a likely culprit. For example, 200,000 executions of a simple update query in 15 minutes may not get logged in the query logs, but you should be able to catch this by looking at qpc_history.

You can also use the query log entries on the master to identify bulk writes that may be causing the slave to lag. Although 130000 rows updated doesn't sound like much, when there are 30 updates in a very short amount of time, it can slow down the slave:

Aug 4 21:22:35.449849 UTC 2014 clxmaster003 real: INSTR SLOW SID:3780625411 db=test user=user1@10.100.0.20 ac=Y xid=53e0fd9420ba4006 sql="UPDATE bigtable SET TIME_UPDATED=now(),DELETED=0 WHERE ID=371" [Ok: 138756 rows updated] time 27487.7ms; reads: 277513; inserts: 416268; deletes: 0; updates: 277512; counts: 138756; rows_read: 416268; forwards: 832543; rows_output: 2; semaphore_matches: 0; runtime: 3735902785; attempts: 1

Note the number of rows updated in the square brackets.

For potential long running queries, you can usually find them with the following query (depending on your version of clustrix) and see how long the query has been in that state:


SELECT *
FROM system.sessions
WHERE statement_state='executing'
AND time_in_state_s>1
ORDER BY time_in_state_s DESC LIMIT 10 \G

If the same type of query is showing up multiple times and are taking a long time to process, it could indicate a missing index or a lock wait timeout (the latter would show up in the query logs).

Indexing

To determine whether a query is missing an index and how to troubleshoot this is a whole other topic, but we'll briefly cover one.

One RBR scenario is that the slave lag continues to grow, relay log is changing, but the log position isn't moving. This may be caused by a bulk delete or update where there isn't a primary key on the target table even though there might be a corresponding index. You should check the create table definition for that table that is being written to on the slave and determine whether it has a primary key or not.

For example, if this query was processed on the master and it had deleted a million rows:

delete from foo where start_date<'2014-01-01'

and the DDL for the table has:

mysql> show create table foo \G
*************************** 1. row ***************************
       Table: foo
Create Table: CREATE TABLE `foo` (
  `id` bigint(16),
  `bar` bigint(16),
  `start_date` date,
  `end_date` date,
  KEY `start_date` (`start_date`) /*$ DISTRIBUTE=1 */
) CHARACTER SET utf8 /*$ REPLICAS=2 SLICES=1 */

The solution here would be to not only have the primary key in the table, but to set the query to be replicated across as a statement based query if your replication is by default RBR. The other thing to do here may be to include a limit to reduce the write load:

alter table foo add primary key (id);
set session binlog_format='statement'
delete from foo where statt_date<'2014-01-01' limit 100000;
### repeat until all rows are deleted ###

If you're running SBR, the query being processed could be poorly optimized or the table doesn't have the correct indexes. You'll need to identify the offending write(s) from the master by looking at the query logs or qpc_history on the master and see if you can find any writes that are slow to execute.

You can then look at the actual query and see whether the query has the correct search arguments to use the indexes that are in the table definition.

Lock Wait Timeouts

You might get the following "lock wait timeout exceeded" error for the slave process in your sprout or clustrix log:

Jul 17 18:07:38.393608 UTC 2014 clxnode002 real: INFO mysql/replication/slave_state.c:175 mysql_slave_state_reconnect(): Stopped mysql slave clxslave (10.1.0.10:3306) on transient error, scheduling retry: Lock wait timeout exceeded: . Statement: COMMIT, Database: (master_log.014723, position 5397030)

or you might see this "Lock wait timeout exceeded" in the Last_Error field when you run the show slave status command:

mysql> show slave status \G
*************************** 1. row ***************************
           Slave_Name: clxslave
         Slave_Status: Reconnecting
          Master_Host: 10.1.0.10
          Master_Port: 3306
          Master_User: replication
      Master_Log_File: master_log
        Slave_Enabled: Enabled
         Log_File_Seq: 14723
         Log_File_Pos: 5397030
           Last_Error: Lock wait timeout exceeded: . Statement: COMMIT, Database:
    Connection_Status: Disconnected
 Relay_Log_Bytes_Read: 0
Relay_Log_Current_Size: 0
Seconds_Behind_Master: NULL
1 row in set (0.00 sec)

This is occurring because the slave event is trying to acquire a lock to update a row or table that is currently be locked by another process that hasn't committed. After five minutes (default), it generates this error if it can't acquire a lock.

You should be able to find the offending transaction if it's still uncommitted with:

SELECT *
FROM system.sessions
WHERE xid IN
    (SELECT holder
     FROM system.lockman
     ORDER BY holder ASC LIMIT 1)\G

Here's a sample of what the output might look like:

mysql> select * from system.sessions where xid in (select holder from system.lockman order by holder asc limit 1) \G*************************** 1. row ***************************           nodeid: 5       session_id: 1977584645
        source_ip: 10.1.1.100
      source_port: 13574
         local_ip: 10.1.0.10
       local_port: 3306
             user: 5711972934625828866
         database: prod
        trx_state: open
  statement_state: executing
              xid: 6037483960283521034
        isolation: REPEATABLE-READ
   last_statement: UPDATE foo set bar=1
    time_in_state: 34
          created: 2014-07-17 10:27:48
          trx_age: 1517
         trx_mode: explicit
trx_counter_select: 3
trx_counter_insert: 0
trx_counter_update: 42
trx_counter_delete: 0
    trx_is_writer: 1
1 row in set (0.07 sec)

In this example, the trx_mode is explicit which tells us that this transaction is not set to auto-commit. You can then notify the user to commit the transaction or you can kill the session.

Slave Process Stuck

In rare cases, the slave process may get stuck where the slave stopped due to an error and when it restarted, the previous slave still hadn't terminated.

This is then manifested on the slave cluster with the following symptoms:

  • slave lag increasing
  • bigc growing
  • undo space growing
  • relay log doesn't grow
  • stats in the system.transactions table for the slave's XID don't increase
  • slave process is the oldest transaction
select * from system.sessions where xid in (select xid from system.transactions order by xid asc limit 1 \G  # check to see if user=4100 which would be the slave process.

If this is confirmed that the slave is stuck, see if you can kill the session.  If that doesn't work, you should contact support.


Issues on the Master

If the CPU load on the slave is low and if there doesn't seem to be any long running queries on the slave and there aren't any indexing issues, you might want to have a look at the master cluster.

  • Any errors in the master logs (clustrix.log)?
  • Is the CPU load on the master high?
  • Is there a big write event being replicated?
  • Are there lots of small writes (hundreds of thousands of small updates, deletes or inserts)?


Volume of Writes

Are there more binlog files being created than normal? Perhaps there is a migration and data is being duplicated as a part of the migration or maybe it's a growth in overall writes and you might need to plan for having multiple binlogs? This is more readily apparent with RBR, but could also show up in SBR environments if there is a significant increase in write statements.

You can query the stats tables to find out how much the binlogs have grown within a time period by looking at the difference in the binlog sizes and charting it:

SELECT (max(value) - min(value))/1024/1024/1024 AS Binlog_Growth_GB
FROM clustrix_statd.statd_history
NATURAL JOIN clustrix_statd.statd_metadata
WHERE TIMESTAMP BETWEEN timestampadd(DAY,-1,now()) AND now()
  AND name = 'clustrix.capacity.disks.binlog.size_bytes';

For RBR, you can count the number of binlogs created within a specific time period (like per day) and see whether the lag corresponds to the increase in binlogs:

tee binlogfiles.todaysdate;
show all binlog files;

For SBR, you could see whether there is an increase in write commands (deletes, updates, inserts) and compare that with the previous period:

mysql> select name, timestamp, max(value) - min(value) as Exec_count from clustrix_statd.statd_history natural join clustrix_statd.statd_metadata where timestamp between timestampadd(day,-1,now()) and now() and name in ('clustrix.stats.Com_delete','clustrix.stats.Com_update','clustrix.stats.Com_insert') group by name limit 10;
>+---------------------------+---------------------+------------+
| name                      | timestamp           | Exec_count |
+---------------------------+---------------------+------------+
| clustrix.stats.Com_update | 2014-08-05 15:15:00 |     196959 |
| clustrix.stats.Com_insert | 2014-08-05 15:10:00 |   17493971 |
| clustrix.stats.Com_delete | 2014-08-05 14:20:01 |       5494 |
+---------------------------+---------------------+------------+
3 rows in set (0.00 sec)

Charting the results would make it easier to visualize and identify increased binlog writes (see attached images).

Data Consistency

Frequent slave stops due to duplicate keys or row not found is going to generate additional slave lag as it may take up to 10 seconds for the slave to restart if it's been configured to skip these events. Ideally, you shouldn't be in this situation as this would indicate a data consistency issue but this occurs mainly in a master-master setup where the application may have mistakenly written the same data onto both clusters. It could also occur in a master-slave configuration where your development team may have inadvertently written to the slave.


Long Running Analytic Queries

Since replication is single-threaded, if you have many complex writes that take a long time to process in SBR, the slave will lag until it has finished processing this and all of the other queries that have queued up behind it.

Data Distribution

If a table is poorly distributed, for example, having less slices than the number of nodes in the cluster, you're not taking advantage of all of the available resources. This table may be "hot" as the writes aren't being spread out more evenly across all the nodes. You can find your hot tables by looking at hotness_history (amend the value in the brackets):

select * from clustrix_statd.hotness_history where write_rank <20 and database not in ('system','clustrix_statd','_replication') and total_replicas < [### 2 * number_of_nodes###] order by timestamp desc, write_rank asc limit 50;

Alternatively, you can use the hotness tool (https://support.clustrix.com/entries/23261163-Hotness) to see what is hot at the moment.

Network Issues

Your network monitoring system or your NOC should notify you when there are network issues that would impact replication.  In the event that they don't, you might notice that the slave lag jumps up from 0 to a high value:

mysql> select * from clustrix_statd.statd_metadata natural join clustrix_statd.statd_history where name like '%seconds_behind%' order by timestamp desc limit 200;
+------+-----------------------------------------------------------+---------------------+-------+
| id   | name                                                      | timestamp           | value |
+------+-----------------------------------------------------------+---------------------+-------+
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-21 00:55:05 |  1486 |
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-21 00:51:06 |  2095 |
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-21 00:45:24 |  NULL |
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-21 00:40:05 |  3676 |
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-21 00:35:06 |  NULL |
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-21 00:32:13 |  5575 |
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-21 00:25:03 |  NULL |
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-21 00:20:07 |  6569 |
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-21 00:15:28 |  6438 |
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-21 00:10:07 |  6671 |
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-21 00:04:49 |  6864 |
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-21 00:00:24 |  NULL |
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-20 23:56:25 |  7155 |
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-20 23:50:04 |  NULL |
| 1327 | clustrix.replication.slave.seconds_behind_master.clxslave | 2014-11-20 23:44:49 |     0 |

The sprout or clustrix log on the node where the slave is running from would have logged this:

Nov 20 23:49:08.216590 UTC 2014 clxnode001 real: INFO mysql/replication/slave_state.c:175 mysql_slave_state_reconnect(): Stopped mysql slave clx-slave (10.0.0.100:3306) on transient error, scheduling retry: Could not connect to master (Connection reset by peer)


Multiple Binlogs

In some cases, there may be too many writes for a single binlog to handle. This could be from growth of your site or cyclical demand like during holiday periods. A workaround would be to have multiple binlogs where non-dependent writes can be logged to their own binlogs. For example, you may have writes to a user table and writes a message table. If they're in different schemas and don't have any dependency on each other, you could set up a binlog per schema so replication events for one schema doesn't get queued up behind writes to the other schema.

Please contact support for more information on how to do this.


Extracting Binlog Events

This FAQ covers how to extract events from binlogs: How To Extract Binlog Events.

Stats

For RBR, these tables provide some insight in RBR stats (stats in the following two tables reset every time the slave restarts):

  • system.slave_row_stats: select * from system.slave_row_stats order by event_bytes desc limit 10; # shows the top ten tables that are being written to since the slave started.
  • system.mysql_slave_stats: select * from system.mysql_slave_stats; # look for bytes_per_sec and events_per_sec and whether batches value is incrementing.

The above stats might tell you which tables are "hot" and help you in finding write queries for those tables.

For SBR, these stats on the slave may be useful in identifying the volume of statements being replicated:

  • clustrix.stats.Com_delete
  • clustrix.stats.Com_update
  • clustrix.stats.Com_insert

when used in conjunction with the stats in

  • clustrix_statd.qpc_history

and with query log entries.


Best Practices

  • If you have multiple slave processes, point them to different nodes on the master cluster rather than the VIP host.
  • RBR is generally more efficient, but you do need to set your bulk writes to be statement based (set session binlog_format='statement') to not flood the slave with lots of write events.
  • If your write volumes are too much for one binlog, separate your binlog writes into different binlogs if the writes aren't dependent.

 

Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.
Powered by Zendesk