Extracting binlog data for troubleshooting replication issues

Henry Hwang -

For most replication errors, the cause of the error can usually be found in the output from show slave status. However, if you're trying to troubleshot a slave lag or if you need to find out more details about what is in the replicated event, you can use mysqlbinlog (a mysql utility) or repclient (a Clustrix utility) to extract binlog events.

Binlogs in Clustrix

Unlike mysql where it writes the binlogs to their own individual files, Clustrix binlogs are stored within the database storage so that it has the same protection as database data.

Causes of Slave Stops

Most common slave stops are due to:

  • Row not found: Row on the slave doesn't match the row details of the replicated event.
  • Duplicate key in container: row with the same primary key already exists on the slave.

Although these errors are fairly self-explanatory, you would still want to use mysqlbinlog or repclient to determine whether the replication event is a standalone query or if it's a part of a larger explicit transaction. If it's the latter, you wouldn't want to skip the event as you may also skip other writes that are in that explicit transaction.

Please see Controlling Slave Behavior on Error for what settings you may want to set for skipping events as well as how much of a match the row should have for a successful write.

Less common errors include:

  • arity errors (mismatch in the table definition with the replication event)
  • foreign key contraints

Sample Errors

Here are some show slave status output with the last successful log position read, the binlog file name, the master host, replication username and reason for the slave stopping.  You'll need some of this information when extracting data from the binlogs.

*************************** 1. row ***************************
 Slave_Name: default
 Slave_Status: Errored
 Master_Host: 10.0.0.10
 Master_Port: 3306
 Master_User: repl
 Master_Log_File: default
 Slave_Enabled: Enabled
 Log_File_Seq: 28
 Log_File_Pos: 45532636
 Last_Error: Duplicate key in container: test_tbl2.__idx_test_tbl2__PRIMARY (314159). Statement: insert into test_tbl2 (id) values (314159), Database: test
 Connection_Status: Disconnected
 Relay_Log_Bytes_Read: 0
Relay_Log_Current_Bytes: 0
 Seconds_Behind_Master: NULL
 
 *************************** 1. row ***************************
 Slave_Name: default
 Slave_Status: Errored
 Master_Host: 10.0.0.10
 Master_Port: 3306
 Master_User: repl
 Master_Log_File: default
 Slave_Enabled: Enabled
 Log_File_Seq: 2485
 Log_File_Pos: 489659967
 Last_Error: Row Not Found: for delete on test.TEST_TBL3 at default2485.489660488 where TEST_TBL3.__idx_TEST_TBL3__PRIMARY(id)=(271828). Statement: COMMIT, Database:
 Connection_Status: Disconnected
 Relay_Log_Bytes_Read: 0
Relay_Log_Current_Bytes: 0
 Seconds_Behind_Master: NULL
1 row in set (0.00 sec)

A sample SBR slave error:

*************************** 1. row ***************************
 Slave_Name: default
 Slave_Status: Reconnecting
 Master_Host: 10.0.0.10
 Master_Port: 3306
 Master_User: repl
 Master_Log_File: default
 Slave_Enabled: Enabled
 Log_File_Seq: 8051
 Log_File_Pos: 56376112
 Last_Error: Can't find any matching row in the test_tbl4 table: year '1666' does not exist. Statement: update test_tbl4 set event="Great Fire" where year=1666, Database:
 Connection_Status: Disconnected
 Relay_Log_Bytes_Read: 0
Relay_Log_Current_Bytes: 0
 Seconds_Behind_Master: NULL
1 row in set (0.00 sec)

Depending on your workload, you may need to extract binlog events just to make sure that the event is a standalone query that isn't a part of a larger transaction. Otherwise, if you skip the event and it is a part of a larger transaction, you will be skipping all of the writes in that explicit transaction.

Using the last successful log position and log file name from the show slave output, we can extract event details from the binlogs to help troubleshoot why a slave might have stopped or lag.

Differences between mysqlbinlog and repclient

Both mysqlbinlog and repclient utilities do the same thing, namely extracting events from binlogs. The main difference is the output and how readable they are and mysqlbinlog doesn't understand all clustrix replication events.

The nice thing about repclient is that you can specify how many events (-count=###) to extract from the starting log position. Repclient can also be used to generate raw binlog dumps from Clustrix so that you can replay the events to another slave or use it as a part of a backup plan.

The useful aspect about mysqlbinlog is that it presents the write event in an easier to understand format especially with row-based replication (RBR). With RBR, only the row changes are replicated so you won't be able to get the actual query that was processed to generate the write events.

You could also use these tools for troubleshooting SBR issues.

Parameters

The main parameters for either command are:

  • log position: the last successful event processed is shown in the show slave status output on the slave cluster.
  • binlog file name: consists of the binlog name and a six digit suffix that increments when a new binlog is created.
  • master host: IP address or hostname of one of the nodes on the master cluster.
  • username and password: to access the binlog data.

If the slave is stopped, it will show the last successful event that it had processed along with the log position of that event and the binlog from that it was reading from.

Comparision of output between repclient and mysqlbinlog:

Here's some output from both utilities querying the same binlog. Before extracting the binlog event, you should cd onto a filesystem that has sufficient space as the binlog extract files can grow very quickly.

cd /clustrix # or wherever you have the most available space
repclient -logname statd.006870 -logpos 59900827 -count 100 -verbose > binlog.output

Look for the BEGIN that defines that start of the transaction:

00000000 42 45 47 49 4E |BEGIN |

The next section shows the table mapping for statd.statd_history. For other replication events that write to multiple tables, it will have entries like this for each of the tables:

}
2014-11-14 12:47:06.599042 UTC eukanuba002 INFO mysql/replication/repclient.c:445 repdump_event(): MsgType19 [pos: 59900886, log_offset: 59900943]
message mysql_table_map {
 header = message mysql_event_header {
  timestamp = 1415960473
   type = 19
    server_id = 2101898466
     size = 57
      log_offset = 59900943
       flags = 0
        }
        table_id = 1
         flags = 0
          db = "statd"
           table = "statd_history"
            columns = [
             mysql_type_long,
              mysql_type_long,
               mysql_type_datetime,
                mysql_type_double
                 ]
                 }
                 

The next event is the actual INSERT. In this case, the mysql_write_rows = INSERT. Note the table_id value. If the event has multiple table mappings, the table_id will identify which table the write is being applied:

2014-11-14 12:47:06.599135 UTC eukanuba002 INFO mysql/replication/repclient.c:445 repdump_event(): MsgType23 [pos: 59900943, log_offset: 59900997]
message mysql_write_rows {
 header = message mysql_event_header {
 timestamp = 1415960473
 type = 23
 server_id = 2101898466
 size = 54
 log_offset = 59900997
 flags = 0
 }
 table_id = 1
 flags = 0
 used_columns = [1111]
 rowdata =
00000000 F0 6D 00 00 00 DF 2F 00 00 F5 90 F3 77 51 12 00 |.m..../.....wQ..|
00000010 00 00 00 A0 69 8F 5C E5 41 |....i.\.A |

}

This next section is the actual data being inserted (values in parenthesis). So the event is inserting the following data (109,12255,'2014-11-14 10:20:05',2867100493):

1 Rows:
 [t_vu32(109), t_vu32(12255), t_datetime(2014-11-14 10:20:05), t_vf64(2867100493)]
2014-11-14 12:47:06.599270 UTC eukanuba002 INFO mysql/replication/repclient.c:445 repdump_event(): MsgType23 [pos: 59900997, log_offset: 59901051]
message mysql_write_rows {
 header = message mysql_event_header {
 timestamp = 1415960473
 type = 23
 server_id = 2101898466
 size = 54
 log_offset = 59901051
 flags = 0
 }
 table_id = 1
 flags = 0
 used_columns = [1111]
 rowdata =
00000000 F0 6E 00 00 00 A8 41 00 00 FD 8E F3 77 51 12 00 |.n....A.....wQ..|
00000010 00 00 00 00 00 00 00 00 00 |......... |

There were several addition mysql_write_rows (inserts) until the transaction completed with COMMIT:

00000000 43 4F 4D 4D 49 54 |COMMIT |

Compare the repclient output with the output from mysqlbinlog:

When using mysqlbinlog to extract events from the binlog, include the use of the --hexdump option as it tells you when a transaction starts and commits but also means more data to scroll through.

mysqlbinlog --verbose --base64-output=ALWAYS --hexdump --start-position=59900827 -R statd.006870 > mysqbinlog.output
ctrl-c after a second or two.

There's some hex output along with some log positions, but again we want to look for the BEGIN:

# Start: binlog v 4, server v 5.0.45 created 700101 0:00:00
BINLOG '
AAAAAA/ibEh9ZgAAAAAAAAAAAAQANS4wLjQ1AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAUwAEGggAAAAICAgC
'/*!*/;
# at 59900866
#141114 10:21:13 server id 2101898466 end_log_pos 59900886
# Position Timestamp Type Master ID Size Master Pos Flags
# 39203c2 99 d7 65 54 02 e2 6c 48 7d 3b 00 00 00 d6 03 92 03 00 00
# 39203d5 04 bc b8 0f 00 00 00 00 00 00 00 15 00 00 00 00 |................|
# 39203e5 00 00 01 00 00 08 00 00 00 00 00 04 21 00 21 00 |................|
# 39203f5 21 00 00 42 45 47 49 4e 00 00 00 04 21 00 21 00 |...BEGIN|
#
BINLOG '
mddlVALibEh9OwAAANYDkgMAAAS8uA8AAAAAAAAAFQAAAAAAAAEAAAgAAAAAAAQhACEAIQAAQkVH
SU4=
'/*!*/;
# at 59900925
# at 59900982
# at 59901036
# at 59901090
# at 59901144
# at 59901198

Here's the event where the table is mapped:

#141114 10:21:13 server id 2101898466 end_log_pos 59900943
# Position Timestamp Type Master ID Size Master Pos Flags
# 39203fd 99 d7 65 54 13 e2 6c 48 7d 39 00 00 00 0f 04 92 03 00 00
# 3920410 01 00 00 00 00 00 00 00 05 73 74 61 74 64 00 0d |.........statd..|
# 3920420 73 74 61 74 64 5f 68 69 73 74 6f 72 79 00 04 03 |statd.history...|
# 3920430 03 0c 05 01 08 08 |......|
# Table_map: `statd`.`statd_history` mapped to number 1

[...some lines deleted...]

#141114 10:21:13 server id 2101898466 end_log_pos 59901219
# Position Timestamp Type Master ID Size Master Pos Flags
# 392050e 99 d7 65 54 02 e2 6c 48 7d 3c 00 00 00 23 05 92 03 00 00
# 3920521 04 bc b8 0f 00 00 00 00 00 00 00 15 00 00 00 00 |................|
# 3920531 00 00 01 00 00 08 00 00 00 00 00 04 21 00 21 00 |................|
# 3920541 21 00 00 43 4f 4d 4d 49 54 |...COMMIT|

After this bit of the hex dump, mysqlbinlog helpfully parses out the decoded writes into something more readable:

BINLOG '
mddlVBPibEh9OQAAAA8EkgMAAAEAAAAAAAAABXN0YXRkAA1zdGF0ZF9oaXN0b3J5AAQDAwwFAQgI
mddlVBfibEh9NgAAAEUEkgMAAAEAAAAAAAAABA/wbQAAAN8vAAD1kPN3URIAAAAAoGmPXOVB
### INSERT INTO statd.statd_history
### SET
### @1=109
### @2=12255
### @3=2014-11-14 10:20:05
### @4=2867100493
mddlVBfibEh9NgAAAHsEkgMAAAEAAAAAAAAABA/wbgAAAKhBAAD9jvN3URIAAAAAAAAAAAAA
### INSERT INTO statd.statd_history
### SET
### @1=110
### @2=16808
### @3=2014-11-14 10:15:01
### @4=0
mddlVBfibEh9NgAAALEEkgMAAAEAAAAAAAAABA/wfQAAAPIAAADxkPN3URIAAAAAAMCJ9lJB
### INSERT INTO statd.statd_history
### SET
### @1=125
### @2=242
### @3=2014-11-14 10:20:01
### @4=4971047
mddlVBfibEh9NgAAAOcEkgMAAAEAAAAAAAAABA/wbQAAAOEvAAD1kPN3URIAAAAAAAAAAAAA
### INSERT INTO statd.statd_history
### SET
### @1=109
### @2=12257
### @3=2014-11-14 10:20:05
### @4=0
mddlVALibEh9PAAAACMFkgMAAAS8uA8AAAAAAAAAFQAAAAAAAAEAAAgAAAAAAAQhACEAIQAAQ09N
TUlU
'/*!*/;

Start of the next transaction as denoted by the BEGIN:

# at 59901258
#141114 10:21:13 server id 2101898466 end_log_pos 59901278
# Position Timestamp Type Master ID Size Master Pos Flags
# 392054a 99 d7 65 54 02 e2 6c 48 7d 3b 00 00 00 5e 05 92 03 00 00
# 392055d 01 44 a7 14 00 00 00 00 00 00 00 15 00 00 00 00 |.D..............|
# 392056d 00 00 01 00 00 08 00 00 00 00 00 04 21 00 21 00 |................|
# 392055d 01 44 a7 14 00 00 00 00 00 00 00 15 00 00 00 00 |.D..............|
# 392056d 00 00 01 00 00 08 00 00 00 00 00 04 21 00 21 00 |................|
# 392057d 21 00 00 42 45 47 49 4e 00 00 00 04 21 00 21 00 |...BEGIN|

The one minor issue that we've found with mysqlbinlog is that it sometimes doesn't start at the correct starting log position, but it isn't too far from starting log position.


Arity Example

Let's say you got this slave stopped error:

*************************** 1. row ***************************
 Slave_Name: default
 Slave_Status: Errored
 Master_Host: 10.0.0.10
 Master_Port: 3306
 Master_User: repl
 Master_Log_File: clx-binlog
 Slave_Enabled: Enabled
 Log_File_Seq: 5
 Log_File_Pos: 1000000
 Last_Error: Row Type Mismatch: Column arity mismatch between slave and master for the table test.test_tbl
 Connection_Status: Disconnected
 Relay_Log_Bytes_Read: 0
Relay_Log_Current_Bytes: 0
 Seconds_Behind_Master: NULL

On one of the nodes on the master cluster:

cd /clustrix # or wherever you have the most available space
mysqlbinlog -u repl -h 127.0.0.1 -p'xxxxxxxxx' --verbose --base64-output=ALWAYS --hexdump --start-position=1000000 -R clx-binlog.000005 > mysqlbinlog.output

or alternatively you can use

repclient -logname clx-binlog.000005 -logpos 1000000 -count 100 -verbose > repclient.output?

Here's a snippet of the mysqlbinlog extract file:

#141022 8:55:17 server id 193994141 end_log_pos 1003393
# Position Timestamp Type Master ID Size Master Pos Flags
# f4f6b f5 70 47 54 02 9d 1d 90 0b 3c 00 00 00 81 4f 0f 00 00 00
# f4f7e 04 fc 96 c2 00 00 00 00 00 00 00 15 00 00 00 00 |................|
# f4f8e 00 00 01 00 00 08 00 00 00 00 00 04 21 00 21 00 |................|
# f4f9e 21 00 00 43 4f 4d 4d 49 54 |...COMMIT|
#
BINLOG '
[...some output that can be ignored...]
### UPDATE test.TEST_TBL
### WHERE
### @1=15871698
### @2=1
### @3=3109804
### @4='Test'
### @5='User'
### @6=0
### @7=0
### @8=3336901
### @9=3128759
### @10=0
### @11=1413951934
### SET
### @1=15871698
### @2=1
### @3=3109804
### @4='Test'
### @5='User2'
### @6=1
### @7=1
### @8=3336901
### @9=3128759
### @10=0
### @11=1413968117
[...some output that can be ignored...]
'/*!*/;
# at 1003431
#141022 8:55:17 server id 193994141 end_log_pos 1003452
# Position Timestamp Type Master ID Size Master Pos Flags
# f4fa7 f5 70 47 54 02 9d 1d 90 0b 3b 00 00 00 bc 4f 0f 00 00 00
# f4fba 09 48 dd b8 00 00 00 00 00 00 00 15 00 00 00 00 |.H..............|
# f4fca 00 00 01 00 00 08 00 00 00 00 00 04 21 00 21 00 |................|
# f4fda 21 00 00 42 45 47 49 4e 00 00 00 04 21 00 21 00 |...BEGIN|
Looking at the sample output above, we can glean the following details:
  • Event was processed at 08:55:17 local time
  • One single update write was processed as the next event starts with a BEGIN.
  • test.TEST_TBL was being updated with columns 5, 6, 7 and 11 being modified as they're the only differences.
  • The fields denoted as @1, @2, @3...@11 match the order of the columns from show create table.

If this was the event that caused the slave to stop due to an arity error, it will be because the number of columns on the slave table doesn't match the number of columns in the write event. We can then verify this by looking at the DDL for the TEST_TBL.

To resolve this issue, you would want to contact support to help determine whether there are other writes with the same number of columns and whether these statements are in an explicit transaction. We would then want to stop the slave and manually apply the change(s) on the slave and skip the event(s).

Slave Lag from Large Write Events

If your slave is continuing to lag and the log position isn't growing, you can extract the event from the binlogs.

If your binlog extract has an event that has a long list of log positions, it indicates that this event is a large transaction:

# 3333333
# 3333432
# 3334023
# 3336023
...
# 12312341
...
# 23423233

If you get stuck with trying to scroll through these, you can search for " Position " and that should take you to the next section of the output.

Eventually, you'll find the TABLE mapping that identifies the table that the event is writing to and whether it's an UPDATE, DELETE or INSERT. You can then go back to the master to find the actual query used. If the query took a bit of time to execute, then it should be logged in the query.log. Otherwise, you should be able to find it in the master's clustrix_statd.qpc_history.


Specifying a starting log position that is in the middle of an event

If you get an error like this in your binlog extract output:

Could not decode row data: Generic Error: table_id 1 not found

this is because you've specified a starting log position that isn't at the start of the event and that the utility needs the table mapping info. It may take several attempts if your starting log position is in the middle of a large transaction as you move to an earlier starting log position, but if you use the log position from a stopped slave event, you shouldn't run into this.


Reptool

Reptool is a new utlity that adds additional functionality and can decode the event in a more readable output. It will be available in a future release.

Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.
Powered by Zendesk