Changing Distribution and Reducing Broadcasts

Nick Lamb -

Distribution in ClustrixDB determines the way that data is stored across the cluster. For more information about data distribution on ClustrixDB, you can read the following article in our documentation: Managing Data Distribution

In v7.5 the default distribution of indexes will be 1. This differs from the current default which is number of columns. The redistribute task was designed to give the best distribution according to disk usage across the cluster but as we've found that this can be less performant in many cases we recommended that this task be disabled and redistribute actions be performed in accordance with the guidelines detailed below. Secondly, as the redistribute task may undo many of the changes recommended below, it should be disabled before proceeding.

You can disable the redistribute task with:

set global task_rebalancer_redistribute_interval_ms=0;

The reason we've found distribution to decrease performance is mainly broadcasts. Broadcasts cause increased traffic across the network and when unnecessary, cause added contention without a performance benefit. The information below describes ways to identify tables and indexes that would benefit from various redistribute actions. A redistribute action is a table re-write, so this may take a long time on many of your larger tables. If your team has any questions on particular changes, we'll be happy to review them.

Broadcasts

Broadcasting is when a node asks another node for additional information. This adds parallelism (a good thing) but can also add noise/congestion. Removing un-needed or unhelpful broadcasts via schema or query changes can be beneficial to the system as a whole.

Detecting:

QPC queries will give information on queries that have recently run. Looking at this table can help to determine recent queries that could be causing issues. Using this to look at broadcasts specifically:

SELECT statement, 
       broadcasts, 
       exec_ms, 
       counts, 
       rows_read 
FROM   system.qpc_queries 
ORDER  BY broadcasts DESC 
LIMIT  5\G

ALLNODES:

Potential ALLNODES Candidates:

  • Table Size < 10MiB
  • Write Frequency < 1K
  • Read Frequency > 1M
  • Currently not ALLNODES

Changing small tables with very few writes will help by preventing the need to broadcast to other nodes to get table data. This especially helps with small tables that are used in JOINs with larger tables. Tables that are frequently written to should not be altered to ALLNODES. Partitioned tables should also not be altered to ALLNODES.

SELECT `database`, 
       `table`, 
       reads, 
       writes, 
       bytes 
FROM   (SELECT `database`, 
               `table`, 
               Sum(reads)                                  AS reads, 
               Sum(inserts) + Sum(deletes) + Sum(replaces) AS writes, 
               Sum(bytes)                                  AS bytes, 
               Count(DISTINCT( `index` ))                  AS paths, 
               Count(DISTINCT( slice ))                    AS slices, 
               Count(DISTINCT( replica ))                  AS replicas 
        FROM   system.container_stats cs 
               JOIN system.table_replicas tr 
                 ON cs.replica = tr.replica 
        GROUP  BY 1, 
                  2) AS a 
WHERE  bytes < 10 * 1024 * 1024 
       AND writes < 1000 
       AND reads > 1000000 
       AND slices > paths 
ORDER  BY 1, 
          2; 

To alter a table to ALLNODES, you can use the following:

sql> ALTER TABLE foo.bar REPLICAS=ALLNODES;

Index Distribution:

  • Potential Index Distribution Candidates
  • Single column indexes with DISTRIBUTE=2
  • Multi-column indexes with DISTRIBUTE >= number of columns with mostly unique initial columns.


Single column indexes with DISTRIBUTE=2

Example:

CREATE TABLE `foo2` (
`i` int(11) not null,
PRIMARY KEY (`i`) /*$ DISTRIBUTE=1 */,
KEY `idx_i` (`i`) /*$ DISTRIBUTE=2 */
) CHARACTER SET utf8 /*$ REPLICAS=2 SLICES=7 */
ALTER TABLE foo2 INDEX idx_i DISTRIBUTE=1;

In v7.0+ is no longer possible to create an index with a higher distribution that it's keycount. In previous versions, or on clusters upgraded from previous versions, this change will remove unneeded broadcasts on single column indexes.

Multi-column indexes:

With Multi-column indexes there can be a benefit to changing the distribution in some cases. Cases:

  • Distribution is > the number of columns - Change distribution to at most the number of columns.
  • Indexes that have mostly unique initial columns - Consider reducing distribution if unnecessary broadcasts are being performed.
  • Check explain of queries using index and look for stream combine.
Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.
Powered by Zendesk