How Do I Improve Data Import Performance?

Nick Lamb -

The most common way to import data onto a ClustrixDB cluster is through use of the clustrix_import tool, as described in Importing Data. Clustrix Import is designed to take advantage of ClustrixDB parallelism and is the recommended method for performant data ingest.

If for some reason you are unable to use clustrix_import, here are some recommendations for optimal import speed:

  • Increase concurrency as much as possible. If using LOAD DATA INFILE, consider splitting the import file into multiple chunks, and running parallel LDI invocations. If you have an application which loads data, utilize multiple threads. A rule of thumb for concurrency is one thread per core in the cluster.
  • Disable auto_increment for the import phase. There is some performance overhead to maintaining the auto_increment value, even when the value is being provided in the INSERT. If you create your table without the AUTO_INCREMENT clause of the column, this penalty is avoided; obviously this is only workable if the data you are importing includes the value of the auto_increment column. Once import is complete, you can ALTER TABLE MODIFY COLUMN to re-add the AUTO_INCREMENT column, an operation which does not require a table rewrite, and so completes immediately. Note that clustrix_import automatically does this for you. 
Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.
Powered by Zendesk