MySQL common technology for large data volume: query optimization, data transfer

Now with the development of the Internet, the magnitude of the data is also the index Growth from GB to TB to PB. The various operations on data are also more and more difficult, and the traditional relational database can no longer meet the requirements of fast query and insert data. At this time, the emergence of NoSQL temporarily solved the crisis. It achieves performance gains by reducing data security, reducing transaction support, and reducing support for complex queries. However, in some cases, some of the trade-offs of NoSQL can't satisfy the usage scenario. For example, some usage scenarios must have transaction and security indicators. At this time, NoSQL is definitely not satisfied, so you still need to use a relational database.

Although the relational database is inferior to the NoSQL database in massive data, but if you operate it correctly, its performance will still meet your needs. The optimization direction is different for different operations of data. For data migration, query and insert operations, you can consider different directions. In the optimization, you need to consider whether other related operations will have an impact. For example, you can improve the performance of the query by creating an index, but this will cause the insertion performance to decrease when inserting the data, and whether you can accept this reduction. Therefore, the optimization of the database is to consider multiple directions and find a best solution for compromise.

一: Query optimization

1: Create an index.

The simplest and most commonly used optimization is the query. Because the read operation is the majority of the CRUD operation, the performance of the read basically determines the performance of the application. The most common use for query performance is to create an index. After testing, 20 million records, each record of 200 bytes of two columns of varchar type. It takes one minute to query a record when the index is not used, and the query time can be ignored when the index is created. However, when you add an index to an existing data, it takes a very long time. After I inserted 20 million records, it took about a few dozen minutes to create an index.

Create the drawbacks and occasions of the index. Although creating an index can greatly optimize the speed of the query, the drawbacks are obvious. One is that when inserting data, it takes some time to create the index, which makes the insertion performance somewhat lower; the other is obvious that the data file becomes larger. When you create an index on a column, the length of each index is the same as the length you created when you created the column. For example, if you create varchar(100), when you create an index on this column, the length of the index is 102 bytes, because the length of more than 64 bytes will increase the length of the 2-byte record index.

From the above picture, I can see that I created an index named index_ycsb_key on the YCSB_KEY column (length 100). Each index has a length of 102. Imagine that when the data becomes huge, the index The size is also not to be underestimated. And from this we can also see that the length of the index and the length of the column type are different, such as varchar which is a variable length character type (see MySQL data type analysis), the actual storage length is the actual character size, but the index is Is the size of the length you declare. When you create a column, declare 100 bytes, then the index length is this byte plus 2, it does not matter how much you actually store.

In addition to the time required to create an index, the size of the index file will become larger and larger, and the index you need to look at will also look at the characteristics of your stored data. When a large part of your stored data is duplicated, then creating an index at this time is not a benefit. Please check the MySQL index introduction first. Therefore, when a lot of data is repeated, the effect of query promotion brought by the index can be directly ignored, but at this time you have to bear the performance cost of creating an index when inserting data.

2: Cache configuration.

There are a variety of caches in MySQL, some caches are responsible for caching query statements, and some are responsible for caching query data. These cached content clients are inoperable and are maintained by the server. It will be continuously updated as your query and modification are performed differently. Through its configuration file we can see the cache in MySQL:

here mainly analyzes the query cache, which is mainly used to cache query data. When you want to use this cache, you must set the query_cache_size size to non-zero. When the size is set to non-zero, the server will cache the results returned by each query, and the next time the same query server gets data directly from the cache, instead of executing the query. The amount of data that can be cached is related to your size setting, so when you set it large enough, the data can be fully cached into memory, and the speed will be very fast.

However, the query cache also has its drawbacks. When you perform any update operations (update/insert/delete) on the data table, the server will force the cache data to be flushed in order to ensure the consistency of the cache and the database, resulting in invalid cache data. Therefore, when a table has a lot of updated data tables, the query cache will not improve the performance of the query, but also affect the performance of other operations.

3:slow_query_log analysis.

In fact, the most important and fundamental means for query performance improvement is the setting of slow_query.

When you set slow_query_log to on, the server will record each query. When the slow query time (long_query_time) you set is exceeded, the query will be logged. When you optimize performance, you can analyze the slow query log and optimize the query for slow queries. You can do this by creating various indexes and by sub-tables. Then why do you want to divide the library? When you do not divide the library, that place is the place to limit performance. Below we will briefly introduce.

4: The sub-library table

分库分表 should be regarded as the killer of query optimization. The above various measures are not obvious after the amount of data reaches a certain level. At this time, the amount of data must be shunted. Diversion generally has two measures: sub-library and sub-segment. The sub-table has two methods: vertical segmentation and horizontal segmentation. Below we will briefly introduce each method.

For mysql, its data files are stored on disk as files. When a data file is too large, the operation of the large file by the operating system will be cumbersome and time consuming, and some operating systems will not support large files, so this time must be divided. In addition, the storage engine commonly used by mysql is Innodb, and its underlying data structure is B+ tree. When the data file is too large, the B+ tree will be more from the hierarchy and the node. When querying a node, it may query many levels, and this will definitely cause multiple IO operations to be loaded into the memory, which will definitely consume. Time. In addition to this, Innodb has a locking mechanism for B+ trees. Lock each node, then when the table structure is changed, the tree will be locked at this time. When the table file is large, this can be considered unrealizable.

So we have to do the operation of sub-table and sub-library.

5: Subquery optimization

often uses subqueries in queries, and in or exist keywords are generally used in subqueries. For in and exist at the time of query, when the amount of data is large enough, the query execution time is quite different. However, to avoid this, the best way is to use a join query. Because in most cases, the server's query optimization for join is much higher than subquery optimization. In the higher version 5.6, the mysql query will automatically optimize the in query to the joint query, so there will be no slow subquery. Sometimes you can also use the distinct keyword to limit the number of subqueries, but it should be noted that distinct will be converted to group by many times, this time there will be a temporary table, copy data will appear The delay to the temporary table.

More sub-query optimization please click.

二: Data Transfer

When the amount of data reaches a certain level, then moving the library will be a very prudent and dangerous job. In the transfer library to ensure the consistency of the data before and after, the processing of various emergencies, the change of data during the transfer process, each is a very difficult problem.

2.1: Insert data

When data migration is performed, there will definitely be a re-import of big data. You can choose to load the file directly, and sometimes you may need to insert the code. At this time, you need to optimize the insert statement. At this time, you can use the INSERT DELAYED statement. When you make an insert request, it is not inserted into the database immediately but placed in the cache, waiting for the opportunity to mature before inserting.