The attempt of the histogram of statistical information in MySQL 8.0

Histogram is a description of the distribution of data on a table after sampling according to a certain percentage and regularity. One of the most important functions is to estimate according to the query conditions. The amount of data that meets the criteria provides an important basis for the generation of the sql execution plan.

In versions prior to MySQL 8.0, MySQL had only one simple statistic but no histograms. The statistics without histograms can be said to have no meaning. One of the new features of

MySQL 8.0 is to start supporting the histogram of statistical information. This concept was put forward very early, and I took the time to try out the usage method.

照旧, directly on the example, create data, create a test environment

create table test
(
    id int auto_increment primary key,
    name varchar(100),
    create_date datetime ,
    index (create_date desc)
);


USE `db01`$$

DROP PROCEDURE IF EXISTS `insert_test_data`$$

CREATE DEFINER=`root`@`%` PROCEDURE `insert_test_data`()
BEGIN
    DECLARE v_loop INT;
    SET v_loop = 100000;
    WHILE v_loop>0 DO
        INSERT INTO test(NAME,create_date)VALUES (UUID(),DATE_ADD(NOW(),INTERVAL -RAND()*100000 MINUTE) );
        SET v_loop = v_loop - 1;
    END WHILE;
END$$

DELIMITER ;

MySQL in the creation of statistical information, different from MSSQL, MySQL statistics do not depend on the index, need to be created separately, the syntax is as follows

  • Create statistical histogram information on the field

ANALYZE TABLE test UPDATE HISTOGRAM ON create_date,name WITH 16 BUCKETS;

  • Delete the statistical histogram information

ANALYZE TABLE test DROP HISTOGRAM ON create_date

1 on the field, you can create statistics of multiple fields at one time, and the system will create statistics on the listed fields one by one. Statistics do not depend on the index, which is different from MSSQL (of course MSSQL can also create statistics independently by throwing away the index) 2, BUCKETS value is a mandatory parameter, the default value is 1000, the range is 1-1024, this is also different from MSSQL, MSSQL has a similar step value field with a maximum value of 200 3. Generally speaking, in the case of a large amount of data, for data that is not repeated or reproducible, the BUCKETS value is larger, and the statistical information described is more detailed. 4, the specific content of the statistical information in the information_schema.column_statistics, but the readability is not good, you can self-resolve according to the needs (out of a format you like)

and the statistical information in sqlserver, in theory, The accuracy is proportional to the sampling percentage (BUCKETS). Of course, the cost of generating statistical information is greater. As for the sampling percentage of BUCKETS and statistical information, and the comprehensive cost, the author has not found relevant information.

The following is a statistical histogram created by ANALYZE TABLE test UPDATE HISTOGRAM ON create_date WITH 4 BUCKETS; It can be found that the HISTOGRAM field of the histogram is a string in JSON format, which is not very readable.

I think of the histogram information of DBCC SHOW_STATISTICS in sqlserver, the following format, the data distribution in the histogram looks very clear and intuitive

, I just did a MySQL histogram format conversion, it is plain Parse the JSON content in the HISTOGRAM field in the information_schema.column_statistics table

as follows, a simple parsing of the histogram statistics json data storage process, the parameters are the library name, table name, field name

DELIMITER $$

USE `db01`$$

DROP PROCEDURE IF EXISTS `parse_column_statistics`$$

CREATE DEFINER=`root`@`%` PROCEDURE `parse_column_statistics`(
    IN `p_schema_name` VARCHAR(200),
    IN `p_table_name` VARCHAR(200),
    IN `p_column_name` VARCHAR(200)
)
BEGIN
    
    DECLARE v_histogram TEXT;
    -- get the special HISTOGRAM
    SELECT HISTOGRAM->>'$."buckets"' INTO v_HISTOGRAM 
    FROM   information_schema.column_statistics
    WHERE schema_name =  p_schema_name 
    AND table_name = p_table_name 
    AND column_name = p_column_name; 
    
    -- remove the first and last [ and ] char
    SET v_histogram = SUBSTRING(v_HISTOGRAM,2,LENGTH(v_HISTOGRAM)-2);

    DROP TABLE IF EXISTS t_buckets ;
    CREATE TEMPORARY TABLE t_buckets
    (
        id INT AUTO_INCREMENT PRIMARY KEY,
        buckets_content VARCHAR(500)
    );
    
    -- split by "]," and get single bucket content    
    WHILE (INSTR(v_histogram,'],')>0) DO
        INSERT INTO t_buckets(buckets_content)
        SELECT SUBSTRING(v_histogram,1,INSTR(v_histogram,'],'));
        SET v_HISTOGRAM = SUBSTRING(v_histogram,INSTR(v_histogram,'],')+2,LENGTH(v_histogram));    
    END WHILE;
   
    INSERT INTO t_buckets(buckets_content) 
    SELECT v_histogram;
    
    -- get the basic statistics data
    WITH cte AS
    (
        SELECT 
        HISTOGRAM->>'$."last-updated"' AS last_updated,
        HISTOGRAM->>'$."number-of-buckets-specified"' AS number_of_buckets_specified
        FROM INFORMATION_SCHEMA.COLUMN_STATISTICS
        WHERE schema_name =  p_schema_name 
        AND table_name = p_table_name 
        AND column_name = p_column_name
    )
    SELECT 
        CASE WHEN id = 1 THEN p_schema_name ELSE '' END AS schema_name,
        CASE WHEN id = 1 THEN p_table_name ELSE '' END AS table_name,
        CASE WHEN id = 1 THEN p_column_name ELSE '' END AS column_name, 
        CASE WHEN id = 1 THEN last_updated ELSE '' END AS last_updated,
        CASE WHEN id = 1 THEN number_of_buckets_specified ELSE '' END AS 'number_of_buckets_specified' ,
        id AS buckets_specified_index,
        buckets_content
    FROM
    (
        SELECT * FROM cte,t_buckets
    )t;

END$$

DELIMITER ;

于是, The results in the first screenshot are converted to the following format. Deliberately according to the histogram generated by the 4 buckets, it should be simple enough. If you are familiar with MSSQL histogram students, you should be able to understand the meaning of this histogram at a glance (test data volume is 400,000). Take the first bucket as an example: ["2018-06-15 04:57:48.000000", "2018-07-02 15:13:04.000000", 0.25, 95311]. 是显,

1,"2018-06-15 04:57:48.000000" and "2018-07-02 15:13:04.000000" are similar to the lower limit in the histogram in sqlserver And upper limit 2, 0.25 is less than the ratio of the value of the bucket (ie less than the ratio of the limit on this interval) 3,95311 is the number of rows in which the field values ​​of this interval are not repeated.

To the last bucket, the sampling rate must be 1, that is, 100%

. It should be noted that the update time of the histogram is the standard time (UTC value), not the current time of the server. The histogram in

MySQL 8.0 is basically consistent with the histogram of sqlserver, which is based on the single-column sampling estimate, but there is no field selectivity in MySQL histogram similar to sqlserver, but the selectivity of this field itself Not big, for sqlserver for composite index, the two fields total in one piece of statistics, unless the two fields are distributed at the same time, otherwise the field selective reference of multi-field index is not significant.

This is why the composite index can't be more accurate.

有疑问题?

I wrote a little MySQL statistics before, but below MySQL5.7, there is no concept of histogram https://www.cnblogs.com/wy123/p/6561517.html The variable that triggered the statistics update is also set global innodb_stats_on_metadata = 1; but after testing, the histogram of the statistics is not updated accordingly. Innodb_stats_on_metadata affects the statistics on the MySQL index in MySQL 5.7, and here is purely a histogram of statistics (the histogram in MySQL 8.0 is not necessarily related to the index).

In addition, after repeated tests, the data volume of buckets has no obvious relationship with the efficiency of generating histograms. As shown in the screenshot below, it is not clear how the number of buckets has a relationship with the sampling percentage.

and carefully looked at the contents of the reference link, found such a paragraph:

  1. Maintaining an index has a cost. If you have an index, every INSERT/UPDATE/DELETE causes the index to be updated. This is not free, and will have an impact on your performance. A histogram on the other hand is created once and never updated unless you explicitly ask for it.It will thus not hurt your INSERT/UPDATE/DELETE-performance.

itself is to explain the relationship between the index and the histogram, mentioning that the histogram will not be automatically updated after the creation, unless Proactive update.

What I have to vomit is, if I create an index on a field, I need to create a statistic histogram by the way? And this histogram does not automatically update as the data changes, but also requires manual updates.

MySQL 8.0 will not associate statistical information with the index, or automatically create statistical information as needed. If the statistical information cannot be automatically updated, it can basically be considered as disabled statistics.

About the consumption of resources when generating histograms

The generation of histograms is a process that consumes resources. The following is the CPU of the server that zabbix monitors during the process of repeatedly testing to create histograms. Usage, of course, here is only a matter of CPU usage.

Therefore, the histogram is no better, really want to use large-scale applications, or to consider comprehensively, when to perform updates, and how to trigger its updates.

This is just a rough attempt, and there are inevitably many places where you don’t know enough.

Some interesting things

I found some interesting things in the reference link given at the end of this article.

MySQL 8.0 some interesting estimation algorithms, it seems that look, there is not much difference with sqlserver, are similar to these kinds of algorithms, there is no way to do it.

The estimate of when two predicates are combined, or the estimate without statistical coverage, can basically be considered as Yimeng, so the above also mentions the selectivity of combining multiple predicates. There is no point.

------------------------------------
AND       : P(A and B) = P(A) * P(B)
OR        : P(A or B)  = P(A) + P(B) - P(A and B)
=         : 1/10
<,>       : 1/3
BETWEEN   : 1/4
IN (list) : MIN(#items_in_list * SEL(=), 1/2)
IN subq   : [1]
NOT OP    : 1-SEL(OP)

is similar to this, the estimation algorithm in sqlserver:

参考: