mentioned matrix calculation, people who have studied "Advanced Mathematics" may have heard of it, but if it is not a researcher in this field, I am afraid it will only stop at "heard" degree. In the field of matrix computing, OpenBLAS, an open source project, has a huge impact. In addition to the use of giant companies such as IBM and Huawei, it has attracted the attention of research institutions and developers around the world.
雷锋网 The AI Research Institute has been fortunate to have invited Zhang Xianyu, founder of Qifeng Technology, founder and main maintainer of OpenBLAS, to introduce the OpenBLAS open source project and matrix multiplication optimization.
嘉宾介绍
张先轶, Ph.D., Chinese Academy of Sciences, postdoctor of MIT, founder and main maintainer of OpenBLAS open source project, founder of PerfXLab. He won the second prize of the 2016 China Computer Society Science and Technology Progress Award.
张先轶师Professor Zhang Yunquan, obtained a Ph.D. in high performance computing from the Chinese Academy of Sciences. During the reading of Bo, based on the original foundation of GotoBLAS, he created OpenBLAS, an open source matrix computing library. The leadership team continued to repair and maintain it. Currently, it has become an influential open source project in the field of matrix computing.
He started his entrepreneurial idea at MIT. After returning to China, he founded PerfXLab for "Deep Learning" to provide integrated performance optimization solutions for companies such as computer vision and speech recognition. Lei Feng network (public number: Lei Feng network) [Xinzhi made] channel has previously interviewed and reported PerfXLab peak technology.
course content
OpenBLAS project introduction
Matrix multiplication optimization algorithm
Step by step optimization
The following is the text and PPT of the open class content.
雷锋网's friends Hello everyone, I am Zhang Xianyu, today we mainly introduce our open source matrix computing library OpenBLAS and matrix multiplication optimization.
First, what is BLAS?
BLAS is an acronym for Basic Linear Algebra Subprograms, which is used primarily for basic matrix calculations or vector calculations. It is divided into three levels:

BLAS Level 1, which mainly performs dot or multiplyaccumulate operations between vectors and vectors, and calculates the corresponding elements;

BLAS Level 2, mainly for matrices and vectors, similar to PPT As shown in the blue part, the matrix A* vector x gives a vector y. In addition, there may be symmetric matrix deformation;

BLAS level 3, mainly matrix and matrix calculations, the most typical is the A matrix * B matrix, to get a C matrix. From the width and height of the matrix, a C matrix of m*n is obtained.
Why BLAS is a very important library or interface because it is one of the cores of many scientific calculations. Every year, the supercomputer's leaderboard is going to do the LINPACK test. Many parts of the test are the calculation of BLAS level 3 matrix and matrix. In addition, there are many scientific and engineering simulations that have become a matrix operation after conversion. If you optimize the matrix particularly well, it is very helpful to improve the entire application.
BLAS and Deep Learning relationship, after deep learning in the past few years, such as Alexnet, which everyone knows very well, if you do specific performance division, this picture on PPT comes from a paper, after cutting down How much time spent in each part, it was found that it spent most of the time in the Conv Layer, and spent a lot of time on the FC layer. The current common implementation of the volume base layer is to expand the matrix into a matrix and matrix multiplication, which is BLAS level 3. The fully connected layer is generally a multiplication of a matrix and a vector, and it also becomes a BLAS operation.
That is to say, based on deep learning of matrix learning, 90% or more of the time is operated by BLAS. Of course, with the advent of new algorithms, the convolutional layer has a special algorithm for the 3*3 convolution kernel, or it can be done with the FFT class algorithm, but in general, the expansion matrix is also very extensive.
Currently, BLAS is just a defined implementation interface, but there are many specific implementations. From a commercial point of view, there are many commercial versions, such as Intel, AMD, NVIDIA, IBM, etc., basically in order to match their own hardware, they have been better optimized, out of the commercial version.
For open source, there are the following types. GoToBLAS, which was wellknown before, has deep roots with OpenBLAS, but it was discontinued in 2010. There is time to analyze the reasons behind it. The developer, Goto, left the UT Austin research group and entered the company, which ended the development. ATLAS is a school in the United States. OpenBLAS is based on GotoBLAS. BLIS is a project based on GotoBLAS after Goto is gone. It is still in a relatively early stage and its maturity is still poor.
The history of OpenBLAS has been around for a few years. It started from the beginning of 2011. The original reason was that GotoBLAS gave up. We reforked a community distribution and continued to develop and maintain. Its maintenance is not a simple process of repairing BUG. If you want To get better performance, you need to keep following the hardware. For example, if you have a new hardware architecture or a wider hardware architecture, you must optimize it to get better acceleration. OpenBLAS is currently the world's best open source matrix computing library. It won the second prize of China Computer Society Science and Technology Progress Award last year, and also entered many mainstream Linux installation packages. For example, Ubuntu has our OpenBLAS Package. Almost all Linux distributions that can be thought of go in, but this is not what we are going to do. There is also a suite of OpenHPC, which is also a source of recent high performance computing.
The current progress of OpenBLAS is to support almost all mainstream CPU processors, and at the same time achieve better optimization performance. From the operating system, basically the mainstream OS is supported. Overall, the range of compatible processors and supported operating systems is the most widely implemented in open source libraries.
Therefore, there are more users of OpenBLAS. For example, there are open source project Julia language, GNU octave, etc.; deep learning, familiar mxnet, Caffe can choose OpenBLAS, as CPUside computing backend; IBM and ARM also use OpenBLAS in their products, NVIDIA OpenBLAS is listed as a benchmark when doing some comparison tests with the CPU. Others include Israeli startups that do compilers, as well as some Internet companies in the country, such as Sogou. Some time ago, I also talked with the people of Sogou Company. Our library has been running stably for more than a year, so it is still possible to calculate the quality of its project.
IBM Some time ago, because deep learning is very hot, I made a Power AI software framework. You can see that the top layer is some open source framework, and our bottom layer computing has our OpenBLAS. Of course, in order to carry his server.
Briefly look at the performance, the higher the performance of the BLAS library, the better. On Intel's Sandy Bridge platform, compared to the performance of MKL, it basically overlaps and achieves a considerable performance.
This picture shows a result made on the Godson, the measurement is relatively complete, the overall BLAS multithreaded, the performance is fully tested, and the performance is relatively high, we have doubled. This is because we have optimized for the Godson 3A, so it has achieved very good results.
has just introduced the performance and effects of OpenBLAS. We have hosted it on GitHub. Students who are interested in matrix multiplication or optimization can join in and contribute code. Our company is willing to give a sum of money to support this. The project continues to move forward. Next, we will start some technical dry goods, mainly about the parts that are of interest to optimization. I refer to the tutorials on matrix multiplication and the tutorials of the UT Austin Flame group. I basically took out his content and took it with you step by step. If we implement from the simplest matrix multiplication, to a highperformance matrix multiplication implementation, it is probably a few steps. How come? Or why optimize, how much performance gain per step. So everyone wants to provide some help for some other optimized programs.
We first look at the basic implementation.
I think as long as I have learned "Linear Algebra", this kind of matrix multiplication is a very simple problem. If I convert it into C code, it is a triple loop. I am in this picture. There is a triple loop of [ijk], where the code of the matrix multiplication is already, and its function is matrix A* matrix B, added to matrix C, C is the result matrix, the code of C, and The accumulated formulas seen on the textbook are the same. Find the i line, corresponding to the result C of this position, take each element of the i line, multiply it by the B column, the corresponding multiplication, and then add up to get the result. But this implementation, if you put the performance on the current processor, and the implementation of the optimized BLAS library, the performance will be many times different, or even 10 times worse.
Why, let's do the final performance test
This picture is also taken from the tutorial, representing a performance chart, the higher the better. The test platform here is Intel core i5, but it only measures single thread, no multithreading. This initial implementation might be 1 GFlop/s. As the scale grows, why is the performance of the matrix declining? Because in the process of implementation, the reason for the cache is not taken into consideration, when the matrix is small, the speed can be faster. When the matrix is large, it will fall down, so there is a downward process in the figure.
This is a very primitive, basic implementation.
Go one step further, how can you optimize it a bit? We need to adjust the multiplication order of the matrix. We have made a small block here, and put p separately into a function, write it in the form of dot multiplication, and make a 1*4 result each time. Come out and become a function. In this step of p, change the order of calculation slightly, put i in it, and put j outside. Why should this background be changed? Actually, we assume that the matrix is stored in column priority when stored. The values in the column items are contiguous, and there is a gap between the lines, which is more advantageous for copying. After becoming such an implementation, it does not help the overall performance, so why replace it with this form to write for the optimization, leaving a certain space.
When the matrix is relatively small, the performance is basically unchanged when it is inside the cache, but when it exceeds the cache, its performance is slightly better. From the very low value, it has reached nearly 1GFlop/s. The main reason is to make some reuse of A(0,p), which saves some cache. On the other hand, in terms of the use of the loop itself, it is equivalent to a certain cycle of expansion than this part, so the efficiency has been improved.
This block is reused only once from memory, so there is some improvement in the case of exceeding the cache.
On this basis, we need to look at what better way to optimize. Our benchmark is how to do the AddDot1*4 benchmark. The first thing we think about is that we can use register variables instead of operating memory. I can apply a bunch of register variables like C 00, 01, register double in C, and part of matrix A, also use register variables.
The rest of the operations are variables of some registers, of course B is still the previous way, and finally written back to C. The process it completes is basically the same as the previous internship, except that we introduce register variables to save more data to the registers instead of putting them in the cache cache to alleviate the pressure on the cache. Upgrade.
can see that the red line is our optimized performance, reaching 2GFlop/s, the blue line is the previous performance, the optimization here is to use the register to reduce the cache consumption, and achieved about 50% Performance improvement. After completing this step, how can we optimize again?
We just used the registers in the A and C parts, so B will save this part, and we have the possibility to do some optimization. In the previous implementation, the coordinate calculation of each part of Part B needs to be calculated. B accesses each position once, which introduces additional overhead, which can be avoided.
We use the pointer method, first point the corresponding pointer position, every time you calculate, as long as the pointer moves continuously, instead of recalculating each time you read a position, the speed will be faster. .
We look at it, after finishing this small optimization, after reducing the consumption of B index, we have achieved 4G performance from just 2G F... The improvement of this block has an effect on the small matrix, and the large matrix is all beyond the cache range, and it will not be effective. So assuming that the matrices are all in the cache, this block has also achieved a lot of performance improvement.
When you complete this part, you can think, I have replaced the A matrix with registers, and the B matrix has been improved by index. What can we do next?
is actually a more common way to do expansion.
is in the innermost loop, can it be expanded into 4 times, when doing this, we can reduce the overhead of this part of the whole loop, and let it flow better.
This section will also have some improvements in performance. This picture compares how much improvement was achieved in the middle stage compared to the beginning stage. By using register variables, pointers are used, and after a certain underlying loop is unrolled, the performance of the red line is reached, which has been significantly improved over the blue line, but this is not complete, just a basis. But at the level of 1*4, there is no oil and water to dig, so we need to find some methods at the upper level.
At 1*4, the value of A produces some reuse, but if the block is enlarged a bit, for example, it becomes 4*4, which means that each time it is calculated, it is a square. Not a column. For the entire optimization, you can reuse your memory access and be able to fully utilize your computing power.
When we become a small 4*4 block, the AddDot function should also be written in this mode. Of course, this part also uses the 1*4 method that we have just done. A is a value before, now it is 4 values, using the variables of the register, the C part is already 4*4, there are 16 All are register variables, and parts of B are all optimized with pointers.
However, if you do this, the overall performance improvement is limited. Since this is only an initial result, it can be seen that for small matrices, there is no improvement in the cache range. However, after the cache is exceeded, there is a certain performance improvement for the largescale matrix.
Based on the 4*4 results optimization, we can further optimize and develop. Why convert to 4*4 optimization is because our current CPU processor, basically want to get high performance, must use vectorized instructions, whether it is old SSE2, AVX or AVX 2.0, etc., for CPU optimization If you want to achieve high performance, you must use vector instructions with single instruction and multiple data.
After doing 4*4 chunking, in this case, it will be more beneficial to vector instructions. Here I rewrote the contents of this part of the loop with vector instructions. Of course, I don't have any inline assembly or pure assembly operation for this part of the instruction. I just wrote it in the form of Intel Intrinsic. You can see this part of the write. Is a 128bit sse, this is a matrix of a doubleprecision floatingpoint double, the data is double, load these two values from A. The last two loads enter this vector register, and the B part is loaded with this instruction of load copy each time. The rest of the block is basically operated with the vector Intrinsic variable. It looks similar, so it becomes a vectorized instruction at compile time. These two lines are the values of the former part C, and this part is the value of the latter part C.
After the use of this vector register instruction, his performance improvement is very obvious, from 4G can reach more than 6G, and this part is a whole change, the vector acceleration effect in the cache is very obvious, basically It is doubled.
The next step is to solve this cache problem. The problem is that there is no big block. After the cache size is exceeded, the performance will drop. To solve this problem, you need to do Blocking on the next level.
Convert to code, do a K split in this layer, the next layer to do a m split, as kc and mc are parameters. These parameters are adjustable, must be adjusted according to the size of the L2 cache, and then each time is a matrix multiplication of a small block c, equivalent to a subproblem, the implementation of this subproblem is basically 4*4 The implementation is the same.
The performance of this part of blocking is as shown in the figure, the blue line is not the performance of Blocking, the red line is done after. When the problem size is in the cache, its performance improvement is basically small, but when the largescale matrix, after doing this Blocking, you can see that it has achieved a very significant improvement, which is twice as fast.
For large matrices, in order to make full use of the cache, the subproblem becomes smaller, and the data locality is improved. It is also necessary to optimize other problems. The next step when we do blocking, if only the code level changes, the basics have already been done.
After further optimization, introduce some operations. When we analyze the performance bottleneck of the program, it is slow for A's memory access and B's memory access. Many accesses exist in the matrix are not continuous, so the memory access performance is much worse. On the one hand, the cache cannot be used. There is also an impact on the TLB. Of course, the C part also has some influence. The C matrix is often very large. There is no way to do packing. It can only be done for A and B. Packing means that I have a part of continuous memory space here. *k, corresponding to the previous mc and kc, in this memory space, before each calculation, I read the matrix of A needed from the original matrix and store it in a continuous memory space. The specific implementation of the Packed Matrix A function is very simple, basically it is taken from the corresponding location, and the continuous memory address ends.
Why would you do this step? The significance of this step is that after the pack, the elements read in the next AddDot4*4 are not read from the A matrix, but from the location of the buffer after the pack. One advantage is that the A matrix has been warmed up and put into the CPU's cache; the second benefit is that you can see that this storage is continuous when I store it, and it is also a continuous read when I read it. The efficiency will be very high and the cache efficiency will be very high. Plus, through the pack step, the performance improvement is very obvious.
This picture is the operation of the previous step. After packing, except for the minimal matrix size, there is no effect, or the extra cost is introduced, and the effect is still worse. Most of the performance improvement is very obvious. More than 50%, reached 9GFlop/s.
For the matrix multiplication implementation, the packing matrix is a very important optimization method. Later, everyone will think that Packing is for A, and Packing is also possible for B. The same reason is also possible, that is, copy it to a continuous space. The Packing operation of Part B is similar to that of Part A. It also reads its data from the original matrix and puts it into a contiguous space, so that its memory access is continuously fetched. Of course, this part, because B access is a streaming memory, so the improvement in this part will be slightly smaller, compared to A only about 10% improvement.
For matrix multiplication implementation, the packing matrix is a very important optimization method. Later, everyone will think that Packing is for A, and Packing is also possible for B. The same reason is also possible, that is, copy it to a continuous space. The Packing operation of Part B is similar to that of Part A. It also reads its data from the original matrix and puts it into a contiguous space, so that its memory access is continuously fetched. Of course, this part, because B access is a streaming memory, so the improvement in this part will be slightly smaller, compared to A only about 10% improvement.
When you finish this step, the performance of your matrix multiplication has improved significantly compared to the performance improvement of the initial triple loop. If you want to do better, the internal core may not only write intrinsic instructions, but also write inline assembly, rearrange the pipeline, so that hardware resources can play more, and may increase by 10%. Of course, this part is more important for the realization of BLAS, and it will be more detailed.
We will review the algorithm of matrix multiplication as a whole. I put this part of the algorithm to the end. After starting step by step, I will see more clearly in the end. The A matrix * B matrix gets the C matrix, which corresponds to the outermost loop. When each step goes down, it is actually doing the blocking. The reason for doing the blocking is just mentioned, just to match the hardware structure, because Memory, cache, etc. are all layered, it is getting smaller and smaller, doing partitioning actually improves the utilization of each layer of the cache.
Share today, thank you.
问答解
问题1: What is access optimization?
张先轶: The memory access optimization solves the performance of the processor reading data. Computationally speaking, it is relatively well optimized, but optimizing the memory access is very difficult. The data of dense matrix multiplication is relatively regular. The order of reading data is regular and easier to optimize. But we have also done a lot of sparse matrix optimization, such as the optimization of sparse matrix multiplication vector, which is more difficult for memory access, because there is no way to predict where the next visit exists, which makes optimization difficult.
问题2: What is the relationship between OpenBLAS and other matrix libraries?
张先轶: OpenBLAS and other BLAS implementations actually complete the interface, BLAS is just the definition of the interface, specifically there can be multiple implementations. We think that our OpenBLAS implementation in open source is very good. If it's a standard BLAS, there is a reference implementation, just a very simple Fortran implementation, the performance is very poor, we are much faster than them. MKL is the BLAS made by Intel itself, and we are quite similar to them. Engien we haven't completely tested it, it claims to be doing very well, but their approach may have some effect on the X86 platform, but I doubt the effect of other platforms. However, I did not specifically do the comparison test, so the right to speak is not big.
问题3: How long does it take to get started?
张先轶: If I guide, you can get started with things in a few months. welcome everybody.
问题4: How does it compare to Qualcomm's library?
张先轶: To tell the truth, Qualcomm's library has not been tested. I think it claims to be faster because on the 32bit ARM, our OpenBLAS did not do vectorization optimization, the part of Qualcomm did, so it It may be faster than us, but PerfBLAS is optimized within our company.
问题5: What is the purpose of the block?
张先轶: It is to optimize the memory access. After the block is divided, the memory is concentrated in a certain area, which can improve the locality of the data, thereby improving the cache utilization and the performance will be better.
问题6: Does the hardware not force the nerve network?
张先轶: One of the data we gave is that we are on the ARM CortexA57 platform, 4 cores 1.7GHz, running the AlexNet model, a picture is 150ms, this speed is relatively fast. On the other hand, we are still doing some other models, doing a streamlined optimization, and then matching the optimization of the underlying library. Under a small model, it takes only 50ms to run a small picture inference. Therefore, on the ARM processor, it is still possible to realtime localized neural network inference.
问题7: Is there a big difference between the internal version and the open source version?
张先轶: The internal version is doing some differentiation processing for deep learning, and the performance may be more than 1 times. This part of the optimization, part of the matrix size, and the square matrix just mentioned Similarly, the matrix of deep learning is mostly small and mediumsized, a certain dimension will be relatively small, the optimization strategy to be used, or the blocking strategy will be different, and sometimes even small, no need to block or pack, directly do more better.
化方法:
1. Use Register instead of memory access
2. Pointer ++ Replace array access
3. Loop unroll, block to improve Cache hit rate
4.packing operation, put the matrix into continuous memory
5.sse instruction, single instruction multiple data parallel acceleration