@sun.misc.Contended Resolve pseudo-sharing problems

The cache system is stored in cache line units. The cache line is an integer power of 2 consecutive bytes, typically 32-256 bytes. The most common cache line size is 64 bytes. When multiple threads modify mutually independent variables, if these variables share the same cache line, they will inadvertently affect each other's performance, which is pseudo-sharing. Write contention on cache lines is the most important limiting factor in the scalability of parallel threads running in SMP systems. Some people describe pseudo-shares as silent performance killers, because it is difficult to see from the code whether pseudo-shares will occur.

In order for scalability to be linear with the number of threads, you must ensure that no two threads write to the same variable or cache line. Two threads writing the same variable can be found in the code. To determine if variables that are independent of each other share the same cache line, you need to understand the memory layout, or find a tool to tell us. Intel VTune is such an analysis tool. In this article I will explain the memory layout of Java objects and how we should fill the cache lines to avoid pseudo-sharing.

cache-line.png
图 1.

图1 illustrates the problem of pseudo-sharing. The thread running on core 1 wants to update variable X, while the thread on core 2 wants to update variable Y. Unfortunately, these two variables are in the same cache line. Each thread has to compete for the ownership of the cache line to update the variables. If core 1 gains ownership, the cache subsystem will invalidate the corresponding cache line in core 2. When Core 2 gains ownership and then performs an update operation, Core 1 will invalidate its corresponding cache line. This will go back and forth through the L3 cache, which greatly affects performance. If the competing cores are in different slots, it is necessary to connect across the slots, and the problem may be more serious.

For the HotSpot JVM, all objects have a header of two words long. The first word is a Mark Word consisting of a 24-bit hash code and an 8-bit flag bit (such as the state of the lock or as a lock object). The second word is a reference to the class to which the object belongs. If it is an array object, you also need an extra word to store the length of the array. The starting address of each object is aligned to 8 bytes to improve performance. Therefore, when encapsulating objects for efficiency, the order in which object fields are declared is reordered into the following byte-based order:

  1. doubles (8) and longs (8)
  2. ints (4) and Floats (4)
  3. shorts (2) and chars (2)
  4. booleans (1) and bytes (1)
  5. references (4/8)
  6. <subclass fields repeat the above order>

to demonstrate its performance Impact, we start several threads, each updating its own independent counter. The counters are of type volatile long, so other threads can see their progress.

public final class FalseSharing implements Runnable

{

    public final static int NUM_THREADS = 4; // change

    public final static long ITERATIONS = 500L * 1000L * 1000L;

    private final int arrayIndex;


    private static VolatileLong[] longs = new VolatileLong[NUM_THREADS];

    static

    {

        for (int i = 0; i < longs.length; i++)

        {

            longs[i] = new VolatileLong();

        }

    }


    public FalseSharing(final int arrayIndex)

    {

        this.arrayIndex = arrayIndex;

    }


    public static void main(final String[] args) throws Exception

    {

        final long start = System.nanoTime();

        runTest();

        System.out.println("duration = " + (System.nanoTime() - start));

    }


    private static void runTest() throws InterruptedException

    {

        Thread[] threads = new Thread[NUM_THREADS];


        for (int i = 0; i < threads.length; i++)

        {

            threads[i] = new Thread(new FalseSharing(i));

        }


        for (Thread t : threads)

        {

            t.start();

        }


        for (Thread t : threads)

        {

            t.join();

        }

    }


    public void run()

    {

        long i = ITERATIONS + 1;

        while (0 != --i)

        {

            longs[arrayIndex].value = i;

        }

    }


    public final static class VolatileLong

    {

        public volatile long value = 0L;

        public long p1, p2, p3, p4, p5, p6; // comment out

    }

}

Run the above code, increase the number of threads and add/remove the padding of the cache line. Figure 2 below shows the results I got. This is the running time measured on my 4-core Nehalem.

duration.png
图 2.

The effect of pseudo-sharing can be clearly seen from the time required for the ever-increasing test. When there is no cache line competition, we almost reached a linear extension with the number of threads.

This is not a perfect test, because we are not sure where these VolatileLong will be placed in memory. They are independent objects. But experience tells us that objects that are allocated at the same time tend to focus on one piece.

Need to pay attention to is 1.7, after a certain version will optimize the above code, so that our fill line does not work, but fortunately the 1.8 version officially used @sun.misc.Contended instead of this practice

The following part is excerpted from

// jdk8 new features, Contended annotation to avoid false sharing
// Restricted on user classpath
// Unlock: -XX:-RestrictContended
@sun.misc.Contended
Public class VolatileLong {
        Volatile long v = 0L;
}

Need to pay attention to when you start jvm to join -XX:-RestrictContended

jdk8 where sun.misc.Contended has been used:

src/share/classes/java/util/concurrent/ConcurrentHashMap.java  
2458: @sun.misc.Contended static final class CounterCell {  
  
src/share/classes/java/util/concurrent/Exchanger.java  
313: @sun.misc.Contended static final class Node {  
  
src/share/classes/java/util/concurrent/ForkJoinPool.java
 
  
src/share/classes/java/util/concurrent/atomic/Striped64.java  
119: @sun.misc.Contended static final class Cell {  
  
src/share/classes/java/lang/Thread.java  
2004: @sun.misc.Contended("tlr")  
2008: @sun.misc.Contended("tlr")  
2012: @sun.misc.Contended("tlr")