Support Notice: SLES and CentOS/RHEL Bug Affects Vertica Process

This article explains specific issues you may experience when running Vertica on SUSE Linux Enterprise Server (SLES) and CentOS or Red Hat Enterprise Linux (RHEL) including affected version information, root cause, and resolution.

SLES

If you are running Vertica on SUSE Linux Enterprise Server (SLES), you may experience an issue when creating a new database. The following sections detail the versions of Vertica and SLES this issue impacts, root cause, and solution.

Environment

  • Vertica 9.1 and higher
  • SUSE Linux Enterprise Server 12 SP2 and higher
  • Server with Intel CPU that has Hardware Lock Elision (HLE) functionality similar to Haswell, Broadwell, etc.

    To check if the CPU has the HLE functionality, run the following command:

$ cat /proc/cpuinfo | grep flags | grep hle

flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall 
nx pdpe1gb rdtscp lm ibrs flush_l1d constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq 
dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave 
avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb invpcid_single pln pts dtherm intel_pt ssbd ibpb stibp kaiser tpr_shadow vnmi flexpriority 
ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc

Issue

Vertica cluster does not start up in the SLES environment. During start up, Vertica process crashes with the following messages in vertica.log:

nameless:1234567890ab [Init] <INFO> Catalog loaded
nameless:1234567890ab [Init] <INFO> Listening on port: 5433
nameless:1234567890ab [Init] <INFO> Initializing NodeInstanceId with random data.
nameless:1234567890ab [Init] <INFO> PID=12345
nameless:1234567890ab [Init] <INFO> Start reading DataCollector information
nameless:1234567890ac [Init] <INFO> NodeInstanceId initialized: 1234567890abcdefghijklmnopqrst.
nameless:1234567890ab [Init] <INFO> Startup [Read DataCollector] Inventory files (bytes) - 0 / 1375597933
nameless:1234567890ab [Init] <INFO> Done reading DataCollector information
Main:1234567890ab [EE] <INFO> The UDx zygote process is down, restarting it...
Main:1234567890ab [Main] <INFO> Handling signal: 11
Main:1234567890ab [Main] <ALL> Core dumped to /vertica01/DB/v_db_node0001_catalog/core.12345
Main:1234567890ab [Main] <PANIC> Received fatal signal SIGSEGV.
Main:1234567890ab [Main] <PANIC> Info: si_code: 128, si_pid: 0, si_uid: 0, si_addr: (nil)

Following is the backtrace in ErrorReport.txt:

Backtrace Generated by Error
Signal: [0x000000000000000b] PID: [0x00000000000010cd] PC: [0x00007f898d8c55e0] FP: [0x00007fff4910fe00] SIGSEGV: SI_ADDR : [0x0000000000000000]
(_ZNSt6vectorIPcSaIS0_EEaSERKS2_+0x2c6efc5) [0x39056b5]
(_ZNSt6vectorIPcSaIS0_EEaSERKS2_+0x2cbfd44) [0x3956434]
(_ZNSt6vectorIPcSaIS0_EEaSERKS2_+0x2cc10b6) [0x39577a6]
(_ZNSt6vectorIPcSaIS0_EEaSERKS2_+0x2cc11c9) [0x39578b9]
(__restore_rt+0x0) [0x7f898d8c3c10]
(__lll_unlock_elision+0x30) [0x7f898d8c55e0]
(_ZNSt6vectorIPcSaIS0_EEaSERKS2_+0x31cae81) [0x3e61571]
(__libc_fork+0x26e) [0x7f898d1be4be]
(_ZNSt6vectorIPcSaIS0_EEaSERKS2_+0x6010e5) [0x12977d5]
(_init+0x15114d) [0x5677d5]
(_init+0x476b9) [0x45dd41]
(__libc_start_main+0xf5) [0x7f898d122725]
(_init+0x14c1e9) [0x562871]
END BACKTRACE
THREAD CONTEXT
Thread type: Main Thread
Request: Unknown request
END THREAD CONTEXT

Cause

Some Intel CPUs do not handle the Hardware Lock Elision correctly.

Solution

The Hardware Lock Elision (HLE) functionality needs to be disabled.

To disable this, add an absolute path of the directory where the noelision libraries are located in the library load path.

  1. Edit /etc/ld.so.conf as in the following. Add the first line to the command:

    /lib64/noelision
    /usr/local/lib64
    /usr/local/lib
    include /etc/ld.so.conf.d/*.conf
  2. Run ldconfig command.
  3. Run ldconfig -p | grep noel command. You should get the following result that includes "noelision" in the target path.

    libpthread.so.0 (libc6,x86-64, OS ABI: Linux 3.0.0) => /lib64/noelision/libpthread.so.0
  4. Run ldd /opt/vertica/bin/vertica | grep libpthread command. You should get the following result that includes "noelision" in the target path again.

    libpthread.so.0 => /lib64/noelision/libpthread.so.0 (0x00007fd4c0275000)
  5. Create the database again.

CENTOS/RHEL

If you are running Vertica on CentOS 7.x and Red Hat Enterprise Linux 7.x, you may experience a Vertica server process failure due to a known issue with glibc in CENTOS/RHEL. This section explains the root cause and what you should do to resolve the issue.

Root Cause of the CentOS/RHEL Issue

The problem that causes the Vertica failure stems from the fact that an important glibc bug fix has not been applied to several distributions of RHEL 7.x and downstream distributions like CentOS 7.x.

The glibc bug fix that is missing is described here: 

https://www.sourceware.org/bugzilla/show_bug.cgi?id=15073

Red Hat has released a fix, available here:

https://rhn.redhat.com/errata/RHBA-2016-1030.html

The fix is not yet available on CentOS. We will publish an update as soon as this fix is available on CentOS.

Note This issue appears in Vertica running on RHEL and CentOS 7.x distributions only. The issue does not appear with Ubuntu and Debian distributions of Linux.

What you’ll see if this problem occurs

  1. If this problem occurs, the Vertica server process will fail, and you’ll see the following error in the <CATALOG_DIRECTORY>/dbLog file.
  2. *** Error in `/opt/vertica/bin/vertica': invalid fastbin entry (free): 0x00007ef70f209800 ***
    ======= Backtrace: =========
    0x7f0614f0efe1(/lib64/libc.so.6):  + 0x7cfe1
    0x2a1e014(/opt/vertica/bin/vertica) CAT::TabColPair_pairToBytes2(void const*, void*, unsigned long) 
  3. In addition, you’ll notice that the vertica.log file appears as if was truncated at an arbitrary place, sometime in the middle of a line.
  4. Finally, on the core file for the failure, the following pattern appears at the top of the stack 
  5. raise
        abort
        __libc_message
        _int_free          <==========
        CAT::TabColPair_pairToBytes2(void const*, void*, unsigned long)

Solution

Upgrading glibc

Important To resolve this issue, the glibc version must be at least glibc-2.17-106.el7_2.6.x86_64 or higher.

To upgrade glibc

1. Restart Vertica as dbadmin: 

admintools -t stop_db -d <database_name> 

2. Run the following command as root on all nodes: 

yum update glibc

3. Run the following command as dbadmin: 

admintools -t start_db -d <database_name>

If you are on CentOS, you should contact your operating system vendor and request a fix for this issue.

You can also choose to build the latest GLIBC 2.17 from source. Vertica recommends testing this process in a staging area before implementing it in production. As with any major operation on your system, Vertica recommends backing up your system before this operation.

How to determine whether you have the affected glibc

Important If you have already upgraded to glibc-2.17-106.el7_2.6.x86_64 or higher, you must not run any of the commands in this section.

To determine whether the patch has been applied to your glibc, you can either:

  • Run the objdump utility, or
  • Examine the libc.so file manually

Run the objdump utility

  1. Find your libc.so file using the following command:
  2. ldd /opt/vertica/bin/vertica | grep libc.so
    libc.so.6 => /lib64/libc.so.6 (0x00007ff6dd99e000
  3. Run the objdump utility as shown below to determine whether fix has been applied:
  4. ## example of buggy lib.c
    objdump -r -d /lib64/libc.so.6 | grep -C 20 _int_free | grep -C 10 cmpxchg | head -21 | grep -A 3 cmpxchg | tail -1 | (grep '%r' && echo "Your libc is likely buggy." || echo "Your libc looks OK.")
       7ca16: 48 85 c9                test   %rcx,%rcx
    Your libc is likely buggy.
    ## example of good lib.c
    objdump -r -d /lib/x86_64-linux-gnu/libc.so.6 | grep -C 20 _int_free | grep -C 10 cmpxchg | head -21 | grep -A 3 cmpxchg | tail -1 | (grep '%r' && echo "Your libc is likely buggy." || echo "Your libc looks OK.")
     
    Your libc looks OK.

Examine the libc.so file manually

You can also choose to examine your libc in its entirety and identify whether the fix has been applied or not. The following example contains the string ‘test   %dil,%dil’. This means that the fix has been applied:

objdump -r -d /lib64/libc-2.12.so | grep -C 20 _int_free | grep -C 10 cmpxchg | head -21
 32cd8786cb:   40 20 f7                and    %sil,%dil
 32cd8786ce:   74 0c                   je     32cd8786dc <_int_free+0xec>
 32cd8786d0:   4c 8b 42 08             mov    0x8(%rdx),%r8
 32cd8786d4:   41 c1 e8 04             shr    $0x4,%r8d
 32cd8786d8:   41 83 e8 02             sub    $0x2,%r8d
 32cd8786dc:   48 89 53 10             mov    %rdx,0x10(%rbx)
 32cd8786e0:   48 89 d0                mov    %rdx,%rax
 32cd8786e3:   64 83 3c 25 18 00 00    cmpl   $0x0,%fs:0x18
 32cd8786ea:   00 00
 32cd8786ec:   74 01                   je     32cd8786ef <_int_free+0xff>
 32cd8786ee:   f0 48 0f b1 19          lock cmpxchg %rbx,(%rcx)
 32cd8786f3:   48 39 c2                cmp    %rax,%rdx
 32cd8786f6:   75 c0                   jne    32cd8786b8 <_int_free+0xc8>
 32cd8786f8:   40 84 ff                test   %dil,%dil             <==** likely good**==
 32cd8786fb:   74 09                   je     32cd878706 <_int_free+0x116>
 32cd8786fd:   41 39 e8                cmp    %ebp,%r8d
 32cd878700:   0f 85 05 07 00 00       jne    32cd878e0b <_int_free+0x81b>
 32cd878706:   48 83 c4 28             add    $0x28,%rsp
 32cd87870a:   5b                      pop    %rbx
 32cd87870b:   5d                      pop    %rbp
 32cd87870c:   41 5c                   pop    %r12

The following example does not contain the string ‘test   %dil,%dil’ . This means the fix has not been applied:

objdump -r -d /lib64/libc-2.17.so | grep -C 20 _int_free | grep -C 10 cmpxchg | head -21
 
 7c9ec:       48 85 c9                test   %rcx,%rcx
 7c9ef:       74 09                   je     7c9fa <_int_free+0xda>
 7c9f1:       8b 41 08                mov    0x8(%rcx),%eax
 7c9f4:       c1 e8 04                shr    $0x4,%eax
 7c9f7:       8d 70 fe                lea    -0x2(%rax),%esi
 7c9fa:       48 89 4b 10             mov    %rcx,0x10(%rbx)
 7c9fe:       48 89 c8                mov    %rcx,%rax
 7ca01:       64 83 3c 25 18 00 00    cmpl   $0x0,%fs:0x18
 7ca08:       00 00
 7ca0a:       74 01                   je     7ca0d <_int_free+0xed>
 7ca0c:       f0 48 0f b1 1a          lock cmpxchg %rbx,(%rdx)
 7ca11:       48 39 c1                cmp    %rax,%rcx
 7ca14:       75 ca                   jne    7c9e0 <_int_free+0xc0>
 7ca16:       48 85 c9                test   %rcx,%rcx               <==**likely buggy**===
 7ca19:       74 09                   je     7ca24 <_int_free+0x104>
 7ca1b:       44 39 e6                cmp    %r12d,%esi
 7ca1e:       0f 85 84 08 00 00       jne    7d2a8 <_int_free+0x988>
 7ca24:       48 83 c4 48             add    $0x48,%rsp
 7ca28:       5b                      pop    %rbx
 7ca29:       5d                      pop    %rbp
 7ca2a:       41 5c                   pop    %r12