Spread Debugging

What is Spread?

Vertica uses an open source toolkit, Spread, to provide a high-performance messaging service that is resilient to network failures. Spread daemons start automatically when your database starts up for the first time. The spread daemons run on control nodes in your cluster. The control nodes manage message communication.

Vertica Process and Spread Daemons Pairs

The Vertica installation procedure installs Spread daemons on your database. A node in the Vertica process communicates with Spread daemons using domain socket. Between the nodes, communication is through 2 channels—the data channel and the control channel (UDP main channel and UDP token channel). The following image shows a 4-node cluster with a domain socket and 2 channels.

  • Vertica process and Spread daemon on the same node connect to each other through the domain socket.
  • Vertica process on one node connects with Vertica process on another node through the TCP socket.
  • Spread daemon on one node connects to the Spread daemon on another node through the UDP token and main channelss.

How Vertica and Spread Work Together

If all the nodes are healthy, Vertica and Spread communicate seamlessly with each other.

Spread daemons communicate with one another by exchanging UDP packets.  The two UDP channels (or ports) used are:

  1. The main channel—where Spread daemons sends Spread-related control messages to other daemons, and Vertica-related control messages originated from the Vertica servers.
  2. A token channel—where a special message called a Spread token is passed from one Spread daemon to the next Spread daemon. The token is important for membership, fault tolerance, and for enforcing consistent control message ordering.

If Spread on one node stops communicating with the other Spreads in the cluster, the Spread daemon removes that node from the cluster membership. The Spread daemon waits for a specific time called Spread Token Timeout before removing nodes from the membership.

Spread on a node communicates with the Vertica process on that node through a Domain socket and sends a heartbeat to check its health. If the heartbeat is missed more than 5 times, Spread daemon on that node evicts the Vertica process and notifies all other nodes through the UDP socket

When Vertica and Spread Cannot Communicate

You may encounter situations when the communication between the Spread daemon and Vertica fails.

Spread Token Timeout

The Spread daemons monitor each other's health using a token mechanism. The token is a special Spread message passed from one node to another node in the current active membership using the User Datagram Protocol (UDP). The token verifies that the Spread daemons in all the nodes are active. The token is active for a specific time called Spread Token Timeout. If the user kills a Spread daemon, the Spread daemons on the other nodes in the cluster detect the killed Spread daemon through a token timeout.

The default Spread Token Timeout in Vertica 9.1 and earlier is 8 seconds. In Vertica 9.2 and later, the default Spread Token Timeout depends on the number of configured Spread segments. If the number of configured Spread segments is 1, the default timeout is 8 seconds. If it is more than 1, the default timeout is 25 seconds. If network delays or temporary pauses of a Virtual Environment last longer than the spread timeout period, you may see UP nodes leave the database. In these cases, you can increase the spread timeout to reduce or eliminate instances where UP nodes leave the database.

In Azure, you may see nodes that are running leave the database due to scheduled maintenance. Azure's maintenance downtime is usually well-defined. For example, Azure's memory-preserving updates can pause a VM for up to 30 seconds while performing maintenance on the system hosting the VM. This pause does not disrupt the node. It continues normal operation once Azure resumes it. If you deploy your Vertica cluster using Azure Marketplace, the spread timeout defaults to 35 seconds. If you manually create your cluster in Azure, the spread timeout defaults to 8 or 25 seconds, as described earlier.

The long token timeout needs more time until Spread recognizes the DOWN node and to remove the nodes from the cluster.

For more information about how to adjust Spread Token Timeout, see Adjusting Spread Daemon Timeouts For Virtual Environments.

For more information about the cause of the timeout, see the Troubleshooting Spread Token Timeout and Unhealthy Node section in this document.

Unhealthy Node

To decrease the impact of an unhealthy node in your Vertica database, Vertica performs regular health checks. The health checks are performed on a regular schedule. The time between each check is set by the user using the DatabaseHeartBeatInterval parameter. This parameter specifies the time intervals between internal health checks performed by each node. After a successful health check, the node sends a heartbeat. If a heartbeat is not detected after 5 intervals have elapsed, then the node is evicted from the database cluster.

For more information about the cause of the timeout, see the Troubleshooting Spread Token Timeout and Unhealthy Node section in this document.

Vertica Process Crashed or Killed Abruptly Using kill-9

Vertica process creates a PROBE message for the Spread daemon. However, if the Vertica process crashes or is terminated abruptly using kill-9, that Vertica process cannot send a PROBE message to the other nodes or call the Spread API, SP_disconnect.

If the Spread daemon for the killed Vertica process is still running, the Spread daemon receives a message about that killed Vertica process through the TCP socket connection. Then, the Spread daemon notifies the other nodes in the cluster that the Vertica process has left the current membership.

Database is DOWN

If the node goes down, the Spread daemons on other nodes detect the down Spread daemon through a token timeout. Suppose you have an 8-node cluster with a K-safety value of 1. If 4 nodes go down or 2 buddy nodes go down, the database goes down. When the node goes down and the database can still continue to operate, the Vertica process invokes the membership protocol to enable the remaining Spread daemons to create a new membership.

Troubleshooting Spread Token Timeout and Unhealthy Node

Spread Token Timeout and Unhealthy Node have several scenarios. Run the following script as a dbadmin user to collect information from the cluster and consider the possibility of each scenario.

collect_for_spread.sh

$ ./collect_for_spread.sh [-U username] [-w password] [-p port] [-f output_format] [-z dbhost] -s start_time -e end_time
 
Arguments description:
-U Use database administrator account for username.
-w Database administrator account password in single quotes. Can be skipped in case no password is set.
-p Database server port. Can be skipped if database is running on default port 5433.
-f Specifies output mode. Can be stdout or file. Default is stdout.
-s Specifies start time in single quotes. Format is YYYY-MM-DD HH24:MI:SS.
-e Specifies start time in single quotes. Format is YYYY-MM-DD HH24:MI:SS.
-z is optional if dbhost is needed for connecting to Vertica.
  1. Spring Ring Size

    There are timeouts when passing a token and creating a membership as the number of Spread daemons becomes large. Spread broadcasts every message (using UDP) to every other daemon in the configuration resulting in increased network congestion.. If you have more than 50 nodes, you must move to a Large Cluster mode.

    For more information about how to configure a Large Cluster mode, see Large Cluster in the Vertica Administrator's Guide .

  2. Network Bandwidth Starvation

    High network usage delays communication between the nodes.

    1. Check the network usage of your system by querying the DC_NETWORK_INFO system table. The script run earlier also collects the following network usage information.

      =======================================================
      b. Network Bandwidth Starvation
      =======================================================
      b-1. Network usage from DC_NETWORK_INFO
      =======================================================
       
      node_name     |     start_time      |      end_time       | tx_kbytes_per_sec | rx_kbytes_per_sec | total_kbytes_per_sec
      ------------------+---------------------+---------------------+-------------------+-------------------+----------------------
      v_trial_node0001 | 2020-08-04 04:44:59 | 2020-08-04 04:45:00 |          44035.44 |          11274.09 |             55309.53
      v_trial_node0002 | 2020-08-04 04:44:59 | 2020-08-04 04:45:00 |          13098.37 |          22710.12 |             35808.49
      v_trial_node0003 | 2020-08-04 04:44:59 | 2020-08-04 04:45:00 |           3116.41 |          24514.25 |             27630.66
      v_trial_node0004 | 2020-08-04 04:44:59 | 2020-08-04 04:45:00 |            5334.8 |          23950.17 |             29284.97
      v_trial_node0001 | 2020-08-04 04:45:00 | 2020-08-04 04:45:01 |          17377.85 |          28435.98 |             45813.83
      v_trial_node0002 | 2020-08-04 04:45:00 | 2020-08-04 04:45:01 |          17255.69 |          25164.83 |             42420.52
      v_trial_node0003 | 2020-08-04 04:45:00 | 2020-08-04 04:45:01 |          46322.48 |          19192.34 |             65514.82
      v_trial_node0004 | 2020-08-04 04:45:00 | 2020-08-04 04:45:01 |           9300.44 |             31563 |             40863.44
      .....
    2. Measure the network performance of your host using vnetperf. Compare the megabytes sent and megabytes received with the available bandwidth using vnetperf. The script run earlier also collects the result of vnetperf. In the following example, the available bandwidth is about 600 MB (4.6 Gbps).

      =======================================================
      b-2. Network performance from vnetperf
      =======================================================
       
      date                    | test              | rate limit (MB/s) | node    | MB/s (sent) | MB/s (rec)
      -----------------------------------------------------------------------------------------------------
      2020-08-04_05:38:35,962 | tcp-throughput    | 32                | average | 30.6099     | 30.6099
      2020-08-04_05:38:37,968 | tcp-throughput    | 64                | average | 61.1261     | 61.1261
      2020-08-04_05:38:39,972 | tcp-throughput    | 128               | average | 122.137     | 122.137
      2020-08-04_05:38:41,977 | tcp-throughput    | 256               | average | 244.212     | 244.212
      2020-08-04_05:38:43,980 | tcp-throughput    | 512               | average | 488.345     | 488.345
      2020-08-04_05:38:45,984 | tcp-throughput    | 640               | average | 588.28      | 588.204
      2020-08-04_05:38:47,988 | tcp-throughput    | 768               | average | 663.714     | 663.59
      2020-08-04_05:38:49,994 | tcp-throughput    | 1024              | average | 779.429     | 778.769
      2020-08-04_05:38:52,83  | tcp-throughput    | 2048              | average | 688.616     | 688.431
    3. Use Database Designer to tune your queries or tune manually. Unoptimized projections may cause high network usage due to the RESEGMENT/BROADCAST operators.

    4. If you continue to see high network usage, upgrade your network. If CPU usage is not high, you may be able to compress network data by enabling the CompressNetworkData configuration parameter. For example:

      => ALTER DATABASE DEFAULT SET CompressNetworkData = 1;

  3. Disk I/O Bandwidth Starvation

    If the Vertica database has large RAM and large memory then the default memory is used for write-caching which results in a full flush. Full flush blocks the Spread daemons for a long time triggering token timeouts.

    If the disks are unable to keep up with the incoming write requests, the system writes data to disks from the application. This results in the write cache to grow beyond a threshold. The kernel blocks all the IO requests until the cache is flushed below the defined threshold. Sometimes, under high load this leads to the node exiting the cluster.

    vm.dirty_background_ratio = 3
    vm.dirty_ratio = 5

    The first parameter vm.dirty_background ratio defines the threshold of write-cache usage. Upon reaching the threshold, the kernel performs a background flush to disk.

    The second parameter vm.dirty_ratio defines the threshold of memory. If the threshold becomes dirty, the kernel begins to block other IO requests until a flush completes.

    For more information about how to tune these thresholds, See Tuning Linux Dirty Data Parameters for Vertica.

    You can check Disk I/O usage with the following steps.

    1. Check the Disk I/O usage of your system by querying the DC_IO_INFO system table. The script run earlier also collects the following disk I/O usage information.

      =======================================================
      c. Disk I/O Bandwidth Starvation
      =======================================================
      c-1. Disk I/O usage from DC_IO_INFO
      =======================================================
      node_name     | device_name |     start_time      |      end_time       | read_mbytes_per_sec | written_mbytes_per_sec | total_mbytes_per_sec
      ------------------+-------------+---------------------+---------------------+---------------------+------------------------+----------------------
      v_trial_node0001 | sda3        | 2020-08-05 06:12:50 | 2020-08-05 06:12:51 |               78.43 |                 272.11 |               350.54
      v_trial_node0002 | sda3        | 2020-08-05 06:12:50 | 2020-08-05 06:12:51 |                  89 |                   22.5 |                111.5
      v_trial_node0003 | sda3        | 2020-08-05 06:12:50 | 2020-08-05 06:12:51 |                9.92 |                  72.07 |                81.99
      v_trial_node0004 | sda3        | 2020-08-05 06:12:50 | 2020-08-05 06:12:51 |               12.07 |                   50.4 |                62.47
      v_trial_node0001 | sda3        | 2020-08-05 06:12:51 | 2020-08-05 06:12:52 |                95.6 |                 188.33 |               283.93
      v_trial_node0002 | sda3        | 2020-08-05 06:12:51 | 2020-08-05 06:12:52 |                  76 |                 196.23 |               272.23
      v_trial_node0003 | sda3        | 2020-08-05 06:12:51 | 2020-08-05 06:12:52 |                5.98 |                 107.47 |               113.45
      v_trial_node0004 | sda3        | 2020-08-05 06:12:51 | 2020-08-05 06:12:52 |               19.91 |                    133 |               152.91
      v_trial_node0001 | sda3        | 2020-08-05 06:12:52 | 2020-08-05 06:12:53 |               88.32 |                  86.67 |               174.99
      v_trial_node0002 | sda3        | 2020-08-05 06:12:52 | 2020-08-05 06:12:53 |               78.01 |                 195.02 |               273.03
      v_trial_node0003 | sda3        | 2020-08-05 06:12:52 | 2020-08-05 06:12:53 |                   2 |                  92.27 |                94.27
      v_trial_node0004 | sda3        | 2020-08-05 06:12:52 | 2020-08-05 06:12:53 |                  19 |                   71.5 |                 90.5
      v_trial_node0001 | sda3        | 2020-08-05 06:12:53 | 2020-08-05 06:12:54 |                97.5 |                 164.29 |               261.79
      v_trial_node0002 | sda3        | 2020-08-05 06:12:53 | 2020-08-05 06:12:54 |                72.5 |                 213.17 |               285.67
      v_trial_node0003 | sda3        | 2020-08-05 06:12:53 | 2020-08-05 06:12:54 |                   0 |                  93.29 |                93.29
      v_trial_node0004 | sda3        | 2020-08-05 06:12:53 | 2020-08-05 06:12:54 |               18.85 |                  76.27 |                95.12
      .....
    2. Measure the Disk I/O performance of your host using vioperf. Compare the megabytes read and megabytes write with the available bandwidth using vioperf. The script run earlier also collects the result of vioperf. In the following example, the available bandwidth is about 340 MB/s for Write, 250 MB/s for Read, 200 MB/s for Read-Write.

      =======================================================
      c-2. Disk I/O performance from vioperf
      =======================================================
      test      | directory                                       | counter name        | counter value | thread count | %CPU | %IO Wait
      ---------------------------s--------------------------------------------------------------------------------------------------------
      Write     | ...badmin/trial/v_trial_node0001_catalog/Catalog| MB/s                | 340           | 4            | 28   | 49
      ReWrite   | ...badmin/trial/v_trial_node0001_catalog/Catalog| (MB-read+MB-write)/s| 100+100       | 4            | 7    | 46
      Read      | ...badmin/trial/v_trial_node0001_catalog/Catalog| MB/s                | 250           | 4            | 17   | 45
      SkipRead  | ...badmin/trial/v_trial_node0001_catalog/Catalog| seeks/s             | 3199          | 4            | 2    | 38
    3. Tune Linux Dirty Page parameters as described earlier. Use Database Designer to tune your queries or tune manually. Unoptimized projections may cause high disk I/O usage due to the unoptimized encoding type.

  4. Network UDP Receive Error and TCP Retransmitted

    If you see a large number of UDP packet receive errors or TCP retransmitted segments, you may need to tune the kernel parameters as follows.

    1. Check the number of UDP network packet receive errors and TCP retransmitted segments by querying DC_NETSTATS system table. The script run earlier also collects these numbers and ratios.

      =======================================================
      d. Network UDP Receive Error and TCP Retransmitted
      =======================================================
      node_name     |     start_time      |      end_time       | upd_packet_receive_errors | udp_packets_received | upd_packet_receive_error_ratio | tcp_segments_retransmited | tcp_segments_received | tcp_segments_retrans_ratio
      ------------------+---------------------+---------------------+---------------------------+----------------------+--------------------------------+---------------------------+-----------------------+----------------------------
      v_trial_node0001 | 2020-08-04 05:05:33 | 2020-08-04 05:05:34 |                         0 |                   63 |                              0 |                         0 |                 26233 |                          0
      v_trial_node0002 | 2020-08-04 05:05:33 | 2020-08-04 05:05:34 |                         0 |                   30 |                              0 |                         0 |                 12317 |                          0
      v_trial_node0003 | 2020-08-04 05:05:33 | 2020-08-04 05:05:34 |                         0 |                  260 |                              0 |                         0 |                 23149 |                          0
      v_trial_node0004 | 2020-08-04 05:05:33 | 2020-08-04 05:05:34 |                         0 |                  260 |                              0 |                         0 |                 21644 |                          0
      v_trial_node0001 | 2020-08-04 05:05:34 | 2020-08-04 05:05:35 |                         0 |                    0 |                              0 |                         0 |                 39041 |                          0
      v_trial_node0002 | 2020-08-04 05:05:34 | 2020-08-04 05:05:35 |                         0 |                  234 |                              0 |                         0 |                 35312 |                          0
      v_trial_node0003 | 2020-08-04 05:05:34 | 2020-08-04 05:05:35 |                         0 |                    0 |                              0 |                         0 |                 38657 |                          0
      v_trial_node0004 | 2020-08-04 05:05:34 | 2020-08-04 05:05:35 |                         0 |                    0 |                              0 |                         0 |                 35711 |                          0
      .....
    2. If you see a large number of UDP packet receive errors or TCP retransmitted segments, add the following parameters to the /etc/sysctl.conf file.

      # Sets the number of incoming connections
      net.core.somaxconn = 1024
      net.ipv4.tcp_max_syn_backlog = 1024
      # Sets the send socket buffer default size in bytes.
      net.core.wmem_default = 262144
      # Sets the receive socket buffer default size in bytes.
      net.core.rmem_default = 262144
      # Sets the send socket buffer maximum size in bytes.
      net.core.wmem_max = 16777216
      # Sets the receive socket buffer maximum size in bytes.
      # If you see UDP packet errors with the following number, you can increase this value by 33554432.
      net.core.rmem_max = 16777216
      # Sets the maximum number of packets allowed to queue when a particular interface receives packets faster than the kernel can process them.
      # Increase the length of the processor input queue
      net.core.netdev_max_backlog = 100000
      # Sets the minimum, pressure, maximum number of memory pages used by TCP socket.
      net.ipv4.tcp_mem = 16777216 16777216 16777216
      # Sets the TCP socket buffer minimum, default, maximum size in bytes.
      net.ipv4.tcp_wmem = 8192 262144 8388608
      net.ipv4.tcp_rmem = 8192 262144 8388608
      # Sets the minimum, pressure, maximum number of memory pages used by UDP socket.
      net.ipv4.udp_mem = 16777216 16777216 16777216
      # Sets the UDP socket buffer minimum size in bytes.
      net.ipv4.udp_rmem_min = 16384
      net.ipv4.udp_wmem_min = 16384

      After adding parameters to the /etc/sysctl.conf file, run the following command:

      $ sysctl -p

      If you continue to see a large number of UDP packet receive errors or TCP retransmitted segments, you may need to turn off the TCP offload features on the network interface cards.

    3. Check the current configuration for the TCP offload features by using the following command. 'eth0' is the device name of the network interface card. Specify the device name of the network interface card which is used by Spread communication.

      $ /sbin/ethtool -k eth0
      Features for eth0:
      rx-checksumming: on
      tx-checksumming: on
      tx-checksum-ipv4: off
      tx-checksum-unneeded: off
      tx-checksum-ip-generic: on
      tx-checksum-ipv6: off
      tx-checksum-fcoe-crc: off [fixed]
      tx-checksum-sctp: off [fixed]
      scatter-gather: on
      tx-scatter-gather: on
      tx-scatter-gather-fraglist: off [fixed]
      tcp-segmentation-offload: on
      tx-tcp-segmentation: on
      tx-tcp-ecn-segmentation: off
      tx-tcp6-segmentation: on
      udp-fragmentation-offload: off [fixed]
      generic-segmentation-offload: on
      generic-receive-offload: on
      large-receive-offload: on
      rx-vlan-offload: on
      tx-vlan-offload: on
      ntuple-filters: off
      receive-hashing: on
      highdma: on [fixed]
      rx-vlan-filter: on [fixed]
      vlan-challenged: off [fixed]
      tx-lockless: off [fixed]
      netns-local: off [fixed]
      tx-gso-robust: off [fixed]
      tx-fcoe-segmentation: off [fixed]
      tx-gre-segmentation: off [fixed]
      tx-udp_tnl-segmentation: off [fixed]
      fcoe-mtu: off [fixed]
      loopback: off [fixed]
    4. Turn off the TCP offload features by using the following command. For example:

      $ /sbin/ethtool -K eth0 sg off tso off gso off gro off lro off rx off tx off

  5. Memory Starvation

    High memory usage makes Spread swap to the disk which cases delay in communication between the nodes.

    1. Check the memory usage of your system by querying the DC_MEMORY_INFO system table. The script run earlier also collects the following memory usage information.

      =======================================================
      e. Memory Starvation
      =======================================================
      e-1. Memory usage from DC_MEMORY_INFO
      =======================================================
      node_name     |     start_time      |      end_time       | average_memory_usage_percent
      ------------------+---------------------+---------------------+------------------------------
      v_trial_node0001 | 2020-08-04 05:08:59 | 2020-08-04 05:09:00 |                        43.53
      v_trial_node0002 | 2020-08-04 05:08:59 | 2020-08-04 05:09:00 |                        42.23
      v_trial_node0003 | 2020-08-04 05:08:59 | 2020-08-04 05:09:00 |                        42.42
      v_trial_node0004 | 2020-08-04 05:08:59 | 2020-08-04 05:09:00 |                         42.1
      v_trial_node0001 | 2020-08-04 05:09:00 | 2020-08-04 05:09:01 |                        43.59
      v_trial_node0002 | 2020-08-04 05:09:00 | 2020-08-04 05:09:01 |                        42.23
      v_trial_node0003 | 2020-08-04 05:09:00 | 2020-08-04 05:09:01 |                        42.48
      v_trial_node0004 | 2020-08-04 05:09:00 | 2020-08-04 05:09:01 |                        42.12
      .....
    2. Check the size of the catalog by querying the DC_ALLOCATION_POOL_STATISTICS system table. The script run earlier also collects the following catalog size information.

      =======================================================
      e-2. Catalog size
      =======================================================
      node_name     |         ts          |            catalog_size_in_mb
      ------------------+---------------------+------------------------------------------
      v_trial_node0001 | 2020-08-04 05:45:14 | 105.914031982421875000000000000000000000
      v_trial_node0002 | 2020-08-04 05:44:57 | 134.056930541992187500000000000000000000
      v_trial_node0003 | 2020-08-04 05:45:14 | 133.696777343750000000000000000000000000
      v_trial_node0004 | 2020-08-04 05:45:14 | 134.865783691406250000000000000000000000
    3. If the catalog size is larger than 4% of total memory size, reduce the general resource pool memory usage, and restart the cluster.

      For example, if the catalog size is 5 GB and the physical memory size is 100 GB, 95% memory size is 95 GB. The catalog information is loaded outside of the Vertica memory space. Hence, you need to reduce the MAXMEMORYSIZE of the general resource pool.

      To reduce the MAXMEMORYSIZE of the general resource pool, use the following command:

      => SELECT name, maxmemorysize FROM v_catalog.resource_pools WHERE name = 'general';
      name   | maxmemorysize
      ---------+---------------
      general | Special: 95%
      => ALTER RESOURCE POOL general maxmemorysize '90%';
      NOTICE 2585:  Change takes effect upon restart.  Recovering nodes will use the new value
      => SELECT name, maxmemorysize FROM v_catalog.resource_pools WHERE name = 'general';
      name   | maxmemorysize
      ---------+---------------
      general | Special: 90%

      Another alternative is to create a new dedicated resource pool for the catalog. Do not use it by any query.

      To create a dedicated resource pool, use the following command:

      => CREATE RESOURCE POOL catalog_pool memorysize '5G';

    4. Another option to reduce the allocation memory by Vertica process is tuning glibc memory allocation. Please refer to the following page for the detailed steps to tune it.

      https://forum.vertica.com/discussion/comment/243701

  6. CPU Bound

    High CPU usage causes delay in communication between the nodes.

    1. Check the CPU usage of your system by querying the DC_CPU_AGGREGATE system table. The script run earlier also collects the following CPU usage information.

      =======================================================
      f. CPU Bound
      =======================================================
      node_name     | processor_id |     start_time      |      end_time       | average_cpu_usage_percent
      ------------------+--------------+---------------------+---------------------+---------------------------
      v_trial_node0001 | aggregate    | 2020-08-04 04:30:29 | 2020-08-04 04:30:30 |    83.8400000000000000000
      v_trial_node0002 | aggregate    | 2020-08-04 04:30:29 | 2020-08-04 04:30:30 |   100.0000000000000000000
      v_trial_node0003 | aggregate    | 2020-08-04 04:30:29 | 2020-08-04 04:30:30 |   100.0000000000000000000
      v_trial_node0004 | aggregate    | 2020-08-04 04:30:29 | 2020-08-04 04:30:30 |    99.7500000000000000000
      v_trial_node0001 | aggregate    | 2020-08-04 04:30:30 | 2020-08-04 04:30:31 |    61.1700000000000000000
      v_trial_node0002 | aggregate    | 2020-08-04 04:30:30 | 2020-08-04 04:30:31 |    99.7500000000000000000
      v_trial_node0003 | aggregate    | 2020-08-04 04:30:30 | 2020-08-04 04:30:31 |   100.0000000000000000000
      v_trial_node0004 | aggregate    | 2020-08-04 04:30:30 | 2020-08-04 04:30:31 |   100.0000000000000000000
      .....
    2. If the CPU usage is very high, assign a single core to spread.

      To check the number of CPU cores, use the following command:

      $ grep processor /proc/cpuinfo
      processor       : 0
      processor       : 1
      processor       : 2
      processor       : 3

      Prepare 2 configuration files with the following contents. Specify all CPU cores to cpuset.cpus value for CPU_Vertica_Group except a single core assigned to spread.

      /etc/cgconfig.conf

      group CPU_Vertica_Group {
      cpuset {
      cpuset.mems="0";
      cpuset.cpus="0-2";
      }
      }
      group CPU_Spread_Group {
      cpuset {
      cpuset.mems="0";
      cpuset.cpus="3";
      }
      }

      /etc/cgrules.conf

      dbadmin:/opt/vertica/bin/vertica        cpuset  CPU_Vertica_Group
      dbadmin:/opt/vertica/spread/sbin/spread cpuset  CPU_Spread_Group

      Install cgroup (Control Group) package on all nodes and start up the associated services.

      For example, in the case of RHEL/CentOS 6:

      $ yum install libcgroup
      $ chkconfig cgconfig on
      $ chkconfig cgred on
      $ service cgconfig start
      $ service cgred start

      In the case of RHEL/CentOS 7:

      $ yum install libcgroup libcgroup-tools
      $ systemctl start cgconfig
      $ systemctl start cgred
      $ systemctl enable cgconfig
      $ systemctl enable cgred

      In the case of RHEL/CentOS 8:

      $ dnf install gcc byacc flex pam-devel redhat-lsb-core
      
      $ wget https://github.com/libcgroup/libcgroup/releases/download/v2.0/libcgroup-2.0.tar.gz
      $ tar zxvf libcgroup-2.0.tar.gz
      $ cd libcgroup-2.0/
      $ ./configure --enable-initscript-install
      $ make
      $ make install
      
      
      $ mkdir /etc/cgconfig.d
       
      $ cp /usr/local/etc/rc.d/init.d/cgconfig /etc/rc.d/init.d/
      $ cp /usr/local/etc/rc.d/init.d/cgred /etc/rc.d/init.d/

      Note During execution of the configure command, the following error occurs in libcgroup-2.0 and 2.0.1. If configure command finishes successfully, it can ignore this error.

      libcgroup-2.0

      $ ./configure --enable-initscript-install
      .....
      checking whether to build static libraries... yes
      ./configure: line 12526: with_bindings: command not found
      checking for g++... g++

      libcgroup-2.0.1

      $ ./configure --enable-initscript-install
      .....
      checking whether to build static libraries... yes
      ./configure: line 12510: with_bindings: command not found
      checking for g++... g++

      If you don't want to ignore it, add --enable-bindings option.

      $ ./configure --enable-initscript-install --enable-bindings

      This option enables the python bindings feature. It may not have any impact.

      Additionally, in the case of RHEL/CentOS 8, modify the following files:

      [/etc/rc.d/init.d/cgred]
      # Should-Start:     --(modify to)--> # Default-Start:      2 3 4 5
      # Should-Stop:      --(modify to)--> # Default-Stop:       0 1 6
       
      [/etc/rc.d/init.d/cgconfig]
      # Should-Start:     --(modify to)--> # Default-Start:      2 3 4 5
      # Should-Stop:      --(modify to)--> # Default-Stop:       0 1 6

      Then, start up the services by the running following commands:

      $ systemctl enable cgconfig.service
      $ systemctl start cgconfig.service
      $ systemctl enable cgred.service
      $ systemctl start cgred.service

      In the case of Ubuntu 16.04 LTS:

      $ apt install cgroup-bin cgroup-lite cgroup-tools cgroupfs-mount libcgroup1
      $ cp /usr/share/doc/cgroup-tools/examples/cgred.conf /etc/
      $ cp /usr/share/doc/cgroup-tools/examples/cgsnapshot_blacklist.conf /etc/
      $ cp /usr/share/doc/cgroup-tools/examples/cgconfig /etc/init.d/
      $ cp /usr/share/doc/cgroup-tools/examples/cgred /etc/init.d/
      $ chmod 755 /etc/init.d/cgconfig
      $ chmod 755 /etc/init.d/cgred

      In addition, in the cause of Ubuntu 16.04 LTS, modify the following files:

      [/etc/init.d/cgred]
      # Should-Start:     --(modify to)--> # Default-Start:      2 3 4 5
      # Should-Stop:      --(modify to)--> # Default-Stop:       0 1 6
      . /etc/rc.d/init.d/functions (remove this line)
      daemon --check $servicename --pidfile $pidfile $CGRED_BIN $OPTIONS     --(modify to)--> start_daemon -p $pidfile $CGRED_BIN $OPTIONS
      [/etc/init.d/cgconfig]
      # Should-Start:     --(modify to)--> # Default-Start:      2 3 4 5
      # Should-Stop:      --(modify to)--> # Default-Stop:       0 1 6

      Then, start up the services by the following commands:

      $ systemctl start cgroupfs-mount.service
      $ systemctl start cgconfig.service
      $ systemctl start cgred.service
      $ systemctl enable cgroupfs-mount.service
      $ systemctl enable cgconfig.service
      $ systemctl enable cgred.service

      Restart the cluster.

Spread Logging

If you still hit the Spread issue, you need to enable Spread Logging and see if any messages appear to understand the potential cause.

To enable Spread Logging, run the following queries:

=> SELECT set_spread_options('/opt/vertica/log/spread.log', 'MEMBERSHIP PRINT CONFIGURATION GROUPS SESSION PROTOCOL EXIT', 'ExitOnIdle = yes');
=> SELECT reload_spread(true);

The first argument of set_spread_options is the log file name. Note that the size of this log file increases immediately and could cause performance degradation.

The second argument is the list of the debug flags. The available debug flags are as below.

Flag Function
PRINT General info that should always be printed.
EXIT Errors or other events that cause Spread to quit.
DEBUG Debugging information.
DATA_LINK Lowest level of sending and receiving datagrams.
NETWORK Packing messages and setting who to talk with.
PROTOCOL Ordering, Token handling, and delivery algorithms.
SESSION Per user connection management.
CONFIGURATION Parsing and loading configuration file.
MEMBERSHIP State and messages sent during membership changes.
FLOW_CONTROL Flow control state of the ring.
STATUS Reporting of status information to the monitor.
EVENTS All events (timed, fd based) and main loop.
GROUPS Group state and group membership changes.
MEMORY Memory debugging and allocation.
SKIPLIST State of data structure.
ALL Enables all flags.
NONE Disables all flags.

The third argument is the extra configuration. Do not change it.

To disable Spread Logging, run the following queries:

=> SELECT set_spread_options('/dev/null', 'PRINT EXIT', 'ExitOnIdle = yes');
=> SELECT reload_spread(true);

Sample Message

In this section, the following 3-node cluster is used as the sample:

When node 3 is disconnected from the network, you can see the following messages in the vertica.log file on each node.

vertica.log on node#1 and node#2

Spread Service InOrder Queue:0x7fb7f05bc700 [Comms] <INFO> NETWORK change with 1 VS sets
Spread Service InOrder Queue:0x7fb7f05bc700 [Comms] <INFO> VS set #0 (mine) has 2 members (offset=24)
Spread Service InOrder Queue:0x7fb7f05bc700 [Comms] <INFO> VS set #0, member 0: #node_a#N192168001110
Spread Service InOrder Queue:0x7fb7f05bc700 [Comms] <INFO> VS set #0, member 1: #node_b#N192168001111
Spread Service InOrder Queue:0x7fb7f05bc700 [Comms] <INFO> DB Group changed

vertica.log on node#3

Poll dispatch:0x7fcc6cd23700 [Comms] <WARNING> Failed to send spmonitor request: Network is unreachable
Spread Mailbox Dequeue:0x7fcc167fd700 [Comms] <WARNING> error SP_receive: Connection closed by spread
Spread Mailbox Dequeue:0x7fcc167fd700 [Comms] <WARNING> error SP_receive: The network socket experienced an error. This Spread mailbox will no longer work until the connection is disconnected and then reconnected
Spread Mailbox Dequeue:0x7fcc167fd700 [Comms] <INFO> stop: disconnecting #node_c#N192168001112 from spread daemon, Mbox=9
Spread Mailbox Dequeue:0x7fcc167fd700 [Comms] <INFO> connected: false
Spread Mailbox Dequeue:0x7fcc167fd700 [Comms] <INFO> DB Group changed
Spread Mailbox Dequeue:0x7fcc167fd700 [Comms] <INFO> Change in vDBgroupState clerk: was #node_a#N192168001110, now [none] (not me)
Spread Mailbox Dequeue:0x7fcc167fd700 [VMPI] <INFO> DistCall: Set current group members called with 0 members
Spread Mailbox Dequeue:0x7fcc167fd700 [VMPI] <INFO> Ending session v_trial_node0001-18304:0x3d due to loss of 45035996273704978
Spread Mailbox Dequeue:0x7fcc167fd700 [Comms] <INFO> nodeSetNotifier: node v_trial_node0001 left the cluster
Spread Mailbox Dequeue:0x7fcc167fd700 [Recover] <INFO> Running hooks after detecting a node loss
Spread Mailbox Dequeue:0x7fcc167fd700 [Session] <INFO> Resetting load balance cache. Internal version reset? [no]
Spread Mailbox Dequeue:0x7fcc167fd700 [Session] <INFO> Invalidating Load Balance Cache. Resetting internal version? no
Spread Mailbox Dequeue:0x7fcc167fd700 [Recover] <INFO> Node left cluster, reassessing k-safety...
Spread Mailbox Dequeue:0x7fcc167fd700 [Recover] <INFO> Setting node v_trial_node0003 to UNSAFE
Spread Mailbox Dequeue:0x7fcc167fd700 [Comms] <INFO> Node v_trial_node0001 erased from nodeToState map
Spread Mailbox Dequeue:0x7fcc167fd700 <LOG> @v_trial_node0003: 00000/3298: Event Posted: Event Code:6 Event Id:5 Event Severity: Informational [6] PostedTimestamp: 2020-08-31 10:41:47.188686 ExpirationTimestamp: 2088-09-18 13:55:54.188686 EventCodeDescription: Node State Change ProblemDescription: Changing node v_trial_node0003 startup state to UNSAFE DatabaseName: trial Hostname: vertica03c8.corp.vertica.local
Spread Mailbox Dequeue:0x7fcc167fd700 <LOG> @v_trial_node0003: 00000/3293: Event Cleared: Event Code:6 Event Id:6 Event Severity: Informational [6] PostedTimestamp: 2020-08-31 10:41:47.188734 ExpirationTimestamp: 2020-08-31 10:41:47.188734 EventCodeDescription: Node State Change ProblemDescription: Changing node v_trial_node0003 leaving startup state UP DatabaseName: trial Hostname: vertica03c8.corp.vertica.local
Spread Mailbox Dequeue:0x7fcc167fd700 [Recover] <INFO> Changing node v_trial_node0003 startup state from UP to UNSAFE
Spread Mailbox Dequeue:0x7fcc167fd700 <LOG> @v_trial_node0003: 00000/3298: Event Posted: Event Code:2 Event Id:0 Event Severity: Emergency [0] PostedTimestamp: 2020-08-31 10:41:47.188765 ExpirationTimestamp: 2020-08-31 10:51:47.188765 EventCodeDescription: Loss Of K Safety ProblemDescription: System is not K-safe: K=1 total number of nodes=3 DatabaseName: trial Hostname: vertica03c8.corp.vertica.local
Spread Mailbox Dequeue:0x7fcc167fd700 [Comms] <INFO> nodeSetNotifier: node v_trial_node0002 left the cluster
Spread Mailbox Dequeue:0x7fcc167fd700 [Recover] <INFO> Running hooks after detecting a node loss
Spread Mailbox Dequeue:0x7fcc167fd700 [Session] <INFO> Resetting load balance cache. Internal version reset? [no]
Spread Mailbox Dequeue:0x7fcc167fd700 [Session] <INFO> Invalidating Load Balance Cache. Resetting internal version? no
Spread Mailbox Dequeue:0x7fcc167fd700 [Recover] <INFO> Node left cluster, reassessing k-safety...
Spread Mailbox Dequeue:0x7fcc167fd700 [Comms] <INFO> Node v_trial_node0002 erased from nodeToState map
Spread Mailbox Dequeue:0x7fcc167fd700 [Comms] <INFO> nodeSetNotifier: node v_trial_node0003 left the cluster
Spread Mailbox Dequeue:0x7fcc167fd700 [Recover] <INFO> Running hooks after detecting a node loss
Spread Mailbox Dequeue:0x7fcc167fd700 [Recover] <INFO> Node left cluster, reassessing k-safety...
Spread Mailbox Dequeue:0x7fcc167fd700 [Comms] <INFO> Node v_trial_node0003 erased from nodeToState map
Spread Mailbox Dequeue:0x7fcc167fd700 [Comms] <INFO> NodeHeartbeatManager: SP_stop_monitoring invoked
Spread Mailbox Dequeue:0x7fcc167fd700 [Comms] <WARNING> NodeHeartbeatManager: SP_stop_monitoring failed with return code -18
Spread Mailbox Dequeue:0x7fcc167fd700 [Comms] <INFO> NodeHeartbeatManager: Notifying the thread waiting on health_check message before disabling HeartbeatGenerator service
Spread Mailbox Dequeue:0x7fcc167fd700 [Comms] <INFO> Lost membership of the DB group
Spread Mailbox Dequeue:0x7fcc167fd700 [Comms] <INFO> Removing #node_a#N192168001110->v_trial_node0001 from processToNode and other maps due to departure from Vertica:all
Spread Mailbox Dequeue:0x7fcc167fd700 [Comms] <INFO> nodeToState map:
Spread Mailbox Dequeue:0x7fcc167fd700 [Comms] <INFO> Removing #node_b#N192168001111->v_trial_node0002 from processToNode and other maps due to departure from Vertica:all
Spread Mailbox Dequeue:0x7fcc167fd700 [Comms] <INFO> nodeToState map:
Spread Mailbox Dequeue:0x7fcc167fd700 [Comms] <INFO> Removing #node_c#N192168001112->v_trial_node0003 from processToNode and other maps due to departure from Vertica:all
Spread Mailbox Dequeue:0x7fcc167fd700 [Comms] <INFO> nodeToState map:
Spread Mailbox Dequeue:0x7fcc167fd700 [Comms] <INFO> Lost membership of V:All
Spread Mailbox Dequeue:0x7fcc167fd700 [Dist] <INFO> TS: cluster (V:all) changed
Spread Mailbox Dequeue:0x7fcc167fd700 [Comms] <WARNING> VSpread disconnected: lost connection to daemon /opt/vertica/spread/tmp/4803
Spread Mailbox Dequeue:0x7fcc167fd700 [Comms] <INFO> Spread dequeue thread exiting
SafetyShutdown:0x7fcbfdfff700 [Shutdown] <INFO> Shutting down this node

In the spread.log file, you can see the following message on each node:

spread.log on node#1

Prot_token_hurry: retransmiting token 6 1
Prot_token_hurry: retransmiting token 6 2
Prot_token_hurry: retransmiting token 6 3
Prot_token_hurry: retransmiting token 6 4
Prot_token_hurry: retransmiting token 6 5
Prot_token_hurry: retransmiting token 6 6
Prot_token_hurry: retransmiting token 6 7
Prot_token_hurry: retransmiting token 6 8
Prot_token_hurry: retransmiting token 6 9
Prot_token_hurry: retransmiting token 6 10
Prot_token_hurry: retransmiting token 6 11
Prot_token_hurry: retransmiting token 6 12
Prot_token_hurry: retransmiting token 6 13
Prot_token_hurry: retransmiting token 6 14
Prot_token_hurry: retransmiting token 6 15
Prot_token_hurry: retransmiting token 6 0
Prot_token_hurry: retransmiting token 6 1
Prot_token_hurry: retransmiting token 6 2
Prot_token_hurry: retransmiting token 6 3
Prot_token_hurry: retransmiting token 6 4
Prot_token_hurry: retransmiting token 6 5
Prot_token_hurry: retransmiting token 6 6
Prot_token_hurry: retransmiting token 6 7
Sess_disable_heartbeats: explict = 1, thresh = 0, heartbeats_on = 1
Sess_disable_heartbeats: disabling heartbeat monitoring on 'node_a' ( mailbox 9 ).
Memb_token_loss: ############### I lost my token, state was 1
Scast_alive: State is 2
Send_join: State is 4
Memb_handle_message: handling join message from 192.168.1.111 (1862379712), State is 4
Sess_read: received a heartbeat on 'node_a' ( mailbox 9 )
Pushed eviction timeout back 600.000000s
Create_form1: putting Aru = 11614 and Highest_Seq = 11614 on rg_info form1 token
Create_form1: SENT following token:
=========== Form Token ==========
FORM 1 Token, sent by 192.168.1.110 (1845602496). Seq: 14947
Configuration hash: 536829647 (local hash 536829647)
ProcID: 192.168.1.110 (1845602496)	 ARU: 0
ARU LastID: unknown (0)
FlowControl: 0	RTR Len: 45
Form Token members list -- Active (1) Pending (0)
0: 192.168.1.110 (1845602496)
Pending Members:
Form Token reps list -- Count (2) index (1)
0: 192.168.1.110 (1845602496) (T 1 SegInd 0) 	1: 192.168.1.111 (1862379712) (T 1 SegInd 1)
Form Token RING list -- Count (1)
Ring 0: MembID 192.168.1.110 (1845602496) - 1598592925	TransTime 0
ARU: 11614	HighSeq: 11614	NumHoles: 0
NumCommit: 1	NumTrans: 1
Message Holes:
Trans List:	0: 192.168.1.110
Commit List:
====================================================

spread.log on node#2

Memb_token_loss: ############### I lost my token, state was 1
Scast_alive: State is 2
Memb_handle_message: handling join message from 192.168.1.110 (1845602496), State is 2
Send_join: State is 4
Memb_handle_message: handling join message from 192.168.1.110 (1845602496), State is 4
Prot_handle_token: it is a Form Token.
Memb_handle_token: handling form1 token
Handle_form1 in GATHER
Fill_form1: RECEIVED following token:
=========== Form Token ==========
FORM 1 Token, sent by 192.168.1.110 (1845602496). Seq: 14947
Configuration hash: 536829647 (local hash 536829647)
ProcID: 192.168.1.110 (1845602496)	 ARU: 0
ARU LastID: unknown (0)
FlowControl: 0	RTR Len: 45
Form Token members list -- Active (1) Pending (0)
0: 192.168.1.110 (1845602496)
Pending Members:
Form Token reps list -- Count (2) index (1)
0: 192.168.1.110 (1845602496) (T 1 SegInd 0) 	1: 192.168.1.111 (1862379712) (T 1 SegInd 1)
Form Token RING list -- Count (1)
Ring 0: MembID 192.168.1.110 (1845602496) - 1598592925	TransTime 0
ARU: 11614	HighSeq: 11614	NumHoles: 0
NumCommit: 1	NumTrans: 1
Message Holes:
Trans List:	0: 192.168.1.110
Commit List:
====================================================

spread.log on node#3

Memb_token_loss: ############### I lost my token, state was 1
Scast_alive: State is 2
Scast_alive: State is 2
Scast_alive: State is 2
Scast_alive: State is 2
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Send_join: State is 4
Create_form1: putting Aru = 11614 and Highest_Seq = 11614 on rg_info form1 token
Create_form1: SENT following token:
=========== Form Token ==========
FORM 2 Token, sent by 192.168.1.112 (1879156928). Seq: 14947
Configuration hash: 536829647 (local hash 536829647)
ProcID: 192.168.1.112 (1879156928)	 ARU: 0
ARU LastID: unknown (0)
FlowControl: 0	RTR Len: 45
Form Token members list -- Active (0) Pending (1)
Pending Members:
0: 192.168.1.112 (1879156928)
Form Token Membership ID unknown (0) : 0
Form Token RING list -- Count (1)
Ring 0: MembID 192.168.1.110 (1845602496) - 1598592925	TransTime 0
ARU: 11614	HighSeq: 11614	NumHoles: 0
NumCommit: 1	NumTrans: 1
Message Holes:
Trans List:	0: 192.168.1.112
Commit List:
====================================================
Sess_disable_heartbeats: explict = 1, thresh = 0, heartbeats_on = 0
Sess_disable_heartbeats: disabling heartbeat monitoring on 'r9112-9' ( mailbox 9 ).
Cannot form trivial self-ring. Network is probably broken.Exit caused by Alarm!
Sess: unlinked domain socket file /opt/vertica/spread/tmp/4803; ret=0
Spread daemon exiting normally!

For More Information