Ntile, and why you should love it

Posted April 5, 2018 by Soniya Shah, Information Developer

white cloud in vault type room representing cloud computing
This blog post was authored by Curtis Bennett. According to a quick internet search, the hottest recorded temperature in Los Angeles was 109 degrees Fahrenheit, recorded on July 8th, 2017. If you had access to all kinds of weather data (spoiler alert: you do!) you could query the data and figure out the highs and lows for any given weather reporting station, and you could even trend the weather over time, in case that was interesting to you. The lowest temperature, incidentally, was 28 degrees, recorded on January 4th, 1949. If you were to run a min() and a max() on the weather in Los Angeles, you would see something like this: min | max -----+----- 28 | 109 (1 row) If you were going to take a trip to Los Angeles, you wouldn’t have any idea what to pack, based on those readings. According to that data, Los Angeles has extreme weather! There’s a better way – ntile. Ntile allows you to break down the data set into n groups, with each group showing the number of occurrences for that particular group. Basically, ntile explodes the result set so you can more easily find patterns in your data. For example, in Vertica, if you’re loading small data files, it’s best to use WOS. If you’re loading large files, putting the data directly into ROS with the DIRECT keyword is preferred. How well do you do in your environment following this practice? Loading large files into WOS (without the DIRECT keyword) will almost certainly result in a WOS spill, which is inefficient, and creates many ROS containers. Likewise, loading many very small files with the DIRECT keyword is slower, and can also create many ROS containers. The following query can show you the load behavior: =>select avg(processed_row_count), case when query ilike '%direct%' then 'direct' else 'WOS' end as type from query_profiles where table_name in (select table_name from query_profiles where query_type = 'LOAD') and query ilike 'copy %' group by 2 ; avg | type ——————+——– 613460.058823529 | WOS 1637375008.9375 | direct (2 rows) These results look OK, but they are inconclusive. These are just the average load sizes for all loads of each type. It’s hard to make any determination based on this data. Fortunately, there’s an easier way – ntile. =>select ntile, type, to_char(avg(processed_row_count),'999,999,999') as avg_row_count from (select processed_row_count , ntile(10) over (order by processed_row_count) as ntile , case when query ilike '%direct%' then 'direct' else 'WOS' end as type from query_profiles where table_name in (select table_name from query_profiles where query_type = 'LOAD') and query ilike 'copy %') sq group by ntile, type order by ntile; ntile | type | avg_row_count ——-+——–+—————— 1 | WOS | 1 1 | direct | 0 2 | WOS | 1 3 | WOS | 1 4 | WOS | 1 5 | WOS | 1 6 | WOS | 1 7 | WOS | 1 8 | WOS | 1 9 | WOS | 15 10 | WOS | 13,557,415 10 | direct | 2,015,230,780 (12 rows) From this output, we have a much better picture of loads in this environment. It appears as though the developer occasionally forgets to put the DIRECT keyword on large loads. (Don’t feel bad – it’s very common!), but generally does a very good job loading only very small data into WOS. If you like ntile, be sure to also check out ntile’s more pedantic evil twin, width_bucket ,which allows for many more configuration options.