NVIDIA GRID on Pascal - A Card Comparision
I was all excited to post this, as I spent a decent amount of time on this subject leading up to ordering our new servers, and hadn't seen anyone else address it yet. Then yesterday Jan Hendricks posted his thoughts on the subject which mirrored much of my findings (https://jhmeier.com/2017/11/15/data-comparison-of-nvidia-grid-tesla-p4-and-p40/). I guess that motivated me to finally finish my thoughts and get it posted as well!
NVIDIA GRID 5.0 & Tesla Pascal Cards
With NVIDIA GRID release 5.0 came support for Pascal GPUs. While this brought many things (covered many other places), the most interesting part of this for me was that now ANY Tesla Pascal GPU works with NVIDIA GRID. This opened us up to a wider array of options right away with 3 PCIe cards already generally available and an MXM option. While NVIDIA pushed the P40 as the natural replacement for the M60, I wanted to look at all of our options.
In the following tables I've provided some information on previous generation cards as well for comparison. If you're starting from scratch, this isn't that important, but if you have an environment that is working, you can compare the performance you'd get if you upgraded. For us, the K2 typically kept up with the workloads we threw at it (at least until fully loaded), so using that as a point of comparison was a logical starting point.
General Card Information:
As you can see, all the Pascal generation cards have only a single GPU. That makes things a bit different than the previous models (which had 2-4 GPUs per card), especially when we talk about performance and density.
Performance Data:
There is some data missing here... I couldn't find easy information on the K2 and M10. I have 4 K2s but they are in production and haven't had a chance to test them (I will when they get retired). The real purpose of this information was to show that M60 is about 2x faster than K2 and even the P4 & P6 Pascal cards outpace the M60.
Cost and Density:
This is the table that really helps make decisions. Assuming you don't have special workloads needing double precision or half precision (we don't, but if you do, look at this article to see more in depth specifications for DP, HP, and SP performance: https://www.nextplatform.com/2016/09/13/nvidia-pushes-deep-learning-inference-new-pascal-gpus/) and are looking at these cards primarily for vGPU, the column that tells us which Pascal cards to consider is Cost/User. While this table calculates cost per user based on a 1 GB profile (not including license costs), the number scales linearly. The only 2 cards in the Tesla Pascal range that make sense for our vGPU needs are the P40 and P4 (obviously the P6 would come into play in blade scenarios).
Now that we've narrowed the search to the P40 and the P4, lets dive into the main differences.
Why would I choose the P40?
The P40 has a few advantages over the P4.
Why would I choose the P4 (aka. Why did I choose the P4)?
Spoiler alert, this is the route we went with. While the P4 may lose out in density, as stated above CPU bounds us there and it has its own advantages that ended up making this an easy choice for us.
NVIDIA GRID 5.0 & Tesla Pascal Cards
Photo Credit: http://www.nvidia.com/object/accelerate-inference.html
In the following tables I've provided some information on previous generation cards as well for comparison. If you're starting from scratch, this isn't that important, but if you have an environment that is working, you can compare the performance you'd get if you upgraded. For us, the K2 typically kept up with the workloads we threw at it (at least until fully loaded), so using that as a point of comparison was a logical starting point.
General Card Information:
As you can see, all the Pascal generation cards have only a single GPU. That makes things a bit different than the previous models (which had 2-4 GPUs per card), especially when we talk about performance and density.
Performance Data:
There is some data missing here... I couldn't find easy information on the K2 and M10. I have 4 K2s but they are in production and haven't had a chance to test them (I will when they get retired). The real purpose of this information was to show that M60 is about 2x faster than K2 and even the P4 & P6 Pascal cards outpace the M60.
Cost and Density:
This is the table that really helps make decisions. Assuming you don't have special workloads needing double precision or half precision (we don't, but if you do, look at this article to see more in depth specifications for DP, HP, and SP performance: https://www.nextplatform.com/2016/09/13/nvidia-pushes-deep-learning-inference-new-pascal-gpus/) and are looking at these cards primarily for vGPU, the column that tells us which Pascal cards to consider is Cost/User. While this table calculates cost per user based on a 1 GB profile (not including license costs), the number scales linearly. The only 2 cards in the Tesla Pascal range that make sense for our vGPU needs are the P40 and P4 (obviously the P6 would come into play in blade scenarios).
Now that we've narrowed the search to the P40 and the P4, lets dive into the main differences.
Why would I choose the P40?
The P40 has a few advantages over the P4.
- Density - The P40 is a dual height PCI card which packs 24 GB of frame buffer. Using the 1 GB profile assumed in our calculations (again this scales linearly), we get 24 users in a dual PCI slot. This beats the P4 which is a single height PCI card with only 8 GB of frame buffer giving us 16 users per 2 single PCI slots.
Typically in our workloads, CPU becomes a limiting factor before we would run out of GPU slices... Keep in mind that newer servers support 2-4 P40s and up to 6 P4s (See NVIDIA support matrix: http://www.nvidia.com/object/grid-certified-servers.html). Assuming our 1 GB profile, that's 48 users on P4s and 48-96 users on P40s. Even with the newest Intel Gold 61xx CPUs I don't anticipate getting that kind of density for our users. - Raw performance - If you have workloads that need more performance, the P40 provides more CUDA cores and more raw performance (see performance data chart). In AEC this might come into play for GPU based rendering. See notes for the P4 for additional considerations here...
- Larger Profiles - The P4 would be limited to 8 GB profiles due to a single GPU only have 8 GB of frame buffer. If you have needs for larger frame buffers than 8 GB (we've made do with 512 mb buffers on K2, so we do not fall in this category) you will need to go with the P40.
Why would I choose the P4 (aka. Why did I choose the P4)?
Spoiler alert, this is the route we went with. While the P4 may lose out in density, as stated above CPU bounds us there and it has its own advantages that ended up making this an easy choice for us.
- NVENC* (see footnote if you don't know what NVENC is and why its so important) - This is the number 1 thing I noticed about the P4 vs P40. While the P40 has 2 encode engines on chip (that encode approx. 12 H264 1080p30 streams each) this is spread out across 24 users. Our users all have dual monitors which means this card would support HVENC for only 12 of our possible 24 users. After the 24th stream (approx, not literally) it would fall back to CPU which reduces density and more importantly impacts the User Experience.
Meanwhile, the P4 has 2 encode engines on its board supporting the same 24 streams for our now 8 users... All users in this scenario should be able to operate with NVENC based on 1 GB profiles with dual monitors. - Scalability - Being a smaller shop with only a few servers, the P4 allows us to size at a smaller scale. I decided on 4 P4s (32 possible users), but would have had to go for 2 P40s to get the same user count (and redundancy in case of card failure), but at a higher cost since I would be paying for 16 more users than I will likely never get (due to CPU limits)
- Performance - Wait, didn't we give this edge to the P40? Well, yes and no. While the P40 has more raw performance if we take a closer look it may not be as clear cut as we think.
NVIDIA's vGPU sharing in its most simple form is a time sliced arrangement. You see this in some of the testing (https://www.wondernerd.net/blog/my-nvidia-grid-5-0-testing/) as different profiles run individually (no other workloads running) give nearly identical results. There are no other workloads to switch to so the benchmark machine gets the whole power of the GPU for the entire time.
So, how does the P4 win here? Well, if you don't need the full power of 3840 CUDA cores on the P40 and 2560 CUDA cores gets the job done for your workloads, you're actually going to have better performance on the P4. If I have 24 users on a P40, an individual user will get all 3840 CUDA cores for 1 cycle out of 24. If I max out the P4 with 8 users I will get all 2560 CUDA cores for 1 cycle out of 8. In the same time I get 1 cycle on the P40, I would get 3 cycles on the P4! With 2/3 of the performance (based on CUDA core count) and 3 times the access, we still end up 2x better off! - Variable Profiles - While the ability to vary vGPU profiles on a single GPU may eventually come (assumptions with no insider knowledge), that is not the case as of now. Having a single 24 GB P40 locks you into 1 profile across the entire card. Having multiple P4s would allow you to assign different profiles to users as needed since each card could have its own vGPU profile. Have a few users that might need 2 GB of framebuffer? Not an issue with the P4. Dedicate 1 P4 to that (4 users) while the other 3 P4s in the host serve 1 GB profiles to a different 24 users.
- Power (and Heat) - One thing I don't highlight in my tables (because it wasn't an issue for me) is power. The P4 only draws 75W of power and does not require a 6 or 8 pin power connector. This should make it work in many more servers... This is in contrast to the 250W power draw of the P40. Along with power comes heat, and more power = more heat... Heat is not our friend.
*What is NVENC?
NVENC is the ability for the NVIDIA card to encode the H264 stream (available in VMware Blast Extreme and XenDesktop HDX & HDX Pro) in dedicated chips to offload that function from the CPU. Not only does this free up CPU for other operations (typically most of a vCPU), but it also reduces the user's latency. NVIDIA describes this measurement of latency as "click to photon" (see more at: https://www.virtualexperience.no/2016/03/07/how-to-use-click-to-photon-to-measure-end-to-end-latency/). Encoding in GPU can reduce latency by 30-130 ms.
*Why is NVENC so important?
The last comment about latency reduction of 30-130 ms is half of why its so important. That's huge for users that are used to a $4000+ box on their desk and now are working in VDI...
In our case we have 1 major gripe about our existing K2 solution that NVENC helps to solve as well. AutoCAD uses a special cursor (a cross-hair)... While a normal cursor is rendered client side, that cursor is rendered server side, encoded, and then sent over the wire. That encoding takes time and the user perceives the cursor lag as slowness. While it may not affect their work (in terms of performance), it drives them crazy. There are other things we can do to help this, but nothing gives that native experience back.
Additionally (as mentioned above) the offload of the encoding operation frees up a vCPU core that normally is burdened with this task. Less CPU load increases density while keeping UX high.
Want even more?
Seriously? I feel like you already read a book, but here are some other great resources from the community:
Anything else you want to know? Leave me a comment.
Wow!!! Nice summary.
ReplyDelete