It's not clear to kernel developers what current and potential users of large pages want. In an effort to find out, we are conducting email interviews with Linux users, asking what their experiences have been and what functionality they would find useful.
Thanks in advance for answering our questions. This information will be shared with many Linux developers, so please be as detailed as you can in your answers. We don't need to know the name of your organization or application, but we would appreciate it if you could tell us that as well. If you want your response to be discussed at the 2006 OLS large pages BOF (see http://www.linuxsymposium.org/2006/view_abstract.php?content_key=289), please return your answers by July 10, 2006. Also, please consider attending the large pages BOF if you will be at OLS. Thanks!
The UK Astrophysical Fluids Facility (UKAFF)
A1 The UK Astrophysical Fluids Facility (UKAFF) is a national supercomputing facility for theoretical astrophysics research http://www.ukaff.ac.uk
A2 There are a variety of user written applications but probably the most relevant for a discussion on large pages are those which use a technique called Smoothed Particle Hydrodynamics (SPH). This is a particle based (rather than grid based) method of computing fluid dynamics which was originally developed for Astrophysics by Joe Monaghan and his collaborators in 1977. For further information about the specifics of SPH, this article is probably a good starting point: http://ukads.nottingham.ac.uk/cgi-bin/nph-iarticle_query?1992ARA%26A..30..543M&data_type=PDF_HIGH&type=PRINTER&filetype=.pdf The Gadget-2 code used for the large "Millenium" cosmology simulation at Max Planck has some SPH elements to it. SPH is used at UKAFF for a variety of astrophysics simulations such as this one http://www.ukaff.ac.uk/starcluster/
A3 One problem with large SPH simulations in for astrophysics problems such as those simulated on the UKAFF systems is the rate at which particles are mixed up. By this I mean that a particles nearest neighbours change on a short timescale - unlike, for example, a smooth steady flow of water. Watch the movies of the Star Cluster simulation (URL above) or the Neutron Star Mergers http://www.ukaff.ac.uk/movies/nsmerger/ and you'll probably get the idea. This leads to memory accesses becoming very random - you need to calculate interactions between neighbouring particles and these get mixed up. Sorting the particles regularly to reduce the randomness tends to carry a very high cpu time overhead. Partial sorts or less frequent sorting are often used to try and balance time spent sorting with time spent doing science but this still leaves a significant level of andomness to the memory accesses. These memory accesses are then frequently cache misses which introduces a high latency to the memory request. Increasing page sizes from 4K to 16M significantly reduces this problem as the number of tlb misses drops. Typically it will reduce runtimes by 25-30% but in an extreme case I've seen an SPH code run 3x faster simply by enabling large pages.
A4 We have no real experience as it's completely unusable. With scientists writing their own codes rather than using a standard code, often in Fortran 77, there's no sensible way that they can implement large pages in their applications. Previously our used worked on an SGI system running IRIX where implementation of large pages was simply by kernel options being set on the system to enable large pages and then the user set runtime environment variables. No code changes were required.
The current Linux implementation requires memory to be reserved explicitly for large pages causing problems for any large page application which doesn't fit within the allocation and for any small page application which doesn't fit within the unreserved memory. Furthermore, as different applications need different amounts of large pages (or small pages), in a production environment we would need to efficiently change the amount of memory reserved for large pages. For an application that will use almost all of the 32GB on the system it's virtually impossible to dynamically reserve enough memory and a reboot is necessary. Fine in a lab on a development system but totally impractical for a production system.
A5 That depends on the hardware. On our pSeries systems we were told that we could use large pages on Linux but this has turned out to be incorrect. The choice to use linux was therefore wrong as we cannot get good enough performance from these expensive systems.
If I was buying a standard x86-64 based server then it probably wouldn't affect my descision as the much lower cost outweighs the loss of the extra performance. Having said that, if UKAFF's next system were x86-64 based I'd certainly look to see whether there is usable large page support in Solaris if Linux hasn't improved by then.
this overflow must not be implemented so that the entire application uses small pages as this is useless where we might have, for example, 85% of the memory allocated to large pages and the application needs 90% of the memory. Letting 5% use small pages is better than trying to fit the entire application in the 15% unreserved space.
A4 Personally, not much, beyond assisting with getting libhugetlbfs into Fedora Extras and updating Red Hat's test suite to use the updated version, which remedies some earlier problems getting it running on some arches within our test suite.
A1 Networking hardware/software for HPC: High performance network stack usually exported to the application through the MPI interface. Very low latency (2 microseconds) and high bandwidth (10Gbit/s) with small CPU overhead (zero-copy communications) and overlapping of communications with computation.
A2 Large parallel applications including chemical computation, automobile crash-test simulation, fluid mechanics, ... Hundreds or thousands of nodes compute and exchange large amounts of data, with very different workloads. Message size vary from a couple bytes to hundreds of megabytes.
A3 For direct transfer between application buffers and networking device, we need to assemble and transfer the pointers to the application data to the networking device. With hugepages it goes 1000 faster to do so than with normal pages, because there is 1000 times less metadata. Without hugepages, this metadata assembly/transfer management is 98% of the CPU involment in our communication system. Basically huges page allows to reduce CPU overhead/involment from a lot to nothing.
A4 Very good. We want more support to allow anonymous applications to use it easily, want to more easily dynamically allocate hugepage without needing them to be reserved first. We want our applications to allocate all their memory with hugepages by default transparently.
A5 We already use Linux most of the time. But some customer still want to use other OSes, so we have to support FreeBSD, Solaris, Windows and Darwin too. With easy-to-use hugepages support improving the performance in Linux, it would be sufficient to motivate the non-Linux portion of our customers to make the switch.
A6 Huge page allows to easily do communication at 20Gb/s (or more when network capacity is there) with existing platforms whereas communication performance would drop by sometimes 30% or more otherwise. It is not visible in most benchmarks since the expensive part is done during initialization only, not in the actually timed execution. But, in real applications, the impact might be very large.