Trade-off type: compute with global memory (block strided) Device specifications -Ĭompute throughput: 7464.96 GFlops (theoretical single precision FMAs) for the OpenCL implementation you may use the commands as follows: Thus, to build a particular implementation use the proper CMakeLists.txt, e.g. Each implementation resides in a separate folder: Half precision Flops (multiply-additions)īuilding is based now on CMake files.Double precision Flops (multiply-additions).Single precision Flops (multiply-additions).Kernel typesįour types of experiments are executed combined with global memory accesses: ![]() The one that exhibits better performance is dependent on the underlying architecture and compiler characteristics. mixbench-XXX-alt: Deprecated - Follows a different design approach than the former so results typically slightly differ.mixbench-XXX-ro: Consider this as the primary implementation.Two executables will be produced for each platform: CUDA, HIP, OpenCL and SYCL implementations have been developed. Using this tool one can assess the practical optimum balance in both types of operations for a GPU. Modern GPUs are able to hide memory latency by switching execution to threads able to perform compute operations. The executed kernel is customized on a range of different operational intensity values. The purpose of this benchmark tool is to evaluate performance bounds of GPUs on mixed operational intensity kernels.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |