4.1.00 (2023-06-16)
- Add
<Kokkos_BitManipulation.hpp>
header #4577 #5907 #5967 #6101 - Add
UnorderedMapInsertOpTypes
#5877 and documentation #350 - Add multiple reducers support for team-level parallel reduce #5727
- Allow NVCC 12 to compile using C++20 flag #5977
- Remove ability to disable CMake option
Kokkos_ENABLE_CUDA_LAMBDA
and unconditionally enable CUDA extended lambda support. #5964 - Drop unnecessary fences around the memory allocation when using
CudaUVMSpace
in views #6008
- Improve performance for
parallel_reduce
. Use different parameters forLightWeight
kernels #6029 and #6160
- Only pass one wrapper object in SYCL reductions #6047
- Improve and simplify parallel_scan implementation #6064
- Remove workaround for submit_barrier not being enqueued properly #5504
- Fix guards for using scratch space with SYCL #6003
- Fix compiling SYCL with KOKKOS_IMPL_DO_NOT_USE_PRINTF_USAGE #6219
- Improve hierarchical parallelism for Intel architectures #6043
- Enable Cray compiler for the OpenMPTarget backend. #5889
- Update HPX backend to use HPX's sender/receiver functionality #5628
- Increase minimum required HPX version to 1.8.0 #6132
- Implement HPX::in_parallel #6143
- Export CMake
Kokkos_{CUDA,HIP}_ARCHITECTURES
variables #5919 #5925 - Add
Kokkos::Profiling::ScopedRegion
#5959 #5972 - Add support for
View::rank[_dynamic]()
#5870 - Detect incompatible relocatable device code mode to prevent ODR violations #5991
- Add (experimental) support for 32-bit Darwin and PPC #5916
- Add missing half and bhalf specialization of the infinity numeric trait #6055
- Add
is_dual_view
trait and align further with regular view #6120 - Allow templated functors in parallel_for, parallel_reduce and parallel_scan #5976
- Define KOKKOS_COMPILER_INTEL_LLVM and only define at most one KOKKOS_COMPILER* macro #5906
- Allow linking against build tree #6078
- Allow passing a temporary std::vector to partition_space #6167
Kokkos
can be used as an external dependency inTrilinos
#6142, #6157 #6163- Left align demangled stacktrace output #6191
- Improve OpenMP affinity warning to include MPI concerns #6185
- Drop
Kokkos_ENABLE_LAUNCH_COMPILER
option which had no effect #6148 - Export variables for relevant Kokkos options with cmake#6142
- Desul atomics always enabled #5801
- Drop
KOKKOS_ENABLE_CUDA_ASM*
andKOKKOS_ENABLE_*_ATOMICS
macros #5940 - Drop
KOKKOS_ENABLE_RFO_PREFETCH
macro #5944 - Deprecate
Kokkos_ENABLE_CUDA_LAMBDA
configuration option and force it toON
#5964 - Remove TriBITS Kokkos subpackages #6104
- Cuda: Remove unused attach_texture_object #6129
- Drop Kokkos_ENABLE_PROFILING_LOAD_PRINT configuration option #6150
- Drop pointless Kokkos{Algorithms,Containers}_config.h files #6108
- Deprecate
BinSort
,BinOp1D
, andBinOp3D
default constructors #6131
- Fix
SYCLTeamMember
to take arguments for scratch sizes asstd::size_t
#5981 - Fix Kokkos_SIMD with AVX2 on 64-bit architectures #6075
- Fix an incorrectly returning size for SIMD uint64_t in AVX2 #6004
- Fix missing avx512 header file with gcc versions before 10 #6183
- Fix incorrect results of
parallel_reduce
of types smaller thanint
on CUDA and HIP: #5745 - CMake: update package compatibility mode when building within Trilinos #6012
- Fix warnings generated from internal uses of
ALL_t
rather thanKokkos::ALL_t
#6028 - Fix bug in
hpcbind
script: check for correct Slurm variable #6116 - KokkosTools: Don't call callbacks before backends are initialized #6114
- Fix global fence in Kokkos::resize(DynRankView) #6184
- Fix
BinSort
support for strided views #6081 - Fix missing
is_*_view
traits in containers #6195 - Fix broken OpenMP target on NVHPC #6171
- Sorting an empty view should exit early and not fail #6130
4.0.01 (2023-04-14)
- Add support for AMDGPU target NAVI31 / RX 7900 XT(X): gfx1100 #6021
- HIP: Fix warning from
std::memcpy
#6019
- Fix
SYCLTeamMember
to take arguments for scratch sizes asstd::size_t
#5986
- Fixup 4.0 change log #6023
- Cherry-pick TriBITS update from Trilinos #6037
- CMake: update package compatibility mode when building within Trilinos #6013
- Fix an incorrectly returning size for SIMD uint64_t in AVX2 #6011
- Desul atomics: wrong value for
desul::Impl::numeric_limits_max<uint64_t>
#6018 - Fix warning in some user code when using std::memcpy #6000
- Fix excessive build times using Makefile.kokkos #6068
4.0.0 (2023-02-21)
- Allow value types without default constructor in
Kokkos::View
withKokkos::WithoutInitializing
#5307 parallel_scan
withView
as result type. #5146- Introduced
SharedSpace
, an alias for aMemorySpace
that is accessible by everyExecutionSpace
. The memory is moved and then accessed locally. #5289 - Introduced
SharedHostPinnedSpace
, an alias for aMemorySpace
that is accessible by everyExecutionSpace
. The memory is pinned to the host and accessed via zero-copy access. #5405 - Add team- and thread-level
sort
,sort_by_key
algorithms. #5317 - Groundwork for
MDSpan
integration. #4973 and #5304 - Introduced MD version of hierarchical parallelism:
TeamThreadMDRange
,ThreadVectorMDRange
andTeamVectorMDRange
. #5238
- Allow CUDA PTX forward compatibility #3612 #5536 #5527
- Add support for NVIDIA Hopper GPU architecture #5538
- Don't rely on synchronization behavior of default stream in CUDA and HIP #5391
- Improve CUDA cache config settings #5706
- Move
HIP
,HIPSpace
,HIPHostPinnedSpace
, andHIPManagedSpace
out of theExperimental
namespace #5383 - Don't rely on synchronization behavior of default stream in CUDA and HIP #5391
- Export AMD architecture flag when using Trilinos #5528
- Fix linking error (see OLCF issue) when using
amdclang
: #5539 - Remove support for MI25 and added support for Navi 1030 #5522
- Fix race condition when using
HSA_XNACK=1
#5755 - Add parameter to force using GlobalMemory launch mechanism. This can be used when encountering compiler bugs with ROCm 5.3 and 5.4 #5796
- Delegate choice of workgroup size for
parallel_reduce
withRangePolicy
to the compiler. #5227 - SYCL
RangePolicy
: manually specify workgroup size through chunk size #4875
- Select the right device #5492
- Add
partition_space
#5105
- Implement
OffsetView
constructor takingpair
s andViewCtorProp
#5303 - Promote math constants to
Kokkos::numbers
namespace #5434 - Add overloads of
hypot
math function that take 3 arguments #5341 - Add
fma
fused multiply-add math function #5428 - Views using
MemoryTraits::Atomic
don't needvolatile
overloads for the value type anymore. #5455 - Added
is_team_handle
trait #5375 - Refactor desul atomics to support compiling CUDA with NVC++ #5431 #5497 #5498
- Support finding
libquadmath
with native compiler support #5286 - Add architecture flags for MSVC #5673
- SIMD backend for ARM NEON #5829
- Let CMake determine OpenMP flags. #4105
- Update minimum compiler versions. #5323
- Makefile and CMake support for C++23 #5283
- Do not add
-cuda
to the link line with NVHPC compiler when the CUDA backend is not actually enabled #5485 - Only add
-latomic
in generated GNU makefiles when OpenMPTarget backend is enabled #5501 #5537 (3.7 patch release candidate) Kokkos_ENABLE_CUDA_LAMBDA
nowON
by default with NVCC #5580- Fix enabling of relocatable device code when using CUDA as CMake language #5564
- Fix cmake configuration with CUDA 12 #5691
- Require C++17 #5277
- Turn setting
Kokkos_CXX_STANDARD
into an error #5293 - Remove all deprecations in Kokkos 3 #5297
- Remove
KOKKOS_COMPILER_CUDA_VERSION
#5430 - Drop
reciprocal_overflow_threshold
numeric trait #5326 - Move
reduction_identity
out of<Kokkos_NumericTraits.hpp>
into a new<Kokkos_ReductionIdentity.hpp>
header #5450 - Reduction and scan routines will report an error if the
join()
operator they would use takesvolatile
-qualified parameters #5409 ENABLE_CUDA_UVM
is dropped in favor of usingSharedSpace
asMemorySpace
explicitly #5608- Remove Kokkos_ENABLE_CUDA_LDG_INTRINSIC option #5623
- Don't rely on synchronization behavior of default stream in CUDA and HIP - this potentially will break unintended implicit synchronization with other libraries such as MPI #5391
- Make ExecutionSpace::concurrency() a non-static member function #5655 and related PRs
- Remove code guarded by
KOKKOS_ENABLE_DEPRECATED_CODE_3
- Deprecate
CudaUVMSpace::available()
which always returnedtrue
#5614 - Deprecate
volatile
-qualified members fromKokkos::pair
andKokkos::complex
#5412 - Deprecate
KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_*
macros #5824 (oversight in 3.6)
- Avoid allocating memory for
UniqueToken
#5300 - Fix
pragma ivdep
inKokkos_OpenMP_Parallel.hpp
#5356 - Fix configuring with Threads support when rerunning CMake #5486
- Fix View assignment between
LayoutLeft
andLayoutRight
with static extents #5535 (3.7 patch release candidate) - Add
fence()
calls to sorting routine overloads that don't take an execution space parameter #5389 ClockTic
changed to 64 bit to fix overflow on Power #5577 (incl. in 3.7.01 patch release)- Fix incorrect offset in CUDA and HIP
parallel_scan
for < 4 byte types #5555 (3.7 patch release candidate) - Fix incorrect alignment behavior of scratch allocations in some corner cases (e.g. very small allocations) #5687 (3.7 patch release candidate)
- Add missing
ReductionIdentity<char>
specialization #5798 - Don't install standard algorithms headers multiple times #5670
- Fix max scratch size calculation for level 0 scratch in CUDA and HIP #5718
3.7.02 (2023-05-17)
- Add Hopper support and update nvcc_wrapper to work with CUDA-12 #5693
- sprintf -> snprintf #5787
- Add error message when not using
hipcc
and whenCMAKE_CXX_STANDARD
is not set #5945
- Fix Scratch allocation alignment issues #5692
- Fix Intel Classic Compiler ICE #5710
- Don't install std algorithm headers multiple times #5711
- Fix static init order issue in InitalizationSettings #5721
- Fix src/dst Properties in deep_copy(DynamicView,View) #5732
- Fix build on Fedora Rawhide #5782
- Finalize HIP lock arrays #5694
- Fix CUDA lock arrays for current Desul #5812
- Set the correct device/context in InterOp tests #5701
3.7.01 (2022-12-01)
- Add fences to all sorting routines not taking an execution space instance argument #5547
- Fix repeated
team_reduce
without barrier #5552 - Fix memory spaces in
create_mirror_view
overloads usingview_alloc
#5521 - Allow
as_view_of_rank_n()
to be overloaded for "special" scalar types #5553 - Fix warning calling a
__host__
function from a__host__ __device__
fromView:: as_view_of_rank_n
#5591 - OpenMPTarget: adding implementation to set device id. #5557
- Use
Kokkos::atomic_load
to Correct Race Condition Giving Rise to Seg Faulting Error in OpenMP tests #5559 - cmake: define
KOKKOS_ARCH_A64FX
#5561 - Only link against libatomic in gnu-make OpenMPTarget build #5565
- Fix static extents assignment for LayoutLeft/LayoutRight assignment #5566
- Do not add -cuda to the link line with NVHPC compiler when the CUDA backend is not actually enabled #5569
- Export the flags in
KOKKOS_AMDGPU_OPTIONS
when using Trilinos #5571 - Add support for detecting MPI local rank with MPICH and PMI #5570 #5582
- Remove listing of undefined TPL dependencies #5573
- ClockTic changed to 64 bit to fix overflow on Power #5592
- Fix incorrect offset in CUDA and HIP parallel scan for < 4 byte types #5607
- Fix initialization of Cuda lock arrays #5622
3.7.00 (2022-08-22)
- Use non-volatile
join()
member functions andoperator+=
inparallel_reduce/scan
#4931 #4954 #4951 - Add
SIMD
sub package (requires C++17) #5016 - Add
is_finalized()
#5247 - Promote mathematical functions from
namespace Kokkos::Experimental
tonamespace Kokkos
#4791 - Promote
min
,max
,clamp
,minmax
functions fromnamespace Kokkos::Experimental
tonamespace Kokkos
#5170 - Add
round
,logb
,nextafter
,copysign
, andsignbit
math functions #4768 - Add
HIPManagedSpace
, similar toCudaUVMSpace
#5112 - Accept view construction allocation properties in
create_mirror[_view,_view_and_copy]
andresize/realloc
#5125 #5095 #5035 #4805 #4844 - Allow
MemorySpace::allocate()
to be called with execution space #4826 - Experimental: Compile time view subscriber #4197
- Add support for Sapphire Rapids Intel architecture #5015
- Add support for ICX, SKL and ICL Intel architectures #5013 #4929
- Add arch flags for Intel GPU Ponte Vecchio #4932
- SYCL: require GPU if GPU architecture was set at configuration time (i.e. do not allow fallback to CPU device) #5264 #5222
- SYCL: Add
SYCL::sycl_queue()
for interoperability #5241 - SYCL: Loosen restriction for using built-in
sycl::group_broadcast
#4552 - SYCL: preserve address space #4396
- OpenMPTarget: Adding a workaound for team scan #5219
- OpenMPTarget: Adding logic to skip the kernel launch if
league_size=0
#5067 - OpenMPTarget: Make sure
Kokkos::abort()
causes abnormal program termination when called on the host-side #4808 - HIP: Make HIPHostPinnedSpace coarse-grained #5152
- Refactor OpenMP
parallel_for
implementation to use more native OpenMP constructs #4664 - Add option to optimize for local CPU architecture
Kokkos_ARCH_NATIVE
#4930
- Add command line argument/environment variable to print the configuration #5233
- Improve error message in view memory access violations #4950
- Remove unnecessary fences in View initialization #4823
- Make
View::shmem_size()
device-callable #4936 - Update numerics support for
__float128
#5081 - Add
log10
overload forKokkos::complex
#5009 - Add
[[nodiscard]]
toScopeGuard
#5224 - Add structured binding support for
Kokkos::Array
#4962 - Enable accessing
Kokkos::Array
elements in constant expressions #4916 - Mark
as_view_of_rank_n
as KOKKOS_FUNCTION #5248 - Cleanup/rework fence overloads #5148
- Assert that
Layout
construction from extents is valid in functions taking integer extents #5209 - Add
fill_random
overload that takes an execution space as first argument #5181 - Avoid some unnecessary fences in
parallel_reduce/scan
#5154 - Include
KOKKOS_ENABLE_LIBDL
in options when printing configuration #5086 - DynRankView: make
layout()
return the same as a corresponding static View #5026 - Use
_mm_malloc
for icpx #5012 - Avoid forcing matching execution spaces in
BinSort
constructor andsort()
#4919 - Check number of bins in
BinSort
#4890 - Improve performance in parallel STL-like algorithms #4887 #4886
- Disable
memset
on A64FX and launchparallel_for
instead (performance) #4884 - Allow non-power-of-two team sizes for team reductions and scans #4809
- Warn when unable to detect local MPI rank and user explicitly asked for it #5263
- Refactor parsing of command line arguments and environment variables #5221
- Refactor device selection at initialization #5211
- Rename tools settings for consistency #5201
- Print help only once #5128
- Update precedence rule in initialization #5130
- Warn instead of just ignoring user settings when kokkos-tools is disabled #5088
- Drop numa args in threads backend initialization #5127
- Warn users when a flag prefixed with -[-]kokkos is not recognized and do not remove it #5256
- Give back to Core what belongs to Core (aka moving tune_internals option from Tools back to Core) #5202
nvcc_wrapper
: filter out -pedantic-errors from nvcc options #5235nvcc_wrapper
: add known nvcc option --source-in-ptx #5052- Link libdl as interface library #5179
- Only show GPU architectures with enabled corresponding backend #5119
- Enable optional external desul build #5021 #5132
- Export
Kokkos_CXX_STANDARD
variable with CMake #5068 - Suppress warnings with nvc++ #5031
- Disallow multiple host architectures in CMake #4996
- Do not include compiler warning flags in the compile option of the cmake target #4989
- AOT flags for OpenMPTarget targeting Intel GPUs #4915
- Repurpose
Kokkos_ARCH_INTEL_GEN
for SYCL to mean JIT to be conforming with OMPT #4894 - Replace amdgpu-target with offload-arch #4874
- Do not enable
kokkos_launch_compiler
whenCMAKE_CXX_COMPILER_LAUNCHER
is set #4870 - Move CMake version check up #4797
- Remove
KOKKOS_THREAD_LOCAL
#5064 - Remove
KOKKOS_ENABLE_POSIX_MEMALIGN
#5011 - Remove unused
KOKKOS_ENABLE_TM
#4995 - Remove unused cmakedefine
KOKKOS_ENABLE_COMPILER_WARNINGS
#4883 - Remove unused
KOKKOS_ENABLE_DUALVIEW_MODIFY_CHECK
#4882 - Drop Instruction Set Architecture (ISA) macros #4981
- Warn in
ScopeGuard
about illegal usage #5250
- Guard against non-public header inclusion #5178
- Raise deprecation warnings if non empty WorkTag class is used #5230
- Deprecate
parallel_*
overloads taking the label as trailing argument #5141 - Deprecate nested types in functional #5185
- Deprecate
InitArguments
struct and replace it withInitializationSettings
#5135 - Deprecate
finalize_all()
#5134 - Deprecate command line arguments (other than
--help
) that are not prefixed withkokkos-*
#5120 - Deprecate
--[kokkos-]numa
cmdline arg andKOKKOS_NUMA
env var #5117 - Deprecate
--[kokkos-]threads
command line argument in favor of--[kokkos-]num-threads
#5111 - Deprecate
Kokkos::is_reducer_type
#4957 - Deprecate
OffsetView
constructors takingindex_list_type
#4810 - Deprecate overloads of
Kokkos::sort
taking a parameterbool always_use_kokkos_sort
#5382 - Warn about
parallel_reduce
cases that calljoin()
with volatile-qualified arguments #5215
- CUDA Reductions: Fix data races reported by Nvidia
compute-sanitizer
#4855 - Work around Intel compiler bug #5301
- Avoid allocating memory for UniqueToken #5300
- DynamicView: Properly resize mirror instances after construction #5276
- Remove Kokkos::Rank limit of 6 ranks #5271
- Do not forget to set last element to nullptr when removing a flag in
Kokkos::initialize
#5272 - Fix CUDA+MSVC build issue #5261
- Fix
DynamicView::resize_serial
#5220 - Fix cmake default compiler flags for unknown compiler #5217
- Fix
move_backward
#5191 - Fixing issue 5196 - missing symbol with intel compiler #5207
- Preserve
KOKKOS_INVALID_INDEX
in ViewDimension and ArrayLayout construction #5188 - Finalize
deep_copy_space
early avoiding printing tostd::cerr
for Cuda #5151 - Use correct policy in Threads MDRange
parallel_reduce
#5123 - Fix building with NVCC as the CXX compiler while the CUDA backend is not enabled #5115
- OpenMPTarget Index range fix for MDRange. #5089
- Fix bug with CUDA's team reduction for empty ranges #5079
- Fix using
ZeroMemset
for Serial #5077 - Fix
Kokkos::Vector::push_back
for default execution space #5047 - ScatterView: Fix ScatterMin/ScatterMax to use proper atomics #5045
- Fix calling
ZeroMemset
indeep_copy
#5040 - Make View self-assignment not produce double-free #5024
- Guard against unrecognized pragma with intel compilers #5019
- Fix racing condition in
HIPParallelLaunch
#5008 - KokkosP: Fix
device_id
in profiling #4997 - Fix for
Kokkos::vector::insert
into empty vector with begin and end iterators #4988 - Fix Core header files installation #4984
- Fix bounds errors with
Kokkos::sort
#4980 - Fixup let
RangePolicy::set_chunk_size
return a reference to self #4918 - Fix allocating large Views #4907
- Fix combined reductions with
Kokkos::View
#4896 - Fixed
_CUDA_ARCH__
to__CUDA_ARCH__
for CUDA LDG #4893 - Fixup
View::access()
truncate parameter pack #4876 - Fix
abort
with HIP backend for ROCm 5.0.2 and beyond #4873 - Fix HIP version when printing the configuration #4872
- Fix scratch lock array when using scratch level 1 #4871
- Fix Makefile.kokkos to work with fujitsu compiler #4867
- cmake: Correct link THREADS link option #4854
- UniqueToken
impl_acquire
function should be device only #4819 - Fix example calls to non existing static
print_configuration
#4806 - Fix requests for large team scratch sizes #4728
3.6.01 (2022-05-23)
- Fix Threads: Fix serial resizing scratch space (3.6.01 cherry-pick) #5109
- Fix ScatterMin/ScatterMax to use proper atomics (3.6.01 cherry-pick) #5046
- Fix allocating large Views #4907
- Fix bounds errors with Kokkos::sort #4980
- Fix HIP version when printing the configuration #4872
- Fixed
_CUDA_ARCH__
to__CUDA_ARCH__
for CUDA LDG #4893 - Fixed an incorrect struct initialization #5028
- Fix racing condition in
HIPParallelLaunch
#5008 - Avoid deprecation warnings with
OpenMPExec::validate_partition
#4982 - Make View self-assignment not produce double-free #5024
3.6.00 (2022-02-18)
- Add C++ standard algorithms #4315
- Implement
fill_random
forDynRankView
#4763 - Add
bhalf_t
#4543 #4653 - Add mathematical constants #4519
- Allow
Kokkos::{create_mirror*,resize,realloc}
to be used withWithoutInitializing
#4486 #4337 - Implement
KOKKOS_IF_ON_{HOST,DEVICE}
macros #4660 - Allow setting the CMake language for Kokkos #4323
- Desul: Add ScopeCaller #4690
- Enable Desul atomics by default when using Makefiles #4606
- Unique token improvement #4741 #4748
- Add math function long double overload on the host side #4712
- Array reductions with pointer return types #4756
- Deprecate
partition_master
,validate_partition
#4737 - Deprecate
Kokkos_ENABLE_PTHREAD
in favor ofKokkos_ENABLE_THREADS
#4619 ** pair with use std::threads ** - Deprecate
log2(unsigned) -> int
(removing in next release) #4595 - Deprecate
Kokkos::Impl::is_view
#4592 - Deprecate
KOKKOS_ACTIVE_EXECUTION_MEMORY_SPACE_*
macros and theActiveExecutionMemorySpace
alias #4668
- Update required SYCL compiler version #4749
- Cap vector size to kernel maximum for SYCL #4704
- Improve check for compatibility of vector size and subgroup size in SYCL #4579
- Provide
chunk_size
for SYCL #4635 - Use host-pinned memory for SYCL kernel memory #4627
- Use shuffle-based algorithm for scalar reduction #4608
- Implement pool of USM IndirectKernelMemory #4596
- Provide valid default team size for SYCL #4481
- Add checks for shmem usage in
parallel_reduce
#4548
- Add support for fp16 in the HIP backend #4688
- Disable multiple kernel instantiations when using HIP (configure with
-DKokkos_ENABLE_HIP_MULTIPLE_KERNEL_INSTANTIATIONS=ON
to use) #4644 - Fix HIP scratch use per instance #4439
- Change allocation header to 256B alignment for AMD VEGA architecture #4753
- Add generic
KOKKOS_ARCH_VEGA
macro #4782 - Require ROCm 4.5 #4689
- Adapt to HPX 1.7.0 which is now required #4241
- Fix thread deduction for OpenMP for
thread_count==0
#4541
- Update memory space
size_type
to improve performance (size_t -> unsigned
) #4779
- Improve NVHPC support #4599
- Add
Kokkos::Experimental::{min,max,minmax,clamp}
#4629 #4506 - Use device type as template argument in Containers and Algorithms #4724 #4675
- Implement
Kokkos::sort
with execution space #4490 Kokkos::resize
always error out for mismatch in runtime rank #4681- Print current call stack when calling
Kokkos::abort()
from the host #4672 #4671 - Detect mismatch of execution spaces in functors #4655
- Improve view label access on host #4647
- Error out for
const
scalar return type in reduction #4645 - Don't allow calling
UnorderdMap::value_at
for a set #4639 - Add
KOKKOS_COMPILER_NVHPC
macro, disablequiet_NaN
andsignaling_NaN
#4586 - Improve performance of
local_deep_copy
#4511 - Improve performance when sorting integers #4464
- Add missing numeric traits (
denorm_min
,reciprocal_overflow_threshold
,{quiet,silent}_NaN}
) and make them work on cv-qualified types #4466 #4415 #4473 #4443
- Manually compute IntelLLVM compiler version for older CMake versions #4760
- Add Xptxas without = to
nvcc_wrapper
#4646 - Use external GoogleTest optionally #4563
- Silent warnings about multiple optimization flags with
nvcc_wrapper
#4502 - Use the same flags in Makefile.kokkos for POWER7/8/9 as for CMake #4483
- Fix support for A64FX architecture #4745
- Drop
KOKKOS_ARCH_HIP
macro when using generated GNU makefiles #4786 - Remove gcc-toolchain auto add for clang in Makefile.kokkos #4762
- Lock constant memory in Cuda/HIP kernel launch with a mutex (thread safety) #4525
- Fix overflow for large requested scratch allocation #4551
- Fix Windows build in mingw #4564
- Fix
kokkos_launch_compiler
: escape$
character #4769 #4703 - Fix math functions with NVCC and GCC 5 as host compiler #4733
- Fix shared build with Intel19 #4725
- Do not install empty
desul/src/
directory #4714 - Fix wrong
device_id
computation inidentifier_from_devid
(Profiling Interface) #4694 - Fix a bug in CUDA scratch memory pool (abnormally high memory consumption) #4673
- Remove eval of command args in
hpcbind
#4630 - SYCL fix to run when no GPU is detected #4623
- Fix
layout_strides::span
for rank-0 views #4605 - Fix SYCL atomics for local memory #4585
- Hotfix
mdrange_large_deep_copy
for SYCL #4581 - Fix bug when sorting integer using the HIP backend #4570
- Fix compilation error when using HIP with RDC #4553
DynamicView
: Fix deallocation extent #4533- SYCL fix running parallel_reduce with TeamPolicy for large ranges #4532
- Fix bash syntax error in
nvcc_wrapper
#4524 - OpenMPTarget
team_policy
reduce fixes forinit/join
reductions #4521 - Avoid hangs in the Threads backend #4499
- OpenMPTarget fix reduction bug in
parallel_reduce
forTeamPolicy
#4491 - HIP fix scratch space per instance #4439
- OpenMPTarget fix team scratch allocation #4431
3.5.00 (2021-10-19)
- Add support for quad-precision math functions/traits #4098
- Adding ExecutionSpace partitioning function #4096
- Improve Python Interop Capabilities #4065
- Add half_t Kokkos::rand specialization #3922
- Add math special functions: erf, erfcx, expint1, Bessel functions, Hankel functions #3920
- Add missing common mathematical functions #4043 #4036 #4034
- Let the numeric traits be SFINAE-friendly #4038
- Add Desul atomics - enabling memory-order and memory-scope parameters #3247
- Add detection idiom from the C++ standard library extension version 2 #3980
- Fence Profiling Support in all backends #3966 #4304 #4258 #4232
- Significant SYCL enhancements (see below)
- Deprecate CUDA_SAFE_CALL and HIP_SAFE_CALL #4249
- Deprecate Kokkos::Impl::Timer (Kokkos::Timer has been available for a long time) #4201
- Deprecate Experimental::MasterLock #4094
- Deprecate Kokkos_TaskPolicy.hpp (headers got reorganized, doesn't remove functionality) #4011
- Deprecate backward compatibility features #3978
- Update and deprecate is_space::host_memory/execution/mirror_space #3973
- Enabling constbitset constructors in kernels #4296
- Use ZeroMemset in View constructor to improve performance #4226
- Use memset in deep_copy #3944
- Add missing fence() calls in resize(View) that effectively do deep_copy(resized, orig) #4212
- Avoid allocations in resize and realloc #4207
- StaticCsrGraph: use device type instead of execution space to construct views #3991
- Consider std::sort when view is accessible from host #3929
- Fix CPP20 warnings except for volatile #4312
- Introduce SYCLHostUSMSpace #4268
- Implement SYCL TeamPolicy for vector_size > 1 #4183
- Enable 64bit ranges for SYCL #4211
- Don't print SYCL device info in execution space intialization #4168
- Improve SYCL MDRangePolicy performance #4161
- Use sub_groups in SYCL parallel_scan #4147
- Implement subgroup reduction for SYCL RangePolicy parallel_reduce #3940
- Use DPC++ broadcast extension in SYCL team_broadcast #4103
- Only fence in SYCL parallel_reduce for non-device-accessible result_ptr #4089
- Improve fencing behavior in SYCL backend #4088
- Fence all registered SYCL queues before deallocating memory #4086
- Implement SYCL::print_configuration #3992
- Reuse scratch memory in parallel_scan and TeamPolicy (decreases memory footprint) #3899 #3889
- Cuda improve heuristic for blocksize #4271
- Don't use [[deprecated]] for nvcc #4229
- Improve error message for NVHPC as host compiler #4227
- Update support for cuda reductions to work with types < 4bytes #4156
- Fix incompatible team size deduction in rare cases parallel_reduce #4142
- Remove UVM usage in DynamicView #4129
- Remove dependency between core and containers #4114
- Adding opt-in CudaMallocSync support when using CUDA version >= 11.2 #4026 #4233
- Fix a potential race condition in the CUDA backend #3999
- Implement new blocksize deduction method for HIP Backend #3953
- Add multiple LaunchMechanism #3820
- Make HIP backend thread-safe #4170
- Refactor Serial backend and fix thread-safety issue #4053
- OpenMPTarget: support array reductions in RangePolicy #4040
- OpenMPTarget: add MDRange parallel_reduce #4032
- OpenMPTarget: Fix bug in for the case of a reducer. #4044
- OpenMPTarget: verify process fix #4041
- Use hipcc architecture autodetection when Kokkos_ARCH is not set #3941
- Introduce Kokkos_ENABLE_DEPRECATION_WARNINGS and remove deprecated code with Kokkos_ENABLE_DEPRECATED_CODE_3 #4106 #3855
- Add allow-unsupported-compiler flag to nvcc-wrapper #4298
- nvcc_wrapper: fix errors in argument handling #3993
- Adds support for -time= and -time in nvcc_wrapper #4015
- nvcc_wrapper: suppress duplicates of GPU architecture and RDC flags #3968
- Fix TMPDIR support in nvcc_wrapper #3792
- NVHPC: update PGI compiler arch flags #4133
- Replace PGI with NVHPC (works for both) #4196
- Make sure that KOKKOS_CXX_HOST_COMPILER_ID is defined #4235
- Add options to Makefile builds for deprecated code and warnings #4215
- Use KOKKOS_CXX_HOST_COMPILER_ID for identifying CPU arch flags #4199
- Added support for Cray Clang to Makefile.kokkos #4176
- Add XLClang as compiler #4120
- Keep quoted compiler flags when passing to Trilinos #3987
- Add support for AMD Zen3 CPU architecture #3972
- Rename IntelClang to IntelLLVM #3945
- Add cppcoreguidelines-pro-type-cstyle-cast to clang-tidy #3522
- Add sve bit size definition for A64FX #3947 #3946
- Remove KOKKOS_ENABLE_DEBUG_PRINT_KERNEL_NAMES #4150
- Retrieve original value from a point in a MultidimensionalSparseTuningProblem #3977
- Allow extension of built-in tuners with additional tuning axes #3961
- Added a categorical tuner #3955
- hpcbind: Use double quotes around $@ when invoking user command #4284
- Add file and line to error message #3985
- Fix compiler warnings when compiling with nvc++ #4198
- Add OpenMPTarget CI build on AMD GPUs #4055
- CI: icpx is now part of intel container #4002
- Remove pre CUDA 9 KOKKOS_IMPL_CUDA_* macros #4138
- UnorderedMap::clear() should zero the size() #4130
- Add memory fence for HostSharedPtr::cleanup() #4144
- SYCL: Fix race conditions in TeamPolicy::parallel_reduce #4418
- Adding missing memory fence to serial exec space fence. #4292
- Fix using external SYCL queues in tests #4291
- Fix digits10 bug #4281
- Fixes constexpr errors with frounding-math on gcc < 10. #4278
- Fix compiler flags for PGI/NVHPC #4264
- Fix Zen2/3 also implying Zen Arch with Makefiles #4260
- Kokkos_Cuda.hpp: Fix shadow warning with cuda/11.0 #4252
- Fix issue w/ static initialization of function attributes #4242
- Disable long double hypot test on Power systems #4221
- Fix false sharing in random pool #4218
- Fix a missing memory_fence for debug shared alloc code #4216
- Fix two xl issues #4179
- Makefile.kokkos: fix (standard_in) 1: syntax error #4173
- Fixes for query_device example #4172
- Fix a bug when using HIP atomic with Kokkos::Complex #4159
- Fix mistaken logic in pthread creation #4157
- Define KOKKOS_ENABLE_AGGRESSIVE_VECTORIZATION when requesting Kokkos_ENABLE_AGGRESSIVE_VECTORIZATION=ON #4107
- Fix compilation with latest MSVC version #4102
- Fix incorrect macro definitions when compiling with Intel compiler on Windows #4087
- Fixup global buffer overflow in hand rolled string manipulation #4070
- Fixup heap buffer overflow in cmd line args parsing unit tests #4069
- Only add quotes in compiler flags for Trilinos if necessary #4067
- Fixed invocation of tools init callbacks #4061
- Work around SYCL JIT compiler issues with static variables #4013
- Fix TestDetectionIdiom.cpp test inclusion for Trilinos/TriBITS #4010
- Fixup allocation headers with OpenMPTarget backend #4003
- Add missing specialization for OMPT to Kokkos Random #3967
- Disable hypot long double test on power arches #3962
- Use different EBO workaround for MSVC (rebased) #3924
- Fix SYCL Kokkos::Profiling::(de)allocateData calls #3928
3.4.01 (2021-05-19)
Bug Fixes:
- Windows: Remove atomic_compare_exchange_strong overload conflicts with Windows #4024
- OpenMPTarget: Fixup allocation headers with OpenMPTarget backend #4020
- OpenMPTarget: Add missing specailization for OMPT to Kokkos Random #4022
- AMD: Add support for AMD Zen3 CPU architecture #4021
- SYCL: Implement SYCL::print_configuration #4012
- Containers: staticcsrgraph: use device type instead of execution space to construct views #3998
- nvcc_wrapper: fix errors in argument handling, suppress duplicates of GPU architecture and RDC flags #4006
- CI: Add icpx testing to intel container #4004
- CMake/TRIBITS: Keep quoted compiler flags when passing to Trilinos #4007
- CMake: Rename IntelClang to IntelLLVM #3945
3.4.00 (2021-04-25)
Highlights:
- SYCL Backend Almost Feature Complete
- OpenMPTarget Backend Almost Feature Complete
- Performance Improvements for HIP backend
- Require CMake 3.16 or newer
- Tool Callback Interface Enhancements
- cmath wrapper functions available now in Kokkos::Experimental
Features:
- Implement parallel_scan with ThreadVectorRange and Reducer #3861
- Implement SYCL Random #3849
- OpenMPTarget: Adding Implementation for nested reducers #3845
- Implement UniqueToken for SYCL #3833
- OpenMPTarget: UniqueToken::Global implementation #3823
- DualView sync's on ExecutionSpaces #3822
- SYCL outer TeamPolicy parallel_reduce #3818
- SYCL TeamPolicy::team_scan #3815
- SYCL MDRangePolicy parallel_reduce #3801
- Enable use of execution space instances in ScatterView #3786
- SYCL TeamPolicy nested parallel_reduce #3783
- OpenMPTarget: MDRange with TagType for parallel_for #3781
- Adding OpenMPTarget parallel_scan #3655
- SYCL basic TeamPolicy #3654
- OpenMPTarget: scratch memory implementation #3611
Implemented enhancements Backends and Archs:
- SYCL choose a specific GPU #3918
- [HIP] Lock access to scratch memory when using Teams #3916
- [HIP] fix multithreaded access to get_next_driver #3908
- Forward declare HIPHostPinnedSpace and SYCLSharedUSMSpace #3902
- Let SYCL USMObjectMem use SharedAllocationRecord #3898
- Implement clock_tic for SYCL #3893
- Don't use a static variable in HIPInternal::scratch_space #3866(kokkos#3866)
- Reuse memory for SYCL parallel_reduce #3873
- Update SYCL compiler in CI #3826
- Introduce HostSharedPtr to manage m_space_instance for Cuda/HIP/SYCL #3824
- [HIP] Use shuffle for range reduction #3811
- OpenMPTarget: Changes to the hierarchical parallelism #3808
- Remove ExtendedReferenceWrapper for SYCL parallel_reduce #3802
- Eliminate sycl_indirect_launch #3777
- OpenMPTarget: scratch implementation for parallel_reduce #3776
- Allow initializing SYCL execution space from sycl::queue and SYCL::impl_static_fence #3767
- SYCL TeamPolicy scratch memory alternative #3763
- Alternative implementation for SYCL TeamPolicy #3759
- Unify handling of synchronous errors in SYCL #3754
- core/Cuda: Half_t updates for cgsolve #3746
- Unify HIPParallelLaunch structures #3733
- Improve performance for SYCL parallel_reduce #3732
- Use consistent types in Kokkos_OpenMPTarget_Parallel.hpp #3703
- Implement non-blocking kernel launches for HIP backend #3697
- Change SYCLInternal::m_queue std::unique_ptr -> std::optional #3677
- Use alternative SYCL parallel_reduce implementation #3671
- Use runtime values in KokkosExp_MDRangePolicy.hpp #3626
- Clean up AnalyzePolicy #3564
- Changes for indirect launch of SYCL parallel reduce #3511
Implemented enhancements BuildSystem:
- Also require C++14 when building gtest #3912
- Fix compiling SYCL with OpenMP #3874
- Require C++17 for SYCL (at configuration time) #3869
- Add COMPILE_DEFINITIONS argument to kokkos_create_imported_tpl #3862
- Do not pass arch flags to the linker with no rdc #3846
- Try compiling C++14 check with C++14 support and print error message #3843
- Enable HIP with Cray Clang #3842
- Add an option to disable header self containment tests #3834
- CMake check for C++14 #3809
- Prefer -std=* over --std=* #3779
- Kokkos launch compiler updates #3778
- Updated comments and enabled no-op for kokkos_launch_compiler #3774
- Apple's Clang not correctly recognised #3772
- kokkos_launch_compiler + CUDA auto-detect arch #3770
- Add Spack test support for Kokkos #3753
- Split SYCL tests for aot compilation #3741
- Use consistent OpenMP flag for IntelClang #3735
- Add support for -Wno-deprecated-gpu-targets #3722
- Add configuration to target CUDA compute capability 8.6 #3713
- Added VERSION and SOVERSION to KOKKOS_INTERNAL_ADD_LIBRARY #3706
- Add fast-math to known NVCC flags #3699
- Add MI-100 arch string #3698
- Require CMake >=3.16 #3679
- KokkosCI.cmake, KokkosCTest.cmake.in, CTestConfig.cmake.in + CI updates #2844
Implemented enhancements Tools:
- Improve readability of the callback invocation in profiling #3860
- V1.1 Tools Interface: incremental, action-based #3812
- Enable launch latency simulations #3721
- Added metadata callback to tools interface #3711
- MDRange Tile Size Tuning #3688
- Added support for command-line args for kokkos-tools #3627
- Query max tile sizes for an MDRangePolicy, and set tile sizes on an existing policy #3481
Implemented enhancements Other:
- Try detecting ndevices in get_gpu #3921
- Use strcmp to compare names() #3909
- Add execution space arguments for constructor overloads that might allocate a new underlying View #3904
- Prefix labels in internal use of kokkos_malloc #3891
- Prefix labels for internal uses of SharedAllocationRecord #3890
- Add missing hypot math function #3880
- Unify algorithm unit tests to avoid code duplication #3851
- DualView.template view() better matches for Devices in UVMSpace cases #3857
- More extensive disentangling of Policy Traits #3829
- Replaced nanosleep and sched_yield with STL routines #3825
- Constructing Atomic Subviews #3810
- Metadata Declaration in Core #3729
- Allow using tagged final functor in parallel_reduce #3714
- Major duplicate code removal in SharedAllocationRecord specializations #3658
Fixed bugs:
- Provide forward declarations in Kokkos_ViewLayoutTiled.hpp for XL #3911
- Fixup absolute value of floating points in Kokkos complex #3882
- Address intel 17 ICE #3881
- Add missing pow(Kokkos::complex) overloads #3868
- Fix bug {pow, log}(Kokkos::complex) #3866(kokkos#3866)
- Cleanup writing to output streams in Cuda #3859
- Fixup cache CUDA fallback execution space instance used by DualView::sync #3856
- Fix cmake warning with pthread #3854
- Fix typo FOUND_CUDA_{DRIVVER -> DRIVER} #3852
- Fix bug in SYCL team_reduce #3848
- Atrocious bug in MDRange tuning #3803
- Fix compiling SYCL with Kokkos_ENABLE_TUNING=ON #3800
- Fixed command line parsing bug #3797
- Workaround race condition in SYCL parallel_reduce #3782
- Fix Atomic{Min,Max} for Kepler30 #3780
- Fix SYCL typo #3755
- Fixed Kokkos_install_additional_files macro #3752
- Fix a typo for Kokkos_ARCH_A64FX #3751
- OpenMPTarget: fixes and workarounds to work with "Release" build type #3748
- Fix parsing bug for number of devices command line argument #3724
- Avoid more warnings with clang and C++20 #3719
- Fix gcc-10.1 C++20 warnings #3718
- Fix cuda cache config not being set correct #3712
- Fix dualview deepcopy perftools #3701
- use drand instead of frand in drand #3696
Incompatibilities:
- Remove unimplemented member functions of SYCLDevice #3919
- Replace cl::sycl #3896
- Get rid of SYCL workaround in Kokkos_Complex.hpp #3884
- Replace most uses of if_c #3883
- Remove Impl::enable_if_type #3863
- Remove HostBarrier test #3847
- Avoid (void) interface #3836
- Remove VerifyExecutionCanAccessMemorySpace #3813
- Avoid duplicated code in ScratchMemorySpace #3793
- Remove superfluous FunctorFinal specialization #3788
- Rename cl::sycl -> sycl in Kokkos_MathematicalFunctions.hpp #3678
- Remove integer_sequence backward compatibility implementation #3533
Enabled tests:
- Fixup re-enable core performance tests #3903
- Enable more SYCL tests #3900
- Restrict MDRange Policy tests for Intel GPUs #3853
- Disable death tests for rawhide #3844
- OpenMPTarget: Block unit tests that do not pass with the nvidia compiler #3839
- Enable Bitset container test for SYCL #3830
- Enable some more SYCL tests #3744
- Enable SYCL atomic tests #3742
- Enable more SYCL perf_tests #3692
- Enable examples for SYCL #3691
3.3.01 (2021-01-06)
Bug Fixes:
- Fix severe performance bug in DualView which added memcpys for sync and modify #3693
- Fix performance bug in CUDA backend, where the cuda Cache config was not set correct.
3.3.00 (2020-12-16)
Features:
- Require C++14 as minimum C++ standard. C++17 and C++20 are supported too.
- HIP backend is nearly feature complete. Kokkos Dynamic Task Graphs are missing.
- Major update for OpenMPTarget: many capabilities now work. For details contact us.
- Added DPC++/SYCL backend: primary capabilites are working.
- Added Kokkos Graph API analogous to CUDA Graphs.
- Added parallel_scan support with TeamThreadRange #3536
- Added Logical Memory Spaces #3546
- Added initial half precision support #3439
- Experimental feature: control cuda occupancy #3379
Implemented enhancements Backends and Archs:
- Add a64fx and fujitsu Compiler support #3614
- Adding support for AMD gfx908 archictecture #3375
- SYCL parallel_for MDRangePolicy #3583
- SYCL add parallel_scan #3577
- SYCL custom reductions #3544
- SYCL Enable container unit tests #3550
- SYCL feature level 5 #3480
- SYCL Feature level 4 (parallel_for) #3474
- SYCL feature level 3 #3451
- SYCL feature level 2 #3447
- OpenMPTarget: Hierarchial reduction for + operator on scalars #3504
- OpenMPTarget hierarchical #3411
- HIP Add Impl::atomic_[store,load] #3440
- HIP enable global lock arrays #3418
- HIP Implement multiple occupancy paths for various HIP kernel launchers #3366
Implemented enhancements Policies:
- MDRangePolicy: Let it be semiregular #3494
- MDRangePolicy: Check narrowing conversion in construction #3527
- MDRangePolicy: CombinedReducers support #3395
- Kokkos Graph: Interface and Default Implementation #3362
- Kokkos Graph: add Cuda Graph implementation #3369
- TeamPolicy: implemented autotuning of team sizes and vector lengths #3206
- RangePolicy: Initialize all data members in default constructor #3509
Implemented enhancements BuildSystem:
- Auto-generate core test files for all backends #3488
- Avoid rewriting test files when calling cmake #3548
- RULE_LAUNCH_COMPILE and RULE_LAUNCH_LINK system for nvcc_wrapper #3136
- Adding -include as a known argument to nvcc_wrapper #3434
- Install hpcbind script #3402
- cmake/kokkos_tribits.cmake: add parsing for args #3457
Implemented enhancements Tools:
- Changed namespacing of Kokkos::Tools::Impl::Impl::tune_policy #3455
- Delegate to an impl allocate/deallocate method to allow specifying a SpaceHandle for MemorySpaces #3530
- Use the Kokkos Profiling interface rather than the Impl interface #3518
- Runtime option for tuning #3459
- Dual View Tool Events #3326
Implemented enhancements Other:
- Abort on errors instead of just printing #3528
- Enable C++14 macros unconditionally #3449
- Make ViewMapping trivially copyable #3436
- Rename struct ViewMapping to class #3435
- Replace enums in Kokkos_ViewMapping.hpp (removes -Wextra) #3422
- Use bool for enums representing bools #3416
- Fence active instead of default execution space instances #3388
- Refactor parallel_reduce fence usage #3359
- Moved Space EBO helpers to Kokkos_EBO #3357
- Add remove_cvref type trait #3340
- Adding identity type traits and update definition of identity_t alias #3339
- Add is_specialization_of type trait #3338
- Make ScratchMemorySpace semi-regular #3309
- Optimize min/max atomics with early exit on no-op case #3265
- Refactor Backend Development #2941
Fixed bugs:
- Fixup MDRangePolicy construction from Kokkos arrays #3591
- Add atomic functions for unsigned long long using gcc built-in #3588
- Fixup silent pointless comparison with zero in checked_narrow_cast (compiler workaround) #3566
- Fixes for ROCm 3.9 #3565
- Fix windows build issues which crept in for the CUDA build #3532
- HIP Fix atomics of large data types and clean up lock arrays #3529
- Pthreads fix exception resulting from 0 grain size #3510
- Fixup do not require atomic operation to be default constructible #3503
- Fix race condition in HIP backend #3467
- Replace KOKKOS_DEBUG with KOKKOS_ENABLE_DEBUG #3458
- Fix multi-stream team scratch space definition for HIP #3398
- HIP fix template deduction #3393
- Fix compiling with HIP and C++17 #3390
- Fix sigFPE in HIP blocksize deduction #3378
- Type alias change: replace CS with CTS to avoid conflicts with NVSHMEM #3348
- Clang compilation of CUDA backend on Windows #3345
- Fix HBW support #3343
- Added missing fences to unique token #3260
Incompatibilities:
- Remove unused utilities (forward, move, and expand_variadic) from Kokkos::Impl #3535
- Remove unused traits #3534
- HIP: Remove old HCC code #3301
- Prepare for deprecation of ViewAllocateWithoutInitializing #3264
- Remove ROCm backend #3148
3.2.01 (2020-11-17)
Fixed bugs:
- Disallow KOKKOS_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE in shared library builds #3332
- Do not install libprinter-tool when testing is enabled #3313
- Fix restrict/alignment following refactor #3373
- Intel fix: workaround compiler issue with using statement #3383
- Fix zero-length reductions #\3364
- Fix multi-stream scratch #\3269
- Guard KOKKOS_ALL_COMPILE_OPTIONS if Cuda is not enabled #3387
- Do not include link flags for Fortran linkage #3384
- Fix NVIDIA GPU arch macro with autodetection #3473
- Fix libdl/test issues with Trilinos #3543
- Register Pthread as Tribits option to be enabled with Trilinos #3558
Implemented enhancements:
- Separate Cuda timing-based tests into their own executable #3407
3.2.00 (2020-08-19)
Implemented enhancements:
- HIP:Enable stream in HIP #3163
- HIP:Add support for shuffle reduction for the HIP backend #3154
- HIP:Add implementations of missing HIPHostPinnedSpace methods for LAMMPS #3137
- HIP:Require HIP 3.5.0 or higher #3099
- HIP:WorkGraphPolicy for HIP #3096
- OpenMPTarget: Significant update to the new experimental backend. Requires C++17, works on Intel GPUs, reference counting fixes. #3169
- Windows Cuda support #3018
- Pass
-Wext-lambda-captures-this
to NVCC when support for__host__ __device__
lambda is enabled from CUDA 11 #3241 - Use explicit staging buffer for constant memory kernel launches and cleanup host/device synchronization #3234
- Various fixup to policies including making TeamPolicy default constructible and making RangePolicy and TeamPolicy assignable: #3202 , #3203 , #3196
- Annotations for
DefaultExectutionSpace
andDefaultHostExectutionSpace
to use in static analysis #3189 - Add documentation on using Spack to install Kokkos and developing packages that depend on Kokkos #3187
- Add OpenMPTarget backend flags for NVC++ compiler #3185
- Move deep_copy/create_mirror_view on Experimental::OffsetView into Kokkos:: namespace #3166
- Allow for larger block size in HIP #3165
- View: Added names of Views to the different View initialize/free kernels #3159
- Cuda: Caching cudaFunctorAttributes and whether L1/Shmem prefer was set #3151
- BuildSystem: Improved performance in default configuration by defaulting to Release build #3131
- Cuda: Update CUDA occupancy calculation #3124
- Vector: Adding data() to Vector #3123
- BuildSystem: Add CUDA Ampere configuration support #3122
- General: Apply [[noreturn]] to Kokkos::abort when applicable #3106
- TeamPolicy: Validate storage level argument passed to TeamPolicy::set_scratch_size() #3098
- BuildSystem: Make kokkos_has_string() function in Makefile.kokkos case insensitive #3091
- Modify KOKKOS_FUNCTION macro for clang-tidy analysis #3087
- Move allocation profiling to allocate/deallocate calls #3084
- BuildSystem: FATAL_ERROR when attempting in-source build #3082
- Change enums in ScatterView to types #3076
- HIP: Changes for new compiler/runtime #3067
- Extract and use get_gpu #3061 , #3048
- Add is_allocated to View-like containers #3059
- Combined reducers for scalar references #3052
- Add configurable capacity for UniqueToken #3051
- Add installation testing #3034
- HIP: Add UniqueToken #3020
- Autodetect number of devices #3013
Fixed bugs:
- Check error code from
cudaStreamSynchronize
in CUDA fences #3255 - Fix issue with C++ standard flags when using
nvcc\_wrapper
with PGI #3254 - Add missing threadfence in lock-based atomics #3208
- Fix dedup of linker flags for shared lib on CMake <=3.12 #3176
- Fix memory leak with CUDA streams #3170
- BuildSystem: Fix OpenMP Target flags for Cray #3161
- ScatterView: fix for OpenmpTarget remove inheritance from reducers #3162
- BuildSystem: Set OpenMP flags according to host compiler #3127
- OpenMP: Fix logic for nested omp in partition_master bug #3101
- nvcc_wrapper: send --cudart to nvcc instead of host compiler #3092
- BuildSystem: Fixes for Cuda/11 and c++17 #3085
- HIP: Fix print_configuration #3080
- Conditionally define get_gpu #3072
- Fix bounds for ranges in random number generator #3069
- Fix Cuda minor arch check #3035
- BuildSystem: Add -expt-relaxed-constexpr flag to nvcc_wrapper #3021
Incompatibilities:
- Remove ETI support #3157
- Remove KOKKOS_INTERNAL_ENABLE_NON_CUDA_BACKEND #3147
- Remove core/unit_test/config #3146
- Removed the preprocessor branch for KOKKOS_ENABLE_PROFILING #3115
- Disable profiling with MSVC #3066
Closed issues:
-
Silent error (Validate storage level arg to set_scratch_size) #3097
-
Remove KOKKKOS_ENABLE_PROFILING Option #3095
-
Cuda 11 -> allow C++17 #3083
-
In source build failure not explained #3081
-
Allow naming of Views for initialization kernel #3070
-
DefaultInit tests failing when using CTest resource allocation feature #3040
-
Add installation testing. #3037
-
nvcc_wrapper needs to handle
-expt-relaxed-constexpr
flag #3017 -
CPU core oversubscription warning on macOS with OpenMP backend #2996
-
Default behavior of KOKKOS_NUM_DEVICES to use all devices available #2975
-
Assert blocksize > 0 #2974
-
Add ability to assign kokkos profile function from executable #2973
-
ScatterView Support for the pre/post increment operator #2967
-
Compiler issue: Cuda build with clang 10 has errors with the atomic unit tests #3237
-
Incompatibility of flags for C++ standard with PGI v20.4 on Power9/NVIDIA V100 system #3252
-
Error configuring as subproject #3140
-
CMake fails with Nvidia compilers when the GPU architecture option is not supplied (Fix configure with OMPT and Cuda) #3207
-
PGI compiler being passed the gcc -fopenmp flag #3125
-
Cuda: Memory leak when using CUDA stream #3167
-
RangePolicy has an implicitly deleted assignment operator #3192
-
MemorySpace::allocate needs to have memory pool counting. #3064
-
Missing write fence for lock based atomics on CUDA #3038
-
CUDA compute capability version check problem #3026
-
Make DynRankView fencing consistent #3014
-
nvcc_wrapper cant handle -Xcompiler -o out.o #2993
-
Reductions of non-trivial types of size 4 fail in CUDA shfl operations #2990
-
complex_double misalignment in reduce, clang+CUDA #2989
-
Span of degenerated (zero-length) subviews is not zero in some special cases #2979
-
Rank 1 custom layouts dont work as expected. #2840
3.1.01 (2020-04-14)
Fixed bugs:
- Fix complex_double misalignment in reduce, clang+CUDA #2989
- Fix compilation fails when profiling disabled and CUDA enabled #3001
- Fix cuda reduction of non-trivial scalars of size 4 #2990
- Configure and install version file when building in Trilinos #2957
- Fix OpenMPTarget build missing include and namespace #3000
- fix typo in KOKKOS_SET_EXE_PROPERTY() #2959
- Fix non-zero span subviews of zero sized subviews #2979
3.1.00 (2020-04-14)
Features:
- HIP Support for AMD
- OpenMPTarget Support with clang
- Windows VS19 (Serial) Support #1533
Implemented enhancements:
- generate_makefile.bash should allow tests to be disabled #2886
- clang/7+cuda/9 build -Werror-unused parameter error in nightly test #2884
- ScatterView memory space is not user settable #2826
- clang/8+cuda/10.0 build error with c++17 #2809
- warnings.... #2805
- Kokkos version in cpp define #2787
- Remove Defunct QThreads Backend #2751
- Improve Kokkos::fence behavior with multiple execution spaces #2659
- polylithic(?) initialization of Kokkos #2658
- Unnecessary(?) check for host execution space initialization from Cuda initialization #2652
- Kokkos error reporting failures with CUDA GPUs in exclusive mode #2471
- atomicMax equivalent (and other atomics) #2401
- Fix alignment for Kokkos::complex #2255
- Warnings with Cuda 10.1 #2206
- dual view with Kokkos::ViewAllocateWithoutInitializing #2188
- Check error code from cudaOccupancyMaxActiveBlocksPerMultiprocessor #2172
- Add non-member Kokkos::resize/realloc for DualView #2170
- Construct DualView without initialization #2046
- Expose is_assignable to determine if one view can be assigned to another #1936
- profiling label #1935
- team_broadcast of bool failed on CUDA backend #1908
- View static_extent #660
- Misleading Kokkos::Cuda::initialize ERROR message when compiled for wrong GPU architecture #1944
- Cryptic Error When Malloc Fails #2164
- Drop support for intermediate standards in CMake #2336
Fixed bugs:
- DualView sync_device with length zero creates cuda errors #2946
- building with nvcc and clang (or clang based XL) as host compiler: "Kokkos::atomic_fetch_min(volatile int *, int)" has already been defined #2903
- Cuda 9.1,10.1 debug builds failing due to -Werror=unused-parameter #2880
- clang -Werror: Kokkos_FixedBufferMemoryPool.hpp:140:28: error: unused parameter 'alloc_size' #2869
- intel/16.0.1, intel/17.0.1 nightly build failures with debugging enabled #2867
- intel/16.0.1 debug build errors #2863
- xl/16.1.1 with cpp14, openmp build, nightly test failures #2856
- Intel nightly test failures: team_vector #2852
- Kokkos Views with intmax/2<N<intmax can hang during construction #2850
- workgraph_fib test seg-faults with threads backend and hwloc #2797
- cuda.view_64bit test hangs on Power8+Kepler37 system - develop and 2.9.00 branches #2771
- device_type for Kokkos_Random ? #2693
- "More than one tag given" error in Experimental::require() #2608
- Segfault on Marvell from our finalization stack #2542
3.0.00 (2020-01-27)
Implemented enhancements:
- BuildSystem: Standalone Modern CMake Support #2104
- StyleFormat: ClangFormat Style #2157
- Documentation: Document build system and CMake philosophy #2263
- BuildSystem: Add Alias with Namespace Kokkos:: to Interal Libraries #2530
- BuildSystem: Universal Kokkos find_package #2099
- BuildSystem: Dropping support for Kokkos_{DEVICES,OPTIONS,ARCH} in CMake #2329
- BuildSystem: Set Kokkos_DEVICES and Kokkos_ARCH variables in exported CMake configuration #2193
- BuildSystem: Drop support for CUDA 7 and CUDA 8 #2489
- BuildSystem: Drop CMake option SEPARATE_TESTS #2266
- BuildSystem: Support expt-relaxed-constexpr same as expt-extended-lambda #2411
- BuildSystem: Add Xnvlink to command line options allowed in nvcc_wrapper #2197
- BuildSystem: Install Kokkos config files and target files to lib/cmake/Kokkos #2162
- BuildSystem: nvcc_wrappers and c++ 14 #2035
- BuildSystem: Kokkos version major/version minor (Feature request) #1930
- BuildSystem: CMake namespaces (and other modern cmake cleanup) #1924
- BuildSystem: Remove capability to install Kokkos via GNU Makefiles #2332
- Documentation: Remove PDF ProgrammingGuide in Kokkos replace with link #2244
- View: Add Method to Resize View without Initialization #2048
- Vector: implement “insert” method for Kokkos_Vector (as a serial function on host) #2437
Fixed bugs:
- ParallelScan: Kokkos::parallel\scan fix race condition seen in inter-block fence #2681
- OffsetView: Kokkos::OffsetView missing constructor which takes pointer #2247
- OffsetView: Kokkos::OffsetView: allow offset=0 #2246
- DeepCopy: Missing DeepCopy instrumentation in Kokkos #2522
- nvcc_wrapper: --host-only fails with multiple -W* flags #2484
- nvcc_wrapper: taking first -std option is counterintuitive #2553
- Subview: Error taking subviews of views with static_extents of min rank #2448
- TeamPolicy: reducers with valuetypes without += broken on CUDA #2410
- Libs: Fix inconsistency of Kokkos library names in Kokkos and Trilinos #1902
- Complex: operator>> for complex<T> uses std::ostream, not std::istream #2313
- Macros: Restrict not honored for non-intel compilers #1922
2.9.00 (2019-06-24)
Implemented enhancements:
- Capability: CUDA Streams #1723
- Capability: CUDA Stream support for parallel_reduce #2061
- Capability: Feature Request: TeamVectorRange #713
- Capability: Adding HPX backend #2080
- Capability: TaskScheduler to have multiple queues #565
- Capability: Support for additional reductions in ScatterView #1674
- Capability: Request: deep_copy within parallel regions #689
- Capability: Feature Request:
create\_mirror\_view\_without\_initializing
#1765 - View: Use SFINAE to restrict possible View type conversions #2127
- Deprecation: Deprecate ExecutionSpace::fence() as static function and make it non-static #2140
- Deprecation: Deprecate LayoutTileLeft #2122
- Macros: KOKKOS_RESTRICT defined for non-Intel compilers #2038
Fixed bugs:
- Cuda: TeamThreadRange loop count on device is passed by reference to host static constexpr #1733
- Cuda: Build error with relocatable device code with CUDA 10.1 GCC 7.3 #2134
- Cuda: cudaFuncSetCacheConfig is setting CachePreferShared too often #2066
- Cuda: TeamPolicy doesn't throw then created with non-viable vector length and also doesn't backscale to viable one #2020
- Cuda: cudaMemcpy error for large league sizes on V100 #1991
- Cuda: illegal warp sync in parallel_reduce by functor on Turing 75 #1958
- TeamThreadRange: Inconsistent results from TeamThreadRange reduction #1905
- Atomics: atomic_fetch_oper & atomic_oper_fetch don't build for complex<float> #1964
- Views: Kokkos randomread Views leak memory #2155
- ScatterView: LayoutLeft overload currently non-functional #2165
- KNL: With intel 17.2.174 illegal instruction in random number test #2078
- Bitset: Enable copy constructor on device #2094
- Examples: do not compile due to template deduction error (multi_fem) #1928
2.8.00 (2019-02-05)
Implemented enhancements:
- Capability, Tests: C++14 support and testing #1914
- Capability: Add environment variables for all command line arguments #1798
- Capability: --kokkos-ndevices not working for Slurm #1920
- View: Undefined behavior when deep copying from and to an empty unmanaged view #1967
- BuildSystem: nvcc_wrapper should stop immediately if nvcc is not in PATH #1861
Fixed bugs:
- Cuda: Fix Volta Issues 1 Non-deterministic behavior on Volta, runs fine on Pascal #1949
- Cuda: Fix Volta Issues 2 CUDA Team Scan gives wrong values on Volta with -G compile flag #1942
- Cuda: illegal warp sync in parallel_reduce by functor on Turing 75 #1958
- Threads: Pthreads backend does not handle RangePolicy with offset correctly #1976
- Atomics: atomic_fetch_oper has no case for Kokkos::complex<double> or other 16-byte types #1951
- MDRangePolicy: Fix zero-length range #1948
- TeamThreadRange: TeamThreadRange MaxLoc reduce doesnt compile #1909
2.7.24 (2018-11-04)
Implemented enhancements:
- DualView: Add non-templated functions for sync, need_sync, view, modify #1858
- DualView: Avoid needlessly allocates and initializes modify_host and modify_device flag views #1831
- DualView: Incorrect deduction of "not device type" #1659
- BuildSystem: Add KOKKOS_ENABLE_CXX14 and KOKKOS_ENABLE_CXX17 #1602
- BuildSystem: Installed kokkos_generated_settings.cmake contains build directories instead of install directories #1838
- BuildSystem: KOKKOS_ARCH: add ticks to printout of improper arch setting #1649
- BuildSystem: Make core/src/Makefile for Cuda use needed nvcc_wrapper #1296
- Build: Support PGI as host compiler for NVCC #1828
- Build: Many Warnings Fixed e.g.#1786
- Capability: OffsetView with non-zero begin index #567
- Capability: Reductions into device side view #1788
- Capability: Add max_size to Kokkos::Array #1760
- Capability: View Assignment: LayoutStride -> LayoutLeft and LayoutStride -> LayoutRight #1594
- Capability: Atomic function allow implicit conversion of update argument #1571
- Capability: Add team_size_max with tagged functors #663
- Capability: Fix allignment of views from Kokkos_ScratchSpace should use different alignment #1700
- Capabilitiy: create_mirror_view_and_copy for DynRankView #1651
- Capability: DeepCopy HBWSpace / HostSpace #548
- ROCm: support team vector scan #1645
- ROCm: Merge from rocm-hackathon2 #1636
- ROCm: Add ParallelScanWithTotal #1611
- ROCm: Implement MDRange in ROCm #1314
- ROCm: Implement Reducers for Nested Parallelism Levels #963
- ROCm: Add asynchronous deep copy #959
- Tests: Memory pool test seems to allocate 8GB #1830
- Tests: Add unit_test for team_broadcast #734
Fixed bugs:
- BuildSystem: Makefile.kokkos gets gcc-toolchain wrong if gcc is cached #1841
- BuildSystem: kokkos_generated_settings.cmake placement is inconsistent #1771
- BuildSystem: Invalid escape sequence . in kokkos_functions.cmake #1661
- BuildSystem: Problem in Kokkos generated cmake file #1770
- BuildSystem: invalid file names on windows #1671
- Tests: reducers min/max_loc test fails randomly due to multiple min values and thus multiple valid locations #1681
- Tests: cuda.scatterview unit test causes "Bus error" when force_uvm and enable_lambda are enabled #1852
- Tests: cuda.cxx11 unit test fails when force_uvm and enable_lambda are enabled #1850
- Tests: threads.reduce_device_view_range_policy failing with Cuda/8.0.44 and RDC #1836
- Build: compile error when compiling Kokkos with hwloc 2.0.1 (on OSX 10.12.6, with g++ 7.2.0) #1506
- Build: dual_view.view broken with UVM #1834
- Build: White cuda/9.2 + gcc/7.2 warnings triggering errors #1833
- Build: warning: enum constant in boolean context #1813
- Capability: Fix overly conservative max_team_size thingy #1808
- DynRankView: Ctors taking ViewAllocateWithoutInitializing broken #1783
- Cuda: Apollo cuda.team_broadcast test fail with clang-6.0 #1762
- Cuda: Clang spurious test failure in impl_view_accessible #1753
- Cuda: Kokkos::complex<double> atomic deadlocks with Clang 6 Cuda build with -O0 #1752
- Cuda: LayoutStride Test fails for UVM as default memory space #1688
- Cuda: Scan wrong values on Volta #1676
- Cuda: Kokkos::deep_copy error with CudaUVM and Kokkos::Serial spaces #1652
- Cuda: cudaErrorInvalidConfiguration with debug build #1647
- Cuda: parallel_for with TeamPolicy::team_size_recommended with launch bounds not working -- reported by Daniel Holladay #1283
- Cuda: Using KOKKOS_CLASS_LAMBDA in a class with Kokkos::Random_XorShift64_Pool member data #1696
- Long Build Times on Darwin #1721
- Capability: Typo in Kokkos_Sort.hpp - BinOp3D - wrong comparison #1720
- Buffer overflow in SharedAllocationRecord in Kokkos_HostSpace.cpp #1673
- Serial unit test failure #1632
2.7.00 (2018-05-24)
Part of the Kokkos C++ Performance Portability Programming EcoSystem 2.7
Implemented enhancements:
- Deprecate team_size auto adjusting to maximal value possible #1618
- DynamicView - remove restrictions to std::is_trivial types and value_type is power of two #1586
- Kokkos::StaticCrsGraph does not propagate memory traits (e.g., Unmanaged) #1581
- Adding ETI for DeepCopy / ViewFill etc. #1578
- Deprecate all the left over KOKKOS_HAVE_ Macros and Kokkos_OldMacros.hpp #1572
- Error if Kokkos_ARCH set in CMake #1555
- Deprecate ExecSpace::initialize / ExecSpace::finalize #1532
- New API for TeamPolicy property setting #1531
- clang 6.0 + cuda debug out-of-memory test failure #1521
- Cuda UniqueToken interface not consistent with other backends #1505
- Move Reducers out of Experimental namespace #1494
- Provide scope guard for initialize/finalize #1479
- Check Kokkos::is_initialized in SharedAllocationRecord dtor #1465
- Remove static list of allocations #1464
- Makefiles: Support single compile/link line use case #1402
- ThreadVectorRange with a range #1400
- Exclusive scan + last value API #1358
- Install kokkos_generated_settings.cmake #1348
- Kokkos arrays (not views!) don't do bounds checking in debug mode #1342
- Expose round-robin GPU assignment outside of initialize(int, char**) #1318
- DynamicView misses use_count and label function #1298
- View constructor should check arguments #1286
- False Positive on Oversubscription Warning #1207
- Allow (require) execution space for 1st arg of VerifyExecutionCanAccessMemorySpace #1192
- ROCm: Add ROCmHostPinnedSpace #958
- power of two functions #656
- CUDA 8 has 64bit __shfl #361
- Add TriBITS/CMake configure information about node types #243
Fixed bugs:
- CUDA atomic_fetch_sub for doubles is hitting CAS instead of intrinsic #1624
- Bug: use of ballot on Volta #1612
- Kokkos::deep_copy memory access failures #1583
- g++ -std option doubly set for cmake project #1548
- ViewFill for 1D Views of larger 32bit entries fails #1541
- CUDA Volta another warpsync bug #1520
- triple_nested_parallelism fails with KOKKOS_DEBUG and CUDA #1513
- Jenkins errors in Kokkos_SharedAlloc.cpp with debug build #1511
- Kokkos::Sort out-of-bounds with empty bins #1504
- Get rid of deprecated functions inside Kokkos #1484
- get_work_partition casts int64_t to int, causing a seg fault #1481
- NVCC bug with __device__ on defaulted function #1470
- CMake example broken with CUDA backend #1468
2.6.00 (2018-03-07)
Part of the Kokkos C++ Performance Portability Programming EcoSystem 2.6
Implemented enhancements:
- Support NVIDIA Volta microarchitecture #1466
- Kokkos - Define empty functions when profiling disabled #1424
- Don't use __constant__ cache for lock arrays, enable once per run update instead of once per call #1385
- task dag enhancement. #1354
- Cuda task team collectives and stack size #1353
- Replace View operator acceptance of more than rank integers with 'access' function #1333
- Interoperability: Do not shut down backend execution space runtimes upon calling finalize. #1305
- shmem_size for LayoutStride #1291
- Kokkos::resize performs poorly on 1D Views #1270
- stride() is inconsistent with dimension(), extent(), etc. #1214
- Kokkos::sort defaults to std::sort on host #1208
- DynamicView with host size grow #1206
- Unmanaged View with Anonymous Memory Space #1175
- Sort subset of Kokkos::DynamicView #1160
- MDRange policy doesn't support lambda reductions #1054
- Add ability to set hook on Kokkos::finalize #714
- Atomics with Serial Backend - Default should be Disable? #549
- KOKKOS_ENABLE_DEPRECATED_CODE #1359
Fixed bugs:
- cuda_internal_maximum_warp_count returns 8, but I believe it should return 16 for P100 #1269
- Cuda: level 1 scratch memory bug (reported by Stan Moore) #1434
- MDRangePolicy Reduction requires value_type typedef in Functor #1379
- Kokkos DeepCopy between empty views fails #1369
- Several issues with new CMake build infrastructure (reported by Eric Phipps) #1365
- deep_copy between rank-1 host/device views of differing layouts without UVM no longer works (reported by Eric Phipps) #1363
- Profiling can't be disabled in CMake, and a parallel_for is missing for tasks (reported by Kyungjoo Kim) #1349
- get_work_partition int overflow (reported by berryj5) #1327
- Kokkos::deep_copy must fence even if the two views are the same #1303
- CudaUVMSpace::allocate/deallocate must fence #1302
- ViewResize on CUDA fails in Debug because of too many resources requested #1299
- Cuda 9 and intrepid2 calls from Panzer. #1183
- Slowdown due to tracking_enabled() in 2.04.00 (found by Albany app) #1016
- Bounds checking fails with zero-span Views (reported by Stan Moore) #1411
2.5.00 (2017-12-15)
Part of the Kokkos C++ Performance Portability Programming EcoSystem 2.5
Implemented enhancements:
- Provide Makefile.kokkos logic for CMake and TriBITS #878
- Add Scatter View #825
- Drop gcc 4.7 and intel 14 from supported compiler list #603
- Enable construction of unmanaged view using common_view_alloc_prop #1170
- Unused Function Warning with XL #1267
- Add memory pool parameter check #1218
- CUDA9: Fix warning for unsupported long double #1189
- CUDA9: fix warning on defaulted function marking #1188
- CUDA9: fix warnings for deprecated warp level functions #1187
- Add CUDA 9.0 nightly testing #1174
- {OMPI,MPICH}_CXX hack breaks nvcc_wrapper use case #1166
- KOKKOS_HAVE_CUDA_LAMBDA became KOKKOS_CUDA_USE_LAMBDA #1274
Fixed bugs:
- MinMax Reducer with tagged operator doesn't compile #1251
- Reducers for Tagged operators give wrong answer #1250
- Kokkos not Compatible with Big Endian Machines? #1235
- Parallel Scan hangs forever on BG/Q #1234
- Threads backend doesn't compile with Clang on OS X #1232
- $(shell date) needs quote #1264
- Unqualified parallel_for call conflicts with user-defined parallel_for #1219
- KokkosAlgorithms: CMake issue in unit tests #1212
- Intel 18 Error: "simd pragma has been deprecated" #1210
- Memory leak in Kokkos::initialize #1194
- CUDA9: compiler error with static assert template arguments #1190
- Kokkos::Serial::is_initialized returns always true #1184
- Triple nested parallelism still fails on bowman #1093
- OpenMP openmp.range on Develop Runs Forever on POWER7+ with RHEL7 and GCC4.8.5 #995
- Rendezvous performance at global scope #985
2.04.11 (2017-10-28)
Implemented enhancements:
- Add Subview pattern. #648
- Add Kokkos "global" is_initialized #1060
- Add create_mirror_view_and_copy #1161
- Add KokkosConcepts SpaceAccessibility function #1092
- Option to Disable Initialize Warnings #1142
- Mature task-DAG capability #320
- Promote Work DAG from experimental #1126
- Implement new WorkGraph push/pop #1108
- Kokkos_ENABLE_Cuda_Lambda should default ON #1101
- Add multidimensional parallel for example and improve unit test #1064
- Fix ROCm: Performance tests not building #1038
- Make KOKKOS_ALIGN_SIZE a configure-time option #1004
- Make alignment consistent #809
- Improve subview construction on Cuda backend #615
Fixed bugs:
- Kokkos::vector fixes for application #1134
- DynamicView non-power of two value_type #1177
- Memory pool bug #1154
- Cuda launch bounds performance regression bug #1140
- Significant performance regression in LAMMPS after updating Kokkos #1139
- CUDA compile error #1128
- MDRangePolicy neg idx test failure in debug mode #1113
- subview construction on Cuda backend #615
2.04.04 (2017-09-11)
Implemented enhancements:
- OpenMP partition: set number of threads on nested level #1082
- Add StaticCrsGraph row() method #1071
- Enhance Kokkos complex operator overloading #1052
- Tell Trilinos packages about host+device lambda #1019
- Function markup for defaulted class members #952
- Add deterministic random number generator #857
Fixed bugs:
- Fix reduction_identity<T>::max for floating point numbers #1048
- Fix MD iteration policy ignores lower bound on GPUs #1041
- (Experimental) HBWSpace Linking issues in KokkosKernels #1094
- (Experimental) ROCm: algorithms/unit_tests test_sort failing with segfault #1070
2.04.00 (2017-08-16)
Implemented enhancements:
- Added ROCm backend to support AMD GPUs
- Kokkos::complex<T> behaves slightly differently from std::complex<T> #1011
- Kokkos::Experimental::Crs constructor arguments were in the wrong order #992
- Work graph construction ease-of-use (one lambda for count and fill) #991
- when_all returns pointer of futures (improved interface) #990
- Allow assignment of LayoutLeft to LayoutRight or vice versa for rank-0 Views #594
- Changed the meaning of Kokkos_ENABLE_CXX11_DISPATCH_LAMBDA #1035
Fixed bugs:
- memory pool default constructor does not properly set member variables. #1007
2.03.13 (2017-07-27)
Implemented enhancements:
- Disallow enabling both OpenMP and Threads in the same executable #406
- Make Kokkos::OpenMP respect OMP environment even if hwloc is available #630
- Improve Atomics Performance on KNL/Broadwell where PREFETCHW/RFO is Available #898
- Kokkos::resize should test whether dimensions have changed before resizing #904
- Develop performance-regression/acceptance tests #737
- Make the deep_copy Profiling hook a start/end system #890
- Add deep_copy Profiling hook #843
- Append tag name to parallel construct name for Profiling #842
- Add view label to
View bounds error
message for CUDA backend #870 - Disable printing the loaded profiling library #824
- "Declared but never referenced" warnings #853
- Warnings about lock_address_cuda_space #852
- WorkGraph execution policy #771
- Simplify makefiles by guarding compilation with appropriate KOKKOS_ENABLE_### macros #716
- Cmake build: wrong include install directory #668
- Derived View type and allocation #566
- Fix Compiler warnings when compiling core unit tests for Cuda #214
Fixed bugs:
- Out-of-bounds read in Kokkos_Layout.hpp #975
- CudaClang: Fix failing test with Clang 4.0 #941
- Respawn when memory pool allocation fails (not available memory) #940
- Memory pool aborts on zero allocation request, returns NULL for < minimum #939
- Error with TaskScheduler query of underlying memory pool #917
- Profiling::*Callee static variables declared in header #863
- calling *Space::name() causes compile error #862
- bug in Profiling::deallocateData #860
- task_depend test failing, CUDA 8.0 + Pascal + RDC #829
- [develop branch] Standalone cmake issues #826
- Kokkos CUDA failes to compile with OMPI_CXX and MPICH_CXX wrappers #776
- Task Team reduction on Pascal #767
- CUDA stack overflow with TaskDAG test #758
- TeamVector test on Cuda #670
- Clang 4.0 Cuda Build broken again #560
2.03.05 (2017-05-27)
Implemented enhancements:
- Harmonize Custom Reductions over nesting levels #802
- Prevent users directly including KokkosCore_config.h #815
- DualView aborts on concurrent host/device modify (in debug mode) #814
- Abort when running on a NVIDIA CC5.0 or higher architecture with code compiled for CC < 5.0 #813
- Add "name" function to ExecSpaces #806
- Allow null Future in task spawn dependences #795
- Add Unit Tests for Kokkos::complex #785
- Add pow function for Kokkos::complex #784
- Square root of a complex #729
- Command line processing of --threads argument prevents users from having any commandline arguments starting with --threads #760
- Protected deprecated API with appropriate macro #756
- Allow task scheduler memory pool to be used by tasks #747
- View bounds checking on host-side performance: constructing a std::string #723
- Add check for AppleClang as compiler distinct from check for Clang. #705
- Uninclude source files for specific configurations to prevent link warning. #701
- Add --small option to snapshot script #697
- CMake Standalone Support #674
- CMake build unit test and install #808
- CMake: Fix having kokkos as a subdirectory in a pure cmake project #629
- Tribits macro assumes build directory is in top level source directory #654
- Use bin/nvcc_wrapper, not config/nvcc_wrapper #562
- Allow MemoryPool::allocate() to be called from multiple threads per warp. #487
- Allow MemoryPool::allocate\(\) to be called from multiple threads per warp. #487
- Move OpenMP 4.5 OpenMPTarget backend into Develop #456
- Testing on ARM testbed #288
Fixed bugs:
- Fix label in OpenMP parallel_reduce verify_initialized #834
- TeamScratch Level 1 on Cuda hangs #820
- [bug] memory pool. #786
- Some Reduction Tests fail on Intel 18 with aggressive vectorization on #774
- Error copying dynamic view on copy of memory pool #773
- CUDA stack overflow with TaskDAG test #758
- ThreadVectorRange Customized Reduction Bug #739
- set_scratch_size overflows #726
- Get wrong results for compiler checks in Makefile on OS X. #706
- Fix check if multiple host architectures enabled. #702
- Threads Backend Does not Pass on Cray Compilers #609
- Rare bug in memory pool where allocation can finish on superblock in empty state #452
- LDFLAGS in core/unit_test/Makefile: potential "undefined reference" to pthread lib #148
2.03.00 (2017-04-25)
Implemented enhancements:
- UnorderedMap: make it accept Devices or MemorySpaces #711
- sort to accept DynamicView and [begin,end) indices #691
- ENABLE Macros should only be used via #ifdef or #if defined #675
- Remove impl/Kokkos_Synchronic_* #666
- Turning off IVDEP for Intel 14. #638
- Using an installed Kokkos in a target application using CMake #633
- Create Kokkos Bill of Materials #632
- MDRangePolicy and tagged evaluators #547
- Add PGI support #289
Fixed bugs:
- Output from PerTeam fails #733
- Cuda: architecture flag not added to link line #688
- Getting large chunks of memory for a thread team in a universal way #664
- Kokkos RNG normal() function hangs for small seed value #655
- Kokkos Tests Errors on Shepard/HSW Builds #644
2.02.15 (2017-02-10)
Implemented enhancements:
- Containers: Adding block partitioning to StaticCrsGraph #625
- Kokkos Make System can induce Errors on Cray Volta System #610
- OpenMP: error out if KOKKOS_HAVE_OPENMP is defined but not _OPENMP #605
- CMake: fix standalone build with tests #604
- Change README (that GitHub shows when opening Kokkos project page) to tell users how to submit PRs #597
- Add correctness testing for all operators of Atomic View #420
- Allow assignment of Views with compatible memory spaces #290
- Build only one version of Kokkos library for tests #213
- Clean out old KOKKOS_HAVE_CXX11 macros clauses #156
- Harmonize Macro names #150
Fixed bugs:
- Cray and PGI: Kokkos_Parallel_Reduce #634
- Kokkos Make System can induce Errors on Cray Volta System #610
- Normal() function random number generator doesn't give the expected distribution #592
2.02.07 (2016-12-16)
Implemented enhancements:
- Add CMake option to enable Cuda Lambda support #589
- Add CMake option to enable Cuda RDC support #588
- Add Initial Intel Sky Lake Xeon-HPC Compiler Support to Kokkos Make System #584
- Building Tutorial Examples #582
- Internal way for using ThreadVectorRange without TeamHandle #574
- Testing: Add testing for uvm and rdc #571
- Profiling: Add Memory Tracing and Region Markers #557
- nvcc_wrapper not installed with Kokkos built with CUDA through CMake #543
- Improve DynRankView debug check #541
- Benchmarks: Add Gather benchmark #536
- Testing: add spot_check option to test_all_sandia #535
- Deprecate Kokkos::Impl::VerifyExecutionCanAccessMemorySpace #527
- Add AtomicAdd support for 64bit float for Pascal #522
- Add Restrict and Aligned memory trait #517
- Kokkos Tests are Not Run using Compiler Optimization #501
- Add support for clang 3.7 w/ openmp backend #393
- Provide an error throw class #79
Fixed bugs:
- Cuda UVM Allocation test broken with UVM as default space #586
- Bug (develop branch only): multiple tests are now failing when forcing uvm usage. #570
- Error in generate_makefile.sh for Kokkos when Compiler is Empty String/Fails #568
- XL 13.1.4 incorrect C++11 flag #553
- Improve DynRankView debug check #541
- Installing Library on MAC broken due to cp -u #539
- Intel Nightly Testing with Debug enabled fails #534
2.02.01 (2016-11-01)
Implemented enhancements:
- Add Changelog generation to our process. #506
Fixed bugs:
- Test scratch_request fails in Serial with Debug enabled #520
- Bug In BoundsCheck for DynRankView #516
2.02.00 (2016-10-30)
Implemented enhancements:
- Add PowerPC assembly for grabbing clock register in memory pool #511
- Add GCC 6.x support #508
- Test install and build against installed library #498
- Makefile.kokkos adds expt-extended-lambda to cuda build with clang #490
- Add top-level makefile option to just test kokkos-core unit-test #485
- Split and harmonize Object Files of Core UnitTests to increase build parallelism #484
- LayoutLeft to LayoutLeft subview for 3D and 4D views #473
- Add official Cuda 8.0 support #468
- Allow C++1Z Flag for Class Lambda capture #465
- Add Clang 4.0+ compilation of Cuda code #455
- Possible Issue with Intel 17.0.098 and GCC 6.1.0 in Develop Branch #445
- Add name of view to "View bounds error" #432
- Move Sort Binning Operators into Kokkos namespace #421
- TaskPolicy - generate error when attempt to use uninitialized #396
- Import WithoutInitializing and AllowPadding into Kokkos namespace #325
- TeamThreadRange requires begin, end to be the same type #305
- CudaUVMSpace should track # allocations, due to CUDA limit on # UVM allocations #300
- Remove old View and its infrastructure #259
Fixed bugs:
- Bug in TestCuda_Other.cpp: most likely assembly inserted into Device code #515
- Cuda Compute Capability check of GPU is outdated #509
- multi_scratch test with hwloc and pthreads seg-faults. #504
- generate_makefile.bash: "make install" is broken #503
- make clean in Out of Source Build/Tests Does Not Work Correctly #502
- Makefiles for test and examples have issues in Cuda when CXX is not explicitly specified #497
- Dispatch lambda test directly inside GTEST macro doesn't work with nvcc #491
- UnitTests with HWLOC enabled fail if run with mpirun bound to a single core #489
- Failing Reducer Test on Mac with Pthreads #479
- make test Dumps Error with Clang Not Found #471
- OpenMP TeamPolicy member broadcast not using correct volatile shared variable #424
- TaskPolicy - generate error when attempt to use uninitialized #396
- New task policy implementation is pulling in old experimental code. #372
- MemoryPool unit test hangs on Power8 with GCC 6.1.0 #298
2.01.10 (2016-09-27)
Implemented enhancements:
- Enable Profiling by default in Tribits build #438
- parallel_reduce(0), parallel_scan(0) unit tests #436
- data()==NULL after realloc with LayoutStride #351
- Fix tutorials to track new Kokkos::View #323
- Rename team policy set_scratch_size. #195
Fixed bugs:
- Possible Issue with Intel 17.0.098 and GCC 6.1.0 in Develop Branch #445
- Makefile spits syntax error #435
- Kokkos::sort fails for view with all the same values #422
- Generic Reducers: can't accept inline constructed reducer #404
- data\(\)==NULL after realloc with LayoutStride #351
- const subview of const view with compile time dimensions on Cuda backend #310
- Kokkos (in Trilinos) Causes Internal Compiler Error on CUDA 8.0.21-EA on POWER8 #307
- Core Oversubscription Detection Broken? #159
2.01.06 (2016-09-02)
Implemented enhancements:
- Add "standard" reducers for lambda-supportable customized reduce #411
- TaskPolicy - single thread back-end execution #390
- Kokkos master clone tag #387
- Query memory requirements from task policy #378
- Output order of test_atomic.cpp is confusing #373
- Missing testing for atomics #341
- Feature request for Kokkos to provide Kokkos::atomic_fetch_max and atomic_fetch_min #336
- TaskPolicy<Cuda> performance requires teams mapped to warps #218
Fixed bugs:
- Reduce with Teams broken for custom initialize #407
- Failing Kokkos build on Debian #402
- Failing Tests on NVIDIA Pascal GPUs #398
- Algorithms: fill_random assumes dimensions fit in unsigned int #389
- Kokkos::subview with RandomAccess Memory Trait #385
- Build warning (signed / unsigned comparison) in Cuda implementation #365
- wrong results for a parallel_reduce with CUDA8 / Maxwell50 #352
- Hierarchical parallelism - 3 level unit test #344
- Can I allocate a View w/ both WithoutInitializing & AllowPadding? #324
- subview View layout determination #309
- Unit tests with Cuda - Maxwell #196
2.01.00 (2016-07-21)
Implemented enhancements:
- Edit ViewMapping so assigning Views with the same custom layout compiles when const casting #327
- DynRankView: Performance improvement for operator() #321
- Interoperability between static and dynamic rank views #295
- subview member function ? #280
- Inter-operatibility between View and DynRankView. #245
- (Trilinos) build warning in atomic_assign, with Kokkos::complex #177
- View<>::shmem_size should runtime check for number of arguments equal to rank #176
- Custom reduction join via lambda argument #99
- DynRankView with 0 dimensions passed in at construction #293
- Inject view_alloc and friends into Kokkos namespace #292
- Less restrictive TeamPolicy reduction on Cuda #286
- deep_copy using remap with source execution space #267
- Suggestion: Enable opt-in L1 caching via nvcc-wrapper #261
- More flexible create_mirror functions #260
- Rename View::memory_span to View::required_allocation_size #256
- Use of subviews and views with compile-time dimensions #237
- Use of subviews and views with compile-time dimensions #237
- Kokkos::Timer #234
- Fence CudaUVMSpace allocations #230
- View::operator() accept std::is_integral and std::is_enum #227
- Allocating zero size View #216
- Thread scalable memory pool #212
- Add a way to disable memory leak output #194
- Kokkos exec space init should init Kokkos profiling #192
- Runtime rank wrapper for View #189
- Profiling Interface #158
- Fix View assignment (of managed to unmanaged) #153
- Add unit test for assignment of managed View to unmanaged View #152
- Check for oversubscription of threads with MPI in Kokkos::initialize #149
- Dynamic resizeable 1dimensional view #143
- Develop TaskPolicy for CUDA #142
- New View : Test Compilation Downstream #138
- New View Implementation #135
- Add variant of subview that lets users add traits #134
- NVCC-WRAPPER: Add --host-only flag #121
- Address gtest issue with TriBITS Kokkos build outside of Trilinos #117
- Make tests pass with -expt-extended-lambda on CUDA #108
- Dynamic scheduling for parallel_for and parallel_reduce #106
- Runtime or compile time error when reduce functor's join is not properly specified as const member function or with volatile arguments #105
- Error out when the number of threads is modified after kokkos is initialized #104
- Porting to POWER and remove assumption of X86 default #103
- Dynamic scheduling option for RangePolicy #100
- SharedMemory Support for Lambdas #81
- Recommended TeamSize for Lambdas #80
- Add Aggressive Vectorization Compilation mode #72
- Dynamic scheduling team execution policy #53
- UVM allocations in multi-GPU systems #50
- Synchronic in Kokkos::Impl #44
- index and dimension types in for loops #28
- Subview assign of 1D Strided with stride 1 to LayoutLeft/Right #1
Fixed bugs:
- misspelled variable name in Kokkos_Atomic_Fetch + missing unit tests #340
- seg fault Kokkos::Impl::CudaInternal::print_configuration #338
- Clang compiler error with named parallel_reduce, tags, and TeamPolicy. #335
- Shared Memory Allocation Error at parallel_reduce #311
- DynRankView: Fix resize and realloc #303
- Scratch memory and dynamic scheduling #279
- MemoryPool infinite loop when out of memory #312
- Kokkos DynRankView changes break Sacado and Panzer #299
- MemoryPool fails to compile on non-cuda non-x86 #297
- Random Number Generator Fix #296
- View template parameter ordering Bug #282
- Serial task policy broken. #281
- deep_copy with LayoutStride should not memcpy #262
- DualView::need_sync should be a const method #248
- Arbitrary-sized atomics on GPUs broken; loop forever #238
- boolean reduction value_type changes answer #225
- Custom init() function for parallel_reduce with array value_type #210
- unit_test Makefile is Broken - Recursively Calls itself until Machine Apocalypse. #202
- nvcc_wrapper Does Not Support -Xcompiler <compiler option> #198
- Kokkos exec space init should init Kokkos profiling #192
- Kokkos Threads Backend impl_shared_alloc Broken on Intel 16.1 (Shepard Haswell) #186
- pthread back end hangs if used uninitialized #182
- parallel_reduce of size 0, not calling init/join #175
- Bug in Threads with OpenMP enabled #173
- KokkosExp_SharedAlloc, m_team_work_index inaccessible #166
- 128-bit CAS without Assembly Broken? #161
- fatal error: Cuda/Kokkos_Cuda_abort.hpp: No such file or directory #157
- Power8: Fix OpenMP backend #139
- Data race in Kokkos OpenMP initialization #131
- parallel_launch_local_memory and cuda 7.5 #125
- Resize can fail with Cuda due to asynchronous dispatch #119
- Qthread taskpolicy initialization bug. #92
- Windows: sys/mman.h #89
- Windows: atomic_fetch_sub() #88
- Windows: snprintf #87
- Parallel_Reduce with TeamPolicy and league size of 0 returns garbage #85
- Throw with Cuda when using (2D) team_policy parallel_reduce with less than a warp size #76
- Scalar views don't work with Kokkos::Atomic memory trait #69
- Reduce the number of threads per team for Cuda #63
- Named Kernels fail for reductions with CUDA #60
- Kokkos View dimension_() for long returning unsigned int #20
- atomic test hangs with LLVM #6
- OpenMP Test should set omp_set_num_threads to 1 #4
Closed issues:
- develop branch broken with CUDA 8 and --expt-extended-lambda #354
- --arch=KNL with Intel 2016 build failure #349
- Error building with Cuda when passing -DKOKKOS_CUDA_USE_LAMBDA to generate_makefile.bash #343
- Can I safely use int indices in a 2-D View with capacity > 2B? #318
- Kokkos::ViewAllocateWithoutInitializing is not working #317
- Intel build on Mac OS X #277
- deleted #271
- Broken Mira build #268
- 32-bit build #246
- parallel_reduce with RDC crashes linker #232
- build of Kokkos_Sparse_MV_impl_spmv_Serial.cpp.o fails if you use nvcc and have cuda disabled #209
- Kokkos Serial execution space is not tested with TeamPolicy. #207
- Unit test failure on Hansen KokkosCore_UnitTest_Cuda_MPI_1 #200
- nvcc compiler warning: calling a __host__ function from a __host__ __device__ function is not allowed #180
- Intel 15 build error with defaulted "move" operators #171
- missing libkokkos.a during Trilinos 12.4.2 build, yet other libkokkos*.a libs are there #165
- Tie atomic updates to execution space or even to thread team? (speculation) #144
- New View: Compiletime/size Test #137
- New View : Performance Test #136
- Signed/unsigned comparison warning in CUDA parallel #130
- Kokkos::complex: Need op* w/ std::complex & real #126
- Use uintptr_t for casting pointers #110
- Default thread mapping behavior between P and Q threads. #91
- Windows: Atomic_Fetch_Exchange() return type #90
- Synchronic unit test is way too long #84
- nvcc_wrapper -> $(NVCC_WRAPPER) #42
- Check compiler version and print helpful message #39
- Kokkos shared memory on Cuda uses a lot of registers #31
- Can not pass unit test
cuda.space
without a GT 720 #25 - Makefile.kokkos lacks bounds checking option that CMake has #24
- Kokkos can not complete unit tests with CUDA UVM enabled #23
- Simplify teams + shared memory histogram example to remove vectorization #21
- Kokkos needs to rever to ${PROJECT_NAME}_ENABLE_CXX11 not Trilinos_ENABLE_CXX11 #17
- Kokkos Base Makefile adds AVX to KNC Build #16
- MS Visual Studio 2013 Build Errors #9
- subview(X, ALL(), j) for 2-D LayoutRight View X: should it view a column? #5
End_C++98 (2015-04-15)
* This Change Log was automatically generated by github_changelog_generator