-
Notifications
You must be signed in to change notification settings - Fork 1
/
Revision
48 lines (36 loc) · 2.01 KB
/
Revision
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
May 18, 2015:
1. Kernel optimized and tested. Current block/grid configuration has one drawback. SR=+/-64 is a critically point. Dropping block sizes/grid sizes will not decrease time because the GPU is underutilized. (+/-64 is just fully utilized).
2. To solve this problem, twice amount of threads are launched. The two halves processing the same location but different pixels.
For example, with +/-64, 256 threads are launched. thread 0 and thread 128 process 0th location. thread 0 calculates half SAD and thread 128 processes the other half.
3. Code modified and tested for twice amount of threads under +/-64. Indeed, large PU consumes half of the time when number of block decrease by half.
There is about 0.22s/frame penalty involved.
May 19,2015:
1. Visual Profiler shows kernels took only 2%(2.8) of the overall time(117s). However, baseline shows 95s. If the time is not consumed in kernel then where did it? This is the task for tomorrow.
2.
Star Search + GPU consumes 115 sec.
Without GPU::Kernel::gpuGpuFullBlockSearch, program consumes 107.51sec
Without Star Searches and GPU, program consumes 95 sec
Problem was in hte Star search part. It consumes a lot of time
3.
xGpuFullBlockSearch() is toned to perform much faster.
May 20, 2015:
1. Modify TZSearch to search just half of the search window
2. Implement heterogeneous search. Dynamic adjusting search window partition based on time.
May 24, 2015
1. Scaling is applied to all sizes and a dramatic improvement has been observed.
2. but WLINC needs to be reconsidered
May 25. 2015
1. Scaling works until 8 and showing good speedup
2. MVP cost calculation optimized
3. m_puiSadCPU alllocated using cudaMallocHost to enable faster copy
4. Kernel reorganized for better readability
May 26. 2015
1. Optimize out some synchthread() if possible
2. Review any possible optimization.
3. Fix WLINC and RedFactor conflict problem
3. Run code for easy, mid and large on WS690 and IVS.
4. Done with optimization.
---->May 27. 2015
1. Optimize CPU portion
Sept 12 2015
Code commented