-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU Cellular Automaton #48
GPU Cellular Automaton #48
Conversation
A new Pull Request was created by @felicepantaleo (Felice Pantaleo) for CMSSW_10_2_X_Patatrack. It involves the following packages: RecoPixelVertexing/PixelTriplets @cmsbot, @fwyzard can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a cursory look and have also some general comments
- In general the reformatting is nice, but
- there are some cases where IMHO it would be clearer to avoid line breaks
- in principle it would be nicer to have the reformatting in its own PR
- There is a lot of copy-paste (in addition to what there was already), but that is probably best to deal with later when the dust settles (i.e. after migrating to HeterogeneousEDProducer etc)
|
||
std::vector<int> theOuterLayerPairs; | ||
std::vector<int> theInnerLayerPairs; | ||
std::string name() const { return theName; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should return const std::string&
.
Eventually it would be nice to avoid strings as the layer identifiers.
edm::ParameterSet comparitorPSet = | ||
cfg.getParameter<edm::ParameterSet>("SeedComparitorPSet"); | ||
std::string comparitorName = | ||
comparitorPSet.getParameter<std::string>("ComponentName"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the line breaks be avoided?
} | ||
} | ||
|
||
void CAHitQuadrupletGenerator::fillDescriptions(edm::ParameterSetDescription& desc) { | ||
void CAHitQuadrupletGenerator::fillDescriptions( | ||
edm::ParameterSetDescription &desc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the line break be avoided?
@@ -39,15 +39,15 @@ class RecHitsSortedInPhi { | |||
typedef std::pair<HitIter,HitIter> Range; | |||
|
|||
using DoubleRange = std::array<int,4>; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All changes in this file are reformatting, so in principle could be avoided in this PR.
#include <cuda.h> | ||
#include <cuda_runtime.h> | ||
|
||
template <int maxSize, class T> struct GPUSimpleVector { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this different from
https://github.com/cms-patatrack/cmssw/blob/CMSSW_10_2_X_Patatrack/HeterogeneousCore/CUDAUtilities/interface/GPUSimpleVector.h
? (ok, I see int maxSize
template parameter)
Anyway it would be better placed (eventually) in HeterogeneousCore/CUDAUtilities
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, can we avoid having two GPUSimpleVector
classes ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed with Felice and I agree the use case is valid (one GPU allocation vs. a GPU allocation per hit). The intention of the class is to provide a vector-like interface on top of an (dynamically-allocated) array, close to what
cmssw/FWCore/Utilities/interface/VecArray.h
Lines 27 to 28 in 655e4ed
template <typename T, unsigned int N> | |
class VecArray { |
does in CPU.
I'd suggest to treat this class similarly, i.e. rename to e.g. GPUVecArray
(and reorder the template parameters as <class, int>
) and move to HeterogeneousCore/CUDAUtilities/interface
.
<flags EDM_PLUGIN="1"/> | ||
<flags CUDA_FLAGS="--expt-relaxed-constexpr"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to add this, it is included by default in the CUDA_FLAGS
|
||
std::vector<const HitDoublets *> hitDoublets; | ||
|
||
const int numberOfHitsInNtuplet = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is unused
1 similar comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I repeat my earlier general comments on the formatting:
- separating the formatting changes from the rest would make review of the rest much easier
- there are many places where the line breaks make the code more difficult to read (IMHO)
// | ||
// #include "CAHitQuadrupletGeneratorGPU.h" | ||
// using CAHitQuadrupletGPUEDProducer = CAHitNtupletEDProducerT<CAHitQuadrupletGeneratorGPU>; | ||
// DEFINE_FWK_MODULE(CAHitQuadrupletGPUEDProducer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the commented code.
CAHitNtupletHeterogeneousEDProducer::CAHitNtupletHeterogeneousEDProducer( | ||
const edm::ParameterSet &iConfig) | ||
: HeterogeneousEDProducer<heterogeneous::HeterogeneousDevices< | ||
heterogeneous::GPUCuda, heterogeneous::CPU>>(iConfig), doubletToken_(consumes<IntermediateHitDoublets>(iConfig.getParameter<edm::InputTag>("doublets"))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The base class constructor call can be shortened to
: HeterogeneousEDProducer(iConfig),
|
||
CAHitNtupletHeterogeneousEDProducer::~CAHitNtupletHeterogeneousEDProducer() { | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to replace the empty destructor with adding = default
to the declaration on line 34.
void CAHitNtupletHeterogeneousEDProducer::beginStreamGPUCuda( | ||
edm::StreamID streamId, cuda::stream_t<> &cudaStream) { | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not mandatory to implement beginStreamGPUCuda()
, so you could remove the method altogether. But where do you allocate the GPU memory? (answering to myself: in the constructor, but see other comment why the allocations should (eventually) be moved here)
|
||
void CAHitNtupletHeterogeneousEDProducer::produceGPUCuda( | ||
edm::HeterogeneousEvent &iEvent, const edm::EventSetup &iSetup, | ||
cuda::stream_t<> &cudaStream) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ordering acquireGPUCuda()
and produceGPUCuda()
in that order would make them easier to read.
dim3 numberOfBlocks_find(8, numberOfRootLayerPairs); | ||
((GPUSimpleVector<maxNumberOfQuadruplets, Quadruplet> | ||
*)(h_foundNtuplets[regionIndex])) | ||
->reset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If h_foundNtuplets
really has to contain void *
, could we at least use reinterpret_cast
here (and elsewhere)?
cudaMemsetAsync(device_isOuterHitOfCell, 0, | ||
maxNumberOfLayers * maxNumberOfHits * | ||
sizeof(GPUSimpleVector<maxCellsPerHit, unsigned int>), | ||
cudaStream_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment that this resets temporary memory for the next event, and is not needed for reading the output? Just to explain why it is ok to have async call here without cudaStreamSynchronize()
.
const TrackingRegion ®ion = regionLayerPairs.region(); | ||
auto foundQuads = fetchKernelResult(index); | ||
std::cout << foundQuads.size() << " found quads" << std::endl; | ||
unsigned int numberOfFoundQuadruplets = foundQuads.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation becomes inconsistent at this line. But I'd rather re-indent the lines above than those below.
const TrackingRegion ®ion = regionLayerPairs.region(); | ||
auto seedingHitSetsFiller = seedingHitSets->beginRegion(®ion); | ||
generator_.fillResults(regionDoublets, ntuplets, iSetup, seedingLayerHits, | ||
cudaStream.id()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me it looks like CAHitQuadrupletGeneratorGPU::fillResults()
already loops over the regions and fills ntuplets
for all regions. Why does it have to be repeated here for each region? Should the loop perhaps be something along
generator_.fillResults(regionDoublets, ntuplets, iSetup, seedingLayerHits, cudaStream.id());
int index = 0;
for (const auto ®ionLayerPairs : regionDoublets) {
const TrackingRegion ®ion = regionLayerPairs.region();
auto seedingHitSetsFiller = seedingHitSets->beginRegion(®ion);
fillNtuplets(seedingHitSetsFiller, ntuplets[index]);
index++;
)
instead?
<flags EDM_PLUGIN="1"/> | ||
<flags CUDA_FLAGS="--expt-relaxed-constexpr"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fwyzard commented earlier that this is not needed as the flag is included by default in CUDA_FLAGS
.
It is also noteworthy that (because of the formatting changes, IIUC) this PR conflicts with cms-sw#23363. |
question: with these changes, do we use the new producer for the GPU workflow ? |
No. Additional changes in the configuration are needed to use this new producer. |
1965f2e
to
30594f6
Compare
Validation summaryReference release CMSSW_10_2_0_pre4 at 926a81b
|
The development summary for workflow 10824.8 is very succinct:
Which is strange because the same workflow is successful in the validation of other PRs. |
1 similar comment
|
||
bool emptyRegionDoublets = false; | ||
std::unique_ptr<RegionsSeedingHitSets> seedingHitSets; | ||
std::vector<OrderedHitSeeds> ntuplets; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please update the member variable names to match the coding rules, i.e. add a trailing _
static constexpr int maxNumberOfHits = 1000; | ||
static constexpr int maxNumberOfRegions = 30; | ||
|
||
unsigned int numberOfRootLayerPairs = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please update the member variable names to match the coding rules, i.e. add a trailing _
theInnerR, theOuterR); | ||
} | ||
|
||
// __host__ __device__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you delete the commented out part ?
1 similar comment
I still need to make use of the existing non-templated GPU vector and probably rebase the PR... |
To reformulate the comment: the code should not crash.
|
@fwyzard, I have implemented your suggestion and replaced the assert with a LogError and return |
Validation summaryReference release CMSSW_10_2_0_pre5 at 30c7b03
|
As of 2ebbcf3 , both 10824.8 workflows are failing (TTbar with |
Validation summaryReference release CMSSW_10_2_0_pre5 at 30c7b03
|
I have split the squash commit in two, to keep separate the work on the |
@makortel sorry for not addressing your comments earlier; can you summarise which ones are still relevant, and I'll try to make the corresponding changes ? |
@fwyzard No problem, I tried to gather them below (most are mostly aesthetic and possibly subjective)
|
I'll implement 1. and 2., look into 3. and 4., and leave 5. as is for the moment... I'd rather make |
@fwyzard Sounds good, thanks! |
OK, I've done something along the lines of 3. and 4. (and maybe 5.) as well. See #83 for the clean up. |
Apply some clean up to the code and formatting of `CAHitNtupletHeterogeneousEDProducer` and `CAHitQuadrupletGeneratorGPU`, as suggested by @makortel during the review of #48: - clean up the `BuildFile.xml` - remove unused data members and arguments from function calls; - percolate the CUDA stream instead of storing it as a data member. Also: - add `cudaCheck` calls around memory allocations and copies; - reduce the number of memory allocations used to set up the GPU state.
Apply some clean up to the code and formatting of `CAHitNtupletHeterogeneousEDProducer` and `CAHitQuadrupletGeneratorGPU`, as suggested by @makortel during the review of #48: - clean up the `BuildFile.xml` - remove unused data members and arguments from function calls; - percolate the CUDA stream instead of storing it as a data member. Also: - add `cudaCheck` calls around memory allocations and copies; - reduce the number of memory allocations used to set up the GPU state.
Apply some clean up to the code and formatting of `CAHitNtupletHeterogeneousEDProducer` and `CAHitQuadrupletGeneratorGPU`, as suggested by @makortel during the review of #48: - clean up the `BuildFile.xml` - remove unused data members and arguments from function calls; - percolate the CUDA stream instead of storing it as a data member. Also: - add `cudaCheck` calls around memory allocations and copies; - reduce the number of memory allocations used to set up the GPU state.
Apply some clean up to the code and formatting of `CAHitNtupletHeterogeneousEDProducer` and `CAHitQuadrupletGeneratorGPU`, as suggested by @makortel during the review of #48: - clean up the `BuildFile.xml` - remove unused data members and arguments from function calls; - percolate the CUDA stream instead of storing it as a data member. Also: - add `cudaCheck` calls around memory allocations and copies; - reduce the number of memory allocations used to set up the GPU state.
Apply some clean up to the code and formatting of `CAHitNtupletHeterogeneousEDProducer` and `CAHitQuadrupletGeneratorGPU`, as suggested by @makortel during the review of #48: - clean up the `BuildFile.xml` - remove unused data members and arguments from function calls; - percolate the CUDA stream instead of storing it as a data member. Also: - add `cudaCheck` calls around memory allocations and copies; - reduce the number of memory allocations used to set up the GPU state.
Apply some clean up to the code and formatting of `CAHitNtupletHeterogeneousEDProducer` and `CAHitQuadrupletGeneratorGPU`, as suggested by @makortel during the review of #48: - clean up the `BuildFile.xml` - remove unused data members and arguments from function calls; - percolate the CUDA stream instead of storing it as a data member. Also: - add `cudaCheck` calls around memory allocations and copies; - reduce the number of memory allocations used to set up the GPU state.
Apply some clean up to the code and formatting of `CAHitNtupletHeterogeneousEDProducer` and `CAHitQuadrupletGeneratorGPU`, as suggested by @makortel during the review of #48: - clean up the `BuildFile.xml` - remove unused data members and arguments from function calls; - percolate the CUDA stream instead of storing it as a data member. Also: - add `cudaCheck` calls around memory allocations and copies; - reduce the number of memory allocations used to set up the GPU state.
Apply some clean up to the code and formatting of `CAHitNtupletHeterogeneousEDProducer` and `CAHitQuadrupletGeneratorGPU`, as suggested by @makortel during the review of #48: - clean up the `BuildFile.xml` - remove unused data members and arguments from function calls; - percolate the CUDA stream instead of storing it as a data member. Also: - add `cudaCheck` calls around memory allocations and copies; - reduce the number of memory allocations used to set up the GPU state.
Apply some clean up to the code and formatting of `CAHitNtupletHeterogeneousEDProducer` and `CAHitQuadrupletGeneratorGPU`, as suggested by @makortel during the review of #48: - clean up the `BuildFile.xml` - remove unused data members and arguments from function calls; - percolate the CUDA stream instead of storing it as a data member. Also: - add `cudaCheck` calls around memory allocations and copies; - reduce the number of memory allocations used to set up the GPU state.
Apply some clean up to the code and formatting of `CAHitNtupletHeterogeneousEDProducer` and `CAHitQuadrupletGeneratorGPU`, as suggested by @makortel during the review of #48: - clean up the `BuildFile.xml` - remove unused data members and arguments from function calls; - percolate the CUDA stream instead of storing it as a data member. Also: - add `cudaCheck` calls around memory allocations and copies; - reduce the number of memory allocations used to set up the GPU state.
Apply some clean up to the code and formatting of `CAHitNtupletHeterogeneousEDProducer` and `CAHitQuadrupletGeneratorGPU`, as suggested by @makortel during the review of #48: - clean up the `BuildFile.xml` - remove unused data members and arguments from function calls; - percolate the CUDA stream instead of storing it as a data member. Also: - add `cudaCheck` calls around memory allocations and copies; - reduce the number of memory allocations used to set up the GPU state.
Preliminary version of the GPU CA.
I'll make it heterogeneousEDProducer, place copies in the correct place, remove commented code.
@rovere @fwyzard @VinInn @makortel