Skip to content

CUDA: enable cuda support v1 - EAGER with GDR COPY#20

Open
bureddy wants to merge 9 commits into
masterfrom
cuda-domain
Open

CUDA: enable cuda support v1 - EAGER with GDR COPY#20
bureddy wants to merge 9 commits into
masterfrom
cuda-domain

Conversation

@bureddy
Copy link
Copy Markdown
Owner

@bureddy bureddy commented Sep 26, 2017

    -   configuration
    -   UCT interface changes to detect address domain
    -   GDR COPY UCT
              - put-zcopy func to write data from bounce buffer to domain(CUDA) buffer 
    -   CUDA COPY UCT
              - get-zcopy func to stage data from domain buffer to bounce buffer
   -    UCP protocol changes for CUDA eager zcopy
             - domain detection
             - domain unpack
             - domain uct lanes
   -   mpool for cuda rndv staging
   -  cudaFree memhooks

Pending:
    -  rndv pipeline transfers using PUT protocol

@bureddy bureddy changed the title CUDA: enable cuda support v1 CUDA: enable cuda support v1 - EAGER with GDR COPY Sep 28, 2017
Comment thread src/ucm/cuda/cudamem.h
@@ -0,0 +1,22 @@
/**
* Copyright (C) Mellanox Technologies Ltd. 2001-2015. ALL RIGHTS RESERVED.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copyright year 2017?

Comment thread src/ucm/cuda/install.c
ucm_debug("cudamem test: got 0x%x out of 0x%x, total: 0x%x", out_events, events,
installed_events);

/* Return success iff we caught all wanted events */
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iff->if

Comment thread src/ucm/cuda/replace.c
#include <pthread.h>


#define MAP_FAILED ((void*)-1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it required? MAP_FAILED is defined in <sys/mman.h>, and MAP_FAILED is not used here

Comment thread src/ucp/core/ucp_ep.c
iface_attr = &worker->ifaces[rsc_index].attr;

domain_config->tag.eager.max_short = iface_attr->cap.am.max_short;
//TODO: zcopy thrshold should be based on the ep AM lane capability with domain addr(i.e can UCT do zcopy from domain)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thrshold -> threshold

Comment thread src/ucp/core/ucp_ep.h
} ucp_ep_rma_config_t;


#define UCP_IS_DEFAULT_ADDR_DOMAIN(_addr_dn_h) (_addr_dn_h == &ucp_addr_dn_dummy_handle)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(_addr_dn_h == &ucp_addr_dn_dummy_handle) -> ((_addr_dn_h) == &ucp_addr_dn_dummy_handle)

Comment thread src/ucp/dt/dt.c
return UCS_OK;
}

while(1) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after while:
while (1) {

Comment thread src/ucp/tag/rndv.c
rreq->recv.length, &rreq->recv.state,
data + hdr_len, recv_len, 1);
status = ucp_dt_unpack(rreq, rreq->recv.datatype, rreq->recv.buffer,
rreq->recv.length, &rreq->recv.state,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allignment

Comment thread src/ucp/wireup/select.c

dn_md_map = addr_dn_h->md_map;

while(1) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coding guide line: space after while: while (1) {

uct_cuda_copy_event_desc_t *cuda_event;
cudaError_t result = cudaSuccess;

ucs_queue_for_each_safe(cuda_event, iter, &iface->pending_event_q, queue)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coding guideline: ucs_queue_for_each_safe() {

ucs_error("cudaEventCreateWithFlags Failed");
}
}
void uct_cuda_copy_event_desc_cleanup(ucs_mpool_t *mp, void *obj)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't call these functions directly by name from anywhere, may be declare it as 'static'?

Comment thread src/uct/api/uct.h
size_t max_alloc; /**< Maximal allocation size */
size_t max_reg; /**< Maximal registration size */
uint64_t flags; /**< UCT_MD_FLAG_xx */
uct_addr_domain_t addr_dn; /**< Supported addr domain */
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to uct_memory_domain_t

Copy link
Copy Markdown
Owner Author

@bureddy bureddy Oct 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or memory_type , memory_kind

Comment thread src/uct/api/uct.h
typedef enum {
UCT_MD_ADDR_DOMAIN_CUDA = 0, /**< NVIDIA CUDA domain */
UCT_MD_ADDR_DOMAIN_DEFAULT, /**< Default system domain */
UCT_MD_ADDR_DOMAIN_LAST = UCT_MD_ADDR_DOMAIN_DEFAULT
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check LAST

Comment thread src/uct/api/uct.h
UCT_MD_FLAG_SOCKADDR = UCS_BIT(7), /**< MD support for client-server
connection establishment via
sockaddr */
UCT_MD_FLAG_ADDR_DN = UCS_BIT(8) /**< MD supports memory addr domain
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to _MRM_DETECT

Comment thread src/uct/api/uct.h
* Return UCS_OK if address belongs to MDs address domain
*
* @param [in] md Memory domain to register memory on.
* @param [in] address Memory address to detect.
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add to ucx_info to show memory detect support

Comment thread src/ucp/core/ucp_mm.c
/*TODO: return if no MDs with address domain detect */

for (md_index = 0; md_index < context->num_mds; ++md_index) {
if (context->tl_mds[md_index].attr.cap.flags & UCT_MD_FLAG_ADDR_DN) {
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove cap flag and use UCT_MD_ADDR_DOMAIN_DEFAULT

struct ucp_request {
ucs_status_t status; /* Operation status */
uint16_t flags; /* Request flags */
ucp_addr_dn_h addr_dn_h; /* Memory domain handle */
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

embeded fields into req struct

Comment thread src/ucp/tag/tag_send.c
ucp_ep_config(ep)->tag.eager.max_short,
ucp_ep_config(ep)->tag.eager.zcopy_thresh,
ucs_likely(UCP_IS_DEFAULT_ADDR_DOMAIN(addr_dn_h)) ?
ucp_ep_config(ep)->tag.eager.max_short :
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array of eager threshold per domain

Comment thread src/ucp/tag/tag_send.c
buffer, count, tag, ucp_ep_peer_name(ep), cb);

if (ucs_likely(UCP_DT_IS_CONTIG(datatype))) {
ucp_addr_domain_detect_mds(ep->worker->context, (void *)buffer, &addr_dn_h);
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

global context value to check if detect is needed or not

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants