Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fsdp sac #10

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Fsdp sac #10

wants to merge 3 commits into from

Conversation

xuanzhang816
Copy link
Collaborator

  • With this example, when using smaller memory budget, it is choosing more granular FSDP units instead of AC some modules. This could be due to
    • batch size is small and thus AC saves a minimal amount of memory
    • model is quite compute bound and all-gather/reduce-scatter latency is pretty small, so going more granular on FSDP units do not affect communication computation overlap
  • The solver took a while to solve this, almost 4 minutes
$ python fsdp_sac_ilp.py --in_file=GPT_modules_info.json --memory_budget=2.5 --verbose
On a single GPU
  peak memory is 6.73 GiB
  compute time is 204.37 ms
------------------------------------------------------------
Solver completed in 232.01 sec
AC decisions are {}
On 8 GPUs
  FSDP units are {'GPT.transformer.h.2.mlp.c_proj', 'GPT.transformer.h.5.mlp.c_fc', 'GPT.transformer.h.0.mlp.c_proj', 'GPT.lm_head', 'GPT.transformer.h.2.attn', 'GPT.transformer.h.4.mlp.c_fc', 'GPT.transformer.h.3.attn', 'GPT.transformer.h.5.mlp.c_proj', 'GPT.transformer.h.5.attn', 'GPT.transformer.h.1.mlp.c_fc', 'GPT.transformer.h.0.attn', 'GPT.transformer.wpe', 'GPT', 'GPT.transformer.h.4.attn', 'GPT.transformer.h.1.mlp.c_proj', 'GPT.transformer.h.4.mlp.c_proj', 'GPT.transformer.h.0.mlp.c_fc', 'GPT.transformer.h.1.attn', 'GPT.transformer.h.3.mlp.c_fc', 'GPT.transformer.h.2.mlp.c_fc', 'GPT.transformer.h.3.mlp.c_proj'}
  peak memory is 2.41 GiB
  total exposed computation time + recomputation time is 2.3878 ms


 --------- DETAILS ---------
FSDP    GPT                                     :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   1.47 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   2.13 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.07 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.07 ms 
                                                    FCP_i   =  60.36 ms    BCP_i   = 144.01 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.wte                     :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.19 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   0.85 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.07 ms    BCP_i   =   0.58 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
FSDP    GPT.transformer.wpe                     :   p_i     =   0.01 GiB   g_i     =   0.01 GiB   a_i     =   0.19 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   0.86 GiB 
                                                    ag_i    =   0.07 ms    fw_ag_i =   1.41 ms    bw_ag_i =   0.01 ms    rs_i    =   0.07 ms    bw_rs_i =   1.41 ms 
                                                    FCP_i   =   0.01 ms    BCP_i   =   0.61 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   1.40 ms    bw_e_i  =   0.82 ms  
        GPT.transformer.drop                    :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.06 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   0.70 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.03 ms    BCP_i   =   0.06 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.0                     :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.38 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.02 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   9.72 ms    BCP_i   =  23.10 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.0.ln_1                :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.06 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   0.70 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.02 ms    BCP_i   =   0.08 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
FSDP    GPT.transformer.h.0.attn                :   p_i     =   0.14 GiB   g_i     =   0.14 GiB   a_i     =   0.26 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.19 GiB 
                                                    ag_i    =   1.41 ms    fw_ag_i =   1.41 ms    bw_ag_i =   0.07 ms    rs_i    =   1.41 ms    bw_rs_i =   1.41 ms 
                                                    FCP_i   =   3.72 ms    BCP_i   =   8.40 ms    rcp_i   =   0.28 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.0.attn.c_attn         :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.20 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.12 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   2.18 ms    BCP_i   =   5.25 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.0.attn.c_proj         :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.15 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.06 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.74 ms    BCP_i   =   1.82 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.0.attn.resid_dropout  :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.13 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.04 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.03 ms    BCP_i   =   0.06 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.0.ln_2                :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.13 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   0.76 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.02 ms    BCP_i   =   0.08 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.0.mlp                 :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.42 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.05 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   5.90 ms    BCP_i   =  14.36 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
FSDP    GPT.transformer.h.0.mlp.c_fc            :   p_i     =   0.14 GiB   g_i     =   0.14 GiB   a_i     =   0.32 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.23 GiB 
                                                    ag_i    =   1.41 ms    fw_ag_i =   1.41 ms    bw_ag_i =   1.41 ms    rs_i    =   1.41 ms    bw_rs_i =   1.41 ms 
                                                    FCP_i   =   2.90 ms    BCP_i   =   6.97 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.0.mlp.gelu            :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.26 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   0.87 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.08 ms    BCP_i   =   0.26 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
FSDP    GPT.transformer.h.0.mlp.c_proj          :   p_i     =   0.14 GiB   g_i     =   0.14 GiB   a_i     =   0.41 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.30 GiB 
                                                    ag_i    =   1.41 ms    fw_ag_i =   1.41 ms    bw_ag_i =   1.41 ms    rs_i    =   1.41 ms    bw_rs_i =   1.41 ms 
                                                    FCP_i   =   2.90 ms    BCP_i   =   7.07 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.0.mlp.dropout         :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.25 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   0.84 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.03 ms    BCP_i   =   0.06 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.1                     :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.57 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.16 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   9.72 ms    BCP_i   =  23.10 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.1.ln_1                :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.25 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   0.84 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.02 ms    BCP_i   =   0.08 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
FSDP    GPT.transformer.h.1.attn                :   p_i     =   0.14 GiB   g_i     =   0.14 GiB   a_i     =   0.45 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.33 GiB 
                                                    ag_i    =   1.41 ms    fw_ag_i =   1.41 ms    bw_ag_i =   1.41 ms    rs_i    =   1.41 ms    bw_rs_i =   1.41 ms 
                                                    FCP_i   =   3.72 ms    BCP_i   =   8.40 ms    rcp_i   =   0.28 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.1.attn.c_attn         :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.39 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.26 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   2.18 ms    BCP_i   =   5.25 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.1.attn.c_proj         :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.34 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.20 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.74 ms    BCP_i   =   1.82 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.1.attn.resid_dropout  :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.32 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.18 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.03 ms    BCP_i   =   0.06 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.1.ln_2                :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.32 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   0.90 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.02 ms    BCP_i   =   0.08 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.1.mlp                 :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.62 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.19 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   5.90 ms    BCP_i   =  14.36 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
FSDP    GPT.transformer.h.1.mlp.c_fc            :   p_i     =   0.14 GiB   g_i     =   0.14 GiB   a_i     =   0.51 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.37 GiB 
                                                    ag_i    =   1.41 ms    fw_ag_i =   1.41 ms    bw_ag_i =   1.41 ms    rs_i    =   1.41 ms    bw_rs_i =   1.41 ms 
                                                    FCP_i   =   2.90 ms    BCP_i   =   6.97 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.1.mlp.gelu            :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.45 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.01 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.08 ms    BCP_i   =   0.26 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
FSDP    GPT.transformer.h.1.mlp.c_proj          :   p_i     =   0.14 GiB   g_i     =   0.14 GiB   a_i     =   0.60 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.44 GiB 
                                                    ag_i    =   1.41 ms    fw_ag_i =   1.41 ms    bw_ag_i =   1.41 ms    rs_i    =   1.41 ms    bw_rs_i =   1.41 ms 
                                                    FCP_i   =   2.90 ms    BCP_i   =   7.07 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.1.mlp.dropout         :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.44 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   0.98 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.03 ms    BCP_i   =   0.06 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.2                     :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.76 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.31 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   9.72 ms    BCP_i   =  23.10 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.2.ln_1                :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.44 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   0.98 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.02 ms    BCP_i   =   0.08 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
FSDP    GPT.transformer.h.2.attn                :   p_i     =   0.14 GiB   g_i     =   0.14 GiB   a_i     =   0.64 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.47 GiB 
                                                    ag_i    =   1.41 ms    fw_ag_i =   1.41 ms    bw_ag_i =   1.41 ms    rs_i    =   1.41 ms    bw_rs_i =   1.41 ms 
                                                    FCP_i   =   3.72 ms    BCP_i   =   8.40 ms    rcp_i   =   0.28 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.2.attn.c_attn         :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.58 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.41 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   2.18 ms    BCP_i   =   5.25 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.2.attn.c_proj         :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.54 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.35 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.74 ms    BCP_i   =   1.82 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.2.attn.resid_dropout  :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.52 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.32 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.03 ms    BCP_i   =   0.06 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.2.ln_2                :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.52 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.04 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.02 ms    BCP_i   =   0.08 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.2.mlp                 :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.81 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.33 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   5.90 ms    BCP_i   =  14.36 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
FSDP    GPT.transformer.h.2.mlp.c_fc            :   p_i     =   0.14 GiB   g_i     =   0.14 GiB   a_i     =   0.70 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.51 GiB 
                                                    ag_i    =   1.41 ms    fw_ag_i =   1.41 ms    bw_ag_i =   1.41 ms    rs_i    =   1.41 ms    bw_rs_i =   1.41 ms 
                                                    FCP_i   =   2.90 ms    BCP_i   =   6.97 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.2.mlp.gelu            :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.64 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.15 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.08 ms    BCP_i   =   0.26 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
FSDP    GPT.transformer.h.2.mlp.c_proj          :   p_i     =   0.14 GiB   g_i     =   0.14 GiB   a_i     =   0.80 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.58 GiB 
                                                    ag_i    =   1.41 ms    fw_ag_i =   1.41 ms    bw_ag_i =   1.41 ms    rs_i    =   1.41 ms    bw_rs_i =   1.41 ms 
                                                    FCP_i   =   2.90 ms    BCP_i   =   7.07 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.2.mlp.dropout         :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.64 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.12 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.03 ms    BCP_i   =   0.06 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.3                     :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.96 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.45 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   9.72 ms    BCP_i   =  23.10 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.3.ln_1                :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.64 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.12 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.02 ms    BCP_i   =   0.08 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
FSDP    GPT.transformer.h.3.attn                :   p_i     =   0.14 GiB   g_i     =   0.14 GiB   a_i     =   0.84 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.61 GiB 
                                                    ag_i    =   1.41 ms    fw_ag_i =   1.41 ms    bw_ag_i =   1.41 ms    rs_i    =   1.41 ms    bw_rs_i =   1.41 ms 
                                                    FCP_i   =   3.72 ms    BCP_i   =   8.40 ms    rcp_i   =   0.28 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.3.attn.c_attn         :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.78 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.55 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   2.18 ms    BCP_i   =   5.25 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.3.attn.c_proj         :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.73 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.49 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.74 ms    BCP_i   =   1.82 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.3.attn.resid_dropout  :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.71 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.46 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.03 ms    BCP_i   =   0.06 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.3.ln_2                :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.71 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.18 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.02 ms    BCP_i   =   0.08 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.3.mlp                 :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   1.01 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.48 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   5.90 ms    BCP_i   =  14.36 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
FSDP    GPT.transformer.h.3.mlp.c_fc            :   p_i     =   0.14 GiB   g_i     =   0.14 GiB   a_i     =   0.90 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.65 GiB 
                                                    ag_i    =   1.41 ms    fw_ag_i =   1.41 ms    bw_ag_i =   1.41 ms    rs_i    =   1.41 ms    bw_rs_i =   1.41 ms 
                                                    FCP_i   =   2.90 ms    BCP_i   =   6.97 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.3.mlp.gelu            :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.84 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.29 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.08 ms    BCP_i   =   0.26 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
FSDP    GPT.transformer.h.3.mlp.c_proj          :   p_i     =   0.14 GiB   g_i     =   0.14 GiB   a_i     =   0.99 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.72 GiB 
                                                    ag_i    =   1.41 ms    fw_ag_i =   1.41 ms    bw_ag_i =   1.41 ms    rs_i    =   1.41 ms    bw_rs_i =   1.41 ms 
                                                    FCP_i   =   2.90 ms    BCP_i   =   7.07 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.3.mlp.dropout         :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.83 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.26 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.03 ms    BCP_i   =   0.06 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.4                     :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   1.15 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.59 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   9.72 ms    BCP_i   =  23.10 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.4.ln_1                :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.83 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.26 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.02 ms    BCP_i   =   0.08 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
FSDP    GPT.transformer.h.4.attn                :   p_i     =   0.14 GiB   g_i     =   0.14 GiB   a_i     =   1.03 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.75 GiB 
                                                    ag_i    =   1.41 ms    fw_ag_i =   1.41 ms    bw_ag_i =   1.41 ms    rs_i    =   1.41 ms    bw_rs_i =   1.41 ms 
                                                    FCP_i   =   3.72 ms    BCP_i   =   8.40 ms    rcp_i   =   0.28 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.4.attn.c_attn         :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.97 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.69 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   2.18 ms    BCP_i   =   5.25 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.4.attn.c_proj         :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.92 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.63 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.74 ms    BCP_i   =   1.82 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.4.attn.resid_dropout  :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.90 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.60 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.03 ms    BCP_i   =   0.06 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.4.ln_2                :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   0.90 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.32 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.02 ms    BCP_i   =   0.08 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.4.mlp                 :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   1.20 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.62 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   5.90 ms    BCP_i   =  14.36 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
FSDP    GPT.transformer.h.4.mlp.c_fc            :   p_i     =   0.14 GiB   g_i     =   0.14 GiB   a_i     =   1.09 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.79 GiB 
                                                    ag_i    =   1.41 ms    fw_ag_i =   1.41 ms    bw_ag_i =   1.41 ms    rs_i    =   1.41 ms    bw_rs_i =   1.41 ms 
                                                    FCP_i   =   2.90 ms    BCP_i   =   6.97 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.4.mlp.gelu            :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   1.03 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.43 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.08 ms    BCP_i   =   0.26 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
FSDP    GPT.transformer.h.4.mlp.c_proj          :   p_i     =   0.14 GiB   g_i     =   0.14 GiB   a_i     =   1.18 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.87 GiB 
                                                    ag_i    =   1.41 ms    fw_ag_i =   1.41 ms    bw_ag_i =   1.41 ms    rs_i    =   1.41 ms    bw_rs_i =   1.41 ms 
                                                    FCP_i   =   2.90 ms    BCP_i   =   7.07 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.4.mlp.dropout         :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   1.02 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.41 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.03 ms    BCP_i   =   0.06 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.5                     :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   1.35 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.73 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   9.72 ms    BCP_i   =  23.10 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.5.ln_1                :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   1.02 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.41 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.02 ms    BCP_i   =   0.08 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
FSDP    GPT.transformer.h.5.attn                :   p_i     =   0.14 GiB   g_i     =   0.14 GiB   a_i     =   1.23 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.89 GiB 
                                                    ag_i    =   1.41 ms    fw_ag_i =   1.41 ms    bw_ag_i =   1.41 ms    rs_i    =   1.41 ms    bw_rs_i =   1.41 ms 
                                                    FCP_i   =   3.72 ms    BCP_i   =   8.40 ms    rcp_i   =   0.28 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.5.attn.c_attn         :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   1.16 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.83 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   2.18 ms    BCP_i   =   5.25 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.5.attn.c_proj         :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   1.12 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.77 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.74 ms    BCP_i   =   1.82 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.5.attn.resid_dropout  :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   1.10 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.74 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.03 ms    BCP_i   =   0.06 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.5.ln_2                :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   1.10 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.46 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.02 ms    BCP_i   =   0.08 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.5.mlp                 :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   1.39 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.76 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   5.90 ms    BCP_i   =  14.36 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
FSDP    GPT.transformer.h.5.mlp.c_fc            :   p_i     =   0.14 GiB   g_i     =   0.14 GiB   a_i     =   1.28 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.93 GiB 
                                                    ag_i    =   1.41 ms    fw_ag_i =   1.41 ms    bw_ag_i =   1.41 ms    rs_i    =   1.41 ms    bw_rs_i =   1.41 ms 
                                                    FCP_i   =   2.90 ms    BCP_i   =   6.97 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.5.mlp.gelu            :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   1.22 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.57 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.08 ms    BCP_i   =   0.26 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
FSDP    GPT.transformer.h.5.mlp.c_proj          :   p_i     =   0.14 GiB   g_i     =   0.14 GiB   a_i     =   1.38 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   2.01 GiB 
                                                    ag_i    =   1.41 ms    fw_ag_i =   0.95 ms    bw_ag_i =   1.41 ms    rs_i    =   1.41 ms    bw_rs_i =   0.95 ms 
                                                    FCP_i   =   2.90 ms    BCP_i   =   7.07 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.h.5.mlp.dropout         :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   1.22 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.55 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.03 ms    BCP_i   =   0.06 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
        GPT.transformer.ln_f                    :   p_i     =   0.00 GiB   g_i     =   0.00 GiB   a_i     =   1.22 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.55 GiB 
                                                    ag_i    =   0.01 ms    fw_ag_i =   0.00 ms    bw_ag_i =   0.00 ms    rs_i    =   0.01 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   0.02 ms    BCP_i   =   0.07 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  
FSDP    GPT.lm_head                             :   p_i     =   0.09 GiB   g_i     =   0.09 GiB   a_i     =   1.34 GiB   d_i     =   0.00 GiB   r_i     =   0.00       m_i     =   1.86 GiB 
                                                    ag_i    =   0.95 ms    fw_ag_i =   0.00 ms    bw_ag_i =   1.41 ms    rs_i    =   0.95 ms    bw_rs_i =   0.00 ms 
                                                    FCP_i   =   1.91 ms    BCP_i   =   4.61 ms    rcp_i   =   0.00 ms    rct_i   =   0.00 ms    fw_e_i  =   0.00 ms    bw_e_i  =   0.00 ms  

@xuanzhang816
Copy link
Collaborator Author

@sanketpurandare let's discuss how to get some good model and model configs to test the formulation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant