Is this library parallelized? #165

ch21d012 · 2024-08-28T12:50:54Z

Hi I am using this FTorch library to perform tensor operations in my fortran code to improve the speed. But my code is parallelized using mpi and when i am running in multiple processors using this library it is slow. So I want to know whether this library can be used in parallel environment or nor?

Thanks in advance

jatkinson1000 · 2024-08-29T06:46:25Z

Hi @ch21d012 yes, whilst FTorch itself has no MPI components we have run several applications that are large MPI codebases on HPC systems.

To do this we create a net and tensors on each process, then run inference on each process.

The most common cause for slowdown is if you are reading in the net at every iteration/timestep of your code, as this is expensive.
Instead you should read in the net only once per process during initialisation, keep it in memory whilst running your code, and then destroy it at the end of the simulation.
An example of this can be seen for an atmospheric modelling code here: https://github.com/Cambridge-ICCS/FTorch-benchmarks/blob/main/benchmark_mima/cg_drag_torch_mod.f90 (though note that this was written for an older version of the API).

I'm hoping to write a worked example around this when I get time.
You may also want to view the second exercise in the workshop here: https://github.com/Cambridge-ICCS/FTorch-workshop which demonstrates exactly this speedup comparing the two (albeit for a single process).

Please let us know if this helps, and any further questions.
If you can point us at the code we can try and take a look to better understand what you are doing.

ch21d012 · 2024-08-29T07:35:21Z

module ann
  use fcmech
  use ftorch
  use precision
  use parser
  use string
  use parallel
  use, intrinsic :: iso_fortran_env, only :gsp => real32
  implicit none
  
  ! Simulation type
  character(len=str_medium) :: sim
  
  ! Number of input nodes
  integer, parameter :: nodinput = 40
  
  ! Number of output nodes
  integer, parameter :: nodoutput = 39


  integer :: jj
  
  integer, parameter :: n_inputs = 1
  
  
  type(torch_module) :: model1
  type(torch_tensor), dimension(n_inputs) :: in_tensors
  type(torch_tensor) :: out_tensor1
  
    
    real(WP), dimension(nodinput) :: sc_in
    real(WP), dimension(nodoutput) :: sc_out
    
    
    contains 
    
    subroutine read_pt_file()
    use ftorch
   
    implicit none

   
    model1 = torch_module_load("/home/goyallab/Documents/GitHub/FTorch/examples/4_2annM/3M_Nradicals_v1.pt")
    
    
   end subroutine read_pt_file
   
   
   subroutine delete_pt_file()
    use ftorch
 
    implicit none

    ! Cleanup
  	call torch_module_delete(model1)
  	
   end subroutine delete_pt_file

 subroutine ann_get_src_pt(sol,deltat)
  
  implicit none
  
  real(WP), dimension(npS1) :: sol
 
  real(WP) :: deltat,ttotalD_
  
  ! Set working precision for reals
  integer, parameter :: gwp = gsp
   
  integer :: num_args, ix
  character(len=128), dimension(:), allocatable :: args

  ! Set up Fortran data structures
  
  real(gwp), dimension(34):: pred_rhs1
  real(gwp), dimension(5):: pred_rhs2
  real(gwp), dimension(nodinput) :: X

  integer :: tensor_layout(1) = [1]
 
  
  
  ! scaling input (applying transform)
  
  X=sol/sc_in
  
  ! Get TorchScript model file as a command line argument
  num_args = command_argument_count()
  allocate(args(num_args))
  do ix = 1, num_args
      call get_command_argument(ix,args(ix))
  end do
 
  
  ! Create Torch input/output tensors from the above arrays
  in_tensors(1) = torch_tensor_from_array(X, tensor_layout, torch_kCPU)
  out_tensor1 = torch_tensor_from_array(pred_rhs1, tensor_layout, torch_kCPU)

  
  
    call torch_module_forward(model1, in_tensors, n_inputs, out_tensor1)
 
  call torch_tensor_delete(in_tensors(1))
  call torch_tensor_delete(out_tensor1)
  !call torch_tensor_delete(out_tensor2)
   
     
 
  pred_rhs1(1)=(pred_rhs1(1)/1000)*sc_out(1)
  pred_rhs1(2)=(pred_rhs1(2)/1000)*sc_out(3)
  pred_rhs1(3:7)=(pred_rhs1(3:7)/1000)*sc_out(5:9)
  pred_rhs1(8:10)=(pred_rhs1(8:10)/1000)*sc_out(11:13)
  pred_rhs1(11:34)=(pred_rhs1(11:34)/1000)*sc_out(15:38)
  
 
  
  sol(2) = sol(2) + (pred_rhs1(1)*deltat)
  sol(4) = sol(4) + (pred_rhs1(2)*deltat)
  sol(6:10) = sol(6:10) + (pred_rhs1(3:7)*deltat)
  sol(12:14) = sol(12:14) + (pred_rhs1(8:10)*deltat)
  sol(16:39) = sol(16:39) + (pred_rhs1(11:34)*deltat)

end subroutine ann_get_src_pt

end module ann

This is my module i am loading the model only once at the start of the simulation in intialization by calling read_pt_file and deleting at the end of simulation delete_pt_file

ch21d012 · 2024-08-29T07:38:31Z

do k=kmin_,kmax_

        do j=jmin_,jmax_
        
           do i=imin_,imax_
           
              ! Not in walls
              if (mask(i,j).eq.1) cycle

              ! Get initial field
              sol(1:Nsp)=max(SC(i,j,k,isc_1:isc_1+Nsp-1),0.0_WP)
              sol(1:Nsp) = sol(1:Nsp)/sum(sol(1:Nsp))
              sol(NT)=SC(i,j,k,isc_T)
              solold=sol

              ! Get chemical mapping after dt
        
              call ann_get_src_pt(sol,dt)
             
              
              ! Get chemical source term for scalar equation
              SRCchem(i,j,k,:)=sol-solold

           end do
          
        end do
       
     end do

i am calling the ann_get_src_pt in the above loop where it gives me sol values

this code is parallelized and each loop runs on different processor

TomMelt · 2024-09-09T08:52:17Z

Dear @ch21d012 , in order for us to help more, we would need some timing information.

Could you please provide timing for the original code on 1 and 4 processes. And then similar timings for running the code with the pytorch model. You can time the whole code, but it would also be handy to know how long the inference/ann_get_src_pt step takes.

ch21d012 · 2024-09-11T13:00:58Z

Dear @TomMelt thanks for the reply

so my original code on 1 processor and 4 processor takes same time i.e., 1e-3. when calling the function "call ann_get_src_pt(sol,dt)"
instead of pytorch model it is faster on 4 processors because the loops are distributed on each processor.

But when i run with pytorch model on 1 processor the time is 1e-3, and on 4 processor the time is 5e-3 it increased

and these are at ann_get_src_pt step only.

TomMelt · 2024-09-16T14:58:11Z

Hi @ch21d012 , I think it might be easiest if we meet virtually so we can discuss this problem in more detail. ftorch is maintained here at ICCS and we run "code clinics" to help our collaborators. Would you mind completing this form. Under the section where it asks "VESRI team", please put "other" and write "FTorch"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is this library parallelized? #165

Is this library parallelized? #165

ch21d012 commented Aug 28, 2024

jatkinson1000 commented Aug 29, 2024

ch21d012 commented Aug 29, 2024 •

edited by jatkinson1000

Loading

ch21d012 commented Aug 29, 2024 •

edited by jatkinson1000

Loading

TomMelt commented Sep 9, 2024

ch21d012 commented Sep 11, 2024

TomMelt commented Sep 16, 2024

Is this library parallelized? #165

Is this library parallelized? #165

Comments

ch21d012 commented Aug 28, 2024

jatkinson1000 commented Aug 29, 2024

ch21d012 commented Aug 29, 2024 • edited by jatkinson1000 Loading

ch21d012 commented Aug 29, 2024 • edited by jatkinson1000 Loading

TomMelt commented Sep 9, 2024

ch21d012 commented Sep 11, 2024

TomMelt commented Sep 16, 2024

ch21d012 commented Aug 29, 2024 •

edited by jatkinson1000

Loading

ch21d012 commented Aug 29, 2024 •

edited by jatkinson1000

Loading