Straightforward implementation of Depth First Search algorithm for large graph on GPU cluster,may lead to load imbalance between threads and un-coalesced memory accesses,giving rise to the low performance of the algorithm.In order to achieve improvement of the performance in a single GPU and multi-GPUs environment,a series of effective operations were used to reschedule before processing the data.A novel strategy for mapping between threads and data was proposed,by using the prefix sum and binary search operations to achieve perfect load ba...