The document compares the Kepler GPU and Xeon Phi architectures through microbenchmarks that test memory-bound, compute-bound, and latency-bound workloads. It finds that for memory-bound workloads, vectorizing loads and using texture cache improves performance on Kepler, while gathering and aligned loads help Xeon Phi. For compute-bound workloads, vectorizing and using float4/double4 benefits Kepler, and intrinsics aid Xeon Phi. And for latency-bound workloads, loop interchange and skipping the L2 cache helps Kepler, while gathering and aligned loads assist Xeon Phi. The conclusion notes vendor performance data may differ from experiments and examples are available online.