Existing SIMD extensions in scalar CPUs (e.g., SSE, AVX, etc.) can leverage instruction-level parallelism (ILP) because of their tight integration with the CPU pipeline. However, the vectors they employ are quite short, and this limits their ability to exploit data-level parallelism (DLP). On the other hand, processing-using-memory (PUM) accelerators are capable of exploiting massive amounts of DLP, as they typically perform computation on very long vectors (tens of thousands of elements) within the memory itself. Recent work demonstrates that order-of-magnitude speedups can be achieved by these architectures for a variety of workloads over area-equivalent multicore CPUs with SIMD extensions. Still, PUM architectures are largely decoupled from the CPU itself, thereby limiting their ability to tap the CPU’s ILP the way SIMD extensions do.In this paper, we propose PUMICE, a tightly integrated CPU-PUM architecture that simultaneously exploits DLP and ILP for very long vector operations. As a result of this tight integration, PUMICE delivers significant performance gains: Our experimental results show speedups of up to 2.2× (1.4× on average) over a state-of-the-art decoupled approach.