We present faster algorithms for the residue multiplication modulo 521-bit Mersenne prime on 32- and 64-bit platforms by using Toeplitz matrix-vector product. The total arithmetic cost of our proposed algorithms is less than that of existing algorithms, with algorithms for 64- and 32-bit residue multiplication giving the best timing results on our test machine. The transition from 64- to 32-bit implementation is full of challenges because the number of limbs doubles and the limbs' bitlengths are cut in half. Without using any intrinsics or SIMD/assembly instructions in our implementation on an Intel(R) Core i5 - 6402P CPU @ 2.80GHz, we find 136 and 550 cycles for our 64- and 32-bit residue multiplications, respectively. In addition, we implement constant-time variable- and fixed-base scalar multiplication for the standard NIST curve P-521 and Edwards curve E-521.