Is it really true that NOEL-V is able to reach up to 4.69 CoreMark/MHz?
After series of tests, we’ve measured the best score we could with the GRLIB Release 2025.1-b4296. 10 iterations were running for almost 12 seconds with 24 MHz clock. This gives the next CoreMark result:
Benchmark was compiled using xPack GCC. Gaisler’s GCC was also tested and didn’t give any different results in execution time. Using any code optimisation or suggested flags from this page only made the result worse, making same 10 iterations run whole 145 seconds. Malloc wasn’t used. Printf wasn’t used. Messages were send into terminal using APB UART. Execution time was measured using APB Timer.
We can actually get significantly better than 4.69 CoreMark/MHz these days - should update those numbers.
Anyway, you say nothing about what NOEL-V configuration you are running on. With an IP as configurable as NOEL-V, the specific configuration makes for a lot of difference. I suppose it would be a good idea to include numbers on the web for something smaller than the HPP64 as well.
RV64 vs RV32 should not really matter, but there are a lot of extensions that would help performance, as well as using a newer compiler. The compiler options you mention do not even include any optimization level - I hope that is just not shown there?
Aside from the above, your amazingly low numbers (and the fact that things get even worse when you try what we used) strongly suggest that you have small/no caches (or they are not working properly). CoreMark really needs to be running from L1 cache, and that requires 16+16 kByte (I+D) caches (and then making sure that the code fits, not letting the compiler unroll loops too much - O3 level optimization by itself may not be good).
I did some changes to the project. First time I forgot to mention I’m using RV32 with minimal configuration. Since then I tweaked the ‘noelvcpu.vhd‘ file in the MIN config. Those changes include:
Increasing L1 cache from 8+8 to 64+64 kBytes for each;
Field ‘bhtentries‘ set to 128;
included next extensions: Zcb, Zba, Zbb, Zbc, Zbkb, Zbkc, Zbkx;
Made sure that code executes from internal memory, generated with ‘ahbram‘.
Now, compiled with O3 level optimization and extensions mentioned above algorithm runs at 0.526 Coremark/MHz. It IS an improvement, but I still left wondering what else I can do to increase performance? Despite MMU, I don’t see anything that I can include into the project. I also doubt it’ll be a fair trade-off between perfomance and LUT’s usage.