Recently I completed development of the reference GPU miner for Dynamo. This was a ground up development for a custom script based hash algorithm. I didn’t have any code to fork from or even review; this was starting from zero. I also had no OpenCL or GPU programming experience, although I did have extensive OpenGL exposure, which turned out to be really useful because OpenCL and OpenGL are really similar in their concepts.
The reference code for Dynamo GPU miner can be found here: https://github.com/dynamofoundation/dyn_miner This is a Visual Studio solution which can be built out of the box. Simply clone and build. You will need curl and opencl which can be installed with vcpkg. If you need help building check out our discord or telegram which can be accessed at https://dynamocoin.org
I initially thought writing a GPU miner was basically impossible, a task relegated to super coders. As it turns out, it’s relatively simple once you understand a few key concepts.
- OpenCL compiler is peculiar. There are many small caveats to the compiler that can waste a lot of time. Your only defense is printf. Use printf liberally to debug what’s going on. The drawback with Windows printf in OpenCL is that it adds a newline to every output, so you have to batch up your data to see anything meaningful.
- Program in parallel if possible. Because Dynamo is purpose built as a script based procedural hash language, this was not possible. Your algo may be different, and you want to use get_global_id() and get_local_id() to break up your work.
- Compiler errors are very funky. If you miss a curly brace it will tell you there is an error at the end of the file, not where the brace is missing. Use comments to prune portions of code and re-compile until you track down the offending line. Also, compiler failure is just reported as an error code, you need to dump out the debug log to see the actual reported errors and line numbers. The code in the reference miner shows how to do this.
- Compute units don’t really mean anything. For my test I used an RX580 which has 36 compute units. When I ran the miner with a global work size of 36 it was ok, but when I ran it with a global work size of 1,000 it was much faster. Something about the call to clEnqueueNDRangeKernel global work size parameter seems to have nothing to do with the actual number of compute units. I tried many different values and 1,000 to 2,000 seemed to be a good number for optimal performance. Try different values to see what works for your algo.
- Local or private memory made no difference for me. I tried several rounds of optimization by copying __global__ tagged buffers to __local__ or __private__. Nothing resulted in any performance increase. This may have been due to the highly deterministic nature of my algo.
- Type checking is lazy. If you pass a char to a uint you will end up with a mess. There is basically zero type checking in the OpenCL compiler. So char 0x89 turns into 0xFFFFFF89 as a uint. You need to write your own memcpy and big endian/little endian conversions. Again, see the reference code for examples of how to do this.
- Port from Windows to Linux is (almost) trivial. Once you get around the library includes and linkages, OpenCL works out of the box on Ubuntu. I ported my Windows GPU miner to Ubuntu in a few hours and had it running under HiveOS (Ubuntu 18) with basically no changes.
In summary, you can write your own GPU miner. This is actually achievable by mere mortals in a reasonable timeframe (mine took about 20 hours). It’s a great educational exercise. Feel free to drop into our Discord or Telegram if you want help building your own.