CUDA-enabled flam3
Edit: Watch svn for progress (which so far hasn't been updated: between ACM ICPC regionals, starting my thesis, my new graphics card dying (there goes the testing) and regular school, I've been overcommitted. I will come back to this, though, as the flam3 project is still somewhat related to my thesis and I've put a lot of effort into the code I have already).
As part of my preparation for a thesis, I need to get familiar with nVidia's CUDA technology (this is their implementation of a stream-processing language), and I want to do by producing open-source code. I've always loved Electric Sheep, but found the resolution to be anemic on larger monitors, so producing a GPU-accelerated version of flam3 capable of locally rendering very-high-resolution sheep in "reasonable time" (eventually reaching real-time as graphics cards continue to grow exponentially faster) would be fantastic.
Profiling indicates that a considerable amount of time is spent in rect.c, which of course makes sense. The inner loops of rect.c and apply_xform in flam3.c seem particularly ripe for the kind of multithreading required to take advantage of GPU acceleration. While I'm hesitant to go into detail until I am more familiar with the code, I believe that it's possible to reorganize those areas of code so that CUDA support would require few changes to the rest of the library and applications.
So here's the question. Assuming that I write clean code which yields a major speedup and does not require significant changes outside of rect.c (but would result in a major reorganization of render_rectangle()), and that all code requiring the CUDA libraries are ifdef'd so as not to add a dependency for users who do not wish to have the CUDA libraries available, would it be included in the official distribution?
(Also, is there anything special about the isaac rng, or can it be replaced by a more CUDA-friendly algorithm if necessary?)
Update
I'm currently heavily over-committed in real life, and unfortunately this project has taken a back-seat for a while. I do expect to finish it, bring it up to speed with trunk, and maintain it for a while, and - if the more challenging thesis idea i had turns out to be practical - eventually port it to a new, more elegant compiler architecture for increased maintainability. (I'm also hoping to rewrite a portion of the algorithm to take advantage of some features on 200-series NVIDIA GPUs which could provide an optional 50x improvement over normal GPU rendering at the cost of some deviation from the official algorithm, and possibly move towards real-time full-quality rendering in 1080p, but that's contingent on me actually picking up one of those cards (quite pricey).)
In any case, I'm going to upload the code I have, which is bitterly untested but at least is *conceptually* working, to SVN as soon as I get home. Keldor314's parallel approach seems quite interesting, and it looks like the goal of real-time HD sheep is well within the community's grasp over the next few generations of GPUs.
My First CUDA Release
While srobertson was working on a CUDA port, I've been busy stripping down my old DX10 renderer (which never really worked) and coding my own CUDA version. Presently I get a speedup of around 20x or so with my GTX 280 over my dual core cpu in apophysis. It presently will handle most of the 50 or so xforms in the official spec, though one or two of them are currently producing clearly wrong results. Right now, the feature set is limited to the raw basics - i.e. motion blur, post-transformations, final transformations, symmetry xforms, density estimation, and so forth are not supported. Also, the documentation is unclear in several places, so my implementation of palette and post processing is heavily based on guesswork.
In any case, I'm going to try to have the source and binary uploaded sometime later today, as soon as I finish tying down some (more) loose ends. Keep an eye out here:
https://sourceforge.net/projects/flam4/
It's online!
Here's the main directory for the code:
http://flam4.svn.sourceforge.net/viewvc/flam4/flam4CUDA/
The executable should be in the release directory. Make sure you also grab all the .dlls there with it.
The repository is a mess now, and I'd recommend that you not download the source until I clean it up (darn 25 MB .ncb file...). I'll have to do this later, since there's a nasty looking thunderstorm rolling in, so I should get off the computer.
Anyway, enjoy!
ATI
Great job! I see you managed to get the .cu compiler to cooperate - that'll certainly be a lot easier to maintain than the raw PTX solution.
A super-broken, not SVN-worthy, but still marginally informative version of my old code can now be found here. The meat of my work is in iter_kernel.ptx. Check out mybuild.sh for hints, and note that you'll have to regenerate the prime multipliers before anything works (see genprimes.py).
I say old code because I'm contemplating an ATI solution. My school is very much an ATI shop, so it will be easier to do a thesis on an ATI GPU, and since my graphics card blew a few caps and is now wildly unstable (typing this on a laptop, because my desktop crashed twice while writing it), I need to get a new card anyway. I'd certainly like to avoid duplicated effort, although hopefully the knowledge gained from pursuing both architectures can be shared.
You might look into a
You might look into a DirectX 11 solution. They just released the tech preview a few days ago, and the compute shader is very similar to something like CUDA. I'm not sure how practical it would be to go with ATI at the moment, since their API is rather in flux since they depreciated CTM.
I'm getting a cudart.dll
I'm getting a cudart.dll missing error. I've installed the CUDA driver, but not the devkit. This could be a bug - just wanted to report it.
Cudart.dll
Strange that that dll never caused me problems, whilst the other four did e.e I guess I had it in my path properly.
Anyway, you can certainly find the dll in the devkit somewhere, and probably with the driver for that matter. Easiest thing to do would be to just copy the thing into the same directory as the executable.
Right at the moment, I'm having hard drive troubles, so my dev computer is off limits until I get the replacement (tomorrow, hopefully), but I'll see to getting cudart.dll up with flam4 as soon as I can, as well as getting the download link from Sourceforge pointing to a current version. (presently, the link still points to the DirectX version from months ago!)
First benchmark
Still testing and adding variations, but unless I missed something, this should provide a pretty accurate reflection of the performance speedup.
Hardware:
Intel Core 2 Duo 6400 @ 2.36 GHz, 2MB cache, 4GB RAM
NVIDIA GeForce 8800 GS with 12 SPUs, core at 550MHz, 384 MB RAM at 800 MHz
Test flame: modified version of test.flam3 included in source (not the one currently in SVN, tho)
CPU rendering time at '33' bits: 86 seconds.
GPU rendering time: 678 milliseconds.
I'm okay with that.
SVN commit coming soon.
sounds great, can't wait to
sounds great, can't wait to see it!
did you really bypass CUDA?
The CUDA C++ compiler, yes.
The CUDA C++ compiler, yes. But part of CUDA lives in the driver, and I am still using that bit.
Snowed under with work for a bit. Gonna try and finish up the first few variations and update svn soon. Also there's a chance that the above benchmarks - which are in many ways best-case - may be overly optimistic as compared to average-case. I'll have to investigate further (soon!).
success
Much more to do - at the moment it runs slower than CPU - but hey it works. From here it's only a matter of optimization.
congrats, that's very
congrats, that's very exciting to see!
Coming along...
I'll keep track of my progress here, in case anybody's interested.
I'm currently reorganizing the algorithms to enable GPU acceleration. Essentially I'm just repartitioning the code, because it was originally designed to best accommodate the changing precisions, but to enable acceleration it must also accept changing data structures, so code that once would logically be adjacent now belongs in separate functions.
The good news so far is that the necessary modifications to the data structures have enabled a few optimizations which should significantly reduce cache thrash on both GPU and CPU computing. An informal test with a very large flame (1920x1200, 3x supersampling, 200 sample density - that's 1.2 GB of data during computation, kids) knocked 12% off of the time, clocking in at 186 seconds/frame as opposed to 213 on CPU rendering. Still a long way away from GPU benchmarking, but it's drawing closer.
encouragement+++
Sounds great! The quality and variety of sheep in the new enlarged format is wonderful, the extra processing power of GPUs might encourage spot to enlarge the sheep size even further! Thank you for your efforts.
Very interested ...
in your overall progress. How far did you get until now? I was thinking as well to port electric sheep to cuda as part of my thesis but you were faster :-).
Im not very familiar with electric sheep and its internals but i think the base algorithm (IFS) should work very well on a GPU. Are you aware of any other projects that are using IFS and are visually appealing like electric sheep ?
CUDA has many compiler issues.
And a lot of the tricks that I needed to pull to get things to run optimally are simply impossible in CUDA's variant of the C language. So I'm hand-coding the whole thing in PTX assembler. Understandably, this has slowed my progress a bit. ;) The good news is that the code is well within the operational limits it needed for highest efficiency - it uses less than 21 registers and 5k shared memory, meaning that three concurrent 256*2-thread blocks can run per SPU, enabling high performance without many compromises. The most concerning thing right now is just if the single-precision floating point math is up to the task. I think it is - the upper limit of image resolution set by GPU framebuffer size is well below the resolution of a f32's 24-bit mantissa - but we'll have to see.
I'm done with all the IFS code, except for some of the variations (which I just got bored of doing and put off until the simpler ones are tested working - hand-coding trig approximations is really really boring). I'm about half-way through the C host-side wrapper, which is more complicated than one might expect because in order to make everything fit in RAM I chose to convert the genomes to a stack-based, design-by-contract affair that is exceptionally hard to debug (any errors hard-lock the system. thanks, nvidia). After that has been tested and the vars finished I'll post code and performance figures.
a source comment
Verifying the integrity of the ported RNG and IFS. No hard numbers yet, still running in an emulator to perform debugging. I need to finish the accumulator and spatial filtering, then I'll post source and get to work on density estimation. My source tree has diverged enormously from SVN, so if a merge happens, it won't happen immediately.
Here's a source comment on memory allocation, thread geometry, and other tricky things, like the new "shuf_width" runtime parameter:
/* The geometry can be a little confusing. The smallest unit of execution in CUDA is a "warp";
* one warp is 32 threads. There's a very heavy penalty for branching independently, so we choose
* which transform to apply on a warp-by-warp basis. However, since the IFS algorithm is designed
* to acheive convergence on a solution, 32 random points undergoing the same transformations would
* quickly converge on a single pixel. No good.
*
* We work around this by having each warp operate on a matrix of values. The matrix is
* 32*shuf_width in size. The warp picks a random transform and runs it on each row of the matrix,
* then stores the results. After that, 32 random integers in the range [0, shuf_width) are picked,
* and each column is wrap-around shifted[1] by the respective number of places. The process then
* repeats.
*
* It is expected that two points will be in the same row once every shuf_width shuffles (another way
* of saying it is that any given row can be expected to contain 32 / shuf_width points that were in
* the same row on the previous pass). A good rule of thumb is to make 1 / shuf_width smaller than
* the smallest transformation density ("weight") in a genome; this is usually true for the default
* shuf_width of 32.
*
* Naturally, the enemy of cranking shuf_width up is memory consumption. Memory latency on CUDA is lethal
* to performance, but that can be worked around as long as enough warps are ready to compute on each
* stream processing unit while other threads wait for memory access to complete. The iterator function
* has been carefully constructed to consume no more than 42 registers, and the other functions use
* less than that. Each function is run in a block of 128 threads, and each SPU has 16384 registers
* in current models, meaning we can fit three blocks of 128 threads on a SPU (16384 / 128 / 42 ~= 3).
* Each warp needs its own matrix, and each element is 12 bytes long. On my GPU, with 12 SPUs, using
* the default shuf_width of 32, that works out to be
*
* (12 bytes/element) * (32 * 32 elements/matrix) * (1 matrix/warp) * (4 warps/block) *
* (3 blocks/SPU) * (12 SPUs) = 5308416 bytes (5.06 MB).
*
* Each thread also gets its own random context (which is persistent across the life of the application
* to avoid costly reinitializations). This currently works out to be around 96 bytes/thread.
*
* A maximum of 128 independent blocks are supported, which should be pretty compatible for now; the
* fastest CUDA GPU available can run 90 simultaneously per the above. "Independent" blocks must be
* given separate control points; since most flames usually use 50 or so temporal samples, the code
* will pair up blocks to operate on the same control point at a small loss of efficiency in order
* to ensure full coverage of the GPU's stream processors. This is all autodetected.
*
* One important thing to note is that, since the entire process runs on GPU, you must have enough
* free device memory to hold all of the above, plus the per-pass and final accumulators. (Final
* filter-down is also performed on the GPU, but we can drop the per-pass accumulator and replace
* it with the smaller destination buffer, so that's not an issue.) Each of these buffers is
*
* (4 bytes/channel) * (4 channels/bucket) * (width * height * oversampling^2 buckets)
*
* meaning that rendering an image at 1920x1200 with 2x oversampling requires 288 MB, which is quite
* reasonable, although if super-high-res flames are needed, it should be fairly easy to implement
* a feature that splits the image into strips and renders each separately. It'll do a complete
* computation, so it will scale purely linearly with buffer depth, but it should still be orders of
* magnitude faster than falling back to CPU.
*
* [1] Using addressing magic, not in-place swapping or anything else equally slow.
* */
i am very interested in GPU
i am very interested in GPU accelerated flames, and i would be happy to assist you in this direction. i can't say i would or would not include it in the official distribution without knowing a lot more but i am certainly open to the idea. please contact me by email.

Recent comments
13 hours 57 min ago
14 hours 32 min ago
14 hours 34 min ago
14 hours 37 min ago
14 hours 42 min ago
1 day 4 hours ago
1 day 16 hours ago
1 day 16 hours ago
1 day 16 hours ago
1 day 16 hours ago