The amazing Gregorio Naçu posted the article title graphic this week to bring attention to the venerable 6502 processor and poke fun at Apple’s M2 chip marketing slides. He’s doing probably the most ambitious single-person Commodore 64 project I know of and has a fantastic blog.
Apple claims the new M2 chip has the following specs.
We all know that these numbers are probably a little fluffy. Maybe a lot fluffy, and in practical applications, they are probably pretty far off. Benchmarking in a lab is fine, but the numbers rarely reflect real-world performance.
Tom’s hardware does an excellent breakdown on this new chip. It does look pretty neato!
How fast is a 6502?
After Gregorio posted this image earlier this week, it sparked a fair amount of discussion on the interwebs about the memory transfer speed of a 6502 processor.
The 6502 on Commodore machines shares the clock with the video chip. Since dual ported ram wasn’t financially feasible at the time, they chose a memory access trick that allowed both the video chip and processor to access memory during a single clock cycle. I think it’s the same on most Commodores, but on the VIC-20, the processor accesses the memory on the low part of the signal and the VIC chip on the high part. Maybe that’s backward… anyhoo, you get the point.
Memory at 1MB per second
Going back to the slide, this 1Mhz memory bandwidth is what folks are questioning.
On every clock cycle, the 6502 reads memory from somewhere… the stack, registers, program counter, memory locations, etc. So at 1 Mhz, typical for Commodore machines, this 1MB per second bandwidth is probably accurate in a vacuum, where marketing people hang out.
It’s important to note that Gregorio Naçu‘s slide was a parody and not intended to be a hard numbers accurate kind of thing. Please remember that because if you don’t, the rest of this discussion will ruffle your feathers.
We’ll try some memory transfers to get an idea of what actual transfer speeds might look like using standard Commodore hardware. Other 6502-based platforms might be faster or slower, so I encourage you to try some tests of your own, and please let me know what you find.
Again, remember that transferring memory takes more clock cycles than just reading or writing…
Let’s give this a go on the most popular 6502-based system of all time, the Commodore 64.
We’ll take a queue right from the venerable Rodney Zaks.
Incidentally, Robin did a long video fixing this book’s implementation bug. I’ll be using the revised version as I think it’s a well-established example of doing a real-world block transfer. Sure there may be faster ways, but this is a realistic way, which is what we’re going for.
You can read this excellent chapter on how this works, and Robin’s video goes into it in great detail. Here’s what we’re going to do:
source = $0800 dest = $4800 len = $4000 from = $fb to = $fd tmpx = $a6 copyr .block lda #
We can count jiffies on a Commodore to give us an idea of how fast this copy takes. Sure there's a slight overhead in the setup, but I think it's marginal enough that we can ignore it for our purposes.
Okay, that's pretty fast. Since that's 16k transferred, it works out to about 54.6 k per second.
Let's do a bunch of them and see what it comes out as.
We can call this pretty quickly 255 times and do the same math.
lda #$00 sta $a2 sta $a1 sta $a0 ldx #255 stx tmpx lp jsr copyr dec tmpx ldx tmpx bne lp lda $a0 jsr printbyte lda $a1 jsr printbyte lda $a2 jsr printbyte
So at $1128 jiffies(4392) and 255 transfers of 16,384, we're seeing around 57K per second.
Grain of salt, yes, but real-world enough.
Yeah, there's some overhead in the setup and running of the transfer. We could probably make this loop a few percentage points faster. Maybe if we make it tight, we could get 15% better out of it. But the point was real-world uses, and this is a pretty good example of a tight but flexible loop to transfer. Let's not get TOO pedantic here.
What's important to note is that transferring memory takes several clock cycles per byte. If we count them, it's about a dozen cycles, which tracks roughly with our results.
The KIM-1 is arguably the most simple and pure 6502 platform, so it will be interesting to try and do memory transfers on it.
It IS clocked a little slower than a Commodore 64, so I expect it to transfer slightly slower. But it doesn't have to compete for access time as VIC-II "badlines," so maybe it'll be pretty close.
Let's find out.
I don't own a "real" KIM-1, but I do own what is considered the best two clones. Today, let's use the Corsham KIM-1 Clone. I'm going to call it a KIM-1 from here forward, mostly because I enjoy getting angry letters about this. You've been warned.
The KIM-1 doesn't have a jiffy clock like the other Commodore machines.
The "Application ports" are easily accessible, so if we set a pin high when we start and set it low again when we finish, we can easily use an oscilloscope to measure the time.
With the expansion bus hooked up on my Corsham KIM board, the Application port A direction is set to output with.
And then, we can toggle pin PA0 by setting it high or low. We'll use $FF and $0 for that for simplicity.
Side note: this is a non-standard location for this port, your KIM-1 or clone probably has it in the $1700 range. Check your documentation.
16k in 262 Milliseconds is around 62.5k per second. Slightly faster than a Commodore 64 even though an NTSC Commodore 64 runs at a slightly higher clock speed (1.023MHz) than our KIM here.
Let's do this 255 times in a tight loop, ignoring the overhead of things like JSR, which takes a few clock cycles each loop. We're going for a ballpark here.
So our loop code then looks something like
lda #$ff sta $1603 sta $1601 ;technically setting all pins high here ;could just use #$01 ldx #255 stx tmpx lp jsr copyr dec tmpx ldx tmpx bne lp lda #$00 sta $1601 brk
Then if we probe it with an oscilloscope, we can measure the 1+ minute square wave.
So 255 transfers of 16,384 bytes take 67 seconds. Or about 62k per second.
I happen to have a Cerberus 2080 board. As far as I know, mine is the only green one in the world.
This has dual-ported RAM and can clock the brand new (yes, they still make them) WDC 65c02S processor at a blazing 8Mhz. Let's see what kind of results we get from it.
Again, we have a no jiffy clock problem, so I'm going to skip right to the 4MB transfer, time it over the video capture, and have it show "done" on the screen when it finishes. Unlike the KIM-1, I don't have a straightforward way to time it with an I/O pin. It'll give us a good enough idea of where we are.
16,384 bytes 255 times took 6.29 seconds, so maxed out, a modern 6502 at 8MHz can do about 664.2k per second. Not too bad!
Sure, this was not a comprehensive set of tests. But in the real world, a 6502 can copy the entire contents of a Commodore 64's memory from one place to another in about a second. Pretty respectable, and it was pretty fast for the time.
You could certainly use self modifying code and unroll this copy routine to get better performance at the price of flexibility and arguably understanding for the average casual 6502 assembly coder.
Again, this was not a "how fast can we absolutely make it" but an everyday use examination.
This copy can handle from one to 216 bytes and every number in between. And as my favorite Youtuber is fond of saying "I know I know, but I didn't do that. Let the angry emails begin."
If you have an REU on your Commodore, that can theoretically swap out the memory at a byte per clock cycle. A true 1MB per second. I heard that games like Sam's Journey make use of this feature quite a bit.
I'd love to hear your thoughts on how you'd approach this, pedantic, nit-picky, and otherwise. Bonus points if you demonstrate methods that show dramatically better results.
Whatever you do, be sure to have fun and don't take marketing slides too seriously.