Home>>HERON modular systems>>Transferring Data

How Fast can I Transfer Data?

In a HERON system using HEART to transport the data between the nodes, there are many things that affect the speed that data can flow in the system

If we consider a simple transfer, between two nodes using a FIFO connection:

FIFO connection

If the Receiver can receive data faster than the Transmitter is generating it, then we can expect the data to be transferred without interruption. The Receiver should not read the FIFO when it is empty, so only valid data will be received. This is true as long as the FIFO can support that data rate.

With HEART (as on the HEPC9) the FIFO connection is a “Virtual FIFO”. HEART makes this connection using 2 real FIFOS (implemented in FPGAs), connected together using a time-slotted ring. The FIFOs are 32 bits wide, and use a 100Mhz clock, so the module can transfer data to and from the FIFO at a peak rate of 400Mbytes/sec (100Mwords/sec). The timeslots of HEART are 66 Mbytes/sec.

virtual FIFO connection

So the speed that the “Virtual FIFO” connection can support is limited by the number of timeslots selected for it.
What happens if the transmitter generates data faster than can be accepted?
The Transmitter uses the Full flag on the FIFO to detect when the FIFO cannot accept any more data. In most cases the Transmitter can simply wait for the FIFO to be ready, so the transfer of data is limited by the slowest element.
However, some devices cannot wait. An A/D for example has to take a sample on every sample clock, and probably doesn’t have any storage associated with it. In this case data will be lost, so it is important that your system is never going to reach that situation.

So the data transfer rate that can be achieved on a particular connection is limited by the slowest part of that connection.

Peak data rates are not the same as sustained data rates though. Any transfer will have some overhead, so will not be able to sustain the maximum rate.

FPGA based modules

In the case of an FPGA based module, the input FIFO connection and the output FIFO connection are connected directly to the FPGA. This means the FPGA has direct control over the transfer of data, and can sustain the maximum transfer rate in and out at the same time, if your FPGA design allows that.

Of course the rates at which you can transfer data depends on the frequency of the FIFO clock that is generated from your FPGA design. Normally this will be the full 100Mhz, but there can be cases where your FIFO design is simpler if you choose a lower clock rate. If this is the case you must be aware that this reduces the data rates accordingly.
Normally the HERON FIFOs will be accessed using the Hardware Interface Layer (HIL) VHDL that is supplied by HUNT ENGINEERING.
When accessing a single FIFO (in or out) the HIL allows the full 400Mbytes/sec to be sustained into and out of the FPGA (at the same time).
When several FIFOs are transferring data at the same time, the HIL needs to switch accesses from one FIFO to another. This switching causes a single cycle where data cannot be transferred, reducing the total data rate that is possible.
The HIL supports different models of accessing several FIFOs. The first is to permanently request data from the FIFOs, then the HIL will use a “round robin” method to access the FIFOs. This can cause data to be transferred one word at a time, with one “dead” cycle between each word. This makes the total rate become 200Mbytes/sec.
If you want to transfer more than this rate using multiple FIFOs, you can choose to use the HIL in a different manner. This method is to cycle the data request signals yourself, where you can force the same FIFO to be accessed for a series of cycles, before the “dead” cycle is introduced and you start to access a different FIFO. Using this method total rates of above 350Mbytes/sec can easily be achieved.

Peak rate	100Mwords/sec = 400Mbytes/sec in Plus 400Mbytes/sec out
Sustained rate single FIFO in one direction	400Mbytes/sec
Sustained total rate on several FIFOS “normal method”	200Mbytes/sec
Sustained total rate on several FIFOS “cycling method”	>350Mbytes/sec

Rates achieved with an FPGA based module on a HEPC9 (FIFO clock=100Mhz).

Refer to the application note for full details of using the HIL to access multiple FIFOs.

C6000 based modules

When using a C6000 processor in a real time system you need to transfer data into and out of the processor at the same time as processing that data. The C6000 DSPs have DMA engines that allow the I/O to be handled separately from the processing – but does it? Of course there are interactions between processing and I/O -- the DMA and processor share internal and external busses, and the DMAs need to be programmed by the processor.
The HERON2 uses the XBUS of the C6203 to access the FIFOs using a 75Mhz clock. Using this separate bus allows FIFO data to be read and memory to be written in the same clock, thus reducing the conflict of resources.

Consider what a particular data rate means in terms of processing speed. If you transfer data into a C6000 at 100Mbytes/sec, you have only one or two processor cycles per byte for processing. If you are handling 32bit data items this means 8 cycles per sample -- but that still needs to be a near trivial operation coded very efficiently!

HERON-API and FIFO accesses

It is usual to use HERON-API to control the DMAs and interrupts in order to correctly access the HERON FIFOs. This library is linked into your DSP program allowing that program to use simple read write calls to transfer the data.
HERON-API has a complex job to do. It must share the four DMA engines of the processor around a possible 12 different data streams. To achieve this it maintains a list of DMA engines that do not have transfers in progress, and another list of transfers ready to be made.

Floating DMA

The default use of these lists is that the tasks get allocated resources in the order that they became ready, a mode that we talk about as "floating DMAs". In this mode of operation, each FIFO block is queued and allocated resources separately, so during a large transfer the DMAs will be claimed and freed inside the HERON-API many times. HERON-API uses interrupts triggered by the FIFO flags and by DMA completions to manage the queues.

HERON2-C6203 to HERON2-C6203 one timeslot to external memory	66Mbytes/sec
HERON2-C6203 to HERON2-C6203 two timeslots to external memory	132Mbytes/sec
HERON2-C6203 to HERON2-C6203 more than 2 timeslots to internal memory	132Mbytes/sec

The HERON2 can achieve 132Mbytes/sec whether it is using internal or external memory.

Dedicated DMA

Because the floating DMA technique could cause a transfer to be delayed while there is no DMA resource available, HERON-API has an option to "dedicate" a DMA engine to a particular transfer. Of course using this option for all transfers limits you to a maximum of 4 transfers – less if your own program wishes to claim a DMA engine for itself.
This is not possible for a non dedicated transfer as the DMA would be returned to the pool of available resources, and re-claimed when this transfer is next on the list.

HERON2-C6203 to HERON2-C6203 one timeslot to external memory	66Mbytes/sec
HERON2-C6203 to HERON2-C6203 two timeslots to external memory	132Mbytes/sec
HERON2-C6203 to HERON2-C6203 four or more timeslots to external memory	210Mbytes/sec
HERON2-C6203 to HERON2-C6203 4 or more timeslots to internal memory	213Mbytes/sec

Multiple “Blocking” transfers

When the DSP is transferring to or from several FIFOs there will be some interaction between the two flows because the resources of the DSP are being shared between the transfers.
Measuring the speeds with one input stream and one output stream gives

HERON2-C6203 1 timeslot read and write non-dedicated DMAs int mem	126Mbytes/sec total (in and out)
HERON2-C6203 1 timeslot read and write dedicated DMAs int mem	127Mbytes/sec total (in and out)
HERON2-C6203 2 or more timeslot read and write non-dedicated DMAs int mem	178 Mbytes/sec total (in and out)
HERON2-C6203 2 or more timeslot read and write dedicated DMAs int mem	236Mbytes/sec total (in and out)

I/O

When data is being transferred between processors the peak rate of the DMA is important and the overhead of starting a transfer is less critical. I/O data however, must never be lost by the transfer and every part of the process becomes critical.
There are two "system" models that can be used. One of them is to process the data as a stream -- blocks of data can be gathered and processed, and the real time part of the system is to process the data fast enough, latency is not the key issue. The other way is to create a tight control loop -- where a single input must be processed and re-acted to before the next sample arrives.
The real time part here is the latency which must include the time to process that sample.

Single Sample processing (Control loops)

If you are trying to make a control loop type application that must process a single sample before the next sample arrives it does not make sense to use DMA at all. In this case your application will "poll" for data, perform processing and output as soon as it arrives, and expect to be ready to poll for the next data item before it arrives. In this case the functions HeronReadWord and HeronWriteWord can be used. These functions are compiled "in-line" in your program for maximum efficiency and will not return until the data is transferred.
The HERON2 is forced to use DMA for all transfers to and from its XBUS (i.e. the HERON FIFOS). This means it will be inefficient and will have some uncertainty of response when accessing only single words. As the number of samples is increased the overhead of the DMA becomes less significant.

Processor loading

Of course when the FIFO access is done using DMA engines, the processor is free to perform processing, but the set-up of the DMAs and responding to their completion interrupts does take processor time.
At high data rates the loading can be quite significant. We measure using DSP/BIOS RTDX the processor load simply running a program sending and receiving data using SEM model.

PCI HOST bandwidth

When data is being transferred to or from the host machine, similar issues apply to the transfer, as we have discussed with the C6000. That is the PC has a processor that must transfer the data on its memory bus, set up the transfers, and service interrupts that flag completion.
The Pentium doesn’t have DMA controllers that you can use for this, but the PCI allows plug in cards to support “Master Mode” which essentially means that the transfer can be made by that plug in card to or from the PC memory.
Like the C6000 modules, the HEPC9 has been designed so that the peak transfer rate is the maximum allowed by the PCI bus, i.e. 132 Mbytes/sec.
One thing that can affect the sustained rates possible is the Operating system running on the Host PC. The main difference is between OS models that allow the driver to know the physical address of the memory buffer the user has allocated. If the Physical address can be obtained, then the Master Mode can be used to transfer the data directly using the buffer allocated by the user program. Win NT uses this model.
If however a physical memory address cannot be obtained, the Master mode must first transfer the data into a “driver buffer” and then copy it into the user buffer. The copy process takes extra time, reducing the sustained data rate. Win 98/ME uses this model.
We have measured the performance on several PCs with the following results:-

PC type	OS	1 slot read	1 slot write	1 slot Read + 1 slot write total	2 slot read	2 slot write
AMD K6-450	Win98	35	36	50	47	37
AMD Athlon 850	Win98	51	57	60	65	57
PIII-800	Win98	41	45	43	59	46
Dell PIII-800	Win NT	64	65	92	101	65
P-Pro 200	Win98	31	23	28	40	23
P4 1.5G	Win98	56	35	50	62	35
P4 1.5G	Win NT	64	36	56	70	37

There are obviously variations between different machines. Generally the faster the machine the faster the transfers. Win NT is generally faster than 98 as expected. On the P4 1.5Ghz the same machine was tested using NT and 98 for a comparison where the OS is the only difference. The difference is about 40%. This machine shows a big difference between read and write speeds which is abnormal. Looking at the PCI signals it seems that this chipset handles reads and writes in different ways. The different chipsets in PCs seems to have a big effect.
In general the speed when reading and writing is about 1.5 times the speed of one direction.
Measuring speeds with more than 2 HEART timeslots connected does not increase the bandwidths achieved.

See also our paper on developing real time systems