Home>>HERON modular systems>>Transferring Data
How Fast can I Transfer Data?
In a HERON system using HEART to transport the data between the nodes, there are many things that affect the speed that data can flow in the system
If we consider a simple transfer, between two nodes using a FIFO connection:
If the Receiver can receive data faster than the Transmitter is generating it, then we can expect the data to be transferred without interruption. The Receiver should not read the FIFO when it is empty, so only valid data will be received. This is true as long as the FIFO can support that data rate.
With HEART (as on the HEPC9) the FIFO connection is a Virtual FIFO. HEART makes this connection using 2 real FIFOS (implemented in FPGAs), connected together using a time-slotted ring. The FIFOs are 32 bits wide, and use a 100Mhz clock, so the module can transfer data to and from the FIFO at a peak rate of 400Mbytes/sec (100Mwords/sec). The timeslots of HEART are 66 Mbytes/sec.
So the speed that the Virtual FIFO connection can support is limited by the
number of timeslots selected for it.
What happens if the transmitter generates data faster than can be accepted?
The Transmitter uses the Full flag on the FIFO to detect when the FIFO cannot accept
any more data. In most cases the Transmitter can simply wait for the FIFO to be
ready, so the transfer of data is limited by the slowest element.
However, some devices cannot wait. An A/D for example has to take a sample on every
sample clock, and probably doesnt have any storage associated with it. In
this case data will be lost, so it is important that your system is never going
to reach that situation.
So the data transfer rate that can be achieved on a particular connection is limited by the slowest part of that connection.
Peak data rates are not the same as sustained data rates though. Any transfer will have some overhead, so will not be able to sustain the maximum rate.
FPGA based modules
In the case of an FPGA based module, the input FIFO connection and the output FIFO connection are connected directly to the FPGA. This means the FPGA has direct control over the transfer of data, and can sustain the maximum transfer rate in and out at the same time, if your FPGA design allows that.
Of course the rates at which you can transfer data depends on the frequency of the
FIFO clock that is generated from your FPGA design. Normally this will be the
full 100Mhz, but there can be cases where your FIFO design is simpler if you
choose a lower clock rate. If this is the case you must be aware that this
reduces the data rates accordingly.
Normally the HERON FIFOs will be accessed using the Hardware Interface Layer (HIL) VHDL
that is supplied by HUNT ENGINEERING.
When accessing a single FIFO (in or out) the HIL allows the full 400Mbytes/sec to be
sustained into and out of the FPGA (at the same time).
When several FIFOs are transferring data at the same time, the HIL needs to switch
accesses from one FIFO to another. This switching causes a single cycle where
data cannot be transferred, reducing the total data rate that is possible.
The HIL supports different models of accessing several FIFOs. The first is to
permanently request data from the FIFOs, then the HIL will use a round
robin method to access the FIFOs. This can cause data to be transferred one
word at a time, with one dead cycle between each word. This makes the
total rate become 200Mbytes/sec.
If you want to transfer more than this rate using multiple FIFOs, you can choose to use
the HIL in a different manner. This method is to cycle the data request signals
yourself, where you can force the same FIFO to be accessed for a series of
cycles, before the dead cycle is introduced and you start to access a
different FIFO. Using this method total rates of above 350Mbytes/sec can easily
be achieved.
Peak rate |
100Mwords/sec = 400Mbytes/sec in |
Sustained rate single FIFO in one direction |
400Mbytes/sec |
Sustained total rate on several FIFOS normal method |
200Mbytes/sec |
Sustained total rate on several FIFOS cycling method |
>350Mbytes/sec |
Rates achieved with an FPGA based module on a HEPC9 (FIFO clock=100Mhz).
Refer to the application note for full details of using the HIL to access multiple FIFOs.
C6000 based modules
When using a C6000 processor in a real time system you need to transfer data into and
out of the processor at the same time as processing that data. The C6000 DSPs
have DMA engines that allow the I/O to be handled separately from the processing
but does it? Of course there are interactions between processing and I/O --
the DMA and processor share internal and external busses, and the DMAs need to
be programmed by the processor.
The HERON2 uses the XBUS of the C6203 to access the
FIFOs using a 75Mhz clock. Using this separate bus allows FIFO data to be read
and memory to be written in the same clock, thus reducing the conflict of
resources.
Consider what a particular data rate means in terms of processing speed. If you transfer data into a C6000 at 100Mbytes/sec, you have only one or two processor cycles per byte for processing. If you are handling 32bit data items this means 8 cycles per sample -- but that still needs to be a near trivial operation coded very efficiently!
HERON-API and FIFO accesses
It is usual to use HERON-API to control the DMAs and interrupts in order to correctly
access the HERON FIFOs. This library is linked into your DSP program allowing
that program to use simple read write calls to transfer the data.
HERON-API has a complex job to do. It must share the four DMA engines of the processor
around a possible 12 different data streams. To achieve this it maintains a list
of DMA engines that do not have transfers in progress, and another list of
transfers ready to be made.
Floating DMA
The default use of these lists is that the tasks get allocated resources in the order that they became ready, a mode that we talk about as "floating DMAs". In this mode of operation, each FIFO block is queued and allocated resources separately, so during a large transfer the DMAs will be claimed and freed inside the HERON-API many times. HERON-API uses interrupts triggered by the FIFO flags and by DMA completions to manage the queues.
HERON2-C6203 to HERON2-C6203 one timeslot to external memory |
66Mbytes/sec |
HERON2-C6203 to HERON2-C6203 two timeslots to external memory |
132Mbytes/sec |
HERON2-C6203 to HERON2-C6203 more than 2 timeslots to internal memory |
132Mbytes/sec |
The HERON2 can achieve 132Mbytes/sec whether it is using internal or external memory.
Dedicated DMA
Because the floating DMA technique could cause a transfer to be delayed while there is
no DMA resource available, HERON-API has an option to "dedicate" a DMA
engine to a particular transfer. Of course using this option for all transfers
limits you to a maximum of 4 transfers less if your own program wishes to
claim a DMA engine for itself.
This is not possible for a non
dedicated transfer as the DMA would be returned to the pool of available
resources, and re-claimed when this transfer is next on the list.
HERON2-C6203 to HERON2-C6203 one timeslot to external memory |
66Mbytes/sec |
HERON2-C6203 to HERON2-C6203 two timeslots to external memory |
132Mbytes/sec |
HERON2-C6203 to HERON2-C6203 four or more timeslots to external memory |
210Mbytes/sec |
HERON2-C6203 to HERON2-C6203 4 or more timeslots to internal memory |
213Mbytes/sec |
Multiple Blocking transfers
When the DSP is transferring to or from several FIFOs there will be some interaction
between the two flows because the resources of the DSP are being shared between
the transfers.
Measuring the speeds with one input stream and one output stream gives
HERON2-C6203 1 timeslot read and write non-dedicated DMAs int mem |
126Mbytes/sec total (in and out) |
HERON2-C6203 1 timeslot read and write dedicated DMAs int mem | 127Mbytes/sec total (in and out) |
HERON2-C6203 2 or more timeslot read and write non-dedicated DMAs int mem |
178 Mbytes/sec total (in and out) |
HERON2-C6203 2 or more timeslot read and write dedicated DMAs int mem |
236Mbytes/sec total (in and out) |
I/O
When data is being transferred between processors the peak rate of the DMA is
important and the overhead of starting a transfer is less critical. I/O data
however, must never be lost by the transfer and every part of the process
becomes critical.
There are two "system" models that can be used. One of them is to process
the data as a stream -- blocks of data can be gathered and processed, and the
real time part of the system is to process the data fast enough, latency is not
the key issue. The other way is to create a tight control loop -- where a single
input must be processed and re-acted to before the next sample arrives.
The real time part here is the latency which must include the time to process that
sample.
Single Sample processing (Control loops)
If you are trying to make a control loop type application that must process a single
sample before the next sample arrives it does not make sense to use DMA at all.
In this case your application will "poll" for data, perform processing
and output as soon as it arrives, and expect to be ready to poll for the next
data item before it arrives. In this case the functions HeronReadWord and
HeronWriteWord can be used. These functions are compiled "in-line" in
your program for maximum efficiency and will not return until the data is
transferred.
The HERON2 is forced to use DMA for all transfers to and from its XBUS (i.e.
the HERON FIFOS). This means it will be inefficient and will have some
uncertainty of response when accessing only single words. As the number of
samples is increased the overhead of the DMA becomes less significant.
Processor loading
Of course when the FIFO access is done using DMA engines, the processor is free to
perform processing, but the set-up of the DMAs and responding to their
completion interrupts does take processor time.
At high data rates the loading can be quite significant. We measure using DSP/BIOS RTDX
the processor load simply running a program sending and receiving data using SEM
model.
PCI HOST bandwidth
When data is being transferred to or from the host machine, similar issues apply to
the transfer, as we have discussed with the C6000. That is the PC has a
processor that must transfer the data on its memory bus, set up the transfers,
and service interrupts that flag completion.
The Pentium doesnt have DMA controllers that you can use for this, but the PCI
allows plug in cards to support Master Mode which essentially means that
the transfer can be made by that plug in card to or from the PC memory.
Like the C6000 modules, the HEPC9 has been designed so that the peak transfer rate is the
maximum allowed by the PCI bus, i.e. 132 Mbytes/sec.
One thing that can affect the sustained rates possible is the Operating system
running on the Host PC. The main difference is between OS models that allow the
driver to know the physical address of the memory buffer the user has allocated.
If the Physical address can be obtained, then the Master Mode can be used to
transfer the data directly using the buffer allocated by the user program. Win
NT uses this model.
If however a physical memory address cannot be obtained, the Master mode must first
transfer the data into a driver buffer and then copy it into the user
buffer. The copy process takes extra time, reducing the sustained data rate. Win
98/ME uses this model.
We have measured the performance on several PCs with the following results:-
PC type |
OS |
1 slot read |
1 slot write |
1 slot Read + 1 slot write total |
2 slot read |
2 slot write |
AMD K6-450 |
Win98 |
35 |
36 |
50 |
47 |
37 |
AMD Athlon 850 |
Win98 |
51 |
57 |
60 |
65 |
57 |
PIII-800 |
Win98 |
41 |
45 |
43 |
59 |
46 |
Dell PIII-800 |
Win NT |
64 |
65 |
92 |
101 |
65 |
P-Pro 200 |
Win98 |
31 |
23 |
28 |
40 |
23 |
P4 1.5G |
Win98 |
56 |
35 |
50 |
62 |
35 |
P4 1.5G |
Win NT |
64 |
36 |
56 |
70 |
37 |
There are obviously variations between different machines. Generally the faster the
machine the faster the transfers. Win NT is generally faster than 98 as
expected. On the P4 1.5Ghz the same machine was tested using NT and 98 for a
comparison where the OS is the only difference. The difference is about 40%. This machine shows a big difference between read and write speeds which
is abnormal. Looking at the PCI signals it seems that this chipset handles reads
and writes in different ways. The different chipsets in PCs seems to have a big
effect.
In general the speed when reading and writing is about 1.5 times the speed of one
direction.
Measuring speeds with more than 2 HEART timeslots connected does not increase the
bandwidths achieved.
See also our paper on developing real time systems