## ARM Architecture

Computer Organization and Assembly Languages Yung-Yu Chuang

with slides by Peng-Sheng Chen, Ville Pietikainen



- 1983 developed by Acorn computers
  - To replace 6502 in BBC computers
  - 4-man VLSI design team
  - Its simplicity comes from the inexperience team
  - Match the needs for generalized SoC for reasonable power, performance and die size
  - The first commercial RISC implemenation
- 1990 ARM (Advanced RISC Machine), owned by Acorn, Apple and VLSI

# ARM Ltd



#### Design and license ARM core design but not fabricate





- One of the most licensed and thus widespread processor cores in the world
  - Used in PDA, cell phones, multimedia players, handheld game console, digital TV and cameras
  - ARM7: GBA, iPod
  - ARM9: NDS, PSP, Sony Ericsson, BenQ
  - ARM11: Apple iPhone, Nokia N93, N800
  - 90% of 32-bit embedded RISC processors till 2009
- Used especially in portable devices due to its low power consumption and reasonable performance

## ARM powered products







- A simple but powerful design
- A whole family of designs sharing similar design principles and a common instruction set

# Naming ARM



- ARMxyzTDMIEJFS
  - x: series
  - y: MMU
  - z: cache
  - T: Thumb
  - D: debugger
  - M: Multiplier
  - I: EmbeddedICE (built-in debugger hardware)
  - E: Enhanced instruction
  - J: Jazelle (JVM)
  - F: Floating-point
  - Synthesizible version (source code version for EDA tools)



- ARM7TDMI
  - 3 pipeline stages (fetch/decode/execute)
  - High code density/low power consumption
  - One of the most used ARM-version (for low-end systems)
  - All ARM cores after ARM7TDMI include TDMI even if they do not include TDMI in their labels
- ARM9TDMI
  - Compatible with ARM7
  - 5 stages (fetch/decode/execute/memory/write)
  - Separate instruction and data cache
- ARM11

# ARM family comparison



| ARM family attribute comparison. |             |                          |                         |                         |  |  |
|----------------------------------|-------------|--------------------------|-------------------------|-------------------------|--|--|
| year                             | 1995        | 1997                     | 1999                    | 2003                    |  |  |
|                                  | ARM7        | ARM9                     | ARM10                   | ARM11                   |  |  |
| Pipeline depth                   | three-stage | five-stage               | six-stage               | eight-stage             |  |  |
| Typical MHz                      | 80          | 150                      | 260                     | 335                     |  |  |
| mW/MHz <sup>a</sup>              | 0.06 mW/MHz | 0.19 mW/MHz<br>(+ cache) | 0.5 mW/MHz<br>(+ cache) | 0.4 mW/MHz<br>(+ cache) |  |  |
| MIPS <sup>b</sup> /MHz           | 0.97        | 1.1                      | 1.3                     | 1.2                     |  |  |
| Architecture                     | Von Neumann | Harvard                  | Harvard                 | Harvard                 |  |  |
| Multiplier                       | 8 × 32      | 8 × 32                   | $16 \times 32$          | $16 \times 32$          |  |  |

<sup>a</sup> Watts/MHz on the same 0.13 micron process.

<sup>b</sup> MIPS are Dhrystone VAX MIPS.



- RISC: simple but powerful instructions that execute within a single cycle at high clock speed.
- Four major design rules:
  - Instructions: reduced set/single cycle/fixed length
  - Pipeline: decode in one stage/no need for microcode
  - Registers: a large set of general-purpose registers
  - Load/store architecture: data processing instructions apply to registers only; load/store to transfer data from memory
- Results in simple design and fast clock rate
- The distinction blurs because CISC implements
  RISC concepts



- Small processor for lower power consumption (for embedded system)
- High code density for limited memory and physical size restrictions
- The ability to use slow and low-cost memory
- Reduced die size for reducing manufacture cost and accommodating more peripherals



- Different from pure RISC in several ways:
  - Variable cycle execution for certain instructions: multiple-register load/store (faster/higher code density)
  - Inline barrel shifter leading to more complex instructions: improves performance and code density
  - Thumb 16-bit instruction set: 30% code density improvement
  - Conditional execution: improve performance and code density by reducing branch
  - Enhanced instructions: DSP instructions







- Only 16 registers are visible to a specific mode.
  A mode could access
  - A particular set of r0-r12
  - r13 (sp, stack pointer)
  - r14 (Ir, link register)
  - r15 (pc, program counter)
  - Current program status register (cpsr)
  - The uses of r0-r13 are orthogonal





- 6 data types (signed/unsigned)
- All ARM operations are 32-bit. Shorter data types are only supported by data transfer operations.



- Store the address of the instruction to be executed
- All instructions are 32-bit wide and wordaligned
- Thus, the last two bits of pc are undefined.







| Processor mode |     | Description                                            |  |
|----------------|-----|--------------------------------------------------------|--|
| User           | usr | Normal program execution mode                          |  |
| FIQ            | fiq | Supports a high-speed data transfer or channel process |  |
| IRQ            | irq | Used for general-purpose interrupt handling            |  |
| Supervisor     | svc | A protected mode for the operating system              |  |
| Abort          | abt | Implements virtual memory and/or memory protection     |  |
| Undefined      | und | Supports software emulation of hardware coprocessors   |  |
| System         | sys | Runs privileged operating system tasks                 |  |

# **Register organization**





### Instruction sets



#### • ARM/Thumb/Jazelle

|                                                            | ARM ( <i>cpsr</i> $T = 0$ )                                          | Thumb ( <i>cpsr</i> $T = 1$ )                                       |
|------------------------------------------------------------|----------------------------------------------------------------------|---------------------------------------------------------------------|
| Instruction size                                           | 32-bit                                                               | 16-bit                                                              |
| Core instructions                                          | 58                                                                   | 30                                                                  |
| Conditional execution <sup>a</sup>                         | most                                                                 | only branch instructions                                            |
| Data processing<br>instructions<br>Program status register | access to barrel shifter and<br>ALU<br>read-write in privileged mode | separate barrel shifter and<br>ALU instructions<br>no direct access |
| Register usage                                             | 15 general-purpose registers<br>+pc                                  | 8 general-purpose registers<br>+7 high registers + <i>pc</i>        |
| Jaze                                                       | elle ( $cpsr T = 0, J = 1$ )                                         |                                                                     |
| Instruction size 8-b                                       | it                                                                   |                                                                     |

Core instructions Over 60% of the Java bytecodes are implemented in hardware; the rest of the codes are implemented in software.





- Execution of a branch or direct modification of pc causes ARM core to flush its pipeline
- ARM10 starts to use branch prediction
- An instruction in the execution stage will complete even though an interrupt has been raised. Other instructions in the pipeline are abondond.

### Interrupts







| Exception/interrupt    | Shorthand | Address   |
|------------------------|-----------|-----------|
| Reset                  | RESET     | 0x0000000 |
| Undefined instruction  | UNDEF     | 0x0000004 |
| Software interrupt     | SWI       | 0x0000008 |
| Prefetch abort         | PABT      | 0x000000c |
| Data abort             | DABT      | 0x0000010 |
| Reserved               |           | 0x0000014 |
| Interrupt request      | IRQ       | 0x0000018 |
| Fast interrupt request | FIQ       | 0x000001c |
|                        |           |           |

#### References





#### 23. Whirlwind Tour of ARM Assembly

#### 23.1. Introduction

rey broadly speaking, you can drivle programming languages into 4 charses. At the lowest level in machine codes raw numbers that in a ensemble machine code in words such an entryl practician corresponds to our machine codes introducts. Aloves the are co in the languagest has been completed to machine code to be able to eau. Finally, there are completed languages the FHD (and ou gh lands of machine code for the descret effects. compiled languages like C, which use structs smally VB and Jami) which are not through

Central

Henry step up the ladder mirreares for human readability factor and portability, at th here yong the action terms to home researcy later as posted at the cost crasme press as program as a trace comp, program as a trace of the press of

Now, in some sincles the word "assembly" can be used to flight ers Because it is so closely tied to the CPU, was can make it es mail st-mann ng. Doing close to hardware also means posite bypa-

provide the second protection on the devices of the second second

ments themselves have an aesthetic quality as well no messing about with classes, different loop styles, operator preceden

Appear, short fur dapter. A complete document in accessible in outing best faux a full work's manual for a CPU. The world request as entre book in body which is not something. The among them is the provide a monotone of the ADM and the or a single of the among the source and a more of we will be about why TI data correct how to use CPU. The monotone of the additional source and a conset of we and a more of a weather to a standard the source and a source of the ADM and TIMOME monotone weights and conset of we and a more of we will be about why TI data correct how to use CPU another to assist a standard to be at a forward of the ADM and TIMOME monotone weights and a more of a weather to a standard to the source and a standard to be at a forward of the ADM and TIMOME request. Lastly, TI gree an example of a form an a forward of the ADM and TIMOME request.

and be able to do a lot of mff, or at least kn of the following do

