Table of contents

Part 0
  Summary
Part 1 - intro
  Why this at all?
  About the Cortex-M0
  About AHB
  Cortex-M0 boot & vector table
  PSoC4 vector redirection
  About the PSoC4
  How I found the SROM's home
  Dumping the first 12 bytes
  Dumping 512 bytes
  Analyzing the 512 bytes
  Towards ROP
  The ROP exploit
  Dumping the entire SROM
Part 2 - what have you got to hide?
  More PsoC4 info
  Like a 100x100 sudoku
  Why is writing flash so hard?
Part 3 - so, why do I care?
  More flash for free
  Rootkits
  IP theft & mutation
Part 4 - can we do it without a debugger?
  Arbitrary code execution in privileged mode
Part 5 - disclosure
  I tried...no luck
Part 6 - future work
  A lot left to do
Part 7 - the goods
  Downloads and license

Part 0 - TLDR

This article explains how I figured out how Cypress's jury-rigged "supervisor" mode in the PSoC4 family works, dumped the secret unreadable SROM, exploited it, and found a way to unlock extra flash in the PSoC4 as well as how you can develop scary rootkits for touchpads and touchscreens that use Cypress chips. I provide the code to do this yourself as well as as much guidance as I possibly can, for now. Along the way I explain how this was all done and what steps it took. This article encompasses a work of about a month.

Part 1, where we meet our opponent and look deeply into its eyes...

The cheapest Cortex-M microcontroller available now is the CY8C4013SXI-400 from the PSoC4000 family by Cypress. It is the cheapest by far. It is available in an 8-pin package, claims to have 8K of flash, 2K of ram, run at 16MHz, and sells for $0.61 apiece. What's not to like. I decided to get some to play with, and got on with reading the manuals (1 2 3). Now, when I see something like "The user has no access to read or modify the SROM code ," I immediately perk up. Not readable, huh? Well, what's in there? "When the device is reset, the initial boot code for configuring the device is executed out of supervisory read-only memory". Oh? Curious. Anything else? "System calls are executed out of SROM in the privileged mode of operation". Allright, I am interested! This is how my adventure with Cypress PSoC4 began. When someone claims something to be impossible, this becomes immediately interesting. This post is my attempt to retell this story and what I learned of it. It starts as quite inaccurate (as I knew little) and gets more accurate as I learn more about the insides of this chip.

First, let me give you a short overview of Cortex-M0 (the CPU in PSoC4) and ARMv6-M (the architecture of the Cortex-M0). You may skip this paragraph if you're well familiar with the topic. The CPU is a very simple one. It has 16 32-bit registers and executes Thumb code (16-bit opcodes). There are a few 32-bit long opcodes, but mostly they are for making long jumps and function calls. Unaligned accesses are not supported. Exception handling abilities are minimal: almost any exception causes a HardFault, and there is not always enough information saved to fix the cause and restart. Cortex-M0 is meant for simple tasks. The processor has a concept of "Exception Number." This is a method to prioritize what can interrupt what. Normal IRQs are configurable, and a few CPU-wide exceptions are not and have hardcoded exception numbers. Any exception of a lower number can interrupt a handler running of a higher number. That is, for example, an IRQ with priority 2 can interrupt normal execution; it itself may be interrupted by an IRQ of priority 1, and that may read undefined memory and take a HardFault (priority -1). The HardFault handler itself may be interrupted by an NMI (priority -2). The CPU will ignore exceptions of a higher priority number while a handler of a lower number runs. This causes an interesting conundrum. What happens if you access undefined memory in an NMI handler? The priority number is lower than that of HardFault, so CPU cannot take the fault. It also cannot access the memory. In this case CPU enters lockup state, which is a special state where it continuously attempts and fails to execute instructions at address 0xFFFFFFFE. This guarantees it does not get itself into any more trouble. Cortex-M family chips can be debugged using a debug protocol called SWD. For this project I used my own SWD debugger (CortexProg - look for it soon), but you may be able to follow along with any other SWD debugger that you can get your hands on.

Next, let's talk about how the CPU core connects to various peripherals and memory. As before, you may skip this paragraph if you're well familiar with the topic. The bus used for this in Cortex-M0 is AHB-Lite. It is a relatively simple 32-bit-address 32-bit-data bus. It supports rudimentary metadata transfer with data transfers: HPROT bits. These four bits signal: (0) if an access is for code or data, (1) is privileged or not, (2) is bufferable or not, and (3) is cacheable or not. Most implementations of Cortex-M0 do not really use any of this as they are simple systems that do not need to signal such things. It is up to the implementer of the system to give meaning to what a "privileged" access really means.

So how does a Cortex-M0 boot? Again, skip if you know this. And how does it handle interrupts? The first part of flash (usually) has a vector table. It is a series of 32-bit words representing addresses. At offset 0x00 the word represents the initial value of the Stack pointer register the CPU will load at reboot. At offset 0x04, the word is the address of the reset handler, at 0x08 is the address of the NMI handler, at 0x0C - the HardFault handler, and so on. Later ARM microcontrollers (like the Cortex-M0+) support ability to relocate this vector table to other addresses than the start of flash using a VTOR register. As Cortex-M0 lacks this ability, most vendors bolt something similar (but simpler) onto it.

For cypress this jury-rigged method of redirecting the vector table takes shape of the VECT_IN_RAM in the CPUSS_CONFIG register. Setting this bit makes the CPU fetch vectors from start of RAM (0x20000000) instead of start of flash (0x00000000). Though, even this is not that simple. Since Cypress states that CPU boots from their SROM, and they use NMI mode to service syscalls to their SROM, they also have a way to force the CPU to fetch vectors from SROM. This is actually documented and the bit for this is DIS_RESET_VECT_REL in the CPUSS_SYSREQ register. This bit is documented to redirect accesses to addresses 0x00000000..0x00000007 to SROM.

The first step to any reverse-engineering is to get a good grip on how things work normally. So, let's look. The CPU boots from SROM. We cannot attach the debugger to the chip until it exits this "privileged" mode (as signified by the PRIVILEGED bit in CPUSS_SYSREQ). Then the CPU runs as normal. What is this PRIVILEGED mode one might ask, and how does it map unto the Cortex-M0 core? Cypress documents that this bit maps directly onto AHB's HPROT[1] bit and that makes sense. Flash writing on the chip is done using syscalls to the SROM. Those get made by writing the proper parameters to an area of RAM, writing its address to the CPUSS_SYSARG register, and writing the syscall number ORRed with 0x80000000 into CPUSS_SYSREQ. This top bit being set in a write is what causes the CPU to fetch the NMI vector from SROM and execute the syscall. Cypress was nice enough to somewhat document the process, which helped. If the CPU is not in NMI mode, and the top bit of CPUSS_SYSREQ is set, the CPU will enter NMI mode, fetch the vector from SROM, set the ROM_ACCESS_EN bit in CPUSS_SYSREQ to indicate that SROM can be read, if the request came from the CPU, the HMASTER_0 bit will be cleared, if from the debugger, set. The SROM will then read the argument from CPUSS_SYSARG, check the key (all public API require a key to ascertain the call is not accidental), verify the operation is allowed and possible, execute it, and return to the normal CPU mode. OK. Good so far.

So, where does this SROM live in the address space? My initial attempts to find it were not meeting much luck. Whenever I'd request a syscall from the debugger, the debugger would lose connection with the core, and I had no idea what happened (I figured it out later). I decided to be sneakier. The longest-running safe syscall was the one that checksums all of flash. I wrote a small program that would call it in a loop and uploaded it to the chip. I then decided to read the PCSR register in the core, which is normally there for profiling. It can be read while the CPU is running and exposes the PC of a "recently executed instruction." Cool. Watching PCSR as my test program ran, I saw lots of addresses in my code (near 0x00000000) and some around 0x10000000. Well then, I guess the SROM is at 0x10000000. Sweet. We're all done! Let's dump it and start the analysis.

> CortexProg -r srom.dump -A 0x10000000 -L 4096
   MEM RW ERR 2

Hm - it faults. I guess I cannot read it after all. What now? The CPU must be able to read it somehow, to be able to fetch vector addresses from there and to execute, so why can I not? Perhaps the debugger is locked out from reading the SROM? Or maybe I really need that ROM_ACCESS_EN bit set? Once again, later I found a better way to do this, but what I did for now is this: I decided to look for corner cases cypress might not have thought of. I tried setting breakpoints in the SROM. CPU reboots on hitting the breakpoint - they thought of this. I tried giving it invalid addresses and having a custom HardFault handler hoping I'd somehow end up in a state where SROM can be read. CPU locks up (remember that lecture above about NMI being a lower exception number than HardFault?) and the debugger cannot attach to it. Hm. Well, It was time for drastic measures. I ran my program that called syscalls in a loop again, waited for PCSR to tell me the code is in SROM, and requested another sycall by writing 0x80000000 to CPUSS_SYSREQ. The CPU locked up, but the debugger was able to attach! Sweet. Let's try that again but enable the bit in the debug unit that will halt the CPU on lockup! Bingo! After a few tries the CPU locks up and is halted. The ROM_ACCESS_EN bit is set, PC is somewhere in 0x10000000-land and we're golden, right? Let's dump the SROM...

>  CortexProg -Spsoc4_explot_test
   WAITING FOR PSCR
   EXPLOIT :)
   EXPLOT2: YES
   HALTED
   dumping...
   [0x10000000] = 0x20000800
   [0x10000004] = 0x10000093
   [0x10000008] = 0x1000012b
   [0x1000000C] = 0x?   -4
   [0x10000010] = 0x?   -4
   [0x10000014] = 0x?   -4

Uh huh. Well, so I got twelve bytes and then any read faults. I guessed this could have made sense. They could have made the SROM execute-only. But the vectors are read by the CPU as data, so that must be readable as data. I tried to make my debugger simulate reading-as-code, but this also did not help - the Cypress chip did not implement that ability in their debug unit. Making code execute-only on ARMv6-M is a huge pain. The CPU lacks any real ability to load large values into a register, forcing one to use multiple instructions to craft such values. Generally this is solved with a PC-relative load of the constant, and placing the constant after the function. Of course this will not work if the area is execute-only. This makes my theory about the SROM being execute-only a weak one. But, what gives? Why can I not read the SROM? What next?

I did have ability to break in SROM. I could read and write registers. I just could not read the actual instructions being executed. I decided to try tracing what the instructions do and see if that would reveal anything. I took the CPU out of LOCKUP mode, set the PC to the reset vector value. Loaded each register with some unlikely-to-occur-in-real-life unique values (so i'd see what happens to them), and tried single-stepping. It worked! Again? It worked! Again! Debugger disconnected. Hm... not much. Let's try that again, but with the syscall entry point instead. Hm, here we get through a few more instruction. The first instruction decrements SP by 16 bytes, the I see my magic values for R4, R5, R6, saved there. Last value is 0xFFFFFFF9, which is the expected LR value for NMI exception. And..oh, what's this? Somehow the next instruction loads the value 0x40100000 into R4. There is no way to do this in one instruction without using a PC-relative load! So then it CAN read itself! Cool. Third instruction somehow gets my syscall number with 0x80000000 set into R0. Hm! syscall number with that top bit set goes into CPUSS_SYSREQ which is at 0x40100004. Cortex-M0 can only load a memory location by using an address in a register, and no other registers have meaningful values yet. This unambiguously proves that that third instruction must be "LDR R0, [R4, #4]", meaning that it will load the 32-bit word pointed to by (R4 + 4) into R0.

So... Here I have ability to step through code, modify registers, but not see the code. And now I have an instruction that reads anything anywhere, including likely the code. Well, let's try setting R4 to an address in SROM and seeing what happens? It works! I can set R4 to each address in SROM in turn, step over this instruction, and read the SROM! Wonderful. Success! Yes? Turns out... no. This gets us only 512 bytes of SROM. After that we start getting faults on the read again. Seriously, Cypress?! C'mon! Well, at least we have 512 bytes to analyze. It is not much, but it is something. Let's see why stepping through the reset vector failed so quickly.

  PUSH    {R3-R7,LR}
  MOV     R5, #0
  MOV     R4, R5
  BL      0x1000032E

OK, that makes sense. The CPU boots in privileged mode, with PRIVILEGED bit in CPUSS_SYSREQ set. This probably makes the rest of the SROM readable. As I cannot set this bit from the debugger context, this jump fails to fetch the first instruction of that destination function as it is past the 512 bytes nonprivileged code can see. It thus faults. Pretty clear there. Let's see why the syscall handler disconnected my from the debugger.

  PUSH    {R4-R6,LR}
  LDR     R4, =0x40100000
  LDR     R0, [R4,#4]  ; CPUSS_SYSREQ
  LDR     R1, [R4,#4]  ; CPUSS_SYSREQ
  UXTB    R0, R0
  MOV     R2, #3
  LSL     R2, #27      ; 0x18000000 - set priv and flash access bits
  ORR     R1, R2
  STR     R1, [R4,#4]  ; CPUSS_SYSREQ

Makes sense as well. As soon as that write to CPUSS_SYSREQ sets the PRIVILEGED bit, the debugger loses access and I can no longer see what happens. This places me in an awkward place. In order to dump the rest of the SROM, a lot of things have to all come together. First, whatever I do, I cannot allow PC to exit SROM, as then the ROM_ACCESS_EN bit gets cleared and I cannot reset it (can only be set by hardware). If that bit is cleared, the entire SROM goes bye-bye. Second, I must somehow set the PRIVILEGED bit, to allow me to access the rest of the SROM. However, this loses my ability to debug or step or even see, so then I must somehow get a desired word or byte loaded into a register, clear the PRIVILEGED bit, and trap into the debugger. All this without being able to write any code. Remember, the SROM is read-only and PC cannot be allowed to exit it.

So, we have areas of memory we can execute but not write (SROM), and we have areas we can write but not execute (RAM). If this problem sounds familiar to you, then the acronym ROP will immediately spring into your mind. For everyone else: I'll explain. Usually when you call a function, it will save some registers on the stack, including the return address. So if you craft the proper structures on the stack, and then jump into the middle of a function (hopefully to an instruction that does something that you want), it will run to completion, pop some registers off the stack, and return to the address you put there. If you put there an address of a middle of another function, and the registers that are loaded are just right, that next function might do something useful too. Repeat this with a long enough chain, and you can do great things. This is called Return Oriented Programming(ROP). Each function piece you execute is called a "gadget." Collect enough gadgets of the right kind, and you can do anything. Now, if you're thinking "this sounds like a gigantic unmanageable nightmare," you're right! Problem is: functions are not written in a way to make this convenient. This sort of attack (to the best of my knowledge) was first demonstrated on x86 - an architecture where variable-length instructions allow you to jump into the middle of an instruction and get a different kind of an instruction. ARMv6-M is not like that at all. All instructions are 16-bits, and the few that are not provide nothing valuable if you jump into their middles. Furthermore, we only have 512 bytes of code to play with, which is not a lot. - there were not that many useful gadgets that I could find.

I needed three. I needed a gadget that would store a value at a given address (to set the PRIVILEGED bit), a gadget to load a word (or a byte) from a given address to a register (so I could dump something), and a gadget that could again store a word at an address to clear the PRIVILEGED bit (gadget reuse is not always possible, depending on what registers a given gadget uses and loads) and return to my code in RAM that would finally fault and let the debugger read out the dumped value. While I found plenty of gadgets for writing values, there were very few that would read them. Return Oriented Programming is always an exercise of building a LEGO spaceship out of a Princess Castle LEGO set: you have to make do with what you have. The best gadget I could find to read was one that was used in what looked like a delay loop.

lbl_loop:
  LDR     R0, [R1,#0x10]
  LSL     R0, R0, #0xF
  BPL     lbl_loop  ;taken if 0x10000 bit is clear
  STR     R2, [R1,#0x10]
  BX      LR

So this will load a value pointed to by [R1 + 0x10], and if a particular bit is clear, load it again (for us this meant a hang as SROM contents do not change). Else it would store R2's contents there and return. This would give us the bottom 16 bits of every word. Not much. And also what DOES happen if you try to store to SROM? Some microcontrollers allow and ignore such writes. Is PSoC4 one of them? Turns out: no - it faults. Onward I went looking for another read gadget. The next best one looked like this:

lbl_loop:
  WFI
  LDR     R0, [R4]
  CMP     R0, #0
  BGE     lbl_loop
  BLX     R1
  B       lbl_loop

So this would load a word. If the top bit was clear, it would wait for interrupts and then try again (once again - for us this is an infinite loop), else it would jump to the function pointer to by R1. Promising, BUT very few thumb instructions have top bit set, so my hopes of using this to dump the rest of the SROM partially and look for better gadgets there were dashed as well - I'd never find anything useful if I only looked at some instructions.

By nature of how ROP works, you're usually looking at ends of functions, since that is where the return happens, and the deeper in you go, the more complex it gets to reason about what happens, and the more side effects the function may have that you need to manage. I was out of simple options, and even out of options of medium difficulty. It was time to start systematically looking at every single load instruction in my 512-byte dump and thinking about whether any of them could be used, no matter how complex. This produced no results for a few days and I was ready to give up, until I found this gem of a function, which I had discarded earlier for some reason. Actually, it does almost everything I need, if you squint at it just right.

  PUSH    {R4,LR}
  LDR     R2,  =0x40100000
  LDR     R4,  [R2,#4]     ;load CPUSS_SYSREQ
  LSL     R3,  R2, #8      ;0x10000000 - PRIVILEGED bit
  BIC     R4,  R3          ;remove PRIVILEGED bit
interesting_address:
  STR     R4,  [R2,#4]     ;store CPUSS_SYSREQ
  B       loop_compare     ;go to loop condition check
loop_body:
  LDR     R4,  [R2,#8]     ;loop body starts here - load CPUSS_SYSARG
  LDR     R4,  [R4,#0xC]   ;load word [CPUSS_SYSARG + 0x0C] points to
  STMIA   R0!, {R4}        ;store to R0 with post-increment
loop_compare:
  CMP     R0,  R1          ;if R0 < R1
  BLS     loop_body        ;do the loop again
  LDR     R0,  [R2,#4]     ;load CPUSS_SYSREQ
  ORRS    R0,  R3          ;set PRIVILEGED bit
  STR     R0,  [R2,#4]     ;store CPUSS_SYSREQ
  POP     {R4,PC}

Well, this function seems to take two parameters: the start and the end of an area of memory, it then loads what [SYSARG + 0x0C] points to, and writes it to each word in that passed buffer. This is sort of like a memset but for 32-bit words. Curiously, This function drops privilege before doing this, and then re-acquires it. This means that it is probably somehow user-facing and was written to avoid exploit potential. Luckily for us, it does not avoid it well enough. Let's see what we can do with this. We control the stack, so we can put any two words on top of it that the function will pop into R4 and return address at the end. We control all registers on entry, and we can enter at any point. We also control CPUSS_SYSARG and whatever it points to. So, here's the plan of attack: We'll jump at the "interesting_address" label. We'll load R2 with 0x40100000 so that the write to CPUSS_SYSREQ happens. We'll load R4 with 0x30000000 - the ROM_ACCESS_EN and PRIVILEGED bits. We'll load R0 and R1 with the same value - a pointer to a word in RAM that we'll want our dump to write to. To CPUSS_SYSARG we'll write the address we want to dump minus 0x0C. We'll leave R3 clear, as we do not need to set any more bits in CPUSS_SYSREQ at the end. This gets us most of the way there. It enables PRIVILEGED mode, loads a word we want, saves it somewhere safe, and jumps to an address we control. After this we no longer need SROM access, so we'll make it easy on ourselves and make sure the return address is in RAM - a piece of code we'll put there that will clear CPUSS_SYSREQ and then execute a breakpoint instruction so that we can break into the debugger. The CPU at the end of this ends up in a weird state and we have to reset it for each word read. Now THIS is a mess! Did it work? Yes it did! At the slow speed of 10 seconds per 4 bytes, the entire SROM shows up on my screen. At this point we're in week two of the PSoC4000 exploration.

After some more work on the SROM, I realized that if I trigger the syscall from CPU and not from the debugger, I can single-step into the SROM (still no ability to read it), so the dumping can be done using a small RAM routine to trigger the syscall, a single step, and then this ROP exploit. There turns out to be no need for my previous hacky random approach of crashing the syscall handler, waiting for lockup, and all that. I replicated my work on PSoC4100, PSoC4200, and PSoC4200M as well, to verify that this approach continue to work. With a minor adjustment of some offsets, it works well. Now, I am not able to publish the SROM contents here (due to copyright or some such thing), but should you mysteriously find them elsewhere in the BitTorrent-sphere, the sha1 hash of the exactly-4096-byte-sized PSoC4000 SROM (sha1sum < srom-psoc4000.bin) is c11d3caddd64047ee4e2c46cc1dac475ec6302d3, and for the 4096-byte-sized PSoC4100 SROM (sha1sum < srom-psoc4100.bin) it is 55d2d3aa5797e7f50e597827f1ac5fb0fdd49415. (Yes I know SHA1 is broken, but there is method to my madness, I promise).

Part 2, or "what have you got to hide?"

Now that we're committing to spending lots of time with the PSoC4, here's just a bit more info to round off the knowledge. The memory map as we know it: 0x00000000 - flash; 0x0FFFF000 - 512 bytes of SFLASH; 0x10000000 - 4096 bytes of SROM; 0x20000000 - 2KB of RAM; 0x40000000 - peripherals. The new arrival there is the SFLASH. This is an extra area of flash that is used to store various persistent settings. Just like the main flash it has 64-byte pages, but there exist no public API to write to it directly. Some API do indirectly. Flash row protection settings are stored in the 0th row of SFLASH and whole-device protection setting is stored in the last byte of the 1st row. The rest of the SFLASH is a mystery. Cypress docs do give names for many of the locations but with no explanation or meaning of bits. I had to figure those out by hand, but I'm getting ahead of myself.

The SoC has a few protection modes. Off the production line it is in "virgin" mode - it has no calibration data in it. The SFLASH is blank. Cypress calibrates various things, and writes the calibration data as well as device unique data into the SROM and flips the bit to put the device into "open" mode. The official party line is that there is no way back to "virgin" mode, and even if there were, it would be devastatingly bad, as among other things the calibration data includes flash writing settings, without which you'd likely damage the flash trying to write to it. (I later discovered that there is a way back, just call the device protection api and pass 0xFADE as a parameter). The remaining modes are "protected" and "kill." Protected mode allows debugging only so far as to erase the chip and turn it back to open mode. Kill mode disables debugging altogether. The reality of all of these statements turns out to be more nuanced, as you'll see.

The exploration of the SROM took me another two weeks, and is in fact already ongoing. I write this now, knowing there is still plenty more to discover here. I've only been focusing on the PSoC4000 SROM, but the PSoC4100, PsoC4200, and PSoC4200M SROMs appear to be very very similar so what I write here will likely apply there as well. (the only exception is that 4200M SROM is bigger.) It turns out that the PSoC4 chips have a lot more registers than are documented in the docs, as many SoCs do. What sets PSoC4 apart is that it uses that AHB HPROT[1] bit to hide them from normal code. Most of the fun registers are only accessible in PRIVILEGED mode. Of course, all of them are also undocumented. I did find that the TRM for PSoC4100/PSoC4200 chips documents some registers that PSoC4000 TRM does not, and that was of some help.

Reverse engineering a ROM of a chip is very much like playing a 100x100 game of sudoku. You learn or infer something in one place, propagate that knowledge elsewhere, and another thing becomes clearer. Maybe you notice a footnote in the docs and match that up to a piece of assembly. Over time things get clearer. Function names go from nothing to "func_0x10000486" to "some_flash_related_thing_2" to "flash_erase_related" to "flash_erase_row_maybe" to "flash_row_erase_likely //param in bits 16 to ?? in 0x40110008" and finally to "void flash_row_erase() //param in ADRCTRL.RowID". The process cannot be done locally and involves the entire SROM dump, much like you cannot solve a piece of a sudoku board without seeing the rest. Some of the registers the code uses are documented and known for the PSoC4000. For some others, I assumed PSoC4100/PSoC4200 are similar enough internally (SROMs are almost identical). And the rest I just had to guess at, slowly developing my understanding of what each did and giving them names.

Over a few weeks I was able to decode most of the SROM and name most of the undocumented registers. These weeks were spent away from the physical chip as it was not necessary - I knew too little to experiment. I did learn quite a few very interesting things. The syscalls are dispatched based on a table, and there are permissions checks. Each syscall has a permissions byte associated with it. Top nibble is for CPU access, bottom is for debugger. Each bit in the nibble is set if access from a particular protection mode is allowed. Specifically, the bit meanings are: 0x08 - virgin, 0x04 - kill, 0x02 - protected, 0x01 - open. The API table has 22 syscalls. Documentation lists only 11. Curiouser and curiouser. Four of them (0x01, 0x02, 0x03, 0x14) always return errors (the code literally does nothing but). One almost directly calls the function I had exploited to dump the SROM. One checksums the SFLASH from 0x0FFFF080 to 0x0FFFF1FE, does some math on the checksum, and compared to the last halfword in SFLASH (0x0FFFF1FE). This one is called during chip reset and if it fails, the chip will be forced to reset forever in a loop. When I got to editing the SFLASH (see later sections) this bit me twice and I lost two chips to this. The rest of the syscalls are available in virgin mode only and do some sort of flash operations. I am not yet sure of their purpose.

Most of the SROM is dedicated to writing flash. This might seems strange, as on many other small microcontrollers, writing flash is as simple as setting a few bits in the flash controller and waiting for some "busy" bit to go low. Cypress did not go this way. Instead they drive the entire process in software. There are dozens of steps and bits in various registers to do this. I was able to reverse engineer some of the things, namely erasing and writing flash and SFLASH. There are other things one can do, which I have not yet entirely figured out and will continue to do so. As I do not know the names of the various registers, I had to make up names. I've verified that my code works and am providing it here at the end of this article.

The actual process is pretty complex and involves the use of a few registers which I called ADRCTRL, FLASH_OP_TIMER, FLASH_LL_CTRL, WRITE_LATCH_VAL, and unknown_0030. There are also a few registers that get various trim values form SFLASH put into them as part of the process. Each step of the procedure is timed, using a timer in the flash unit (SPCIF). You can write the proper number of flash clock cycles for the step into FLASH_OP_TIMER, and then set the 0x200 bit in FLASH_LL_CTRL to start the process. The bit will self-clear when done. If it is not masked, it will laso trigger the SPCIF interrupt. The actual operation to be performed is controlled by the bottom 7 bits in FLASH_LL_CTRL. FLASH_LL_CTRL also contains bits to enable flash access for write/erase (disables access for read immediately). The register I called ADRCTRL controls where in the flash we write. It contains the row number of flash to operate on, byte offset into latches, and a few bits to assist in selecting which flash to write to (SFLASH/flash) and when to write to latches. The SROM contained a lot of code that does more flash things than basic erase and write. I am still working on decoding them. I also have full-chip-erase and per-sector (64 rows) erase decoded. Also, I did check, the calibration data does vary between chips of the same model and lot - Cypress must be really calibrating them at the factory. Most flash functions devolve to one central function which does most of the operations using calibration data in SFLASH for DAC values and timing. Given the lack of names in the binary, I named it myself. I call it "flash_rubber_to_road()". It is actually also directly exposed as an API (number 0x10). If you are interested in more details, contact me. Of course, the whole process must be driven from RAM or SROM as flash is entirely inaccessible during flash operations.

"So, why do I care?"

What would you say, if I told you that CY8C4013SXI-400 has 16K of flash, not 8? Probably you would not believe me, since any attempt to access any flash past 8K will fault. But it is so. What if I told you I could create an undetectable rootkit for the PSoC4 family of chips? You'd probably be skeptical, and yet I've done both of these things. I'll explain how.

PSoC4 has a few interesting features. There is a register that allows the chip to make only 1/2, 1/4, or 1/8 of flash available. Anything past that point will be inaccessible and cause a fault on access. There is another register of the same type for RAM. After looking at the values in them I realized that while my chip really did have 2KB of RAM, it actually had 16KB of flash, and was limited to 8KB by this register. These registers are write-once but are backed by RAM. That is to say that the chip loads them each boot. Loads from where? SFLASH! Since they are write-once, we cannot write to them, but if only we could edit SFLASH, we could change the values there and get more flash or RAM. More on this later. There were a few other interesting registers. One could also limit RAM or flash in such a way that the extra area is privileged only. See where I am going with this? rootkits love areas of memory that are inaccessible. When limited in this way the normal code has no way to access these areas or even know they exist. You might ask why normal code cannot just inspect these limiting registers to find out if there is hidden areas. Well, these registers themselves are privileged only. Cool, huh?

There is yet a cooler thing. SFLASH has a word that determines where execution starts, so this is how my rootkit can get to run before user code does. Even cooler, if there is any flash reserved for privileged mode only, the SROM assumes there is an alternate syscall table there, allowing me to intercept, or add syscalls. This is how the syscall can patch behind itself the hole it will use to install itself (more on this later).

Since I cannot easily setup the proper write code using ROP techniques (it is too complex), and there exist no public syscalls to write SFLASH, I am simply forced to reimplement all the flash write code. It is of course at this point that two chips died as I tried to edit the SFLASH and forgot about the checksum code. But on the third, everything worked fine and I was able to use 16K of flash. After a little bit more work I was able tt prove out the rootkit concept. I will not be releasing this YET, but I might in the future.

Why is this scary? Seemingly most cypress 32-bit products use a variation on this SROM idea and are vulnerable to a similar attack due to their very design. One might say they are designed for this to start with. So the good news is you can have more flash for free. The bad news is that there might be a rootkit in your laptop's touch pad or your phone's touchscreen controller. A rootkit that might store input and replay later when requested via a special gesture. Scared yet?

Want more? Perhaps, I wanted to destroy the flash in your chip permanently. Thanks to the timing and voltages being under software control entirely, this is possible too.

"Scary, but this all requires a debugger connection and a lot of setup, right?"

typedef void* (*SupervisorRunnableFunc)(void* arg);
bool supervisorRun(
	SupervisorRunnableFunc fnc,
	void *param,
	void **retVal
);

Well, no! This is where it gets REALLY cool. I wanted a way to do all of this without having to have an external debugger attached. Sadly, Cypress engineers did a reasonably good job of securing their SROM. I found no out-of-bounds accesses I could exploit, the secret APIs were mostly all gated to virgin-mode-only, and so on. But I was determined to find something. Guess what? That function I used to dump the SROM keeps on giving. Look at it carefully. It is exported as an API via a rather circuitous way. We already established that it drops privileges before doing its write, but RAM is all free-for-all. So what if we knew how deep the stack would be when it ran? And what if we used this func to overwrite its own return address on stack? And what if we pointed this at a function of our choice? Well, when this one returns, it will jump to ours in privileged mode. As we're not in SROM, we'll lose SROM access, but we'll still be in privileged mode. Returning from this will be difficult, but with some work we can do it too. We cannot use SROM itself as access to it goes away the moment we jump out of it. But, had we saved all of our registers up front, we could restore them all. We could then craft a proper exception frame to get the CPU to return back to non-NMI mode whence we came. That's right! You read that right! A way to execute arbitrary code in privileged mode! This all sounds like a lot of moving parts, but I did do exactly this. In fact I automated the process of finding the stack offset and restoring this all in such a way that the same code runs on PSoC4000 and PSoC4100. The basic primitive my code provides is as simple as: run this code in privileged mode. As this does run in NMI mode, you are not allowed to fault in there. Nor can you debug, but such is life. This I am also publishing today, of course. The license is below along with the download.

Responsible disclosure attemps

I did my best to reach Cypress. I tried on linkedin contact to people who worked in the PSoC4 family, I tried Cypress support forum, and the online case system. They have no public security contact that I could find. This is sad, because these vulnerabilities are likely fixable.

Future work

Plenty to do still. I am still trying to figure out what those extra flash API that are exposed can do and working on a cooler more stealthy rootkit PoC. Stay tuned and watch this space for updates.

The goods

So, if you want to unlock more flash in your chip, you'll need to write a 0x00 byte to 0x0FFFF142. This will, however, brick your chip unless you recalculate the checksum. But there is a simpler way. There is a byte you can write to disable checksum, so write 0x01 to 0x0FFFF169. That is all - double the flash on your CY8C4013SXI-400. Enjoy!

Code (including better includes with newly discovered registers and flash writing code): => [DOWNLOAD] <= The license on this code is "free for any non-commercial use as long as you give me credit in any writeup, manual, or printed material that come with your device that uses this code. No commercial use. For commercial licensing, contact me."

© 2012-2017