模拟器开发

加密的方法或偏方

这些加密的方法或偏方有：

1、磨片，用细砂纸将芯片上的型号磨掉。对于偏门的芯片比较管用，对常用芯片来说，小偷们只要猜出个大概功能，查一下哪些管脚接地、接电源很容易就对照出真实的芯片了
2、封胶，用那种凝固后象石头一样的胶（如粘钢材、陶瓷的那种）将PCB及其上的元件全部覆盖。里面还可故意搞五六根飞线（用细细的漆包线最好）拧在一起，使得小偷拆胶的过程必然会弄断飞线而不知如何连接。要注意的是胶不能有腐蚀性，封闭区域发热不太大。
3、使用专用加密芯片，如ATMEL的AT88SC153等也就几块钱，只要软件不能被反汇编出来，小偷即使把所有信号用逻辑分析仪截获下来也是无法复制的。
4、使用不可破解的芯片，如EPLD的EPM7128以上、ACTEL的CPLD等，但成本较高（百元以上），对小产品不适用
5、使用MASK IC，一般来说MASK IC要比可编程芯片难破解得多，这需要很大的批量。MASK不仅仅是至MCU，还包括ROM、FPGA和其它专用芯片
6、使用裸片，小偷们看不出型号也不知道接线。但芯片的功能不要太容易猜，最好在那团黑胶里再装点别的东西，如小IC、电阻等
7、在电流不大的信号线上串联60欧姆以上的电阻（让万用表的通断档不响），这样小偷在用万用表测连线关系时将增加很大的麻烦。
8、多用一些无字（或只有些代号）的小元件参与信号的处理，如小贴片电容、TO-XX的二极管、三极管、三到六个脚的小芯片等，想查出它的真实面目还是有点麻烦的。
9、将一些地址、数据线交叉（除RAM外，软件里需进行对应的交叉），让那些小偷们测连线关系时没法靠举一反三来偷懒
10、PCB采用埋孔和盲孔技术，使过孔藏在板内。此方法成本较高，只适用于高端产品
11、使用其它专用配套件，如定做的LCD屏、定做的变压器等、SIM卡、加密磁盘等
12、申请专利。鉴于知识产权保护的环境太差，国外最优选的方法在咱们这只能放在最后一条。

posted @ 2009-10-21 08:50 王小明阅读(111) | 评论 (0) | 编辑收藏

masm与gnu汇编转换

MASM ------------> GNU Assembly
----------------------------------------
[ebp-4] ------------> -4(%ebp)
[foo+eax*4] --------> foo(, %eas, 4)
[foo] ---------------> foo(, 1)
gs:f00 ------------->%gs:f00
mov ebx, eax ------> movl %eax, %ebx
mov eax, 25 -------> movl $25, eax
mov [eax], 25 ------> movl $25, (%eax)
mov [eax+2], 25 ---> movl $25, 2(%eax)
public --------------> .globl

posted @ 2009-10-09 14:40 王小明阅读(273) | 评论 (0) | 编辑收藏

函数调用之参数传递

1：不同的硬件平台在默认情况下,参数调用时的参数传递机制是不一样的，例如mips等RISC的cpu，通常使用register进行参数传递，而x86平台下一般编译器编译的代码，默认是通过stack来传递参数，当然也可以通过编译选项，调用约定等手段来使用register传递参数，但CISC的cpu毕竟寄存器较少；
2：不同软件平台，参数传递方式也不一样，比如windows和linux，甚至在windows下的cygwin的参数传递，大家都有各自的约定；

通常来讲，cygwin和linux，如果函数指定regparm（n），则编译器会考虑使用eax,edx,ecx(注意是参数严格按照先eax，其次是edx，第三个参数是ecx)来进行参数传递，而在windows下visual studio，如果函数知道调用方式为__fastcall编译器会考虑使用ecx，edx（严格先ecx，再edx，参数从左往右，打死不用eax）传递参数，eax用来接受返回值；所以vc下面__fastcall函数参数超过2个，后面的还是push 到stack，也就fast不起来了。。。
所以，移植汇编代码的时候，要非常小心。

posted @ 2009-10-08 16:54 王小明阅读(433) | 评论 (1) | 编辑收藏

gcc参数备忘

2009-01-08 22:06

gcc[option|filename ]...
g++[option|filename ]...

描述(DESCRIPTION)
1.预处理,生成.i的文件[预处理器cpp]
2.将预处理后的文件不转换成汇编语言,生成文件.s[编译器egcs]
3.有汇编变为目标代码(机器代码)生成.o的文件[汇编器as]
4.连接目标代码,生成可执行程序[链接器ld]
要用四个步骤中的一个或多个处理输入文件: 预处理(preprocessing),编译(compilation),汇编(assembly)和连接(linking).源文件后缀名标识源文件的语言,但

是对编译器来说,后缀名控制着缺省设定:

.c      C源程序;预处理,编译,汇编
.C      C++源程序;预处理,编译,汇编
.cc     C++源程序;预处理,编译,汇编
.cxx    C++源程序;预处理,编译,汇编
.m      Objective-C源程序;预处理,编译,汇编
.i     预处理后的C文件;编译,汇编
.ii    预处理后的C++文件;编译,汇编
.s     汇编语言源程序;汇编
.S     汇编语言源程序;预处理,汇编
.h     预处理器文件;通常不出现在命令行上

其他后缀名的文件被传递给连接器(linker).通常包括:

.o     目标文件(Object file)
.a     归档库文件(Archive file)

除非使用了-c, -S,或-E选项(或者编译错误阻止了完整的过程),否则连接总是最后的步骤.在连接阶段中,所有对应于源程序的.o文件, -l库文件,无法识别的文件

名(包括指定的 .o目标文件和.a库文件)按命令行中的顺序传递给连接器.

选项(OPTIONS)
选项必须分立给出: `-dr'完全不同于`-d -r '.
大多数`-f'和`-W'选项有两个相反的格式: -fname和 -fno-name (或-Wname和-Wno-name).这里只列举不是默认选项的格式.

[参数详解]
-x language filename
设定文件所使用的语言,使后缀名无效,对以后的多个有效.也就是根
据约定C语言的后缀名称是.c的，而C++的后缀名是.C或者.cpp,如果
你很个性，决定你的C代码文件的后缀名是.pig 哈哈，那你就要用这
个参数,这个参数对他后面的文件名都起作用，除非到了下一个参数
的使用。
可以使用的参数吗有下面的这些
`c', `objective-c', `c-header', `c++', `cpp-output',
`assembler', and `assembler-with-cpp'.
看到英文，应该可以理解的。
例子用法:
gcc -x c hello.pig

-x none filename
关掉上一个选项，也就是让gcc根据文件名后缀，自动识别文件类型
例子用法:
gcc -x c hello.pig -x none hello2.c

-c
只激活预处理,编译,和汇编,也就是他只把程序做成obj文件
例子用法:
gcc -c hello.c
他将生成.o的obj文件

-S
只激活预处理和编译，就是指把文件编译成为汇编代码。

-E
只激活预处理,这个不生成文件,你需要把它重定向到一个输出文件里
面.
例子用法:
gcc -E hello.c > pianoapan.txt
gcc -E hello.c | more
慢慢看吧,一个hello word 也要与处理成800行的代码

-o
    指定生成对象名称

-pipe
使用管道代替编译中临时文件,在使用非gnu汇编工具的时候,可能有些问
题
gcc -pipe -o hello.exe hello.c

-ansi
关闭gnu c中与ansi c不兼容的特性,激活ansi c的专有特性(包括禁止一
些asm inline typeof关键字,以及UNIX,vax等预处理宏,

-fno-asm
此选项实现ansi选项的功能的一部分，它禁止将asm,inline和typeof用作
关键字。

-fno-strict-prototype
只对g++起作用,使用这个选项,g++将对不带参数的函数,都认为是没有显式
的对参数的个数和类型说明,而不是没有参数.
而gcc无论是否使用这个参数,都将对没有带参数的函数,认为城没有显式说
明的类型

-fthis-is-varialble
就是向传统c++看齐,可以使用this当一般变量使用.

-fcond-mismatch
允许条件表达式的第二和第三参数类型不匹配,表达式的值将为void类型

-funsigned-char
-fno-signed-char
-fsigned-char
-fno-unsigned-char
这四个参数是对char类型进行设置,决定将char类型设置成unsigned char(前
两个参数)或者 signed char(后两个参数)

-include file
包含某个代码,简单来说,就是便以某个文件,需要另一个文件的时候,就可以
用它设定,功能就相当于在代码中使用#include
例子用法:
gcc hello.c -include /root/pianopan.h

-imacros file
将file文件的宏,扩展到gcc/g++的输入文件,宏定义本身并不出现在输入文件
中

-Dmacro
相当于C语言中的#define macro

-Dmacro=defn
相当于C语言中的#define macro=defn

-Umacro
相当于C语言中的#undef macro

-undef
取消对任何非标准宏的定义

-Idir
在你是用#include"file"的时候,gcc/g++会先在当前目录查找你所制定的头
文件,如果没有找到,他回到缺省的头文件目录找,如果使用-I制定了目录,他
回先在你所制定的目录查找,然后再按常规的顺序去找.
对于#include,gcc/g++会到-I制定的目录查找,查找不到,然后将到系
统的缺省的头文件目录查找

-I-
就是取消前一个参数的功能,所以一般在-Idir之后使用

-idirafter dir
在-I的目录里面查找失败,讲到这个目录里面查找.

-iprefix prefix
-iwithprefix dir
一般一起使用,当-I的目录查找失败,会到prefix+dir下查找

-nostdinc
使编译器不再系统缺省的头文件目录里面找头文件,一般和-I联合使用,明确
限定头文件的位置

-nostdin C++
规定不在g++指定的标准路经中搜索,但仍在其他路径中搜索,.此选项在创建
libg++库使用

-C
在预处理的时候,不删除注释信息,一般和-E使用,有时候分析程序，用这个很
方便的

-M
生成文件关联的信息。包含目标文件所依赖的所有源代码
你可以用gcc -M hello.c来测试一下，很简单。

-MM
和上面的那个一样，但是它将忽略由#include造成的依赖关系。

-MD
和-M相同，但是输出将导入到.d的文件里面

-MMD
和-MM相同，但是输出将导入到.d的文件里面

    -w 不生成任何警告信息。

    -Wall 生成所有警告信息。

-Wa,option
此选项传递option给汇编程序;如果option中间有逗号,就将option分成多个选
项,然后传递给会汇编程序

-Wl.option
此选项传递option给连接程序;如果option中间有逗号,就将option分成多个选
项,然后传递给会连接程序.

-llibrary
制定编译的时候使用的库

-Ldir
制定编译的时候，搜索库的路径。比如你自己的库，可以用它制定目录，不然
编译器将只在标准库的目录找。这个dir就是目录的名称。

-O0
-O1
-O2
-O3
编译器的优化选项的4个级别，-O0表示没有优化,-O1为缺省值，-O3优化级别最
高　　

-g
只是编译器，在编译的时候，产生调试信息。
　　
-gstabs
此选项以stabs格式声称调试信息,但是不包括gdb调试信息.

-gstabs+
此选项以stabs格式声称调试信息,并且包含仅供gdb使用的额外调试信息.

-ggdb
此选项将尽可能的生成gdb的可以使用的调试信息.
-static
此选项将禁止使用动态库，所以，编译出来的东西，一般都很大，也不需要什么
动态连接库，就可以运行.
-share
生成共享目标文件。通常用在建立共享库时。
-traditional
试图让编译器支持传统的C语言特性

posted @ 2009-09-29 09:29 王小明阅读(162) | 评论 (0) | 编辑收藏

君正linux交叉编译平台搭建

先下个mipseltools-gcc412-glibc261.tar.bz2，mipsel工具链，里头没有gdb，得自己编译；

把mipseltools-gcc412-glibc261.tar.bz2解压到/opt,并修改/etc/bash.bashrc,添加mipsel的工具链bin路径

下面编译gdb

从http://www.gnu.org/software/gdb/ 下载 gdb-6.8.tar.gz

tar -zxvf gdb-6.8.tar.gz

cd gdb-6.8/

sudo ./configure --host=i686-pc-linux-gnu --target=mipsel-linux

sudo make

sudo make install

注：ubuntu平台

posted @ 2009-09-26 09:17 王小明阅读(1626) | 评论 (3) | 编辑收藏

模拟器技术解释，dynamic recompilation动态编译型和interpretive解释执行型，还有Threaded Interpreter

In another thread sniff_381 asked me to explain what dynamic recompilation (dynarec for short) is.
Since most NG emus have a dynarec now, I thought it might be a good idea to cover the topic in a separate thread.

I have to admit that I haven't programmed a dynarec myself yet, but I have decent knowledge of the basics and some details. I'm not totally sure how much I should go into depth anyway. I guess I'll see if there is enough interest to talk about details.

First of all, the term "dynamic recompilation" is a bit odd, because "to recompile" often means to compile the source code of a program again, but the older term "binary translation" is more precise, since the binary code of a game or application is translated - not the source code - and "dynamic" only means that it is done during runtime and on demand.

So what's the difference to "traditional" or "interpretive" emulation?
An interpretive emulator always picks the instruction the program counter (PC) points to, decodes it, and executes it, just like a real processor would do it. So every time the emulator comes across it same instruction it has to do all the steps again.
In his article How To Write a Computer Emulator (http://fms.komkon.org/EMUL8/HOWTO.html) Marat Fayzullin uses this pseudo C-code sequence to describe the process:
Code:

Counter=InterruptPeriod;
PC=InitialPC;

for( ;; )
{
OpCode=Memory[PC++];
Counter-=Cycles[OpCode];

switch(OpCode)
{
    case OpCode1:
    case OpCode2:
    ...
}

if(Counter<=0)
{
    /* Check for interrupts and do other */
    /* cyclic tasks here                 */
    ...
    Counter+=InterruptPeriod;
    if(ExitRequired) break;
}
}

Dynamic recompilation deviates from this procedure by working with whole blocks of code instead of single instructions, and that those blocks are translated into the machine language of the processor the emulator is running on. There wouldn't be a speed advantage if the translated blocks weren't cached and simply recalled as soon as the program counter enters that block again.
Here is some sample code from my (unfortunately still unfinished) DRFAQ (http://www.dynarec.com/~mike/drfaq.html):
Code:

/* the following line defines a 'function pointer',               */
/* which can be used to call the code generated by the translator */
/* CTX is the context of the processor, ie. the register values   */

int (*dyncode)(Context *CTX);

/* the following simplyfied loop is often called the "dispatcher" */

for( ;; ) {

/* try to find the current address of the PC in the translation cache */
address = block_translated(CTX->PC);

/* nothing found, ie. first translate the code block starting at the PC address */
if (address == NULL)
    /* do the translation and add it to the translation cache */
    address = translate_block(CTX->PC);

/* point the function pointer to the address of the translated code */
dyncode = (int(*)(Context*)) address;

/* call the translated code with the current context */
status = (*dyncode)(CTX);

/* handle interrupts and other events here */
}

That's basically how a dynarec works, only that I still haven't explained how the translation cache and of course the translation are handled.

I spoke of code blocks several times, and it might be a good idea to define the term, since not all will be into compiler theory...
In compilers the smallest block of cohesive instructions is called a basic block. Such a block has a starting point and ends with the next conditional jump or branch, ie. as soon as there is a possibility that the program counter changes apart from pointing to the next instruction the block ends. It's also important that no other code block can jump into the middle of the basic block, only at the starting address, because only that way the compiler can see it as a separate collection of code that can be optimized in every possible way.
Most dynarecs probably work with basic blocks, but some use end the block with the next unconditional jump or branch, which leads to larger blocks, often called translation units. This leads to faster code, because all conditional branches can jump in the translated code without having to go through the dispatcher loop first, but it can be problematic to handle interrupts since you don't have a guarantee that the code returns to the dispatcher. Of course that could be handled within the generated code, but that makes things more complicated.

I think that's enough for an introduction. If there are any questions feel free to ask, and if there is interest in extending the parts I haven't covered yet, I could write something about the translation cache, some translation problems, register allocation, the difference to threaded interpretation, etc.
__________________
The crownless again shall be King

As announced yesterday I'll provide a simple example now that shows how to call dynamically generated code. For some this might be even more intimidating, but those who looked up how function pointers work in C and have a slight understanding of x86 assembly it should make clear how that calling process works.

Code:

/* In the beginning we'll have to define the function pointer. */
/* I called the function 'dyncode' and gave it an int argument */
/* as well as an int return value just to show what's possible. */

int (*dyncode)(int); /* prototype for call of dynamic code */

/* The following char array is initialized with some binary code */
/* which takes the first argument from the stack, increases it, */
/* and returns to the caller.                                    */
/* Just very simple code for testing purposes...                 */

unsigned char code[] = {0x8B,0x44,0x24,0x04, /* mov eax, [esp+4] */
                        0x40,                 /* inc eax          */
                        0xC3                  /* ret              */
                       };

/* Include the prototypes of the functions we are using... */

#include < stdio.h >

int main(void)
{
/* To show you that the code can be dynamically generated    */
/* although I defined static data above, the code is copied */
/* into an allocated memory area and the starting address is */
/* assigned to the function pointer 'dyncode'.               */
/* The strange stuff in front of the malloc is just to cast */
/* the address to the same format the function pointer is    */
/* definded with, otherwise you'd get a compiler warning.    */

dyncode = (int (*)(int)) malloc(sizeof(code));
memcpy(dyncode, code, sizeof(code));

/* To show that the code works it is called with the argument 41 */
/* and the retval sould be 42, obviously.                        */

printf("retval = %d\n", (*dyncode)(41) ); /* call the code and print the return value */

return 0;
}

This code has been written with GCC in mind, but it should work with any C compiler on any x86 operating system that passes function arguments on the stack.

I originally wrote this example with some ARM machine code instead of x86, and all that I had to change was the definition of the code[] array.
That's the nice thing about working with a function pointer to call dynamic code, apart from the generated code everything else is totally portable to any system with a C compiler.

A warning to those working with harvard architecture processors (ie. those with split instruction and data caches):
After copying the code and before calling it you'll have to flush the caches, otherwise the code will be in the data cache but not in the instruction cache and the processor will get into trouble.

While x86 processors nowadays have split L1 caches as well it's not a problem on these because they solve such issues in hardware due to transpartent compatibility with x86 processors that still had a unified cache.

Ok, so much for dynamically generated code...
Anyone still with me?

When is using a dynarec reasonable?
Those who are still following this thread probably noticed that dynamic recompilation is far from being trivial, and I haven't even touched the more complex issues yet.

This leads to the ultimate question: When does it make sense to use dynamic recompilation anyway?

The advantages of dynamic recompilation are:

    * more speed
    * more speed (nothing else really)

The disadvantages of dynamic recompilation are:

    * quite complicated
    * hard to debug
    * not as exact as interpretive emulation
    * not portable to systems with other processors
    * problems with self-modifying code

This means, as long as you can pull the whole emulation off at a decent speed by using traditional emulation (perferable even a portable solution), just do it and don't give dynamic recompilation a second thought.
Although I've seen people toying with dynarecs for 6502 and similar 8-bit processors it's not worth the hassle, since a nice CPU core written in C would be portable to different systems and should run at full speed on any current and even most older computers.

Even most 16-bit processors should be tried in interpretive emulation before thinking of dynamic recompilation. One of the few reasonable 16-bit candidates would be the 68000, because it is widely used and quite complex, so a dynarec for it might speed up a lot of emulators if you stick to the same API.

Where dynamic recompilation really shines is 32-bit and 64-bit processors, because it makes sense to do operations on hardware registers when the original code does so. Especially the MIPS (used in PSX and N64, eg.) and SuperH (Saturn eg.) processors with their simple instruction set should be emulated via dynamic recompilation to get a decent speed.

One thing to keep in mind is that an emulator with a dynarec needs a lot of RAM, because it not only needs at least the same amount of memory as a traditional emulator but additionally also memory for the translation cache, ie. the code blocks that have already been translated.

Eventually, it's a good idea to start with interpretive emulation to see if that's fast enough and switch to dynamic recompilation when it isn't. During the switching process it's a good idea to keep both CPU emulations to test the dynarec against the interpreter which should make debugging a little easier.
__________________
The crownless again shall be King

Some emulators that claim to use dynamic recompilation actually utilize a technique called threaded interpretation, eg. Generator does that (http://www.squish.net/generator/docs.html).

How does threaded interpretation differ from dynamic recompilation and what do they have in common?

Both techniques work on code blocks and "translate" these into some other representation. This means that both share the disadvantage of needing more memory than a traditional emulator and having problems when the already translated code should be changed by the translation (keyword: self-modifying code).

But instead of translating to code threaded interpretation fills the translation cache with addresses to the instruction emulation routines instead, ie. each instruction found in the source binary will be translated to an address and parameters that point to a piece of code in the emulator that emulates this instruction.

The only thing that you spare compared to a traditional emulator so far is the repetive decoding of all instructions in a block. But threaded emulation can take one further step. Due to having to analyze a whole block of code you can find out which condition flags need to be calculated for an instruction, ie. if a certain flag is overwritten by the side-effect of a following instruction before it can be tested or taken as input by another it doesn't have to be calculated. Since calculation of condition flags often can need more than half the time to emulate an instruction this approach can lead to a noticable speed improvement.
The emulator Generator mentioned above has two different emulation functions for each single instruction, one that calculates all flags and another that doesn't calculate any flags. The address of one of these functions will be added to the translation cache as appropriate.

The advantage of threaded emulation is that it can be portable when it is programmed in a high-level lamguage (like C) and works with function pointers.
The disadvantage is that you cannot access hardware registers directly (unless the instruction routines are written in assembly and you are using static register allocation, but then it wouldn't be portable and you could also use dynamic recompilation), so you still need to access the register file in memory every time an instruction reads or alters a register.

I think that should be enough about threaded interpreting...
Maybe I should cover register allocation next, since I already mentioned it here.

Intermediate Summary
Ok, maybe I should sum up some of the stuff here that no one can complain about too much information ;-)

Dynamic recompilation also known as dynamic binary translation is the process of translating binary code blocks during runtime into the binary code of the host machine.

The translated code is collected in a translation cache.

In a dispatcher loop the dynarec decides if certain code blocks still have to be translated and eventually calls the translated code.

The generated code is ideally called using a function pointer.
Here is an example how the function pointer code above has to be changed to get the same result on an ARM processor based machine:

Code:

unsigned long code[] = {0xE2800001, /* ADD R0, R0, #1 */
                        0xE1A0F00E   /* MOV PC, LR     */
                       };

As you can see only the code[] array (that would be the generated code in a real dynarec) has to be changed. The switch from "unsigned char" to "unsigned long" has only been made due to the fact that ARM has 32-bit fixed lenght instructions, but since we cast it to the function pointer later there is no difference.

Of course the code generator has to be different on each processor, but it makes sense to make everything else portable, thus you only have to write a new code generator but not a totally new emulator.

Mind you that an ARM processor with separate instruction and data caches (eg. the StrongARM) needs its caches to be flushed before the code can be called, but that's operating system specific and I won't go into that detail here.
__________________
The crownless again shall be King

Register Allocation
One of the big advantages of dynamic recompilation is that you can actually use the registers of the host machine when an emulated instruction uses registers. This can not only reduce the amount of instructions needed to emulate another instruction but also minimize slow memory accesses for every referenced register.
The process of mapping the emulated registers to the registers of the host machine is called register allocation.

Unfortunately a lot of dynarecs in emulators still fetch all needed register values from the register file (just a memory structure where the contents of all registers of the emulated processor) in the beginning of each emulated instruction and write the result back to the register file afterwards. Only a little better are those that don't store the value of the result register if it is used as input in the following instruction.
This really spoils a large part of what dynamic recompilation is about, but still seen quite often in "real world" examples.

There are two different methods of allocating registers:

Static register allocation: This means that in every translated block the same emulated registers are always allocated to the same host registers. When the host machine has enough registers to hold all emulated registers this is the optimal solution, but there are also some advantages even if it has fewer registers (mainly related to timing and translation block handling; I'll cover that later). In the latter case this means that only some of the emulated registers are held in host registers (ideally the most often used ones) and for the remaining ones the register file is accessed.

Dynamic register allocation: In every translation block the registers are allocated differently. Ideally you load the values of the emulated registers into the host registers at the beginning of the block and store those that were modified back to the register file before the block returns control to the dispatcher loop. If there aren't enough registers to hold all registers used in the block you'll have store the value held in a host register to free it for the next one.

Implementation wise static register allocation is rather easy, because you always use the same host registers to hold specific emulated registers and access the memory locations in the register file for all the others. This also means that you always load the same register values and store them to the same memory locations afterwards no matter what translated code block you are executing. So you only need one setup/clean-up code for all blocks which is occasionally called glue code because it's the thing that connects the emulator and the generated code.

The implementation of dynamic register allocation is a bit more complicated though. First of all there is more bookkeeping to do, since the emulator has to remember these facts:

    * which host register holds which emulated register: you need to know if the register is already in use and where to store the value to if you need to free the register
    * which emulated register is held in which host register: this tells you if the register is already allocated, and if it is you know which register to use in the generated code
    * has the value held in the host register been modified? This isn't really necessary, but it's a good idea not to store a value to the register file that hasn't been changed since you spare another unnecessary memory access.

How does register replacement work when you run out of registers?
There are very many methods that could be applied, and probably only one would be ideal, which basically means that you replace that register which isn't needed in that block anymore. Since you do have the entire block you could actually go through it backwards to find out if there is a register the code doesn't use anymore, but that can be a bit tedious.
That ideal solution reminded me of the best but theoretical solution for a page replacement algorithm in operating systems, so I came up with the idea of using another page replacement algorithm called second chance, which is similar to LRU (least recently used) but simpler to implement.
For that algorithm you set a reference flag for the host register every time it is referenced during code generation. When you need a register you go through all host registers in a circle (using a modulo operation with the maximal number of host registers), pick the next register register where the reference flag isn't set, while unsetting the reference flag of all registers you have to skip. Probably sounds a bit weird, but it's easy to implement and should lead to good results.
Of course if you come up with an easy implementation of the optimal solution that would be perfect.

I guess that's the most important things about register allocation...
If you want to know how to handle 64-bit registers on the IA-32 then take a look at the 1964 documentation (http://e64.wwemu.com/emus/1964/1964_...er_doc_101.pdf).

Translation Caching
Our next topic is the translation cache. But I won't go into detail how the generated code blocks are actually stored and freed, since I haven't given it too much thought yet. Although it is most likely that it's a good idea to allocate a large block of memory via the operating system and then make your own arena managament in that chunk for performance reasons.

The most interesting thing is how to remember which block is already translated and how to do that fast.
For 8-bit processors which normally have a 16-bit address range it would be easy to simply make an array of addresses and mark each single address when it is the start address of a recompiled block. But for those processors where dynamic recompilation is really interesting, this simple approach does not work, even if you take into account that most of the processors only permit aligned instructions. So other, less memory consuming methods have to be found.

When you come straight from computer science studies the most obvious solution would be a hash. That means a value is mapped through a special hash function (often containing a modulo operation to define the confines) onto a much smaller range array. The advantage is that in the best case you only perform the simply function and get the key to the memory location where you find the address of the recompiled code with one memory access (or a zero if it hasn't been recompiled yet). The problem is that the hash function is likely to generate the same key for different values, which happens more often when the hash array is too small and/or the hash function isn't that good. When this happens you get a so-called collision, which has to be solved somehow. The typical solution is to do a linked list of all values that collided, but that means that you need several lookups since you'll have to search the list in linear order until you find what you want.
So hashing is a possible solution, but not necessarily the best one.

Another problem that might have to be solved during the translation cache lookup is to detect self-modifying code. If the code behaves nicely you can skip that, but if self-modifying code occurs from time to time you'll have to detect it.
The typical computer science solution would likely be to run some kind of checksum (eg. CRC) over the original code, and regenerate the checksum every time the block is about to be executed. If the checksum has changed the block has to be translated again.
The problem here is that checksums can be fooled when several values (in the case instruction encodings) change but the change is not visable in the checksum. Also recalculating the checksum before every run of the code should hit on performance hard.

Since traditional solutions aren't perfect let's see what alternatives there are...

A drity trick that UltraHLE is said to be using works as follows:
Instead of using a data structure to memorize which blocks have been translated, the first instruction of the block is replaced by an illegal instruction that also contains the offset to the generated code - since MIPS has 32-bit instructions that's quite possible. So the emulator just takes a look at the start of the block to recognize if it has been translated already or if it has to be translated still. A side-effect is that code that modifies the first instruction of the block leads to the block being translated again automatically.
The disadvantage is that self-modifying code is only detected when the illegal instruction is replaced, and since you modify the original code you might run into problems when that block is actually just a sub-block of a larger block that might run into that illegal instruction. This could be handled of course, when the original instruction is stored somewhere, but it makes things a bit more complicated.

The elegant solution would be a paged translation map. When you don't know what to do in emulation, it often helps to take a look at how the hardware does things.
Most of the processors that are interesting candidates for dynamic recompilation organize their memory in pages, ie. the higher part of the address is the page number and the lower part is the page offset. The typical page size is 4K, which means that in a 32-bit address space you'd have 20-bit for the page number and 12-bit for the page offset. Even if you want to keep track of all possible pages (which isn't normally necessary) you'd need 1MB (= 20-bit address range) x 4 byte (size of an address on a 32-bit system) = 4MB, which might sound like much but actually keeps track of all pages in a 4GB address range. Now you still need 4KB x 4 byte = 16KB per page to have all the addresses, but you only need to keep track of pages that actually contain code, so that's far less than you might assume, and when you have a processor like MIPS where all instructions have to be aligned to 32-bit (ie. the start address of each instruction has the lower two bits cleared) it's only 4K.
When the emulation jumps to a certain address you first look at the page (by shifting the address right by 12 bit) to see if code in tat page has already been translated. If there is no translated code yet you allocate a new memory location that is large enough to hold all addresses for that page, enter a pointer to that area in the page number entry, and finally enter the address to the translated code in the location of the page offset at which the original code block starts.
The lookup needs just two memory accesses, one to find the find the location of the page directory via the page number and another to look up the address of the translated code via the page offset in the page directory.
I hope this doesn't sound too complicated, because it really isn't...

The paged transmap also allows for a solution to identify self-modifying code. Every time a write access is performed it is checked if the page number for that access indicates that code on that page has been translated, the cached code for that page is freed and the address in the page number entry cleared to force a recompilation of the code. This might sound crude, but in paged environments data and code are normally on different pages, so it is really likely that the code has been modified.

Since I was talking about Harvard architectures (ie. split integer and data caches) before, there is another way to detact self-modifying code in that case. Since those architectures have to flush their caches you can try to trace that (either some system call or the processor operation) and free the appropriate code that is no longer valid. According to an article (http://devworld.apple.com/technotes/pt/pt_39.html) the official 68K emulator for PowerMacs uses that solution.

Translation without Babelfish
Who actually waited for the next article yesterday?
Sorry about the delay, but I had to make up my mind what to cover next.

I think it's time to tackle the most difficult part of a dynarec, the translation.
You should already know that a block of the original code is translated to a block of host machine code.
To join the generated code with the emulator you also need some glue code to set up registers and write back the values after the block has been executed. For static register allocation you have something you could call a master block, since it does all the setup and cleaning for all generated blocks. With dynamic register allocation on the other hand you need a prologue and epilogue for each generated block, as register allocation can be different from block to block.

In following postings I'll disuss different forms of translation methods...

Using direct translation the code block is processed linearly and each instruction is translated separately. Often the code that generates the translation is placed directly after the decoding of the instruction, ie. the instruction decode look very much like the one from an interpretive emulator, with the exception that machine code is generated instead of the instruction being simulated.

The advantage of that method is that you can simply transform an interpretive emulator into a dynamic recompilator.

But there are many disadavtages.
First of all, optimizing the code is very hard because you could only reasonable do a one or two instruction lookahead to test if the current instruction could be combined with the following ones for a better translation.
Retargetting the dynarec to a different host processor can be quite tedious, since you'll have to go through the whole decoder loop, which should actually be portable.

Unfortunately I see code like that much too often in real-world examples.

The VIP Desaster
Once I past the direct translation stage - I was heading in that direction and I didn't like it, which was one of the reasons why I stalled my dynarec and started research again - I thought of making the dynamic recompiler more portable.

Wouldn't it be cool to have some virtual intermediate processor (VIP, ie. the original code is translated to VIP "instructions", which are then translated to target instructions) and enable dynamic recompilation between a whole lot processors without too many translators (just two per processor, one that translates to VIP while the other translates from VIP)?

I thought so, and since I didn't know of the failed UNCOL project (they tried to define some kind of universal machine language and it didn't work out, but I guess I wouldn't have cared anyway) I started analyzing maybed dozens of processor architectures, and after several months I knew much more about several architectures, but I gave up on the VIP idea.

It is true that basically all that processors do is calculate, but there are surprisingly many differences...
Just take the number of logical instructions, where some architectures have just the standard ones while others have a whole lot combinations. Or the fact that some architectures handle the carry flag differently during subtraction (ie. they borrow) while others don't have any flags at all.
The biggest difference so probably the division, where some architectures also calculate the remainder, others have a separate instruction to calculate the reminder, and some require you to calculate the reminder via multiplication. Also some architectures produce results only on special registers, do division steps (ie. only calculate a certain amount of bits of the result per instruction), or don't have a division instruction at all.
If you also add strange instuctions like the "add and branch" from PA-RISC you get a whole lot of different instructions.

With the VIP there would be two extremes, either you end up having all possible instructions from all architectures you know (and there will be stil many you don't know), or you make it very simple and compose translations for more complex instructions with a sequence of VIP instructions. Either way it's bad:
When you make the VIP too complex porting will be very tedious since you'd have to provide translations for hundreds and hundreds VIP instructions for each new host processor and no one will want to do that, which nullifies the whole idea behind using the VIP in the first place.
If you make it simple, then it will be a charm to port, but the quality of the target code will be very bad, as it is very hard to optimize code, when even very simple original instructions end up as being a long sequence in VIP, eg. since many processors have different condition flags and some don't have any those couldn't be handled in a simple VIP, and you'd have to translate these into separate VIP instructions, which could transform a very simple ADD instruction into a long VIP sequence.
Finding a compromise between these extremes will be very hard and most likely result in narrowing the number of supported architectures, which again ruins the idea of using a VIP.

So that isn't really a good idea either...
__________________
The crownless again shall be King

Decoder-Translator-Abstraction
Ok, since the first two approaches to translation aren't that recommended, what else could be done?
In my opinion, the translator should make it easy to optimize translations and should be not too closely linked to the instruction decode to make porting easier.

The best solution is probably to generate a block decode structure, ie. as the instructions of a block are decoded the decode information is added to a structure (you might even add some additional information about register and flag use), which is then handed to the translator that is a totally different module.

This method is relatively fast, since you don't do several translations as with the VIP and lookaheads are much easier in pre-decoded block structure than in direct translation, which also leads to peephole-optimization being much easier.

You still have to write a special translator for each host system, but since the translator is a separate module that communicates with the decoder via a data structure porting is a much cleaner process. Not to mention that you are able to do very specific optimization, which probably would not be possible if you were using a stronger abstraction.
__________________
The crownless again shall be King

Code Generation
After some days without new information I think I should talk a little about code generation, since I don't want to dive into the platform specific translations yet.

One possiblity that is used by a few dynarecs are preassembled routines, ie. the covers (the code that represents a certain operation, in our case an instruction of the original code) are written in assembly (althought I know at least one example where they are hacked in hex code) and translated during the compilation of the emulator.
Those covers contain only placeholders where register references, addresses, or immediate values should be. After the whole cover has been copied to the target memory location the placeholders have to be patched with the appropriate values. Due to this fact the range of possible instruction formats is somewhat limited unless you want to make the whole process too complicated.
Obviously this method isn't that flexible and also won't lead to optimized code, as you have to translate each instruction all by itself instead of being able to combine two or three instructions to make a faster semantic translation of the code sequence.

Other methods typically use code emitter functions for covers. Those functions directly write exacutable instructions to a memory location. The difference here is how readable the code is. Often it's like this:
Code:

    emit_4byte(0xE2800000 | 1); /* ADD R0, R0, #1 */
    emit_4byte(0xE1A0F00E);      /* MOV PC, LR     */

Some readers might recognize the ARM example code I was using for the demonstration of the dynamic function call. I only changed it to show how values like the immediate #1 in the ADD instruction tend to be added to such code.
Of course this is already more flexible than preassembled code, but it isn't very readable and even harder to edit. Mind you that you won't even find the comments that tell you which instruction is generated in some example dynarecs I found.

To improve the last method you should use either functions or macros to make the code emitters more readable and also easier to use, which might look like this in the end:
Code:

    emit_ADDI(REG0, REG0, 1); /* immediate addition */
    emit_MOV(REG_PC, REG_LR); /* return             */

With a few dozen code emitters like these you should be able to program and edit covers quite easily after some practice, and it is clear what they do even without the comments.

This is probably one of the last basics I can explain without using too much assembly. So are there any things you don't understand fully yet and that should be better explained before I continue?

What I have left now are topics about timing issues, specific translation problems, and code optimization, but those are very specific in most cases and maybe not that interesting for most. After that I'll probably have to pass or just answer questions, because I don't really have too much knowlegde of more in-depth stuff to continue. I didn't plan to fully explain a dynamic recompiler anyway
__________________
The crownless again shall be King

Quote:
Originally posted by n64warrior
ok.than how whould i simulate a chip like the RCP which is 2 or more cpu's or gpu's or a cpu and a gpu in 1 chip with fixed instructions and formats?? thats what i realy need to know..
First of all, I meant this a general discussion about dynamic recompilation and I want to keep it clean from references to specific systems, which is why I won't turn this into a N64 emulation discusssion. If you want to know more about that you'd better open a new discussion thread and/or look for answers here:
http://www.classicgaming.com/epr/n64.htm
http://e64.wwemu.com/emus/1964/1964_...er_doc_101.pdf
http://1964emu.emulation64.com/
http://www.pj64.net/code/mainiframe_downloads.htm

Secondly, according to my information the RCP isn't a normal CPU but rather a video/audio chip combination, where dynamic recompilation doesn't make sense because there isn't any program code you could translate and cache.
You might try to cache display lists and there effect, but since display lists aren't necessarily linked to specific addresses unlike program code blocks so recognizing an already encountered display list will be slow, and not only due to this fact I assume that you won't get any speed increase with such a technique.

If you simply mean the synchronization of the CPU and the rest of the system, then I'll cover that topic in "Timing Issues" a bit later...

Quote:

i am not asking anyone to write an emulator for me . i just want to learn as munch infor. about emulation, so i can get tooie working.and to make a some ideas on how to emulate it.
I guess you're just one of those guys who are pissed off that the only N64 emulator that is fast enough on your system to be playable is UltraHLE, but that uses lots and lots of dirty tricks to be able to run the games that fast at the expense of compatibility, which means that only few titles work.
Now you probably think that all the other N64 emulator authors just suck, and you'd be able to do a lightning fast emulator with top notch compatibility all by yourself, while you don't have enough knowledge to do that and the main problem simply is that your PC is too slow.

Quote:

but still my main goal is get real good at programing software for cross systems.so that lunix and apple software can run on windows on a Pentium2 at good speed.and i also want to to make a xwindowsxp that whould be an open source project.but thats all later.
Good luck, but I guess you won't finish any of these projects in the next 5 or 10 years...
I had the idea of writing a VDM (Virtual DOS Machine) for BeOS to be able to run simple DOS tools under my favourite operating system. Apart from the use of the V86 mode of the processor this should be almost trivial compared to what you have on your mind, and I am still at the research stage after I thought of the VDM several months ago, and I'll probably drop the whole idea because of more important things I could do.

Quote:

for now i just want get good at n64 emulation.and at least make one good video plugin.
It's certainly not the right thread to discuss that here...
__________________
The crownless again shall be King

Timing Issues
When it comes to handling multiple chip functions in emulation one might think with all the multi-tasking operating systems the different functions could run in parallel as separate threads. But it will be very hard to synchronize the timing between the part, which will either lead to weird effects or the emulation won't work well at all. The only thing that can probably run in a separate thread is the user interface IMO, where the system can buffer user input for later use.

So how does synchronization work in an emulator?
A traditional emulator "executes" one instruction after another, then checks the clock cycles it would have taken on the real machine. This is done for two reasons: first of all, when you know how much time it takes to perform the instruction on the real processor you can adjust the timing of the emulator that it runs just as fast as the real system, and secondly, after a certain amount of clock cycles are processed it is time to refresh the display, output some sound, etc. The amount of clock cycles that have to pass before the emulator has to process the other tasks totally depends on the emulated system, so I won't go into detail here.

In some rare cases, when the emulation has to be very exact, a so-called single-cycle emulation is used. In that case not the whole instrution is emulated, but only a small part of it during each clock cycle, to get the timing of the emulator as close to the original system as possible. This is very slow and complicated of course, and the total opposite to dynamic recompilation, which is meant to be fast and not totally exact. Single-cycle emulation is used be some C64 emulators, eg.

How does synchronization work in a dynamic recompiler?
Basically it's the same way, only that you execute blocks instead of single instructions or even single cycles, but you still count the clock cycles the same block would have needed on the real machine to synchronize the timing of the CPU emulation with the remaining emulation.
Depending how exact the emulation has to be, some blocks might spend too many clock cycles. If that happens you'll need to find out how many clock cycles can pass without running into a problem, and basically split the block into smaller parts during translation to stay below that cycle limit. This means that you can perform less optimizations because the blocks are smaller, and when you work with dynamic register allocation you have to generate prologue/epilogue code for each smaller block.

The most extreme solution that Neil Bradley came up with, if you should need very exact timing when using a dynarec, is that you basically have one original instruction per block. To keep the overhead low you should only do it with static register allocation, or you'll end up with more prologue/epilogue code than actual translations.
But instead of returning to the dispatcher loop as would be normal with translated blocks, you decrease a cycle counter and only jump to the exit code when the counter becomes negative. It looks a bit like this in x86 code:
Code:

    ; translated intruction here
    sub ecx, cycles
    js exitcode_1234
    ; next translated instruction

You need separate exit code for each instruction, but when you use static register allocation this won't be more than writing the current program counter address of the emulated system in an appropriate variable that the emulator knows where the dynarec left the block.
When you also use the paged translation map you can address each translated instrcution individually, so you don't have blocks anymore but linear code that can be left after any moment when it is time to do so.
One of the advantages is that you can actually make branches inside the generated code, because even when it would be an unlimited loop the execution thread would leave the block after a certain amount of clock cycles.
Since you'd most likely still translate block for block instead of the whole executable code (which is even impossible to find when you have indirect jumps), there are cases when you translate a block in the dispatcher that anther block just branched to. In that case you could patch the previous block not to leave to the dispatcher but jump to the newly translated block directly.
This whole idea - I hope I made it clear - should provide the best timing when using a dynamic recompiler, and it even has the advantage that it only returns to the dispatcher loop of the emulator when it has to due to clock cycles being "used up", but otherwise it runs totally on generated code.
No light without shadow, of course. The disadvantages are, you have to work with static recompilation (although some might not see that as a flaw) because the memory overhead for dynamic recompilation would be too big, and since you translate the instructions out of context you cannot use any optimization at all.
Use this technique if you really need instruction exact timing, otherwise it might be a bit extreme.
__________________
The crownless again shall be King

posted @ 2009-09-17 10:09 王小明阅读(526) | 评论 (1) | 编辑收藏

mips ext，inc指令，这个在mips R2中支持，君正4725不支持该指令

mips32 R2 中ins指令的解释
ins $x,$y,pos,size
清掉x中pos开始的size位，然后再将y中的低size位，放置到x中pos位开始的地方，合并起来
例如：
ins $x,$y,3,9
在c语言写为：
x=(x&0xfffff007)|((y&0x1ff)<<3)

ins $x,$y,6,11
在c语言写为：
x=(x&0xfff7003f)|((y&0x1ff)<<3)

ins $x,$y,31,1
在c语言写为：
x=(x&0x7fffffff)|(y&0x1)

其中

0 <= pos < 32
0 < size <= 32
0 < pos+size <= 32

posted @ 2009-09-02 20:20 王小明阅读(2007) | 评论 (2) | 编辑收藏

MIPS中32个通用寄存器常用命名

存器编号     助记符     用法
$0     zero

永远返回0，有点象NULL设备，往里写东西都给你丢了。看似浪费，其实有用的。等会学到汇编的时候再提，呵呵！
$1     at     （Assembly Temporary汇编缓存）保留给汇编使用
$2~$3       v0,v1     子函数调用的返回结果。就是return 返回的那些东西保存的地方，这个简单。
$4~$7     a0,a3     （Arguments）子函数调用的前几个参数，就像行参保存的东西。
$8~$15        t0,t7     暂时变量，子函数使用时不需要保存与恢复，就像子函数中使用tmp这样的临时变量。
$16~$23     s0,s7     子函数寄存器变量。子函数写入时必须保存其值并在返回前恢复原值，从而调用函数看到的这些寄存器的值没有变化。
$24~$25       t8,t9     同t0 t7等。
$26~$27     k0,k1     异常使用的。保留给中断或自陷处理程序使用，其值可能在你眼皮底下改变。
$28     gp     （Global Pointer）全局指针，这个比较有意思。一些运行系统维护这个指针以便于存取static和extern 变量。
$29     sp     （Stack Pointer）栈指针
$30     s8/fp

第九个寄存器变量，相当于s8；

如果需要的话作为帧指针，否则作sacved register。
$31     ra     子程序的返回地址

posted @ 2009-09-02 17:22 王小明阅读(1260) | 评论 (0) | 编辑收藏