may i join u guys

?
looks u managed the stack problem finally!
the thing i want to comment is using 32 bit code in real mode. THIS IS POSSIBLE starting on 386+ CPUs.
little example to make things clearer:
suppose CPU is in real mode, 16-bit registers are native (but 32-bit accesible!). say, u assemble
mov ax,01
db 00
db 00
with 16-bit target assembler will be compile this as
B8 01 00 00 00, and in real mode this does mean
mov ax, 01
dw 00 ;btw, this is add [bx][si],al
(B8 is "mov ax" code here)
but u can write and compile
mov eax, 01
for the same 16-bit target
and this will yield
66 B8 01 00 00 00
(thats why i added these two zeros above). mind the prefix 66 meaning "change native register model" for CPU.
so, the point is that 32-bit operations generally are quite accessible in 16-bit mode (not pm instructions sure) but are 1 byte longer. personally i quite use 32-bit instructions in DOS com files.
now the 2nd part of the story. say, u r in 32-bit mode. here
B8 01 00 00 00
means
mov eax,01
on the contrary, the above
66 B8 01 00 00 00
means now
mov ax,01
add [eax],al
end of the story.. hope i was not too boring

just wanted be helpful
regards, oleksii