Discussion:
[Xbox-linux] Datapoint on the USB stack.
Terry Cole
2003-04-13 00:24:14 UTC
Permalink
Dear List,
Struggling to understand some of this, but I did catch Franz's
8.
The USB stack
i have found out, that the
BootUsbInit(...
which is #ifdef DO_USB
(it was permanent swithed on)
lead to .xbe booting faliture on 1.1 xbox.
i have now //#ifdef DO_USB, and all looks good now.
The current .xbe/source in CVS is running (tested) on 1.1 and 1.0
xboxes and boots from evox and CD.
Eh? I have an Xbox v1.1b (PAL). Everything has worked perfectly from
day one, including booting off .xbe... Is Franz referring to booting off
a USB device? I'm sure I could test that...
8.
The USB stack
i have found out, that the
BootUsbInit(...
which is #ifdef DO_USB
(it was permanent swithed on)
lead to .xbe booting faliture on 1.1 xbox.
i have now //#ifdef DO_USB, and all looks good now.
The current .xbe/source in CVS is running (tested) on 1.1 and 1.0
xboxes and boots from evox and CD.
Interesting news, since previously the USB init has worked fine on=20
both v1.0 and v1.1 boxes, in fact it has been turned on for many=20
weeks in Cromwell.
Yes. I keep hearing about other people's horror stories, wondering what
the heck might be going wrong. Since many - like Franz - do not have a
modchip, it's not traceable to hardware modification errors. Baffling.


Regards, TC
--
Terry Cole BA/BSc/BE/BA(hons) (***@maths.otago.ac.nz)
System Administrator, Dept. of Maths. & Stats., Otago Uni.
PO Box 56, Dunedin, NZ.
Andy Green
2003-04-13 10:45:54 UTC
Permalink
Post by Terry Cole
i have now //#ifdef DO_USB, and all looks good now.
The current .xbe/source in CVS is running (tested) on 1.1 and
1.0 xboxes and boots from evox and CD.
Eh? I have an Xbox v1.1b (PAL). Everything has worked
perfectly from day one, including booting off .xbe... Is Franz
referring to booting off a USB device? I'm sure I could test
that...
No, we do not have the USB stack necessary to talk to USB devices yet.
Franz is telling you that on his Xbox, he found that by turning off
USB init altogether, he made his boot problem go away. So he infers
from this that the USB init is broken and by removing it "all looks
good now".

I don't think the USB init is broken at all, instead as I keep
describing, the boot instability which is seen is sensitive to code
placement. Removing the USB init code affects the layout of the
remaining code. In fact on some images I have made, adding or
removing a single NOP in BootStartup.S is enough to cause a boot
crash or not. This makes the result of any test involving code size
changes suspect - did the added or removed code make the difference,
or was it just the change in layout caused by the extra or removed
code, so that if you had added the same amount of NOPs you would get
the same effect?

Still, lets hope Franz has fixed the instability as he tells us and
its all moot now.

- -Andy
Ivan Hawkes
2003-04-13 11:00:02 UTC
Permalink
Andy Green wrote:
| On Sunday 13 April 2003 01:24, Terry Cole wrote:
|
|
|> >i have now //#ifdef DO_USB, and all looks good now.
|> >
|> >The current .xbe/source in CVS is running (tested) on 1.1 and
|> > 1.0 xboxes and boots from evox and CD.
|>
|> Eh? I have an Xbox v1.1b (PAL). Everything has worked
|>perfectly from day one, including booting off .xbe... Is Franz
|>referring to booting off a USB device? I'm sure I could test
|>that...
|
|
| No, we do not have the USB stack necessary to talk to USB devices yet.
| Franz is telling you that on his Xbox, he found that by turning off
| USB init altogether, he made his boot problem go away. So he infers
| from this that the USB init is broken and by removing it "all looks
| good now".
|
| I don't think the USB init is broken at all, instead as I keep
| describing, the boot instability which is seen is sensitive to code
| placement. Removing the USB init code affects the layout of the
| remaining code. In fact on some images I have made, adding or
| removing a single NOP in BootStartup.S is enough to cause a boot
| crash or not. This makes the result of any test involving code size
| changes suspect - did the added or removed code make the difference,
| or was it just the change in layout caused by the extra or removed
| code, so that if you had added the same amount of NOPs you would get
| the same effect?
|
| Still, lets hope Franz has fixed the instability as he tells us and
| its all moot now.
|
| -Andy


Just a quick thought here, and it's probably been discussed before...are
you making sure you're using the correct padding for the compiler. Some
platforms are fussy about byte/word/dword alignment and require padding
to be inserted to ensure all structures start on a word or dword
boundary. The runtime results are unpredicatable if the padding is wrong.
Andy Green
2003-04-13 11:32:14 UTC
Permalink
Post by Ivan Hawkes
Just a quick thought here, and it's probably been discussed
before...are you making sure you're using the correct padding for
the compiler. Some platforms are fussy about byte/word/dword
alignment and require padding to be inserted to ensure all
structures start on a word or dword boundary. The runtime results
are unpredicatable if the padding is wrong.
Its the kind of off the wall idea that we need, but for this
particular one I don't think its the problem. i386 does not care
about member alignment - at least, it won't choke if a DWORD is on
any intra-DWORD boundary, ie, a dword can be happily be started at +0
bytes, +1 bytes, +2 bytes or +3 bytes from a DWORD boundary. I think
its *faster* if its aligned but its not death if it isn't.

What you might be thinking of is where you interface to a precompiled
API, like an OS, sending structs around between the API and your code
then its critical the struct members follow the same padding rules so
they line up. That's not what is happening here, since during the
init it is ONLY our code that is running and its all compiled under
the same compile environment.

- -Andy
Ivan Hawkes
2003-04-13 22:30:59 UTC
Permalink
Post by Andy Green
Post by Ivan Hawkes
Just a quick thought here, and it's probably been discussed
before...are you making sure you're using the correct padding for
the compiler. Some platforms are fussy about byte/word/dword
alignment and require padding to be inserted to ensure all
structures start on a word or dword boundary. The runtime results
are unpredicatable if the padding is wrong.
Its the kind of off the wall idea that we need, but for this
particular one I don't think its the problem. i386 does not care
about member alignment - at least, it won't choke if a DWORD is on
any intra-DWORD boundary, ie, a dword can be happily be started at +0
bytes, +1 bytes, +2 bytes or +3 bytes from a DWORD boundary. I think
its *faster* if its aligned but its not death if it isn't.
What you might be thinking of is where you interface to a precompiled
API, like an OS, sending structs around between the API and your code
then its critical the struct members follow the same padding rules so
they line up. That's not what is happening here, since during the
init it is ONLY our code that is running and its all compiled under
the same compile environment.
-Andy
That's probably what I was thinking about ;-) There's all sorts of
miscellaneous computer knowledge floating around my brain from years at the
keyboard. The old axiom of "use it or lose it" is slapping me around a bit
these days...too much management, not enough hard-core coding. I was going to
look it up in one of my old books, but they were all written when 486's were
new and were considered the ant's pants so they're not relevant to the
Pentium architecture or it's sibblings.

When faced with an intractable problem like the boot issues, I usually start
to remove all the code possible to bring it down to a skeleton only. When
that is operating (if!) properly it's time to slowly trickle the rest of the
code back in, a little at a time until the source of the breakage is
identified. That may not apply in this case if you are having issues which
could possibly be related to the code layout in memory.

Instability is more likely due to buffer overflows which affect sections of
code/data (data more likely really, it should all be in a data segment) and
is layout sensitive because corrupted data in some sections is more tolerable
than in other.

Since it's at boot time that you're having issues I'm guessing that there's no
debugger help available and extremely restricted access to useful methods for
instrumenting the code. Pity the machines don't come with an old MDA adapter
(Mono Display Adapter) you could have output text to that during boot
sequence for debugging purposes. As is stands, what sort of debugging is
available? I suspect there is no screen, is there at least a system bell
(beep) to call? I used to debug a graphical multimedia system I built using
the system bell since the debuggers at the time weren't up to scratch. It's
amazing how much info listening for beeps can provide.

How about pre-initialising all the memory being used to a known value, then
checking this value is present prior to using it? There's a chance this will
reveal a buffer overrun from a previous section of code.

I'd start hacking myself but my box is in Australia still getting
modded...*sigh*...not much longer to wait.
Andy Green
2003-04-14 05:54:40 UTC
Permalink
Post by Ivan Hawkes
When faced with an intractable problem like the boot issues, I
usually start to remove all the code possible to bring it down to a
skeleton only. When that is operating (if!) properly it's time to
slowly trickle the rest of the code back in, a little at a time
until the source of the breakage is identified. That may not apply
in this case if you are having issues which could possibly be
related to the code layout in memory.
A good method. But the problem is present with a minimal startup
footprint, as soon as enough code - pretty much any code, it seems,
is added to see the symptoms.
Post by Ivan Hawkes
Instability is more likely due to buffer overflows which affect
sections of code/data (data more likely really, it should all be in
a data segment) and is layout sensitive because corrupted data in
some sections is more tolerable than in other.
It seems to me there is some interrelation between the hardware and
the init code going on. Your point is right though, the instability
could be being driven by uninitialized memory contents in that way.
Post by Ivan Hawkes
Since it's at boot time that you're having issues I'm guessing that
there's no debugger help available and extremely restricted access
to useful methods for instrumenting the code. Pity the machines
don't come with an old MDA adapter (Mono Display Adapter) you could
have output text to that during boot sequence for debugging
purposes. As is stands, what sort of debugging is available? I
suspect there is no screen, is there at least a system bell (beep)
to call? I used to debug a graphical multimedia system I built
using the system bell since the debuggers at the time weren't up to
scratch. It's amazing how much info listening for beeps can
provide.
I designed and built three Filtror devices

http://warmcat.com/milksop/filtror.html

back in July 2002 I think, and distributed two of them to other people
who were working on the early BIOS stuff. I also wrote a terminal
type application, so even when there was no IO up at all, it was
possible to have a debug terminal up. This proved very useful
indeed, especially when we were booting Linux for the first time, it
was crashing early during boot with no output to guide us. Milosch
wrote a filtror character driver and that got us over the dump by
showing dmesg stuff.
Post by Ivan Hawkes
How about pre-initialising all the memory being used to a known
value, then checking this value is present prior to using it?
There's a chance this will reveal a buffer overrun from a previous
section of code.
Franz has been trying this trick with adding memsets(), although he is
quick to claim success, he has not explained which of the "14-15"
memsets he added did anything, nor where the supposed bug was. And
he continues to ignore my advice (after all, what do I know!) that
this instability problem is sensitive to code layout. So I assume
any changes in behaviour were due to the added code footprint and not
function, despite Franz's handwaving.

I did actually add code in BootStartup.S some months ago to zero the
first 1M (I think) of memory before the stack was set up, it made no
difference to the instability compared to the same amount of NOPs
inserted at the same place. So I removed the code and came away
pretty sure that uninitialized RAM is not the problem. (This doesn't
help with unprepped stack allocation, but at that time there was
virtually no code other than the init).
Post by Ivan Hawkes
I'd start hacking myself but my box is in Australia still getting
modded...*sigh*...not much longer to wait.
You're very welcome to join in.

- -Andy
Ivan Hawkes
2003-04-15 11:16:20 UTC
Permalink
Andy Green wrote:
| On Sunday 13 April 2003 23:30, Ivan Hawkes wrote:
|
|
|>When faced with an intractable problem like the boot issues, I
|>usually start to remove all the code possible to bring it down to a
|>skeleton only. When that is operating (if!) properly it's time to
|>slowly trickle the rest of the code back in, a little at a time
|>until the source of the breakage is identified. That may not apply
|>in this case if you are having issues which could possibly be
|>related to the code layout in memory.
|
|
| A good method. But the problem is present with a minimal startup
| footprint, as soon as enough code - pretty much any code, it seems,
| is added to see the symptoms.

That's going to make it hard to crack.

|>Instability is more likely due to buffer overflows which affect
|>sections of code/data (data more likely really, it should all be in
|>a data segment) and is layout sensitive because corrupted data in
|>some sections is more tolerable than in other.
|
|
| It seems to me there is some interrelation between the hardware and
| the init code going on. Your point is right though, the instability
| could be being driven by uninitialized memory contents in that way.
|
|
|>Since it's at boot time that you're having issues I'm guessing that
|>there's no debugger help available and extremely restricted access
|>to useful methods for instrumenting the code. Pity the machines
|>don't come with an old MDA adapter (Mono Display Adapter) you could
|>have output text to that during boot sequence for debugging
|>purposes. As is stands, what sort of debugging is available? I
|>suspect there is no screen, is there at least a system bell (beep)
|>to call? I used to debug a graphical multimedia system I built
|>using the system bell since the debuggers at the time weren't up to
|>scratch. It's amazing how much info listening for beeps can
|>provide.
|
|
| I designed and built three Filtror devices
|
| http://warmcat.com/milksop/filtror.html
|
| back in July 2002 I think, and distributed two of them to other people
| who were working on the early BIOS stuff. I also wrote a terminal
| type application, so even when there was no IO up at all, it was
| possible to have a debug terminal up. This proved very useful
| indeed, especially when we were booting Linux for the first time, it
| was crashing early during boot with no output to guide us. Milosch
| wrote a filtror character driver and that got us over the dump by
| showing dmesg stuff.

Yikes, nice piece of engineering there. I take it you did electronics at
Uni? Was it the work you did on LPCs that drew you to the XBox project
or was that simply a happy co-incidence?

|>How about pre-initialising all the memory being used to a known
|>value, then checking this value is present prior to using it?
|>There's a chance this will reveal a buffer overrun from a previous
|>section of code.
|
|
| Franz has been trying this trick with adding memsets(), although he is
| quick to claim success, he has not explained which of the "14-15"
| memsets he added did anything, nor where the supposed bug was. And
| he continues to ignore my advice (after all, what do I know!) that
| this instability problem is sensitive to code layout. So I assume
| any changes in behaviour were due to the added code footprint and not
| function, despite Franz's handwaving.

If it's not reproducable then you haven't isolated the bug. If you can't
explain "why" a problem goes away then you haven't solved it - you've
merely moved it a little further into your future. I get decididly edgy
if a problem (intermittent) just goes away...software bugs don't fix
themselves.

| I did actually add code in BootStartup.S some months ago to zero the
| first 1M (I think) of memory before the stack was set up, it made no
| difference to the instability compared to the same amount of NOPs
| inserted at the same place. So I removed the code and came away
| pretty sure that uninitialized RAM is not the problem. (This doesn't
| help with unprepped stack allocation, but at that time there was
| virtually no code other than the init).

Adding NOP into a data segment (god, I'm so old, do they even still have
the segmented memeory architecture on Pentium class processors?) is more
likely to cause data weirdness than zeroing it. That's because 0 is a
terminating character for strings, so if you have a data buffer overflow
(but still in the data segment) then there is a good chance of hitting a
0 eventually...thus terminating the string data. Neither is particularly
good with numeric values. NOP is good in code segments because they take
1 instruction and thus when code goes off the rails and hit's the NOP
they can just suck 'em up like PacMan until they hit the start of the
next block of real instructions.

One idea might be to seed the code area/data area with long jumps to an
exeception handler block, preferrably one that rings the bell or logs to
the screen or fires off a noticable event. Perhaps a little NOP padding
either side of these will increase the chances of it being executed as a
long jump rather than some random instruction sequence because the code
executes from within e.g. (note, I haven't written assembler for 15
years...this will be wrong)

NOP
NOP
NOP
NOP
JMP EX_HANDLER
NOP
NOP
NOP
NOP
JMP EX_HANDLER
NOP
NOP
NOP
NOP
JMP EX_HANDLER


EX_HANDLER:
MOV AH, whatever ; You get the idea...
.
.
.
INT 21H ; or some debugging shit...since
; this is an old BIOS interupt.

|>I'd start hacking myself but my box is in Australia still getting
|>modded...*sigh*...not much longer to wait.
|
|
| You're very welcome to join in.
|
| -Andy

I'd love to join in the hacking, better dust off my assembler books and
stuff though, I haven't used that part of my brain for a while and all
my stuff is still referencing the 8086/8080 architectures...sheesh.

- -------------------------------------------------------
This SF.net email is sponsored by: Etnus, makers of TotalView, The debugger
for complex code. Debugging C/C++ programs can leave you feeling lost and
disoriented. TotalView can help you find your way. Available on major UNIX
and Linux platforms. Try it free. www.etnus.com
_______________________________________________
Xbox-linux-devel mailing list
Xbox-linux-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xbox-linux-devel
Andy Green
2003-04-15 11:53:56 UTC
Permalink
Post by Ivan Hawkes
| A good method. But the problem is present with a minimal startup
| footprint, as soon as enough code - pretty much any code, it
| seems, is added to see the symptoms.
That's going to make it hard to crack.
Right, its the slippery issues that don't stay still that are the
worst.
Post by Ivan Hawkes
Yikes, nice piece of engineering there. I take it you did
electronics at Uni? Was it the work you did on LPCs that drew you
to the XBox project or was that simply a happy co-incidence?
I left school at 16 :-) I recently tried to take a degree course at a
nearby University, but I am ashamed to say I found the Xbox stuff a
lot more interesting than the course material and just stopped
turning up. I only met the LPC bus for the first time on the Xbox.
Post by Ivan Hawkes
If it's not reproducable then you haven't isolated the bug. If you
can't explain "why" a problem goes away then you haven't solved it
- you've merely moved it a little further into your future. I get
decididly edgy if a problem (intermittent) just goes
away...software bugs don't fix themselves.
Yeah, anyone with serious bugtracking experience not only knows this,
they KNOW it and have it tattooed on the inside of their eyelids.
Post by Ivan Hawkes
Adding NOP into a data segment (god, I'm so old, do they even still
have the segmented memeory architecture on Pentium class
processors?) is more likely to cause data weirdness than zeroing
it. That's because 0 is a terminating character for strings, so if
Now you are showing your age :-) 0x00 for NOP - isn't that Z80 or
something? Its 0x90 on i386.
Post by Ivan Hawkes
One idea might be to seed the code area/data area with long jumps
to an exeception handler block, preferrably one that rings the bell
The problem is I no longer believe that its that kind of code bug. I
enabled the paging stuff in Cromwell, so that you get page faults if
you go right off the rails. I also added an exception handler which
dumps the exception details and gives a dump of the last
instructions. This showed details of a problem on v1.1 boxes when
they overheat, you get video artefacts and then a crash. The
exception handler showed that RAM fetches were mangled, IIRC FFs were
wrongly seen, and sometimes just one bit was set when it shouldn't
be.

I'm hoping that when a bad box can be made to go wrong when someone is
using a filtror on it (this is currently Ed or me), the exception
dump stuff might give strong indications of where to look next, or at
a minimum new information for us to marvel at.

- -Andy
Marko Friedemann
2003-04-15 12:56:35 UTC
Permalink
Post by Andy Green
I'm hoping that when a bad box can be made to go wrong when someone is
using a filtror on it (this is currently Ed or me), the exception
Please, Andy define 'bad box'. Is there a specific setup known to illuminate
them like 'HEY, BAD BOX RIGHT HERE' or is it just luck (excuse the word here ;)
) to stumble upon one? Is a bad box bad in the sense it hangs/crashes
predictable/deterministic all the time?

The reason I'm asking is the way my box behaved with Ed's 0.3.0. I understood
there were some problems regarding that particular distro, but what my box did
was kinda strange. Crashed/hung every time I tried booting (boot.iso for booting
from unused space), at different points (seemingly dependent on the time the
xbox had already been switched on). When 'warm', it crashed right away after the
USB init of _romwell with steady red led. When 'cold' it went on for varying
times, sometimes it even came right to X, only to crash there. Actually, the
very first time I tried the installation image I was able to start the
installation only to find it just had hung in the midst of it.
0.1.0 (which I tried after) and now 0.3.1 work fine without any problem up until
now. Sorry if this has nothing to do with the instability issue, but is rather a
know fact, but I kinda must have missed that info then.

Marko
BTW, Andy, do you happen to know how to align the LE33CZ for the USB 5V->3.3V
version of the cheapLPC? Having no indication of in/out on it, I think I
reversed the thing (went VERY hot when I put the 5V on and did not output 3.3V
but rather 0.1V).
--
www.bmx-chemnitz.de -==- 20 Zoll in Chemnitz
***@bmx-chemnitz.de
Andy Green
2003-04-15 13:24:33 UTC
Permalink
Post by Marko Friedemann
Post by Andy Green
I'm hoping that when a bad box can be made to go wrong when
someone is using a filtror on it (this is currently Ed or me),
the exception
Please, Andy define 'bad box'. Is there a specific setup known to
illuminate them like 'HEY, BAD BOX RIGHT HERE' or is it just luck
(excuse the word here ;) ) to stumble upon one? Is a bad box bad in
the sense it hangs/crashes predictable/deterministic all the time?
Well I use the word 'bad' only by default since I don't know how to
describe it in a better way. The word came about because some people
seem to have boxes which are far more prone to symptoms than the ones
the developers are using. For example, Ryan Shoff's NTSC box, Paul
B's NTSC box, Grez's boxes, all choke or have difficulties on
Cromwell versions which worked fine - days of uptime in my case -
here. So it seemed that some boxes are more likely to show symptoms
than others.
Post by Marko Friedemann
The reason I'm asking is the way my box behaved with Ed's 0.3.0. I
understood there were some problems regarding that particular
distro, but what my box did was kinda strange. Crashed/hung every
time I tried booting (boot.iso for booting from unused space), at
different points (seemingly dependent on the time the xbox had
already been switched on). When 'warm', it crashed right away after
the USB init of _romwell with steady red led. When 'cold' it went
on for varying times, sometimes it even came right to X, only to
crash there. Actually, the very first time I tried the installation
image I was able to start the installation only to find it just had
hung in the midst of it. 0.1.0 (which I tried after) and now 0.3.1
work fine without any problem up until now. Sorry if this has
nothing to do with the instability issue, but is rather a know
fact, but I kinda must have missed that info then.
Is this a v1.1 Xbox? With the lid off? Because I saw that kind of
flakiness here with that setup.

Otherwise, congratulations, its a "bad" one :-)

Grez also noted that on his bad boxes, the problems were correlated
with heat, and yet, even on a hot box that was refusing to start with
Cromwell, swapping it for a native BIOS --> flawless boot. So the
problem belongs to Cromwell somehow, no doubt, its as if something
that Cromwell does in its init makes the Xbox more sensitive to the
heat, as if as Ivan suggests we are looking at CPU clock, RAM
waitstates, something of that nature being set differently in
Cromwell.
Post by Marko Friedemann
Marko
BTW, Andy, do you happen to know how to align the LE33CZ for the
USB 5V->3.3V version of the cheapLPC? Having no indication of
in/out on it, I think I reversed the thing (went VERY hot when I
put the 5V on and did not output 3.3V but rather 0.1V).
Yeah, its a good way to tell if you got it the wrong way around :-)

Google is your friend for these kind of questions, for almost
everything nowadays a datasheet is only a Google away. I used to
have to bum databooks from distributor reps, those were the days.

http://www.premier-electric.com/files/STM/pdf/LExx.pdf

Top of page 3, note "Bottom view"

- -Andy
Marko Friedemann
2003-04-15 14:02:12 UTC
Permalink
Post by Andy Green
Is this a v1.1 Xbox? With the lid off? Because I saw that kind of
flakiness here with that setup.
No, it's not. This happened on v1.0 (PAL) with lid on.
Post by Andy Green
Otherwise, congratulations, its a "bad" one :-)
OK. Looks as if I had to build a filtror then?
Post by Andy Green
Grez also noted that on his bad boxes, the problems were correlated
with heat, and yet, even on a hot box that was refusing to start with
Cromwell, swapping it for a native BIOS --> flawless boot. So the
I can confirm this. I could play them games just fine, even just after a
crash/hang under Xromwell (I did not flash cromwell onto the lf020 but used evox
+ xromwell instead). Having the box switched on for any given amount of time
(including game play, short DVD session, ...) had resulted in a crash not far
into the booting process every time (with 0.3.0 that was, 0.3.1 with
boot_riva.iso boot image seems to run fine, had this running for 5 hours
straight yesterday).
Post by Andy Green
Yeah, its a good way to tell if you got it the wrong way around :-)
Google is your friend for these kind of questions, for almost
everything nowadays a datasheet is only a Google away. I used to
have to bum databooks from distributor reps, those were the days.
Yeah, you're right, of course. Information at your fingertips ;)
Post by Andy Green
http://www.premier-electric.com/files/STM/pdf/LExx.pdf
Top of page 3, note "Bottom view"
Thanks a lot.

Marko
--
www.bmx-chemnitz.de -==- 20 Zoll in Chemnitz
***@bmx-chemnitz.de
Andy Green
2003-04-15 15:01:31 UTC
Permalink
Post by Marko Friedemann
Post by Andy Green
Otherwise, congratulations, its a "bad" one :-)
OK. Looks as if I had to build a filtror then?
Post by Andy Green
Grez also noted that on his bad boxes, the problems were
correlated with heat, and yet, even on a hot box that was
refusing to start with Cromwell, swapping it for a native BIOS
--> flawless boot. So the
I can confirm this. I could play them games just fine, even just
after a crash/hang under Xromwell (I did not flash cromwell onto
the lf020 but used evox + xromwell instead). Having the box
switched on for any given amount of time (including game play,
short DVD session, ...) had resulted in a crash not far into the
booting process every time (with 0.3.0 that was, 0.3.1 with
boot_riva.iso boot image seems to run fine, had this running for 5
hours straight yesterday).
It does increasingly seem like there is some register or something set
by _romwell which puts the thing in a more fragile state.

I mean, you're booting via Xromwell - the registers have already been
inited by the original MS BIOS code. So it shouldn't be that we
missed out some init in Cromwell, more that we are crapping over some
good inherited init with bad values somehow.

Maybe that is a good way forward.

1) Find a bad box

2) Stick EvoX or something on it

3) Keep booting Xromwell versions, each time removing more and more
init code (after all, the box is fully inited by EvoX)

4) Goto 3 until crashing stops

- -Andy
Ivan Hawkes
2003-04-15 13:03:16 UTC
Permalink
Andy Green wrote:

<SNIP>

|>Yikes, nice piece of engineering there. I take it you did
|>electronics at Uni? Was it the work you did on LPCs that drew you
|>to the XBox project or was that simply a happy co-incidence?
|
|
| I left school at 16 :-) I recently tried to take a degree course at a
| nearby University, but I am ashamed to say I found the Xbox stuff a
| lot more interesting than the course material and just stopped
| turning up. I only met the LPC bus for the first time on the Xbox.

Heh heh, I made it a little further, just far enough to spend two years
in my course then drop out. The tech I was using outside uni was more
interesting than playing on the VAXes and PDP-11s.
|
|>If it's not reproducable then you haven't isolated the bug. If you
|>can't explain "why" a problem goes away then you haven't solved it
|>- you've merely moved it a little further into your future. I get
|>decididly edgy if a problem (intermittent) just goes
|>away...software bugs don't fix themselves.
|
|
| Yeah, anyone with serious bugtracking experience not only knows this,
| they KNOW it and have it tattooed on the inside of their eyelids.

Agreed.

|>Adding NOP into a data segment (god, I'm so old, do they even still
|>have the segmented memeory architecture on Pentium class
|>processors?) is more likely to cause data weirdness than zeroing
|>it. That's because 0 is a terminating character for strings, so if
|
|
| Now you are showing your age :-) 0x00 for NOP - isn't that Z80 or
| something? Its 0x90 on i386.

Possibly, but what I meant was that using 0x00 would be great for data
and using NOP would be great for code.

NOTE: Although I programmed some Z80s in basic I never got to do it in
Assembler. My assembler experience is strictly 6502/PDP-11/80x86 in that
order. I just located some handy online (PDF) processor architecture
manuals a few minutes ago to help brush up - thanks Intel!

|>One idea might be to seed the code area/data area with long jumps
|>to an exeception handler block, preferrably one that rings the bell
|
|
| The problem is I no longer believe that its that kind of code bug. I
| enabled the paging stuff in Cromwell, so that you get page faults if
| you go right off the rails. I also added an exception handler which
| dumps the exception details and gives a dump of the last
| instructions. This showed details of a problem on v1.1 boxes when
| they overheat, you get video artefacts and then a crash. The
| exception handler showed that RAM fetches were mangled, IIRC FFs were
| wrongly seen, and sometimes just one bit was set when it shouldn't
| be.
|
| I'm hoping that when a bad box can be made to go wrong when someone is
| using a filtror on it (this is currently Ed or me), the exception
| dump stuff might give strong indications of where to look next, or at
| a minimum new information for us to marvel at.
|
| -Andy

If it's a heat related issue then that someone may well be me, since I
am about to put a 7200RPM 120GB drive into mine.

This may sound odd, but much conversation has centred around the video
glitches (signs of RAM getting courrpted - either by poorly operating
chips or code gone hog wild) and other stuff that sounds heat related.
Has anyone tried *underclocking* the CPU/GPU and see if that helps
alleviate the symptoms. I know we all would rather overclock (so tempted
to bang a high spec CPU in there with some massive cooling block ;-> )
but underclocking could produce greater stability in the CPU/GPU/Memory.
The best behaved software in the world is going to have a hard time
coping with malfunctioning hardware. Has any research been done into the
clock rates of the CPU and fiddling with the multipliers, etc.

Also, a test harness that sits there and works the CPU/GPU/Memory
without doing anything too complicated would be an idea (e.g. walking
bit patterns over the memory space ala memchecker). Try to see if it is
code related or heat related or gamma rays from outa-space...

- -------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Xbox-linux-devel mailing list
Xbox-linux-***@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xbox-linux-devel
Andy Green
2003-04-15 13:31:32 UTC
Permalink
Post by Ivan Hawkes
Heh heh, I made it a little further, just far enough to spend two
years in my course then drop out. The tech I was using outside uni
was more interesting than playing on the VAXes and PDP-11s.
Its good to know I am in good company, there are a few of us failed
academics around :-)
Post by Ivan Hawkes
If it's a heat related issue then that someone may well be me,
since I am about to put a 7200RPM 120GB drive into mine.
This may sound odd, but much conversation has centred around the
video glitches (signs of RAM getting courrpted - either by poorly
operating chips or code gone hog wild) and other stuff that sounds
heat related. Has anyone tried *underclocking* the CPU/GPU and see
if that helps alleviate the symptoms. I know we all would rather
overclock (so tempted to bang a high spec CPU in there with some
massive cooling block ;-> ) but underclocking could produce greater
stability in the CPU/GPU/Memory. The best behaved software in the
world is going to have a hard time coping with malfunctioning
hardware. Has any research been done into the clock rates of the
CPU and fiddling with the multipliers, etc.
Also, a test harness that sits there and works the CPU/GPU/Memory
without doing anything too complicated would be an idea (e.g.
walking bit patterns over the memory space ala memchecker). Try to
see if it is code related or heat related or gamma rays from
outa-space...
This is actually a smart idea.... but in fact I have zero knowledge
on how the CPU clock is generated in the Xbox, no idea how to jiggle
it.

Have to interpret any results quite carefully, since lower clock -->
less heat. So a meaningful test would have to try to maintain the
die temperature while reducing the clock.

The CPU and memory clocks may be locked together too, reducing the
amount of information that the test can give.

But its an interesting idea, if anyone has any concept of the CPU
clock generation or how to control it, speak up.

- -Andy
Paul Bartholomew
2003-04-15 15:15:07 UTC
Permalink
Hi Andy -
Post by Marko Friedemann
Post by Andy Green
1) Find a bad box
...
As you know, I've got a 'bad' xbox.
Post by Marko Friedemann
Post by Andy Green
2) Stick EvoX or something on it
3) Keep booting Xromwell versions, each time removing more and
more
Post by Marko Friedemann
Post by Andy Green
init code (after all, the box is fully inited by EvoX)
4) Goto 3 until crashing stops
I can give this a try. I think I should have some free time
tonight.

I haven't looked at _romwell code in a while. If you can point out
specific places where I should start removing code, that would be a
help.

- Paulb

-----Original Message-----
From: xbox-linux-devel-***@lists.sourceforge.net
[mailto:xbox-linux-devel-***@lists.sourceforge.net]On Behalf Of
Andy
Green
Sent: Tuesday, April 15, 2003 11:02 AM
To: xbox-linux-***@lists.sourceforge.net
Subject: Re: [Xbox-linux] Datapoint on the USB stack.


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Post by Marko Friedemann
Post by Andy Green
Otherwise, congratulations, its a "bad" one :-)
OK. Looks as if I had to build a filtror then?
Post by Andy Green
Grez also noted that on his bad boxes, the problems were
correlated with heat, and yet, even on a hot box that was
refusing to start with Cromwell, swapping it for a native BIOS
--> flawless boot. So the
I can confirm this. I could play them games just fine, even just
after a crash/hang under Xromwell (I did not flash cromwell onto
the lf020 but used evox + xromwell instead). Having the box
switched on for any given amount of time (including game play,
short DVD session, ...) had resulted in a crash not far into the
booting process every time (with 0.3.0 that was, 0.3.1 with
boot_riva.iso boot image seems to run fine, had this running for 5
hours straight yesterday).
It does increasingly seem like there is some register or something
set
by _romwell which puts the thing in a more fragile state.

I mean, you're booting via Xromwell - the registers have already
been
inited by the original MS BIOS code. So it shouldn't be that we
missed out some init in Cromwell, more that we are crapping over
some
good inherited init with bad values somehow.

Maybe that is a good way forward.

1) Find a bad box

2) Stick EvoX or something on it

3) Keep booting Xromwell versions, each time removing more and more
init code (after all, the box is fully inited by EvoX)

4) Goto 3 until crashing stops

- -Andy
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQE+nB7LjKeDCxMJCTIRAuukAJ9Ecus8TJTIFb9GU85jnyQfVn0pZgCfUBa8
w7g4JPojBnqzI7cZXtLNWtE=
=SuBV
-----END PGP SIGNATURE-----



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
Andy Green
2003-04-15 16:12:28 UTC
Permalink
Post by Paul Bartholomew
Hi Andy -
Post by Andy Green
1) Find a bad box
...
As you know, I've got a 'bad' xbox.
Post by Andy Green
2) Stick EvoX or something on it
3) Keep booting Xromwell versions, each time removing more and
more
Post by Andy Green
init code (after all, the box is fully inited by EvoX)
4) Goto 3 until crashing stops
I can give this a try. I think I should have some free time
tonight.
I haven't looked at _romwell code in a while. If you can point out
specific places where I should start removing code, that would be a
help.
Hi Paul -

Sounds good, I will look at it after tea, (which will be before your
tonight)

I think the first moves should be to make a Xromwell that

1) Doesn't boot into Linux any more, just sits there pulsing tux

2) Check that fails

3) Make the minimal stripped down version of this

Also, when you come in by XBE, is the screen turned off as was
experienced by DaveX?

I guess a really minimal version can just change the LED lights on a
timing loop, it doesn't even need video.

- -Andy

PS Sorry about all this Franz mess, but I have really had my fill of
him :-/

Loading...