(Re)start and Troubleshooting Procedures

[last modified April, 2008. sdp]

The basic OSIRIS system consists of 2 PCs and a Linux data taking computer (soaric2) as well as the instrument itself. The user interface is accessed through a program called Prospero which runs on the OSIRIS Linux computer, soaric2. Prospero is a command line interface with a status window which displays the current system status, instrument configuration, and data taking parameters. The instrument computer (IC) controls image acquisition through a DOS program called "IC." Images are assembled on the IC and passed to soaric2. A second PC called the "IE" handles low level motor control and is onboard the instrument.

The overall system is run through the interface on soaric2 (a RedHat Linux machine). The instrument software on this machine is a "client-server" architecture. The server is called ISIS. Three main clients are Prospero, Caliban, and the TCS Agent. In addtion, ISIS communicates directly with the instrument computer (IC) through a serial link. See the system block diagram for an overview.

The actual instrument control software running on the IC is called ICIMACS: Instrument Control and Image Acquisition System. The instrument can be controlled entirely from the IC without soaric2/Prospero, but this is usually only done in an engineering mode. Under normal operating conditions, the user will interface with OSIRIS only through soaric2/Prospero.


Simple Restart

If one of the four main components hangs up (ISIS, SOAR TCS, Caliban, Prospero), try restarting that component and then executing a "startup" at the prospero prompt. If that does not work, exit each of the four components and restart them in the proper order. Remember to restart these windows in the active vnc session.

If the simple Restart won't work, try restarting the IC.

Sometimes a simple restart won't get you up and running again. In this case, there may be some indication from Prospero which part of the system is not up. On the top bar of the Prospero status window you will see "UP" or "DOWN" flags indicating the status of each part of the system. You may also receive an error message in the command window indicating a problem with a particular computer/disk in the data taking system. A common example is that the data disks are not synched: see -ICSYNC/-DNSYNC below.

If the IC is indicated as the problem, you may try to restart the IC program. Type "exit" at the IC keyboard. When you are back at the DOS prompt, type "IC." When the IC has started again, type UARTINIT to initialize the communication ports in the HE. Look for a message indicating that the IC is talking to the instrument ("PONG Received from IE"). You should also see +SEQUENCER on the IC status display.

If the IC is hung completely and not responding to keyboard input, you may reboot using Ctrl-Alt-Del.

If a soft reboot won't work, try cycling the power on the IC. Turn off the computer. Then turn on  the IC power. When the IC is back up and has rebooted, execute IC.

Always follow a restart of the IC with a "startup" on Prospero. If Prospero still complains, it may be necessary to restart ISIS, SOARTCS, Caliban, and then Prospero.

Restarting the IE

If the instrument itself is not responding, the IE will have to have its power cycled at the instrument. Type quit in Prospero and the IC. Go up to the telescope and cycle the power on both of the large metal boxes attached to the side of the instrument. There is a switch near the power chord of each box. The IE will reboot on power-up and start its program automatically (ieosiris.exe). It will take up to a minute or so for the IE to reboot. If the IE doesn't appear to come up and start ieosiris properly (i.e., you still can't talk to the instrument and Prospero claims it can't find the IE), a monitor and keyboard can be attached directly to the IE at the telescope so that the boot messages can be seen and ieosiris.exe can be executed manually (the IE is the gray colored box).

If the IC can not communicate with the IE, check that the flat "phone" cable between the HE and IE is connected.  You can also try swapping this cable to another port on the HE (Don't use port 1 until further notice). The cable must always be connected to COM1 on the IE. There could also be a fiber transmission problem (see below).

A HOSTS command on the IC will definitively show whether communication from the IC is reaching the IE.

There are four mechanisms which have relative positions in the instrument (xpupil, ypupil, camfocus, grattilt) which must be reset or zeroed after an IE power cycle. Always follow a restart of the IE with a reset of the xpupil, ypupil, and camfocus mechanisms. See `Reseting Mechanisms' below.

SCSI Disks not Synched

OSIRIS uses a set of scsi disks to transfer data rapidly from the IC to the soaric2 linux computer. These disks are synched by the Caliban daemon. As long as Caliban is up and running, the CB flag on the IC and Prospero status should be +. If not, try a restart of Prospero. If Prospero still complains, open up the Caliban icon and type "> IC ping". The try a startup again on Prospero. If it still won't work, it is usually best to quit Prospero and cycle the power (or reboot) the IC and restart Caliban on soaric2 (it may be necessary to quit Prospero, SOARTCS, Caliban, and ISIS and restart them beginning with ISIS as above).

If you fill up the Linux disk...

Go to bed, game over...If you are too stubborn to quit, go into the computer room and cycle the power on the IC (then follow the instructions above). This failure mode introduces an insidious aspect to data taking such that Caliban will continuously try to transfer the image which filled the disk, even though it may appear from Prospero that the system is taking data (it is, it just won't ever make it to soaric2). You must cycle the power on the IC and soaric2 and the external scsi disk after filling up the soaric2 disk. Shutdown the IC, soaric2, and the the external scsi disk. Next power up the disk, then boot soaric2, and finally the IC (don't forget tostart the IC program). Now restart ISIS, SOARTCS, Caliban, and Prospero.

Talking to the TCS

If Propspero can't send commands or receive information from the TCS (note the TCS status indicator in the Prospero status  window), try restarting the SOAR TCS Agent  from the pull down menu on soaric2 (left mouse button, "Data Acquisition" menu).

Resetting Mechanisms

Several of the motors inside the dewar need to be reset, or sent to a "home" position after powering up the IE. These are the xpupil, ypupil, camfocus, and grating tilt. Reset these by the Prospero commands: xpupil ?? reset, ypupil ?? reset, grating ?? reset, and camfocus ?? reset. Where ?? is the last value you used for the given mechanism.

Fiber transmission problems

If the fiber transmission is low, communication between the IC and HE (and with the IE) will be interrupted. If you are sure the IE is up and running and connected properly to the HE, but you can still not communicate with it, then there could be a fiber transmission problem. Have the electronicos perform a transmission test of the fibers or verify that one was done on setup. The fiber signal may be too weak even if light can be seen in the fiber.

Corrupted images can also be caused by bad fiber connections, or dirty fiber connections. If corrupted images occur, inspect the fiber connections at the IC and the HE. Small "dust balls" can "grow" at the connections and foul the signal transmission.

The "ERROR: Missing END card in FITS file#1, skipping..." Error

This is a voodoo error in the IC code. When writing a fits image, the IC uses a template header with the currrent FITS keywords. For some reason, the code can barf while reading this file even with no apparent changes. Removing a blank or unneeded line from the file, saving it, and restarting often solves the problem. If you have this error, call for help. If no help is available, edit this file on the IC (fitsosir.tpl). Add or subtract a blank line ('#'), save and restart.
As of February 2007, this bug has been fixed in the IC and should not occur.

Scrambled or Badly Formed Images

The fiber run at SOAR may be long enough that image transfer is compromised if the fiber throughput is degraded. The fibers should be checked for transmission each time they are reconnected. The September 06 values show good transmission for test signals of 22mV degradation or less. 22.8 mV is too low and has been shown to result in lost data and therefore bad images.

Typical bad images which might result from fiber problems will have blocks of pixels shifted into a different quadrant. This results from pixels being lost in transmission (except n pixels modulo 4) since OSIRIS uses the same processing chain to multiplex the pixel stream coming off the four quadrants. The pixels are delaced in the IC computer at the other end of the fiber run.

Scrambling of blocks of rows in the same quadrant has also been seen. This is not obviously a result of fiber transmission loss, but might instead be related to clocking the array in the analog electronics.

If scrambled images are seen on the image display check that they are also seen on the real time display above the Prospero work station. Also check the IC display while images are being taken to see if any special error messages are being generated.

Camfocus Problems in Prospero

When a camera is changed using the Prospero command

pr> camera n

where n=0-3, the camera turret rotates to change the camera and performs an automatic camfocus reset. Sometimes after a camera change one receives the following error message in the Prospero window:

pr> camera n
Setting camera...
pr> camfocus 500
Setting camfocus...
pr> camera m
Setting camera
Error: Command returned error
ERROR: Aborting motion: Total step limit exceeded.
Error: Could not reset camfocus to home position
Execution error in Prospero command.
Command: camera m

This signifies that during the camera change, the camfocus reset exceeded its low limit switch and should not be moved via Prospero commands, i.e., repeating

pr> camfocus 0 reset

commands. This will only drive the camfocus mechanism further past the low limit switch, resulting in a physical collision in which the instrument will have to be warmed and opened to fix.

The current accepted procedure to recover from the initial error is to offset the camfocus mechanism via commands to the IE from the ISIS window as shown below.
IS% >IE mstatus camfocus
IS% IE> STATUS: MechNum=7 Address=6 EncBits=0000101100000000
Position=INVALID Current=255 StepDelay=2200
IS% >IE setadr 6
IS% IE> DONE: Address=6
IS% >IE offset 2000
IS% IE> DONE: Offset completed.
IS% >IE mstatus camfocus
IS% IE> STATUS: MechNum=7 Address=6 EncBits=0000101100000000
Position=INVALID Current=255 StepDelay=2200i
IS% >IE offset 2000
IS% IE> ERROR: Aborting motion: Hi limit encountered.
IS% >IE mstatus camfocus
IS% IE> STATUS: MechNum=7 Address=6 EncBits=0010100100100000
Position=HILIMIT Current=255 StepDelay=2200

Notice that at the first mstatus command the camfocus posistion is "INVALID". After the second offset command, the camfocus has been moved to its high limit and the camfocus posistion is no longer INVALID. At this stage, it should be possible to return to the Prospero window and issue a camfocus 0 reset command to home the camfocus mechanism with success. If in doubt about the procedure, please contact the Instrument Scientist.