Friday, May 28, 2010

General - How to Troubleshoot Linux Kernel Panics

  • Problem Description:
Kernel panics on Linux are hard to identify and troubleshoot. Troubleshooting kernel panics often requires reproducing a situation that occurs rarely and collecting data that is difficult to gather.
  • Solution Summary:
This document outlines several techniques that will help reduce the amount of time necessary to troubleshoot a kernel panic.
  • Technical Discussion:
What is a kernel panic?
As the name implies, the Linux kernel gets into a situation where it doesnt know what to do next. When this happens, the kernel gives as much information as it can about what caused the problem, depending on what caused the panic.

There are two main kinds of kernel panics:
Hard Panic : also known as Aieee!
Soft Panic : also known as Oops

What can cause a kernel panic?
Only module that are located within kernel space can directly cause the kernel to panic. To see what modules are dynamically loaded, do lsmod this shows all dynamically loaded modules (Dialogic drivers, LiS, SCSI driver, filesystem, etc.). In addition to these dynamically loaded modules, components that are built into the kernel (memory map, etc.) can cause a panic.
Since hard panics and soft panics are different in nature, we will discuss how to deal with each separately.

  • How to Troubleshoot a Hard Kernel Panic
Hard Panics Symptoms:
Machine is completely locked up and unusable.
Num Lock / Caps Lock / Scroll Lock keys usually blink.
If in console mode, dump is displayed on monitor (including the phrase Aieee!).
Similar to Windows Blue Screen.


Hard panics causes:
The most common cause of a hard kernel panic is when a driver crashes within an interrupt handler, usually because it tried to access a null pointer within the interrupt handler. When this happens, that driver cannot handle any new interrupts and eventually the system crashes. This is not exclusive to Dialogic drivers.


Hard panics information to collect:
Depending on the nature of the panic, the kernel will log all information it can prior to locking up. Since a kernel panic is a drastic failure, it is uncertain how much information will be logged. Below are key pieces of information to collect. It is important to collect as many of these as possible, but there is no guarantee that all of them will be available, especially the first time a panic is seen.
/var/log/messages sometimes the entire kernel panic stack trace will be logged there
Application / Library logs (RTF, cheetah, etc.) may show what was happening before the panic
Other information about what happened just prior to the panic, or how to reproduce
Screen dump from console. Since the OS is locked, you cannot cut and paste from the screen. There are two common ways to get this info:


Digital Picture of screen (preferred, since its quicker and easier)
Copying screen with pen and paper or typing to another computer


If the dump is not available either in /var/log/message or on the screen, follow these tips to get a dump:
If in GUI mode, switch to full console mode no dump info is passed to the GUI (not even to GUI shell).
Make sure screen stays on during full test run if a screen saver kicks in, the screen wont return after a kernel panic. Use these settings to ensure the screen stays on.


setterm -blank 0
setterm -powerdown 0
setvesablank off


From console, copy dump from screen (see above).
Hard panics Troubleshooting when a full trace is available
The stack trace is the most important piece of information to use in troubleshooting a kernel panic. It is often crucial to have a full stack trace, something that may not be available if only a screen dump is provided the top of the stack may scroll off the screen, leaving only a partial stack trace. If a full trace is available, it is usually sufficient to isolate root cause. To identify whether or not you have a large enough stack trace, look for a line with EIP, which will show what function call and module caused the panic. In the example below, this is shown in the following line:
EIP is at _dlgn_setevmask [streams-dlgnDriver] 0xe
If the culprit is a Dialogic driver you will see a module name with:
streams-xxxxDriver (xxxx = dlgn, dvbm, mercd, etc.)


Hard panic full trace example:


Unable to handle kernel NULL pointer dereference at virtual address 0000000c
printing eip:
f89e568a
*pde = 32859001
*pte = 00000000
Oops: 0000
Kernel 2.4.9-31enterprise
CPU: 1
EIP: 0010:[] Tainted: PF
EFLAGS: 00010096
EIP is at _dlgn_setevmask [streams-dlgnDriver] 0xe
eax: 00000000 ebx: f65f5410 ecx: f5e16710 edx: f65f5410
esi: 00001ea0 edi: f5e23c30 ebp: f65f5410 esp: f1cf7e78
ds: 0018 es: 0018 ss: 0018
Process pwcallmgr (pid: 10334, stackpage=f1cf7000)
Stack: 00000000 c01067fa 00000086 f1cf7ec0 00001ea0 f5e23c30 f65f5410 f89e53ec
f89fcd60 f5e16710 f65f5410 f65f5410 f8a54420 f1cf7ec0 f8a4d73a 0000139e
f5e16710 f89fcd60 00000086 f5e16710 f5e16754 f65f5410 0000034a f894e648
Call Trace: [setup_sigcontext+218/288] setup_sigcontext [kernel] 0xda
Call Trace: [] setup_sigcontext [kernel] 0xda
dlgnwput [streams-dlgnDriver] 0xe8
Sm_Handle [streams-dlgnDriver] 0×1ea0
intdrv_lock [streams-dlgnDriver] 0×0
Gn_Maxpm [streams-dlgnDriver] 0×8ba
Sm_Handle [streams-dlgnDriver] 0×1ea0
lis_safe_putnext [streams] 0×168
 __insmod_streams-dvbmDriver_S.bss_L117376 [streams-dvbmDriver] 0xab8 
dvbmwput [streams-dvbmDriver] 0×6f5
dvwinit [streams-dvbmDriver] 0×2c0
lis_safe_putnext [streams] 0×168
lis_strputpmsg [streams] 0×54c
__insmod_streams_S.rodata_L35552 [streams] 0×182e
sys_putpmsg [streams] 0×6f
[system_call+51/56] system_call [kernel] 0×33
system_call [kernel] 0×33
Nov 28 12:17:58 talus kernel:
Nov 28 12:17:58 talus kernel:
Code: 8b 70 0c 8b 06 83 f8 20 8b 54 24 20 8b 6c 24 24 76 1c 89 5c


Hard panics Troubleshooting when a full trace is not available
If only a partial stack trace is available, it can be tricky to isolate the root cause, since there is no explicit information about what module of function call caused the panic. Instead, only commands leading up to the final command will be seen in a partial stack trace. In this case, it is very important to collect as much information as possible about what happened leading up to the kernel panic (application logs, library traces, steps to reproduce, etc).
Hard panic  partial trace example (note there is no line with EIP information)
ip_rcv [kernel] 0×357
sramintr [streams_dlgnDriver] 0×32d
lis_spin_lock_irqsave_fcn [streams] 0×7d
inthw_lock [streams_dlgnDriver] 0×1c
pwswtbl [streams_dlgnDriver] 0×0
dlgnintr [streams_dlgnDriver] 0×4b
Gn_Maxpm [streams_dlgnDriver] 0×7ae
__run_timers [kernel] 0xd1
handle_IRQ_event [kernel] 0×5e
do_IRQ [kernel] 0xa4
default_idle [kernel] 0×0
default_idle [kernel] 0×0
call_do_IRQ [kernel] 0×5
default_idle [kernel] 0×0
default_idle [kernel] 0×0
default_idle [kernel] 0×2d
cpu_idle [kernel] 0×2d
__call_console_drivers [kernel] 0×4b
call_console_drivers [kernel] 0xeb
Code: 8b 50 0c 85 d2 74 31 f6 42 0a 02 74 04 89 44 24 08 31 f6 0f
<0> Kernel panic: Aiee, killing interrupt handler!
In interrupt handler - not syncing]


Hard panics using kernel debugger (KDB)
If only a partial trace is available and the supporting information is not sufficient to isolate root cause, it may be useful to use KDB. KDB is a tool that is compiled into the kernel that causes the kernel to break into a shell rather than lock up when a panic occurs. This enables you to collect additional information about the panic, which is often useful in determining root cause.


Some important things to note about using (KDB)
If this is a potential Cisco issue, technical support should be contacted prior to the to use of KDB
Must use base kernel  i.e. 2.4.18 kernel instead of 2.4.18-5 from RedHat. This is because KDB is only available for the base kernels, and not the builds created by RedHat. While this does create a slight deviation from the original configuration, it usually does not interfere with root cause analysis.
Need different Cisco VOS drivers compiled to handle the specific kernel.


  • How to Troubleshoot a Soft Kernel Panic
Soft panics  symptoms:
Much less severe than hard panic.
Usually results in a segmentation fault.
Can see an oops message search /var/log/messages for string Oops.
Machine still somewhat usable (but should be rebooted after information is collected).


Soft panics  causes:


Almost anything that causes a module to crash when it is not within an interrupt handler can cause a soft panic. In this case, the driver itself will crash, but will not cause catastrophic system failure since it was not locked in the interrupt handler. The same possible causes exist for soft panics as do for hard panics (i.e. accessing a null pointer during runtime).
Soft panics information to collect:


When a soft panic occurs, the kernel will generate a dump that contains kernel symbols  this information is logged in /var/log/messages. To begin troubleshooting, use the ksymoops utility to turn kernel symbols into meaningful data.
To generate a ksymoops file:


Create new file from text of stack trace found in /var/log/messages. Make sure to strip off timestamps, otherwise ksymoops will fail.


Run ksymoops on new stack trace file:


Generic: ksymoops -o [location of Application drivers] filename


Example: ksymoops -o /lib/modules/2.4.18-5/misc ksymoops.log


All other defaults should work fine


For a man page on ksymoops, see the following webpage:
http://gd.tuwien.ac.at/linuxcommand.org/man_pages/ksymoops8.html

So you are trying to start Linux for the first time and wham! You get messages like:


Unable to mount root device.
Kernel panic - not syncing.


What do I do now? Oh, how I love Windows!!;
Here's the scoop;
(1) The first part of the system that starts running is the boot loader, usually grub. This is the program that loads Linux, and/or Windows if you so desire. (The master boot record, or MBR, enables the computer to load grub.)
(2) The first thing that Grub needs to know is where is the kernel? It gets this from the /boot/grub/grub.conf file. The way that you specify the correct drive and partition in Grub is a little different from, like (hd0,0) what you use in ordinary Linux. The kernel will be in some file named vmlinuz.
(3) Once Grub has loaded the kernel into memory, the first thing that the kernel needs to know is, where is the root filesystem? The root= parameter is passed to the kernel to provide this information. Notice that now you are talking to Linux, and you identify devices in Linux's terms, like /dev/hda23.


(4) Given this information, Linux is going to try to mount the root filesystem prepare it for use. The most common mistake at this point is that you've specified the wrong device in step #3. Unfortunately, the message that results is rather nasty looking.
When Linux doesn't know how to proceed, as in this case, it says kernel panic and it stops. But, even then, it tries to go down gracefully. It tries to write anything to disk that hasn't been written out (an operation called syncing, for some darn-fool reason), and if it succeeds in doing so it will say not syncing. What's totally misleading about this message combination is that it implies, incorrectly, that the reason for the panic is not syncing, when actually the reason for the panic will be found in the preceding few lines.
You might see the message, tried to kill 'init'. That really means that a program called init died& which it is not allowed to ever do. init is a very special program in Linux the first program created when the machine starts.
So, basically, when you get these messages on startup the situation is really a lot more dreadful looking than it actually is. You have probably just made a tpyo  when entering the information in grub.conf.
(Another common place to make a typo is in /etc/fstab, which tells Linux where all the other drives are.)
So what do you do? If you're doing a first-time install you can just start over. Otherwise, you need to boot a separate CD-ROM, which will give you a stand-alone Linux installation from which you can edit the offending files.
Explained: kernel panic - not syncing - attempted to kill init
--------------------------------------------------------------------------------


When the kernel gets into a situation where it does not know how to proceed (most often during booting, but at other times), it issues a kernel panic by calling the panic(msg) routine defined in kernel/panic.c. (Good name, huh?) This is a call from which No One Ever Returns.
The panic() routine adds text to the front of the message, telling you more about what the system was actually doing when the panic occurred & basically how big and bad the trail of debris in the filesystem is likely to be. This is where the not syncing part comes from, and when you see that, it's good. (panic() does try to issue a sinc() system-call to push all buffered data out to the hard-disks before it goes down.)
The second part of the message is what was provided by the original call to panic(). For example, we find panic(Tried to kill init!) in kernel/exit.c.
So, what does this actually mean? Well, in this case it really doesn't mean that someone tried to kill the magical init process (process #1&), but simply that it tried to die. This process is not allowed to die or to be killed.


When you see this message, its almost always at boot-time, and the real messages; the cause of the actual failure; will be found in the startup messages immediately preceding this one. This is often the case with kernel-panics. init encountered something really bad, and it didn't know what to do, so it died, so the kernel died too.
BTW, the kernel-panic code is rather cute. It can blink lights and beep the system-speaker in Morse code. It can reboot the system automagically. Obviously the people who wrote this stuff encountered it a lot;


In diagnosing, or at least understanding, kernel-panics, I find it extremely helpful to have on-hand a copy of the Linux source-code, which is usually stored someplace like /usr/src/linux-2.x. You can use the grep utility to locate the actual code which caused the panic to occur.

CTS System - Secure registration analysis



Get the CTS logs from this location:
http://docs.google.com/leaf?id=0B47-vpuz_NefN2VmODNiYzUtMzhlYS00OGY1LTkzYzMtN2E5ZDIxNDM3ZWFl&hl=en
  
You will find 2 files, Registration1 and Registration2.
Registration 1 contains 4 different types of system bootup:
No security, MIC, LSC and AUTHString, including CTS logs and Packetcapture

Registration2 contains the CTS logs analysis for a secure using MIC certificate.

Look for ##DEBUG## in Log files
 
. CTS registration
Logs needed:

CCM SDI/SDL traces
Sniffer capture
CTS logs
CTS sysop logs,
cdp
cca,
ccafg
cmr,
cma,
sip
Srtp*
Secd*,
tsps
rc.log,
sysm


What to check?

Start with CTS SYSOP logs
Verify CCAFG logs to confirm CUCM file is retrieved
Check sysm files for system startup
If multi-screen verify secondaries and look for the same logs
Check CCA logs (SIP messages are included in CCA in human readable timestamps)
Check SIP logs (Epoch time)
If Security is configured check:
- SECD
Check Keep alive timers and CUCM configuration
Verify CUCM Registration
Check CUCM SDI/SDL Traces
Check system has initialized successfully from Phone UI (Take ScreenShot)
Check TSPS for Midlet/Phone UI XML
Use tcpdump or utils network capture command

CTS System - Basic MultiPoint call analysis

The following information will help you how to perform a basic call control analysis for a Telepresence call between 3 CTS Endpoints and CTMS in this call we have 1 CTS 3200, 1CTS 3000 and 1 CTS-1300 system.

Versions:
CUCM 7.1.3.32900-4 Pub and Sub
CTS 1.6.5
CTS 3200 and CTS 3000 use ts4.local (Presentation codec).

Call from 3001,3002 and 3003 to 4006 around 15:45 PST time
Get the files from this location and look for the word: ##DEBUG##


Logs needed:

CCM SDI/SDL traces (Detailed level)
CTMS Logs set to INFO level
CTMS sysop, sip, ccs, switching, alarm, rtp, keyechange*
CTS Logs
CTS sysop logs,
rtp,
cca,
cmr,
cma,
sip
Srtp*
Secd*,
Keyexchange*
rc.log,
sysm

Sniffer capture (Optional)
  

+++++DEBUG+++++

You will find details in CUCM SDI, SDL, CTMS and CTS traces (CCA, CMR, OSD and TSPS) for each call


What to check?

Use CTS SYSOP logs
Check CCA logs
Check SIP logs
Check CUCM traces
Check CMA
Check CMR
If Security is configured check:
  - SECD
  - KEYEXCHANGE
  - SRTP
Check CUCM Configuration/Infraestructure (i.e. BEA, SBCs, etc)
Check CUCM Traces SDI/SDL traces
Use CTMS_SYSOP
Check SIP.LOGx
Check CCS
Check SWITCHING
Check RTP
Check ALARM
If Security is configured check:
  - KEYEXCHANGE
 

Troubleshooting

Verify TelePresence SW Compatibility matrix
From CTS CLI, verify Calling Services are up
Verify System status in CTS GUI
Verify system is registered
Verify Region Bandwidth in CUCM (Check Video and Audio BW)
Verify Location
Verify CSS (Calling Search Space (Use Dial Number Analyzer)
Check Route Patterns
Verify SIP Trunk Security Profile matches what is configured in CTMS (Protocol, port)
Check Region between SIP trunk for CTMS and Endpoints
Verify Static/Scheduled or Ad-hoc meeting is configured in CTMS
Check packet flow
Use sniffer capture (Span port is recommended)

CTS - Certification - TelePresence 642-185 Exam

This is the material I used to clear 642-185 Exam:

  •  TelePresence book  TelePresence book is available to read for free on Google Books:
  • SRND Guide 

  • Certification training
  • 4. Room design:
Design Guides and Whitepapers

Room Requirements Guide
http://ciscoet.com/resourcelib.asp?id=259

 Six NEW Technical White Papers on Room Design
http://ciscoet.com/resourcelib.asp?loc=72

 1. RRA Assessment Process
http://cisco.partnerelearning.com/peclms/lang-en/management/LMS_TrainLedInfo.asp?UserMode=0&LEDefId=21821

 2. Acoustics and Lighting Basics
http://cisco.partnerelearning.com/peclms/lang-en/management/LMS_TrainLedInfo.asp?UserMode=0&LEDefId=21822

 3. RRA Tools and Proper Usage
http://cisco.partnerelearning.com/peclms/lang-en/management/LMS_TrainLedInfo.asp?UserMode=0&LEDefId=21823

 4. Environment, Power, HVAC
http://cisco.partnerelearning.com/peclms/lang-en/management/LMS_TrainLedInfo.asp?UserMode=0&LEDefId=21824

 5. Acoustics
http://cisco.partnerelearning.com/peclms/lang-en/management/LMS_TrainLedInfo.asp?UserMode=0&LEDefId=21825

 6. Lighting
http://cisco.partnerelearning.com/peclms/lang-en/management/LMS_TrainLedInfo.asp?UserMode=0&LEDefId=21826

 7. Continuity
http://cisco.partnerelearning.com/peclms/lang-en/management/LMS_TrainLedInfo.asp?UserMode=0&LEDefId=21827

 8. Site Survey
http://cisco.partnerelearning.com/peclms/lang-en/management/LMS_TrainLedInfo.asp?UserMode=0&LEDefId=21828

  • Hardware Guides for each CTSXXX System

Thursday, May 27, 2010

CTS System - TelePresence SIP H264 profile-level-id

In a SIP SDP an H.264 video capability can sometimes appear such as this:

CTMS:

a=rtpmap:96 mpeg4-generic/48000
a=fmtp:96 profile-level-id=16;streamtype=5;mode=AAC-hbr;config=11B0;sizeLength=13;indexLength=3;indexDeltaLength=3;constantDuration=480
a=rtpmap:0 PCMU/8000
a=rtpmap:99 L16/48000
a=rtpmap:101 telephone-event/8000
a=fmtp:101 0-15
m=video 16390 RTP/AVP 112
b=TIAS:4000000
a=rtpmap:112 H264/90000
a=fmtp:112 profile-level-id=ABCDEF;packetization-mode=1


CTS:

a=rtpmap:96 mpeg4-generic/48000
a=fmtp:96
profile-level-id=16;streamtype=5;mode=AAC-hbr;config=B98C00;sizeLength=1
3;indexLength=3;indexDeltaLength=3;constantDuration=480
a=rtpmap:0 PCMU/8000
a=rtpmap:101 telephone-event/8000
a=fmtp:101 0-15
m=video 28242 RTP/AVP 112
b=TIAS:4000000
a=rtpmap:112 H264/90000
a=fmtp:112
profile-level-id=4d0028;sprop-parameter-sets=R00AKAmWUgDwBDyA,SGE7iJyA;p
acketization-mode=1


RFC 3984 defines this (page 38).

"If the profile-level-id parameter is used for capability exchange or session setup procedure,
it indicates the profile that the codecsupports and the highest levelsupported for the signaled profile. The profile-iop byte indicates whether the codec has additional limitations whereby only the common subset of the algorithmic features and limitations of the profiles signaled with the profile-iop byte and of the profile indicated by profile_idc is supported by the codec. For example, if a codec supports only the common subset of the coding tools of the Baseline profile and the Main profile at level 2.1 and
below, the profile-level-id becomes 42E015, in which 42 stands for the Baseline profile, E0 indicates that only the common subset for all profiles is supported, and 15 indicates level 2.1."


The question is how does the 15 convert to level 2.1?

Page 298 of the ITU spec for H.264 documents how to convert this hex value into a level.

A level to which the bitstream conforms shall be indicated by the syntax elements level_idc and constraint_set3_flag as follows.
If level_idc is equal to 11 and constraint_set3_flag is equal to 1, the indicated level is level 1b.
Otherwise (level_idc is not equal to 11 or constraint_set3_flag is not equal to 1), level_idc shall be set equal to a value of ten times the level number specified in Table A-1 and constraint_set3_flag shall be set equal to 0.

This means to covert the last byte you convert the hex to decimal and divide it by 10.  0x15 = decimal 21 = level 2.1

Level limits for each are documented at http://rob.opendot.cl/index.php/useful-stuff/h264-profiles-and-levels/.  You can also look at table A-1 of the H.264 ITU spec linked below.

Internally CUCM converts these to H.241 levels.   This mapping is in the ITU-T H.241 table 8-4.
http://www.itu.int/rec/T-REC-H.241/en

Thanks to Ryan Ratliff for compiling information above

Wednesday, May 12, 2010

CTS System - Basic MUX analysis for P2P Call

The Cisco TelePresence displays and cameras natively support 1080p resolution and utilize digital media interfaces to connect to the Cisco TelePresence codecs. This ensures the integrity of the video signal from end to end by eliminating the need for any digital/analog conversion.

Inside the Cisco TelePresence codecs an onboard array of Digital Signal Processors (DSPs) encode the digital video signal from the cameras into Real-Time Transport Protocol (RTP) packets using the H.264 encoding and compression standard. The Cisco TelePresence codecs can encode the video from the cameras at 1080p or 720p.

The quality of the video enjoyed by the meeting participants is a function of three variables:

•Resolution (i.e., number of pixels within the image)
•Frame rate (how often those pixels are re-drawn on the display)
•Degree of compression applied to the original video signal

CTS-1000 and CTS-500 systems send only one audio and one video stream (excluding auxiliary audio and video inputs for the moment). On the other hand, CTS-3000 and CTS-3200 primary codecs process three separate audio and three separate video streams. However, these codecs do not send three separate audio streams and three separate video streams over the network. Rather, CTS-3000 and CTS-3200 primary codecs multiplex the three audio streams into one RTP stream and three video streams into one RTP stream, and hence send only a single audio and a single video stream over the network. These streams, in turn, are de-multiplexed by the receiving codec. The multiplexing of audio and video streams performed by the CTS-3000 primary codecs is illustrated below. Auxiliary audio and video inputs are also multiplexed into the same audio and video streams. Therefore, in the case of the CTS-1000 or CTS-500, the primary video and auxiliary video are multiplexed into one outgoing video stream; likewise the primary audio and auxiliary audio are multiplexed into one outgoing audio stream. In the case of the CTS-3000 or CTS-3200, the auxiliary video is treated as the 4th video channel and multiplexed in with the rest of the video; likewise the auxiliary audio is treated as the 4th audio channel and multiplexed in with the rest of the audio.



Cisco TelePresence systems,  use for media multiplexing the RTCP extension mechanism, in specific the APP packet, that allows private uses.


For more information of MUX protocol you need to contact Cisco Systems, Inc directly as that information is still not public.


This document allows you to understand the architecture inside a Cisco TelePresence Point to Point call to help you identify the network components that delay or drop RTP or MUX RTP Control Protocol (RTCP) packets..

We define an RTCP APP packets to informs the remote peer of what we are capable of doing with respect to media multiplexing. This includes the number of media streams that an endpoint is willing to transmit and willing to receive for a given media channel of a particular type (audio or video). It also allows the receiver to specify which media positions it is willing to transmit and receive
 
Most of the RTCP APP extensions described in this document benefit from having an explicit acknowledgment from the receiver. In this section, a generic ACK packet is defined for use with these extensions and any future extensions that may require it.  The most significant bit of the RTCP APP subtype is defined to represent an ACK bit. The least significant nibble (4 bits) of the subtype is used to carry the extension identifier of the packet being acknowledged.

The captures you will find in this document show 1 CTS-1000 calling a CTS-500 number (3005 to 3006)
In order to decode the RTCP, decode UDP streams using even port numbers as RTCP.


Use rtcp.app.subtype == (Subtype number in table below) to filter in Wireshark,
sip or ((rtcp.app.subtype == 1) and ( udp.port == rtpport +1 )) (Look for SDP rtp video or audio port and add 1, take a look at RTCP report "Stream setup by SDP packet number to find correct port)
sip or ((rtcp.app.subtype == 1) and ( udp.port == 32543))
Subtypes within the 'xcts' namespace

Contact Cisco for more information about TIP (MUX protocol)

Subtype Extension
0 Reserved
1 MUXCTRL
2 Deprecated (Version 2 REFRESH)
3 Deprecated (Version 2 CONFID)
4 ECHO
5 TXFLOWCTRL
6 RXFLOWCTRL
7 MEDIAOPTS
8 REFRESH
9 Deprecated (Version 3 TUNNEL)
10 PAUSE
11 SKIP
12 SPIMAP
13 SRVLOC
14 RETRANSMIT
15 Unused
16 Reserved
17 MUXCTRL ACK
18 Reserved
19 Deprecated (Version 2 CONFID ACK)
20 Reserved for ECHO ACK
21 TXFLOWCTRL ACK
22 RXFLOWCTRL ACK
23 MEDIAOPTS ACK
24 REFRESH ACK
25 Unused
26 PAUSE ACK
27 SKIP ACK
28 SPIMAP ACK
29 SRVLOC ACK
30 Reserved for RETRANSMIT ACK
31 Reserved for ACK of subtypes 15

Subtypes within the 'xtp1' namespace

Please obtain CTS logs and packet capture (it is included in each logs from here)
Packet captures exist under /nv/log directory.( name will show inverted , 3005.pcap is 3006.pcap and viceversa)



Information:
3005 - 172.16.181.142
3006 - 172.16.181.140

From 3005 log files:
sysop00001.log
2010-05-26 23:34:47: INFO Local user dialing 3006
2010-05-26 23:34:49: INFO Connected to 3006

This is an screenshot of packet capture which contains the first RTCP messages that are negotiated at the beginning of the call.



Note that the sequence of MUXCTRL and MEDIAOPTS exchanges is just an example; several other possibilities exist as each peer’s exchange is independent. It is required that a successful MUXCTRL occur before any other CTS MUX extension, including MEDIAOPTS, is sent

From CMR logs:
cmr00002.log

You will be able to find MUX control messages logged here:

2010-05-26 23:34:49.708: SendRTCPMUXCTRL streamID 57 Xmit RTCP APP MUXCTRL ver 6 profile 0 ntp 14963260998691229330

2010-05-26 23:34:49.708: SendRTCPMUXCTRL streamID 59 Xmit RTCP APP MUXCTRL ver 6 profile 2 ntp 14963260998707030515

2010-05-26 23:34:49.959: ProcessInboundPacket streamID 57, rcv RTCP APP MUXCTRL

2010-05-26 23:34:49.960: ProcessInboundPacket streamID 59, rcv RTCP APP MUXCTRL

In next chapter you will be able to understand every of those message in more detail.

Sunday, May 9, 2010

CTS System - Basic P2P call between CTS 1000 and CTS 500


The following information will help you how to perform a basic call control analysis for a Telepresence call between 2 CTS Endpoints (1 single screen each), in this call we have 1 CTS 500 and 1 CTS-1000 system and no ts4.local (Presentation codec).

Versions:
CUCM 7.1.3.32900-4 Single server
CTS 1.6.4

  • Now where your devices are registered. 
Each Cisco CallManager log traces files locally. If a phone or gateway is registered to a particular CallManager, then the call is processed on that CallManager if the call is initiated there. You must capture traces on that Cisco CallManager in order to debug a problem. A common mistake is to have devices register on a Subscriber, but capture traces on the Publisher. These trace files are nearly empty (and most definitely do not have the call in them). Another common problem is to have Device 1 registered to Cisco CallManager 1 and Device 2 registered to Cisco CallManager 2. If Device1 calls Device 2, the call trace is in Cisco CallManager 1; if Device 2 calls Device1, the trace is in Cisco CallManager 2. If you troubleshoot a two-way calling issue, you need both traces from both Cisco CallManagers in order to get the full picture. 
  • Include DNs (phone numbers) or IP addresses (if gateways) for all devices involved in the problem. 
This enables the Technical Support engineer to quickly locate the phones and other devices involved in the problem calls. 
  • Know the approximate time of the problem in the traces. 
Multiple calls are probably made, so it helps to know the approximate time of the call that Cisco Technical Support needs to focus on.

Once you have downloaded both CUCM SDI/SDL detailed traces and both CTS traces, its time to start the analysis

Call from 3004 to 3006 around 15:45 PST time
Get the files from this location and look for the word:

+++++DEBUG+++++

You will find details in CUCM SDI and CTS traces (CCA, CMR, OSD and TSPS) for each call


What to check?
CTS point to point
Use CTS SYSOP logs
Check CCA logs (SIP messages are included in CCA in human readable timestamps)
Check SIP logs (Epoch time)
Check CMA
Check CMR
If Security is configured check:
  - SECD
  - KeyExchange
  - SRTP
Check CUCM Configuration/Infraestructure (i.e. BEA, SBCs, etc)
Check CUCM Traces SDI/SDL traces
 

Troubleshooting

Verify TelePresence SW Compatibility matrix
From CTS CLI, verify Calling Services are up
Verify System status in CTS GUI
Verify system is registered
Verify Region Bandwidth in CUCM (Check Video and Audio BW)
Verify Location
Verify CSS (Calling Search Space (Use Dial Number Analyzer)
Check Route Patterns
Check packet flow
Use sniffer capture (Span port is recommended)
 

Tuesday, May 4, 2010

General - Set Up Cisco Unified Communications Manager Traces for TelePresence support

This video explains how to configure and capture traces from the Cisco CallManager server for Cisco Technical Support. When you call Cisco Technical Support, you can be asked by the TelePresence Support engineer to capture these traces in order to troubleshoot the problem.
We have the previous document but for TelePresence you may want to add specific options.

http://www.cisco.com/en/US/products/sw/voicesw/ps556/products_tech_note09186a0080094e89.shtml

Check Video here:

Sunday, May 2, 2010

CTS System - Basic IP SLA configuration

The following commands provides basic configuration to enable IP SLA in CTS system

  • CTS System (172.16.181.57)
Couple of gotchas with Cisco TAC customers:
a) Starting 1.5 we include support and we need to configure in IP SLA responder port 32770 or above.
b) Configuring correct group schedule.

admin:utils ipsla responder enable
Starting IPSLA responder using port range 32770 - 33000...

utils ipsla responder initiators add net 172.16.154.0/24

  • IP SLA Router (172.16.154.1)

ip sla monitor 1
 type jitter dest-ipaddr 172.16.181.57 dest-port 32800 num-packets 1500 interval 30
 request-data-size 256
 tos 136
 owner gonzalo
 tag TelePresence_small


ip sla monitor 2
 type jitter dest-ipaddr 172.16.181.57 dest-port 32801 num-packets 3000 interval 15
 request-data-size 960
 tos 136
 owner gonzalo
 tag TelePresence_medium


ip sla monitor 3
 type jitter dest-ipaddr 172.16.181.57 dest-port 32802 num-packets 5625 interval 8
 request-data-size 1400
 tos 136
 owner gonzalo
 tag TelePresence_large


ip sla monitor 4
 type jitter dest-ipaddr 172.16.181.57 dest-port 32803 num-packets 1000
 request-data-size 200
 tos 136
 owner gonzalo
 tag TelePresence_audio


ip sla monitor  10
 type jitter dest-ipaddr 172.16.181.57 dest-port 32804 num-packets 1000
 request-data-size 200
 tos 184
 owner gonzalo
 tag voice-only-calls

ip sla group schedule 1 10,1-4 schedule-period 1 frequency 60 start-time now age forever

Saturday, May 1, 2010

CTS System - Basic Linux Tools - Part I

  • Basic checks:
When troubleshooting CTS performance, Linux embedded OS provide a powerful set of tools which will help you to verify system resources utilization:
This guide provides a basic list of tools which will help you troubleshoot your CTS system.
Some of the commands described in our Administration guide provide the same functionality, as they just invoke in the backend the same Linux commands:

http://www.cisco.com/en/US/partner/docs/telepresence/cts_admin/1_6/CLI/cts1_6cli.html

  • System status

date
uptime
CTS:>cat /proc/version
Linux version 2.6.11 (root@TSBU-ENG2.CISCO.COM) (gcc version 3.4.3) #1 Tue Feb 24 03:59:58 UTC 2009

http://www.unixguide.net/linux/linuxshortcuts.shtml


In Linux, a variety of log files are maintained. And its understanding is often vital for troubleshooting system problems. The centralized logging is provided by two daemons :

    * syslogd
    * klogd

For example, running the 'ps' command on my system gives the following output ...

CTS:>ps aux|grep [sk]*logd
  400 root        748 S   /usr/sbin/syslogd
  405 root        524 S   /usr/sbin/klogd
 2271 root        468 S   /usr/local/bin/clogd

The log files generated by these daemons as well as the log files generated by the applications like apache, squid etc are stored under the /var/log directory

  • Memory:
http://mail.nl.linux.org/linux-mm/2003-03/msg00077.html

Running Process:
    ps

    CTS:>ps --help
    BusyBox v1.00-pre8 (2009.02.24-04:32+0000) multi-call binary

    Usage: ps
    Report process status

    This version of ps in CTS accepts no option.
    We can use grep to track specific processes

    CTS:>ps | grep snmpd
     3499 root       6824 S   snmpd -LS e d -LF n /var/log/snmpdlogPipe -p /var/run
     3517 root       6824 S   snmpd -LS e d -LF n /var/log/snmpdlogPipe -p /var/run
     3518 root       6824 S   snmpd -LS e d -LF n /var/log/snmpdlogPipe -p /var/run
     3527 root       6824 S   snmpd -LS e d -LF n /var/log/snmpdlogPipe -p /var/run
     3528 root       6824 S   snmpd -LS e d -LF n /var/log/snmpdlogPipe -p /var/run
     3529 root       6824 S   snmpd -LS e d -LF n /var/log/snmpdlogPipe -p /var/run
     3530 root       6824 S   snmpd -LS e d -LF n /var/log/snmpdlogPipe -p /var/run
     2851 root        576 S < grep snmpd

    cat /proc/$pid/status

    Check free Memory:

    CTS:>free
                  total         used         free       shared      buffers
      Mem:       516180       494196        21984            0       392900
     Swap:            0            0            0
    Total:       516180       494196        21984

    CTS:>top
    Mem: 494260K used, 21920K free, 0K shrd, 392900K buff, 38816K cached
    Load average: 0.02, 0.16, 0.20    (State: S=sleeping R=running, W=waiting)

      PID USER     STATUS   RSS  PPID %CPU %MEM COMMAND
     4312 root     R <      736 14281  1.9  0.1 top
      678 root     S        28M   208  0.0  5.7 tsLogger
      550 root     S        28M   208  0.0  5.7 tsLogger
      205 root     S        28M     1  0.0  5.7 tsLogger
      298 root     S        28M   208  0.0  5.7 tsLogger
      555 root     S        28M   208  0.0  5.7 tsLogger
      578 root     S        28M   208  0.0  5.7 tsLogger
      212 root     S        28M   208  0.0  5.7 tsLogger
      210 root     S        28M   208  0.0  5.7 tsLogger
      299 root     S        28M   208  0.0  5.7 tsLogger
     1102 root     S        28M   208  0.0  5.7 tsLogger
      446 root     S        28M   208  0.0  5.7 tsLogger
      532 root     S        28M   208  0.0  5.7 tsLogger
      449 root     S        28M   208  0.0  5.7 tsLogger
      209 root     S        28M   208  0.0  5.7 tsLogger
      602 root     S        28M   208  0.0  5.7 tsLogger
      208 root     S        28M   205  0.0  5.7 tsLogger
      447 root     S        28M   208  0.0  5.7 tsLogger
      211 root     S        28M   208  0.0  5.7 tsLogger

    pidof
    ps | awk '{print $4"\t"$11}' | sort | uniq -c | awk '{print $2" "$1" "$3}' | sort -nr
     

    S=sleeping R=running, W=waiting)

    • Storage 
    CTS:>fdisk -l
      Disk /dev/hda: 256 MB, 256901120 bytes
      16 heads, 32 sectors/track, 980 cylinders
      Units = cylinders of 512 * 512 = 262144 bytes

         Device Boot    Start       End    Blocks   Id  System
      /dev/hda1               1           9        2288   83  Linux
      /dev/hda2              10         193       47104   83  Linux
      /dev/hda3             194         980      201472    5  Extended
      /dev/hda5             194         377       47088   83  Linux
      /dev/hda6             378         561       47088   83  Linux
      /dev/hda7             562         980      107248   83  Linux

      CTS:>df -h
      Filesystem            Size  Used Avail Use% Mounted on
      rootfs                248M  148M  100M  60% /
      /dev/root             248M  148M  100M  60% /
      /dev/hda7             101M   28M   68M  29% /nv

      CTS:>pwd

      CTS:>du --max-depth=1 -h
      du: ./sys/devices/pci0001:01/0001:01:00.0/resource1: Value too large for defined data type
      0       ./sys
      5.8M    ./bin
      1.0k    ./boot
      17k     ./dev
      259k    ./etc
      1.0k    ./home
      36M     ./lib
      9.0k    ./mnt
      du: cannot change to directory ./proc/1118/task: No such file or directory
      du: cannot change to directory ./proc/1118/fd: No such file or directory
      du: ./proc/1119/task/1119/fd/4: No such file or directory
      du: ./proc/1119/fd/4: No such file or directory
      9.7M    ./proc
      11k     ./root
      3.6M    ./sbin
      439k    ./tmp
      79M     ./usr
      20M     ./var
      0       ./debug
      28M     ./nv
      5.0k    ./nv-cache
      3.0k    ./snmp
      184M    .
      • Processor | CPU
      cat /proc/cpuinfo
      • Network
      This command will help you displaying the association for active network connections and its corresponding executable file

      netstat -anp
      • CTS specific tools:
      runtbdiag -n (--- DONT RUN when there is AN ACTIVE CALL ---)
      tbdiag

      When CTS logs are not available to retrieve and we need to find specific patterns before we can afford a reboot grep becomes a useful tool
      For example we want to find exception in CLI commands executed from regular CLI access
      CTS:>pwd
      /nv/log/cli

      grep -i -n 'cmdGeneralIOExec: myCmd.destroy exception' /nv/log/cli/*
      /nv/log/cli/cli00000.log:34:2010-02-10 23:36:58,579 DEBUG [main] - cmdGeneralIOExec: myCmd.destroy exception
      /nv/log/cli/cli00001.log:42:2010-02-12 11:29:27,116 DEBUG [main] - cmdGeneralIOExec: myCmd.destroy exception
      /nv/log/cli/cli00001.log:58:2010-02-12 11:29:35,711 DEBUG [main] - cmdGeneralIOExec: myCmd.destroy exception
      /nv/log/cli/cli00001.log:75:2010-02-12 11:29:41,111 DEBUG [main] - cmdGeneralIOExec: myCmd.destroy exception
      /nv/log/cli/cli00001.log:93:2010-02-12 11:29:45,653 DEBUG [main] - cmdGeneralIOExec: myCmd.destroy exception
      /nv/log/cli/cli00001.log:109:2010-02-12 11:30:44,192 DEBUG [main] - cmdGeneralIOExec: myCmd.destroy exception
      /nv/log/cli/cli00001.log:125:2010-02-12 11:30:49,098 DEBUG [main] - cmdGeneralIOExec: myCmd.destroy exception
      /nv/log/cli/cli00001.log:146:2010-02-12 11:31:02,991 DEBUG [main] - cmdGeneralIOExec: myCmd.destroy exception
      /nv/log/cli/cli00001.log:163:2010-02-12 11:31:11,710 DEBUG [main] - cmdGeneralIOExec: myCmd.destroy exception
      /nv/log/cli/cli00001.log:184:2010-02-12 11:36:21,453 DEBUG [main] - cmdGeneralIOExec: myCmd.destroy exception
      /nv/log/cli/cli00001.log:205:2010-02-12 11:36:33,361 DEBUG [main] - cmdGeneralIOExec: myCmd.destroy exception
      /nv/log/cli/cli00001.log:225:2010-02-12 11:37:04,425 DEBUG [main] - cmdGeneralIOExec: myCmd.destroy exception
      /nv/log/cli/cli00001.log:245:2010-02-12 11:37:26,638 DEBUG [main] - cmdGeneralIOExec: myCmd.destroy exception
      /nv/log/cli/cli00001.log:264:2010-02-12 11:38:36,594 DEBUG [main] - cmdGeneralIOExec: myCmd.destroy exception
      /nv/log/cli/cli00001.log:280:2010-02-12 11:38:58,656 DEBUG [main] - cmdGeneralIOExec: myCmd.destroy exception
      /nv/log/cli/cli00001.log:298:2010-02-12 11:39:22,262 DEBUG [main] - cmdGeneralIOExec: myCmd.destroy exception
      /nv/log/cli/cli00001.log:318:2010-02-12 11:39:54,417 DEBUG [main] - cmdGeneralIOExec: myCmd.destroy exception

      Please check

      http://mytelepresence.blogspot.com/2010/04/telepresence-system-collect-logs-when.html

      for more information in how to collect logs when GUI is not available