Creation Zone

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Friday, 30 December 2005

Solaris: Improve 64-bit link time w/ LD_NOEXEC_64

Posted on 02:28 by Unknown
Linking 64-bit applications take considerably more time than linking 32-bit applications. It was reported that some of the real world ISV {64-bit} applications are nearly 4 to 6 times slower in linking phase, compared to 32-bit linking.

Remedy:
64-bit link time performance can be improved a bit, by setting the environment variable LD_NOEXEC_64 to any non-zero value. If the link process doesn't need more than 4G virtual address limit of the 32-bit mode, LD_NOEXEC_64 env. variable suppresses the automatic execution of the 64-bit link-editor (ld). Note that in 64-bit mode, the linker has to deal with pointers that take twice as much room as their 32-bit counterparts; hence the memory consumption and the link time will be enormously high {compared to 32-bit linking}.

From the man page of ld(1):
...
LD_NOEXEC_64

Suppresses the automatic execution of the 64-bit link-editor. By default, the link-editor executes the 64-bit version when the ELF class of the first input relocatable file identifies a 64-bit object. The 64–bit image that a 32–bit link-editor can create, has some limitations. However, some link-edits might find the use of the 32–bit link-editor faster.
...

To suppress 64-bit linker:
        % setenv LD_NOEXEC_64 1
        % make or any custom script for linking
________________
Technorati tag: Solaris | OpenSolaris | Linker
Read More
Posted in | No comments

Wednesday, 21 December 2005

Solaris: Estimating process memory footprint

Posted on 21:34 by Unknown
On Solaris 9 and later versions, pmap tool can be used to do capacity planning. pmap prints information about the virtual address space of a process. In other words, pmap output represents snapshot of a running process.

To calculate per user memory footprint from a process, the following simple formula can be used:

Total private memory + (Total shared memory/#instances)
-------------------------------------------------------
(Total load)

Where:
#instances is the actual number of instances of the program/application concurrently running. It is not uncommon for large applications like Oracle, Siebel to fork multiple instances of a server, to improve concurrency.

Total load is the number of concurrent users connected to the application. In multi-threaded server applications, this should be the sum of all users connected to the forked processes of the server.

Total private memory is the total memory used exclusively by all the instances of the process.

Private memory is the memory used exclusively by a single process -- it will be reported under Anon column of pmap -x <pid> output.

Total shared memory is the total memory being shared by more than one process. More than one HAT mapping {in pmaps output} indicates that more than one process is actively sharing this mapping -- shared memory can be calculated using (RSS-Anon) from pmaps output.

Note:
  1. If one of the processes, sharing the page, tries to alter the shared code, a copy-on-write (COW) page fault occurs; and the kernel will make a special copy of the page containing those instruction for the modifying process, allowing everyone else to share the unchanged page. The newly created page becomes part of the private memory segment.

  2. RSS (Resident Set Size) column of pmap -x <pid> shows the amount of virtual memory touched by either read or write operation on processes' virtual memory location, and in turn, the number of physical pages brought into memory as a result of the memory touch operation.

Example:
%cat alloc.c
#include <stdlib.h>
#include <unistd.h>

int main()
{
int *bigArray;

bigArray = (int *) malloc (sizeof(int) * 100000);
sleep (30);

return (0);
}

%./alloc &
[1] 1941

%pmap -x 1941
1941: ./alloc
Address Kbytes RSS Anon Locked Mode Mapped File
00010000 8 8 - - r-x-- alloc
00020000 8 8 8 - rwx-- alloc
00022000 392 8 8 - rwx-- [ heap ]
FF280000 688 688 - - r-x-- libc.so.1
FF33C000 32 32 32 - rwx-- libc.so.1
FF390000 8 8 8 - rwx-- [ anon ]
FF3A0000 8 8 - - r-x-- libc_psr.so.1
FF3B0000 184 184 - - r-x-- ld.so.1
FF3EE000 8 8 8 - rwx-- ld.so.1
FF3F0000 8 8 8 - rwx-- ld.so.1
FF3FA000 8 8 8 - rwx-- libdl.so.1
FFBF6000 40 40 40 - rwx-- [ stack ]
-------- ------- ------- ------- -------
total Kb 1392 1008 120 -

This process is consuming 1,392 KB of virtual memory, with 1,008 KB being resident in {physical} memory. Out of 1,008 KB memory, 120 KB is private to this process; and 880 KB (=1008-120) is being shared along with other processes. So, the memory footprint of this process is (total shared memory + total private memory) = 1,008 KB. Note that in this example, #instances = 1; and total load = 1 user (since only one copy of the executable is running)

Note:
Since this is a simple example, we can directly consider the value of RSS as the total memory footprint of this process.

The following example shows how to calculate per user memory footprint, when sleep is invoked by two users:
% sleep 30 &
[1] 2272

% pmap -x 2272
2272: sleep 30
Address Kbytes RSS Anon Locked Mode Mapped File
00010000 8 8 - - r-x-- sleep
00020000 8 8 8 - rwx-- sleep
00022000 8 8 8 - rwx-- [ heap ]
FF280000 688 688 - - r-x-- libc.so.1
FF33C000 32 32 32 - rwx-- libc.so.1
FF390000 8 8 8 - rwx-- [ anon ]
FF3A0000 8 8 - - r-x-- libc_psr.so.1
FF3B0000 184 184 - - r-x-- ld.so.1
FF3EE000 8 8 8 - rwx-- ld.so.1
FF3F0000 8 8 8 - rwx-- ld.so.1
FF3FA000 8 8 8 - rwx-- libdl.so.1
FFBFE000 8 8 8 - rw--- [ stack ]
-------- ------- ------- ------- -------
total Kb 976 976 88 -

% sleep 30 &
[1] 2274

% pmap -x 2274
2274: sleep 30
Address Kbytes RSS Anon Locked Mode Mapped File
00010000 8 8 - - r-x-- sleep
00020000 8 8 8 - rwx-- sleep
00022000 8 8 8 - rwx-- [ heap ]
FF280000 688 688 - - r-x-- libc.so.1
FF33C000 32 32 32 - rwx-- libc.so.1
FF390000 8 8 8 - rwx-- [ anon ]
FF3A0000 8 8 - - r-x-- libc_psr.so.1
FF3B0000 184 184 - - r-x-- ld.so.1
FF3EE000 8 8 8 - rwx-- ld.so.1
FF3F0000 8 8 8 - rwx-- ld.so.1
FF3FA000 8 8 8 - rwx-- libdl.so.1
FFBFE000 8 8 8 - rw--- [ stack ]
-------- ------- ------- ------- -------
total Kb 976 976 88 -

From the above outputs:
Total private memory = (88 + 88) = 176 KB
Total RSS = (976 + 976) = 1,952 KB
Total shared memory = (1,952 - 176) = 1,776 KB

Memory consumption/user (by running sleep) = (176 + (1,776/2))/2 = (176 + 888)/2 = 532 KB.

Shell script to calculate the per user memory footprint

The following simple script automates the above calculation (script credit: Khader Mohiuddin/Kesari Mandyam):
_________
% cat memfootprint
#!/bin/bash

#if [ $# -eq 0 ]; then
if [ $# -ne 2 ]; then
echo "Usage: memfootprint <process_name> <total_load>"
exit
fi

count=0

PIDS=`/usr/bin/ps -ef | grep $1 | grep -v "grep $1" | grep -v memfootprint | awk '{ print $2 }'`

for pid in $PIDS
do
echo 'pmap process :' $pid
count=`expr $count + 1`
done

pmap -x $PIDS | grep total | awk 'BEGIN { FS = " " } {print $1,$2,$3,$4,$5} {rss+=$4} {private+=$5} END {print "Total Private mem: "private/1024" M Total RSS mem: "rss/1024" M Total Shared mem: " (rss-private)/1024 "M **** For '$2' user load: Memory Footprint is: "((private/1024)+(((rss-private)/1024)/'$count'))/'$2'" MB/user"}'

_________

eg., #1
% sleep 60&
[9] 3757

% sleep 60&
[4] 3758

% sh +x memfootprint sleep 2
pmap process : 3757
pmap process : 3758
total Kb 976 976 88
total Kb 976 976 88
Total Private mem: 0.171875 M Total RSS mem: 1.90625 M Total Shared mem: 1.73438M
**** For 2 user load: Memory Footprint is: 0.519531 MB/user

eg., #2
% sleep 60&
[1] 3783

% sleep 60&
[2] 3784

% sleep 60&
[3] 3785

% sleep 60&
[4] 3786

% sleep 60&
[5] 3787

% sleep 60&
[6] 3788

% sh +x memfootprint sleep 6
pmap process : 3787
pmap process : 3785
pmap process : 3783
pmap process : 3788
pmap process : 3784
pmap process : 3786
total Kb 976 976 88
total Kb 976 976 88
total Kb 976 976 88
total Kb 976 976 88
total Kb 976 976 88
total Kb 976 976 88
Total Private mem: 0.515625 M Total RSS mem: 5.71875 M Total Shared mem: 5.20312M
**** For 6 user load: Memory Footprint is: 0.230469 MB/user


[Updated 03/27/2008]

The shell script in this blog post was slightly modified to emit the output cleanly. Here is the modified script:

% memfootprint
#!/bin/sh

if [ $# -lt 2 ]; then
echo "Usage: memfootprint <Pattern> <NumOfUsers>"
exit
fi

WHOAMI=`/usr/ucb/whoami`

PIDS=`/usr/bin/ps -ef | grep $WHOAMI" " | grep $1 | grep -v "grep $1" | grep -v memfootprint | grep -v dog | awk '{ print $2 }'`

for pid in $PIDS
do
echo 'PID:' $pid
done

printf "------------------------------------------------------------------\n"

umask 0

printf "%-10s %-15s %-15s %-15s %-15s\n" PID Kbytes Resident Private Shared
printf "%-10s %-15s %-15s %-15s %-15s\n" "" "" " Kbytes" " Kbytes" Kbytes
printf "%-10s %-15s %-15s %-15s %-15s\n" --- ------ -------- ------- ------

pmap -x $PIDS | grep total | nawk -v Arg1=$2 'BEGIN { FS = " " }
{ printf "%-10s %-15s %-15s %-15s %-15s\n", "_NA_", $3, $4, $5, ""} {rss+=$4} {private+=$5} END {
printf "%-10s %-15s %-15s %-15s %-15s\n", "---", "------", "--------", "-------", "------"
printf "%-10s %-15s %-15s %-15s %-15s\n", "_NA_", "_NA", rss, private, (rss-private)
printf "------------------------------------------------------------------\n"
printf "Number of Users: %-10s\n", Arg1
printf "Per User Memory Footprint: %13.8f Mega Bytes\n", ((private/1024)+(((rss-private)/1024)/NR))/Arg1
}

Sample output from the enhanced script:
% sleep 60 &
[1] 4478

% sleep 60 &
[2] 4479

% sleep 60 &
[3] 4480

% sleep 60 &
[4] 4481

% sleep 60 &
[5] 4482

% sleep 60 &
[6] 4483

% ./memfootprint sleep 6
PID: 4480
PID: 4481
PID: 4483
PID: 4479
PID: 4478
PID: 4482
------------------------------------------------------------------
PID Kbytes Resident Private Shared
Kbytes Kbytes Kbytes
--- ------ -------- ------- ------
_NA_ 1352 1304 168
_NA_ 1352 1304 168
_NA_ 1352 1304 168
_NA_ 1352 1304 168
_NA_ 1352 1304 168
_NA_ 1352 1304 168
--- ------ -------- ------- ------
_NA_ _NA 7824 1008 6816
------------------------------------------------------------------
Number of Users: 6
Per User Memory Footprint: 0.34895833 Mega Bytes


To DO://
Fix the PIDs in the table. Currently they are represented with "_NA_".

Suggested reading:
Process Memory Requests: Process Virtual Address Space, Memory, and Swap by Hae Hirdler
________________
Technorati tag: Solaris | OpenSolaris
Read More
Posted in | No comments

Sunday, 11 December 2005

Sun Studio: debugging a multi-threaded application w/ dbx

Posted on 18:34 by Unknown

Multi-threading lets different tasks to run concurrently in a single process, hence multi-threaded programs would run faster on machines with multiple processors and on CPUs with multiple cores. On an SMP (Symmetric Multi-Processing system, where multiple processors share a single memory system) system with no CMT (Chip Multi-Threading), software threads are executed on different processors; and on an SMP system with CMT, the threads are executed on cores, and logical processors in CMP (Chip Multi-Processing) processors. As revolutionary chip designs are evolving, many important commercial applications like Oracle, SAP, Siebel, PeopleSoft are designed to be multi-threaded.

Debugging a multi-threaded (MT in short) application is a bit hard, due to the number of software threads running in parallel, compared to a single threaded program where only one task will be running per process, at any given time. Thread synchronization plays an important role when concurrently running threads have to share global resources. Improperly synchronized threads may starve, and lead to unnecessary dead locks, and race conditions. So, it is good to have an MT aware debugger handy, during development and in support phases of software life cycle, to debug threading issues.

Fortunately on Solaris, Sun Studio's debugger, dbx, has support for MT applications that are designed to use Solaris threads, and/or POSIX threads. With dbx, it is possible to get information like thread state, stack trace, locks from all threads, navigate between threads, suspend/resume threads, put break points in a thread and can do step by step execution in a function in a designated thread. Note that Solaris Modular Debugger (mdb) also has support for MT programs; but this blog post concentrates on Studio's dbx.

Siebel processes were used to show various dbx commands in the following examples. Siebel is a multi-threaded application, written in C/C++.

Core dump analysis

The following example shows some useful commands to get the stack trace in the thread, where the process crashed. For more information about dbx commands, type help or help <command> in dbx environment ie., at dbx prompt.

% ls -lh core
-rw------- 1 giri other 273M Dec 9 16:56 core

% file core
core: ELF 32-bit MSB core file SPARC Version 1, from 'siebprocmw'

% /opt/SS11/SUNWspro/prod/bin/dbx siebprocmw core
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.5' in your .dbxrc
Reading siebprocmw
core file header read successfully
Reading ld.so.1
Reading libsslcwsl.so
Reading libssscsci.so
Reading libssscscf.so
...
...
Reading libsbcfui.so
Reading libsbcfuiapps.so
t@1 (l@1) terminated by signal KILL (Killed)
0xfd2bc7e0: ___nanosleep+0x0008: blu _cerror ! 0xfd2206a0
Since we don't know which thread crashed the process, let's list all known threads with threads command. threads -all lists all threads, including zombies.
(dbx) threads
> t@1 a l@1 ?() LWP suspended in ___nanosleep()
t@2 b l@2 MwTimerThread() LWP suspended in __pollsys()
t@3 b l@3 MwAsyncSignalThread() sleep on 0xfd874078 in __lwp_park()
t@4 b l@4 MwThread() LWP suspended in __pollsys()
t@5 b l@5 MwThread() LWP suspended in __pollsys()
o t@6 b l@6 MwThread() signal SIGABRT in __lwp_kill()
t@7 b l@7 MwThread() LWP suspended in __pollsys()
t@9 b l@9 MwThread() LWP suspended in ___nanosleep()

In the above list, t@1 is the current thread, which is indicated by ">", and the start function is not known (indicated with a "?()").

(dbx) thread
current thread ($thread) is t@1

(dbx) where
current thread: t@1
=>[1] ___nanosleep(0x4, 0xffbfd9a8, 0x0, 0xff000000, 0x0, 0x0), at 0xfd2bc7e0
[2] _sleep(0x64, 0x0, 0xfd2e8bc0, 0xfd0e2000, 0xfd0e2000, 0x0), at 0xfd2afaa0
[3] thr_t::do_thr_action(0xfd86ba10, 0xc, 0x1608, 0xfd86ba20, 0x1, 0x2), at 0xfd770e14
[4] thr_t::t_sleep(0xfb80f5c0, 0x0, 0xffbfdb0e, 0xffbfdb08, 0xfd8546cc, 0xffffffff), at 0xfd770c58
[5] MwWaitForMultipleObjects(0xfb80f5c0, 0x2, 0xfb80f5c8, 0x2, 0xffffffff, 0x9cd48), at 0xfd774dd4
[6] WaitForMultipleObjectsEx(0x2, 0xffbfde3c, 0x0, 0x100000, 0x0, 0x9cd48), at 0xfd77fe9c
[7] OSDNTWait::WaitForThread(0xc, 0xffffffff, 0xffbfdecc, 0xd0108, 0x1004f, 0xff8a1d64), at 0xffa7b050
[8] OSDWaitTid(0xc, 0xffffffff, 0xffbfe7c4, 0x0, 0xc, 0xc), at 0xff05f1c4
[9] scfEventFacility::scfEventFac::ShutdownCmd(0xe14450, 0x1, 0x7, 0xfe4de0f4, 0xffbfe7c8, 0xff48f8d4), at 0xff819884
[10] scfEventFacility::scfEventFac::Shutdown(0xffbfe96c, 0xff877530, 0x0, 0x5e000, 0xff874e8c, 0x5e114), at 0xff819390
[11] ScfSisDetach(0x0, 0x0, 0x0, 0xffffffff, 0xffbfe96c, 0xfc81c), at 0xff781ed4
[12] _shutdown(0x6479c, 0x0, 0x651a8, 0x651a8, 0x7, 0x0), at 0x49c7c
[13] wmain(0x12a, 0x6479c, 0x0, 0x0, 0xffbfedac, 0x6479c), at 0x4995c
[14] main(0xfd85f310, 0xc94, 0xffbfef90, 0x54, 0xfd85f310, 0xc00), at 0x4d3cc

This is not exactly what we are looking for. The above call stack shows where the current thread (t@1) is waiting. Since our interest is to find out the thread that is responsible for the process crash, we need to look for an o before the thread id. t@6 is the ill fated thread in the list of all known threads; and the process was killed because of a SIGABRT in lwp_kill method. Note that OS provides the necessary abstraction for creating, and destroying threads; and also has the freedom of killing malfunctioning threads when things go haywire. In this example, __lwp_kill() was called by the operating system, due to some event which we are going to investigate.

thread -info <tid> command provides more information like what exactly happened in application code that triggered the forcible shutdown.

(dbx) thread -info t@6
Thread t@6 (0xfcb80c00) at priority 0
state: bound to l@6
base function: 0xfd770ff4: MwThread() stack: 0xfa380000[524288]
flags: BOUND|DETACHED|SUSPENDED
masked signals: SEGV
Currently active in __lwp_kill

Observe that kernel trapped an illegal memory access with a SEGV signal. The default behavior for a SEGV, is to shutdown the process with a possible core file generation (aka core dump). Let's switch to thread t@6 with thread <tid> command, and get to the instruction which raised the segmentation fault.

(dbx) thread t@6
t@6 (l@6) stopped in __lwp_kill at 0xfd2bd5ec
0xfd2bd5ec: __lwp_kill+0x0008: bcc,a,pt %icc,__lwp_kill+0x18 ! 0xfd2bd5fc

(dbx) thread
current thread ($thread) is t@6

(dbx) where
current thread: t@6
=>[1] __lwp_kill(0x0, 0x6, 0x0, 0x6, 0xffff0000, 0x0), at 0xfd2bd5ec
[2] raise(0x6, 0x0, 0xfd2a1af4, 0x42770, 0xfd2e4278, 0x6), at 0xfd25d884
[3] abort(0xe15220, 0x1, 0x0, 0xa6544, 0xfd2e7298, 0x0), at 0xfd23de38
[4] SehScanInvokeTryList(0x44bd308, 0x108000, 0xfd8571c4, 0x0, 0x2, 0x0), at 0xfd74c9d4
[5] Signal_Handler::raise(0xc0000005, 0xfa37cde8, 0x0, 0x2, 0xfa37cc80, 0x1800), at 0xfd74d778
[6] Raise_Exception::operator()(0x67670, 0xb, 0xfa37d0a0, 0xfa37cde8, 0xfd86a07c, 0x2c), at 0xfd74d8dc
[7] __sighndlr(0xb, 0xfa37d0a0, 0xfa37cde8, 0xfd74d7c8, 0x0, 0x1), at 0xfd2bc52c
---- called from signal handler with signal 11 (SIGSEGV) ------
[8] CSSSqlObj::GetTrxDbConn(0x458a7d8, 0x0, 0x1394478, 0x64c00, 0x0, 0x4611290), at 0xf91de72c

[9] CSSSqlObj::Execute(0x4611290, 0x0, 0x0, 0x0, 0x0, 0xfe4dd294), at 0xf91c7b98
[10] CSSBusComp::SqlExecute(0x4606640, 0x0, 0x0, 0x0, 0x1, 0x4b22e84), at 0xf9a9c160
[11] CSSBCBase::SqlExecute(0x4606640, 0x0, 0xfa37d6fc, 0x0, 0x1, 0xf57be3e8), at 0xf56c2294
[12] CSSBusComp::Execute(0x0, 0x0, 0x0, 0x0, 0x4606640, 0xfa37d7cc), at 0xf9a6b118
[13] CSSMsgBoardMaintSvc::UpdTaskHistory(0x44b5ae0, 0xfa37df90, 0x0, 0x4567d14, 0xf8611198, 0x489cd94), at 0xf85f2d48
[14] CSSMsgBoardMaintSvc::HandleEventDataList(0x44b5ae0, 0x43a0018, 0xff486b38, 0x0, 0xfa37e0ac, 0xf8611198), at 0xf85f5afc
[15] CSSMsgBoardMaintSvc::ReadTaskHistory(0x44b5ae0, 0x43a0018, 0xf85f4e60, 0x44b5ae0, 0x43a0018, 0x1), at 0xf85f53c0
[16] scfEventFacility::scfEventFac::CallRegSub(0x2a59448, 0x4109bd8, 0x0, 0x0, 0x8, 0x2), at 0xff81ad20
[17] scfEventFacility::scfEventFac::HandleCurrProcEvents(0xe14450, 0x7530, 0xe14450, 0xff432ef0, 0xff874e8c, 0x1),
at 0xff81b19c
[18] scfEventFacility::scfEventFac::scfEventThreadMain(0x0, 0x0, 0x0, 0x7400, 0xfa37fc90, 0xd0001), at 0xff81a7dc
[19] OSDWslThreadStart(0x101d58, 0xff81a580, 0x101d58, 0x6, 0x0, 0x101d70), at 0xff05bec8
[20] _AfxThreadEntry(0xffbfde34, 0xe9568, 0x0, 0x1, 0x0, 0x17289c), at 0xfeb95730
[21] MwThread(0x1, 0x0, 0x1, 0x0, 0xfd86bed0, 0xe15220), at 0xfd771230

From the above stack trace it is clear that the binary doesn't contain necessary debug information to show high level instructions; so, let's try to get the disassembly with dis command.

(dbx) dis GetTrxDbConn / 50
More than one identifier 'GetTrxDbConn'.
Select one of the following:
0) Cancel
1) `libsscfdm.so`#__1cPCSSModelPhysDefMGetTrxDbConn6MpkH_pnJCSSDbConn__
[non -g, demangles to: CSSModelPhysDef::GetTrxDbConn(const unsigned short*)]
2) `libsscfdm.so`#__1cJCSSSqlObjMGetTrxDbConn6kM_pnJCSSDbConn__
[non -g, demangles to: CSSSqlObj::GetTrxDbConn()const]
> 2
0xf91de6c0: GetTrxDbConn : save %sp, -96, %sp
0xf91de6c4: GetTrxDbConn+0x0004: mov %i0, %i5
0xf91de6c8: GetTrxDbConn+0x0008: ld [%i0 + 388], %i0
0xf91de6cc: GetTrxDbConn+0x000c: cmp %i0, 0
0xf91de6d0: GetTrxDbConn+0x0010: be,pn %icc,GetTrxDbConn+0x60 ! 0xf91de720
0xf91de6d4: GetTrxDbConn+0x0014: sethi %hi(0x5b400), %l6
0xf91de6d8: GetTrxDbConn+0x0018: call GetTrxDbConn+0x20 ! 0xf91de6e0
0xf91de6dc: GetTrxDbConn+0x001c: mov %o7, %o7
0xf91de6e0: GetTrxDbConn+0x0020: sethi %hi(0x2d1400), %o5
0xf91de6e4: GetTrxDbConn+0x0024: xor %l6, 88, %l4
0xf91de6e8: GetTrxDbConn+0x0028: inc 420, %o5
0xf91de6ec: GetTrxDbConn+0x002c: sethi %hi(0x1000), %l5
0xf91de6f0: GetTrxDbConn+0x0030: add %o5, %o7, %l3
0xf91de6f4: GetTrxDbConn+0x0034: add %l5, 868, %l1
0xf91de6f8: GetTrxDbConn+0x0038: add %l3, %l4, %l2
0xf91de6fc: GetTrxDbConn+0x003c: ld [%l2], %l0
0xf91de700: GetTrxDbConn+0x0040: ld [%l0 + %l1], %o4
0xf91de704: GetTrxDbConn+0x0044: cmp %o4, 0
0xf91de708: GetTrxDbConn+0x0048: be,a,pn %icc,GetTrxDbConn+0x68 ! 0xf91de728
0xf91de70c: GetTrxDbConn+0x004c: ld [%i5 + 128], %i2
0xf91de710: GetTrxDbConn+0x0050: ld [%o4 + 88], %l7
0xf91de714: GetTrxDbConn+0x0054: cmp %i5, %l7
0xf91de718: GetTrxDbConn+0x0058: bne,a,pn %icc,GetTrxDbConn+0x68 ! 0xf91de728
0xf91de71c: GetTrxDbConn+0x005c: ld [%i5 + 128], %i2
0xf91de720: GetTrxDbConn+0x0060: ret
0xf91de724: GetTrxDbConn+0x0064: restore %g0, 0, %o0
0xf91de728: GetTrxDbConn+0x0068: ld [%i2 + 188], %i1
0xf91de72c: GetTrxDbConn+0x006c: ld [%i1 - 16], %i3
0xf91de730: GetTrxDbConn+0x0070: cmp %i3, 0
0xf91de734: GetTrxDbConn+0x0074: bge,pn %icc,GetTrxDbConn+0x90 ! 0xf91de750
0xf91de738: GetTrxDbConn+0x0078: add %i2, 188, %i4
0xf91de73c: GetTrxDbConn+0x007c: clr %o0
0xf91de740: GetTrxDbConn+0x0080: call RequiredConditionIsFalse [PLT] ! 0xf94b0684
0xf91de744: GetTrxDbConn+0x0084: mov 84, %o1
0xf91de748: GetTrxDbConn+0x0088: ld [%i4], %i1
0xf91de74c: GetTrxDbConn+0x008c: ld [%i5 + 388], %i0
0xf91de750: GetTrxDbConn+0x0090: call GetTrxDbConn ! 0xf90e0e00
0xf91de754: GetTrxDbConn+0x0094: restore %g0, 0, %g0
0xf91de758: GetTrxDbConn+0x0098: unimp 0x0
...
...

To see the actual C++ instruction which seg faulted, compile the binary with -g (debug) option, and reproduce the crash. If the source code is readable from the location where you run the dbx session, you will see the actual high level instructions.


Some fun with an active process

The objective of this section is to show how to use some of the dbx commands to get some useful information, from a running MT process.

   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
2754 giri 399M 302M sleep 59 0 0:00:34 2.0% siebmtshmw/21

% /opt/SS11/SUNWspro/prod/bin/dbx - 2754
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.5' in your .dbxrc
Reading -
Reading ld.so.1
Reading libsslcwsl.so
Reading libssscsci.so
Reading libssscscf.so
...
...
Reading libsscasvbc.so
Reading libswcasvfr.so
Attached to process 2754 with 21 LWPs
t@1 (l@1) stopped in __pollsys at 0xfd13d1c4
0xfd13d1c4: __pollsys+0x0004: ta 8

(dbx) threads
> t@1 a l@1 ?() running in __pollsys() <- t@1 is always the default current thread under dbx
t@2 b l@2 MwTimerThread() sleep on 0xfb80f4c0 in __lwp_park()
t@3 b l@3 MwAsyncSignalThread() sleep on 0xfd774078 in __lwp_park()
t@4 b l@4 MwThread() running in __pollsys()
t@5 b l@5 MwThread() running in __pollsys()
t@6 b l@6 MwThread() sleep on 0xf9b7eb80 in __lwp_park()
t@7 b l@7 MwThread() running in __pollsys()
t@8 b l@8 MwThread() running in _so_recv()
t@9 b l@9 MwThread() sleep on 0xf927fb68 in __lwp_park()
t@10 b l@10 MwThread() sleep on 0xf877f500 in __lwp_park()
t@11 b l@11 MwThread() sleep on 0xf867fa40 in __lwp_park()
t@12 b l@12 MwThread() sleep on 0xf857fa50 in __lwp_park()
t@13 b l@13 MwThread() sleep on 0xf847fa38 in __lwp_park()
t@14 b l@14 MwThread() running in __pollsys()
t@15 b l@15 MwThread() sleep on 0xf827f490 in __lwp_park()
t@16 b l@16 MwThread() running in __pollsys()
t@17 b l@17 MwThread() sleep on 0xf807f490 in __lwp_park()
t@18 b l@18 MwThread() running in __pollsys()
t@19 b l@19 MwThread() sleep on 0xf4c7f490 in __lwp_park()
t@20 b l@20 MwThread() running in __pollsys()
t@21 b l@21 MwThread() sleep on 0xf4a7f490 in __lwp_park()

Put a break point in thread 21 (t@21) for all calls to memcpy():

(dbx) stop in memcpy -thread t@21
More than one identifier 'memcpy'.
Select one of the following:
0) Cancel
1) `libc.so.1`memcpy
2) `libc_psr.so.1`memcpy
a) All
> a
dbx: warning: 'memcpy' has no debugger info -- will trigger on first instruction
dbx: warning: 'memcpy' has no debugger info -- will trigger on first instruction
Will create handlers for all 2 hits
(2) stop in _private_memcpy -thread t@21 <- implicit break point set by dbx
(3) stop in _memcpy -thread t@21 <- implicit break point

(dbx) cont
t@21 (l@21) stopped in _memcpy at 0xfe1f04c0
0xfe1f04c0: _memcpy : nop

Note that dbx is synchronous -- when any thread or lightweight process (LWP) stops, all other threads and LWPs stop as well.

(dbx) thread
current thread ($thread) is t@21

(dbx) where
current thread: t@21
=>[1] _memcpy(0x5080e14, 0xff406b38, 0x2, 0x36, 0x1, 0x6c), at 0xfe1f04c0
[2] SSstring::GetWriteBuffer(0xf4a7e6ac, 0xff406b28, 0xff874e8c, 0x32, 0x0, 0xff3b2ef0), at 0xff31ffcc
[3] sciProcState::sciBlock::FormatLatchName(0xf4a7e6ac, 0x1, 0x7, 0x853c, 0xffa30bd8, 0x8400), at 0xffa02744
[4] sciProcState::sciProcState(0x5ad31f8, 0xf9fc0000, 0xf4a7e644, 0xff406b3c, 0x0, 0x0), at 0xffa012c4
[5] sciProcState::GetSciProcState(0xf4a7e7f8, 0x26fcb8, 0x5ad31f8, 0xff88db30, 0x5f5e4, 0x61e6c90), at 0xffa014f0
[6] SciCheckShutdown(0xf4a7e8cc, 0x34151f8, 0x74, 0x26fcb8, 0x0, 0x2ef798), at 0xff9fe0e4
[7] SciGetInterrupt(0x0, 0x6a20950, 0x0, 0xf4a7e864, 0x25cd94, 0x1da84), at 0xff9fde40
[8] _smiMessageQ::ProcessMessage(0x15f85c0, 0x6a20950, 0x0, 0x0, 0x24a360, 0x32e18f0), at 0x2158e4
[9] _smiMessageQ::ProcessRequest(0x3380c48, 0x6a20950, 0x191, 0x2, 0x5ae22f0, 0x15f85c0), at 0x21461c
[10] _smiWorkQueue::ProcessWorkItem(0x15f98b8, 0x3380c48, 0x6a20950, 0x5ae2390, 0x0, 0x101f180), at 0x208d08
[11] _smiWorkQueue::WorkerTask(0x15f98b8, 0x5b7f6b8, 0x3326338, 0x1500e0, 0x0, 0x0), at 0x208764
[12] SmiThrdEntryFunc(0x32f72d8, 0x70000f, 0x700010, 0x0, 0x0, 0x0), at 0x1f7a0c
[13] OSDWslThreadStart(0x3380568, 0x1f75a0, 0x3380568, 0x15, 0x0, 0x3380760), at 0xfefdbec8
[14] _AfxThreadEntry(0xf4b7de5c, 0x3386210, 0x0, 0x1, 0x0, 0x17289c), at 0xfeb95730
[15] MwThread(0x1, 0x0, 0x1, 0x0, 0xfd76bed0, 0x33cdc40), at 0xfd671230

Let's step into memcpy() with stepi, and observe how the thread state changes.

(dbx) stepi
t@21 (l@21) stopped in _memcpy at 0xfe1f04c4
0xfe1f04c4: _memcpy+0x0004: nop

(dbx) threads
t@1 a l@1 ?() running in __pollsys()
t@2 b l@2 MwTimerThread() sleep on 0xfb80f4c0 in __lwp_park()
t@3 b l@3 MwAsyncSignalThread() sleep on 0xfd774078 in __lwp_park()
t@4 b l@4 MwThread() running in __pollsys()
t@5 b l@5 MwThread() running in __pollsys()
t@6 b l@6 MwThread() sleep on 0xf9b7eb80 in __lwp_park()
t@7 b l@7 MwThread() running in __pollsys()
t@8 b l@8 MwThread() running in _so_recv()
t@9 b l@9 MwThread() sleep on 0xf927fb68 in __lwp_park()
t@10 b l@10 MwThread() sleep on 0xf877f500 in __lwp_park()
t@11 b l@11 MwThread() sleep on 0xf867fa40 in __lwp_park()
t@12 b l@12 MwThread() sleep on 0xf857fa50 in __lwp_park()
t@13 b l@13 MwThread() sleep on 0xf847fa38 in __lwp_park()
t@14 b l@14 MwThread() running in __pollsys()
t@15 b l@15 MwThread() sleep on 0xf827f490 in __lwp_park()
t@16 b l@16 MwThread() running in __pollsys()
o t@17 b l@17 MwThread() breakpoint in _memcpy()
o t@18 b l@18 MwThread() breakpoint in _memcpy()
o t@19 b l@19 MwThread() breakpoint in _memcpy()
t@20 b l@20 MwThread() running in __pollsys()
*> t@21 b l@21 MwThread() single stepped in _memcpy()

In the above example, t@17, t@18 and t@19 are stopped at calls to memcpy(); and t@21 stepped into memcpy(). Get out of memcpy() with step up command.

(dbx) step up
_memcpy returns 84413972
t@21 (l@21) stopped in SSstring::GetWriteBuffer at 0xff31ffd4
0xff31ffd4: GetWriteBuffer+0x0114: ld [%i1 + 4], %i2

Clear the break point (in current thread) with clear command

(dbx) cont
t@21 (l@21) stopped in _memcpy at 0xfe1f04c0
0xfe1f04c0: _memcpy : nop

(dbx) clear
cleared (3) stop in _memcpy -thread t@21
Locks

thread -blocks [<tid>] lists all locks held by the given thread, blocking other threads. If tid is not specified, dbx lists the locks held by the current thread. In the following example, t@21 (current thread) is not holding any locks.

(dbx) thread -blocks
Locks held by t@21:

thread -blockedby [<tid>] shows the synchronization object (monitor) on which the given thread is blocked. If tid is not specified, dbx shows this information for the current thread. Note that only sleeping threads must be in blocked state.

(dbx) thread -blockedby t@10
Thread t@10 is blocked by:
0xf877f500 (0xf877f500): thread condition variable

(dbx) thread -blockedby t@12
Thread t@12 is blocked by:
0xf857fa50 (0xf857fa50): thread condition variable

(dbx) thread -blockedby t@17
Thread t@17 is not asleep

syncs command lists all synchronization objects ie., locks/monitors.

(dbx) syncs
All locks currently known to libthread:
0x01020320 (0x01020320): thread mutex(unlocked)
0x010203f8 (0x010203f8): thread mutex(unlocked)
0xf827f490 (0xf827f490): thread condition variable
0xf827f4a0 (0xf827f4a0): thread mutex(unlocked)
0xf877f500 (0xf877f500): thread condition variable
0xf877f510 (0xf877f510): thread mutex(unlocked)
0xf927fb68 (0xf927fb68): thread condition variable
0xf927fb78 (0xf927fb78): thread mutex(unlocked)
0xf867fa40 (0xf867fa40): thread condition variable
0xf867fa50 (0xf867fa50): thread mutex(unlocked)
0xf9b7eb80 (0xf9b7eb80): thread condition variable
0xf9b7eb90 (0xf9b7eb90): thread mutex(unlocked)
0x015c2ed8 (0x015c2ed8): thread mutex(unlocked)
0x015c2f38 (0x015c2f38): thread mutex(unlocked)
0x015c2f18 (0x015c2f18): thread mutex(unlocked)
0x015c2dd8 (0x015c2dd8): thread mutex(unlocked)
0x015c34d8 (0x015c34d8): thread mutex(unlocked)
0x03325fb8 (0x03325fb8): thread mutex(unlocked)
0x033264b8 (0x033264b8): thread mutex(unlocked)
0x033261b8 (0x033261b8): thread mutex(unlocked)
0x017a6ce8 (0x017a6ce8): thread mutex(locked)
0xfa4f4314 (0xfa4f4314): process mutex(locked)
0x0332c438 (0x0332c438): thread mutex(unlocked)
0x0332c348 (0x0332c348): thread mutex(unlocked)
0x02fcd7e8 (0x02fcd7e8): thread mutex(unlocked)
0x0028f860 (0x0028f860): thread mutex(unlocked)
__1cUCSSSISLocalTransSrvrKs_instLock_+0x8 (0xff1ee220): thread mutex(unlocked)
0x034150e8 (0x034150e8): thread mutex(unlocked)
0x034151d8 (0x034151d8): thread mutex(unlocked)
__uberdata+0x80 (0xfd168c40): thread mutex(unlocked)
0x01878b98 (0x01878b98): thread mutex(unlocked)
0x01878aa8 (0x01878aa8): thread mutex(unlocked)
0xfa4c7e9c (0xfa4c7e9c): process mutex(unlocked)
libc_malloc_lock (0xfd1676f8): thread mutex(unlocked)
0x0179cb30 (0x0179cb30): thread mutex(unlocked)
0x0179c830 (0x0179c830): thread mutex(unlocked)
0xfa5c2664 (0xfa5c2664): process mutex(unlocked)
0xfa5c2c94 (0xfa5c2c94): process mutex(unlocked)
0x0161dd90 (0x0161dd90): thread mutex(unlocked)
0x0101f6e0 (0x0101f6e0): thread mutex(unlocked)
0x0101f718 (0x0101f718): thread mutex(unlocked)
0x0101f770 (0x0101f770): thread mutex(unlocked)
0x0101f508 (0x0101f508): thread mutex(locked)
0x0101f5a8 (0x0101f5a8): thread mutex(unlocked)
0x015bfe90 (0x015bfe90): thread mutex(unlocked)
0x015bfe20 (0x015bfe20): thread mutex(unlocked)
0x015bfe58 (0x015bfe58): thread mutex(unlocked)

To get information about a synchronization object at a given address, use sync -info <address>

(dbx) sync -info 0x0028f860
0x0028f860 (0x28f860): thread mutex(unlocked)
Lock is unowned
No threads are blocked by this lock

(dbx) sync -info 0xf877f500
0xf877f500 (0xf877f500): thread condition variable

(dbx) sync -info 0xfd1676f8
libc_malloc_lock (0xfd1676f8): thread mutex(unlocked)
Lock is unowned
No threads are blocked by this lock
Tracing

trace command can be used to trace the executed source lines, function calls, or variable changes. The following example traces the thread creation, and prints a message whenever a thread gets created.

(dbx) trace thr_create
(4) trace thr_create

(dbx) cont
trace: thread created t@22 on l@22
trace: thread created t@23 on l@23

Reading libsrlcver.so
Reading libsscafsbc.so
...

(dbx) threads
*> t@1 a l@1 ?() signal SIGINT in __pollsys()
t@2 b l@2 MwTimerThread() sleep on 0xfb80f4c0 in __lwp_park()
t@3 b l@3 MwAsyncSignalThread() sleep on 0xfd774078 in __lwp_park()
...
...
t@20 b l@20 MwThread() running in __pollsys()
t@21 b l@21 MwThread() sleep on 0xf4a7f490 in __lwp_park()
t@22 b l@22 MwThread() running in __pollsys() <- new thread
t@23 b l@23 MwThread() sleep on 0xea6ff490 in __lwp_park() <- new thread

In the above example, there is no information about who created the threads t@22 & t@23. Even to get that information, use when command as shown below:

(dbx) when thr_create { echo "New thread $newthread was created by thread $thread"; }
(6) when thr_create { kprint "New thread ${newthread} was created by thread ${thread}"; }
(dbx) cont
New thread t@24 was created by thread t@10
New thread t@25 was created by thread t@24

$newthread and $thread are pre-defined variables of dbx, which holds the thread ID of a newly created thread, and the thread ID of the current thread, respectively.

Similarly thread exits can be traced as follows:

(dbx) trace thr_exit
(5) trace thr_exit

(dbx) cont
New thread t@26 was created by thread t@10
New thread t@27 was created by thread t@26
trace: thr_exit t@27
Suspending/Resuming threads

To suspend the execution of a thread, run the command thread -suspend <tid>; to resume the suspended thread, thread -resume <tid>

(dbx) thread -suspend t@26
Thread t@26 suspended

(dbx) thread -resume t@26
Thread t@26 unsuspended
Break point with stop command

The following example shows how to set a break point to stop the execution, when a new thread with id t@34 gets created.

(dbx) stop thr_create t@34
(9) stop thr_create t@34

(dbx) cont
t@10 (l@10) stopped in tdb_event_create at 0xfd1377e8
0xfd1377e8: tdb_event_create : retl
trace: thread created t@34 on l@34

(dbx) where <- who initiated the new thread creation? entire call stack
current thread: t@10
=>[1] tdb_event_create(0x2, 0x1084, 0x3ff, 0x0, 0xfc8e1c00, 0x1000), at 0xfd1377e8
[2] _thrp_create(0x180, 0x10f8, 0xfd1377e8, 0x1e, 0xc1, 0xfde32000), at 0xfd138c04
[3] _pthread_create(0xf877f310, 0x0, 0xfd670ff4, 0xf877f318, 0x0, 0xfd168bc0), at 0xfd12d104
[4] MwCreateThread(0x0, 0xfeb95630, 0xf877f414, 0x4, 0x0, 0x9383cb0), at 0xfd671460
[5] CreateThread(0x0, 0x0, 0xfeb95630, 0xf877f414, 0x4, 0x9383cb0), at 0xfd67d124
[6] CWinThread::CreateThread(0x9383c80, 0x4, 0x0, 0x0, 0xfd164278, 0x88cabc9), at 0xfeb95f1c
[7] AfxBeginThread(0xffa7a420, 0x88cabc0, 0x0, 0x0, 0x4, 0x0), at 0xfeb958a4
[8] WslCreateThread(0xfefdbe00, 0x5c135c0, 0x0, 0x88cabc0, 0xf877f584, 0x16b8c), at 0xffa7a4cc
[9] OSDCreateThread(0x211200, 0x5b40660, 0x0, 0x0, 0x5ab1590, 0x5c135c0), at 0xfefdc16c
[10] SmiDispatchThrdMain(0x101f180, 0x5ab1588, 0x5ab1590, 0xf877fd64, 0xf877fcec, 0xff40f8d4), at 0x1f53f4
[11] OSDWslThreadStart(0x10b8ad0, 0x1f5240, 0x10b8ad0, 0xa, 0x0, 0x15d07e8), at 0xfefdbec8
[12] _AfxThreadEntry(0xffbfeaac, 0x2f4948, 0x0, 0x1, 0x0, 0x17289c), at 0xfeb95730
[13] MwThread(0x1, 0x0, 0x1, 0x0, 0xfd76bed0, 0x15cd558), at 0xfd671230
Light Weight Processes (LWPs)

Application (user) threads are not visible to the kernel. Kernel treats light weight processes (LWPs) as the only schedulable entities within a process. LWPs bridge the user level and kernel level threads. Each process contains one or more LWPs; and each LWP is associated with a kernel thread. Prior to Solaris 9, each of LWPs would run one or more user level threads (ie., 1xN). From Solaris 9 onwards, there is one LWP for every user level thread (ie., 1x1).

Use lwps command to list all LWPs in the process.

(dbx) lwps
l@1 running in _private_mprotect()
l@2 running in __lwp_park()
l@3 running in __lwp_park()
l@4 running in __pollsys()
l@5 running in __pollsys()
l@6 running in __lwp_park()
l@7 running in __pollsys()
l@8 running in _so_recv()
l@9 running in __lwp_park()
l@10 running in __lwp_park()
l@11 running in __lwp_park()
l@12 running in __lwp_park()
l@13 running in __time()
l@14 running in __pollsys()
l@15 running in __lwp_park()
l@16 running in __pollsys()
o l@17 breakpoint in SSstring::GetWriteBuffer()
l@18 running in __lwp_unpark()
o l@19 breakpoint in SSstring::GetWriteBuffer()
l@20 running in __pollsys()
*>l@21 breakpoint in SSstring::GetWriteBuffer()

lwp command displays the current LWP. To switch to a different LWP, use lwp <lwpid>. lwp -info [<lwpid>] shows some useful information for a given LWP.

(dbx) lwp
current LWP ($lwp) is l@21

(dbx) lwp -info
l@21 breakpoint in SSstring::GetWriteBuffer()
masked signals are:

(dbx) lwp -info l@12
l@12 running in __lwp_park()
masked signals are:

(dbx) lwp l@18
t@18 (l@18) stopped in __pollsys at 0xfd13d1c4
0xfd13d1c4: __pollsys+0x0004: ta 8

Scalability issues

In general, MT applications that make heavy use of the standard {Solaris operating system's} memory allocator, may exhibit poor scalability. This problem occurs when multiple threads are in malloc() or free() waiting to obtain the mlock.

If the application suffers from this scalability issue, the top of the thread stacks (which can be obtained using either dbx or pstack command) will appear as below:

lwp_park
mutex_lock_queue
slow_lock
free
or
lwp_park
mutex_lock_queue
slow_lock
malloc

One such problem was described in this Solaris forum's thread slow_lock making application hang.

MT aware memory allocators

mtmalloc, umem libraries of Solaris distribution will resolve this kind of scalability problem. libmtmalloc was introduced in Solaris 7; and libumem was introduced in Solaris 9 Update 3. These userland memory allocators are packaged as a drop-in replacement to the standard malloc() and free() library calls; so, to take advantage of these allocators, link the MT application with any of these allocators.

mtmalloc, umem allocators are a redesign of the standard library; and hence results in finer grained locking. These libraries will significantly outperform the standard library in cases where multiple concurrent requests are made to the memory allocator. In the case of a single threaded application, the standard memory allocator will however provide better performance. The standard memory allocator also provides a smaller memory footprint. Note that the trade-off with mtmalloc, umem allocators is much bigger memory footprint, due to the way the memory gets allocated. For these reasons the standard memory allocator may be preferred in cases where the advantages of mtmalloc and umem, do not apply. Make sure to experiment with these memory allocators to see which one fits best for your application.

Linking with mtmalloc or umem

At compile time, the application can be linked against mtmalloc or umem library. Adding -lmtmalloc or -lumem, option to the link line results in the application being linked appropriately.

eg.,
% cc -mt -o my_program my_program.c -lmtmalloc or
% cc -mt -o my_program my_program.c -lumem

You can check the library dependency with ldd my_program.

Quick workaround -- library interposition

If re-building the application by linking with mtmalloc or umem, is not feasible, either of these libraries can be preloaded with LD_PRELOAD environment variable, when the program is executed.

eg.,
% setenv LD_PRELOAD libmtmalloc.so
% ./my_program


or

% setenv LD_PRELOAD libumem.so
% ./my_program

You can verify whether the library is preloaded, with pldd `pgrep my_program`.

Resources:
  1. Debugging a Program With dbx
  2. Multithreaded Programming Guide
  3. malloc vs mtmalloc
Suggested Reading:
  1. Welcome to the CMT Era!
  2. Improving Application Efficiency Through Chip Multi-Threading
___________________
Technorati tags: Sun Studio | dbx | CMT
Read More
Posted in | No comments

Sunday, 4 December 2005

Sun Studio C/C++: Improve performance with -xtarget, -xarch

Posted on 02:28 by Unknown
Even though many software vendors don't support SPARC v8 architecture (ie., pre-UltraSPARC era), for some reason they hesitate to use -xtarget option with any value other than generic (default), in building their softwares. Perhaps they are not aware of the benefits of specifying target platform and/or not spending enough time experimenting with different values to compare the performance.

In general, it is always recommended to specify the target platform with -xtarget option, and the target instruction set architecture with -xchip option, for better performance. I believe one of the major concern {for software vendors} in specifying the target platform is the suspicion that the application may not run on a wide range of platforms. While it is true upto some extent, still there is a chance to specify some value for the target platform, if we knew that all the supported architecture is compatible with the one we specify with -xchip option.

32-bit SPARC applications, and -xtarget=ultra3 -xarch=v8plusa

For example, for a 32-bit application, if we know for sure that the supported architecture will only be UltraSPARC chip architecture, it is strongly recommended to use -xtarget=ultra3, -xarch=v8plusa options in building the application. -xarch=v8plusa selects an instruction set that is Okay for all the members of UltraSPARC family (US-I, II, III, III+, IV, IV+, T1 (code named Niagara)). -xchip=ultra3 tells the optimizer to optimize for best execution on US-III, and later systems. The code will run well on the US-I & II boxes, but possibly a little slower than if optimized for them.

Performance improvement from a real world application

One of our partners (an ISV, in short) is shipping their product with -xtarget=generic -xarch=v8plusa for the past few years. Their application supports only UltraSPARC platform. So, recently I have experimented with their application by building it with -xtarget=ultra3 -xarch=v8plusa on a US-IV machine. When the application was run on a US-III box with moderate workload, (not so surprisingly) the run-time performance of the application was improved by ~2.5% (compared to the numbers from -xtarget=generic -xarch=v8plusa build). Of course, there is no performance regression on a US-II box, and the performance is comparable to the vanilla build ie., built with -xtarget=generic -xarch=v8plusa option; also the performance gains on a US-IV box is relatively comparable to the gains on a US-III box.

These experiments gave enough confidence to the ISV to go with -xtarget=ultra3 -xarch=v8plusa combination; and the next version of their application is being built with those options.

Note:
Do not use -xtarget=ultra3, if there is a heavy use of the Sun performance library. In that case you really need to have specific separate builds for all the target platforms, because there is no single optimized perflib is available, that is suitable for all architectures.


Excerpts from Darryl Gove's Selecting the Best Compiler Options article

Darryl Gove, a senior performance engineer at Sun Microsystems, recently posted an article about selecting the best compiler options to improve the run-time performance of the application(s). Since it has a ton of information about 32/64-bit applications on UltraSPARC, x64/x86 platforms, I thought of copy, pasting the relevant information here {for completeness}, instead of just pointing to the article.

Specify the Target Platform and Architecture as Explicitly as Possible

The target platform specifies the processor that the application is expected to run on, the minimum processor that is required, and whether the application is 32-bit or 64-bit. For compiler versions prior to the SunStudio 9 release, the compiler specified a generic processor; SunStudio 9 compilers target an UltraSPARC processor for the SPARC architecture, and a generic x86 based processor for the x86 architecture. In all cases it is best to explicitly specify the target processor, since it is possible in some cases for the target processor to depend on the hardware upon which the application is built.

There are a number of compiler flags that specify the target. The flag -xtarget sets all the other flags to appropriate default values for the given target processor: -xarch, -xchip, and -xcache. The flag -xarch sets the instruction set that the processor supports, the flag -xchip specifies how the compiler should use these instructions. Finally the flag -xcache specifies the
structure of the caches for this target (however this flag may not have any impact for many codes). As with all compiler flags, the order is important; flags accumulate from left to right, in the event that there are conflicting settings the flag on the right will override the values of flags which were specified earlier on the
command line.

A point to be cautious of is that specifying a more recent hardware target may mean that older hardware is no longer able to run the application. In particular specifying the target as being an UltraSPARC platform means that the application will no longer run on pre-UltraSPARC processors (however UltraSPARC processors have been shipping for over 10 years). Similarly specifying an Opteron processor will mean that the code no longer runs x86-compatible processors that do not have the SSE2 instruction set extensions.

Specifying the target platform for the UltraSPARC processor family

For UltraSPARC processors, a generally good option pair to use is -xtarget=ultra3 with -xarch=v8plusa. These options allow the compiler to generate 32-bit code that can run on all the members of the UltraSPARC family and their follow-ons (UltraSPARC I, UltraSPARC II, UltraSPARC III, UltraSPARC IV). The compiler will also schedule the code especially for the UltraSPARC III. These options represents a good compromise, since code scheduled for the UltraSPARC III is better at taking advantage of the new features of the UltraSPARC III architecture, while still providing good performance on previous generations of processors.

If the application requires the capability to address 64-bit memory addresses, then the appropriate flags to use are -xtarget=ultra3 -xarch=v9a which adds 64-bit addressing whilst still targeting all the members of the UltraSPARC family of processors.


Recommended compiler flags for the UltraSPARC platform
32-bit code-xtarget=ultra3 -xarch=v8plusa
64-bit code-xtarget=ultra3 -xarch=v9a

Specifying the target processor for the x64 processor family

By default the compiler targets a 32-bit generic x86 based processor, so the code will run on any x86 processor from a Pentium Pro up to an AMD Opteron architecture. Whilst this produces code that can run over the widest range of processors, this does not take advantage of the extensions offered by the Opteron family of processors. Consequently it is recommended that for 32-bit code the Opteron processor is targeted, this will generate code that will run on processors (such as the Pentium 4 and Opteron) which support the SSE2 instruction set extensions.

To take advantage of the x64 processor family and the advantages of 64-bit code, the appropriate compiler flags are -xtarget=opteron -xarch=amd64.

Recommended compiler flags for the x64 platform
32-bit code-xtarget=opteron
64-bit code-xtarget=opteron -xarch=amd64

Using -xtarget=generic

The compiler also supports the options -xtarget=generic and -xtarget=generic64. These options tell the compiler to produce code which runs well on as wide a range of machines as possible. One feature of these flags is that they will be interpreted appropriately on both the SPARC and x64 platforms -- so using them may mean fewer changes to makefile flags. The following table shows how the compiler will interpret the -xtarget=generic flag on both the SPARC and x64 platforms.

FlagSPARCx64
-xtarget=genericV8plus architecture386 architecture
-xtarget=generic64V9 architectureAMD64 architecture


Credit:
Darryl Gove, Sun Product Technical Support JSE EMEA
___________________
Technorati tags: Sun Studio
Read More
Posted in | No comments

Thursday, 1 December 2005

Sun Studio 11: Asynchronous Profile Feedback Data Collection

Posted on 22:17 by Unknown
Sun released Studio 11 compiler collection, couple of weeks back, and giving it away for free, for everyone (Studio 10 is also freely downloadable from Sun downloads web site; but the only requirement is that the user must register at OpenSolaris web site)

About asynchronous profile collection

Asynchronous profile feedback data collection is one of the new features in this release. However the data collection part is totally transparent to the end user, and hence this feature was neither documented nor highlighted anywhere. In simple words, this new feature doesn't require any changes to the way the feedback data was collected; but increases the probability of getting a good profile from multi-threaded applications.

Prior to this release, the profiler thread has to wait until the shared library finalization and for the process to call exit(), before writing all the feedback data to feedbin file. In a way it mandates the process to be exited, to get the feedback data. Also there is no guarantee that all the applications (esp. multi-threaded apps) are/will be designed to terminate gracefully. So, some processes may not call exit() at all. In those cases, getting usable feedback data is very unlikely.

To alleviate the problems described above, we need some mechanism to collect the feedback data from a running process without requiring it terminate gracefully. Fortunately Studio 11 has some desirable enhancements; and due to these, the chances of getting a good profile from many single/multi-threaded applications is high, irrespective of how they exit.

Undocumented environment variables

When the collect binaries are built with Studio 11, the profiler thread occasionally (I don't have the exact time in sec) writes the feedback data to feedbin file {on disk}. The time interval between periodic profile snapshots can be controlled by the undocumented env variable SUN_PROFDATA_ASYNC_INTERVAL, whose value is interpreted as the duration, in seconds. If SUN_PROFDATA_ASYNC_INTERVAL has been set to a positive integer value n at startup of an application, the profiler thread collects periodic profile data every n seconds, and subsequently updates the corresponding feedbin.

When data for a snapshot is collected, the profiler updates a single profile directory whose name is of the form:
<procname>.<hostname>.<pid>[.profile]

where:
<procname> is the name of the process being profiled
<hostname> is the host name of the machine executing the profiled process
<pid> is the process id of the profiled process

.profile is appended to the name of the profile directory unless is specified using the value of the {documented} environment variable SUN_PROFDATA.

The collected profile data can be used in use phase of PFO, by specifying the compiler option:
-xprofile=use:<procname>.<hostname>.<pid>

Note:
  1. The directory name can be renamed at your will, before specifying it with -xprofile=use option

  2. The profiler thread collects profile snapshots only for the process in which it was initiated. So, forked processes will not inherit the profiler thread (or simply profiler)

  3. When the application is built with -xprofile=collect, the object prof_lib.o is linked into profiled shared libraries, and prof_tsd.o is linked into profiled executables. These objects provide the run-time support for profile feedback data collection. The linking of these object files is transparent to us. To check this, specify -# flag of C compiler, or -v option of C++ compiler, on compile line.

Multiple profile snapshots per proces

Studio 11 also enables the collection of profile data more than once per process. If the env. variable SUN_PROFDATA_ASYNC_SEQUENCE is defined and set to an integer value, num_snapshots >= 1, the profiler generates a sequence of distinct profile snapshots whose names are of the form:
<procname>.<hostname>.<pid>.<n>[.profile]

where:
<n> is a positive integer in the range [1..num_snapshots].

Subsequent profile snapshots are applied to update the <procname>.<hostname>.<pid>[.profile] directory for the remaining life time of the process.

The time sequence of profile snapshots generated by setting SUN_PROFDATA_ASYNC_SEQUENCE may be used to determine how long profile data should be collected from a given application in order to obtain good performance with -xprofile=use.

Example

Let's assume that the program mymtserver is compiled with -xprofile=collect. The profile data collection can be done as follows:
% %uname -n
Govinda
% setenv SUN_PROFDATA_ASYNC_INTERVAL 30
% setenv SUN_PROFDATA_ASYNC_SEQUENCE 3
% setenv SUN_PROFDATA_VERBOSE
% setenv SUN_PROFDATA_DIR /tmp/mymtserver
% ./mymtserver &
[1] 1234
This will collect a snapshot of profile data from process 1234 every 30 seconds for as long as it continues executing. The first 3 snapshots will be saved in their own feedback directories:
/tmp/mymtserver/mymtserver.Govinda.1234.1.profile, /tmp/mymtserver/mymtserver.Govinda.1234.2.profile and /tmp/mymtserver/mymtserver.Govinda.1234.3.profile.

Then the subsequent snapshots will update the feedback directory: /tmp/mymtserver/mymtserver.Govinda.1234.profile.

Note:
To get any warning messages during profile data collection, set the env. variable, SUN_PROFDATA_VERBOSE

Often the default values are good enough to get the feedback data; and we may not need any of these env. variables mentioned here (in the example). Perhaps that's the main reason for these to remain undocumented. Nevertheless they provide more control over the data collection, where we need.

Async. profile collection with Studio 9 & 10

Even though this feature was first integrated into Studio 11, it was backported to Studio 9 & 10, and released as common C/C++ patches. So, to have this feature Studio 9 & 10 must have the following patches installed:
Studio  9: 115983-06 or later
Studio 10: 117832-06 (most likely -- not released yet) or later

Related posts:
  1. Sun Studio C/C++: Profile Feedback Optimization
  2. Sun Studio C/C++: Profile Feedback Optimization II

Credit/Acknowledgements:
Chris Aoki, Sun Microsystems
___________________
Technorati tags: Sun Studio
Read More
Posted in | No comments

Monday, 14 November 2005

Sun Studio C/C++: Tuning iropt for inline control

Posted on 22:47 by Unknown
It is desirable to inline as many hot routines as possible to reduce the run-time overhead of CPU intensive applications. In general, it appears that compilers go by their own rules when to inline a routine, and when to not inline it. This blog post is intended to introduce some of the not widely known (or used) compiler internal flags, to tweak the pre-defined rules of compiler.

Consider the following trivial C code:
% cat inline.c
#include <stdio.h>
#include <stdlib.h>

inline void freememory(int *ptr)
{
free(ptr);
}

extern inline void swapdata(int *ptr1, int *ptr2)
{
int *temp;

temp = (int *) malloc (sizeof (int));
printf("\nswapdata(): before swap ->");

*temp = *ptr1;
*ptr1 = *ptr2;
*ptr2 = *temp;

printf("\nswapdata(): after swap ->");

free (temp);
}

inline void printdata(int *ptr)
{
printf("\nAddress = %x\tStored Data = %d", ptr, *ptr);
}

inline void storedata(int *ptr, int data)
{
*ptr = data;
}

inline int *getintptr()
{
int *ptr;
ptr = (int *) malloc (sizeof(int));
return (ptr);
}

inline void AllocLoadAndSwap(int val1, int val2)
{
int *intptr1, *intptr2;

intptr1 = getintptr();
intptr2 = getintptr();
storedata(intptr1, val1);
storedata(intptr2, val2);
printf("\nBefore swapping .. ->");
printdata(intptr1);
printdata(intptr2);
swapdata(intptr1, intptr2);
printf("\nAfter swapping .. ->");
printdata(intptr1);
printdata(intptr2);
freememory(intptr1);
freememory(intptr2);
}

inline void InitAllocLoadAndSwap()
{
printf("\nSnapshot 1\n___________");
AllocLoadAndSwap(100, 200);
printf("\n\nSnapshot 2\n___________");
AllocLoadAndSwap(435, 135);
}

int main() {
InitAllocLoadAndSwap();
return (0);
}
By default auto inlining is turned off with Sun compilers; and to turn it on, one has to compile the code with -O4 or higher optimization. This example tries to suggest the compiler to inline all the routines, with inline keyword. Note that inline keyword is a suggestion/request for the compiler to inline the function; however there is no guarantee that compiler honors our suggestion/request. Just like any other useful system in the world, compiler has a pre-defined set of rules, and based on those rules, it tries to do its best, as long as those rules are not violated. If the compiler chooses to inline a routine, the function body will be expanded at all the call sites (just like a macro expansion).

When this code is compiled with Sun Studio C compiler, it doesn't print any diagnostic information on stdout/stderr; so, using nm or elfdump tools are one way to find what routines are inlined and what routines are not.
% cc -xO3 -c inline.c
% nm inline.o

inline.o:

[Index] Value Size Type Bind Other Shndx Name

[4] | 0| 0|NOTY |LOCL |0 |3 |Bbss.bss
[6] | 0| 0|NOTY |LOCL |0 |4 |Ddata.data
[8] | 0| 0|NOTY |LOCL |0 |5 |Drodata.rodata
[16] | 0| 0|NOTY |GLOB |0 |ABS |__fsr_init_value
[14] | 0| 0|FUNC |GLOB |0 |UNDEF |InitAllocLoadAndSwap
[1] | 0| 0|FILE |LOCL |0 |ABS |inline.c
[15] | 0| 20|FUNC |GLOB |0 |2 |main
From this output, we can see that InitAllocLoadAndSwap() is not inlined; but still we have no information as to why this function is not inlined.

Compiler commentary with er_src tool

To get some useful diagnostic information, Sun Studio compiler collection offers a tool called er_src. When the source code was compiled with debug (-g or -g0) flag, er_src tool can print the compiler commentary. However since compiler does auto inlining only at O4 or later optimization levels, unfortunately compiler commentary for inlining is not available at O3 opt. level.

iropt's inlining report

iropt component is the global optimizer in Sun Studio compiler collection suite; and inlining will be taken care by iropt. It performs inlining for callees in the same file, unless compiler options for cross file optimizations like -xipo, -xcrossfile are specified on compile line.

Fortunately there are some internal flags of iropt, we could use to control inlining heuristics. Note that these flags have no dependency on the optimization level.

Getting the list of iropt phases, and the corresponding flags

Sun C/C++ compilers on SPARC platform support a variety options for inlining control. iropt -help displays the list of supported flags.
% /opt/SS9/SUNWspro/prod/bin/iropt -help

****** General Usage Information about IROPT ******

To get general help information about IROPT, use -help
To list all the optimization phases in IROPT, use -phases
To get help on a particular phase, use -help=phase
To turn on phases, use -A<phase_name>+<phase_name>+...+<phase_name>
To turn off phases, use -R<phase_name>+<phase_name>+...+<phase_name>
To use phase-specific flags, use -A<phase_name>:<flags list>

% /opt/SS9/SUNWspro/prod/bin/iropt -phases

****** List of Optimization Phases in IROPT ******

Phase Name Description
-------------------------------------------------------------
loop Loop Invariant Code Motion
copy Copy ProPaGation
const Const ProPaGation and folding
reg Virtual Register Allocation
reassoc Reconstruction of associative and/or distributive expressions
rename Scalar Rename
mvl Two-version loops for parallelization
loop_dist Loop Distribution
ddint Loop Interchange
fusion Loop Fusion
eliminate Scalar Replacement on def-def and def-use
private Private Array Analysis
scalarrep Scalar Replacement for use-use
tile Cache Blocking
ujam Register Blocking
ddrefs Loop Invariant Array References Moving
invcc Invariant Conditional Code Motion
restrict_g Assume global pointers as restrict
dead Dead code elimination
pde Partial dead code elimination
ansi_alias Apply ANSI Aliase Rules to Pointer References
yas Scalar Replacement for reduction arrays
cond_elim Conditional Code Elimination
vector Vectorizing Some Intrinsics Functions Calls in Loops
whole Whole Program Mode
bopt Branches Reordering based on Profile Data
invccexp Invariant Conditional Code Expansion
bcopy Memcpy and Memset Transformations
ccse Cross Iteration CSE
data_access Array Access Regions Analysis
ipa Interprocedual Analysis
contract Array Contraction Analysis
symbol Symbolic Analysis
ppg2 optimistic strategy of constant propagation
parallel Parallelization
pcg Parallel Code Generator
lazy Lazy Code Motion
region Region-based Optimization
loop_peeling Loop Peeling
loop_shifting Loop Shifting
loop_collapsing Loop Collapsing
memopt Merge memory allocations
sr Strength reduction (new)
ivsub3 Induction Variable Substitution
crit Critical path optimisations
scalar_repl
loop_bound
loop_condition
measurement
memopt_pattern

% /opt/SS9/SUNWspro/prod/bin/iropt -help=inline

NAME
inline - Qoption for IPA-based inlining phase.

SYNOPSIS
-Ainline[:<op1>][:<op2>]:...[:<opn>] - turn on inline.
-Rinline - turn off inline

DESCRIPTION
inline is on by default now. -Ainline turns it on.
-Rinline turns it off.

NOTE: the following is a brief description of the old inliner qoptions
1. Old inliner qoptions that do not have equivalent
options in the new inliner--avoid to use them later:
-Ml -Mi -Mm -Ma -Mc -Me -Mg -Mw -Mx -Mx -MC -MS

2. Old inliner qoptions that have equivalent option
in the new inliner--use the new options later:
Old options new options
-Msn recursion=n
-Mrn irs=n
-Mtn cs=n
-Mpn cp=n
-MA chk_alias
-MR chk_reshape
-MI chk_reshape=no
-MF mi

The acceptable sub-options are:

report[=n] - dump inlining report.
n=chain:
show to-be-inlined call chains.
n=0: show inlined calls only.
n=1: (default): show both inlined and
non-inlined calls and reasons for
inlining/non-inlining.
n=2: n=1 plus call id and node id
n=3: show inlining summary only
n=4: n=2 and iropt aborts after the
inlining report is dumped out.
cgraph - dump cgraph.
call_in_pragma[=no|yes]:
- call_in_pragma or call_in_pragma=yes:
Inline a call that in the Parallel region
into the original routine
- call_in_pragma=no: (default)
Don't inline a call that in the Parallel region
into the original routine
inline_into_mfunction[=no|yes]:(only for Fortran)
- inline_into_mfunction or inline_into_mfunction=yes:(default)
Inline a call into the mfunction if it is in the
Parallel Region
- inline_into_mfunction=no:
Don't inline a call into the mfunction if it
in the Parallel Region
NOTE: for other languages, if you specify inline_into_mfunction=yes
The compiler will silently ignore this qoption. As a result,
Calls in parallel region will still be inlined into pragma constructs
rs=n - max number of triples in inlinable routines.
iropt defines a routine as inlinable or not
based on this number. So no routines over
this limit will be considered for inlining.
irs=n - max number of triples in a inlining routine,
including size increase by inlining its calls
cs=n - max number of triples in a callee.
In general, iropt only inline calls whose
callee is inlinable (defined by rs) AND
whose callee size is not greater than n.
But some calls to inlinable routines are
actually inlined because of other factors
such as constant actuals, etc.
recursion=n
- max level of resursive call that is
considered for inlining.
cp=n - minimum profile feedback counter of a call.
No call with counter less than this limit
would be inlined.
inc=n - percentage of the total number of triples
increased after inlining. No inlining over
this percentage. For instance, 'inc=30'
means inlining is allowed to increase the
total program size by 30%.
create_iconf=<filename>:
use_iconf=<filename>:
This creates/uses an inlining configuration.
The file lists calls and routines that are
inlined and routines that inline their calls.
Its format is:
air /* actual inlining routines */
n11 n12 n13 ...
n21 n22 n23 ...
.....
ari /* actual routines inlined */
n11 n12 n13 ...
n21 n22 n23 ...
.....
aci /* actual calls inlined */
n11 n12 n13 ...
n21 n22 n23 ...
.....
The numbers are call ids and node ids
printed out when report=2. It is used for
debugging. The usual usage is to use
create_iconf= to create a config file.
then, comment (by preceding numbers line
with #) to disallow inlining for those
calls or routines. For instance,
aci
2 5 6 90
10 234 45 6
# 21 34 46
with the above config file, calls whose
call ids are 21, 34, or 46 will not be
inlined.
do_inline=<routine_name>:
- guide inliner to do inlining for a given
routine only.
mi:
- Do maximum inlining for given routines if do_inline
is used; otherwise, do maximum inlining for main routine.
(The inliner will not check inlining parameters.
remove_ip[=no|yes]:
- remove_ip or remove_ip=yes:
removing inliningPlan after inlining.
- remove_ip=no [default]:
keep inliningPlan after inlining.
chk_alias[=no|yes]:
- chk_alias or chk_alias=yes [default]:
Don't inline a call if inlining it causes
aliases among callee's formal arrays.
- chk_alias=no:
Ignore such checking.
chk_reshape[=no|yes]:
- chk_reshape or chk_reshape=yes [default]:
Don't inline a call if its array argument
is reshaped between caller and callee.
- chk_reshape=no:
Ignore such checking.
chk_mismatch[=no|yes]:
- chk_mismatch or chk_mismatch=yes [default]:
Don't inline a call if any real argument
mismatches with its formal in type.
- chk_mismatch=no:
Ignore such checking.
do_chain[=no|yes]:
- do_chain or do_chain=yes [default]:
Enable inlining for call chains.
- do_chain=no:
Disable inlining for call chains.
callonce[=no|yes]:
- callonce=no [default]:
Disable inlining a routine that is
called only once.
- callonce or callonce=yes:
Enable inlining a routine that is
called only once.

All of a sudden we have overwhelming information to get all the heuristic data from compile time. If we carefully look at all the options listed above, there is a sub-option (report) to -Ainline that dumps inlining report. To pass special flags to iropt, we need to specify -W2,<option>:<sub-option> on compile line.

Here's how to:
%%cc -xO3 -c -W2,-Ainline:report=2 inline.c

INLINING SUMMARY

inc=400: percentage of program size increase.
irs=4096: max number of triples allowed per routine after inlining.
rs=450: max routine size for an inlinable routine.
cs=400: call size for inlinable call.
recursion=1: max level for inlining recursive calls.
Auto inlining: OFF

Total inlinable calls: 14
Total inlined calls: 36
Total inlined routines: 7
Total inlinable routines: 7
Total inlining routines: 3
Program size: 199
Program size increase: 744
Total number of call graph nodes: 11

Notes for selecting inlining parameters

1. "Not inlined, compiler decision":
If a call is not inlined by this reason, try to
increase inc in order to inline it by
-Qoption iropt -Ainline:inc= for FORTRAN, C++
-W2,-Ainline:inc= for C

2. "Not inlined, routine too big after inlining":
If a call is not inlined by this reason, try to
increase irs in order to inline it by
-Qoption iropt -Ainline:irs= for FORTRAN, C++
-W2,-Ainline:irs= for C

3. "Not inlined, callee's size too big":
If a call is not inlined by this reason, try to
increase cs in order to inline it by
-Qoption iropt -Ainline:cs= for FORTRAN, C++
-W2,-Ainline:cs= for C

4. "Not inlined, recursive call":
If a call is not inlined by this reason, try to
increase recursion level in order to inline it by
-Qoption iropt -Ainline:recrusion= for FORTRAN, C++
-W2,-Ainline:recrusion= for C

5. "Routine not inlined, too many operations":
If a routine is not inlinable by this reason, try to
increase rs in order to make it inlinable by
-Qoption iropt -Ainline:rs= for FORTRAN, C++
-W2,-Ainline:rs= for C


ROUTINES NOT INLINABLE:

main [id=7] (inline.c)
Routine not inlined, user requested

CALL INLINING REPORT:

Routine: freememory [id=0] (inline.c)
Nothing inlined.

Routine: swapdata [id=1] (inline.c)
Nothing inlined.

Routine: printdata [id=2] (inline.c)
Nothing inlined.

Routine: storedata [id=3] (inline.c)
Nothing inlined.

Routine: getintptr [id=4] (inline.c)
Nothing inlined.

Routine: AllocLoadAndSwap [id=5] (inline.c)
getintptr [call_id=8], line 46: Auto inlined
getintptr [call_id=9], line 47: Auto inlined
storedata [call_id=10], line 48: Auto inlined
storedata [call_id=11], line 49: Auto inlined
printdata [call_id=13], line 51: Auto inlined
printdata [call_id=14], line 52: Auto inlined
swapdata [call_id=15], line 53: Auto inlined
printdata [call_id=17], line 55: Auto inlined
printdata [call_id=18], line 56: Auto inlined
freememory [call_id=19], line 57: Auto inlined
freememory [call_id=20], line 58: Auto inlined

Routine: InitAllocLoadAndSwap [id=6] (inline.c)
AllocLoadAndSwap [call_id=22], line 64: Not inlined, compiler decision
(inc limit reached. See INLININING SUMMARY)
AllocLoadAndSwap [call_id=24], line 66: Auto inlined
swapdata [call_id=15], line 53: Auto inlined
getintptr [call_id=8], line 46: Auto inlined
getintptr [call_id=9], line 47: Auto inlined
printdata [call_id=13], line 51: Auto inlined
printdata [call_id=14], line 52: Auto inlined
printdata [call_id=17], line 55: Auto inlined
printdata [call_id=18], line 56: Auto inlined
freememory [call_id=19], line 57: Auto inlined
freememory [call_id=20], line 58: Auto inlined
storedata [call_id=10], line 48: Auto inlined
storedata [call_id=11], line 49: Auto inlined

Routine: main [id=7] (inline.c)
InitAllocLoadAndSwap [call_id=25], line 70: Auto inlined
AllocLoadAndSwap [call_id=22], line 64: Not inlined, compiler decision
(inc limit reached. See INLININING SUMMARY)
AllocLoadAndSwap [call_id=24], line 66: Auto inlined
swapdata [call_id=15], line 53: Auto inlined
getintptr [call_id=8], line 46: Auto inlined
getintptr [call_id=9], line 47: Auto inlined
printdata [call_id=13], line 51: Auto inlined
printdata [call_id=14], line 52: Auto inlined
printdata [call_id=17], line 55: Auto inlined
printdata [call_id=18], line 56: Auto inlined
freememory [call_id=19], line 57: Auto inlined
freememory [call_id=20], line 58: Auto inlined
storedata [call_id=10], line 48: Auto inlined
storedata [call_id=11], line 49: Auto inlined
Finally there's some very useful information. The above report shows the threshold values being used while making decisions, all the routines, and information about whether a call to any function is inlined or not; and if not inlined, the reason for not inlining it, and some suggestions on how to make it succeed. This is very cool!

From the report: the compiler is trying to inline all the routines, as long as the program size doesn't go beyond 400% of the original size (ie., without inlining). Unfortunately AllocLoadAndSwap() fall beyond the limits; and hence compiler decides not to inline it. Fair enough. If we don't bother about the size of the binary, and if we really need to inline this routine, we can increase the value for inc, in such a way that AllocLoadAndSwap()s inclusion would fit into the new limits.

eg.,
% cc -xO3 -c -W2,-Ainline:report=2,-Ainline:inc=650 inline.c
INLINING SUMMARY

inc=650: percentage of program size increase.
irs=4096: max number of triples allowed per routine after inlining.
rs=450: max routine size for an inlinable routine.
cs=400: call size for inlinable call.
recursion=1: max level for inlining recursive calls.
Auto inlining: OFF

Total inlinable calls: 14
Total inlined calls: 60
Total inlined routines: 7
Total inlinable routines: 7
Total inlining routines: 3
Program size: 199
Program size increase: 1260
Total number of call graph nodes: 11

Notes for selecting inlining parameters

... skip ... (see prev reports for the text that goes here)

ROUTINES NOT INLINABLE:

main [id=7] (inline.c)
Routine not inlined, user requested


CALL INLINING REPORT:

Routine: freememory [id=0] (inline.c)
Nothing inlined.

Routine: swapdata [id=1] (inline.c)
Nothing inlined.

Routine: printdata [id=2] (inline.c)
Nothing inlined.

Routine: storedata [id=3] (inline.c)
Nothing inlined.

Routine: getintptr [id=4] (inline.c)
Nothing inlined.

Routine: AllocLoadAndSwap [id=5] (inline.c)
getintptr [call_id=8], line 46: Auto inlined
getintptr [call_id=9], line 47: Auto inlined
storedata [call_id=10], line 48: Auto inlined
storedata [call_id=11], line 49: Auto inlined
printdata [call_id=13], line 51: Auto inlined
printdata [call_id=14], line 52: Auto inlined
swapdata [call_id=15], line 53: Auto inlined
printdata [call_id=17], line 55: Auto inlined
printdata [call_id=18], line 56: Auto inlined
freememory [call_id=19], line 57: Auto inlined
freememory [call_id=20], line 58: Auto inlined

Routine: InitAllocLoadAndSwap [id=6] (inline.c)
AllocLoadAndSwap [call_id=22], line 64: Auto inlined
swapdata [call_id=15], line 53: Auto inlined
getintptr [call_id=8], line 46: Auto inlined
getintptr [call_id=9], line 47: Auto inlined
printdata [call_id=13], line 51: Auto inlined
printdata [call_id=14], line 52: Auto inlined
printdata [call_id=17], line 55: Auto inlined
printdata [call_id=18], line 56: Auto inlined
freememory [call_id=19], line 57: Auto inlined
freememory [call_id=20], line 58: Auto inlined
storedata [call_id=10], line 48: Auto inlined
storedata [call_id=11], line 49: Auto inlined
AllocLoadAndSwap [call_id=24], line 66: Auto inlined
swapdata [call_id=15], line 53: Auto inlined
getintptr [call_id=8], line 46: Auto inlined
getintptr [call_id=9], line 47: Auto inlined
printdata [call_id=13], line 51: Auto inlined
printdata [call_id=14], line 52: Auto inlined
printdata [call_id=17], line 55: Auto inlined
printdata [call_id=18], line 56: Auto inlined
freememory [call_id=19], line 57: Auto inlined
freememory [call_id=20], line 58: Auto inlined
storedata [call_id=10], line 48: Auto inlined
storedata [call_id=11], line 49: Auto inlined

Routine: main [id=7] (inline.c)
InitAllocLoadAndSwap [call_id=25], line 70: Auto inlined
AllocLoadAndSwap [call_id=22], line 64: Auto inlined
swapdata [call_id=15], line 53: Auto inlined
getintptr [call_id=8], line 46: Auto inlined
getintptr [call_id=9], line 47: Auto inlined
printdata [call_id=13], line 51: Auto inlined
printdata [call_id=14], line 52: Auto inlined
printdata [call_id=17], line 55: Auto inlined
printdata [call_id=18], line 56: Auto inlined
freememory [call_id=19], line 57: Auto inlined
freememory [call_id=20], line 58: Auto inlined
storedata [call_id=10], line 48: Auto inlined
storedata [call_id=11], line 49: Auto inlined
AllocLoadAndSwap [call_id=24], line 66: Auto inlined
swapdata [call_id=15], line 53: Auto inlined
getintptr [call_id=8], line 46: Auto inlined
getintptr [call_id=9], line 47: Auto inlined
printdata [call_id=13], line 51: Auto inlined
printdata [call_id=14], line 52: Auto inlined
printdata [call_id=17], line 55: Auto inlined
printdata [call_id=18], line 56: Auto inlined
freememory [call_id=19], line 57: Auto inlined
freememory [call_id=20], line 58: Auto inlined
storedata [call_id=10], line 48: Auto inlined
storedata [call_id=11], line 49: Auto inlined
From the above output, AllocLoadAndSwap() was inlined by the compiler when we let the program size to increase by 650%.

Notes:
  1. Multiple iropt options separated by a comma (,) can be specified after -W2
    eg., -W2,-Ainline:report=2,-Ainline:inc=650

  2. For C++ programs, -Qoption can be used to pass internal flags to iropt.
    eg., -Qoption iropt -Ainline:report=2
    -Qoption iropt -Ainline:report=2,-Ainline:inc=650

  3. Inlining those functions whose function call overhead is large relative to the function's code, improves performance. The obvious reason for the performance improvement is the elimination of the function call, stack frame manipulation, and the function return

  4. Even though inlining may increase the run-time performance of an application, do not try to inline too many functions. Inline only those functions (from profiling data) that could benefit from inlining.

  5. In general, compiler threshold values are good enough for inlining the functions. Use iropt's flags only if some very hot routines, couldn't make it due to some reason. Turn on auto inlining with -xO4 option

  6. Inline functions increases build times, and program sizes. Sometimes it is possible that some of the very large routines (when inlined) may not fit into processor's cache and may lead to poor performance, due to the increased cache miss rate

Relevant information:
  1. Sun C/C++ compilers: Inlining routines
  2. Sun Studio: Advanced Compiler Options for Performance

___________________
Technorati tags: Sun Studio | iropt
Read More
Posted in | No comments

Sunday, 6 November 2005

Sun Studio: Gathering memory allocations/leaks data, from a running process

Posted on 20:11 by Unknown
One simple way of collecting this information is with the runtime checking (RTC) feature of dbx, as described in Investigating memory leaks with dbx. Yet another way is to use the collector to get this data, if the process is already running under dbx.

Note that running the application under collect tool, with heap tracing on {with -H on option}, produces overwhelming data which includes all the leaks that occured in the life span of the process. So, if we need to collect the data only for a specific component of the application, clearly running the application under collect is not a good choice.

Steps for collecting data from a running (32-bit) process, with dbx and collector:

In one window:
  1. Preload libcollector.so and set the path to the library, using LD_PRELOAD and LD_LIBRARY_PATH respectively

  2. Run the program/application

In another terminal window:
  1. Get the process ID (aka PID)

  2. If dbx is already running, attach the process to dbx; else start dbx with the PID of the application

  3. Enable data collection with collector enable command

  4. Create/open a new experiment with collector store filename <experiment_name>.er command

  5. Turn the heap tracing on, with collector heaptrace on command. By default, heap tracing is off

  6. Start the data collection by resuming the process with cont command of dbx

  7. Run some relevant test scenarios to capture the memory allocations and leaks, for a specific component of interest

  8. Either detach the process or stop it, when the data collection is done

  9. Finally analyze the data with er_print command line tool, or analyzer graphical tool

The following simple example shows the execution steps outlined above

eg.,
% cat memleaks.c     <- source code
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>

void allocate() {
int *x;
char *y;

x = (int *) malloc(sizeof(int) * 100);
y = (char *) malloc (sizeof(char) * 200);

printf("\nAddress of x = %u, y = %u", x, y);

x = (int *) malloc (sizeof(int) * 25);
y = (char *) malloc (sizeof(char) * 25);

printf("\nNew address of x = %u, y = %u\n", x, y);
free (y);
}

void main() {
while (1) {
allocate();
sleep(3);
}
}

% /opt/SS10/SUNWspro/prod/bin/cc -g -o memleaks memleaks.c

% setenv LD_LIBRARY_PATH /opt/SS10/SUNWspro/prod/lib/dbxruntime/libcollector.so:$LD_LIBRARY_PATH
% setenv LD_PRELOAD /opt/SS10/SUNWspro/prod/lib/dbxruntime/libcollector.so

% ./memleaks

Address of x = 134613536, y = 134613944
New address of x = 134614152, y = 134614272

Address of x = 134616832, y = 134617240
New address of x = 134617448, y = 134614272

...
...

Address of x = 134621928, y = 134622336
New address of x = 134622544, y = 134614272

Address of x = 134622656, y = 134623064
New address of x = 134623272, y = 134614272
<- start the data collection
Address of x = 134623384, y = 134623792
New address of x = 134624000, y = 134614272

Address of x = 134624112, y = 134624520
New address of x = 134624728, y = 134614272

Address of x = 134624840, y = 134625248
New address of x = 134625456, y = 134614272

Address of x = 134625568, y = 134625976
New address of x = 134626184, y = 134614272

Address of x = 134626296, y = 134626704
New address of x = 134626912, y = 134614272

<- stop the data collection

In another window:
% ps -eaf | grep memleaks
techno 11174 10744 0 19:39:09 syscon 0:00 ./memleaks

% dbx - 11174
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.4' in your .dbxrc
Reading -
Reading ld.so.1
Reading libcollector.so
Reading libc.so.1
Reading libdl.so.1
Attached to process 11174
stopped in ___nanosleep at 0xd2642a75
0xd2642a75: ___nanosleep+0x0015: jae ___nanosleep+0x23 [ 0xd2642a83, .+0xe ]
Current function is main
24 sleep(3);

(dbx) collector enable
(dbx) collector heaptrace on

(dbx) cont
Creating experiment database test.2.er ... <- If no experiment name is specified with
collector store filename <exptname>.er command, a default experiment will be created

^C <- stopped it after 15 sec
execution completed, exit code is 138902584
(dbx) quit

% er_print test.2.er
test.2.er: Experiment has warnings, see header for details

(er_print) func
Functions sorted by metric: Inclusive Bytes Leaked

Incl. Incl. Excl. Incl. Name
Bytes Leaks User CPU User CPU
Leaked sec. sec.
3500 15 0. 0. <Total>
3500 15 0. 0. _start
3500 15 0. 0. allocate
3500 15 0. 0. main
3500 15 0. 0. malloc
0 0 0. 0. ___nanosleep

(er_print) source allocate
Source file: ./memleaks.c
Object file: ./memleaks
Load Object: ./memleaks

Incl. Incl. Excl. Incl.
Bytes Leaks User CPU User CPU
Leaked sec. sec.
1. #include <stdlib.h>
2. #include <unistd.h>
3. #include <stdio.h>
4.
0 0 0. 0. 5. void allocate() {

6. int *x;
7. char *y;
8.
2000 5 0. 0. 9. x = (int *) malloc(sizeof(int) * 100);
1000 5 0. 0. 10. y = (char *) malloc (sizeof(char) * 200);

11.
0 0 0. 0. 12. printf("\nAddress of x = %u, y = %u", x, y);
13.
500 5 0. 0. 14. x = (int *) malloc (sizeof(int) * 25);
0 0 0. 0. 15. y = (char *) malloc (sizeof(char) * 25);
16.
0 0 0. 0. 17. printf("\nNew address of x = %u, y = %u\n", x, y);
0 0 0. 0. 18. free (y);
0 0 0. 0. 19. }
20.
0 0 0. 0. 21. void main() {

22. while (1) {
## 3500 15 0. 0. 23. allocate(); <- malloc() was called 15 times,
requesting for a total of 3500 bytes, with no corresponding code>free()

0 0 0. 0. 24. sleep(3);
0 0 0. 0. 25. }
0 0 0. 0. 26. }


(er_print) leaks <- show potential memory leaks
Summary Results: Distinct Leaks = 3, Total Instances = 15, Total Bytes Leaked = 3500

Leak #1, Instances = 5, Bytes Leaked = 2000
malloc + 0x000000BA
allocate + 0x00000018, line 9 in "memleaks.c"
main + 0x00000013, line 23 in "memleaks.c"
_start + 0x00000079

Leak #2, Instances = 5, Bytes Leaked = 1000
malloc + 0x000000BA
allocate + 0x00000028, line 10 in "memleaks.c"
main + 0x00000013, line 23 in "memleaks.c"
_start + 0x00000079

Leak #3, Instances = 5, Bytes Leaked = 500
malloc + 0x000000BA
allocate + 0x0000004A, line 14 in "memleaks.c"
main + 0x00000013, line 23 in "memleaks.c"
_start + 0x00000079

(er_print) allocs <- show memory allocations
Summary Results: Distinct Allocations = 4, Total Instances = 20, Total Bytes Allocated = 3625

Allocation #1, Instances = 5, Bytes Allocated = 2000
malloc + 0x000000BA
allocate + 0x00000018, line 9 in "memleaks.c"
main + 0x00000013, line 23 in "memleaks.c"
_start + 0x00000079

Allocation #2, Instances = 5, Bytes Allocated = 1000
malloc + 0x000000BA
allocate + 0x00000028, line 10 in "memleaks.c"
main + 0x00000013, line 23 in "memleaks.c"
_start + 0x00000079

Allocation #3, Instances = 5, Bytes Allocated = 500
malloc + 0x000000BA
allocate + 0x0000004A, line 14 in "memleaks.c"
main + 0x00000013, line 23 in "memleaks.c"
_start + 0x00000079

Allocation #4, Instances = 5, Bytes Allocated = 125
malloc + 0x000000BA
allocate + 0x00000057, line 15 in "memleaks.c"
main + 0x00000013, line 23 in "memleaks.c"
_start + 0x00000079

(er_print) quit

Detailed information is in Sun Studio Performance Analyzer document, hosted on docs.sun.com.

Related blog posts:
  1. Sun Studio: Investigating memory leaks with dbx
  2. Sun Studio: Investigating memory leaks with Collector/Analyzer
___________________

Technorati tags: Sun Studio | dbx
Read More
Posted in | No comments
Newer Posts Older Posts Home
Subscribe to: Posts (Atom)

Popular Posts

  • *nix: Workaround to cannot find zipfile directory in one of file.zip or file.zip.zip ..
    Symptom: You are trying to extract the archived files off of a huge (any file with size > 2 GB or 4GB, depending on the OS) ZIP file with...
  • JDS: Installing Sun Java Desktop System 2.0
    This document will guide you through the process of installing JDS 2.0 on a PC from integrated CDROM images Requirements I...
  • Linux: Installing Source RPM (SRPM) package
    RPM stands for RedHat Package Manager. RPM is a system for installing and managing software & most common software package manager used ...
  • Solaris: malloc Vs mtmalloc
    Performance of Single Vs Multi-threaded application Memory allocation performance in single and multithreaded environments is an important a...
  • C/C++: Printing Stack Trace with printstack() on Solaris
    libc on Solaris 9 and later, provides a useful function called printstack , to print a symbolic stack trace to the specified file descripto...
  • Installing MySQL 5.0.51b from the Source Code on Sun Solaris
    Building and installing the MySQL server from the source code is relatively very easy when compared to many other OSS applications. At least...
  • Oracle Apps on T2000: ORA-04020 during Autoinvoice
    The goal of this brief blog post is to provide a quick solution to all Sun-Oracle customers who may run into a deadlock when a handful of th...
  • Siebel Connection Broker Load Balancing Algorithm
    Siebel server architecture supports spawning multiple application object manager processes. The Siebel Connection Broker, SCBroker, tries to...
  • 64-bit dbx: internal error: signal SIGBUS (invalid address alignment)
    The other day I was chasing some lock contention issue with a 64-bit application running on Solaris 10 Update 1; and stumbled with an unexpe...
  • Oracle 10gR2/Solaris x64: Fixing ORA-20000: Oracle Text errors
    First, some facts: * Oracle Applications 11.5.10 (aka E-Business Suite 11 i ) database is now supported on Solaris 10 for x86-64 architectur...

Categories

  • 80s music playlist
  • bandwidth iperf network solaris
  • best
  • black friday
  • breakdown database groups locality oracle pmap sga solaris
  • buy
  • deal
  • ebiz ebs hrms oracle payroll
  • emca oracle rdbms database ORA-01034
  • friday
  • Garmin
  • generic+discussion software installer
  • GPS
  • how-to solaris mmap
  • impdp ora-01089 oracle rdbms solaris tips upgrade workarounds zombie
  • Magellan
  • music
  • Navigation
  • OATS Oracle
  • Oracle Business+Intelligence Analytics Solaris SPARC T4
  • oracle database flashback FDA
  • Oracle Database RDBMS Redo Flash+Storage
  • oracle database solaris
  • oracle database solaris resource manager virtualization consolidation
  • Oracle EBS E-Business+Suite SPARC SuperCluster Optimized+Solution
  • Oracle EBS E-Business+Suite Workaround Tip
  • oracle lob bfile blob securefile rdbms database tips performance clob
  • oracle obiee analytics presentation+services
  • Oracle OID LDAP ADS
  • Oracle OID LDAP SPARC T5 T5-2 Benchmark
  • oracle pls-00201 dbms_system
  • oracle siebel CRM SCBroker load+balancing
  • Oracle Siebel Sun SPARC T4 Benchmark
  • Oracle Siebel Sun SPARC T5 Benchmark T5-2
  • Oracle Solaris
  • Oracle Solaris Database RDBMS Redo Flash F40 AWR
  • oracle solaris rpc statd RPC troubleshooting
  • oracle solaris svm solaris+volume+manager
  • Oracle Solaris Tips
  • oracle+solaris
  • RDC
  • sale
  • Smartphone Samsung Galaxy S2 Phone+Shutter Tip Android ICS
  • solaris oracle database fmw weblogic java dfw
  • SuperCluster Oracle Database RDBMS RAC Solaris Zones
  • tee
  • thanksgiving sale
  • tips
  • TomTom
  • windows

Blog Archive

  • ►  2013 (16)
    • ►  December (3)
    • ►  November (2)
    • ►  October (1)
    • ►  September (1)
    • ►  August (1)
    • ►  July (1)
    • ►  June (1)
    • ►  May (1)
    • ►  April (1)
    • ►  March (1)
    • ►  February (2)
    • ►  January (1)
  • ►  2012 (14)
    • ►  December (1)
    • ►  November (1)
    • ►  October (1)
    • ►  September (1)
    • ►  August (1)
    • ►  July (1)
    • ►  June (2)
    • ►  May (1)
    • ►  April (1)
    • ►  March (1)
    • ►  February (1)
    • ►  January (2)
  • ►  2011 (15)
    • ►  December (2)
    • ►  November (1)
    • ►  October (2)
    • ►  September (1)
    • ►  August (2)
    • ►  July (1)
    • ►  May (2)
    • ►  April (1)
    • ►  March (1)
    • ►  February (1)
    • ►  January (1)
  • ►  2010 (19)
    • ►  December (3)
    • ►  November (1)
    • ►  October (2)
    • ►  September (1)
    • ►  August (1)
    • ►  July (1)
    • ►  June (1)
    • ►  May (5)
    • ►  April (1)
    • ►  March (1)
    • ►  February (1)
    • ►  January (1)
  • ►  2009 (25)
    • ►  December (1)
    • ►  November (2)
    • ►  October (1)
    • ►  September (1)
    • ►  August (2)
    • ►  July (2)
    • ►  June (1)
    • ►  May (2)
    • ►  April (3)
    • ►  March (1)
    • ►  February (5)
    • ►  January (4)
  • ►  2008 (34)
    • ►  December (2)
    • ►  November (2)
    • ►  October (2)
    • ►  September (1)
    • ►  August (4)
    • ►  July (2)
    • ►  June (3)
    • ►  May (3)
    • ►  April (2)
    • ►  March (5)
    • ►  February (4)
    • ►  January (4)
  • ►  2007 (33)
    • ►  December (2)
    • ►  November (4)
    • ►  October (2)
    • ►  September (5)
    • ►  August (3)
    • ►  June (2)
    • ►  May (3)
    • ►  April (5)
    • ►  March (3)
    • ►  February (1)
    • ►  January (3)
  • ►  2006 (40)
    • ►  December (2)
    • ►  November (6)
    • ►  October (2)
    • ►  September (2)
    • ►  August (1)
    • ►  July (2)
    • ►  June (2)
    • ►  May (4)
    • ►  April (5)
    • ►  March (5)
    • ►  February (3)
    • ►  January (6)
  • ▼  2005 (72)
    • ▼  December (5)
      • Solaris: Improve 64-bit link time w/ LD_NOEXEC_64
      • Solaris: Estimating process memory footprint
      • Sun Studio: debugging a multi-threaded application...
      • Sun Studio C/C++: Improve performance with -xtarge...
      • Sun Studio 11: Asynchronous Profile Feedback Data ...
    • ►  November (2)
      • Sun Studio C/C++: Tuning iropt for inline control
      • Sun Studio: Gathering memory allocations/leaks dat...
    • ►  October (6)
    • ►  September (5)
    • ►  August (5)
    • ►  July (10)
    • ►  June (8)
    • ►  May (9)
    • ►  April (6)
    • ►  March (6)
    • ►  February (5)
    • ►  January (5)
  • ►  2004 (36)
    • ►  December (1)
    • ►  November (5)
    • ►  October (12)
    • ►  September (18)
Powered by Blogger.

About Me

Unknown
View my complete profile