__FILE__ & __LINE__ macros specified in C & C++ standards are useful to provide some debugging information, during run-time. During pre-processing stage of compilation, __FILE__
macro expands to the name of the file being compiled, in the form of a C string constant; and __LINE__
expands to the line number of the source file that is being compiled, in the form of an integer constant.
Run-time anomalies are not uncommon in production environment. Some developers check for pre-, post-conditions for a block of code with the help of assertions. Depending on the severity of the failure/violation/deviation, the program (application) may either crash or continue to run (of course, at the will of the developer who wrote that piece of code). However to minimize errors/failures during run-time, it is often necessary to log the run-time error messages for a later analysis. So, the developer can introduce some statements with __FILE__ & __LINE__
macros, at appropriate places in the source code to log the exact location of the failure, if an inconsistency was detected during run-time.
Here's an example of poorly written C code, that fails to do proper exception handling. However it logs enough information, to locate the failed statement easily.% cat -n file-line-directives.c
1 #include <stdio.h>
2 #include <stdlib.h>
3
4 int main(int argc, char **argv)
5 {
6 FILE *fd;
7 char *errinfo = NULL;
8
9 errinfo = (char *) malloc (sizeof (char) * 50);
10
11 fd = fopen (argv[1], "r");
12 if (fd == NULL)
13 {
14 sprintf (errinfo, "Error at line %d of %s", (__LINE__ - 3), __FILE__);
15 perror (errinfo);
16 free (errinfo);
17 }
18 else
19 {
20 printf ("file descriptor = %d", fileno(fd));
21 fclose (fd);
22 }
23
24 return (0);
25 }
Compiling this code with Sun Studio C (or C++) compiler produces the following result, at run-time:% cc -o directives file-line-directives.c
% ./directives <- needs an argument ie., file name
Error at line 11 of file-line-directives.c: Bad address
% ./directives /tmp/nonexistingfile <- need an existing file name, to open it up for reading
Error at line 11 of file-line-directives.c: No such file or directory
% ls /tmp/*.log
/tmp/activity.log
%./directives /tmp/activity.log
file descriptor = 3
Note:
Since these pre-defined macros are part of the standard(s), they are available with all C and C++ standard compliant compilers.
___________________
Technorati tags: C | C++ | Programming
Wednesday, 26 October 2005
C/C++: About __FILE__ & __LINE__ Macros
Posted on 20:45 by Unknown
Monday, 24 October 2005
My Favorite Music V
Posted on 22:52 by Unknown
Here comes another long list of my favorite singles. Previous posts of this series are at: I | II | III | IV
________________
Technorati tag: Music
3 Doors Down - Let Me Go | Kinks - You Really Got Me |
Technorati tag: Music
Friday, 21 October 2005
Solaris: pthread_attr_getstack() broken?
Posted on 18:02 by Unknown
Apparently there seems to be some confusion1 2 around when the structure pthread_attr_t gets updated. In one of the above threads, it appears that the initiator believes that the attribute structure must be updated at least when a thread was created by calling
Let's start with an example:(Taken from: pthread_attr_getstack(), pthread_attr_getstackaddr() thread -- for simplicity, I just removed the error checking code)
What is an Attribute object?
Attributes are a way to specify behavior that is different from the default. Any variable of type
Now it all makes sense for the OS to return NULL and 0 for base address and size of a stack. Since the POSIX specification didn't specify the default values to be returned, the output for the same program might be different on different platforms (see previous paragraphs for the default values chosen by Sun).
Let's modify the C code a little to place a 3M stack, at a different location than the system defined address. (The newly introduced code is in green color)
________________
Technorati tag: Programming | POSIX | Solaris
pthread_create()
.Let's start with an example:(Taken from: pthread_attr_getstack(), pthread_attr_getstackaddr() thread -- for simplicity, I just removed the error checking code)
#include <pthread.h>On Solaris, this piece of code produces a result of:
#include <stdio.h>
#include <unistd.h>
#include <errno.h>
#include <stdlib.h>
void* Proc(void* param)
{
sleep(3);
return 0;
}
int main(int argc, char *argv[])
{
pthread_attr_t Attr;
pthread_t Id;
void *Stk;
size_t Siz;
int err;
pthread_attr_init(&Attr);
pthread_attr_getstacksize(&Attr, &Siz);
pthread_attr_getstackaddr(&Attr, &Stk);
printf("Default: Addr=%08x Size=%d\n", Stk, Siz);
pthread_create(&Id, &Attr, Proc, NULL);
pthread_attr_getstackaddr(&Attr, &Stk);
pthread_attr_getstacksize(&Attr, &Siz);
printf("Stack : Addr=%08x Size=%d\n", Stk, Siz);
return (0);
}
Default: Addr=00000000 Size=0This result may surprise (and confuse) some of the users, who expects a valid base address and a valid size for the stack. However on Solaris, a value of NULL for stack address, and 0 for stack size indicates that the new thread will have system defined stack address and system allocated stack size (1M for 32-bit and 2M for 64-bit processes) respectively. Other default values for attribute (
Stack : Addr=00000000 Size=0
Attr
, in the example) is mentioned in Solaris 10 Multithreaded Programming Guide. To understand when the interfaces pthread_attr_getstackaddr()
and pthread_attr_getstacksize()
, return valid values, first we need to know a little about attribute object.What is an Attribute object?
Attributes are a way to specify behavior that is different from the default. Any variable of type
pthread_attr_t
can be used specify the new behavior. Attribute object can be specified, when a thread is created with pthread_create()
or when a synchronization variable is initialized. Note that these attributes cannot be altered while the thread is in use.Now it all makes sense for the OS to return NULL and 0 for base address and size of a stack. Since the POSIX specification didn't specify the default values to be returned, the output for the same program might be different on different platforms (see previous paragraphs for the default values chosen by Sun).
Let's modify the C code a little to place a 3M stack, at a different location than the system defined address. (The newly introduced code is in green color)
#include <pthread.h>Here's the output from the above program:
#include <stdio.h>
#include <unistd.h>
#include <errno.h>
#include <stdlib.h>
#include <sys/mman.h>
void* Proc(void* param)
{
sleep(3);
return 0;
}
int main(int argc, char *argv[])
{
pthread_attr_t Attr;
pthread_t Id;
void *Stk;
size_t Siz;
int err;
size_t stksize = 0x300000;
void *stack;
pthread_attr_init(&Attr);
pthread_attr_getstacksize(&Attr, &Siz);
pthread_attr_getstackaddr(&Attr, &Stk);
printf("Default: Addr=%08x Size=%d\n", Stk, Siz);
stack = mmap(NULL, stksize, PROT_WRITE|PROT_READ, MAP_ANON|MAP_SHARED, -1, 0);
pthread_attr_setstack(&Attr, stack, stksize);
pthread_create(&Id, &Attr, Proc, NULL);
pthread_attr_getstackaddr(&Attr, &Stk);
pthread_attr_getstacksize(&Attr, &Siz);
printf("Stack : Addr=%08x Size=%d\n", Stk, Siz);
return (0);
}
Default: Addr=00000000 Size=0Moral of the story:
Stack : Addr=fee80000 Size=3145728
pthread_attr_getstack()
is not broken; and the structure pthread_attr_t (ie., the attribute object) gets updated only if we update it, with the help of pthread_attr_setxxxx() interfaces defined in pthread.h ie.,% grep "pthread_attr_set*" /usr/include/pthread.hOtherwise system is going to return the default values for all pthread_attr_getxxxx() calls.
int pthread_attr_setstack(pthread_attr_t *, void *, size_t);
int pthread_attr_setstacksize(pthread_attr_t *, size_t);
int pthread_attr_setstackaddr(pthread_attr_t *, void *);
int pthread_attr_setdetachstate(pthread_attr_t *, int);
int pthread_attr_setscope(pthread_attr_t *, int);
int pthread_attr_setinheritsched(pthread_attr_t *, int);
int pthread_attr_setschedpolicy(pthread_attr_t *, int);
int pthread_attr_setschedparam(pthread_attr_t *_RESTRICT_KYWD,
int pthread_attr_setguardsize(pthread_attr_t *, size_t);
________________
Technorati tag: Programming | POSIX | Solaris
Friday, 14 October 2005
Handling SIGFPE
Posted on 00:53 by Unknown
Couple of days back I was asked to look into simple C code that tries to get into division by zero problem deliberately. Signal handler was installed, and there's some code to catch signal SIGFPE (Floating Point Exception), and to print simple message when the relevant code in the signal handler is called. Here's the code (courtesy: Dheeraj):
Now the developer has the following choices:
uc points to the user context, defined by the structure
Now it is obvious that
----
This code works "as is" on SPARC, since REG_nPC is available on SPARC. To make it work with other processors, the code needs to be changed a little bit.
________________
Technorati tag: Programming | Solaris | SPARC
% cat fpe.cDuring run-time, the system (OS) throws SIGFPE once the statement
#include <sys/types.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
void signal_handler (int signo, siginfo_t *si, void *data) {
switch (signo) {
case SIGFPE:
fprintf(stdout, "Caught FPE\n");
break;
default:
fprintf(stdout, "default handler\n");
}
}
int main (void) {
struct sigaction sa, osa;
unsigned int b = ULONG_MAX;
sa.sa_flags = SA_ONSTACK | SA_RESTART | SA_SIGINFO;
sa.sa_sigaction = signal_handler;
sigaction(SIGFPE, &sa, &osa);
b /= 0x0;
return b;
}
b /= 0x0;
gets executed. Since the handler is available for this signal, it should print Caught FPE once, on console and then return from main()
. Strangely enough, the floating point exception was caught multiple times as though it was in an infinite loop, and the process didn't exit.% cc -o fpe fpe.cIt turns out to be the expected behavior; and it appears that when a floating point instruction traps due to the occurrence of an unmasked floating point exception, the hardware leaves the instruction pointer pointing to the beginning of the same instruction. This explains the reason for the multiple SIGFPE's from the same process (and from the same instruction).
"fpe.c", line 25: warning: division by 0
% ./fpe
Caught FPE
Caught FPE
Caught FPE
Caught FPE
Caught FPE
Caught FPE
Caught FPE
^C
Now the developer has the following choices:
- Abort the program
- Modify the operands of the instruction, so the exception will not occur; then continue by re-executing that instruction. Doing so, supplies a result for the trapping instruction
--And/Or-- - Update the instruction pointer (PC), so the execution continues at the next instruction (nPC)
% cat fpe.c
#include <sys/types.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <ucontext.h>
void signal_handler (int signo, siginfo_t *si, void *data) {
ucontext_t *uc;
uc = (ucontext_t *) data;
switch (signo) {
case SIGFPE:
fprintf(stdout, "Caught FPE\n");
uc->uc_mcontext.gregs[REG_PC] = uc->uc_mcontext.gregs[REG_nPC];
break;
default:
fprintf(stdout, "default handler\n");
}
}
int main (void) {
struct sigaction sa, osa;
unsigned int b = ULONG_MAX;
sa.sa_flags = SA_ONSTACK | SA_RESTART | SA_SIGINFO;
sa.sa_sigaction = signal_handler;
sigaction(SIGFPE, &sa, &osa);
b /= 0x0;
return b;
}
% cc -o fpe fpe.c
"fpe.c", line 30: warning: division by 0
% ./fpe
Caught FPE
uc points to the user context, defined by the structure
ucontext_t
. The user context includes the contents of the calling processes' machine registers, the signal mask, and the current execution stack. uc_mcontext
is a member of the structure ucontext_t
, of type mcontext_t
. gregs
, general register set is a member of structure mcontext_t
. gregs[REG_PC] holds the PC of the current instruction, and gregs[REG_nPC] holds the PC of the next instruction.Now it is obvious that
uc->uc_mcontext.gregs[REG_PC] = uc->uc_mcontext.gregs[REG_nPC];
statement increments the program counter . Since the user context got manipulated a bit, the process will be able to continue with the next instruction.----
This code works "as is" on SPARC, since REG_nPC is available on SPARC. To make it work with other processors, the code needs to be changed a little bit.
________________
Technorati tag: Programming | Solaris | SPARC
Thursday, 6 October 2005
Sun Studio: Investigating memory leaks with dbx
Posted on 16:32 by Unknown
Some time back I have had a blog entry with the title Sun Studio: Investigating memory leaks with Collector/Analyzer. It is possible to check memory leaks {along with memory use and memory access checking} with the Sun Studio debugger tool,
Runtime checking with
For the runtime checking feature to work properly, the process must have
To preload rtcaudit.so:
Turn off mutex tracking with
Unset
Here's an example, with annotated commentary:
In short, memory leak report can be generated as follows with the help of RTC feature of dbx:
Technorati tags: Sun Studio | dbx
dbx
, as well. This blog post attempts to show the steps involved in generating the memory leak report, with a simple example (taken from Collector/Analyzer example).Runtime checking with
dbx
dbx
uses the word runtime checking (RTC) for detecting runtime errors, like memory leaks and memory access errors. Extensive material about RTC is available in the Debugging a Program With dbx document, under chapter Using Runtime Checking.For the runtime checking feature to work properly, the process must have
rtcaudit.so
preloaded when it starts.To preload rtcaudit.so:
% setenv LD_AUDIT <path-to-rtcaudit-lib>/rtcaudit.so
Turn off mutex tracking with
dbxenv mt_sync_tracking off
, to avoid running into the bug thread_db synchronization tracking causes cond_wait failure and hangs. Even if you don't, tt forces you to turn it off, anyway.Unset
LD_AUDIT
once the data collection is done. Here's an example, with annotated commentary:
% more memleaks.c <= source file
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
void allocate() {
int *x;
char *y;
x = (int *) malloc(sizeof(int) * 100);
y = (char *) malloc (sizeof(char) * 200);
printf("\nAddress of x = %u, y = %u", x, y);
x = (int *) malloc (sizeof(int) * 25);
y = (char *) malloc (sizeof(char) * 25);
printf("\nNew address of x = %u, y = %u\n", x, y);
free (y);
}
void main() {
while (1) {
allocate();
sleep(1);
}
}
% which cc <= Sun Studio 10 C compiler
/software/SS10/SUNWspro/prod/bin/cc
% cc -g -o memleaks memleaks.c <= Compile the code with debug (-g
) flag
% setenv LD_AUDIT /software/SS9/SUNWspro/prod/lib/dbxruntime/rtcaudit.so <= Pre-load rtcaudit library
% ./memleaks
Address of x = 134621928, y = 134622336
New address of x = 134622544, y = 134614272
Address of x = 134622656, y = 134623064
New address of x = 134623272, y = 134614272
...
...
In another window:
% ps -eaf | grep memleaks
techno 11174 10744 0 19:39:09 syscon 0:00 ./memleaks
% dbx - 11174 <= Attach the process to the debugger,dbx
For information about new features see `help changes'
To remove this message, put `dbxenv suppress_startup_message 7.4' in your .dbxrc
Reading -
Reading ld.so.1
Reading libc.so.1
Reading rtcaudit.so
Reading libmapmalloc.so.1
Reading libgen.so.1
Reading libdl.so.1
Reading libm.so.2
Reading libc_psr.so.1
Reading rtcboot.so
Reading librtc.so
(dbx) dbxenv mt_sync_tracking off <= Turn off mutex tracking
(dbx) check -leaks <= Turn on memory leak checking
leaks checking - ON
RTC: Enabling Error Checking...
RTC: Running program...
(dbx) cont <= Resume program execution
Address of x = 134623384, y = 134623792
New address of x = 134624000, y = 134614272
Address of x = 134624112, y = 134624520
New address of x = 134624728, y = 134614272
Address of x = 134624840, y = 134625248
New address of x = 134625456, y = 134614272
Address of x = 134625568, y = 134625976
New address of x = 134626184, y = 134614272
Address of x = 134626296, y = 134626704
New address of x = 134626912, y = 134614272
^C <= Interrupt the execution with Ctrl-C, to get intermediate leak report
dbx: warning: Interrupt ignored but forwarded to child.
signal INT (Interrupt) in ___nanosleep at 0xed1bc7dc
0xed1bc7dc: ___nanosleep+0x0004: ta 8
Current function is main
24 sleep(1);
(dbx) showleaks <=showleaks
reports new memory leaks since the lastshowleaks
command
Checking for memory leaks...
Actual leaks report (actual leaks: 15 total size: 3500 bytes)
Total Num of Leaked Allocation call stack
Size Blocks Block
Address
========== ====== =========== =======================================
2000 5 - allocate < main
1000 5 - allocate < main
500 5 - allocate < main
Possible leaks report (possible leaks: 0 total size: 0 bytes)
(dbx) cont <= Continue the execution of the program
Address of x = 139208, y = 139632
New address of x = 139856, y = 139984
Address of x = 140040, y = 140464
New address of x = 140688, y = 135824
Address of x = 140816, y = 141240
New address of x = 141464, y = 136656
Address of x = 141592, y = 142016
New address of x = 142240, y = 137488
^C <= Interrupt the execution with Ctrl-C, to get intermediate leak report
dbx: warning: Interrupt ignored but forwarded to child.
signal INT (Interrupt) in ___nanosleep at 0xed1bc7dc
0xed1bc7dc: ___nanosleep+0x0004: ta 8
Current function is main
24 sleep(1);
(dbx) showleaks <=showleaks
reports new memory leaks since the lastshowleaks
command
Checking for memory leaks...
Actual leaks report (actual leaks: 12 total size: 2800 bytes)
Total Num of Leaked Allocation call stack
Size Blocks Block
Address
========== ====== =========== =======================================
1600 4 - allocate < main
800 4 - allocate < main
400 4 - allocate < main
Possible leaks report (possible leaks: 0 total size: 0 bytes)
(dbx) showleaks -a <=showleaks -a
show all the leaks generated so far,
not just the leaks since the lastshowleaks
command
Checking for memory leaks...
Actual leaks report (actual leaks: 27 total size: 6300 bytes)
Total Num of Leaked Allocation call stack
Size Blocks Block
Address
========== ====== =========== =======================================
3600 9 - allocate < main
1800 9 - allocate < main
900 9 - allocate < main
Possible leaks report (possible leaks: 0 total size: 0 bytes)
(dbx) showleaks -v <= Verbose report, since the lastshowleaks
command. Default: Non verbose report
Checking for memory leaks...
Actual leaks report (actual leaks: 12 total size: 2800 bytes)
Memory Leak (mel):
Found 4 leaked blocks with total size 1600 bytes
At time of each allocation, the call stack was:
[1] allocate() at line 9 in "memleaks.c"
[2] main() at line 23 in "memleaks.c"
Memory Leak (mel):
Found 4 leaked blocks with total size 800 bytes
At time of each allocation, the call stack was:
[1] allocate() at line 10 in "memleaks.c"
[2] main() at line 23 in "memleaks.c"
Memory Leak (mel):
Found 4 leaked blocks with total size 400 bytes
At time of each allocation, the call stack was:
[1] allocate() at line 14 in "memleaks.c"
[2] main() at line 23 in "memleaks.c"
Possible leaks report (possible leaks: 0 total size: 0 bytes)
(dbx) func allocate <= Change the current function toallocate
(dbx) list 9 <= List line number 9 of functionallocate
9 x = (int *) malloc(sizeof(int) * 100);
(dbx) list 10
10 y = (char *) malloc (sizeof(char) * 200);
(dbx) list 14
14 x = (int *) malloc (sizeof(int) * 25);
(dbx) func main
(dbx) list 23
23 allocate();
(dbx)
In short, memory leak report can be generated as follows with the help of RTC feature of dbx:
- Compile your program with
-g
(debug) flag, to get source line numbers in the runtime checking error messages. However it is not mandatory to compile with-g
to do runtime checking (RTC) - Pre-load RTC library, rtcaudit:
setenv LD_AUDIT <path-to-rtcaudit-lib>/rtcaudit.so
- Load the program with dbx or attach the running process to dbx with
dbx - <pid>
- Turn off mutex tracking with
dbxenv mt_sync_tracking off
-- I have no clue why dbx didn't like mutex tracking on; and there's no relevant documentation anywhere in Sun documentation site. Perhaps Chris Quenelle can explain this - Turn on the runtime checking for memory leaks with
check -leaks
command - Run the program with
run
command, if it is not running already - Occasionally interrupt the execution with Ctrl-C, and get the leak report with any of
showleaks, showleaks -a, showleaks -v
commands - Detach the process, once the data has been collected
- Unset the LD_AUDIT variable.
unsetenv LD_AUDIT
Technorati tags: Sun Studio | dbx
Tuesday, 4 October 2005
Dynamic TSB support in Solaris 10
Posted on 00:54 by Unknown
Recently at our partner (ISV, short for Independent Software Vendor) site, I was asked to look into some unexpected {huge} performance improvement {compared to Solaris 9} in running their application on a UltraSPARC based server running Solaris 10. The application was compiled/built on Solaris 8. Although Solaris 10 has many improvements, the mileage of the customer applications vary depending on the nature of the application being run. The application under discussion is a userland, CPU intensive financial application, written in C++. Apparently I didn't have much information about the way the application was compiled, and on which platform (hardware & software). So I made one of the Sun Fire v480's, a dual boot server with Solaris 9 and 10 installed on two partitions. I've installed the ISV's application on the third partition, and did a quick test by loading up the machine with virtual users until the average CPU consumption was about 85%. Interestingly I haven't seen phenomenal improvement as the ISV observed, but decent enough (~2.88%) gain that lead me to investigate further.
Here's the trapstat output from Solaris 9 env:
Average time spent in virtual to physical memory translatations: 20.74%
trapstat output from Solaris 10 env:
Average time spent in virtual to physical memory translatations: 18.21%
In Solaris 9 env, the OS has spent 2.53% CPU cycles more {compared to Solaris 10}, in serving TLB/TSB misses. A closer look at the trapstat outputs reveals that Solaris 10 has less burden of serving TSB misses {for data}. There's about 3.64% difference between Solaris 9 & 10's dTSB miss%; but Solaris 10 spent ~0.75% more cycles in serving dTLB misses, compared to Solaris 9, which leaves us 2.89% (ie., 3.64% - 0.75%). Surprisingly this number (2.89%) matched with the gain (2.88%) I've observed by running the same application on both Solaris 9 & 10.
Solaris 10's dynamic TSB support
Since I ran the same application on same hardware with different versions of Solaris, I can directly attribute the improvement in performance to Solaris 10. It is no brainer to quickly realize that this is the result of the algorithmic changes to dynamic TSB support in Solaris 10.
On Solaris 9 and prior versions, depending on the physical memory installed on the machine, the system allocates a fixed number of TSBs with size 128KB or 512KB, at boot time; and since the number is fixed, all processes have to share those TSBs. Due to the limited (only 2) number of supported TSB sizes, any process that needs a TSB of size somewhere between 128 & 512, say 256KB, may either experience a miss (for eg., if the translation was done in a 128KB TSB) or wastes some memory (for eg, if the translation was done in 512KB).
Prior to version 10, Solaris is lacking the flexibility of using the right TSB size, for the right process. Recent versions of UltraSPARC chips can support TSBs of eight different sizes (8K, 16K, 32K, 64K, 128K, 256K, 512K, 1024K or 1M). By sticking to only 128K and 512K TSB's, Solaris 9 and prior versions couldn't take the advantage of the hardware capability quite efficiently.
Solaris 10 overcomes those drawbacks mentioned above by creating a TSB on the fly, per the needs of the process. Here's the corresponding RFE to fix the issues which were seen until Solaris 9:Integrate support for Dynamic TSBs.
Now it makes more sense for me to mention about the 3% reduction in memory footprint per user, in my test runs.
To complete the original story of huge performance difference between Solaris 9 & 10, I gave them a check list to make sure they are doing apples-to-apples comparison; but I never heard from 'em back. Anyway here's the check list that I sent:
Technorati tag: Solaris
Here's the trapstat output from Solaris 9 env:
cpu m size| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim
----------+-------------------------------+-------------------------------+----
ttl | 1339705 7.7 12031 0.5 | 869027 6.3 86899 4.7 |19.2
ttl | 1371385 7.9 12165 0.5 | 931897 6.8 93874 5.1 |20.3
ttl | 1261136 7.2 11227 0.5 | 862982 6.3 86420 4.7 |18.7
ttl | 1334286 7.7 12201 0.5 | 871144 6.4 90464 4.9 |19.4
ttl | 1423610 8.2 14101 0.6 | 957773 7.1 105544 5.7 |21.6
ttl | 1399334 8.1 14120 0.6 | 973754 7.2 110116 6.0 |21.9
ttl | 1478324 8.5 13310 0.6 | 975822 7.2 104689 5.7 |21.9
ttl | 1416840 8.1 12698 0.5 | 962725 7.1 98593 5.3 |21.0
ttl | 1464161 8.4 13149 0.6 | 974467 7.2 105842 5.8 |21.9
ttl | 1412006 8.2 13685 0.6 | 915772 6.9 107461 5.9 |21.5
Average time spent in virtual to physical memory translatations: 20.74%
trapstat output from Solaris 10 env:
cpu m size| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim
----------+-------------------------------+-------------------------------+----
ttl | 1449113 8.7 5045 0.2 | 1015584 7.5 35621 1.7 |18.1
ttl | 1504522 9.0 5771 0.3 | 1056137 7.9 39809 1.9 |19.0
ttl | 1372013 8.2 4824 0.2 | 965968 7.2 33577 1.6 |17.2
ttl | 1366566 8.2 5194 0.2 | 988130 7.3 34719 1.6 |17.3
ttl | 1433062 8.6 5170 0.2 | 1006544 7.4 34607 1.6 |17.9
ttl | 1463364 8.8 5403 0.2 | 1023112 7.6 37313 1.7 |18.3
ttl | 1356094 8.1 4904 0.2 | 979501 7.3 34212 1.6 |17.2
ttl | 1497592 9.0 5816 0.3 | 1060080 7.9 39844 1.9 |19.0
ttl | 1468445 8.8 6166 0.3 | 1079617 8.1 42968 2.0 |19.2
ttl | 1505277 9.0 5737 0.3 | 1062101 7.8 39025 1.8 |18.9
Average time spent in virtual to physical memory translatations: 18.21%
In Solaris 9 env, the OS has spent 2.53% CPU cycles more {compared to Solaris 10}, in serving TLB/TSB misses. A closer look at the trapstat outputs reveals that Solaris 10 has less burden of serving TSB misses {for data}. There's about 3.64% difference between Solaris 9 & 10's dTSB miss%; but Solaris 10 spent ~0.75% more cycles in serving dTLB misses, compared to Solaris 9, which leaves us 2.89% (ie., 3.64% - 0.75%). Surprisingly this number (2.89%) matched with the gain (2.88%) I've observed by running the same application on both Solaris 9 & 10.
Solaris 10's dynamic TSB support
Since I ran the same application on same hardware with different versions of Solaris, I can directly attribute the improvement in performance to Solaris 10. It is no brainer to quickly realize that this is the result of the algorithmic changes to dynamic TSB support in Solaris 10.
On Solaris 9 and prior versions, depending on the physical memory installed on the machine, the system allocates a fixed number of TSBs with size 128KB or 512KB, at boot time; and since the number is fixed, all processes have to share those TSBs. Due to the limited (only 2) number of supported TSB sizes, any process that needs a TSB of size somewhere between 128 & 512, say 256KB, may either experience a miss (for eg., if the translation was done in a 128KB TSB) or wastes some memory (for eg, if the translation was done in 512KB).
Prior to version 10, Solaris is lacking the flexibility of using the right TSB size, for the right process. Recent versions of UltraSPARC chips can support TSBs of eight different sizes (8K, 16K, 32K, 64K, 128K, 256K, 512K, 1024K or 1M). By sticking to only 128K and 512K TSB's, Solaris 9 and prior versions couldn't take the advantage of the hardware capability quite efficiently.
Solaris 10 overcomes those drawbacks mentioned above by creating a TSB on the fly, per the needs of the process. Here's the corresponding RFE to fix the issues which were seen until Solaris 9:Integrate support for Dynamic TSBs.
Now it makes more sense for me to mention about the 3% reduction in memory footprint per user, in my test runs.
To complete the original story of huge performance difference between Solaris 9 & 10, I gave them a check list to make sure they are doing apples-to-apples comparison; but I never heard from 'em back. Anyway here's the check list that I sent:
- On which hardware (US II/III/IV/..), the application was built?
This is extremely important to know, because building on US II, and running the binary on later processors (US III, III+, ..) will have significant impact on the overall performance of the application. For eg., I have seen nearly 4% performance difference in CPU utilization with some application that was built on US II, and ran on US III+ machine, with similar work loads. - Is it the same binary that was run on both Solaris 9 & 10?
- Check the difference(s) in the run-time environments of both experiments. Have the tests been conducted on the same kind of hardware ? With same #processors? With same load? etc.,
- Make sure to use the same {application & OS} tunables, in both experiments
- Which version of Solaris 10 is in use to test the application?
The later versions of Solaris 10 (also called Solaris Express builds), enable large pages for data and instructions by default, if the OS thinks doing so is beneficial. Large pages for data (MPSS) is already introduced in Solaris 9; now Solaris 11 extends it to instructions. So, if Solaris Express bits are being used (less likely though; but there's a possibility), there is almost 13 to 15% improvement (based on the trapstat data shown above) in CPU utilization with no effort from the users. - Make sure large pages are either enabled or not enabled on both (S9 & S10) platforms
- If the application binary is not the same, check the changes in the application, that could improve performance significantly. Also check the changes in compiler flags
Technorati tag: Solaris
Subscribe to:
Posts (Atom)