Wednesday, January 15, 2020

17 - All About Optimization

 

Toucan Linux Project - 17


A Roll Your Own Distribution

All About Optimization


Goal 3 – Optimization Settings

There is one more optimization we can use that is relatively new to GCC. Though it has supported it for some time, only with version 8 was it useful, and in version 9, far better. It is called link time optimization, or LTO, and it occurs when the program is linked not when it is compiled. This is different from optimizing code since the optimizer can only look at the code of a single file. Before we get to far into LTO, let's first examine what the compiler does.

How GCC Works

GCC has several passes it makes using various different programs including the pre-compiler which does work with macros, include files, and pragma. It builds a single file of source code that includes everything necessary to compile that single program. The next pass is compilation (which is a actually few passes) that converts the source code into an assembly file for the native processor (or chosen processor for a cross compile). It is important to note that GCC and other compilers don't optimize assembly directly but use several abstraction layers to do so, though for purposes of discussion we can think of it as optimizing the assembly. The assembly source code is then passed through an assembler which generates the executable object for the file. The optimization can occur by rewriting the assembly source or by examining the object file. Either way a new object file is created that contains the executable code, once again per source file. Finally the objects are linked together and all external references to libraries are found and added, startup code and compiler run time is added, and a file is written in the operating systems binary file structure (ELF for Linux) that if finally an executable file.

Let's take an example program we'll write ourselves for demonstration purposes. It's called Vesuvius. It is contained in two files we'll create in the home directory. For this step use your regular (non-root account.)

Create a directory in your home area to work and create a file called !main.c in it.

cd
mkdir -p code/vesuvius
cd vesuvius
cat > main.c << "EOF"
// mainc Michael R Stute
// for The Toucan Linux Project
#include <stdlib.h>
#include <ncurses.h>
#include "particle.h"

#define NUM_COLORS 12
#define COLOR_VOLCANO 1

int scrX,scrY;
int colors[NUM_COLORS];

void error(char *msg)
{
   
      endwin();
      printf("%s",msg);
      exit(1);
}


// Create colors
// Volcano color is grey
void
create_colors(int can_chg)
{
   if(can_chg) {

      init_color(0,00,00,00);   // Background

      init_color(1,300,300,350); 
      init_color(2,950,00,00); // White
      init_color(3,700,00,00); // Dull white
      init_color(4,950,0,00); // Blue White
      init_color(5,850,0,00); // blue
      init_color(6,750,0,00); // Yellow white
      init_color(7,650,0,00); // Yellow
      init_color(8,500,0,00); // Yellow red
      init_color(9,450,00,00);   // Bright red
      init_color(10,350,00,0);   // red
      init_color(11,200,00,0);    // dull red
      
      init_pair(COLOR_VOLCANO,1,COLOR_BLACK);
      init_pair(2,2,COLOR_BLACK);
      init_pair(3,3,COLOR_BLACK);
      init_pair(4,4,COLOR_BLACK);
      init_pair(5,5,COLOR_BLACK);
      init_pair(6,6,COLOR_BLACK);
      init_pair(7,7,COLOR_BLACK);
      init_pair(8,8,COLOR_BLACK);
      init_pair(9,9,COLOR_BLACK);
      init_pair(10,10,COLOR_BLACK);      
      init_pair(11,11,COLOR_BLACK);            
   }
   else {
      // Set volcano color
      init_pair(COLOR_VOLCANO,COLOR_YELLOW,COLOR_BLACK);
      
      // Create temperature colors
      init_pair(1,COLOR_WHITE,COLOR_BLACK);
      init_pair(2,COLOR_WHITE,COLOR_BLACK);
      init_pair(3,COLOR_WHITE,COLOR_BLACK);      
      init_pair(4,COLOR_BLUE,COLOR_BLACK);
      init_pair(5,COLOR_BLUE,COLOR_BLACK);      
      init_pair(6,COLOR_YELLOW,COLOR_BLACK);
      init_pair(7,COLOR_YELLOW,COLOR_BLACK);
      init_pair(8,COLOR_RED,COLOR_BLACK);
      init_pair(9,COLOR_RED,COLOR_BLACK);
      init_pair(10,COLOR_RED,COLOR_BLACK);
   }
   colors[0]=COLOR_PAIR(0);
   colors[1]=COLOR_PAIR(1);
   colors[2]=COLOR_PAIR(2);
   colors[3]=COLOR_PAIR(3);
   colors[4]=COLOR_PAIR(4);
   colors[5]=COLOR_PAIR(5);
   colors[6]=COLOR_PAIR(6);
   colors[7]=COLOR_PAIR(7);
   colors[8]=COLOR_PAIR(8);
   colors[9]=COLOR_PAIR(9);
   colors[10]=COLOR_PAIR(10);
   colors[11]=COLOR_PAIR(11);   
}

//       ***
//      *****
//     *******
//    **********
//   *************
// ****************
void
draw_volcano(int lastline)
{
   int x,y;
   
   x=scrX / 2;
   y=scrY-6;
   attr_set(A_NORMAL,COLOR_VOLCANO,NULL);
   mvprintw(y++,x-1,"***");
   mvprintw(y++,x-2,"*****");
   mvprintw(y++,x-3,"*******");
   mvprintw(y++,x-4,"*********");
   mvprintw(y++,x-5,"***********");
   if(lastline)
     mvprintw(y,x-7,"**************");   
}

void
erupt(int rocks) 
{
   int counter=0,next_p=0,next_time,todo;
   PARTICLE *particles[100]={NULL};
   
   next_time=counter+1;
   
   for(;;counter++) {
      todo=1+ lrand48() % rocks;
      for(int g=0;g<todo;g++) {    
if(counter==next_time) {
    // Start another rock unless there is no space
    if(next_p>=0)
      particles[next_p]=create_particle(scrX/2,scrY-7,1);
    // Adjust next_p and next_time
    next_time+=1;
    int i;
    for(i=0;i<100;i++) {
       if(!particles[i]) {  
  next_p=i;
  break;
       }     
    }
    if(i==100) // This means array is full
      next_p=-1;  
}
      }
      
      
      for(int i=0;i<100;i++) {
if(particles[i])
   if(move_particle(particles[i])==TRUE)  {
     destroy_particle(particles[i]);
     particles[i]=NULL;
   }  
   else
     draw_particle(particles[i]);
      }
      draw_volcano(0);
      refresh();
      if(getch()=='q')
break;
   }
}


int
main(int argc, char *argv[])
{
   // Setup curses
   int chgClr=0;
   initscr();
   /* first test for color ability of the terminal */
   if(!has_colors())
     error("Terminal doesn't support colors\n");
   /* next attempt to initialize curses colors */
   if(start_color() != OK)
     error("Starintg the color system failed.\n");
   /* colors are okay; continue */
   if(can_change_color()) 
     chgClr=1;
   
   curs_set(0);
   halfdelay(2);
   scrX=getmaxx(stdscr);
   scrY=getmaxy(stdscr);

   create_colors(chgClr);
   draw_volcano(1);
   erupt(5);
   refresh();   
   getch();
   
   endwin();
   
   
}
EOF

Create the particle file

cat > particle.c << "EOF"
// particle.c Michael R Stute
// for The Toucan Linux Project
#include <ncurses.h>
#include <stdlib.h>
#include "particle.h"

extern int colors[11];
extern int scrX, scrY;

PARTICLE *
create_particle(int sX, int sY, int temp)
{
   PARTICLE *p;
   
   p=(PARTICLE *)malloc(sizeof(PARTICLE));
   // Parameters for parabola
   // Formula is y = -(x^2 + hx) * ratio
   // where h is the height and ratio is the opening of the parabola
   // This guarantees it goes through the origin
   p->rx=p->ry=0;
   p->h=1+lrand48() % (scrY);
   p->invert=lrand48() % 2;
   p->ratio=1.0;
   p->ratio=4.0 - (8.0*drand48());
   p->color=(lrand48() % 10) + 1;
   return(p);   
}

int
move_particle(PARTICLE *p)
{
   if(!p)
     return(TRUE);
   
   p->rx++;
   p->ry=(int)((float)((p->rx*p->rx)) + (float)((-p->h*p->rx)) * p->ratio);
   // Save last position for erasing
   p->ox=p->x;
   p->oy=p->y;
   // Translate to screen
   p->y=p->ry+scrY-7;
   if(p->invert)
     p->x=p->rx+(scrX/2);
   else
     p->x=(scrX/2) - p->rx;
   if(p->y>scrX) {  // Particle has landed
      mvaddch(scrY-1,p->x,'*'|colors[p->color]);
     return(TRUE);
   }
   
   
   return(FALSE);   
}

void
destroy_particle(PARTICLE *p)
{
   if(p)
     free(p);
}

void
draw_particle(PARTICLE *p)
{
   if(!p)
     return;
   // Erase old position
   if(p->ox>=0 && p->ox<=scrX && p->oy>=0 && p->oy<=scrY)
      mvaddch(p->oy,p->ox,' ');
   
   // Draw new position
   if(p->x>=0 && p->x<=scrX && p->y>=0 && p->y<=scrY)
     mvaddch(p->y,p->x,'*'|colors[p->color]);
   
}
EOF

Create a header file to define the particle functions and structures.


cat > particle.h < "EOF"
// particle.h Michael R Stute
// for The Toucan Linux Project
typedef struct _particle 
{
   int x,y;       // screen coords
   int ox,oy;     // Previous position
   int h;         // height
   float ratio;   // width of opening
   int rx,ry;     // Real x,y
   int invert;    // Invert the parabola on x axis
   int color;   
} PARTICLE;

PARTICLE *create_particle(int sX, int sY, int temp);
int move_particle(PARTICLE *p);
void destroy_particle(PARTICLE *p);
void draw_particle(PARTICLE *p);
EOF

Now we have two files that contain the program. Test them first by compiling them

gcc particle.c main.c -o vesuvius  `pkg-config ncurses --libs --cflags`

You can run it to see a curses animated volcano spewing red hot rocks, press 'q' when you get tired of it. Now, lets use it to check out GCC's various stages.

The first step is the pre-compiler. By using the -E option GCC will stop after the pre-compiler

gcc -E main.c > /tmp/main_pc.c

You can examine the file using less or the editor. It will be much larger as the pre-compiler has retrieved and added all include files (including the includes they include) as well applying the macros in the program. That is the code that is actually passed to the compiler. Much larger? Yes.

Next step is to check out the assembly code the compiler generates

gcc -S main.c

It will output a file called main.s which contains the assembly code the compiler has created. This will be optimized using whatever options are passed on the command line, which is none (default -O0) in this case. You can use different optimization options and save the output to different file names, then use diff to find out the different optimizations.

Lastly, GCC will create only the assembled object files (but not the program with include startup code, etc.) using the -c option.

gcc -c main.c

which will leave a file called main.o behind.

Examining the Object File

Object files are the ELF format that Linux has used since around 1995. All binary object files (code), shared libraries, and executable files are stored in the ELF file format as adopted by most Unix-like operating systems support dynamic linking (and now even non-Unix OSes.) Though it is rare that a user might need to know anything about the ELF specification, we will use it to examine the output of the compiler.

There are two binaries that come as part of the util-linux package: readelf and objdump. The readelf utility will display everything it can if passed the -a option (or --all). Here is the output as when it examines the !main.o file from the compiler

$ readelf -a main.o
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              REL (Relocatable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x0
  Start of program headers:          0 (bytes into file)
  Start of section headers:          6416 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           0 (bytes)
  Number of program headers:         0
  Size of section headers:           64 (bytes)
  Number of section headers:         13
  Section header string table index: 12

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [ 0]                   NULL             0000000000000000  00000000
       0000000000000000  0000000000000000           0     0     0
  [ 1] .text             PROGBITS         0000000000000000  00000040
       0000000000000803  0000000000000000  AX       0     0     1
  [ 2] .rela.text        RELA             0000000000000000  00000e88
       00000000000009a8  0000000000000018   I      10     1     8
  [ 3] .data             PROGBITS         0000000000000000  00000843
       0000000000000000  0000000000000000  WA       0     0     1
  [ 4] .bss              NOBITS           0000000000000000  00000843
       0000000000000000  0000000000000000  WA       0     0     1
  [ 5] .rodata           PROGBITS         0000000000000000  00000848
       000000000000008b  0000000000000000   A       0     0     8
  [ 6] .comment          PROGBITS         0000000000000000  000008d3
       000000000000001b  0000000000000001  MS       0     0     1
  [ 7] .note.GNU-stack   PROGBITS         0000000000000000  000008ee
       0000000000000000  0000000000000000           0     0     1
  [ 8] .eh_frame         PROGBITS         0000000000000000  000008f0
       00000000000000b8  0000000000000000   A       0     0     8
  [ 9] .rela.eh_frame    RELA             0000000000000000  00001830
       0000000000000078  0000000000000018   I      10     8     8
  [10] .symtab           SYMTAB           0000000000000000  000009a8
       00000000000003a8  0000000000000018          11     9     8
  [11] .strtab           STRTAB           0000000000000000  00000d50
       0000000000000135  0000000000000000           0     0     1
  [12] .shstrtab         STRTAB           0000000000000000  000018a8
       0000000000000061  0000000000000000           0     0     1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  l (large), p (processor specific)

There are no section groups in this file.

There are no program headers in this file.

There is no dynamic section in this file.

Relocation section '.rela.text' at offset 0xe88 contains 103 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
00000000000d  000e00000004 R_X86_64_PLT32    0000000000000000 endwin - 4
00000000001b  000500000002 R_X86_64_PC32     0000000000000000 .rodata - 4
000000000025  000f00000004 R_X86_64_PLT32    0000000000000000 printf - 4
00000000002f  001000000004 R_X86_64_PLT32    0000000000000000 exit - 4
00000000005d  001200000004 R_X86_64_PLT32    0000000000000000 init_color - 4
000000000076  001200000004 R_X86_64_PLT32    0000000000000000 init_color - 4
00000000008f  001200000004 R_X86_64_PLT32    0000000000000000 init_color - 4
0000000000a8  001200000004 R_X86_64_PLT32    0000000000000000 init_color - 4
0000000000c1  001200000004 R_X86_64_PLT32    0000000000000000 init_color - 4
0000000000da  001200000004 R_X86_64_PLT32    0000000000000000 init_color - 4
0000000000f3  001200000004 R_X86_64_PLT32    0000000000000000 init_color - 4
00000000010c  001200000004 R_X86_64_PLT32    0000000000000000 init_color - 4
000000000125  001200000004 R_X86_64_PLT32    0000000000000000 init_color - 4
00000000013e  001200000004 R_X86_64_PLT32    0000000000000000 init_color - 4
000000000157  001200000004 R_X86_64_PLT32    0000000000000000 init_color - 4
000000000170  001200000004 R_X86_64_PLT32    0000000000000000 init_color - 4
000000000184  001300000004 R_X86_64_PLT32    0000000000000000 init_pair - 4
000000000198  001300000004 R_X86_64_PLT32    0000000000000000 init_pair - 4
0000000001ac  001300000004 R_X86_64_PLT32    0000000000000000 init_pair - 4
0000000001c0  001300000004 R_X86_64_PLT32    0000000000000000 init_pair - 4
0000000001d4  001300000004 R_X86_64_PLT32    0000000000000000 init_pair - 4
0000000001e8  001300000004 R_X86_64_PLT32    0000000000000000 init_pair - 4
0000000001fc  001300000004 R_X86_64_PLT32    0000000000000000 init_pair - 4
000000000210  001300000004 R_X86_64_PLT32    0000000000000000 init_pair - 4
000000000224  001300000004 R_X86_64_PLT32    0000000000000000 init_pair - 4
000000000238  001300000004 R_X86_64_PLT32    0000000000000000 init_pair - 4
00000000024c  001300000004 R_X86_64_PLT32    0000000000000000 init_pair - 4
000000000265  001300000004 R_X86_64_PLT32    0000000000000000 init_pair - 4
000000000279  001300000004 R_X86_64_PLT32    0000000000000000 init_pair - 4
00000000028d  001300000004 R_X86_64_PLT32    0000000000000000 init_pair - 4
0000000002a1  001300000004 R_X86_64_PLT32    0000000000000000 init_pair - 4
0000000002b5  001300000004 R_X86_64_PLT32    0000000000000000 init_pair - 4
0000000002c9  001300000004 R_X86_64_PLT32    0000000000000000 init_pair - 4
0000000002dd  001300000004 R_X86_64_PLT32    0000000000000000 init_pair - 4
0000000002f1  001300000004 R_X86_64_PLT32    0000000000000000 init_pair - 4
000000000305  001300000004 R_X86_64_PLT32    0000000000000000 init_pair - 4
000000000319  001300000004 R_X86_64_PLT32    0000000000000000 init_pair - 4
00000000032d  001300000004 R_X86_64_PLT32    0000000000000000 init_pair - 4
000000000333  000b00000002 R_X86_64_PC32     0000000000000020 colors - 8
00000000033d  000b00000002 R_X86_64_PC32     0000000000000020 colors - 4
000000000347  000b00000002 R_X86_64_PC32     0000000000000020 colors + 0
000000000351  000b00000002 R_X86_64_PC32     0000000000000020 colors + 4
00000000035b  000b00000002 R_X86_64_PC32     0000000000000020 colors + 8
000000000365  000b00000002 R_X86_64_PC32     0000000000000020 colors + c
00000000036f  000b00000002 R_X86_64_PC32     0000000000000020 colors + 10
000000000379  000b00000002 R_X86_64_PC32     0000000000000020 colors + 14
000000000383  000b00000002 R_X86_64_PC32     0000000000000020 colors + 18
00000000038d  000b00000002 R_X86_64_PC32     0000000000000020 colors + 1c
000000000397  000b00000002 R_X86_64_PC32     0000000000000020 colors + 20
0000000003a1  000b00000002 R_X86_64_PC32     0000000000000020 colors + 24
0000000003b9  000900000002 R_X86_64_PC32     0000000000000004 scrX - 4
0000000003cb  000a00000002 R_X86_64_PC32     0000000000000004 scrY - 4
0000000003d8  001500000002 R_X86_64_PC32     0000000000000000 stdscr - 4
0000000003e4  001500000002 R_X86_64_PC32     0000000000000000 stdscr - 4
000000000401  000500000002 R_X86_64_PC32     0000000000000000 .rodata - 1
00000000040f  001600000004 R_X86_64_PLT32    0000000000000000 mvprintw - 4
000000000425  000500000002 R_X86_64_PC32     0000000000000000 .rodata + 3
000000000433  001600000004 R_X86_64_PLT32    0000000000000000 mvprintw - 4
000000000449  000500000002 R_X86_64_PC32     0000000000000000 .rodata + 9
000000000457  001600000004 R_X86_64_PLT32    0000000000000000 mvprintw - 4
00000000046d  000500000002 R_X86_64_PC32     0000000000000000 .rodata + 11
00000000047b  001600000004 R_X86_64_PLT32    0000000000000000 mvprintw - 4
000000000491  000500000002 R_X86_64_PC32     0000000000000000 .rodata + 1b
00000000049f  001600000004 R_X86_64_PLT32    0000000000000000 mvprintw - 4
0000000004b5  000500000002 R_X86_64_PC32     0000000000000000 .rodata + 27
0000000004c3  001600000004 R_X86_64_PLT32    0000000000000000 mvprintw - 4
000000000525  001800000004 R_X86_64_PLT32    0000000000000000 lrand48 - 4
00000000056f  000a00000002 R_X86_64_PC32     0000000000000004 scrY - 4
000000000578  000900000002 R_X86_64_PC32     0000000000000004 scrX - 4
00000000058f  001900000004 R_X86_64_PLT32    0000000000000000 create_particle - 4
00000000064b  001a00000004 R_X86_64_PLT32    0000000000000000 move_particle - 4
000000000668  001b00000004 R_X86_64_PLT32    0000000000000000 destroy_particle - 4
000000000696  001c00000004 R_X86_64_PLT32    0000000000000000 draw_particle - 4
0000000006b4  001400000004 R_X86_64_PLT32    00000000000003ac draw_volcano - 4
0000000006bb  001500000002 R_X86_64_PC32     0000000000000000 stdscr - 4
0000000006c3  001d00000004 R_X86_64_PLT32    0000000000000000 wrefresh - 4
0000000006ca  001500000002 R_X86_64_PC32     0000000000000000 stdscr - 4
0000000006d2  001e00000004 R_X86_64_PLT32    0000000000000000 wgetch - 4
0000000006f9  001f00000004 R_X86_64_PLT32    0000000000000000 __stack_chk_fail - 4
000000000716  002100000004 R_X86_64_PLT32    0000000000000000 initscr - 4
00000000071b  002200000004 R_X86_64_PLT32    0000000000000000 has_colors - 4
000000000729  000500000002 R_X86_64_PC32     0000000000000000 .rodata + 3c
00000000072e  000c00000004 R_X86_64_PLT32    0000000000000000 error - 4
000000000733  002300000004 R_X86_64_PLT32    0000000000000000 start_color - 4
00000000073e  000500000002 R_X86_64_PC32     0000000000000000 .rodata + 64
000000000743  000c00000004 R_X86_64_PLT32    0000000000000000 error - 4
000000000748  002400000004 R_X86_64_PLT32    0000000000000000 can_change_color - 4
00000000075d  002500000004 R_X86_64_PLT32    0000000000000000 curs_set - 4
000000000767  002600000004 R_X86_64_PLT32    0000000000000000 halfdelay - 4
00000000076e  001500000002 R_X86_64_PC32     0000000000000000 stdscr - 4
00000000077a  001500000002 R_X86_64_PC32     0000000000000000 stdscr - 4
00000000078f  000900000002 R_X86_64_PC32     0000000000000004 scrX - 4
000000000796  001500000002 R_X86_64_PC32     0000000000000000 stdscr - 4
0000000007a2  001500000002 R_X86_64_PC32     0000000000000000 stdscr - 4
0000000007b7  000a00000002 R_X86_64_PC32     0000000000000004 scrY - 4
0000000007c1  001100000004 R_X86_64_PLT32    0000000000000033 create_colors - 4
0000000007cb  001400000004 R_X86_64_PLT32    00000000000003ac draw_volcano - 4
0000000007d5  001700000004 R_X86_64_PLT32    00000000000004ca erupt - 4
0000000007dc  001500000002 R_X86_64_PC32     0000000000000000 stdscr - 4
0000000007e4  001d00000004 R_X86_64_PLT32    0000000000000000 wrefresh - 4
0000000007eb  001500000002 R_X86_64_PC32     0000000000000000 stdscr - 4
0000000007f3  001e00000004 R_X86_64_PLT32    0000000000000000 wgetch - 4
0000000007f8  000e00000004 R_X86_64_PLT32    0000000000000000 endwin - 4

Relocation section '.rela.eh_frame' at offset 0x1830 contains 5 entries:
  Offset          Info           Type           Sym. Value    Sym. Name + Addend
000000000020  000200000002 R_X86_64_PC32     0000000000000000 .text + 0
00000000003c  000200000002 R_X86_64_PC32     0000000000000000 .text + 33
00000000005c  000200000002 R_X86_64_PC32     0000000000000000 .text + 3ac
00000000007c  000200000002 R_X86_64_PC32     0000000000000000 .text + 4ca
00000000009c  000200000002 R_X86_64_PC32     0000000000000000 .text + 6ff

The decoding of unwind sections for machine type Advanced Micro Devices X86-64 is not currently supported.

Symbol table '.symtab' contains 39 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND
     1: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS main.c
     2: 0000000000000000     0 SECTION LOCAL  DEFAULT    1
     3: 0000000000000000     0 SECTION LOCAL  DEFAULT    3
     4: 0000000000000000     0 SECTION LOCAL  DEFAULT    4
     5: 0000000000000000     0 SECTION LOCAL  DEFAULT    5
     6: 0000000000000000     0 SECTION LOCAL  DEFAULT    7
     7: 0000000000000000     0 SECTION LOCAL  DEFAULT    8
     8: 0000000000000000     0 SECTION LOCAL  DEFAULT    6
     9: 0000000000000004     4 OBJECT  GLOBAL DEFAULT  COM scrX
    10: 0000000000000004     4 OBJECT  GLOBAL DEFAULT  COM scrY
    11: 0000000000000020    48 OBJECT  GLOBAL DEFAULT  COM colors
    12: 0000000000000000    51 FUNC    GLOBAL DEFAULT    1 error
    13: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND _GLOBAL_OFFSET_TABLE_
    14: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND endwin
    15: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND printf
    16: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND exit
    17: 0000000000000033   889 FUNC    GLOBAL DEFAULT    1 create_colors
    18: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND init_color
    19: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND init_pair
    20: 00000000000003ac   286 FUNC    GLOBAL DEFAULT    1 draw_volcano
    21: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND stdscr
    22: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND mvprintw
    23: 00000000000004ca   565 FUNC    GLOBAL DEFAULT    1 erupt
    24: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND lrand48
    25: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND create_particle
    26: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND move_particle
    27: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND destroy_particle
    28: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND draw_particle
    29: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND wrefresh
    30: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND wgetch
    31: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND __stack_chk_fail
    32: 00000000000006ff   260 FUNC    GLOBAL DEFAULT    1 main
    33: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND initscr
    34: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND has_colors
    35: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND start_color
    36: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND can_change_color
    37: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND curs_set
    38: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND halfdelay

No version information found in this file.

This would be even longer if the -g option were passed to the compiler to create debug. An ELF file consists of different sections and the two we are most interested in for now are the "Section Headers" and "Relocation" sectionz. We'll start with the latter first.

The section headers shows which sections are in the file. It can be viewed  by itself with the -S option.

readelf -S main.o
There are 13 section headers, starting at offset 0x1910:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [ 0]                   NULL             0000000000000000  00000000
       0000000000000000  0000000000000000           0     0     0
  [ 1] .text             PROGBITS         0000000000000000  00000040
       0000000000000803  0000000000000000  AX       0     0     1
  [ 2] .rela.text        RELA             0000000000000000  00000e88
       00000000000009a8  0000000000000018   I      10     1     8
  [ 3] .data             PROGBITS         0000000000000000  00000843
       0000000000000000  0000000000000000  WA       0     0     1
  [ 4] .bss              NOBITS           0000000000000000  00000843
       0000000000000000  0000000000000000  WA       0     0     1
  [ 5] .rodata           PROGBITS         0000000000000000  00000848
       000000000000008b  0000000000000000   A       0     0     8
  [ 6] .comment          PROGBITS         0000000000000000  000008d3
       000000000000001b  0000000000000001  MS       0     0     1
  [ 7] .note.GNU-stack   PROGBITS         0000000000000000  000008ee
       0000000000000000  0000000000000000           0     0     1
  [ 8] .eh_frame         PROGBITS         0000000000000000  000008f0
       00000000000000b8  0000000000000000   A       0     0     8
  [ 9] .rela.eh_frame    RELA             0000000000000000  00001830
       0000000000000078  0000000000000018   I      10     8     8
  [10] .symtab           SYMTAB           0000000000000000  000009a8
       00000000000003a8  0000000000000018          11     9     8
  [11] .strtab           STRTAB           0000000000000000  00000d50
       0000000000000135  0000000000000000           0     0     1
  [12] .shstrtab         STRTAB           0000000000000000  000018a8
       0000000000000061  0000000000000000           0     0     1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  l (large), p (processor specific)

This shows all the sections in the file and where to find them. You might think of this as an index to the file. Each section has several indicators of what they are and what must be done as indicated by the info, section title, and the flags. The code is contained in a section called ".text" and another section called "Relocation" (rela.text) which is a list of code relocations that must occur for the program to execute. They are architecture dependent and there are usually quite a few of them. You might think of them as a step-by-step list of instructions necessary to finding all the bits and pieces of code and data for the program to run, including allocations. The most common are X86_64_COPY which means "copy the code from the library to this address" and X86_64_JUMP_SLO (JUMP_SLOT) which is a standard way to link to a function that is shared by multiple programs (in the Global Offset Table -- GOT) so that only one copy needs to be resident in memory for all programs. The linker will use the Procedure Linkage Table to resolve and write in the addresses from the GOT (referred to as GOT/PLT) linking. This is probably more than we need to know, but we are here to learn. You won't see these in the object files (.o) because they are used for linking but are not linked themselves. You can see the GOT/PLT on the executable itself using the second command objdump.

objdump -d vesuvius

Here's a sample of the output

Disassembly of section .plt:

0000000000000ab0 <__cxa_finalize@plt-0x10>:
 ab0:   ff 35 3a 25 00 00       pushq  0x253a(%rip)        # 2ff0 <_GLOBAL_OFFSET_TABLE_+0x8>
 ab6:   ff 25 3c 25 00 00       jmpq   *0x253c(%rip)        # 2ff8 <_GLOBAL_OFFSET_TABLE_+0x10>
 abc:   90                      nop
 abd:   90                      nop
 abe:   90                      nop
 abf:   90                      nop

0000000000000ac0 <__cxa_finalize@plt>:
 ac0:   ff 25 3a 25 00 00       jmpq   *0x253a(%rip)        # 3000 <__cxa_finalize@GLIBC_2.2.5>
 ac6:   68 00 00 00 00          pushq  $0x0
 acb:   e9 e0 ff ff ff          jmpq   ab0 <_init+0x20>

0000000000000ad0 <endwin@plt>:
 ad0:   ff 25 32 25 00 00       jmpq   *0x2532(%rip)        # 3008 <endwin>
 ad6:   68 01 00 00 00          pushq  $0x1
 adb:   e9 d0 ff ff ff          jmpq   ab0 <_init+0x20>

0000000000000ae0 <printf@plt>:
 ae0:   ff 25 2a 25 00 00       jmpq   *0x252a(%rip)        # 3010 <printf@GLIBC_2.2.5>

Scanning the far right hand column we see reference to the Global Offset Table, to external libraries (printf@GLIBC_2.2.5) and entries in the procedure linkage table.

What you also see is that objdump has disassembled the code which you can see both in the executable and the object files.

Now when we compile with optimization each file is examined for ways to speed up as mentioned earlier. But in this program, as with many others, the code is contained in more than one file. Here it is very simple to make it easy to follow, but image a program like ffmpeg that is made of over 300 object files. You might wonder is it possible to optimize all the code in all the files. The answer is yes.

Link Time Optimization

Newer versions of GCC support a type of optimization that examines all the object files of a project, and does further optimization of the complete program as a whole. While this has been supported from some time, it started becoming useful in GCC 7, was greatly improved in GCC 8, and finally became something fairly powerful in GCC 9.

GCC breaks the source code in a token tree after parsing the code and then converts this into several other intermediate forms as it compiles the code. After the parsing tree is built it creates a representation called "generic" that represent entire functions in generic trees. In this way source code can be a mixture of the languages GCC supports (FORTRAN, C, C++, Objective C/++, D, Go, BRIG, and Ada.) Once in a generic tree format the code can be treated operationally the same, meaning the source no longer matters, making it generic. After Generic it creates tuples from the generic tree. These tuples are highly uniform in order to allow optimization. This is called GIMPLE after the original idea as proposed and implemented in the McCat compiler from McGill University which was called SIMPLE. GIMPLE is a representation of the code in tuples that allows easy examination of the code as simple steps that can be optimized.

There are several advantages to this, including being to optimize different source languages using the same optimizer since it originates from the generic pass. Now, as long as the source is first converted to generic then GIMPLE tuples it can optimized and any new technique that is added to the optimizer automatically applies to all source languages. The other major advantage is that multiple files of GIMPLE can be treated as a single entity and further optimization can occur allowing optimizations to occur at the program level instead of just the file level. To do this, the compiler needs to save the GIMPLE representation of each file as it compiles them so it can later look at them all together. That is called Link Time Optimization, or LTO for short which in enabled at compile time with -flto. With this switch, GCC will write the GIMPLE for each function (along with a few other necessities) to the object  file. Of course, there is no advantage unless the program is linked from multiple object files that is why the vesuvius example program is awkwardly written over two files.

Compile it to object files with the LTO option:

gcc main.c -c -flto  `pkg-config ncurses --libs --cflags

Then examine with *objdump* program

objdump -xd main.o

main.o:     file format elf64-x86-64
main.o
architecture: i386:x86-64, flags 0x00000010:
HAS_SYMS
start address 0x0000000000000000

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         00000000  0000000000000000  0000000000000000  00000040  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  1 .data         00000000  0000000000000000  0000000000000000  00000040  2**0
                  CONTENTS, ALLOC, LOAD, DATA
  2 .bss          00000000  0000000000000000  0000000000000000  00000040  2**0
                  ALLOC
  3 .gnu.lto_.inline.5caf6f8700863e25 000000c5  0000000000000000  0000000000000000  00000040  2**0
                  CONTENTS, READONLY, EXCLUDE
  4 .gnu.lto_.jmpfuncs.5caf6f8700863e25 0000001e  0000000000000000  0000000000000000  00000105  2**0
                  CONTENTS, READONLY, EXCLUDE
  5 .gnu.lto_error.5caf6f8700863e25 0000015d  0000000000000000  0000000000000000  00000123  2**0
                  CONTENTS, READONLY, EXCLUDE
  6 .gnu.lto_create_colors.5caf6f8700863e25 0000050d  0000000000000000  0000000000000000  00000280  2**0
                  CONTENTS, READONLY, EXCLUDE
  7 .gnu.lto_draw_volcano.5caf6f8700863e25 000004b0  0000000000000000  0000000000000000  0000078d  2**0
                  CONTENTS, READONLY, EXCLUDE
  8 .gnu.lto_erupt.5caf6f8700863e25 0000076f  0000000000000000  0000000000000000  00000c3d  2**0
                  CONTENTS, READONLY, EXCLUDE
  9 .gnu.lto_main.5caf6f8700863e25 000005cf  0000000000000000  0000000000000000  000013ac  2**0
                  CONTENTS, READONLY, EXCLUDE
 10 .gnu.lto_.symbol_nodes.5caf6f8700863e25 0000017f  0000000000000000  0000000000000000  0000197b  2**0
                  CONTENTS, READONLY, EXCLUDE
 11 .gnu.lto_.refs.5caf6f8700863e25 00000059  0000000000000000  0000000000000000  00001afa  2**0
                  CONTENTS, READONLY, EXCLUDE
 12 .gnu.lto_.decls.5caf6f8700863e25 000016fd  0000000000000000  0000000000000000  00001b53  2**0
                  CONTENTS, READONLY, EXCLUDE
 13 .gnu.lto_.symtab.5caf6f8700863e25 00000287  0000000000000000  0000000000000000  00003250  2**0
                  CONTENTS, READONLY, EXCLUDE
 14 .gnu.lto_.opts 000003e5  0000000000000000  0000000000000000  000034d7  2**0
                  CONTENTS, READONLY, EXCLUDE
 15 .comment      0000001b  0000000000000000  0000000000000000  000038bc  2**0
                  CONTENTS, READONLY
 16 .note.GNU-stack 00000000  0000000000000000  0000000000000000  000038d7  2**0
                  CONTENTS, READONLY
SYMBOL TABLE:
0000000000000000 l    df *ABS*  0000000000000000 main.c
0000000000000000 l    d  .text  0000000000000000 .text
0000000000000000 l    d  .data  0000000000000000 .data
0000000000000000 l    d  .bss   0000000000000000 .bss
0000000000000000 l    d  .gnu.lto_.inline.5caf6f8700863e25      0000000000000000 .gnu.lto_.inline.5caf6f8700863e25
0000000000000000 l    d  .gnu.lto_.jmpfuncs.5caf6f8700863e25    0000000000000000 .gnu.lto_.jmpfuncs.5caf6f8700863e25
0000000000000000 l    d  .gnu.lto_error.5caf6f8700863e25        0000000000000000 .gnu.lto_error.5caf6f8700863e25
0000000000000000 l    d  .gnu.lto_create_colors.5caf6f8700863e25        0000000000000000 .gnu.lto_create_colors.5caf6f8700863e25
0000000000000000 l    d  .gnu.lto_draw_volcano.5caf6f8700863e25 0000000000000000 .gnu.lto_draw_volcano.5caf6f8700863e25
0000000000000000 l    d  .gnu.lto_erupt.5caf6f8700863e25        0000000000000000 .gnu.lto_erupt.5caf6f8700863e25
0000000000000000 l    d  .gnu.lto_main.5caf6f8700863e25 0000000000000000 .gnu.lto_main.5caf6f8700863e25
0000000000000000 l    d  .gnu.lto_.symbol_nodes.5caf6f8700863e25        0000000000000000 .gnu.lto_.symbol_nodes.5caf6f8700863e25
0000000000000000 l    d  .gnu.lto_.refs.5caf6f8700863e25        0000000000000000 .gnu.lto_.refs.5caf6f8700863e25
0000000000000000 l    d  .gnu.lto_.decls.5caf6f8700863e25       0000000000000000 .gnu.lto_.decls.5caf6f8700863e25
0000000000000000 l    d  .gnu.lto_.symtab.5caf6f8700863e25      0000000000000000 .gnu.lto_.symtab.5caf6f8700863e25
0000000000000000 l    d  .gnu.lto_.opts 0000000000000000 .gnu.lto_.opts
0000000000000000 l    d  .note.GNU-stack        0000000000000000 .note.GNU-stack
0000000000000000 l    d  .comment       0000000000000000 .comment
0000000000000001       O *COM*  0000000000000001 __gnu_lto_v1
0000000000000001       O *COM*  0000000000000001 __gnu_lto_slim

Notice the -d option was used with objdump which will dissemble all code yet there is no assembly. This is because there is no assembly yet. Instead, there are now many new entries it in symbol table starting with gnu.lto_* which contain the functions compiled into GIMPLE. It also maintains an internal symbol table, list of jump function utilities, inline functions the compiler created from normal functions, and errors. This proves the code is not yet compiled even to assembly which will now take place at link time instead of compile time.

Because the code must first be optimized the compiler will use a linker helper program. In order to trigger the link phase of the compile to examine the GIMPLE of each file, use the -fuse-linker-plugin for the compiler. (though it is a linker option it should also be used on the compile command line if objects are compiled separately since the generic plugin interface can be called from either pass.) This will cause GCC to first load all the GIMPLE entries from all the object files, create one large function table for all functions in the program, and further optimize them before creating the assembly, and finally passing assembly code to the assembler to create the binary objects.

gcc particle.c main.c -o vesuvius -flto -fuse-linker-plugin `pkg-config ncurses --libs --cflags

This does mean that compile time will increase significantly since this extra pass is required. It also means the intermediate object files will generally be bigger (though not by a lot.) The biggest difference will be the time it takes to link a large program. The linker (with the help of the plugin) now must make an extra pass to load examine all the GIMPLE from the files then decide how to further optimize the code (mostly in inlining external functions from other files) and then finally write the assembly, call the assembler, and at last, link and create the executable or library. As you can imagine this can take quite some time for large programs.

Developer Alert

It is possible to specify the optimization level for a given section of code with GCC. This means as a developer you can profile your program to determine where the program spends most of the time, test that code in the highest optimization level, and then specify in the code what optimization level to use for a given set of functions. In this manner it is possible to use -Ofast or -O3 for core code where speed is more important than size and while using a different code level of optimization for the remainder of the code. You can, of course, specify different levels of optimization.  Since the command line option is considered the default, a programmer can override the default with a higher (or lower) level of optimization.

If you want to set optimization for a single function use the attribute after the function type declaration but before the function declaration

void __attribute__((optimize("O3"))) make_this_fast(int var1, int var2) {
 ...
}

If you want to optimize larger chunks of code, put them in a separate file and use the optimize pragma

# This file is optimized for level 3 to make these functions fast
#pragma GCC push_options
#pragma GCC optimize ("O3")

/* Put your code here */
int function1(void) {
...
}

int fast_function2(int a, int b) {
...
}

For those that develop for lower memory systems or need to fit binaries into ROMs optimization is usually set for size. But using the above compiler tricks you can optimize the program for size and still choose higher levels of optimization for execution speed where it makes sense.

Gold Linker

Since 2008, there is a alternate linker to the GNU linker that does not use the binary file description library (BFD.) The BFD system adds a level of abstraction to the object file as a way to support different object file formats for different architectures in order to handle byte-ordering differences, word size differences for data, and, of course, the object file specification itself. This is an essential tool for embedded developers and makes cross-compiling much easier (which we used to kick start the original tool set used to later build the base system.) But any layer of abstraction adds complexity and often degrades performance. The BDF library now supports more than 50 file formats and 25 different instruction set architectures which is a lot of overhead if all the system ever needs is ELF on X86 or X86_64.

To produce a much cleaner implementation, the gold linker was created. It doesn't use the BFD library and can only handle ELF object files. But it is much faster then the GNU BFD linker. It is not a version of the GNU linker but a complete rewrite from scratch with the intention of completely removing all traces of the BFD library.

To use the gold linker, set the LD environment variable to gold or specify which linker to use as compile time using the -fuse-ld parameter such as -fuse-ld=gold.

To compile out test program using LTO and the gold linker do the following

rm *.o vesuvius ; gcc particle.c main.c -o vesuvius -fuse-ld=gold -flto -fuse-linker-plugin `pkg-config ncurses --libs --cflags`

Be sure the -fuse-ld=gold comes before  any linker flags including the libraries to link and search paths for object files.

Now we are set to highly optimize our system not only for the specific CPU we are using, but also for higher optimization levels, and faster linking. Exactly how we will do this is the topic of discussion next time.

No comments:

Post a Comment