Tuesday, November 12, 2019

16 - Optimization Levels










Toucan Linux Project - 16


A Roll Your Own Distribution


Goal  3 – Rebuild Base Realm with Optimization


Now that we have an easy way to build packages and can build a whole realm it's time to decide exactly what optimization settings we want to use for global use. These can be adjusted per package as necessary, but being an experimental distro we want them set as aggressively as we can globally. This chapter will focus on optimization settings, how to change them, and how to set them per package.

GCC Optimization Options

GCC has a range of optimization settings including support for the various CPU architectures as we discussed in the last chapter. In addition there are options for optimizing code by examining the intermediate code and making changes to both increase performance. These are all accessed in GCC using the -O option. Like the -march option in reality the -O is a shortcut to setting many other options which all have there own command line option as well. While these can be controlled individually, there are very specific reasons for doing so, though it might be useful to include some optimization choices for certain architectures that are not available (at least in a specific combination) with the -O.

Level 0 - Debug and Compile Speed

Many believe this is "no optimization" but that isn't the case. The compiler will make choices to make the code easy to debug and still keep the compile time short. This is really a developer option since it prioritizes compiler speed and developers are the ones that will constantly compile code. For them, compiler speed is important because it allows more iterations of development rounds in a given period of time. For production code the compile time is unimportant. This level is specified by -O or -O0.

Level g - Optimize for Debugging Ability 

This level corresponds with the -g flag which indicates to create code that can be debug using a source-level debugger. GCC allows optimization with the debug (-g) option by the code will do unexpected jumps within the debugger. The debug optimizer optimizes the debugging experience, performs some minor optimizations, and keeps the compile time fast. This is a better choice for debugging code as the -O0 level turns off some compiler passes that collect debugging information. This level is chosen with -Og.

Level 1 - Optimize But Keep Compile Time Fast

This level is intended as another level a developer might uses. Suppose the code base is now mostly complete, or you job on the team is to run test cases against the compiled code using different choices of compile-time routines. In this case very slow code that takes a long time to run means slow testing, but you still want a fast compile time to get through coding iterations fast. This optimization is perfect as it does optimize for execution speed but only choose options that don't significantly increase compile time. Now the execution runs are faster but the compile times are still fast. This is implemented with -O1.

Level 2 - Standard Optimization

This is the target level for production code. Here the compiler will favor on execution speed over compile time and, to a lesser degree, code size. This level will turn on all optimizations that increase execution speed except those that will create large amounts of code, thereby increasing the code size. However, the code size will generally increase since some functions, such as -fpartial-inlining will be turned on. This is also considered "safe" optimization because it is highly tested and compiler optimization options that are newer or not well tested or those that have shown to causes issues in certain circumstances, are not enabled. This is the level that most production code is set for and tested at since this level does greatly increase performance without significantly increasing code size or risk of optimization errors. The -O2 option turns on this level. Unfortunately, many development teams will hard code this option in their configurations or even sanitize flags to turn off other options. There are ways around such choices, but many times it means the code is known to fail under more aggressive options.

Level 3 - High Optimization

Like level 2 the goal is to optimize for speed of execution. The difference is that all optimizations are included that regardless of executable size or compile speed. Though they are tested some of them, in combination with other optimization options, might cause unexpected behavior in code that is not completely well written. For this reason, the GCC development team doesn't recommend this as a standard option unless the resulting code is very well tested. This restricts to option to code that is really performance dependent since extensive testing means a longer development iteration. However for simple programs that can be easily tested thoroughly it might be a good choice. This is selected with the -O3 option.

Level fast - All "Safe" Optimization

In this level all possible optimization is preformed regardless of code size, compile time, and even strict standards compliance. If the architecture supports it, the work is performed. This is not a safe option for most code. In practice, I have found little difference between level 3 and fast in practice, at least on the X86_64 architecture. Use -Ofast on the command line to turn on this option.

Level size - Optimize for Size

This performs optimization routines only if they don't increase program size. The goal is to optimize the size of the program to do keep it as small as possible. On modern processors with multiple level super high-speed caches, code that fits entirely in the L1 cache might run faster than code that is highly optimized, but large. That is one reason to keep the code small (the kernel in fact allows optimizing for size as an option to take advantage of this property.) Another reason is to create a smaller disk image for fitting a full distro on a smaller USB stick or SD card. People that do embedded work often need to fit code into a small memory space or storage area, making this option very important. The command line flag for this options is -Os.


Step 1 - Create BC Test Suite 

We will do a few experiments to show the difference between the optimization levels. We'll start with the GNU programmable command line calculator bc. The developers of bc have tried to use as many speedups as possible for the various architectures including dedicated math operations of the CPU. Few people use bc much anymore because complete languages such as R or Matlab have replaced it, but bc is completely programmable through scripting.


We'll use Phodd's prime library for bc to calculate the first 50,000 primes, then calculate PI to 4,000 digits, followed by calculating the Mandelbrot set 1,000 times.

Step 2 - Download or Create the Primes Library

We'll need prime.bc from Phodd at http://phodd.net/gnu-bc/code/primes.bc or we can create it ourselves

cat > primes.bc << "EOF"
#!/usr/bin/bc -l

### Primes.BC - Primes and factorisation (rudimentary)

 ## All factor finding is done by trial division meaning that many
 ## functions will eat CPU for long periods when encountering
 ## certain numbers. Primality testing uses better techniques and
 ## is much faster if no factors are required.
 ## e.g. 2^503-1 is identified as non-prime by the primality testers
 ## but no factors will be found in any sensible amount of time
 ## through trial division.
 ##
 ## Steps have been taken to make the trial division as fast as
 ## possible, meaning much code re-use.


max_array_ = 4^8-1

# Greatest common divisor of x and y - stolen from funcs.bc
define int_gcd(x,y) {
  auto r,os;
  os=scale;scale=0
  x/=1;y/=1
  while(y>0){r=x%y;x=y;y=r}
  scale=os
  return(x)
}

### Primality testing ###

# workhorse function for int_modpow and others
define int_modpow_(x,y,m) {
  auto r, y2;
  if(y==0)return(1)
  if(y==1)return(x%m)
  y2=y/2
  r=int_modpow_(x,y2,m); if(r>=m)r%=m
  r*=r                 ; if(r>=m)r%=m
  if(y%2){r*=x         ; if(r>=m)r%=m}
  return( r )
}

# Raise x to the y-th power, modulo m
define int_modpow(x,y,m) {
  auto os;
  os=scale;scale=0
  x/=1;y/=1;m/=1
  if(x< 0){print "int_modpow error: base is negative\n";    x=-x}
  if(y< 0){print "int_modpow error: exponent is negative\n";y=-y}
  if(m< 0){print "int_modpow error: modulus is negative\n"; m=-m}
  if(m==0){print "int_modpow error: modulus is zero\n";  return 0}
  x=int_modpow_(x,y,m)
  scale=os
  return( x )
}

## Pseudoprime tests

# Global variable to limit the number of Rabin-Miller iterations to try
# 0 => Run until sure number is prime
rabin_miller_maxtests_=0

# Uses the Rabin-Miller test for primality
#   uses a shortcut for numbers < 300 decimal digits
define is_rabin_miller_pseudoprime(p) {
  auto os,a,inc,top,next_a,q,r,s,d,x,c4;
  os=scale;scale=0
  if(p!=p/1){scale=os;return 0}
  if(p<=(q=F+2)){scale=os;return(p==2||p==3||p==5||p==7||p==B||p==D||p==q)}
  s=0;d=q=p-1;x=d/2;while(0==d-x-x){.=s++;d=x;x=d/2}
  if(p<A^(1+3*A*A)){
    # Takes a few liberties for the sake of speed compared to a
    # . full RM; Assumes small primes and perrin tests have been run
    inc=(p-4)/(C+length(p));if(inc<1)inc=1
    top=q
  } else {
    # This bizarre construct sets inc to an approximation of
    # . log to base 4 of p, meaning the loop below should
    # . run enough times to guarantee that p is prime
    inc=((B*B+F+F)*B*(length(p)+1))/A^3;if(inc<1)inc=1
    if(!inc%2).=inc++ # inc needs to be odd to ensure good mix of candidates
    top=inc*(inc+1)
  }
  if(rabin_miller_maxtests_){
    rabin_miller_maxtests_/=1
    if(rabin_miller_maxtests_<=0){
      rabin_miller_maxtests_=0
      print "Warning: rabin_miller_maxtests_ set to invalid value. Now = 0\n"
      print "         This calculation may take longer to run than expected\n"
    }else{
      inc=top/rabin_miller_maxtests_+1
    }
  } 
  for(a=2;a<top;a+=inc){
    next_a=0
    x=int_modpow_(a,d,p)
    if(x!=1&&x!=q){
      for(r=1;r<s;r++){
       x*=x;x%=p
       if(x==1){scale=os;return 0}#composite
       if(x==q){next_a=1;break}
      }#end for
      if(!next_a){scale=os;return 0}
    }#end if
  }#end for
  scale=os;return 1
}

# Determine whether p is a Perrin pseudoprime
#   returns 0 if definitely composite
#   returns 1 if possibly prime (but not definitely)
define is_perrin_pseudoprime(p) {
  auto os,i,h,rp,m[],r[],t[];#,m[];
  os=scale;scale=0
  if(p!=p/1){scale=os;return 0}
  if(p==2){scale=os;return 1}
  if(p<2||p%2==0){scale=os;return 0}
  #set rp to reverse of p
  # would love to use int_modpow to calculate powers but it doesn't work
  #  on matrices!
  # could use an array to store bits of p, but arrays are limited to
  #  65536 elements; integers have no such limits
  rp=0;i=p;while(i){h=i/2;rp+=rp+i-h-h;i=h}
  # Quick Mersenne test; if rp == p then p could be 2^i - 1 for some i.
  # if i is not prime then neither is p, and it's faster to check i first.
  # this test is unnecessary, but since half the work is already done
  # we might as well check
  if(rp==p){
    for(i=0;rp;i++)rp/=2
    if(2^i-1==p&&!is_perrin_pseudoprime(i)){scale=os;return 0}
    rp=p # restore rp; if we're here then the test failed
  }
  # m[]={0,1,1;1,0,0;0,1,0}
  #m[0]=0; m[1]=1; m[2]=1;
  #m[3]=1; m[4]=0; m[5]=0; # Never actually used
  #m[6]=0; m[7]=1; m[8]=0;
  # r[]={1,0,0;0,1,0;0,0,1}=Unit
  r[0]=1; r[1]=0; r[2]=0;
  r[3]=0; r[4]=1; r[5]=0;
  r[6]=0; r[7]=0; r[8]=1;
  while(rp) {
    # square r
    t[0]=r[1]*r[3]; t[1]=r[2]*r[6]; t[2]=r[0]+r[4]
    t[3]=r[2]*r[7]; t[4]=r[0]+r[8]
    t[5]=r[1]*r[5]; t[6]=r[5]*r[6]; t[7]=r[5]*r[7]; t[8]=r[4]+r[8]
    t[9]=r[2]*r[3]; t[A]=r[3]*r[7]; t[B]=r[1]*r[6]
    r[0]*=r[0];r[0]+=t[0]+t[1]; r[1]*=t[2];r[1]+=t[3]; r[2]*=t[4];r[2]+=t[5]
    r[3]*=t[2];r[3]+=t[6]; r[4]*=r[4];r[4]+=t[0]+t[7]; r[5]*=t[8];r[5]+=t[9]
    r[6]*=t[4];r[6]+=t[A]; r[7]*=t[8];r[7]+=t[B]; r[8]*=r[8];r[8]+=t[1]+t[7]
    h=rp/2
    if(rp-h-h){# odd
      # multiply r by m
      #  this is a hack that assumes m is {0,1,1;1,0,0;0,1,0}
      #  without actually using m.
      #  if m is changed, this will need to be rewritten
      t[0]=r[0]+r[2]; t[1]=r[3]+r[5]; t[2]=r[6]+r[8]
      r[2]=r[0]; r[0]=r[1]; r[1]=t[0]
      r[5]=r[3]; r[3]=r[4]; r[4]=t[1]
      r[8]=r[6]; r[6]=r[7]; r[7]=t[2]
    }
    for(i=0;i<9;i++)r[i]%=p
    rp=h
  }
  r[0]=(2*r[6]+3*r[8])%p
  scale=os;
  return (r[0]==0)
}

# Determine whether x is possibly prime through division by small numbers
define is_small_division_pseudoprime(x) {
 auto os,j[],ji,sx,p,n;#oldscale,jump,jump-index,sqrtx,prime,nth
 if(x<2)return 0;
 os=scale;scale=0
 if(x!=x/1){scale=os;return 0}
 j[0]=4;j[1]=2;j[2]=4;j[3]=2;j[4]=4;j[5]=6;j[6]=2;j[7]=6
 for(p=2;p<7;p+=p-1){if(x==p){scale=os;return 1};if(x%p==0){scale=os;return 0}}
 sx=sqrt(x);if(sx>(n=A^5))sx=n;# 100000(decimal) upper limit
 if(prime[A^4]){ #primes-db is present
   for(n=4;(p=prime[n])<=sx;n++)if(x%p==0){scale=os;return 0}
 } else {
   ji=7;p=7;n=2*A-1
   while(p<=sx){
    if(x%p==0){scale=os;return 0}
    if(ji++==8)ji=0;p+=j[ji];
    if(p>n)while(p%7==0||p%B==0||p%D==0||p%n==0){if(ji++==8)ji=0;p+=j[ji]}
   }
 }
 scale=os;return(1)
}

## Primality / Power-freedom tests

define is_prime(x) {
  # It is estimated that all numbers will not be misidentified
  # using the tests below, but it may take time
  if(!is_small_division_pseudoprime(x))return 0
  if(x<A^A)return 1 # pairs with the A^5 in is_s.d.pp()
  if(x<A^(1+3*A*A)||rabin_miller_maxtests_){
    # 300 digits or less; faster doing RM, then PP
    # and shortcut RM may miss something (hard to prove)
    if(!is_rabin_miller_pseudoprime(x))return 0
    if(!is_perrin_pseudoprime(x))return 0
  } else {
    # Tests reversed because full RM is slower than PP
    if(!is_perrin_pseudoprime(x))return 0
    if(!is_rabin_miller_pseudoprime(x))return 0
  }
  return 1
}

# Determine whether x is prime through trial division only
define is_prime_td(x) {
 auto os,j[],ji,sx,p,n;#oldscale,jump,jump-index,sqrtx,prime,nth
 if(x<2)return 0;
 os=scale;scale=0
 if(x!=x/1){scale=os;return 0}
 j[0]=4;j[1]=2;j[2]=4;j[3]=2;j[4]=4;j[5]=6;j[6]=2;j[7]=6
 for(p=2;p<7;p+=p-1){if(x==p){scale=os;return 1};if(x%p==0){scale=os;return 0}}
 if(!is_perrin_pseudoprime(x)){scale=os;return 0}#cheat a bit
 sx=sqrt(x);p=7;ji=7
 if(prime[max_array_])for(n=4;n<=max_array_&&(p=prime[n])<=sx;n++)if(x%p==0){scale=os;return 0}
 if(p>7)ji=4#assume p is now prime[max_array_]
 n=2*A-1
 while(p<=sx){
  if(x%p==0){scale=os;return 0}
  if(ji++==8)ji=0;p+=j[ji];
  if(p>n)while(p%7==0||p%B==0||p%D==0||p%n==0){if(ji++==8)ji=0;p+=j[ji]}
 }
 scale=os;return(1)
}

### Storage and output of prime factorisations ###

# Output the given array interpreted as prime factors and powers thereof
# . this function plus fac_store() make for a "delayed" equivalent
# . to the fac_print() function
define printfactorpow(fp[]) {
 auto i,c;
 for(i=0;fp[i];i+=2){
  print fp[i]
  if((c=fp[i+1])>1)print "^",c
  if(fp[i+2])print " * "
 }
 print"\n"
 return (fp[1]==1&&fp[2]==0) # fp[] is prime?
}

# Workhorse function for the below
# . for retaining a copy of the last calculated factorisation
# . in factorpow global array to save time if further functions
# . are to be called on same number
define factorpow_set_(fp[]) {
  auto i;
  for(i=0;fp[i];i++)factorpow[i]=fp[i]
  return factorpow[i]=factorpow[i+1]=0;
}

# Workhorse function for the below
# . appends newly found factor and power thereof to the provided array
# . outputs that information if the print_ flag is set
define fac_store_(*fp[],m,p,c,print_) {
  auto z;
  if(!m%2).=m++ # m should be position of last element and thus odd
  # even elements are prime factor, odd elements are how many.
  # 9 -> {3,2} -> 3^2 , 60 -> {2,2,3,1,5,1} -> 2^2*3^1*5^1
  # negative c means we know this is the end and we can write two zeroes
  z=0;if(c<0){z=1;c=-c}
  fp[++m]=p;fp[++m]=c
  if(print_){
    print p;if(c>1)print "^",c
    if(!z){print " * "}else{print "\n"};
  }
  if(z){fp[++m]=0;fp[++m]=0}
  return m
}

# Workhorse function for the below
# . performs action that otherwise occurs three times
# . relies on inherited scope (POSIX style)
# . returns 0 if parent should also return
define fac_sp_innerloop_() {
  for(c=0;x%p==0;c++)x/=p
  if(c){
    if(x==1)c=-c
    m=fac_store_(fp[],m,p,c,print_);
    if(x==1)return factorpow_set_(fp[]); # = 0
    if(is_prime(x)){
      m=fac_store_(fp[],m,x,-1,print_);
      return factorpow_set_(fp[]); # = 0
    }
  }
  return 1;
}

# Workhorse function for the below
# . factorises x through trial division, using the above functions
# . for output, storage, retention, etc.
define fac_sp_(*fp[],x,print_) {
 auto os,j[],ji,sx,p,c,n,m,f;#oldscale,jump,jump-index,sqrtx,prime,count,nth,mth
 os=scale;scale=0;x/=1
 # Check to see if last calculation was the same as this one - save work
 f=1;for(m=0;p=factorpow[m]&&f<=x;m+=2)f*=(fp[m]=p)^(fp[m+1]=factorpow[m+1]);
 if(f==x){
   if(print_).=printfactorpow(fp[]);
   scale=os;return fp[m]=fp[m+1]=0;
 }
 # Main algorithm
 m=-1
 if(x<0){m=fac_store_(fp[],m,-1,1,print_);x=-x}
 if(x<=1||is_prime(x)){m=fac_store_(fp[],m,x,-1,print_);scale=os;return (x>1)}
 j[0]=4;j[1]=2;j[2]=4;j[3]=2;j[4]=4;j[5]=6;j[6]=2;j[7]=6
 for(p=2;p<7;p+=p-1)if(!fac_sp_innerloop_()){scale=os;return 0} #1
 sx=sqrt(x);p=7;ji=7
 if(prime[max_array_])for(n=4;n<=max_array_&&(p=prime[n])<=sx;n++){
   if(!fac_sp_innerloop_()){scale=os;return 0} #2
 }
 if(p>7)ji=4#assume p is now prime[max_array_]
 n=2*A-1;sx=sqrt(x)
 while(p<=sx){
   if(!fac_sp_innerloop_()){scale=os;return 0} #3
   if(c)sx=sqrt(x)
   if(ji++==8)ji=0;p+=j[ji];
   if(p>n)while(p%7==0||p%B==0||p%D==0||p%n==0){if(ji++==8)ji=0;p+=j[ji]}
 }
 if(x>1)m=fac_store_(fp[],m,x,-1,print_)
 scale=os;return factorpow_set_(fp[]);
}

# Output the prime factors of x with powers thereof
# . displays factors as they are found which allows more chance
# . of some factors being output before becoming bogged down
# . finding larger factors by trial division
define fac_print(      x) { auto a[];x=fac_sp_( a[],x,1);return ( a[1]==1&& a[2]==0); }

# Store the prime factors of x into the given array
define fac_store(*fp[],x) {          x=fac_sp_(fp[],x,0);return (fp[1]==1&&fp[2]==0); }

# Can be slow in the case of a large spf
define smallest_prime_factor(x) {
 auto os,j[],ji,sx,p,n;#oldscale,jump,jump-index,sqrtx,prime,nth
 os=scale;scale=0;x/=1
 if(x<0){scale=os;return -1}
 if(x<=1||is_prime(x)){scale=os;return x}
 j[0]=4;j[1]=2;j[2]=4;j[3]=2;j[4]=4;j[5]=6;j[6]=2;j[7]=6
 for(p=2;p<7;p+=p-1)if(x%p==0){scale=os;return p}
 sx=sqrt(x);p=7;ji=7
 if(prime[max_array_])for(n=4;n<=max_array_&&(p=prime[n])<=sx;n++)if(x%p==0){scale=os;return p}
 if(p>7)ji=4#assume p is now prime[max_array_]
 n=2*A-1;sx=sqrt(x)
 while(p<=sx){
  if(x%p==0){scale=os;return p}
  if(ji++==8)ji=0;p+=j[ji];
  if(p>n)while(p%7==0||p%B==0||p%D==0||p%n==0){if(ji++==8)ji=0;p+=j[ji]}
 }
 scale=os;return(-sx) #if we ever get here, something is wrong!
}

# Non trivial = slow
define largest_prime_factor(x) {
 auto i,fp[];
 .=fac_store(fp[],x);
 for(i=0;fp[i];i+=2){}
 return fp[i-2];
}

# Largest prime power within x
# e.g. 992 = 2^5*31 and 2^5>31 so 992 -> 2^5 = 32
define largest_prime_power(x) {
 auto i,fp[],p,max;
 .=fac_store(fp[],x);
 for(i=0;(p=fp[i]^fp[i+1])!=1;i+=2)if(max<p)max=p
 return max;
}

# Return powerfree kernel of x (largest powerfree number which divides x)
define rad(x) {
 auto i,r,f,fp[];
 .=fac_store(fp[],x)
 r=1;for(i=0;f=fp[i];i+=2)r*=f
 return r;
}

# Return square part of x
define squarepart(x) {
 auto os,i,r,f,p,fp[];
 .=fac_store(fp[],x)
 os=scale;scale=0
  r=1;for(i=0;f=fp[i];i+=2){p=fp[i+1];p-=p%2;r*=f^p}
 scale=os;return r
}

# Return squarefree core of x
define core(x) {
 auto os;
 os=scale;scale=0
 x/=squarepart(x)
 scale=os;return x
}

# Count the number of (non-unique) prime factors of x
# e.g. 60 = 2^2*3^1*5^1 -> 2 + 1 + 1 = 4
define count_factors(x) {
  auto i,c,fp[];
  if(x<0)return count_factors(-x)+1;
  if(x==0||x==1)return 0;
  .=fac_store(fp[],x)
  for(i=0;fp[i];i+=2)c+=fp[i+1]
  return c;
}

# Count the number of unique prime factors of x
# e.g. 84 = [2]^2*[3]^1*[7]^1 -> #{2,3,7} = 3
define count_unique_factors(x) {
  auto i,d,fp[];
  if(x<0)return count_unique_factors(-x)+1;
  if(x==0||x==1)return 0;
  .=fac_store(fp[],x);
  for(i=0;fp[i];i+=2).=d++
  return d;
}

# Determine whether x is y-th power-free
#  e.g. has_freedom(51, 2) will return whether 51 is square-free
#  + sign of result indicates (-1)^[number of prime factors]
#    making has_freedom(x,2) equivalent to the mobius function
#  Special case: has_freedom(x, 1) returns whether x is prime
#  Pseudo-boolean, since always returns 0 for false, but not always 1 for true
define has_freedom(x,y) {
 auto os,i,p,m,fp[];
 os=scale;scale=0;if(x<0)x=-x
 if(x!=x/1){scale=os;return 0}
 if(x==1){scale=os;return 1}
 if(y==1){scale=os;return is_prime(x)}
 if(x>A^A)if(is_prime(x)){scale=os;return -1}
 if(x<0||y<1){scale=os;return 0}
 .=fac_store(fp[],x);
 m=1
 for(i=1;p=fp[i];i+=2){
  if(p>=y){scale=os;return 0}
  m*=p*(-1)^p
 }
 return m
}

# Returns 0 if non-squarefree,
# 1 for 1 and all products of an even number of unique primes
# -1 otherwise
define mobius(x) {
  return has_freedom(x,2);
}

# Determine the so-called arithmetic derivative of x
define arithmetic_derivative(x) {
  auto os,i,f,n,d,fp[];
  if(x<0)return -arithmetic_derivative(-x)
  os=scale;scale=0;x/=1
  if(x==0||x==1){scale=os;return 0}
  .=fac_store(fp[],x);n=0;d=1
  for(i=0;f=fp[i];i+=2){n=n*f+d*fp[i+1];d*=f}
  n=(n*x)/d
  scale=os;return n
}

### Storage and output of divisors of a number ###

# Count the number of divisors of x (prime or otherwise)
define count_divisors(x) {
  auto i,c,p,fp[];
  i=scale;x/=1;scale=i
  if(x<0)return 2*count_divisors(-x);
  if(x==0||x==1)return x
  .=fac_store(fp[],x);
  c=1;for(i=1;p=fp[i];i+=2)c*=1+p
  return c
}

# Workhorse function for the below
define divisors_sp_(*divs[],x,print_) {
  # opts: 1 -> print; 0 -> store
  auto os,s,sx,c,ch,p,m,i,j,k,f,fp[]
  os=scale;scale=0;x/=1
  if(x==0||x==1){
    if(!print_){divs[0]=x;divs[1]=0}
    scale=os;return x;
  }
  .=fac_store(fp[],x)
  c=1;for(i=1;(p=++fp[i++])>1;i++)c*=p
  .=c--
  s=1;if(x<0){s=-1;x=-x}
  if(print_){
    print s,", "
  } else {
    divs[0]=s
    sx=0
    ch=(c+1)/2
    if(c>max_array_){
      sx=sqrt(x)
      print "too many divisors to store. storing as many as possible\n"
    }
  }
  for(i=1;i<c;i++){
    j=i;k=0;f=1
    while(j){if(m=j%(p=fp[k+1]))f*=fp[k]^m;j/=p;k+=2} # generate a divisor
    if(print_){print s*f,", "}else{
      if(sx){
        # Only applies if all divisors won't fit in the array
        # Divisors <= sqrt(x) can be used to reconstruct missing divisors
        #   These can be overlooked due to the way the generator works
        k=f;if(k<0)k=-k
        if(k>sx){
          # Store divisors > sqrt(x) in any space that would otherwise have
          #   been available at the high end of the array or else skip them
          if(ch<max_array_)divs[ch++]=s*f
          continue;
        }
      }
      if(i<=max_array_){divs[i]=s*f}else{break}
    }
  }
  if(print_){x}else{if(c>max_array_-1)c=max_array_-1;divs[c]=s*x;divs[c+1]=0}
  scale=os;return s*x
}

# Displays all divisors of x in a logical but unsorted order
define divisors_print(     x) { auto d[]; return divisors_sp_(d[],x,1); }

# Store calculated divisors in given array in same logical but unsorted order
# . see array.bc for sorting and rudimentary printing of arrays
define divisors_store(*d[],x) {           return divisors_sp_(d[],x,0); }

# Returns the sum of the divisors of x where each divisor is raised
# to the power n; e.g. sigma(2,6) -> 1^2 + 2^2 + 3^2 + 6^2 = 50
# . only supports integer n at present
define sigma(n,x) {
  auto os,i,u,d,f,fp[];
  os=scale;scale=0;x/=1;n/=1
  if(n==0){scale=os;return count_divisors(x)}
  if(n<0){scale=os;n=-n;return sigma(n,x)/x^n}
  .=fac_store(fp[],x);u=d=1
  if(x<0)x=0
  if(x==0||x==1)return x;
  for(i=0;fp[i];i+=2){u*=(f=fp[i]^n)^(fp[i+1]+1)-1;d*=f-1}
  u/=d;scale=os;return u;
}

# Returns the sum of the divisors of x, inclusive of x
# e.g. primes -> prime + 1, 2^x -> 2^(x+1)-1, 6 -> 12
define sum_of_divisors(x) {
  return sigma(1,x);
}

# Computes the Euler totient function for x
define totient(x) {
  auto i,t,f,fp[];
  .=fac_store(fp[],x);t=1
  if(x==0||x==1)return x;
  for(i=0;fp[i];i+=2)t*=((f=fp[i])-1)*f^(fp[i+1]-1)
  return t
}

# Number of iterations of the totient function to reach 1
define totient_itercount(x) {auto t;while(x>1){x=totient(x);.=t++};return t}

# Sum of iterating the totient function down to 1
define totient_itersum(x) {auto t;while(x>1){x=totient(x);t+=x};return t}

# Returns if x is a perfect totient number
define is_perfect_totient(x) { return totient_itersum(x)==x }

# Computes Ramanujan's c_q function for n (given q)
define ramanujan_c(q,n) {
  auto i,c,d,t,f,p,fp[];
  if(q<0||n<0)return 0;
  if(q==1)return 1;
  if(n==1)return mobius(q);
  if(n==q)return totient(q);
  n=q/int_gcd(q,n)
  if(n==1)return totient(q);
  .=fac_store(fp[],n);t=1;c=d=0;
  for(i=0;fp[i];i+=2){
    t*=((f=fp[i])-1)*f^((p=fp[i+1])-1)
    c+=p;.=d++
  }
  if(c!=d){c=0}else{c=(-1)^d}
  return c*totient(q)/t # mobius(n')*totient(q)/totient(n')
}

# Determines whether a number is a practical number
define is_practical(x) {
  auto os,i,ni,s,p,f,nf,fp[];
  if(x<0)return 0;
  if(x==1)return 1;
  os=scale;scale=0;i=x%2;scale=os
  if(i)return 0
  .=fac_store(fp[],x)
  if(!fp[2])return 1;# x is power of 2
  scale=0
  #fp[0] has to be 2 here, so replace with 2
  s=2^(fp[1]+1)-1
  f=fp[i=2]
  while(1){
    if(f>1+s){scale=os;return 0}
    if((nf=fp[ni=i+2])==0){scale=os;return 1}
    s*=(f^(fp[i+1]+1)-1)/(f-1)
    f=nf;i=ni
  }
  return -1;# should never get here
}

### Exploring prime neighbourhood ###

# Finds and returns the nearest prime > x
define nextprime(x) {
  auto os,ox;
  if(x<0)return -prevprime(-x)
  if(x<2)return 2
  if(x<3)return 3
  os=scale;scale=0
   ox=x
   if(x/1<x).=x++ #ceiling
   x/=1            #truncate
   x+=1-x%2        #make odd
   if(x==ox)x+=2   #same as we started with
   #while(!is_prime(x))x+=2 # could use jumps here, but is not much faster
   while(1){
     while(!is_small_division_pseudoprime(x))x+=2
     if(x<A^A)break; # pairs with the A^5 in is_s.d.pp()
     if(is_rabin_miller_pseudoprime(x)){
       if(rabin_miller_maxtests_){
         if(is_perrin_pseudoprime(x))break;
       } else {
         break;
       }
     }
     x+=2
   }
  scale=os;return x
}

# nearest prime >= x
define nextprime_ifnotprime(x) {
  if(is_prime(x))return x;
  return nextprime(x)
}

# Finds and returns the nearest prime < x
define prevprime(x) {
  auto os,ox;
  if(x<0)return -nextprime(-x)
  if(x<=2)return -2
  if(x<=3)return 2
  if(x<=5)return 3
  os=scale;scale=0
   ox=x;x/=1
   if(x%2){if(x==ox)x-=2}else{.=x--}
   #while(!is_prime(x)&&x>0)x-=2
   while(x>0){
     while(!is_small_division_pseudoprime(x))x-=2
     if(x<A^A)break; # pairs with the A^5 in is_s.d.pp()
     if(is_rabin_miller_pseudoprime(x)){
       if(rabin_miller_maxtests_){
         if(is_perrin_pseudoprime(x))break;
       } else {
         break;
       }
     }
     x-=2
   }
   if(x<2)return x=-2
  scale=os;return x
}

# nearest prime <= x
define prevprime_ifnotprime(x) {
  if(is_prime(x))return x;
  return prevprime(x)
}

# Find the nearest prime to x (inclusive)
define nearestprime(x) {
  auto np,pp;
  if(is_prime(x))return x;
  np=nextprime(x)
  pp=prevprime(x)
  if(np-x>x-pp)return pp
  return np
}

### Relatives of the Primorial / Factorial

# Primorial  
define primorial(n) {
  auto i,pm,p;
  pm=1;p=2
  if(prime[max_array_])for(i=1;i<=max_array_&&p=prime[i]<=n;i++)pm*=p
  for(p=p;p<=n;p=nextprime(p))pm*=p
  return pm
}

# Subprimorial
define subprimorial(n) {
  auto i,pm,p;
  pm=1;p=2
  if(prime[max_array_])for(i=2;i<=max_array_&&(p=prime[i])<=n;i++)pm*=p-1
  for(p=p;p<=n;p=nextprime(p))pm*=p-1
  return pm
}
EOF

Step 3 - Create A Driver

We'll call the program prime_test.bc

cat > prime_test.bc << "EOF"
p=3
for(i=0;i<50000;i++) {
   p=nextprime(p)
   #print p; print "\n"
}
EOF


Note I have elected to not print the primes because I want to test bc's calculation execution not the shell's ability to render text through the console. If you want to see the output, remove the pound sign (#) from line 5.

Step 4 - Create A Mandelbrot Calculator

Create a bc program to calculate the Mandelbrot set

cat > mandelbrot.bc << "EOF"
scale=10
i=-1
for(y=0;y<25;y++) {
   r=1
   l=0
   for(x=0;x<68;x++) {
      q=r
      w=i
      for(n=1;n<21;n++) {
         a=q*q
         b=w*w
         if(a+b>4) {
            break
         }
         w=2*q*w+i
         q=a-b+r
      }
      scale=0
      n%=10
      scale=10
      l+=n*10^x
      r-=3/68
      }
   # print l
   i+=2/24
}


#print "\n"
EOF



Again I have removed the print statements by putting a pound sign in front of them to make them into comments. To see the output remove these but don't output speed is not part of the benchmark test.

Step 5 - Create Launcher Script

Make a script to launch all steps of the benchmark

cat > bc_test_suite.sh << "EOF"
cat primes.bc prime_test.bc | bc
echo "scale=4000; 4*a(1)" | bc -l
for i in {1..1000}; do cat mandelbrot.bc|bc ; done
EOF



Step 6 - Build bc With Level 0 Optimization

We will modify the config file in the bc package directory to compile with the various optimization levels. First is the level 0.

cat > /usr/src/paccom/realms/base/bc/config << "EOF"
PREFIX=/usr CC=gcc CFLAGS="$CFLAGS -std=c99 " ./configure.sh -G -O0
EOF


Now rebuild with paccom

paccom -R base - p bc -iv

Then run the benchmark

time bash bc_test_suite.sh

which will print out execution time (after printing PI) such as

real 2m07.970s
user 2m05.616s
sys 0m1.089s


The top number is the one we will use for the benchmark since it is the real time of the program execution. I will run this three times to get three scores that are averaged for a composite.

This verifies the benchmark process.


Step 7 - Benchmark BC

Now we do the same thing for all optimization levels


For level 1

cat > /usr/src/paccom/realms/base/bc/config << "EOF"
PREFIX=/usr CC=gcc CFLAGS="$CFLAGS -std=c99 " ./configure.sh -G -O1
EOF
paccom -R base - p bc -iv
time bash bc_test_suite.s



For level 2

cat > /usr/src/paccom/realms/base/bc/config << "EOF"
PREFIX=/usr CC=gcc CFLAGS="$CFLAGS -std=c99 " ./configure.sh -G -O2
EOF
paccom -R base - p bc -iv
time bash bc_test_suite.s



For level 3

cat > /usr/src/paccom/realms/base/bc/config << "EOF"
PREFIX=/usr CC=gcc CFLAGS="$CFLAGS -std=c99 " ./configure.sh -G -O3
EOF
paccom -R base - p bc -iv
time bash bc_test_suite.s



For level fast

cat > /usr/src/paccom/realms/base/bc/config << "EOF"
PREFIX=/usr CC=gcc CFLAGS="$CFLAGS -std=c99 " ./configure.sh -G -Ofast
EOF
paccom -R base - p bc -iv
time bash bc_test_suite.s



For optimized for space

cat > /usr/src/paccom/realms/base/bc/config << "EOF"
PREFIX=/usr CC=gcc CFLAGS="$CFLAGS -std=c99 " ./configure.sh -G -Os
EOF
paccom -R base - p bc -iv
time bash bc_test_suite.s



On my system I generated the following results in level, execution speed, and code size.

-O0    167 seconds 158,216 Bytes
-O1     61 seconds 121,238 Bytes
-O2     53 seconds 133,616 Bytes
-O3     49 seconds 162,288 Bytes
-Ofast  51 seconds 162,288 Bytes
-Os    140 seconds 109,008 Bytes


This clearly shows optimization at level 3 is faster even though smaller might utilize the various levels of CPU cache. The code is much larger than the level 2 optimization because level 3 will utilize more function inlining where smaller functions that are called often are converted from function calls to code that is placed direct. This speeds execution but increases code size. Despite the code being larger and less likely to full fit in L1 cache reality is that CPU caches are much larger and the pieces of code that operate within a program end up in the caches unless the system is running many different programs simultaneously. Another reason is that programs utilize the operating system functions which are shared for all programs and those core routines tend to be in the cache. So even though smaller is better in some architectures in larger CPU cache systems larger code is still faster.

But we have several choices we can use to both reduce code size and further increase execution speed.


Step 8 - Prepare to Build yasm

We will perform some more tests with ffmpeg which is a command line video editing and rendering program. It can link to a lot of external libraries, which we can install later, but also has a good core of libraries for audio and video codes, as well as stream containers. We'll go through the same process we did with bc.
I choose ffmpeg as a second test for optimization because it is known to be very fast and that speed comes from hand-tuned assembly. The development team outputs the assembly for various pieces of the code, where speed is the most important, and hand tune it but examining the code and making changes to speed it up. Even with today's greatest optimizers it is still possible to make it better using the old fashion human brain (at least for a few more years.) The code is a combination of C code and assembly code. The team chose the yasm assembler over the nasm assembler because the wanted support for Microsoft Visual Studio. Most projects will use nasm instead for the external (non-GCC) assembler because they will use a different developer's environment. Second, as we'll see later, mixing assembly with C might have a few drawbacks.

We will yasm in a realm called !dev-tools.

Make the new realm directories

mkdir -p /usr/src/paccom/realms/dev-tools/{sources,yasm}

Make the manifest

echo 'yasm~yasm-1.3.0.tar.gz' > /usr/src/paccom/realms/dev-tools/manifest

Get yasm source code

wget http://www.tortall.net/projects/yasm/releases/yasm-1.3.0.tar.gz -P /usr/src/paccom/realms/dev-tools/sources/

There are files necessary since yasm uses all the default builds.

Step 9 - Change Directory Structure

Because this is consider an experiment, and experiments are somewhat special to The Toucan Linux Project, we will setup the directory structure for the /usr/local directory a bit different. Recall we have a primary and a second local partition. Because we may not want to keep our test code, we want to switch the secondary partition. For now, we'll use a simplified version of our two partitions and do the work to setup the proper way to do TTLP experiments later.

Create two mount points

mkdir -p /mnt/{local1,local2}
Local 1 is considered the primary local which is non-experimental. Local 2 is used for our experiments. We need to setup the fstab to mount these properly. Currently the primary partition is mounted at /usr/local but we want it mounted at /mnt/local1 and the secondary mounted at /mnt/local2. We need make the fstab reflect these changes. I suggest you use a text editor (either vi or joe/jmacs/jstar). Change the mount point of the /usr/local/ partition (the primary) to /mnt/local1 and add an entry to mount the secondary at /mnt/local2.

/dev/sda5 /mnt/local1 xfs noatime 0 2
/dev/sda6 /mnt/local2 xfs noatime 0 2


/dev/sdX5 is the primary in this case, and /dev/sdX6 is the secondary.

Now umount the current /usr/local/ and mount the primary and secondary.

cd / && umount /usr/local && mount -a

Verify they are mounted

# mount
/dev/sda5 on /mnt/local1 type xfs (rw,relatime,attr2,inode64,noquota)
/dev/sda5 on /mnt/local2 type xfs (rw,relatime,attr2,inode64,noquota)


Then remove the old mount and make a symbolic link instead. This time we will use the secondary for the link, not the primary

cd usr
rmdir local
ln -s ../mnt/local2 local



Notice that a relative path is used for the link instead of an absolute path (starting with the root /.) This ensures the link will be correct even if you mount the filesystem using a host as we did before. This is the configuration we will use in the future when we fully support experiments though we will do so in a much better way.

Now when we build and install yasm and ffmpeg they will be installed in the secondary filesystem so they won't interfere with our "production" file system /usr/local/ area. This will allow us to perform various trails with a program to determine the best way to build it for the system without worry it will clobber something we want.

Step 9 - Build yasm

Since we have paccom handy and configured we should use it.

paccom -R dev-tools -p yasm -iv

It should build properly build an install in the secondary local area as it uses /usr/local/ as the prefix by default. But perhaps not all packages will do this so let's create a realm config to prefix for sure.

echo "../configure --prefix=/usr/local" > /usr/src/paccom/realms/dev-tools/config

Step 10 - Build ffmpeg

Because the developers want ffmpeg to be as fast as possible they have already set the optimization level to level 3 so the default configure tool will always use it. There is a way to specific additional compiler flags (--extra-cflags) but they are placed before the optimization setting and GCC will use the last optimization setting it sees on the command line. First we need to remove all references to "-O3" in the configuration program. We can do this in the config file in the package directory but we only want to make the change if we are doing an experimental build. We can check for the presence of an environment variable named EXPERIMENTAL.

We will put video tools in a realm called video. First make the realm directories and package directory.
mkdir -p /usr/src/paccom/realms/video/(sources,ffmpeg}

Now make the manifest file

echo "ffmpeg~ffmpeg-4.2.1.tar.xz" > /usr/src/paccom/realms/video/manifest

Fetch the ffmpeg source

wget http://ffmpeg.org/releases/ffmpeg-4.1.1.tar.xz --prefix=/usr/src/paccom/realms/video/sources

Create the config file

cat > /usr/src/paccom/realms/video/ffmpeg/config << "EOF"
if [ -n "$EXPERIMENTAL" ]; then
sed -i "s/-O3//g" configure
fi


./configure --extra-cflags="-O0" \
--disable-static --enable-shared
EOF


This will only remove the -O3 option for the configure file when the EXPERIMENTAL flag is set to any value. For the first run we will use no optimization so we can begin benchmarks. Building ffmpeg requires a lot of space, requiring the large build area

touch /usr/src/paccom/realms/video/ffmpeg/large

Now build it with paccom

rm /var/paccom/logs/*
EXPERIMENTAL=1 paccom -R video -p ffmpeg -iv

Check the logs to make sure everything worked. Building ffmpeg can take some time and, unfortunately angband will not be available since we are in experimental mode. In the future we will address a better way to go experimental without losing angband and other production code.

Step 11 - Benchmark ffmpeg

Downloaded any video that is in MKV format. You can grab one with

cd
wget https://sample-videos.com/video123/mkv/720/big_buck_bunny_720p_30mb.mkv

This is not the video I used but it will suffice.

To set ffmpeg can the container from MKV to MP4 using

ffmpeg -i video.mkv -o video.mp4

If that is completes we are ready to benchmark the -O0 optimization level. We will use an mp4 container with mpeg4 video codec and mpeg2 audio codec. This will force ffmpeg to recode both video and audio for the new file.

rm video.mp4 ; time ffmpeg -i video.mkv -c:v mpeg4 -c:a mp2 video.mp4

Again the time program will print out the times involved. I will run each three times and take the average. This will be performed for all optimization levels from 0 to fast. After testing one, modify the /usr/src/paccom/realms/video/ffmpeg/config file and change -O0 to -O1, -O2, -O3, -Ofast, and -Os and use paccom to rebuild between each then run the test.

For my system I received the following results. The columns are level, execution time, and file size.

-O0    51s 28,249,616 Bytes
-O1    24s 16,848,064
 Bytes
-O2    23s 16,676,032
 Bytes
-O3    22s 19,039,424
 Bytes
-Ofast 21s 19,027,072
 Bytes
-Os    49S 14,013,664
 Bytes

In this case -Ofast does yield both a smaller file and a faster execution but whether it is completely safe will take some testing.

Next chapter we will continue to look at compiler optimization including reducing file size, optimizing at link time, and, for developers, how to set optimization levels for individual functions within code. Once we have are settings we will configure paccom to use them to recompile all programs in the base realm. We will also add ffmpeg and yasm to the primary local using complete settings.

Copyright (C) 2019 Michael R Stute