Basics of GCC compilation process

By | 02/09/2012

In one of the earlier articles about Hello World C program on Linux , we learned how to write, compile and execute a basic C program on Linux platform in 3 easy steps. Among those 3 steps, the second step was to compile the source code to produce an executable file as output. In this article, we will try to go further deep into this compilation process to understand the basics of GCC compilation process by exploring what all happens behind the scenes when a source code is compiled to produce an output executable.

Basics of GCC compilation process

The source code

Lets consider the following C code:

#include<stdio.h>

#define STRING "HelloWorld"

int main(void)
{
    char *ptr = NULL;
    ptr = STRING;

    // Print the HelloWorld string
    printf("\n [%s] \n",ptr);
    return 0;
}

In the above code, we have used a macro STRING for the “HelloWold” string and then inside the main() function, we have used a pointer ‘ptr’ that points towards this string and then finally through printf() we print this string. Lets name the above file as helloworld.c

The compilation process

Put the helloworld.c file in an empty directory. Now, compile the helloworld.c using gcc but with an extra flag -save-temps.

$ ls
helloworld.c

$ gcc -Wall -save-temps helloworld.c -o helloworld

$ ls
helloworld  helloworld.c  helloworld.i helloworld.o helloworld.s

$

So we see that firstly we checked that the directory contains only the helloworld.c file. Next we ran the gcc command with an extra flag -save-temps. When we again checked the contents of the directory, we found that the executable ‘helloworld’ was produced but three extra files with .i, .s and .o extension (highlighted in bold above) were also produced. Lets first understand what the flag -save-temps does.

From the man page of gcc :

-save-temps Store the usual “temporary” intermediate files permanently; place them in the current directory and name them based on the source file. Thus, compiling foo.c with -save-temps would produce files foo.i and foo.s, as well as foo.o. This creates a preprocessed foo.i output file even though the compiler now normally uses an integrated preprocessor.

So the above explanation makes it clear that -save-temps flag lets gcc to store the temporary intermediate files that usually get deleted at the end of the compilation process. Now, why am I emphasising on these temporary files is because the gcc compilation process can be broken down into four steps and each temporary file gets generated after a step gets completed while the executable is generated after the last step. These steps are :

  1. Preprocessing
  2. Compilation
  3. Assembly
  4. Linking

Lets understand the basics of these steps one by one.

1. The preprocessing step

In this stage all the header files that you have included in your program are actually expanded and included in source code of your program. Other than this, all the macros are replaced by their respective values all over the code and all the comments are stripped off. The intermediate file that is generated after this stage is the .i file. Lets take a look at the helloworld.i file :

...
...
...
extern int fileno_unlocked (FILE *__stream) __attribute__ ((__nothrow__)) ;
# 846 "/usr/include/stdio.h" 3 4
extern FILE *popen (__const char *__command, __const char *__modes) ;

extern int pclose (FILE *__stream);

extern char *ctermid (char *__s) __attribute__ ((__nothrow__));
# 886 "/usr/include/stdio.h" 3 4
extern void flockfile (FILE *__stream) __attribute__ ((__nothrow__));

extern int ftrylockfile (FILE *__stream) __attribute__ ((__nothrow__)) ;

extern void funlockfile (FILE *__stream) __attribute__ ((__nothrow__));
# 916 "/usr/include/stdio.h" 3 4

# 2 "helloworld.c" 2

int main(void)
{
    char *ptr = ((void *)0);
    ptr = "HelloWorld";

    printf("\n [%s] \n",ptr);
    return 0;
}

In the above output, if you notice :

  • There is a lot of code before the main() function now. This is because the stdio.h file has been expanded and included here.
  • There is no trace of the macro STRING. This is because it has been removed and all its instances have been replaced by the actual string “HelloWorld”.
  • Also, the comment before the printf() function is stripped off.

2. The compilation step

In this step, the source code is actually compiled by the compiler to produce an assembly code. Assembly code consist of set of instructions that determine what you program wants to do. Most computers have a specific set of instructions that can be used to make your computer perform all the actions it does. In earlier days the code was written mostly in assembly language but then higher level languages like C, COBOL etc were developed as programs can be written at much faster pace in these languages as these languages are easy to understand and program. So whenever there is a source code written in a higher level language, the compiler converts the source code into the assembly language. The intermediate file produced at this stage is a .s file. Lets take a look at the helloworld.s file :

        .file   "helloworld.c"
        .section        .rodata
.LC0:
        .string "HelloWorld"
.LC1:
        .string "\n [%s] \n"
        .text
.globl main
        .type   main, @function
main:
.LFB0:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        movq    %rsp, %rbp
        .cfi_offset 6, -16
        .cfi_def_cfa_register 6
        subq    $16, %rsp
        movq    $0, -8(%rbp)
        movq    $.LC0, -8(%rbp)
        movl    $.LC1, %eax
        movq    -8(%rbp), %rdx
        movq    %rdx, %rsi
        movq    %rax, %rdi
        movl    $0, %eax
        call    printf
        movl    $0, %eax
        leave
        ret
        .cfi_endproc
.LFE0:
        .size   main, .-main
        .ident  "GCC: (Ubuntu 4.4.3-4ubuntu5.1) 4.4.3"
        .section        .note.GNU-stack,"",@progbits

So we see that the helloworld.s file contains all the assembly instructions.

3. The assembly step

In this step, the assembler understands the assembly instructions and converts each of them in to the corresponding machine level code or a bit stream that consists of 0′s and 1′s. This machine level code is known as object code and this code can be executed by the processor. The object code consists of different sections that the processor uses while executing the program. The intermediate file produced after this step is the .o file. Lets have a look at the helloworld.o file :

^?ELF^B^A^A^@^@^@^@^@^@^@^@^@^A^@>^@^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@X^A^@^@^@^@^@^@^@^@^@^@@^@^@^@^@^@@^@^M^@
^@UHåHì^PHÇEø^@^@^@^@HÇEø^@^@^@^@¸^@^@^@^@HUøHÖHǸ^@^@^@^@è^@^@^@^@¸^@^@^@^@ÉÃHelloWorld^@
 [%s]
^@^@GCC: (Ubuntu 4.4.3-4ubuntu5.1) 4.4.3^@^@^@^@^@^@^@^T^@^@^@^@^@^@^@^AzR^@^Ax^P^A^[^L^G^H^A^@^@^\^@^@^@^\^@^@^@^@^@^@^@8^@^@^@^@A^N^PC^B^M^F^@^@^@^@^@^@^@^@.symtab^@.strtab^@.shstrtab^@.rela.text^@.data^@.bss^@.rodata^@.comment^@.note.GNU-stack^@.rela.eh_frame^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^A^@^@^@^F^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@@^@^@^@^@^@^@^@8^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^D^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^[^@^@^@^D^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@À^E^@^@^@^@^@^@H^@^@^@^@^@^@^@^K^@^@^@^A^@^@^@^H^@^@^@^@^@^@^@^X^@^@^@^@^@^@^@&^@^@^@^A^@^@^@^C^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@x^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^D^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@,^@^@^@^H^@^@^@^C^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@x^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^D^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@1^@^@^@^A^@^@^@^B^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@x^@^@^@^@^@^@^@^T^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@9^@^@^@^A^@^@^@0^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@&^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^A^@^@^@^@^@^@^@^A^@^@^@^@^@^@^@B^@^@^@^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@²^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@W^@^@^@^A^@^@^@^B^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@¸^@^@^@^@^@^@^@8^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^H^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@R^@^@^@^D^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^H^F^@^@^@^@^@^@^X^@^@^@^@^@^@^@^K^@^@^@^H^@^@^@^H^@^@^@^@^@^@^@^X^@^@^@^@^@^@^@^Q^@^@^@^C^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ð^@^@^@^@^@^@^@a^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^A^@^@^@^B^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^D^@^@^@^@^@^@^H^A^@^@^@^@^@^@^L^@^@^@    ^@^@^@^H^@^@^@^@^@^@^@^X^@^@^@^@^@^@^@  ^@^@^@^C^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^E^@^@^@^@^@^@^Z^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^A^@^@^@^D^@ñÿ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^C^@^A^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^C^@^C^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^C^@^D^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^C^@^E^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^C^@^G^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^C^@^H^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^C^@^F^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^N^@^@^@^R^@^A^@^@^@^@^@^@^@^@^@8^@^@^@^@^@^@^@^S^@^@^@^P^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@helloworld.c^@main^@printf^@^@^@^@^@^@^@^T^@^@^@^@^@^@^@^K^@^@^@^E^@^@^@^@^@^@^@^@^@^@^@^Y^@^@^@^@^@^@^@
^@^@^@^E^@^@^@^K^@^@^@^@^@^@^@-^@^@^@^@^@^@^@^B^@^@^@
^@^@^@üÿÿÿÿÿÿÿ ^@^@^@^@^@^@^@^B^@^@^@^B^@^@^@^@^@^@^@^@^@^@^@

As you can see that nothing much can be comprehended from the above output as it contains machine level output. So, any editor will not be able to translate this machine code except a few symbols like ‘printf’, ‘main’ etc.

4. The linking step

Once the object file containing the machine code is produced in the step above, the linking step makes sure that all the undefined symbols in code are resolved. An undefined symbol is one for which there is no definition available. For example, in our code, there is no definition of printf() function. So in order to make our program execute correctly, the definition of this function need to included or atleast linked to our code. This is what happens in the Linking stage. In an other example, suppose your source code consists of more than one source files, then the assembly stage produces separate .o files corresponding to all the individual .c files. Then its the linking stage where all these object files are linked together. The output after this step is the final executable.

In our case its ‘helloworld’.

So we see that when we say compilation then its just a step to covert source code into assembly code but when we say compilation process then it consist of all the four stages explained above.

Note : The temporary files produced by -save-temps flag in one go can be produced one by one by using the gcc flags -E, -C and -S at each of the preprocessing, compilation and assembly steps respectively.

9 thoughts on “Basics of GCC compilation process

  1. Gene Ricky Shaw

    This was an amazing article. You explained to me in 10 minutes what I struggled with in a semester of intro to programming. Thank you.

    Reply
  2. Miten Mehta

    nice short article to explain the compilation process.

    Reply
  3. slekcher

    At the linking stage, you said the printf() definition will be included in our executable file, but is the definition in machine code?

    Reply
      1. Roger

        Isn’t the printf definition included in stdio.h file and as you said the stdio.h file will be expanded at preprocessing so the printf definition should be available at compiling stage.
        Please correct me if I am wrong

        Reply

Leave a Reply

Your email address will not be published. Required fields are marked *