Compiler Core `cc1`

Going back to the clang_start

Ok, so let’s resume. We:

created the driver settings
with that, created the compilation settings
created actions for each part of the compilation
created a list of jobs from the list of actions

Now it’s time to execute the job!!!

At the beginning, we talked about an if: the first argument of the function is -cc1. That’s it—we are building, guys!!!

Main class explanation

Before we start, I think it’s best to explain all the main classes involved in the compilation: CompilerInstance, FileManager, SourceManager, PreProcessor, Lexer. This is more of a reference section to better understand the rest.

`CompilerInstance`

This is the class responsible for handling the cc1 part of Clang. It contains all the classes needed for compilation.

Main classes inside the CompilerInstance:

FileManager
SourceManager
TargetInfo
PreProcessor
ASTContext, ASTConsumer, ASTReader

Function:

bool CompilerInstance::ExecuteAction(FrontendAction &Act);

`FileManager`

The FileManager class finds/reads files on disk or VFS and caches metadata and open memory buffers. This returns a FileEntryRef.

VFS: is an LLVM abstraction that makes all file sources (disk files, in-memory buffers, archives) look like a single, uniform filesystem to the compiler.

Main members inside the FileManager:

FileSystem
FileSystemOptions
DenseMap of real files by ID
DenseMap of real directories by ID
SmallVector of virtual file entries
SmallVector of virtual directory entries
StringMap for cached file lookups
StringMap for cached directory lookups
OptionalFileEntryRef for stdin

Function:

llvm::Expected<FileEntryRef> FileManager::getFileRef(StringRef Filename,
                                                    bool OpenFile = false,
                                                    bool CacheFailure = true,
                                                    bool IsText = true);

`SourceManager`

This class takes raw buffers from FileManager (FileEntryRef), assigns them simple integer FileIDs, and builds the data structures needed for the rest of the compiler to ask, “What line/column is offset N in FileID X?”

Main members inside the SourceManager:

SourceMgr – contains a list of MemoryBuffer objects (one per FileID)
FileIDMap – mapping from buffer index to FileEntryRef
Line/column mapping tables for each buffer (e.g., LineOffsets)
Macro expansion and include-location stacks
FileID counter to assign unique IDs for each buffer

Functions:

/// Create a new FileID for the given FileEntryRef and return it.
FileID SourceManager::createFileID(FileEntryRef FE,
                                   SourceLocation Loc,
                                   SrcMgr::CharacteristicKind K);
 
/// Retrieve character data for a given FileID and offset.
const char *SourceManager::getCharacterData(FileID FID, unsigned &Offset);

`Preprocessor`

This class handles all pre-parsing work—macro definitions, #include directives, conditional compilation—and produces a stream of tokens for the parser.

Main members inside the Preprocessor:

HeaderSearch – manages search paths and maps headers to FileEntryRef
PreprocessorOptions – stores flags like -D, -I, macro expansion settings
IdentifierTable – uniquing table for identifiers and keywords
Builtin::Context – definitions for built-in macros and keywords
MacroInfoMap – maps macro names to their definitions (MacroInfo)

Functions:

/// Get the next token, expanding macros and handling directives.
Token Preprocessor::Lex(Token &Tok);
 
/// Push a new file onto the include stack, creating a fresh Lexer.
void Preprocessor::EnterSourceFile(FileID FID, bool IsMacroFile,
                                   llvm::MemoryBuffer *Buffer);

`Lexer`

This class reads characters from a source buffer (via SourceManager) and produces Token objects for the parser.

Main members inside the Lexer:

FileID – identifies the buffer being lexed
SourceManager &SM – provides access to buffer contents and locations
LangOptions &LangOpts – controls language-specific lexing behavior
const char *BufferStart/Ptr/End – pointers to track current position in the memory buffer
Token CurToken – storage for the current token
unsigned CurPPEnd – offset for end-of-macro/file switching

Functions:

/// Lex the next token from the buffer (skips whitespace/comments).
Token Lexer::Lex(Token &Result);
 
/// Initialize the lexer for a new buffer.
void Lexer::Initialize(FileID FID,
                       const LangOptions &LangOpts,
                       SourceManager &SM,
                       bool IsAtStartOfFile);

back to code

cc1_main
 ├─ split & claim cc1-only flags (plugins, -mllvm, etc.)
 ├─ initialize cc1 context
 │    ├─ create DiagnosticIDs & DiagnosticsEngine + consumers
 │    ├─ register PCH formats & built-in targets
 │    └─ parse frontend/backend flags (–disable-free, –disable-llvm-verifier, –main-file-name, etc.)
 ├─ parse remaining argv into CompilerInvocation
 ├─ instantiate CompilerInstance CI
 │    ├─ attach DiagnosticsEngine
 │    ├─ set up FileManager & SourceManager
 │    ├─ set up TargetInfo/TargetMachine
 │    ├─ initialize Preprocessor & HeaderSearch
 │    └─ create ASTContext & Sema
 ├─ configure timers, timeTraceProfile & stats if requested
 ├─ status = ExecuteCompilerInvocation(CI)
 │    ├─ handle –help / –version
 │    ├─ load plugins & handle –mllvm
 │    ├─ configure sanitizers & static-analysis hooks
 │    ├─ select & construct FrontendAction Act
 │    ├─ CI.BeginSourceFile(Act, inputs)
 │    ├─ CI.ExecuteAction(Act) ← Parser→Sema→CodeGen or AST walk
 │    ├─ CI.EndSourceFile()
 │    └─ return Act success/failure
 └─ return status

ExecuteCC1Tool

tokenize the cmd line
redirect on the right cc1

cc1_main

init a CompilerInstance() and a DiagnosticIDs() instance
init PCH format (precompiled header)
init all target base functions
diag setup
fill the CompilerInvocation of the CompilerInstance that contains all compiler settings and does some checks on the args list
timeTraceProfile init if needed
print clang cpu/stats
init the diag for the CompilerInstance
install the llvm backend diag in the instance
execute ExecuteCompilerInvocation
handle errors and timers

ExecuteCompilerInvocation

handle basic -help / -version
load clang user plugin
handle -mllvm
optional static analyzer stuff
error checkup
creation of the FrontendAction (binds the right ExecuteAction function to the FrontendAction)
execute the FrontendAction

ExecuteAction

Preconditions: verify diagnostics are initialized and help/version have been handled
Diagnostics cleanup guard: ensure getDiagnosticClient().finish() runs on exit
Verbose stream: obtain the verbose output stream (OS)
Prepare action: invoke Act.PrepareToExecute(*this), which sets up any internal state required by the action; if it returns false, the action cannot run and ExecuteAction immediately returns false to signal failure (rather than continuing into an invalid state)
Create target: initialize the compilation target (architecture, vendor, OS, and ABI settings—e.g., x86_64-unknown-linux-gnu), configuring TargetInfo/TargetMachine; abort on failure
ObjC rewrite patch: for Objective-C (ObjC) rewriting actions (e.g., RewriteObjC), adjust the built-in ObjCBool type (disable signed char) to match the ObjC ABI
Verbose/stats flags: print version info if verbose, enable stats if requested
Sort codegen tables: sort TOC (Table of Contents) and NoTOC variable lists (used on targets like PowerPC64 for PIC data) so lookups use binary search for efficient codegen decisions
Process inputs: for each input file, clear IDs, run BeginSourceFile, Execute, and EndSourceFile
Print diagnostic stats: emit any collected diagnostic statistics
Dump stats to file: if StatsFile is set, open (or append) and write JSON stats, warning on error
Return success: return true if no errors were reported, false otherwise

Execute

Get CompilerInstance
ExecuteAction: invoke the action-specific frontend logic (ExecuteAction())
Rebuild Global Module Index: if CI.shouldBuildGlobalModuleIndex() and file manager/preprocessor are present, fetch Cache = CI.getPreprocessor().getHeaderSearchInfo().getModuleCachePath() and, if non-empty, call GlobalModuleIndex::writeIndex. On error, consume it silently
Return success: always return llvm::Error::success() (no error propagation)

init AST creation

FrontendAction Hierarchy

Clang’s frontend actions all inherit from the abstract base FrontendAction, providing a uniform ExecuteAction entry point. Key subclasses include:

ASTFrontendAction: Runs after parsing to operate on the AST (analysis, transformations).
PreprocessorFrontendAction: Hooks into the preprocessor stage (token processing before parsing).
CodeGenAction: Coordinates code generation backends (e.g., IR emission, object code output).
Other FrontendActions: Miscellaneous actions (e.g., MergeModuleAction, PluginAction).

Below is the EmitLLVMAction, which inherits from CodeGenAction and emits LLVM IR:

class EmitLLVMAction : public CodeGenAction {
  virtual void anchor();              // Ensure vtable emission
public:
  EmitLLVMAction(llvm::LLVMContext *_VMContext = nullptr);
};
 
void EmitLLVMAction::anchor() { }
EmitLLVMAction::EmitLLVMAction(llvm::LLVMContext *_VMContext)
  : CodeGenAction(Backend_EmitLL, _VMContext) {}

What this does:

anchor(): Defines an out-of-line virtual method so that the compiler emits EmitLLVMAction’s vtable in this translation unit.
Constructor: Calls the CodeGenAction base with Backend_EmitLL, registering the action to generate LLVM IR into the given (or newly created) LLVMContext.

note: all the CodeGenAction are the same, with the llvm::LLVMContext (Backend_EmitLL) being the only difference

CodeGenAction::ExecuteAction

Wrapper over AST path: Delegates all non-LLVM-IR inputs straight to ASTFrontendAction::ExecuteAction().
LLVM IR handling: Intercepts LLVM IR inputs and runs the IR-specific emission pipeline (bypassing the AST frontend).

We’re not detailing the LLVM IR setup and configuration steps, as those belong to the backend-specific flow.

ASTFrontendAction::ExecuteAction

Preprocessor check: Returns early if the preprocessor isn’t initialized.
Stack setup: Marks the bottom of the stack to guard against deep AST recursion.
Code completion: If requested, installs a CodeCompleteConsumer for IDE-style suggestions.
Semantic analysis init: Creates/configures the Sema object to drive name lookup and type checking.
Parse AST: Invokes ParseAST, which lexes, parses, and executes the specific frontend action logic on the AST.

Parsing Class

`Parser`

The Parser class is responsible for syntactic parsing. It takes references to the PreProcessor and Sema, and drives the parsing of tokens into declarations and statements.

`Sema`

The Sema (semantic analyzer) performs semantic checks (e.g., type checking, declaration validation). It owns references to essential components used during semantic analysis:

Main members inside the Sema:

ASTContext – manages all AST nodes and semantic info
PreProcessor – token stream handler
LangOptions – holds active language dialect flags
DiagnosticsEngine – emits warnings and errors
SourceManager – tracks source locations

`Scope`

Tracks the current scope during parsing and semantic analysis. It helps Sema resolve names (variables, functions, types) correctly depending on the nesting level (e.g., inside functions, loops, conditionals, or namespaces).

For example:

int foo() {
    while (1) {
        if (bar) {
            foobar();
        }
    }
}

foobar is in the if scope, which is in the while scope, which is in the function scope.

`Decl`

Decl is the base class for all AST nodes representing C/C++ declarations (e.g., functions, variables, classes). All declaration nodes (like FunctionDecl, VarDecl, RecordDecl) derive from Decl.

DeclGroupRef is a container used when multiple declarations are parsed together (e.g., int a, b;).

`ASTContext`

ASTContext manages the lifetime and storage of all AST nodes and semantic information. It contains information about:

Main members inside the ASTContext:

Decl nodes (all AST nodes)
LangOpts
TargetInfo

`ASTConsumer`

A callback interface class to observe and process AST nodes as they’re built.

`Parser::ModuleImportState`

An enum only for C++20 module/import that can only be placed at the start of a file. After the first declaration, the keywords module/import become normal identifiers that you can use.

Example:

module;
import foo; // valid here
int x;
import bar; // now invalid
int import = 1; // valid

parsing logic overview

clang::ParseAST

Sets up the statistics system (used for diagnostics and performance tracking)
Initializes Sema
Creates the ASTConsumer from Sema, which will receive the AST nodes
Constructs the Parser, connecting the Preprocessor and Sema
Registers crash recovery cleanup routines
Ensures the lexer is available
Initializes the parsing logic — this is where the first token is read and the first scope is set (DeclScope)
Main parsing logic happens in HandleTopLevelDecl, which in your case emits LLVM IR
Processes things like #pragma weak
create the Target (ex .o)
Prints stats

parsing logic

This is the main parsing logic of the Clang Parser.

    Parser::DeclGroupPtrTy ADecl; // temp for the current Decl
    Sema::ModuleImportState ImportState; // import state for C++20
 
    for (bool AtEOF = P.ParseFirstTopLevelDecl(ADecl, ImportState); !AtEOF;
         AtEOF = P.ParseTopLevelDecl(ADecl, ImportState)) {
 
      if (ADecl && !Consumer->HandleTopLevelDecl(ADecl.get()))
        return;
    }

ParseFirstTopLevelDecl is a wrapper around ParseTopLevelDecl that initializes the ImportState for C++20. So this loop returns a Decl and while it’s not AtEOF, the Decl is passed to Consumer->HandleTopLevelDecl, which is the ASTConsumer that transforms it to the target — in your case, IR.

Parser::DeclGroupPtrTy is a pointer to a DeclGroupRef, which is an interface for these classes.

Parser::ParseTopLevelDecl

Sets up a destructor (RAII) for the parser data
Giant switch-case for special cases (this is where tok::eof returns true)
Initializes ParsedAttributes for GNU-style attributes (__attribute__((foo))) and C++11-style ([[foo]])
Parses the trailing attributes of both types
Calls ParseExternalDeclaration
Handles the ImportState for C++20

Parser::ParseExternalDeclaration

This function decides whether to redirect to ParseDeclarationOrFunctionDefinition or ParseDeclaration, and handles special cases like asm or import/export. It can be simplified like this:

    if (Tok.isEditorPlaceholder()) {
      ConsumeToken();
      return nullptr;
    }
    if (getLangOpts().IncrementalExtensions &&
        !isDeclarationStatement(/*DisambiguatingWithExpression=*/true))
      return ParseTopLevelStmtDecl();
 
    if (!SingleDecl)
      return ParseDeclarationOrFunctionDefinition(Attrs, DeclSpecAttrs, DS);
 
    return Actions.ConvertDeclToDeclGroup(SingleDecl);

ParseTopLevelStmtDecl is for clang-repl, so you can use C like you would with the Python REPL:

 ➜  ~ clang-repl
clang-repl> #include <stdio.h>
clang-repl> int foo = 5;
clang-repl> printf("foo = %d\n", foo);
foo = 5
clang-repl> foo = 2;
clang-repl> printf("foo = %d\n", foo);
foo = 2

ParseDeclaration: Parses a declaration only (variables, typedefs, namespaces, inline namespaces, etc.). Used in the big switch-case
ParseDeclarationOrFunctionDefinition: Parses either a declaration or a function body, depending on what follows the declarator

Parser::ParseDeclarationOrFunctionDefinition

The ParsingDeclSpec (DS) passed by ParseExternalDeclaration is set to nullptr, so we’re not using the if (DS) part.

But first, what is ParsingDeclSpec?

ParsingDeclSpec

As stated in the class definition, ParsingDeclSpec is for parsing a DeclSpec:

/// A class for parsing a DeclSpec.
class ParsingDeclSpec : public DeclSpec {

The constructor of ParsingDeclSpec calls the parser’s getAttrFactory(), providing what DeclSpec needs for initialization. The key advantage of ParsingDeclSpec is that it creates a ParsingDeclRAIIObject to manage diagnostics — accumulating warnings or errors during parsing and ensuring they’re either committed or discarded when the object goes out of scope.

DeclSpec

But what is DeclSpec?

DeclSpec captures all the information about declaration specifiers:

Storage specifiers (SCS): typedef, extern, static, auto, register, etc.
Thread storage specifiers (TSCS): __thread, thread_local, _Thread_local
Type qualifiers (TQ): const, volatile, restrict, atomic, etc.
Attribute

Each category is stored in a compact bitfield along with source-location metadata, allowing Clang to validate and diagnose specifier usage precisely.

Back to ParseDeclarationOrFunctionDefinition:

Initializes a ObjCDeclContextSwitch — we don’t care for this since it’s Objective-C
Enters ParseDeclOrFunctionDefInternal

Parser::ParseDeclOrFunctionDefInternal

Adds DeclSpecAttrs (__attribute__) list to the ParsingDeclSpec
Prepares for MS-specific parsing
Parses freestanding declaration specifiers (typedef, extern, static, auto, register, class {}, struct {}, enum {}) and fills DeclSpec
Checks for a missing ; and parses trailing args like attribute
If a ; is found:
- Returns the keyword size for diagnostics in LengthOfTSTToken
- Suggests corrections (like fixing [[attrib]] struct → struct [[attrib]])
- Adjusts attribute position if possible
- Consumes the ;
- Creates the AST (Sema::ParsedFreeStandingDeclSpec)
- If it’s a RecordDecl (struct/union/class), validates it
- Creates a DeclGroupPtrTy
If a struct has been parsed but is not freestanding (struct S { … } x;), notify Sema before parsing x
Handles attributes before Objective-C keywords like @interface
DS.abort() — RAII cleanup?
Adds Attrs ([[attribute]]) to DeclSpec
Detects extern "C"
Adjusts [[attribute]] position if possible
Calls ParseDeclGroup to parse the function and return the result

Functions to detail later:

ParseDeclarationSpecifiers
DiagnoseMissingSemiAfterTagDefinition
ParsedFreeStandingDeclSpec
ActOnDefinedDeclarationSpecifier

`Parser::ParseDeclGroup`

example of the current state of the compiler

int [[hidden]] x = 5;

already consumed int [[hidden]] current Token x

static inline void foo() {
    return;
}

already consumed static inline void current Token foo

function explanation:

init the ParsedAttribute and ParsingDeclarator with current data
init SuppressAccessChecks, a RAII class to delay errors for templates of private class members template <> foo<bar::foobar>
ParseDeclarator (ex: int x = 5; current ;, int main(int ac, char av) {} current: {)
pop out SuppressAccessChecks
if no name parsed, skip until a good place to continue the parsing and return nullptr

int /*missing name*/;

shader-specific parsing
parse end of requires (C++20)

template<typename T>
void f(T) requires std::is_integral_v<T>;  
// The 'requires std::is_integral_v<T>' is parsed here

if we are parsing a function
- parse trailing GNU __attribute__
- if the current token is _Noreturn, create a diagnostic: this can’t happen
- if tok == = and next token is the code completion of the LSP
```
struct S {
  S() = /*<cursor here>*/a
};
```
- handle virtual/override (C++11)
```
struct C { void f(); };
// Out-of-line definition with invalid 'override'
void C::f() override { /*...*/ }
```
- determine if this is a function body or prototype; if it’s a function def, enter the if
  - file-scope only
  - handle explicit instantiation vs specialization
  - call ParseFunctionDefinition(...)
  - return converted DeclGroup
```
int f(int x) { return x*2; } // function def
 
int f(int x);  // prototype
```
consume any attribute after declarator and attach them to DeclaratorDecl
handle C++ range-based for loops (and Obj-C for-in)

for (auto &x : vec) {
  // ...
}

declsIncGroup to explain
while it’s not a comma, continue
- if this is a new line and in this context this can’t be a declarator: diag missing semi
- error on multiple template declarators
- D.clear(); D.setCommaLoc()
- parse __attribute__ again…
- if (MicrosoftExt) skip MS‐style attrs
- shader parsing

if that’s a valid type

parse requires

template<typename T>
void f(),                 // first declarator: no requires
g() requires C<T>;   // second declarator: has a trailing requires-clause

get token location into DeclEnd
if semi and expectedSemi
- if any specifier, skip it
Actions.FinalizeDeclaratorGroup

Parser::ParseFunctionDefinition

Microsoft-specific SEH (__try, __except, __finally) is flagged as illegal in the function body after parsing
If compiling C and not in C89, warn about a missing return identifier (this was allowed in older C)
Handles K&R-style identifiers:

int sum(a, b)     // ← parameter names only
int a;            // ← types declared separately
int b;
{
    return a + b;
}

Checks if the token is valid ({, or in C++ only: try, :, =); otherwise, emit a diagnostic and skip until {; if not found, return nullptr
If =, check for invalid attribute
If delayed template parsing is enabled and it’s a template definition (not = default or = delete):
- Only parse the signature, cache the body tokens, and delay full parsing until later
Else if Objective-C-specific
Set parsescope to:

int foo() {}        // FnScope
 
struct mytype { };  // DeclScope
 
void foo() {
  {                 // CompoundStmtScope
    int x = 3;  
  }
}

Parse C++ = delete or = default as ActOnStartOfFunctionDef
Take a snapshot of the FPF (Float Precision Feature) to restore it later:

void f() {
  #pragma float_control(precise, on)
  float x = a + b;  // precise mode
  
  {
    #pragma float_control(precise, off)
    float y = c + d;  // fast-math mode
  }
 
  float z = e + f;  // still in precise mode — outer setting restored
}

Sema checks if the base of the parsing is correct
Skip what should be skipped for delayed template parsing
Clear the ParsingDeclarator RAII before parsing the body
Finish parsing for special states (e.g., = default)
If using auto, transform it into a template by adding depth to the depth traker (this is normaly done by template <> and need to be done manulay for auto f(auto foo) in cpp 20)
If skipping the function body, finish parsing and return it
Parse try
If :, parse constructor
Parse late __attribute__
Return ParseFunctionStatementBody

Parser::ParseFunctionStatementBody

Save the { location for diagnostics
Setup RAII for C++ method pragma internal state (mainly for Windows)
Main parsing function: ParseCompoundStatementBody
If the function body is invalid, create a bogus CompoundStmt
Exit the body scope
Return the finished parsed body

Parser::ParseCompoundStatementBody

Create the feature RAII
Start with pragmas
Parse __label__ (can only be at the start of a scope)

{
  __label__ retry1;
  __label__ retry2;
  __label__ done1, done2;
}

Set ParsedStmtContext to Compound | InStmtExpr
While not at the end of the function:
- Try to parse misplaced C++20 imports
- Parse redundant semicolons ;;;;;;;;;
- If token is not __extension__, call ParseStatementOrDeclaration()
- Else, consume all __extension__ and parse the statement, but silence extension warnings
- If the last statement is valid, push it to the Stmts vector
Check the last statement
Emit FP evaluation method warnings for unsupported targets
Gracefully handle the closing } or missing braces
Build and return a CompoundStmt or StmtExpr via Sema

ParseStatementOrDeclaration

To be explained

also the sema validation

Quartz 4

Explorer

clang_cc1

Compiler Core cc1

Going back to the clang_start

Main class explanation

CompilerInstance

FileManager

SourceManager

Preprocessor

Lexer

back to code

ExecuteCC1Tool

cc1_main

ExecuteCompilerInvocation

ExecuteAction

Execute

init AST creation

FrontendAction Hierarchy

CodeGenAction::ExecuteAction

ASTFrontendAction::ExecuteAction

Parsing Class

Parser

Sema

Scope

Decl

ASTContext

ASTConsumer

Parser::ModuleImportState

parsing logic overview

clang::ParseAST

parsing logic

Parser::ParseTopLevelDecl

Parser::ParseExternalDeclaration

Parser::ParseDeclarationOrFunctionDefinition

ParsingDeclSpec

DeclSpec

Parser::ParseDeclOrFunctionDefInternal

Parser::ParseDeclGroup

Parser::ParseFunctionDefinition

Parser::ParseFunctionStatementBody

Parser::ParseCompoundStatementBody

ParseStatementOrDeclaration

Graph View

Table of Contents

Backlinks

Compiler Core `cc1`

`CompilerInstance`

`FileManager`

`SourceManager`

`Preprocessor`

`Lexer`

`Parser`

`Sema`

`Scope`

`Decl`

`ASTContext`

`ASTConsumer`

`Parser::ModuleImportState`

`Parser::ParseDeclGroup`