Compiler Core cc1
Going back to the clang_start
Ok, so let’s resume. We:
- created the driver settings
- with that, created the compilation settings
- created actions for each part of the compilation
- created a list of jobs from the list of actions
Now it’s time to execute the job!!!
At the beginning, we talked about an if: the first argument of the function is -cc1.
That’s it—we are building, guys!!!
Main class explanation
Before we start, I think it’s best to explain all the main classes involved in the compilation:
CompilerInstance, FileManager, SourceManager, PreProcessor, Lexer.
This is more of a reference section to better understand the rest.
CompilerInstance
This is the class responsible for handling the cc1 part of Clang.
It contains all the classes needed for compilation.
Main classes inside the CompilerInstance:
FileManagerSourceManagerTargetInfoPreProcessorASTContext,ASTConsumer,ASTReader
Function:
bool CompilerInstance::ExecuteAction(FrontendAction &Act);FileManager
The FileManager class finds/reads files on disk or VFS and caches metadata and open memory buffers.
This returns a FileEntryRef.
VFS: is an LLVM abstraction that makes all file sources (disk files, in-memory buffers, archives) look like a single, uniform filesystem to the compiler.
Main members inside the FileManager:
FileSystemFileSystemOptionsDenseMapof real files by IDDenseMapof real directories by IDSmallVectorof virtual file entriesSmallVectorof virtual directory entriesStringMapfor cached file lookupsStringMapfor cached directory lookupsOptionalFileEntryReffor stdin
Function:
llvm::Expected<FileEntryRef> FileManager::getFileRef(StringRef Filename,
bool OpenFile = false,
bool CacheFailure = true,
bool IsText = true);SourceManager
This class takes raw buffers from FileManager (FileEntryRef), assigns them simple integer FileIDs, and builds the data structures needed for the rest of the compiler to ask, “What line/column is offset N in FileID X?”
Main members inside the SourceManager:
SourceMgr– contains a list ofMemoryBufferobjects (one perFileID)FileIDMap– mapping from buffer index toFileEntryRef- Line/column mapping tables for each buffer (e.g.,
LineOffsets) - Macro expansion and include-location stacks
FileIDcounter to assign unique IDs for each buffer
Functions:
/// Create a new FileID for the given FileEntryRef and return it.
FileID SourceManager::createFileID(FileEntryRef FE,
SourceLocation Loc,
SrcMgr::CharacteristicKind K);
/// Retrieve character data for a given FileID and offset.
const char *SourceManager::getCharacterData(FileID FID, unsigned &Offset);Preprocessor
This class handles all pre-parsing work—macro definitions, #include directives, conditional compilation—and produces a stream of tokens for the parser.
Main members inside the Preprocessor:
HeaderSearch– manages search paths and maps headers toFileEntryRefPreprocessorOptions– stores flags like-D,-I, macro expansion settingsIdentifierTable– uniquing table for identifiers and keywordsBuiltin::Context– definitions for built-in macros and keywordsMacroInfoMap– maps macro names to their definitions (MacroInfo)
Functions:
/// Get the next token, expanding macros and handling directives.
Token Preprocessor::Lex(Token &Tok);
/// Push a new file onto the include stack, creating a fresh Lexer.
void Preprocessor::EnterSourceFile(FileID FID, bool IsMacroFile,
llvm::MemoryBuffer *Buffer);Lexer
This class reads characters from a source buffer (via SourceManager) and produces Token objects for the parser.
Main members inside the Lexer:
FileID– identifies the buffer being lexedSourceManager &SM– provides access to buffer contents and locationsLangOptions &LangOpts– controls language-specific lexing behaviorconst char *BufferStart/Ptr/End– pointers to track current position in the memory bufferToken CurToken– storage for the current tokenunsigned CurPPEnd– offset for end-of-macro/file switching
Functions:
/// Lex the next token from the buffer (skips whitespace/comments).
Token Lexer::Lex(Token &Result);
/// Initialize the lexer for a new buffer.
void Lexer::Initialize(FileID FID,
const LangOptions &LangOpts,
SourceManager &SM,
bool IsAtStartOfFile);back to code
cc1_main
├─ split & claim cc1-only flags (plugins, -mllvm, etc.)
├─ initialize cc1 context
│ ├─ create DiagnosticIDs & DiagnosticsEngine + consumers
│ ├─ register PCH formats & built-in targets
│ └─ parse frontend/backend flags (–disable-free, –disable-llvm-verifier, –main-file-name, etc.)
├─ parse remaining argv into CompilerInvocation
├─ instantiate CompilerInstance CI
│ ├─ attach DiagnosticsEngine
│ ├─ set up FileManager & SourceManager
│ ├─ set up TargetInfo/TargetMachine
│ ├─ initialize Preprocessor & HeaderSearch
│ └─ create ASTContext & Sema
├─ configure timers, timeTraceProfile & stats if requested
├─ status = ExecuteCompilerInvocation(CI)
│ ├─ handle –help / –version
│ ├─ load plugins & handle –mllvm
│ ├─ configure sanitizers & static-analysis hooks
│ ├─ select & construct FrontendAction Act
│ ├─ CI.BeginSourceFile(Act, inputs)
│ ├─ CI.ExecuteAction(Act) ← Parser→Sema→CodeGen or AST walk
│ ├─ CI.EndSourceFile()
│ └─ return Act success/failure
└─ return status
ExecuteCC1Tool
- tokenize the cmd line
- redirect on the right cc1
cc1_main
- init a CompilerInstance() and a DiagnosticIDs() instance
- init PCH format (precompiled header)
- init all target base functions
- diag setup
- fill the
CompilerInvocationof theCompilerInstancethat contains all compiler settings and does some checks on the args list - timeTraceProfile init if needed
- print clang cpu/stats
- init the diag for the
CompilerInstance - install the llvm backend diag in the instance
- execute
ExecuteCompilerInvocation - handle errors and timers
ExecuteCompilerInvocation
- handle basic
-help/-version - load clang user plugin
- handle
-mllvm - optional static analyzer stuff
- error checkup
- creation of the
FrontendAction(binds the right ExecuteAction function to the FrontendAction) - execute the
FrontendAction
ExecuteAction
- Preconditions: verify diagnostics are initialized and help/version have been handled
- Diagnostics cleanup guard: ensure
getDiagnosticClient().finish()runs on exit - Verbose stream: obtain the verbose output stream (
OS) - Prepare action: invoke
Act.PrepareToExecute(*this), which sets up any internal state required by the action; if it returnsfalse, the action cannot run andExecuteActionimmediately returnsfalseto signal failure (rather than continuing into an invalid state) - Create target: initialize the compilation target (architecture, vendor, OS, and ABI settings—e.g.,
x86_64-unknown-linux-gnu), configuring TargetInfo/TargetMachine; abort on failure - ObjC rewrite patch: for Objective-C (
ObjC) rewriting actions (e.g.,RewriteObjC), adjust the built-in ObjCBool type (disable signed char) to match the ObjC ABI - Verbose/stats flags: print version info if verbose, enable stats if requested
- Sort codegen tables: sort TOC (
Table of Contents) and NoTOC variable lists (used on targets like PowerPC64 for PIC data) so lookups use binary search for efficient codegen decisions - Process inputs: for each input file, clear IDs, run
BeginSourceFile,Execute, andEndSourceFile - Print diagnostic stats: emit any collected diagnostic statistics
- Dump stats to file: if
StatsFileis set, open (or append) and write JSON stats, warning on error - Return success: return true if no errors were reported, false otherwise
Execute
- Get CompilerInstance
- ExecuteAction: invoke the action-specific frontend logic (
ExecuteAction()) - Rebuild Global Module Index: if
CI.shouldBuildGlobalModuleIndex()and file manager/preprocessor are present, fetchCache = CI.getPreprocessor().getHeaderSearchInfo().getModuleCachePath()and, if non-empty, callGlobalModuleIndex::writeIndex. On error, consume it silently - Return success: always return
llvm::Error::success()(no error propagation)
init AST creation
FrontendAction Hierarchy
Clang’s frontend actions all inherit from the abstract base FrontendAction, providing a uniform ExecuteAction entry point. Key subclasses include:
ASTFrontendAction: Runs after parsing to operate on the AST (analysis, transformations).PreprocessorFrontendAction: Hooks into the preprocessor stage (token processing before parsing).CodeGenAction: Coordinates code generation backends (e.g., IR emission, object code output).- Other FrontendActions: Miscellaneous actions (e.g.,
MergeModuleAction,PluginAction).
Below is the EmitLLVMAction, which inherits from CodeGenAction and emits LLVM IR:
class EmitLLVMAction : public CodeGenAction {
virtual void anchor(); // Ensure vtable emission
public:
EmitLLVMAction(llvm::LLVMContext *_VMContext = nullptr);
};
void EmitLLVMAction::anchor() { }
EmitLLVMAction::EmitLLVMAction(llvm::LLVMContext *_VMContext)
: CodeGenAction(Backend_EmitLL, _VMContext) {}What this does:
anchor(): Defines an out-of-line virtual method so that the compiler emitsEmitLLVMAction’s vtable in this translation unit.- Constructor: Calls the
CodeGenActionbase withBackend_EmitLL, registering the action to generate LLVM IR into the given (or newly created)LLVMContext.
note: all the CodeGenAction are the same, with the
llvm::LLVMContext(Backend_EmitLL) being the only difference
CodeGenAction::ExecuteAction
- Wrapper over AST path: Delegates all non-LLVM-IR inputs straight to
ASTFrontendAction::ExecuteAction(). - LLVM IR handling: Intercepts LLVM IR inputs and runs the IR-specific emission pipeline (bypassing the AST frontend).
We’re not detailing the LLVM IR setup and configuration steps, as those belong to the backend-specific flow.
ASTFrontendAction::ExecuteAction
- Preprocessor check: Returns early if the preprocessor isn’t initialized.
- Stack setup: Marks the bottom of the stack to guard against deep AST recursion.
- Code completion: If requested, installs a
CodeCompleteConsumerfor IDE-style suggestions. - Semantic analysis init: Creates/configures the
Semaobject to drive name lookup and type checking. - Parse AST: Invokes
ParseAST, which lexes, parses, and executes the specific frontend action logic on the AST.
Parsing Class
Parser
The Parser class is responsible for syntactic parsing. It takes references to the PreProcessor and Sema, and drives the parsing of tokens into declarations and statements.
Sema
The Sema (semantic analyzer) performs semantic checks (e.g., type checking, declaration validation). It owns references to essential components used during semantic analysis:
Main members inside the Sema:
ASTContext– manages all AST nodes and semantic infoPreProcessor– token stream handlerLangOptions– holds active language dialect flagsDiagnosticsEngine– emits warnings and errorsSourceManager– tracks source locations
Scope
Tracks the current scope during parsing and semantic analysis. It helps Sema resolve names (variables, functions, types) correctly depending on the nesting level (e.g., inside functions, loops, conditionals, or namespaces).
For example:
int foo() {
while (1) {
if (bar) {
foobar();
}
}
}foobar is in the if scope, which is in the while scope, which is in the function scope.
Decl
Decl is the base class for all AST nodes representing C/C++ declarations (e.g., functions, variables, classes).
All declaration nodes (like FunctionDecl, VarDecl, RecordDecl) derive from Decl.
DeclGroupRef is a container used when multiple declarations are parsed together (e.g., int a, b;).
ASTContext
ASTContext manages the lifetime and storage of all AST nodes and semantic information. It contains information about:
Main members inside the ASTContext:
Declnodes (all AST nodes)LangOptsTargetInfo
ASTConsumer
A callback interface class to observe and process AST nodes as they’re built.
Parser::ModuleImportState
An enum only for C++20 module/import that can only be placed at the start of a file. After the first declaration, the keywords module/import become normal identifiers that you can use.
Example:
module;
import foo; // valid here
int x;
import bar; // now invalid
int import = 1; // validparsing logic overview
clang::ParseAST
- Sets up the statistics system (used for diagnostics and performance tracking)
- Initializes
Sema - Creates the
ASTConsumerfromSema, which will receive the AST nodes - Constructs the
Parser, connecting thePreprocessorandSema - Registers crash recovery cleanup routines
- Ensures the lexer is available
- Initializes the parsing logic — this is where the first token is read and the first scope is set (
DeclScope) - Main parsing logic happens in
HandleTopLevelDecl, which in your case emits LLVM IR - Processes things like
#pragma weak - create the Target (ex
.o) - Prints stats
parsing logic
This is the main parsing logic of the Clang Parser.
Parser::DeclGroupPtrTy ADecl; // temp for the current Decl
Sema::ModuleImportState ImportState; // import state for C++20
for (bool AtEOF = P.ParseFirstTopLevelDecl(ADecl, ImportState); !AtEOF;
AtEOF = P.ParseTopLevelDecl(ADecl, ImportState)) {
if (ADecl && !Consumer->HandleTopLevelDecl(ADecl.get()))
return;
}ParseFirstTopLevelDecl is a wrapper around ParseTopLevelDecl that initializes the ImportState for C++20. So this loop returns a Decl and while it’s not AtEOF, the Decl is passed to Consumer->HandleTopLevelDecl, which is the ASTConsumer that transforms it to the target — in your case, IR.
Parser::DeclGroupPtrTy is a pointer to a DeclGroupRef, which is an interface for these classes.
Parser::ParseTopLevelDecl
- Sets up a destructor (RAII) for the parser data
- Giant switch-case for special cases (this is where
tok::eofreturns true) - Initializes
ParsedAttributesfor GNU-style attributes (__attribute__((foo))) and C++11-style ([[foo]]) - Parses the trailing attributes of both types
- Calls
ParseExternalDeclaration - Handles the
ImportStatefor C++20
Parser::ParseExternalDeclaration
This function decides whether to redirect to ParseDeclarationOrFunctionDefinition or ParseDeclaration, and handles special cases like asm or import/export. It can be simplified like this:
if (Tok.isEditorPlaceholder()) {
ConsumeToken();
return nullptr;
}
if (getLangOpts().IncrementalExtensions &&
!isDeclarationStatement(/*DisambiguatingWithExpression=*/true))
return ParseTopLevelStmtDecl();
if (!SingleDecl)
return ParseDeclarationOrFunctionDefinition(Attrs, DeclSpecAttrs, DS);
return Actions.ConvertDeclToDeclGroup(SingleDecl);ParseTopLevelStmtDecl is for clang-repl, so you can use C like you would with the Python REPL:
➜ ~ clang-repl
clang-repl> #include <stdio.h>
clang-repl> int foo = 5;
clang-repl> printf("foo = %d\n", foo);
foo = 5
clang-repl> foo = 2;
clang-repl> printf("foo = %d\n", foo);
foo = 2ParseDeclaration: Parses a declaration only (variables, typedefs, namespaces, inline namespaces, etc.). Used in the big switch-caseParseDeclarationOrFunctionDefinition: Parses either a declaration or a function body, depending on what follows the declarator
Parser::ParseDeclarationOrFunctionDefinition
The ParsingDeclSpec (DS) passed by ParseExternalDeclaration is set to nullptr, so we’re not using the if (DS) part.
But first, what is ParsingDeclSpec?
ParsingDeclSpec
As stated in the class definition, ParsingDeclSpec is for parsing a DeclSpec:
/// A class for parsing a DeclSpec.
class ParsingDeclSpec : public DeclSpec {The constructor of ParsingDeclSpec calls the parser’s getAttrFactory(), providing what DeclSpec needs for initialization. The key advantage of ParsingDeclSpec is that it creates a ParsingDeclRAIIObject to manage diagnostics — accumulating warnings or errors during parsing and ensuring they’re either committed or discarded when the object goes out of scope.
DeclSpec
But what is DeclSpec?
DeclSpec captures all the information about declaration specifiers:
- Storage specifiers (
SCS):typedef,extern,static,auto,register, etc. - Thread storage specifiers (
TSCS):__thread,thread_local,_Thread_local - Type qualifiers (
TQ):const,volatile,restrict,atomic, etc. - Attribute
Each category is stored in a compact bitfield along with source-location metadata, allowing Clang to validate and diagnose specifier usage precisely.
Back to ParseDeclarationOrFunctionDefinition:
- Initializes a
ObjCDeclContextSwitch— we don’t care for this since it’s Objective-C - Enters
ParseDeclOrFunctionDefInternal
Parser::ParseDeclOrFunctionDefInternal
-
Adds
DeclSpecAttrs(__attribute__) list to theParsingDeclSpec -
Prepares for MS-specific parsing
-
Parses freestanding declaration specifiers (
typedef,extern,static,auto,register,class {},struct {},enum {}) and fillsDeclSpec -
Checks for a missing
;and parses trailing args likeattribute -
If a
;is found:- Returns the keyword size for diagnostics in
LengthOfTSTToken - Suggests corrections (like fixing
[[attrib]] struct→struct [[attrib]]) - Adjusts attribute position if possible
- Consumes the
; - Creates the AST (
Sema::ParsedFreeStandingDeclSpec) - If it’s a
RecordDecl(struct/union/class), validates it - Creates a
DeclGroupPtrTy
- Returns the keyword size for diagnostics in
-
If a struct has been parsed but is not freestanding (
struct S { … } x;), notify Sema before parsingx -
Handles attributes before Objective-C keywords like
@interface -
DS.abort()— RAII cleanup? -
Adds
Attrs([[attribute]]) toDeclSpec -
Detects
extern "C" -
Adjusts
[[attribute]]position if possible -
Calls
ParseDeclGroupto parse the function and return the result
Functions to detail later:
ParseDeclarationSpecifiersDiagnoseMissingSemiAfterTagDefinitionParsedFreeStandingDeclSpecActOnDefinedDeclarationSpecifier
Parser::ParseDeclGroup
example of the current state of the compiler
int [[hidden]] x = 5;already consumed int [[hidden]]
current Token x
static inline void foo() {
return;
}already consumed static inline void
current Token foo
function explanation:
- init the
ParsedAttributeandParsingDeclaratorwith current data - init
SuppressAccessChecks, a RAII class to delay errors for templates of private class memberstemplate <> foo<bar::foobar> ParseDeclarator(ex:int x = 5;current;,int main(int ac, char av) {}current:{)- pop out
SuppressAccessChecks - if no name parsed, skip until a good place to continue the parsing and return nullptr
int /*missing name*/; - shader-specific parsing
- parse end of
requires(C++20)
template<typename T>
void f(T) requires std::is_integral_v<T>;
// The 'requires std::is_integral_v<T>' is parsed here-
if we are parsing a function
- parse trailing GNU
__attribute__ - if the current token is
_Noreturn, create a diagnostic: this can’t happen - if tok ==
=and next token is the code completion of the LSP
struct S { S() = /*<cursor here>*/a };- handle
virtual/override(C++11)
struct C { void f(); }; // Out-of-line definition with invalid 'override' void C::f() override { /*...*/ }-
determine if this is a function body or prototype; if it’s a function def, enter the if
- file-scope only
- handle explicit instantiation vs specialization
- call
ParseFunctionDefinition(...) - return converted DeclGroup
int f(int x) { return x*2; } // function def int f(int x); // prototype - parse trailing GNU
-
consume any attribute after declarator and attach them to
DeclaratorDecl -
handle C++ range-based for loops (and Obj-C for-in)
for (auto &x : vec) {
// ...
}-
declsIncGroupto explain -
while it’s not a comma, continue
- if this is a new line and in this context this can’t be a declarator: diag missing semi
- error on multiple template declarators
D.clear(); D.setCommaLoc()- parse
__attribute__again… - if (MicrosoftExt) skip MS‐style attrs
- shader parsing
-
if that’s a valid type
- parse
requires
template<typename T> void f(), // first declarator: no requires g() requires C<T>; // second declarator: has a trailing requires-clause - parse
-
get token location into
DeclEnd -
if semi and
expectedSemi- if any specifier, skip it
-
Actions.FinalizeDeclaratorGroup
Parser::ParseFunctionDefinition
- Microsoft-specific SEH (
__try,__except,__finally) is flagged as illegal in the function body after parsing - If compiling C and not in C89, warn about a missing return identifier (this was allowed in older C)
- Handles K&R-style identifiers:
int sum(a, b) // ← parameter names only
int a; // ← types declared separately
int b;
{
return a + b;
}-
Checks if the token is valid (
{, or in C++ only:try,:,=); otherwise, emit a diagnostic and skip until{; if not found, returnnullptr -
If
=, check for invalidattribute -
If delayed template parsing is enabled and it’s a template definition (not
= defaultor= delete):- Only parse the signature, cache the body tokens, and delay full parsing until later
-
Else if Objective-C-specific
-
Set
parsescopeto:
int foo() {} // FnScope
struct mytype { }; // DeclScope
void foo() {
{ // CompoundStmtScope
int x = 3;
}
}- Parse C++
= deleteor= defaultasActOnStartOfFunctionDef - Take a snapshot of the FPF (Float Precision Feature) to restore it later:
void f() {
#pragma float_control(precise, on)
float x = a + b; // precise mode
{
#pragma float_control(precise, off)
float y = c + d; // fast-math mode
}
float z = e + f; // still in precise mode — outer setting restored
}- Sema checks if the base of the parsing is correct
- Skip what should be skipped for delayed template parsing
- Clear the
ParsingDeclaratorRAII before parsing the body - Finish parsing for special states (e.g.,
= default) - If using
auto, transform it into a template by adding depth to the depth traker (this is normaly done by template <> and need to be done manulay for auto f(auto foo) in cpp 20) - If skipping the function body, finish parsing and return it
- Parse
try - If
:, parseconstructor - Parse late
__attribute__ - Return
ParseFunctionStatementBody
Parser::ParseFunctionStatementBody
- Save the
{location for diagnostics - Setup RAII for C++ method pragma internal state (mainly for Windows)
- Main parsing function:
ParseCompoundStatementBody - If the function body is invalid, create a bogus
CompoundStmt - Exit the body scope
- Return the finished parsed body
Parser::ParseCompoundStatementBody
- Create the feature RAII
- Start with pragmas
- Parse
__label__(can only be at the start of a scope)
{
__label__ retry1;
__label__ retry2;
__label__ done1, done2;
}-
Set
ParsedStmtContexttoCompound|InStmtExpr -
While not at the end of the function:
- Try to parse misplaced C++20 imports
- Parse redundant semicolons
;;;;;;;;; - If token is not
__extension__, callParseStatementOrDeclaration() - Else, consume all
__extension__and parse the statement, but silence extension warnings - If the last statement is valid, push it to the
Stmtsvector
-
Check the last statement
-
Emit FP evaluation method warnings for unsupported targets
-
Gracefully handle the closing
}or missing braces -
Build and return a
CompoundStmtorStmtExprvia Sema
ParseStatementOrDeclaration
To be explained
also the sema validation