building a lexer in python – a tutorial

knowing how to build a lexer allows you to extend your applications and opens up a whole new world. You can write your own DSLs or your own language or just better separate symbols: in other words, it allows you to have more control over a string

what is a lexer?

from user input to code execution there are three main steps either for interpreters or compilers:

  1. lexical analysis (1)
  2. parsing (2)
  3. code generation (example: to machine code or bytecode)
  4. execution (3)

a lexer is a tool that performs lexical analysis

difference between lexical analysis and scanning

in the beginning, scanning and lexical analysis were two different steps but due to the increased speed of processors, they now refer to one and the same process and are used interchangeably.

what is scanning?

scanning means to pass over / scan a string character by character (char by char).

what is lexical analysis?

it is the process whereby the scanned characters are turned into lexemes

what is a lexeme?

a lexeme is a recognised piece of string

let us take:


we might build a parser that outputs the following lexemes :

  • lexeme 1: the
  • lexeme 2: quick
  • lexeme 3: brown
  • lexeme 4: fox

now let us take a simili code:

for(i; i<arr.length; i++){ }

  • lexeme 1: for
  • lexeme 2: (
  • lexeme 3: i
  • lexeme 4: ;
  • lexeme 5: i
  • lexeme 6: <
  • lexeme 7: arr
  • lexeme 8: .
  • lexeme 9: length
  • lexeme 10: ;
  • lexeme 11: i
  • lexeme 12: ++
  • lexeme 13: )
  • lexeme 14: {
  • lexeme 15: }

so our task for this post will  be to build a program that can separate those pieces

what does a lexer needs as input?

a lexer needs two things : the source code and keywords

what is a keyword?

a keyword is a lexeme that has a special meaning to the lexer

normally words like print, from, to are known as keywords, however, symbols such as ( , { can also be considered as keywords

types of keywords

there are single character keywords e.g. whitespace

there are multi-char  keywords such as print

ignoring keywords

there are cases where we might want to ignore keywords as in the case where they appear between ” ” or in comments

what are ids?

ids are user-defined names, like the names of namespaces, variables, classes and functions. functions in the standard library are only predefined ids. normally keyword names are banned from being used as ids

some observations

let us compare :

the quick brown fox


for(i; i<arr.length; i++){ }

for both of the strings, lexing could be carried out. the first case is simple

but for the next case we observe a general rule. we break if the next character is a keyword be it a special word or symbol

in reality we also add the condition of if the element itself is a keyword

special cases

now let us take the case of ++, + is a keyword and ++ too but how do we differenciate between them or ==?

that’s when we use flags to determine which context are we in

flags are just variables that act like switches e.g.

x = 0

if … : x = 1

if x == 1 : …

or use some checks !

writing the scanner

basically we just need a program that outputs a char at one time. in python it is easy :

which outputs:

just some fancy modifications :


writing the basic lexer

now that we can read a string char by char, we can now check if next char is a keyword

let us do it for the phrase:

the beautiful white moon

where white space is a keyword, referring to above :

we implement something similar to it i.e checking if next char is a white space:

which outputs :

moon is missing as there is no white space after it, we fix this edge case by printing the lexeme after the loop :

output :

fixing whitespace addition

now white space got added to our lexeme, to fix we just add a character if the current char is not a whitespace :

output :

running on a real piece of code

let us take this piece of java code taken from tutorialspoint:

our first task is to identify single-char and multi-char keywords


public, class, static, void, main, string, int, for


{ } ( ) ; [ ] : . \ ” *

we did not consider System as a keyword as in the language, System is the name of an id (user-defined name)

now we can feed those keywords to our lexer

preassumption on whitespace

since whitespace still separates between like public and class:

we’ll keep the if next == whitespace rule

source codes are just pieces of strings

we could have ignored newline was it not for single line comments

a first attempt:

the added conditions test if the next char is a single-char keyword or if what we already have at hand after adding a char is a keyword, if so, we then check if our variable is not empty else we’ll print an empty lexeme. we also replaced newline by <newline> to better distinguish it

the above code outputs in:

see how it magically separates 10 and , System and . , String and name … all perfect but : we have  a glitch

a glitch and the introduction the next step!

see at /* and */, it considered them as / then * and not as a single entity. that is because we defined * as a keyword, though we did not use it here but we use it in multiplication like : 12 * 12

so, our basic lexer has a confusion at this junction, we can for the time being tackle it like :

we modify only the loop :

we just added some checks to make things clear with the result that now the comments are recognised as such :


we built a lexer by voluntarily leaving out regex given that some lookaheads is a breeze in py. we’ll move on next time to the tokeniser.

in this post, brevity was voluntarily left out, favouring a more complete approach, more programming than jargon usage!


Lives in Mauritius, cruising python waters for now.

5 thoughts on “building a lexer in python – a tutorial”

Comments are closed.