building an indentation lexer in python – a tutorial

scope by indentation is an alternative to curly braces and is used by languages like python and pug. it certainly is more human-friendly as experience gathered tends to produce more and more friendly syntaxes (like livecode). we’ll in this post build an indentation analyser that correctly tells us when we mixed indentation level; just like python does. in fact, we’ll analyse python code snippets

this article requires this one on lexer

bonus : exercises included !

analysing indented codes

let us take a python code snippet taken from sqlite3, dbapi2.py, python version 3.4

the first step when using a lexer is keyword definitions. here, do we take whitespace into consideration or no t? here we must

whitespace sensitivity

as indentations are based on whitespace or tab in python, we must include them as keywords. in this article, we’ll ignore tabs to keep things simple

a basic run of a lexer with keywords :

SINGLE

( ) : . ” ” * , <whitespace> <newline>

MULTI

def return

running the analysis

our code :

outputs (we replaced whitespace with <whitespace> and newlines with <newline>) :

but let us take for a single level :

notice that after a semicolon there is a newline after that there are 4 whitespaces then a non-whitespace

<:> <\n> <\s> <\s> <\s> <\s> <def>

the basic rule

so after <:> if there is a newline we’ll get an indentation level ending at a non-whitespace char

but there is a problem. suppose we get

<:> <\s> <\s> <\s> <\n> <\s> <\s> <\s> <\s> <def>

where a user puts lots of spaces after a semicolon, so, we must maintain whitespace sensitivity but not treat it as a lexeme. then we just check if when passing over <\n> if the last lexeme was <:>

we’ll have a last lexeme variable. reminder: we’ll keep whitespace as keyword as it separates def and  function_name etc

defining indentation level

indentation level is just a whitespace counter. we’ll have a whitespace counter and indentation level variable

getting indentation level

pseudocode :

explanations

with indent_on we check if we should count whitespace for indentation level or not

our code :

output:

see the numbers represent our indentation level but there is a glitch: our indent level once detected, should remain constant as compared to this that we can say that subsequent indents don’t match! we’ll do that by having a first pass flag

resetting count and a general case

not only after <:> <\n> do we get indentations but also after <\n> and next char white space

if char \n and next_char == whitespace then

we start checking for indentations and reset counts

code :

output:

see how the count goes from 4 to 8 then back

some tests

testing

gives out (lexemes omitted)

meaning it successfully detected our indentation level changes

detecting mixed indentation levels

correct indentation level check is

indent_count % indent_level == 0

in other words, if at reaching the end of an indent, the whitespace count is exactly divisible by the level first detected, our check passes

implementation

if we try to run it on this piece of text :

on the bad levels it outputs :

knowing on which line error occurred

on error occurred, we’ll break and give a message to the user :

for that we’ll implement a line counter :

complete code :

how to detect mixed indentations and tabs?

we’ll have an indentation character variable detected on first pass. then when checking for indentation (when indent_on is true) if the char is not equal to that variable we raise an error

hope you enjoyed the article! the debug print logs were voluntarily added.

Exercises

1] implement the above using classes (implement a lexer class)

2] implement tabs and whitespace detection

3] how would you know when a function starts and end? implement it!

  •  
  •  
  •  
  •  

Lives in Mauritius, cruising python waters for now.