Building a JSON Parser from Scratch
A Deep Dive into Parser Development

Have you ever used JSON.parse() and wondered about the magic happening behind the scenes? I recently embarked on a journey to build a JSON parser from scratch, and what I discovered was both challenging and fascinating. In this post, I'll walk you through the entire process of building a parser, from tokenization to parsing, and share the valuable lessons learned along the way.
Why Build Your Own Parser?
"Why reinvent the wheel?" you might ask. While it's true that we have excellent JSON parsers available, building one from scratch offers invaluable insights into:
How parsers and compilers work
The intricacies of handling text processing
Error handling in complex systems
The JSON specification itself
Plus, it's just plain fun to build something from the ground up!
Understanding the Architecture
Before diving into the code, let's understand what we're building. Our JSON parser consists of three main components:
Tokenizer: Breaks raw JSON text into tokens
Parser: Converts tokens into data structures
Error Handler: Provides meaningful error messages
Let's visualize this with a diagram:

The Tokenizer: The First Step
The tokenizer is like a skilled reader who breaks down text into meaningful chunks. Let's see how it handles a simple JSON string:
{
"name": "John",
"age": 25,
"hobbies": ["coding", "reading"]
}
How Tokenization Works
The tokenizer reads this character by character, producing a stream of tokens. Here's the process:
- Character Recognition:
def scan_token(self):
char = self.advance()
if char == '{':
self.add_token(TokenType.LEFT_BRACE)
elif char == '"':
self.string() # Handle string separately
elif char.isdigit() or char == '-':
self.number() # Handle number separately
# ... handle other characters
- String Handling:
def string(self):
string_content = []
while True:
char = self.peek()
if char == '"': # End of string
break
if char == '\\': # Handle escape sequences
self.advance()
escape_char = self.advance()
if escape_char in '"\\bfnrt':
string_content.append('\\' + escape_char)
else:
raise TokenizerError("Invalid escape sequence")
string_content.append(char)
self.advance()
- Number Processing:
def number(self):
# Handle integers
while self.peek().isdigit():
self.advance()
# Handle decimals
if self.peek() == '.':
self.advance()
while self.peek().isdigit():
self.advance()
# Handle scientific notation
if self.peek() in ['e', 'E']:
self.advance()
if self.peek() in ['+', '-']:
self.advance()
while self.peek().isdigit():
self.advance()
The Parser: Building the Structure
The parser takes our tokens and builds actual Python objects. It uses a technique called recursive descent parsing, which is perfect for JSON's hierarchical structure.
Parsing in Action
Let's look at how the parser handles different JSON elements:
- Objects:
def parse_object(self):
obj = {}
if not self.check(TokenType.RIGHT_BRACE):
while True:
# Get key (must be string)
if not self.check(TokenType.STRING):
raise ParserError("Expected string key")
key = self.advance().value
self.consume(TokenType.COLON, "Expected ':'")
value = self.parse_value()
obj[key] = value
if not self.match(TokenType.COMMA):
break
self.consume(TokenType.RIGHT_BRACE, "Expected '}'")
return obj
- Arrays:
def parse_array(self):
array = []
while not self.check(TokenType.RIGHT_BRACKET):
array.append(self.parse_value())
if not self.match(TokenType.COMMA):
break
self.consume(TokenType.RIGHT_BRACKET, "Expected ']'")
return array
Error Handling: The Secret Sauce
Good error handling can make or break a parser. We implemented a robust error system that provides precise error locations and helpful messages.
Example Error Scenarios
- Missing Quotes:
# Input
{"name": John}
# Error Output
Error: Expected string at line 1, column 9
{"name": John}
^
- Invalid Number:
# Input
{"age": 12.34.56}
# Error Output
Error: Invalid number format at line 1, column 13
{"age": 12.34.56}
^
Real-World Testing and Edge Cases
We tested our parser with various challenging inputs:
- Nested Structures:
# Deep nesting
parse_json('
{
"user": {
"profile": {
"address": {
"city": "New York",
"coordinates": [40.7128, -74.0060]
}
}
}
}
')
- Special Numbers:
# Scientific notation
parse_json('{"small": 1.23e-10, "large": 1.23E+10}')
# Negative zero
parse_json('{"zero": -0}')
- Unicode and Escapes:
# Unicode characters
parse_json('{"greeting": "t", "emoji": "đź‘‹"}')
# Escape sequences
parse_json('{"text": "Line 1\\nLine 2\\t(tabbed)"}')
Performance Considerations
While building the parser, we made several performance-related decisions:
- Memory Usage:
Store tokens in a list instead of generating them one by one
Keep track of string positions instead of copying substrings
- Speed Optimizations:
Use string concatenation for small strings
Implement look-ahead for more efficient parsing
Cache repeated token checks
The Web Interface
We also built a simple web interface using Flask to make the parser more accessible:
@app.route('/parse', methods=['POST'])
def parse_json():
try:
data = request.json
json_str = data.get('json', '')
# Create tokenizer and get tokens
tokenizer = Tokenizer(json_str)
tokens = tokenizer.tokenize()
# Create parser and parse tokens
parser = Parser(tokens)
parsed_result = parser.parse()
return jsonify({
'tokens': [str(token) for token in tokens],
'parsed': parsed_result
})
except (TokenizerError, ParserError) as e:
return jsonify({'error': str(e)})
Lessons Learned
Building this parser taught me several valuable lessons:
Start Small: Begin with the basics (simple objects and arrays) and gradually add support for more complex features.
Test Early, Test Often: Write tests for each new feature and edge case you discover. Our test suite caught many subtle bugs:
def test_edge_cases():
assert parse_json('{"empty": ""}') == {"empty": ""}
assert parse_json('{"space": " "}') == {"space": " "}
assert parse_json('{"unicode": "\\u0041"}') == {"unicode": "A"}
Error Messages Matter: Clear error messages can save hours of debugging time. Always strive to make them as helpful as possible.
Documentation is Crucial: Good documentation helps others understand and use your code effectively.
Future Improvements
While our parser is functional, there's always room for improvement:
- Performance Optimizations:
Implement streaming for large files
Add parallel processing for large arrays
Optimize string handling
- Feature Additions:
Support for comments (non-standard JSON)
Pretty printing with customizable indentation
JSON Schema validation
- Developer Experience:
Better IDE integration
More detailed error messages
Interactive debugging tools
Conclusion
Building a JSON parser from scratch was an enlightening experience that improved my understanding of:
Text processing and parsing
Error handling and user experience
The importance of thorough testing
Performance optimization techniques
The next time you use JSON.parse(), you'll have a deeper appreciation for what's happening under the hood!
Want to explore the code or contribute? Check out the full project on GitHub. Follow me on Twitter.
Have questions or suggestions? Feel free to open an issue or ask me on twitter!

