Open Source Blog
News about Google's open source student programs and software releases
Gumbo: A C library for parsing HTML
Tuesday, August 13, 2013
We're pleased to announce the open source release of the
Gumbo HTML parser
, a C implementation of the
HTML5 parsing algorithm
.
One of the big accomplishments of the HTML5 standard was to standardize the HTML parsing algorithm, so that all browsers see the same HTML document in the same way. So far, most implementations of this algorithm have either been tied to specific browsers or rendering engines, or they've been written in specific scripting languages. This makes it hard to write quick one-off tools to manipulate and cleanup HTML if you don't happen to be working in a language that already has an HTML5-compatible parsing library.
Gumbo seeks to provide a simple library that can serve as a basic building block for linters, refactoring tools, templating languages, page analysis, and other small programs that need to manipulate HTML. It's written in pure C for ease of interfacing with other languages, and has no outside dependencies. Gumbo was built from the start to support source locations and correlating nodes in the parse tree with positions in the original text.
For more information including download, installation, and usage instructions, please visit the
Gumbo project page
.
By Jonathan Tang, Search Features team
Labels
gsoc
423
releases
182
conference
90
gci
80
ghop
55
meetups
49
Linux
28
GSoC Meetups
25
Python
25
students
25
project hosting
24
open source release
23
hackathon
21
App Engine
16
C++
16
Git
14
OSCON
14
JavaScript
13
library
13
Eclipse
12
KDE
12
games
12
GNOME
11
student programs
11
testing
11
Android
10
R
10
BSD
9
Go
9
Java
9
accessibility
9
education
9
security
9
Chrome
8
HTML5
8
Subversion
8
awards
8
maps
8
Chromium
7
GSoC 10 Things
7
Google Cloud Platform
7
Google Earth
7
Selenium
7
database
7
fonts
7
licensing
7
usability
7
Django
6
Google I/O
6
Samba
6
contest
6
documentation
6
events
6
machine learning
6
science
6
Dart
5
Free Software Foundation
5
GCC
5
Gerrit
5
Haskell
5
government
5
standards
5
statistics
5
Creative Commons
4
GNU
4
GitHub
4
Perl
4
mentors
4
mobile
4
protocol buffers
4
season of usability
4
webdriver
4
BioJS
3
C
3
CSS
3
Google Compute Engine
3
JSON
3
Mercurial
3
PHP
3
Unicode
3
fun propulsion lab
3
internationalization
3
open data
3
patents
3
profiles
3
translation
3
Haiku
2
Kubernetes
2
Objective-C
2
deep learning
2
hardware
2
ios
2
k8s
2
metabrainz
2
research
2
time zones
2
wrap-up
2
.NET
1
ADC
1
BigQuery
1
C#
1
FOSSASIA
1
Firebase
1
GIS
1
Google App Engine
1
HUES
1
LIDAR
1
NRNB
1
Neural Networks
1
OpenMRS
1
PowerShell
1
Ruby
1
SCoRe
1
SLAM
1
Science Journal
1
ZuriHac
1
algorithms
1
angular
1
artificial intelligence
1
audio
1
australia
1
bazel
1
big data
1
cardboard
1
clojure
1
compression
1
computer vision
1
dataset
1
debugging
1
energy
1
functional programming
1
gcloud
1
gmail
1
images
1
language
1
lisp
1
logo
1
making
1
melange
1
musicbrainz
1
natural language
1
nmap
1
peer bonus program
1
performance
1
report card
1
robotics
1
sugar labs
1
ui automation
1
virtual reality
1
webvr
1
zopfli
1
Archive
2016
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2015
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2014
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2013
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2012
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2011
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2010
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2009
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Jan
2008
Dec
Nov
Oct
Sep
Aug
Jul
Jun
May
Apr
Mar
Feb
Subscribe
Google Summer of Code
on
Google Code-in
on
Follow @gsoc
Visit
Google Open Source Programs Office
for more information.