Posts tagged tokenizer
html tokenizer
1Hypertext Markup Language (HTML), is the most predominant language for web development.
This articles aims to discuss a simple HTML Tokenizer design logic and implementation.
outline:
- introduction to Tokenizers
- Why html tokenizer?
- Html tokens
- Example of tokenizing
- Prototype using Java
introduction to Tokenizers:
Tokenizers are tools used to tokenize tokens. Tokenize is the work of dividing a string into smaller pieces. Those pieces are called tokens.
Why a html tokenizer?:
Html tokenizer is used to extract information from web pages. Good example is converting a table in a web page to an excel sheet.
Html tokens:
html code can be divided into two type. First is the plain text. Second is the tags text.
Tags are starting with (<) and ending with (>). this means that plain text is the one before (<) or after (>).
Example of tokenizing:
assume the following html code:
price of diflucan embed;” dir=”ltr”>———————————————-
<html> <head> <title > MY EXAMPLE PAGE</title> </head><body>
This is a link for <a href=”example.com”> example.com</a >
</body>
</html>
—example #1:
moneygram locations class=”MsoNormal” style=”text-align: left; line-height: normal; direction: ltr; unicode-bidi: embed;” dir=”ltr”>Now let us run a generic tokenizer on it with delimiter (” \n\t”).
\t means tab, \n means new line and before them we make a space.
Results of the tokenizing process should be:
- <html>
- <head>
- <title
- >
- MY
- EXAMPLE
- PAGE</title>
- </head><body>
- This
- is
- a
- link
- for
- <a
- href=”example.com”>
- example.com</a
- >
- </body>
- </html>
—example #2:
Now let us run a generic tokenizer on it with delimiter (“<>”).
Results of the tokenizing process should be:
- html
- head
- title
- buy Drugstore cheap online Ampicillin
- amoxil buy 0cm; margin-left: 36pt; text-align: left; line-height: normal; direction: ltr; unicode-bidi: embed;” dir=”ltr”> MY EXAMPLE PAGE
- /title
- /head
- body
- This is a link for
- a href=”example.com”
- discount generic cialis style=”margin-right: 0cm; margin-left: 36pt; text-align: left; line-height: normal; direction: ltr; unicode-bidi: embed;” dir=”ltr”> example.com
- /a
- /body
- /html
We can conclude that we need to build a customized tokenizer in order to get better tokens.
prototype:
Here we will try to construct a prototype of the html tokenizer. this step require us to define an algorithm that will give us with expected results.
First we need to choose the generic tokenizer delimiters. Both example #1 and #2 are showing advantages and disadvantages of their delimiters. By making a fast review on the results we can point on those differences:
- in example #1, both plain text and tagged text are divided in smaller pieces. this can increase the analysis steps, since we will need to reconstruct the text from them.
- in example #2, plain text is untouched which is good. Still tagged text is missing (<) and (>). also there are some unwanted empty tokens (spaces only).
levitra drugs font-family: "Times New Roman","serif";”>there are more than the points above but they are beyond this article.
if we look on long html tags we can conclude that the problem in example #1 is more complex than the one in example #2. this is because there are many html tags with large number of options, this is an example:
<input class=”exampleClass” name=”example” type=”checkbox” id=”exId” value=”AAA” checked=”checked” />
so we will use delimiters from example #2.
Draft of pseudo code: (not complete
)
1. read html file into string
2. make a list to store final tokens
3. make a buffer to collect characters
buy cheap levitra -18pt; line-height: normal; direction: ltr; unicode-bidi: embed;” dir=”ltr”>4. for i=0 to i<string.length
a. if first character, then append it to buffer
b. else if it is ‘<’, then
i.make token from the existence buffer
ii. add the new token to the list
buy cheap without prescription Amoxil online Roman"; font-style: normal; font-variant: normal; font-weight: normal; font-size: 7pt; line-height: normal; font-size-adjust: none; font-stretch: normal; -x-system-font: none;”> iii. buffer = new buffer
iv. append ‘<’ to the buffer
c. else if it is ‘>’ then
i. append it to the buffer
ii. make token from the buffer
iii. add token to the list
iv. buffer = new buffer
d. else if not ‘<’ nor ‘>’ then
i. append it to the buffer
download java code of html tokenizer from here HtmlTokenizer
Finish