HTML basics

Posted by Dave K on 25 Jun 2009, 98 views

 

 

this is not a finish product 

 

covering HTML 4, HTML 4.01, and XHTML

Table of Contents

Preface 3

HTML overview 4

SGML roots 6

color schemes 6

comments 7

XML and XHTML 7

page basics 9

different DTDs 9

head section 10

text manipulation tags 11

new kind of tags 11

Block tags 12

lists 13

unordered list 13

ordered list 13

nested lists 14

definition list 16

tables 18

spanning columns and rows 18

links and anchors 20

multimedia 21

user input 22

deprecated or rarely used tags 23

 

 

Preface

 

This paper is about how to write HTML web pages from scratch. I don't describe how to use DreamWeaver or FrontPage, but these topics are essential to knowing how to author a web-site.

 

The prerequisites are to have a text editor like notepad, a browser like FireFox or IE, and a knowledge of browsing the Internet. It would be helpful to have a general understanding of ASCII character set and the hexadecimal number system.

 

The goal of this paper is to give the student enough knowledge of HTML to write a static web-site on their own, work with future International standards, and to give them enough knowledge to learn more.

 

 

HTML overview

It is important to remember that HTML, the main programming language on the Internet, is different from many modern programming languages. Instead of describing the process used to get from point A point B, it describes the way a document looks at point B. This takes a slightly different thinking process. Instead of figuring out the most efficient way to get someplace, you worry about what you'll do when you get there. Believe it or not, that is a difficult shift for some programmers.

Because it is different, some people dismiss HTML as not a “real programming language”. They forget that HTML is an example of a 4th generation language and what they call a “real programming language”, such as C++ or Java, is a 3rd generation. So in both cases we are “luzers” and rely on some kind of abstraction to make it. If it is so important to define the logical steps from one point to another, why aren't they working in assembly or, better yet, machine language? If solving a problem is important, then sometimes you want an abstraction like HTML.

The way an HTML statement is written is simple enough, but the standards are in the middle of changing and the rules will change also in XML and XHTML which I will cover later. For right now I will cover how HTML statements have been on the scene for a few years-- since the late 90's. The standards are in the middle of changing as of mid 2009 when this was written. In my opinion, some of the changes are for the good and some are for the bad. But we need to know both the old and the new. While many people want to change ASAP, there are probably millions of HTML pages written with the old standard.

All HTML is is a bunch of text with special marks to show how text is to be treated (“bold these words”) or how data is arranged (“put a table column here”). This why HTML is referred to as a “markup” language. One of the new ideas is to separate the data from the layout so the two statements above would be “make this text strong” and “mark this as a group of text”. Then in another file they define that strong text is bolded and that data in that particular group of text will be displayed as a table column. That is the idea behind Cascading Style Sheets which I'll cover later.

opening and closing tag syntax

Text is marked with a pair of “tags” which got HTML the nickname “tagsoup” or JaBoT (Just a Bunch of Tags). It makes me wonder; why aren't C-like languages called a bunch of semicolons or “curlysoup”? One way or another, those tags are simply text inside of angled brackets (< ... >). The first in the pair is called the “opening tag” and is just the tag name inside inside the brackets. The end tag, aka the closing tag, has a bracket and a backslash (</ ... >). Between the opening and the closing brackets is the text that is affected by that tag. (<tag> ... </tag>)

As an example; the paragraph tag is something like this:

<p>The actual text of the paragraph.</p>

But what about tags that aren't a pair? There are a couple. What they then do is add the slash after the text so the tag ends with a space, slash, and right angled bracket ( />).

A line break (<br>)is one of those single tags. All it does is the same as your 'enter' key-- it adds a line in the display.

The first line of text that needs to be broken into two lines <br />The second line of text that needs to be broken into two lines

attributes

Sometimes you need to define some additional data a tag needs to work. For example; you need to tell the image tag (<img>) what file to display. Those pieces of data or attributes are anywhere from information the search engines need to the background color used. Something I will get into later is that there is now a large effort to change the rules so that many (but not all) of these attributes are invalid. I will cover those later in the section on XML.

An example of some of the most common attributes are listed below:

<p id = ”name”>A paragraph</p>

<a href = “example.com/index.html”>click me!</a>

<table width=100%>A spreadsheet</table>

 

entities

Sometimes you need to output a character that isn't available in simple text or will confuse the program displaying the HTML. For those cases it has something called 'HTML entities'. The most common of those entities are given a name like &NBSP; or &GT;. You can also give the number that represents the ASCII character (see the ASCII chart appendix). To understand how this is done you need to understand ASCII and hexadecimal.

One of those entities, the “no break space”, is different and hard to describe. While all it displays is a blank character, where it does or does not show-up is what makes it different. All browsers try to format the text and end lines in the appropriate places. But sometimes you want two or more words to stay together. Maybe you have a title that you don't want broken-up and such that if the title is at the end of the line, the whole thing should show up on the next line. In that case, use the “no break space” (&nbsp;) instead of spaces.

Below I have a list of the most common HTML entities and in the appendix I list all of the entities available.

 

&nbsp;

no break space (see above)

&gt;

greater than ('>')

&lt;

less than ('<')

&amp;

ampersand ('&')

&eq;

equal sign ('=')

&quot;

quote ('')

 

 

SGML roots

A little background is appropriate right now.

HTML was created as a simplified version of a very complex markup language known as SGML. (as defined in [ISO8879] As I stated earlier, those rules are changing and XML is replacing SGML as the language that HTML gets its rules from.

In SGML there are some very strict rules, but if you follow those rules you can get a very flexible layout. To begin with you define what kind of document the markup file is in the first tag, DOCTYPE. That file is called a DTD or document type definition.

If the DOCTYPE is HTML (or one of it's derivatives) then the DTD it uses is one of the standards at the authoritative organization for Internet standards, World Wide Web Consortium or W3C. There are three different DOCTYPES in HTML. I will detail those soon enough.

But now it is time to bust a myth about HTML; I've heard it called “lazy” as if it didn't comply with any standards. Well, it did comply to the standards it was written for. Just because people have decided that it should comply to another set of standards doesn't mean it was lazy when complying with the first.

color schemes

There are three different ways of naming a color. The first and most obvious is to use a name like “red”, “green”, or “blue”. Then you can describe the color in a more technical manner by saying what the color is composed of (the amount of red, green, and blue). This can be done with either a decimal number which we humans use or hexadecimal which is what a computer uses. Well, technically that isn't true, but it is good enough for now unless you are either a Computer Scientist or a mathematician and want to learn about binary. I'll assume not.

To understand the technical ways of naming a color you have to understand the underlining color scheme that HTML uses-- the RGB or Red Green Blue model. The idea behind the RGB scheme is that all colors visible to the human eye are composed of a mix of three colors which are sometimes known as primary colors. There are other schemes; some that subtract a color instead of adding it and some that describe other color traits likes luminosity. But we use the RGB model here so we have to learn it and the two ways to define it in HTML.

One way is to use the RGB() function directly. In it, define the degree of the primary colors with numbers from 0 to 255. For example; RGB(255,0,0) meaning it has as much red as possible, and no green or blue.

Another way is to use a hexadecimal number. Assuming you know the basics of hexadecimal, just note that it is a 6 digit number; the first two being red, next two being green, and last two being blue, and it starts with a hash mark (#). The degree of color is 00-ff instead of 0-255. This way a #ff0000 is complete red with no green or blue. You can use the hexadecimal approach even if you don't know what hexadecimal is. Just look-up the color you want in a color chart that gives you the equivalent decimal values or hexadecimal values and plug that one in.

Here are the five most common colors. The complete list is at the end of the book.

 

 

red

rgb (255,0,0)

#ff0000

green

rgb (0,255,0)

#00ff00

blue

rgb (0,0,255)

#0000ff

black

rgb (0,0,0)

#000000

white

rgb (255,255,255)

#ffffff

 

comments

When you want the browser to not display a certain piece of text, you wrap it as an HTML comment. In other words, put a “<!-- ” in front of the text and “ -->” behind it.

Like the group comments in other languages like C++, you can't comment lines that are already comments. When it comes to the first “-->” it will end the comment, even if it has passed two “<!--” on the way. In those cases your best bet is to put a new “<!--” on the next line after the “-->”.

Many people use these comments to put information about the web site in the HTML file. Unfortunately, the comments can be displayed in the browser by looking at the source code (control U in FireFox). So anytime you have a piece of data about your web-site that would make a hackers job easier then you don't want to put it in a comment. Remember that the first thing a hacker does is try to figure-out what they are working with.

XML and XHTML

As I mentioned earlier, the rules are changing. HTML is trying to transition into a more complex type of document based on a relaxed version of SGML called “XML”. The W3C recognized that since HTML4 has been around since '99, a few people are used to it and there are one or two pages out there that are written in it. So they came-up with something in between to help get people off HTML4 and help make the transition a little easier.

First of all, there is already a document type that is lenient on international standards. It used to be called “loose” but now it is called “transitional”. This will allow for the time between new standards are introduced and when they are common.

Right now they are working on XHTML which is a hybrid of XML and HTML. At first it will be known as HTML 4.01 and later it will move to XHTML which will later move to XML.

There are a couple of differences between HTML and XHTML:

     

  • tag and attribute names must be lower case
    old: <TABLE WIDTH='' ''>
    new: <table width='' '' >

  • attributes much have a value (name = value)
    old = <img norepeat>
    new = <img repeat="no-repeat">

     

  • attribute values must be in quotes (name="value")
    old = <table width=100%>
    new = <table width=''100%''>

  • all tags must be closed or self closing
    old: <p> text <img>
    new: <p>text</p> <img/>

  • Attributes names must be complete
    While it wasn't used very often, in HTML4 you only had to enter as many characters for an attribute to separate it from other attributes for that tag.

  • The doctype must now begin with an exclamation point and be self closing
    old: <doctype>
    new: <!doctype/>

 

When a document complies with of the rules it is called “well-formed”. So a document that complies with the XML rules listed above is known as “well-formed XML”.

page basics

 

As I said earlier, the document types (DTD) and they are defined by the first tag in the HTML file. Technically it is not an HTML tag since the HTML hasn't started yet. But for our purposes, it looks like a tag and quacks like a tag so guess what it is in our books.

As of this writing there were seven DTD's on World Wide Web Consortium's site.

 

Roughly the DTD's can be categorized into HTML4.01, XHTML1.0, and XHTML1.1. Then HTML4.01 and XHTML1.0 being similar to each other except that XHTML is more strict on “well-formedness” which is a habit you should have gotten into with HTML4.01. So even though you can get away with not doing so, learn to always close your tags and always put elements in lower-case.

 

Then those categories are divided into three basic doctypes; strict, transitional, and frameset. The strict doctypes are just that; strict. All elements (tags, entities, and attributes) must be according to the rules.

 

Transitional, which used to be called “loose”, is when a standard isn't widely adapted yet. Imagine if they were to suddenly not allow web-pages to have a tag that many are reliant on (font for example). The transitional DTD allows for a small amount of deviation, especially with old standards.

 

different DTDs

 

HTML 4.01 Strict

This contains all HTML elements and attributes, except deprecated elements (like font) and framesets which are not allowed in HTML4.01. However, it is a little more relaxed as far as being well formed according to XML rules.

HTML 4.01 Transitional

Like 4.01 strict except this DTD includes deprecated elements so browsers can handle older tags. But once again, framesets are not allowed.

HTML 4.01 Frameset

This DTD is basically the same as HTML 4.01 Transitional, but it allows framesets.

XHTML 1.0 Strict

The XMHTML 1.0 strict DTD is like the HTML 4.01 strict DTD except that the markup must also be written as well-formed XML.

XHTML 1.0 Transitional

The XMHTML 1.0 transitional DTD is like the HTML 4.01 transitional DTD except that the markup must also be written as well-formed XML.

XHTML 1.0 Frameset

This DTD is equal to XHTML 1.0 Transitional, but allows the use of frameset content.

XHTML 1.1

This DTD is equal to XHTML 1.0 Strict, but allows you to add modules (for example to provide ruby support for East-Asian languages).

head section

In HTML4 standards, the head section is not required, but required in XML.

The head section is where you'll find information that is needed to show an HTML document, but doesn't directly affect the text within.

In it you'll find information like the page's title, information search engines will require, the URL the site is based at, and information about the style of the tags within.

The title tag is the text that is shown in search engines and at the top bar of the browser.

There are several meta tags including the sites description and keywords, both of which search engines use. We will only cover the description and the keywords in this document.

The style tag refers to an internal set of style rules such as “pull” which is part of a class for paragraphs intended to be used as pull-quotes like magazines and news-papers have.

 

 

text manipulation tags

 

The text manipulation tags do just that-- they manipulate text. The have no size like the block tags this paper will cover in a minute and you cannot position them in a certain place. All they do is affect the text within them.

These kind of changes are call “inline” meaning they only change the text within that line and don't change other text, or its position beyond the few changes inherited from the changes in the size of this text. When you bold a phrase, the characters to the right will be moved to the right.

new kind of tags

Something you need to be aware of is the fact that there are sometimes two tags that appear to do the same thing. They both have the same default attributes which can be altered by CSS (which I will cover later). The only real difference is that one is thought to be more correct by today's standards because it is worded in a general way to treat that text while the other is specific to how that piece of text is to be displayed. Take for example the <b> tag, or “bold” tag, versus the <strong> tag. Both will act the same way and both can be altered to act differently. If you want a different type-face instead of bolding text that is considered strong? Both will do it. However, <strong> is the new standard so new pages should be written with them.

In the same manner as bolding text; there is also <i>, for italicizing text, and <em> for text that is to be emphasized. But both will italicize text by default and both can be customized as needed.

There used to be a tag to indicate any other changes in character typography called <font>. Now there is <super> to superscript the text, <sub> to subscript it, <big> to make it bigger, and <small> to make it smaller.

Some of the old-time tags have been eliminated, the techy word being “deprecated”, and replaced with tags that conform to new standards better. Two of those eliminated are <u> for underline and <strike> for strike through. Now they have <del> for strike through and <ins> for underlined.

Some examples of character manipulation are:

This text is bolded.

This text is italicized.

This text is an “inline quote”.

This text is worng right.

Since the old-time tags will eventually be going away, it makes sense to do any work today with the newer standards, even if the old ones will be tolerated.

Block tags

A block tag signals a block of text that can be sized as necessary, extra margins can be given to it and it can be positioned in a specific place on the screen instead of just going where it happens to fall. It also has characteristics that are specific to that piece of text such as a background graphic or background, and a border around it. Something that confuses many is that there is a line return after a block. This allows you, with CSS, to specify how much extra space there is between paragraph. Remembering this little fact will help you avoid early dementia when you start coding CSS.

the area is usually used for the site's title

left side bar

 

This area is usually used for menu items

center column

 

This area is usually used to display the content of an HTML site

right side bar

 

This area is usually used for more menu items or extra blocks of information.

 

this area is usually used for trivial items that you need to display

 

<div>
<h1>the header is usually used for the site's title</h1>
</div>

<div>
<h3>left side bar</h3>
<p>This area is used for menu items.</p>
</div>

<div>
<h3>center column</h3>
<p>This area is used for the content of an HTML site.</p>
</div>

<div>
<h3>right side bar</h3>
<p>This area is used for extra blocks of information.</p>
</div>
<div>
<h3>A footer is used for trivial items</h3>
</div>

lists

There are three different kinds of lists in HTML.

Two of the lists use very similar syntax so it is easy to use one or the other. The only difference in the display is whether each item is marked with a bullet () or a number (1). The only difference in the HTML is whether the tags that start and end the list are <ul>...</ul> or <ol>...</ol>. The <ul> stands for unordered list and <ol> for ordered list.

unordered list

An unordered list is what most people call a bulleted list looks like this:

item 1

item 2

item 3

The HTML looks like this

<ul>

<li>item 1</li>

<li>item 2</li>

<li>item 3</li>

</ul>

 

ordered list

An ordered list can be shown in many ways with CSS. Here are a few:

1 first item A. first item a. first item

2 second item B. second item b. second item

3 third item C. third item c. third item

 

 

 

 

In all three cases the HTML looks like this:

<ol>

<li>first item</li>

<li>second item</li>

<li>third item</li>

</ol>

How the numbers are shown requires some CSS so we'll cover that in a minute.

nested lists

You can also put lists inside lists to get a hierarchical list thatt will look like a traditional outline which these two examples show.

1. first item first item

a. first sub-item first sub-item

b. second sub-item second sub-item

c. third sub-item third sub-item

2. second item second item

3. third item third item

 

 

 

 

 

 

 

 

 

 

 

 

The HTML for the nested ordered list looks like this

<ol>

<li>first item</li>

<ol>

<li>first sub-item</li>

<li>second sub-item</li>

<li>third sub-item</li>

</ol>

<li>second item</li>

<li>third item</li>

</ol>

 

The HTML for the nested unordered list looks like this:

<ul>

<li>item 1</li>

<ul>

<li>first sub-item</li>

<li>second sub-item</li>

<li>third sub-item</li>

</ul>

<li>item 2</li>

<li>item 3</li>

</ul>

 

definition list

A definition list looks almost like a dictionary entry; a term being defined and some text for the definition which is usually indented. There is nothing preceding the list like a bullet and there are two things you must define; the term and the definition.

 

So it ends up looking like this:

first term

This is the definition for the first term in this list. It is a whole paragraph that is usually indented. This is the definition for the first term in this list. It is a whole paragraph that is usually indented. This is the definition for the first term in this list. It is a whole paragraph that is usually indented.

second term

This is the definition for the first term in this list. It is a whole paragraph that is usually indented. This is the definition for the first term in this list. It is a whole paragraph that is usually indented. This is the definition for the first term in this list. It is a whole paragraph that is usually indented.

 

Some parts of the list are customizable with the first two versions of CSS but using CSS3 gives you even more power over how a list looks.

 

 

 

 

 

 

 

 

 

 

 

 

The HTML for a definition list looks like this:

<dl>

<dt>first term</dt>

<dd>This is the definition for the first term in this list.
It is a whole paragraph that is usually indented.
This is the definition for the first term in this list.
It is a whole paragraph that is usually indented.
This is the definition for the first term in this list.
It is a whole paragraph that is usually indented. </dd>

<dt>second term</dt>

<dd>This is the definition for the second term in this list.
It is a whole paragraph that is usually indented.
This is the definition for the second term in this list.
It is a whole paragraph that is usually indented.
This is the definition for the second term in this list.
It is a whole paragraph that is usually indented. </dd>

<dt>third term</dt>

<dd>This is the definition for the third term in this list.
It is a whole paragraph that is usually indented.
This is the definition for the third term in this list.
It is a whole paragraph that is usually indented.
This is the definition for the third term in this list.
It is a whole paragraph that is usually indented. </dd>

</dl>

 

 

tables

An HTML table is something like a spreadsheet. Data is layed-out in rows and columns. However, it is not a spreadsheet in the sense that you have to put all the correct text in. It does not calculate one column to be the sum of two other columns.

Like a list, you need to define the beginning and end of the table but this time with the <table> and </table> tabs. Then where the rows start and stop are defined with <tr> and </tr>. Each cell that contains a piece of data is defined with a <td> and </td>.

spanning columns and rows

Sometimes you'll want a column to be a hierarchal parent of other columns.

Let's say you are working on a financial statement and want it broken into months that are grouped by quarters.

 

quarter 1

quarter 2

quarter 3

quarter 4

description

Jan

Feb

Mar

Apr

May

June

July

Aug

Sept

Oct

Nov

Dec

Gross profit

100.00

100.00

100.00

100.00

100.00

100.00

100.00

100.00

100.00

100.00

100.00

100.00

Expenses

90.00

90.00

90.00

90.00

90.00

90.00

90.00

90.00

90.00

90.00

90.00

90.00

Net profit

10.00

10.00

10.00

10.00

10.00

10.00

10.00

10.00

10.00

10.00

10.00

10.00

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The HTML would look like this:

<table>

<tr>
<th cellspan=3>Quarter 1</th>
<th cellspan=3>Quarter 2</th>
<th cellspan=3>Quarter 3</th>
</tr>

<tr>
<td>gross profit</td>
<td>100.00</td> <td>100.00</td> <td>100.00</td>
<td>100.00</td> <td>100.00</td> <td>100.00</td>
<td>100.00</td> <td>100.00</td> <td>100.00</td>
</tr>

<tr>
<td>expenses</td>
<td>90.00</td> <td>90.00</td> <td>90.00</td>
<td>90.00</td> <td>90.00</td> <td>90.00</td>
<td>90.00</td> <td>90.00</td> <td>90.00</td>
</tr>

<tr>
<td>net profit</td>
<td>10.00</td> <td>10.00</td> <td>10.00</td>
<td>10.00</td> <td>10.00</td> <td>10.00</td>
<td>10.00</td> <td>10.00</td> <td>10.00</td>
</tr>

</table>

     

links and anchors

 

One of the benefits of HTML's approach is that text isn't displayed in one continuous piece of text but smaller pieces that let you jump around. Of course the same thing can be done with a well written book or report which should be segmented into chapters and section. The big differences are that a to go from one web page to another you click a button instead of turning a page, the text can be cut-n-paste into another paper, and a web page can be on the Internet.

The part of a page that you click on is called a link or hyperlink and the place it goes is called a URL or an anchor. A link can be displayed as a piece of text, an image, or a shape.

In the case of text, when you want it to go to another web page, you use the “a” tag and specify the URL in the “href” parameter.

<a href=”http://example.com”>Click here</a>

will go to the URL example.com on the Internet.

If you want browsers to jump to the middle of a page you need to specify where that point, or anchor will be. Again you would use the “a” link but this time you would specify the name of the anchor.

<a name=”sectionI”/>

    Specifies where “section I” begins.

When you want to jump to an anchor like “sectionI” you would specify the name at the end of the URL with a hash mark (#) in between.

<a href=”example.com#sectionI”>Click here</a>

  • map - map of an image (makes images clickable)

  • area - an area in a map

     

    When it is text, the characters are usually underlined, displayed in blue. In all cases the mouse pointer changes to a hand when the mouse is over it.

     

 

 

 

 

multimedia

  • img - add an image

  • object - embed a complicated item like Flash or Java

  • param - a parameter to an applet or OBJECT

  • iframe - an inline frame

     

user input

  • form - define the "form" and what to do when submit is hit

  • input - input type

  • option - "drop down" list

  • textarea - multiline text

  • button - button type

     

deprecated or rarely used tags

  • <center> centering text is now controlled by a stylesheet

  • <frameset> <frame> using a frame is not considered a good design decision

  • <col> columns

  • <colgroups> column groups

  • <bdo> bidirectional override

  • <dfn>Definition term

  • <code>Computer code text

  • <samp>Sample computer code text

  • <kbd>Keyboard text

  • <var>Variable

  • <cite>Citation

  • <applet> Use the object tag to embed a Java Applet

  • <font> Controlling font face, size, etc is now controlled by a stylesheet.

  • <u> Underlining text is now controlled by a stylesheet.

 

 

 

 

Subject:
HTML

About Author

Dave's Classes

No-pails