There are several steps involved in developing an effective database design. As with all types of applications, the process begins with requirements analysis. In terms of relational database design, this phase answers questions regarding what data
elements must be stored, who will access them, and how.
The second step is to define the logical database. This phase does not deal with how the data will be stored physically, but with how information is grouped logically. The requirements are translated into a model that provides a level of abstraction
from the physical database, representing data in terms of business entities and relationships, rather than in terms of tables and columns.
Physical design is the final phase, in which individual data elements are given attributes and defined as columns in tables. This phase also deals with performance considerations relating to the creation of indexes, rollback segments, temporary
segments, and the physical layout of data files on disk. DDL (Data Definition Language) scripts are written to create database objects and to be used for capacity planning.
A simple contact manager will be used as a sample application throughout this chapter, illustrating the concepts and techniques presented for each phase of the design process.
System requirements are typically gathered through a series of interviews with the end users. This is an iterative process in which systems designers provide structure to the ongoing dialog and document findings, and solicit feedback from the users.
Although requirements definition is not normally considered part of the design process, the design is driven by the requirements, and the two processes often overlap. For example, the logical model may bring out new requirements that were not recognized in
the earlier phases of analysis. It is important, however, to identify all requirements before developing a physical design, because capacity planning and hardware purchasing decisions are ineffective without a full understanding of system requirements.
A common technique used to define and document database requirements is to develop a data dictionary. As the name implies, a data dictionary simply enumerates and defines the individual data elements that must be stored. An initial draft of the data
dictionary for a simple contact manager might look like Table 17.1.
Item |
Description |
Last Name |
The individual contact's last name |
First Name |
The individual's first name |
Middle Initial |
The individual's middle initial |
Contact Type |
Standardized description indicating whether this individual represents a client, a prospect, a vendor, or some other type of contact |
Individual Notes |
Additional information related to the individual |
Company |
The name of the company that the individual represents |
Company Notes |
Additional information related to the individual's company |
Address Line 1 |
Line 1 of the individual's street address |
Address Line 2 |
Line 2 of the individual's street address |
Address Line 3 |
Line 3 of the individual's street address |
City |
City name of the individual's mailing address |
State |
State name for the individual's mailing address |
Zip Code |
Zip code for the individual's mailing address |
Address Type |
Standardized description indicating whether this is a work, home, or some other type of address |
Phone Number |
The individual's area code and phone number |
Phone Type |
Standardized description indicating whether this is a home, office, or other type of phone number |
Date Contacted |
The date that this individual was contacted |
Contacted By |
The name of the salesperson or employee who contacted this individual |
Contact Method |
Standardized description indicating whether the individual was contacted by phone, mail, fax, or some other method |
Contact Reason |
Standardized description of the reason that the individual was contacted |
Contact Notes |
Additional information related to this specific contact |
Although this is a good way to start defining database requirements, there are obvious shortcomings. The data dictionary does not describe how these individual items are related. It also lacks information regarding how the data is created, updated, and
retrieved, among other things.
A functional specification documents the system requirements in plain English and should fill in details concerning who will be using the system, when, and how. Information concerning the number of concurrent users accessing the system, how frequently
records are inserted and updated, and how information will be retrieved are particularly important topics to be covered in the functional specification. These factors will help determine hardware and software licensing requirements, and have a significant
impact on issues relating to performance, security, and database integrity.
The functional description for the sample contact manager might include a summary similar to the text that follows:
The system will be available to 40 sales representatives, 5 sales managers, 26 sales assistants, 6 purchasing agents, 1 purchasing department manager, and 2 purchasing assistants, for a total of 80 users. Of these 80 possible users, it is expected that
a maximum of 20 would be actively using the system at any given time. Purchasing department personnel should have access only to purchasing department contacts, and sales department personnel should have access only to sales contacts.
All users may add information regarding a specific contact at any time, but whereas sales representatives and purchasing agents can add new prospects, only assistants can add new vendors and clients (after obtaining proper approval from a manager).
Sales representatives and purchasing agents should have access only to their accounts and prospects, whereas managers should have full access to the entire database for their specific departments.
One assistant from each department will be designated as a system administrator. Only the system administrators will be able to add and modify address, phone, contact types, contact methods, and contact reasons. With the approval of a manager, a system
administrator will be able to reassign a vendor or client to a new purchasing agent or sales representative.
For audit purposes, every time information is added or modified, the individual who made the modification, and the date and time that the information was modified, should be recorded.
In the preceding example, the functional specification added several new data elements to the requirements, in addition to pertinent information regarding access and security. The functional specification and data dictionary are often developed
simultaneously, as one document may provide relevant information that should be reflected in the other.
An important part of requirements analysis is to anticipate the needs of the users, because they will not always be able to fully explain the system requirements on their own. Based on information from the previous examples, the system designers may
have these follow-up questions:
These are obviously just a few of the questions that come to mind. In practice, the functional description should describe the system to the fullest extent and detail possible. The importance of thorough requirements analysis and documentation is often
underestimated. Put simply, poor requirements definition will most likely result in poor or inadequate design, because these requirements provide the foundation for the later phases of design, including the logical and physical data models.
A common way to represent the logical model is through an Entity-Relationship (E-R) Diagram. For the purposes of this type of model, an entity is defined as a discrete object for which items of data are being stored, and a relationship refers to
an association between two entities.
In the contact manager example, there are five main entities for which data is being stored:
The relationships between these entities can be summarized in plain terms:
These entities and their relationships can be represented graphically by an E-R diagram, as shown in Figure 17.1.
Figure 17.1. E-R diagram for the contact manager sample application.
This may seem to be a simplistic approach to modeling the application's data, but it is often a good first step. In larger applications with hundreds of entities, these models can become extremely complex.
This model can be taken a step further by defining attributes for each entity. An entity's attributes are the individual items of data to be stored that relate specifically to the object. The attributes for each entity in the example are listed in Table
17.2.
Employees |
Individuals |
Contacts |
Employee Number |
Last Name |
Contact Date |
User ID |
First Name |
Contacted By |
|
Middle Initial |
Contact Reason |
|
Company |
Contact Method |
|
Individual Notes |
Contact Type |
|
Company Notes |
Contact Notes |
|
|
|
Addresses |
Phone Numbers |
|
Address Line 1 |
Phone Number |
|
Address Line 2 |
Phone Type |
|
Address Line 3 |
|
|
City |
|
|
State |
|
|
Zip Code |
|
|
Address Type |
|
|
Note that several items of information are missing. The audit information mentioned in the functional specification is omitted. This can be handled by adding a Last Update User ID and Last Update Date/Time Stamp attribute to each entity. More important,
there are attributes missing that are required to relate entities to each other. These data items will be handled somewhat differently because they are not "natural" attributes belonging to a specific entity.
This is a highly abstract view of the data, concerned only with broad categories of data (entities) and the logical relationships between them. The E-R model, although good at representing basic data concepts, is not of much use when it comes to
physical implementation. The relational model helps bridge this gap.
The relational model is characterized by its use of keys and relations, among other things. The term relation in the context of relational database theory should not be confused with a relationship. A relation can be viewed as an unordered,
two-dimensional table, where each row is distinct. Relationships are built between relations (tables) through common attributes. These common attributes are called keys.
There are several types of keys, and they sometimes differ only in terms of their relationships to other attributes and relations. A primary key uniquely identifies a row in a relation and each relation can have only one primary key, even if more than
one attribute is unique. In some cases, it takes more than one attribute to uniquely identify each row in a relation. The aggregate of these attributes is called a concatenated key, or a composite key. In other cases, a primary key must be generated. The
entity Individuals in the preceding example illustrates this point. Although it may be likely, there is no guarantee that the combination of the entity's attributes will be unique. A new attribute should be created based on generated values to make
Individuals a relation. This can be accomplished in Oracle through the use of a SEQUENCE.
Another type of key, called a foreign key, exists only in terms of the relationship between two relations. A foreign key in a relation is a nonkey attribute that is a primary key (or part of the primary key) in another relation. This is the
shared attribute that forms a relationship between two relations (tables). Primary and foreign key relationships are illustrated in Table 17.3.
Referring back to the example, the entities' attributes can be extended to fulfill the audit requirements and make the model relational. Note that the ID attribute is a generated primary key in each relation in which it appears. The reasons for this
will be explained in further detail in the following section on performance considerations.
Employees |
Individuals |
Employee Number (PK) |
ID (PK) |
User ID |
Last Name |
Last Update User ID |
First Name |
Last Update Date/Time |
Middle Initial Company Contact Type |
|
Assigned Employee (FK) Individual Notes |
|
Company Notes |
|
Last Update User ID |
|
Last Update Date/Time |
|
|
Addresses |
Phone Numbers |
ID (PK) |
ID (PK) |
Individual ID (FK) |
Individual ID (FK) |
Address Line 1 |
Phone Number |
Address Line 2 |
Phone Type |
Address Line 3 |
Last Update User ID |
City |
Last Update Date/Time |
State |
|
Zip Code |
|
Address Type |
|
Last Update User ID |
|
Last Update Date/Time |
|
|
|
Contacts |
|
ID (PK) |
|
Individual ID (FK) |
|
Contacted By (FK) |
|
Contact Date |
|
Contact Reason |
|
Contact Method |
|
Contact Notes |
|
Last Update User ID |
|
|
|
There are numerous limitations and redundancies in this model. For example, if one individual works for more than one company, he or she must be stored as two separate individuals to be associated with both companies. Redundancies are also introduced
when multiple contacts share the same address and phone number.
A process known as normalization is a technique used to group attributes in ways that eliminate these types of problems. More specifically, the goals of normalization are to minimize redundancy and functional dependency. Functional dependencies occur
when the value of one attribute can be determined from the value of another attribute. The attribute that can be determined is said to be functionally dependent on the attribute that is the determinant. By definition, then, all nonkey attributes will be
functionally dependent on the primary key in every relation (because the primary key uniquely defines each row). When one attribute of a relation does not uniquely define another attribute, but limits it to a set of predefined values, this is called a
multivalued dependency. A partial dependency exists when an attribute of a relation is functionally dependent on only one attribute of a concatenated key. Transitive dependencies occur when a nonkey attribute is functionally dependent on one or more
other nonkey attributes in the relation.
Normalization is measured in terms of normal forms, and the process of normalization consists of taking appropriate steps to reach the next normal form:
(There are other normal forms, but they are beyond the scope of this discussion.)
Third normal form can be reached in the example by removing the transitive dependencies that exist in the Address relation. This can be achieved by removing city and state from the Address relation and creating two new relations, as you will note in
Table 17.4.
Cities |
States |
Zip Code (PK) |
Zip Code (PK) |
City (PK) |
State |
This makes both the zip code and city attributes in the Address relation foreign keys. (Note that City is made part of the key because, in rural areas, one zip code may have more than one city.) The model can be further normalized by eliminating several
of the multivalued dependencies, as you will note in Table 17.5.
Address Type |
Phone Type |
Contact Type |
ID (PK) |
ID (PK) |
ID (PK) |
Type |
Type |
Type |
|
|
|
Contact Method |
Contact Reason |
|
ID (PK) |
ID (PK) |
|
Method |
Reason |
|
Where these attributes exist in the previously defined relations, they will remain as foreign keys. As these examples illustrate, the dependencies are not eliminated, but their nature is changed so that dependencies exist on key attributes rather than
nonkey attributes. In this example, however, fourth normal form is still not achieved. In the Individual relation, Company has a multivalued dependency (at best) on the Last Name, First Name, and Middle Initial.
Separating Company from Individuals brings the model to fourth normal form. Additional relations allow addresses and phones to be associated with either companies or individuals, and allow individuals to be related to more than one company, as you will
note in Table 17.6.
Individuals |
Individual-Company Relation |
ID (PK) |
Individual ID (FK) |
Last Name |
Company ID (FK) |
First Name |
|
Middle Initial |
|
Contact Type (FK) |
|
Assigned Employee (FK) |
|
Individual Notes |
|
Last Update User ID |
|
Last Update Date/Time |
|
|
|
Companies |
Entity Type |
ID (PK) |
ID (PK) |
Company |
Type |
Company Notes |
|
|
|
Addresses |
Phone Numbers |
ID (PK) |
ID (PK) |
Entity Type (FK) |
Entity Type (FK) |
Entity ID (FK) |
Entity ID (FK) |
Address Line 1 |
Phone Number |
Address Line 2 |
Phone Type (FK) |
Address Line 3 |
Last Update User ID |
City (FK) |
Last Update Date/Time |
Zip Code (FK) |
|
Address Type (FK) |
|
Last Update User ID |
|
Last Update Date/Time |
|
An additional aspect of the logical model is the design of tablespaces. A tablespace consists of one or more data files and, as the name implies, a tablespace houses one or more database objects. Before proceeding to the physical design,
designers should consider how they may want to use tablespaces to group database objects along logical boundaries.
One tablespace is always created when Oracle is installed. This tablespace, called SYSTEM, houses all system tables and other system objects used by Oracle itself. Although this tablespace can be used to house additional tables, it is preferable to
create separate tablespaces for application-specific objects. In many cases, it is desirable to create several tablespaces to house different types of database objects. A common logical division is to create separate tablespaces for rollback segments,
indexes, tables, and temporary segments. This topic will be discussed in greater detail in the following section, but these logical design issues are worthy of some consideration before proceeding to the physical design.
The physical database consists of data files, tablespaces, rollback segments, tables, columns and indexes. There are dependencies between these elements that impose an order on the design process. The process often starts with designing the smallest
units of physical storage (the column) and proceeds, in order, to each successively larger unit of storage. Overall capacity and performance considerations provide constraints to the design, and should be considered at every step. As with logical modeling,
developing a physical design can be a somewhat iterative process.
Designing the physical database begins with assigning column attributes. The attributes of a column determine how it will be physically stored in the database by defining its data type and maximum length. The data type and length of a column should be
carefully chosen at design time, because it is sometimes difficult to change these attributes after data has been loaded. Consider the following summarizations of each of the Oracle data types:
CHAR(SIZE) |
Used to store fixed-length alphanumeric data. The (SIZE) determines the number of character that will be stored for each value in the column. If a value is inserted into the column that is shorter than (SIZE), it will be padded with spaces on the right until it reaches (SIZE) characters in length. The maximum length of this data type is 255 bytes. If (SIZE) is omitted, the default is 1 byte. |
VARCHAR2(SIZE) |
Used to store variable-length alphanumeric data. This data type differs from CHAR in that inserted values will not be padded with spaces, and the maximum (SIZE) for this type is 2000 bytes. |
NUMBER(P, S) |
Used to store positive or negative, fixed or floating-point numeric values. The precision, (P), determines the maximum length of the data, whereas the scale, (S), determines the number of places to the right of the decimal. If scale is omitted, the default is 0. If precision is omitted, values are stored with their original precision up to the maximum of 38 digits. |
DATE |
Used to store dates and times. Oracle uses its own internal format to store 7 bytes each for day, month, century, year, hour, minute and second. This is important because it illustrates the point that dates are fairly expensive to store49 bytes per record, even if only a portion of the information is used. The default representation for dates in Oracle is DD-MON-YY. For example, '01-JAN-95' is used to represent January 1, 1995. |
LONG |
Used to store up to 2 gigabytes of alphanumeric data. As with VARCHAR2, values are stored at their actual lengths. LONG values cannot be indexed, and the normal character functions such as SUBSTR cannot be applied to LONG values. |
RAW |
Used to store binary data with no character set conversion. RAW data cannot exceed 255 bytes. RAW data can be indexed, but no functions can be performed on RAW values. |
LONG RAW |
Used to store large binary objects such as whole documents, video, and graphics, or entire compiled programs. LONG RAW can store up to 2 gigabytes of information, but cannot be indexed. |
ROWID |
Used to represent a row's physical address. Every table has a ROWID pseudo-column, which is not evident when describing the table or issuing SELECT * FROM table_name. This address will remain unchanged unless the table is exported and imported, or otherwise physically moved on disk. In practice, this value is rarely used. |
You should consider additional factors besides the nature of the data and its length when selecting a data type for a column. For example, one might assume that it is always better to use VARCHAR2 instead of CHAR, so that only the actual number of bytes
present in the data will be stored. There are differences, however, in how comparisons are handled with these data types. Two VARCHAR2 values must be the same length to be considered equal, where two CHAR values are compared without consideration of
trailing spaces. As a result, if the values 'WORD' and 'WORD ' are being compared, they will compare as equal if they are CHAR values, but will be not be considered equal if they are VARCHAR2 values because of the trailing spaces in the second value.
When using the NUMBER data type, the declared precision and scale greatly affect how the data is stored. If not fully understood, these values may not behave as intended. For example, assume that a column has been declared NUMBER(10,4). One might
conclude this means that up to 10 digits can be stored in this column, as many as four of which may be to the right of the decimal. This is not necessarily the case, however. An attempt to insert a value of 1234567 into this column will fail, with an error
code of ORA01438 (value larger than specified precision allows for this column). The declaration of NUMBER(10,4) does allow up to 10 digits of precision, but only 6 of these may be to the left of the decimal.
The default DATE format can be changed for the database instance, or for a particular session. If you wish to store times in a column declared as DATE, you must alter the default date format to include the time, or all values inserted will have the
default time of 12:00 A.M. (with the exception of the SYSDATE system variable). To insert date values from an application using the default date format, a statement such as:
INSERT INTO table_name (column_name) VALUES ('01-JAN-95')
must be issued. Because there is no way to specify time using the default data format, Oracle will store the default value of 12:00 A.M. If this statement is changed to:
INSERT INTO table_name (column_name) VALUES(SYSDATE)
the time is inserted accurately because SYSDATE shares the same internal representation. The date format can be altered for a session by issuing a statement such as:
ALTER SESSION SET NLS_DATE_FORMAT = 'MM-DD-YYYY HH:MI A.M.'
After this statement has been issued, times can be inserted accurately using the new format:
INSERT INTO table_name (column_name) VALUES('12-31-1994 11:59 P.M.')
Note that this format is valid only for the session. In any subsequent sessions, the statement:
SELECT column_name FROM table_name
will return 31-DEC-94 for the previously inserted value until the session is altered again.
Using the contact manager application example, the column attributes for the Addresses table might be defined as in Table 17.7.
Column |
Attribute |
ID |
NUMBER(10) |
EntityType |
NUMBER(10) |
Entity ID |
NUMBER(10) |
Address Line 1 |
VARCHAR2(40) |
Address Line 2 |
VARCHAR2(40) |
Address Line 3 |
VARCHAR2(40) |
City |
VARCHAR2(30) |
Zip Code |
NUMBER(5) |
Address Type |
NUMBER(10) |
Last Update User ID |
VARCHAR2(20) |
Last Update Date/Time |
DATE |
Defining the column attributes is an important step in capacity planning. From this information, the maximum record size for each table can be determined. This combined with an estimate of the total number of rows helps determine the amount of storage
required to house the data.
The next step is to begin creating Data Definition Language (DDL) scripts that will be used to create the tables. This may seem like a step toward implementation, but DDL can be used as a tool for capacity planning and the design of tablespaces and data
file layout. The DDL for creating tables consists of defining column attributes and constraints, storage specification, table constraints, and rules. Constraints and rules are discussed in detail in the chapter on enforcing integrity; therefore, for now,
the description of DDL focuses on column attributes and storage specification.
Referring back to the sample application, assume that it is estimated that 5,000 address records will be stored initially, and that the number of records is expected to double in the next several years. From the definition of the column attributes for
the address table, it is apparent that the maximum size of any single record is 264 bytes. (Remember that DATE columns require 49 bytes always.) Assume further that the primary and foreign key IDs will be sequenced starting with 1, and that Address Line 2
and Address Line 3 are rarely populated. Based on this additional information, a conservative estimate would be that the average record size will be 200 bytes. The total size of the table can then be estimated at 1 MB initially. The script in Listing 17.1
can then be written with an appropriate storage specification.
CREATE TABLE Addresses ( Address_ID NUMBER(10) PRIMARY KEY ,Address_Type_ID NUMBER(10) NOT NULL ,Entity_ID NUMBER(10) NOT NULL ,Entity_Type_ID NUMBER(10) NOT NULL ,Address_Line1 VARCHAR2(40) NOT NULL ,Address_Line2 VARCHAR2(40) ,Address_Line3 VARCHAR2(40) ,City VARCHAR2(30) NOT NULL ,Zip_Code NUMBER(5) NOT NULL ,Last_Updt_User VARCHAR2(20) NOT NULL ,Last_Updt_Timestamp DATE NOT NULL ) TABLESPACE Contact_Main STORAGE ( INITIAL 1M NEXT 100K MINEXTENTS 1 MAXEXTENTS 100 PCTINCREASE 10 );
Note that, although the columns have been rearranged and renamed, the physical table still reflects the logical relation, with a few exceptions. The foreign key constraints have been omitted. There are a number of ways to enforce the foreign key
constraints, through column constraints, table constraints, or other means. The issue of enforcing integrity through table and column constraints is revisited briefly later in this chapter, and all options will be discussed in full detail in the chapter on
enforcing database integrity.
The primary key constraint is an important consideration for a number of reasons. At this point, it is important to recognize that Oracle will automatically generate a unique index for this column. This should not be overlooked in capacity planning.
The tablespace Contact_Main does not exist yet, but scripts referencing the tablespace help determine its size. The STORAGE specification indicates that 1 megabyte will be allocated initially, that additional extents will start at 100 kilobytes, that
there will be a minimum of 1 extent and a maximum of 10 extents, and that each extent will be 10 percent larger than the previous extent. Obviously, this specification will allow the table to grow to well over 2 megabytes, which is the planned maximum.
Although storage specifications can be modified using ALTER TABLE, in the additional design phase and for capacity planning purposes, it is usually best to allow for the maximum estimated size or more. This is based on the assumption that it is better to
overestimate storage requirements than to underestimate them.
After DDL scripts have been created for each of the tables, scripts can be written for the tablespaces that will house them. Assume that the sample application will store all application tables in the tablespace named Contact_Main. The data file(s)
created with the tablespace should be sufficiently large to contain all tables at their full size (this can be calculated based on the INITIAL, NEXT, and PCTINCREASE parameters in the STORAGE clause of each table). The script to create Contact_Main might
look like Listing 17.2.
CREATE TABLESPACE Contact_Main DATAFILE '/oradata/tables/main1.dat' SIZE 10M REUSE DEFAULT STORAGE ( INITIAL 2K NEXT 2K MAXEXTENTS 100 PCTINCREASE 0 );
The DEFAULT STORAGE clause of CREATE TABLESPACE determines how space for tables will be allocated when tables are created without a STORAGE specification. Lookup tables are typically numerous and very small, so it may be desirable to create a default
storage that will be sufficient to handle all lookup tables. The STORAGE clause can then be omitted from those CREATE TABLE scripts.
When designing the physical database, performance is an important consideration. There are numerous factors related to the design that will affect the overall performance of the database. These factors include the data model itself, indexing, rollback
and temporary segments, and the physical location of the data on the disks.
A factor that can heavily impact overall performance stems from the logical model. The degree of normalization in the model often comes down to a trade-off between flexibility and performance. In the example of normalization presented in the section on
the logical model, several relations were created that improved the flexibility of the model, as shown in Table 17.8.
Individual-Company Relation |
Companies |
Individual ID (FK) |
ID(PK) |
Company ID (FK) |
Company |
|
Company Notes |
|
|
Entity Type |
|
ID(PK) |
|
Type |
|
Separating company information from the Individuals relation added a considerable amount of flexibility. This allowed individuals to be related to more than one company, and it allowed addresses and phones to be related to either an individual or a
company. Another nice feature of the normalized model is that it allows any number of phones and addresses to be related to an individual or a company as in Table 17.9.
Addresses |
Phone Numbers |
ID (PK) |
ID (PK) |
EntityType (FK) |
Individual ID (FK) |
Entity ID (FK) |
Phone Number |
Address Line 1 |
Phone Type (FK) |
Address Line 2 |
Last Update User ID |
Address Line 3 |
Last Update Date/Time |
City (FK) |
|
Zip Code (FK) |
|
Address Type (FK) |
|
Last Update User ID |
|
Last Update Date/Time |
|
|
|
Address Type |
Phone Type |
ID (PK) |
ID (PK) |
Type |
Type |
A less useful feature of the normalized model is the separation of city and state as in Table 17.10.
Cities |
States |
Zip Code (PK) |
Zip Code (PK) |
City (CK) |
State |
The end result is a very flexible (but possibly overly complex), data model. Assume, for example, that one of the primary uses of the database is to generate a listing of the names of contacts and their companies, addresses, and phone numbers. This is a
fairly complex query and illustrates a potential flaw in the model: Although addresses and phones can be related to either companies or individuals, there is nothing in the model that allows phones and addresses to be related to an individual at a company.
Assume that, as a workaround, a third entity type is created for an individual at a company, and a rule is imposed that the Individual ID is used for that entity where an entity's ID is required as a foreign key. The phone list can be generated under this
scenario, but it requires joining nine tables. Individuals must be joined to Addresses, Phone Numbers, and Individual-Company Relation, which must be joined Companies to get the company name, and Addresses must be joined to Cities and States. In addition,
Phone Numbers and Addresses must be joined to Phone Types and Address Types to get their respective standardized type descriptions. Although joining nine tables is not a particularly difficult task, if the database contains millions of Individuals, the
number of joins can have a very significant impact on performance. If this report is generated up-to-the-minute online, and the database has a high volume of transactions, the impact is further magnified.
Denormalization, the opposite of normalization, can be used to improve performance under these circumstances. By combining some of the relations, the number of joins can be reduced without sacrificing flexibility. In Table 17.11, the listed
denormalizations may be appropriate.
Addresses |
Phone Numbers |
ID(PK) |
ID (PK) |
Individual ID |
Individual ID |
Company ID |
Company ID |
Address Line 1 |
Phone Number |
Address Line 2 |
Phone Type |
Address Line 3 |
Last Update User ID |
City |
Last Update Date/Time |
State |
|
Zip Code |
|
Address Type |
|
Last Update User ID |
|
Last Update Date/Time |
|
|
|
Individual-Company Relation |
Companies |
Individual ID (FK) |
ID (PK) |
Company ID (FK) |
Company |
With this model, only seven tables must be joined to generate the list, and no flexibility is lost. Note that the foreign key constraints must be removed from IndividualID and CompanyID on the Addresses and Phone Numbers tables because one or the other
might be NULL. Listing 17.3 demonstrates the SQL used to create the report after this denormalization.
SELECT First_Name, Middle_Init, Last_Name, Company_Name, F.Type, Address_Line1, Address_Line2, Address_Line3, City, State, Zip, G.Type, Phone_Number FROM Individuals A, Individual_Company_Relation B, Companies C, Addresses D, Phones_Numbers E, Address_Types F, Phone_Types G WHERE A.ID = B.Individual_ID AND B.Company_ID = C.ID AND B.Individual_ID = D.Individual_ID AND B.Company_ID = D.Company_ID AND B.Individual_ID = E.Individual_ID AND B.Company_ID = E.Company_ID AND D.Address_Type = F.ID AND E.Phone_Type = G.ID
Additional denormalization could improve performance further, but probably at the cost of flexibility.
The previous example can also be used to illustrate the importance of indexes. Indexes can be created on single or multiple columns, and may or may not be unique. When creating an index on multiple columns, the order in which the columns are declared is
particularly important, because Oracle treats the values of such an index as an aggregate. The column that will be used the most should be declared first in a multicolumn index. In the previous example, the Employee_Company_Relation table is a prime
candidate for an index. If both columns are indexed in aggregate, the table itself should never be read. Individual_ID should be declared as the first column in the index because it is used for one more join. The DDL to create this column and its index
might look like the script in Listing 17.4.
CREATE TABLE Individual_Company_Relation ( Individual_ID NUMBER(10) NOT NULL ,Company_ID NUMBER(10) NOT NULL ) TABLESPACE Contact_Main STORAGE (INITIAL 50K NEXT 10K MAXEXTENTS 10 ); CREATE INDEX Indiv_Co ON Individual_Company_Relation (Individual_ID, Company_ID) TABLESPACE Contact_Index STORAGE (INITIAL 50K NEXT 10K MAXEXTENTS 10);
Note that the MINEXTENTS parameter to the STORAGE clause was not used. The default value of 1 is acceptable in most circumstances.
The creation of indexes should be planned very carefully, because improper use of indexes can have a damaging effect on performance. Even where indexes improve the performance of SELECT statements, they have a negative impact on INSERTs and UPDATEs,
because the indexes must be modified in addition to the tables.
The column attributes themselves play a role in performance as well. Wherever possible, integers should be used as keys because they can be compared faster than any other data type. Column and table constraints should be avoided because they must be
checked whenever a value is inserted or updated. Although these constraints are often necessary, integrity should be enforced by other means when it is possible to do so safely.
Rollback segments also play an important role in the overall performance of the database. As the name would imply, Oracle uses rollback segments as temporary storage for data needed to reverse a transaction. This data must be stored until the
transaction is committed. Rollback segments must be sufficiently large to store this data for all transactions occurring at a given time. If rollback segments are not large enough, transactions will fail.
To properly estimate the size of the rollback segments needed, the designer must know how many users will be submitting transactions, and the maximum size of the rows affected by a single transaction. In many large databases, transactions are initiated
by batch processes used to load and update data from external sources, or to create summary tables. These batch processes often generate much larger transactions than the user community, and should be considered when planning rollback segments. A rollback
segment, like other database objects, can be created with a script, as demonstrated in Listing 17.5.
CREATE PUBLIC ROLLBACK SEGMENT contact_rbs1 TABLESPACE contact_rb_segs STORAGE (INITIAL 100K NEXT 100K OPTIMAL 500K MAXEXTENTS 100); ALTER ROLLBACK SEGMENT contact_rbs1 ONLINE;
The OPTIMAL parameter to the STORAGE clause indicates that when extents have been created, they will not be deallocated below this value. This, in effect, sets the minimum size of the rollback segment after that threshold is reached.
Rollback segments are typically created in a separate tablespace. The size of the data files in this tablespace should be sufficient to hold the rollback segments at their maximum extents.
Another performance consideration relates to the creation of temporary segments. Temporary segments are similar to rollback segments, except that they are used to store result sets rather than transaction information. When a SELECT statement produces a
result set that is too large to be stored in memory, a temporary table is created to store the results until the cursor is closed. Temporary tables may also be created by Oracle to store temporary result sets for complex joins or unions. As with rollback
segments, these temporary segments must be sufficiently large to store this data, or SELECT statements may fail.
Temporary segments must be assigned to users explicitly. If no temporary segment is assigned, the SYSTEM tablespace is used by default. It is preferable to create a separate tablespace for these temporary segments, and assign it to users using the
TEMPORARY TABLESPACE clause of the CREATE USER command. When designing temporary tablespaces, keep in mind any batch processes that may create large cursors. These too, will require the use of temporary segments.
It may be preferable to create separate segments (both temporary and rollback) for different groups of users, based on the transactions and result sets generated by different groups of users.
Other performance considerations relate to the physical layout of files on disk. Proper use of multiple disks and controllers, clustering, and striping can improve performance greatly in certain situations.
In the example on the creation of indexes (see Listing 17.4), notice that the table and the index were created in separate tablespaces. The example assumes that the tablespaces were created on separate disks, using separate controllers. Keeping indexes
on separate physical devices with separate controllers allows the index and the tables to be read almost simultaneously, and minimizes the movement of the read-write heads. In the sample SQL statement, this would allow the read-write head of one drive to
continue reading the index while a separate controller reads the Addresses and Phones tables to find the corresponding values. If the index were on the same disk, either the whole index would have to be read into memory before the table, or the heads would
have to move back and forth, reading part of the index and part of the table.
The use of separate controllers and disks also applies to rollback and temporary segments. In an ideal configuration, tables, indexes, rollback segments, and temporary segments would all be on separate disks using separate controllers. This
configuration would greatly improve overall performance, particularly for batch processes such as the creation of summary tables. In practice, however, this configuration is rarely possible. Regardless of the actual hardware configuration, the designer
should carefully consider how these data files will be accessed. The design should attempt to minimize the movement of read/write heads for the most common or mission-critical database operations.
Clustering is another way to improve performance through physical layout on the disk. It is sometimes advantageous to create an indexed cluster for a group of tables that are frequently joined. The index used to join the tables in the cluster must be
declared as part of the cluster. Tables created in the cluster must specify this index in the CLUSTER clause of the CREATE TABLE statement. When an indexed cluster is created, each value for the cluster index (or cluster key) is stored only once. The rows
of the tables that contain the clustered key value are physically stored together, as if already joined. This method minimizes the movement of read-write heads when accessing these tables and conserves disk space by storing the key value only once.
In the contact manager example, assume that the users will typically be contacting individuals by telephone, and that they will be retrieving this information much more frequently than they will be updating it. Clustering the Individuals and Phone
Numbers relations should be considered in this case. The DDL script in Listing 17.6 illustrates how this cluster might be created.
CREATE CLUSTER Individual_Phone_Numbers (Individual_ID NUMBER(10)) SIZE 256 STORAGE (INITIAL 1M NEXT 1M MAXEXTENTS 100 PCTINCREASE 10); CREATE TABLE Individuals ( Individual_ID NUMBER(10) PRIMARY KEY ,Last_Name VARCHAR2(30) NOT NULL ,First_Name VARCHAR2(20) NOT NULL ,Middle_Initial CHAR(1) ,Last_Updt_User VARCHAR2(20) NOT NULL ,Last_Updt_Timestamp DATE NOT NULL ) CLUSTER Individual_Phone_Numbers(Individual_ID); CREATE TABLE Phone_Numbers ( Phone_ID NUMBER(10) PRIMARY KEY ,Individual_ID NUMBER(10) NOT NULL ,Company_ID NUMBER(10) NOT NULL ,Phone_Number NUMBER(10) NOT NULL ,Phone_Type_ID NUMBER(10) NOT NULL ,Last_Updt_User VARCHAR2(20) NOT NULL ,Last_Updt_Timestamp DATE NOT NULL ) CLUSTER Individual_Phone_Numbers(Individual_ID); CREATE INDEX Indiv_Phone on CLUSTER Individual_Phone_Numbers;
In Listing 17.6, the SIZE keyword indicates the size in bytes needed to store the rows corresponding to one key value. The size is always rounded up to the nearest block size. In the example, because the cluster key is stored only once, it will take
only 245 bytes to store the rows for each cluster key. The example assumes that 256 bytes is the closest block size.
A second type of cluster, known as a hash key cluster, can also be utilized to improve performance. Rather than store rows based on a common indexed key, rows are stored together based on a common hash value, which is not physically stored in the
database. The hash value is calculated at run time, using either a user-defined function or Oracle's internal hashing function. Although this reduces physical disk accesses, hash key clustering should be used only on small tables, or tables that will have
a maximum size that is known at design time. This is because the number of hash keys is fixed at design time, and resolving collisions requires additional reads.
Clustering should not be overused because there can be a negative impact on performance when clustered tables are accessed separately. Insert, update, and delete operations on clustered tables will typically be slower as well. For these reasons, the
designer should carefully consider how tables will be used before creating a cluster.
Striping is a technique that consists of spreading a large database object multiple disks. Performance can be greatly improved by striping, particularly when large tables are accessed by full table scans. The striping of a particular table can be
forced by creating a tablespace with multiple data files on separate disks, each of which is smaller than the table itself. To provide maximum control over how the table is striped, it should be the only object in the tablespace. Listing 17.7 provides one
example of how striping can be accomplished in Oracle.
CREATE TABLESPACE Individual_Stripes DATAFILE 'disk1/oradata/stripe1.dat' SIZE 100K REUSE, 'disk2/oradata/stripe2.dat' SIZE 100K REUSE DEFAULT STORAGE ( INITIAL 200K NEXT 200K MAXEXTENTS 100 PCTINCREASE 0 ); CREATE TABLE Individuals ( Individual_ID NUMBER(10) PRIMARY KEY ,Last_Name VARCHAR2(30) NOT NULL ,First_Name VARCHAR2(20) NOT NULL ,Middle_Initial CHAR(1) ,Last_Updt_User VARCHAR2(20) NOT NULL ,Last_Updt_Timestamp DATE NOT NULL ) TABLESPACE Individual_Stripes STORAGE (INITIAL 90K NEXT 90K MINEXTENTS 2 MAXEXTENTS 100 PCTINCREASE 0 );
Obviously, a third extent will not be able to be allocated for this table, making this approach to striping a high maintenance proposition. The size of the table must be continually monitored, and new data files must be added to the
tablespace when needed, using the ALTER TABLESPACE command with ADD DATAFILE. Although this approach requires additional maintenance, performance gains can be very significant, particularly if the disks have separate controllers. The
designer should consider the trade-offs carefully before recommending that a table be striped.
Capacity planning is important in ensuring that adequate storage is available for future growth. The DDL scripts for each database object are invaluable in determining the overall storage required by the database. In fact, the process of capacity
planning actually begins with the DDL scripts.
It starts with defining the column attributes. These column attributes, in turn, determine the size of each row in the table. The column attributes also determine the size of each row in indexes created on the columns. These attributes, combined with
the estimated total number of rows (including provisions for future growth), are used in defining the storage clause for tables and indexes. For purposes of capacity planning, it should be assumed that all objects will reach their maximum extents.
The next step is creating DDL for tablespaces. The data file(s) created by these scripts should be sufficiently large to contain all objects that they will contain. When determining the size of the data files, it should be assumed that all objects
within the tablespace will reach their maximum size, as defined by the STORAGE clause in the DDL script for each object.
The total size of the database can then be determined by simply adding the sizes of the data files. Never assume, however, that this estimate is accurate and complete. Also, there are a number of additional considerations to be made.
In capacity planning, the designer must accommodate for unexpected growth. As a general rule of thumb, at least 25 percent (preferably 50 percent) of each disk should be free after the initial installation. This will allow additional data files to be
created wherever necessary if tables grow larger than expected. It is common to experience unanticipated growth, and it is even more common to underestimate the requirements for temporary and rollback segments, or to identify the need for additional
indexes after the initial design is complete.
The initial hardware configuration and data file layout should accommodate these possibilities. Capacity limitations can be crippling in the event that additional rollback segments or temporary segments cannot be created when needed.
For this reason, the importance of capacity planning should not be underestimated.
Designing a database is a methodical process. It begins with defining the requirements of the application. Entities and attributes are grouped into a logical data model that meets the business needs. The logical data model can be represented graphically as an Entity-Relationship Diagram. These entities and attributes are then translated into a physical model. The physical model is when data types and column constraints are defined. The normalization process is used to eliminate redundancy of data. Denormalization is a process of breaking the normalization rules to gain performance increases. A well designed database is important for future expansion as well as yielding ease of application programming.