CSE-4/562 Spring 2019 - SQL & Physical Layout

SQL & Physical Layout

CSE-4/562 Spring 2019

January 30, 2019

Textbook: Ch. 2.3, 6.1-6.4, and 13.5-13.7

SQL

  • Developed by IBM (for System R) in the 1970s.
  • Standard used by many vendors.
    • SQL-86 (original standard)
    • SQL-89 (minor revisions; integrity constraints)
    • SQL-92 (major revision; basis for modern SQL)
    • SQL-99 (XML, window queries, generated default values)
    • SQL 2003 (major revisions to XML support)
    • SQL 2008 (minor extensions)
    • SQL 2011 (minor extensions; temporal databases)

A Basic SQL Query


            SELECT  [DISTINCT] targetlist
            FROM    relationlist
            WHERE   condition
    
  1. Compute the $2^n$ combinations of tuples in all relations appearing in relationlist
  2. Discard tuples that fail the condition
  3. Delete attributes not in targetlist
  4. If DISTINCT is specified, eliminate duplicate rows

This is the least efficient strategy to compute a query! A good optimizer will find more efficient strategies to compute the same answer.

Example Data

SELECT * FROM Trees;

Wildcards (*, tablename.*) are special targets that select all attributes.

CREATED_ATTREE_IDBLOCK_IDTHE_GEOMTREE_DBHSTUMP_DIAMCURB_LOCSTATUSHEALTHSPC_LATINSPC_COMMONSTEWARDGUARDSSIDEWALKUSER_TYPEPROBLEMSROOT_STONEROOT_GRATEROOT_OTHERTRNK_WIRETRNK_LIGHTTRNK_OTHERBRNCH_LIGHBRNCH_SHOEBRNCH_OTHEADDRESSZIPCODEZIP_CITYCB_NUMBOROCODEBORONAMECNCLDISTST_ASSEMST_SENATENTANTA_NAMEBORO_CTSTATELATITUDELONGITUDEX_SPY_SP
'08/27/2015'180683348711'POINT (-73.84421521958048 40.723091773924274)'30'OnCurb''Alive''Fair''Acer rubrum''red maple''None''None''NoDamage''TreesCount Staff''None''No''No''No''No''No''No''No''No''No''108-005 70 AVENUE''11375''Forest Hills'4064'Queens'292816'QN17''Forest Hills'4073900'New York'40.72309177-73.844215221027431.14821202756.768749
'09/03/2015'200540315986'POINT (-73.81867945834878 40.79411066708779)'210'OnCurb''Alive''Fair''Quercus palustris''pin oak''None''None''Damage''TreesCount Staff''Stones''Yes''No''No''No''No''No''No''No''No''147-074 7 AVENUE''11357''Whitestone'4074'Queens'192711'QN49''Whitestone'4097300'New York'40.79411067-73.818679461034455.70109228644.837379
'09/05/2015'204026218365'POINT (-73.93660770459083 40.717580740099116)'30'OnCurb''Alive''Good''Gleditsia triacanthos var. inermis''honeylocust''1or2''None''Damage''Volunteer''None''No''No''No''No''No''No''No''No''No''390 MORGAN AVENUE''11211''Brooklyn'3013'Brooklyn'345018'BK90''East Williamsburg'3044900'New York'40.71758074-73.93660771001822.83131200716.891267
'09/05/2015'204337217969'POINT (-73.93445615919741 40.713537494833226)'100'OnCurb''Alive''Good''Gleditsia triacanthos var. inermis''honeylocust''None''None''Damage''Volunteer''Stones''Yes''No''No''No''No''No''No''No''No''1027 GRAND STREET''11211''Brooklyn'3013'Brooklyn'345318'BK90''East Williamsburg'3044900'New York'40.71353749-73.934456161002420.35833199244.253136
'08/30/2015'189565223043'POINT (-73.97597938483258 40.66677775537875)'210'OnCurb''Alive''Good''Tilia americana''American linden''None''None''Damage''Volunteer''Stones''Yes''No''No''No''No''No''No''No''No''603 6 STREET''11215''Brooklyn'3063'Brooklyn'394421'BK37''Park Slope-Gowanus'3016500'New York'40.66677776-73.97597938990913.775046182202.425999
... and 683783 more

            SELECT tree_id, spc_common, boroname
            FROM Trees
            WHERE boroname = 'Brooklyn'
    

In English, what does this query compute?

What is the ID, Commmon Name and Borough of Trees in Brooklyn?

TREE_IDSPC_COMMONBORONAME
204026'honeylocust''Brooklyn'
204337'honeylocust''Brooklyn'
189565'American linden''Brooklyn'
192755'London planetree''Brooklyn'
189465'London planetree''Brooklyn'
... and 177287 more

      SELECT latitude, longitude 
      FROM Trees, SpeciesInfo
      WHERE Trees.spc_common = SpeciesInfo.name
        AND SpeciesInfo.has_unpleasant_smell = 'Yes';
    

In English, what does this query compute?

What are the coordinates of Trees with bad smells?

LATITUDELONGITUDE
40.59378755-73.9915968
40.69149917-73.97258754
40.74829709-73.98065645
40.68767857-73.96764605
40.739991-73.86526993
... and more

      SELECT Trees.latitude, Trees.longitude 
      FROM Trees, SpeciesInfo
      WHERE Trees.spc_common = SpeciesInfo.name
        AND SpeciesInfo.has_unpleasant_smell = 'Yes';
    

... is the same as ...


      SELECT T.latitude, T.longitude 
      FROM Trees T, SpeciesInfo S
      WHERE T.spc_common = S.name
        AND S.has_unpleasant_smell = 'Yes';
    

... is (usually) the same as ...


      SELECT latitude, longitude 
      FROM Trees, SpeciesInfo
      WHERE spc_common = name
        AND has_unpleasant_smell = 'Yes';
    

Expressions


            SELECT tree_id, 
                   stump_diam / 2 AS stump_radius,
                   stump_area = 3.14 * stump_diam * stump_diam / 4
            FROM Trees;
    

Arithmetic expressions can appear in targets or conditions. Use ‘=’ or ‘AS’ to assign names to these attributes. (The behavior of unnamed attributes is unspecified)

Expressions


  SELECT tree_id, spc_common FROM Trees WHERE spc_common LIKE '%maple'
    
TREE_IDSPC_COMMON
180683'red maple'
204325'sycamore maple'
205044'Amur maple'
184031'red maple'
208974'red maple'

SQL uses single quotes for ‘string literals’

LIKE is used for String Matches

%’ matches 0 or more characters

Union


    SELECT tree_id FROM Trees WHERE spc_common = 'red maple'
    UNION [ALL]
    SELECT tree_id FROM Trees WHERE spc_common = 'sycamore maple'
    

Computes the set-union of any two union-compatible sets of tuples

Adding ALL preserves duplicates across the inputs (bag-union).

Aggregate Queries


    SELECT [DISTINCT] targetlist
    FROM relationlist
    WHERE condition
    GROUP BY groupinglist
    HAVING groupcondition
    

The targetlist now contains (a) Grouped attributes, and (b)Aggregate expressions.

Targets of type (a) must be a subset of the grouping-list

(intuitively each answer tuple corresponds to a single group, and each group must have a single value for each attribute)

groupcondition is applied after aggregation and may contain aggregate expressions.

Aggregate Queries


    SELECT spc_common, count(*) FROM Trees GROUP BY spc_common
    
SPC_COMMON COUNT
''Schubert' chokecherry' 4888
'American beech' 273
'American elm' 7975
'American hophornbeam' 1081
'American hornbeam' 1517
... and more

Physical Layout


  from re import split;

  with open('Trees.csv', 'r') as f:
    for line in f:
      fields = split(",", line);
      if(fields[30] == 'Brooklyn'):
        print(fields[0]);
    

Record Layouts

Record Layout 1: Fixed

Record Layout 2: Delimiters

Record Layout 2: Headers

Record Formats

Fixed
Constant-size fields. Field $i$ at byte $\sum_{j < i} |Field_j|$
Delimited
Special character or string (e.g., ,) between fields
Header
Fixed-size header points to start of each field
 
 

File Formats

Fixed
Constant-size records. Record $i$ at byte $|Record| \times i$
Delimited
Special character or string (e.g., \r\n) at record end
Header
Index in file points to start of each record
Paged
Align records to paging boundaries
openclipart.org
File
A collection of pages (or records)
Page
A fixed-size collection of records
Page size is usually dictated by hardware.
Mem Page $\approx$ 4KB   Cache Line $\approx$ 64B
Record
One or more fields (for now)
Field
A primitive value (for now)

  with db_open('Trees') as data:
    for record in data:
      if(record['BORONAME'] == 'Brooklyn'):
        print(record['TREE_ID']);