cmput391 Winter 2021 Homework 2 -- OSM and Embedded SQL
Instructions:
- This programming assignment is meant to be completed individually or in pairs, under the Consultation model of collaboration as per the Computing Science Department Course Policies.
- You must not upload binary files of any kind, or large text files to your GitHub repository, especially XML files with map data or SQLite databases. This will slow down the uploading of your solution, which caused problems in the past, especially close to the deadline.
- Your solution will be graded and must run on any of the machines in the instructional labs, without the installation of libraries or tools not already there.
- Your answers must be on the GitHub repository created by following the instructions on eClass.
- That repository already has the folder structure for the assignment and the configuration file for automated testing. DO NOT MODIFY the directory structure nor the
.github/workflows/main.yml
file. - Remember to submit the URL of your repository through eClass before the deadline.
Learning Objectives
- Understanding the OpenStreetMap (OSM) data model.
- Extracting OSM data from a binary map file into an XML representation.
- Writing one or more programs to parse the XML representation and extract data of interest.
- Further practicing your knowledge of SQL.
- Becoming acquainted with embedded SQL programming with SQLite3 and its C programming API.
Part I – Loading OSM data into SQLite
The OSM data model has four elements:
- A node represents a unique point on the Earth’s surface; it has an
id
and a pair oflat/lon
coordinates. A node can be used to represent a park bench, a water fountain, a phone booth, etc. - A way is an ordered list of nodes, defining a path (called polyline in the OSM documentation) on the map.
- A path that starts and ends with the same node is said to be closed and represents an area on the map (e.g., a lake, a farm or a building).
- A relation is a multi-purpose data structure that groups together several elements (e.g., multiple nodes and ways).
- A tag is a key-value pair, used to describe an element. OSM has a predefined list of accepted tags, but users can add their own.
The minimum bounding rectangle (MBR) of a set S of points (or OSM node
elements) is the smallest rectangle such that no element of S falls outside of that rectangle. A node that lies on the boundary of the rectangle is considered to be inside the rectangle.
Q1 (5 marks)
In a sentence, your task is to write and document scripts and/or programs, in any language (as long as they run on the lab machines) to create a proper SQLite database with all nodes, paths and areas (closed paths) within the City of Edmonton, together with any tags associated with them, from an OSM map of Alberta.
You need to use the Osmosis command-line Java tool to process the binary OSM data file and produce XML files which are both human and machine readable. You will use map data encompassing the city of Edmonton. For this, you can download a recent map of Alberta from the Canadian section of the Geofabrik.de download site. You must extract all node
and all way
elements within the MBR containing all of the City of Edmonton. To find the coordinates of such MBR, use the official boundaries of the City of Edmonton described by approximately 8 thousand points (lat/lon coordinates).
Extracting Map Data With Osmosis
Download and install Osmosis in your laptop or your home directory in a lab machine or in your own computer. Run Osmosis to extract an XML dataset for the city of Edmonton, providing the MBR of the city as a command line parameter to Osmosis. Call your output file edmonton.osm
(note that an XML file need not have xml
as extension).
Read this to familiarize yourself with the command line parameters. You want to use the following ones (all in one line and in the order below):
--read-pbf <path to Alberta map>
--bounding-box bottom=... left=... top=... right=...
--write-xml edmonton.osm
Nodes
The XML file you created should have several thousand <node ...>
elements like this one:
<node id="651244271" version="5" timestamp="2015-04-06T21:43:41Z" uid="156121" user="Ranek" changeset="30027314" lat="53.5269944" lon="-113.5267783"/>
Such nodes are to be stored in a table
node (id integer, lat float, lon float)
Some of the node
elements in the XML file span multiple lines and have tag
elements nested within them (which you need to capture as well–see below):
<node id="869072261" version="1" timestamp="2010-08-19T17:54:42Z" uid="86952" user="xixi" changeset="5537291" lat="53.5919642" lon="-113.4798392">
<tag k="source" v="CanVec 6.0 - NRCan"/>
<tag k="addr:city" v="Edmonton"/>
<tag k="addr:street" v="90 Street North-west"/>
<tag k="addr:housenumber" v="13128"/>
</node>
NOTE about XML syntax: in XML, single-line node
elements are empty and hence can be closed succinctly in the same line (with />
). The multi-line elements, instead, are not empty and need to be closed explicitly withe a </node>
tag.
Paths and Areas
Paths and areas are encoded as way
elements. For example, Athabasca Hall looks like this:
<way id="336758065" version="3" timestamp="2017-09-06T18:01:32Z" uid="6511217" user="Arctic gnome" changeset="51789291">
<nd ref="651244271"/>
<nd ref="651244272"/>
<nd ref="3441794532"/>
<nd ref="651244273"/>
<nd ref="651244274"/>
<nd ref="3293394327"/>
<nd ref="3117153980"/>
<nd ref="651244254"/>
<nd ref="3441794539"/>
<nd ref="651244255"/>
<nd ref="651244258"/>
<nd ref="3117153985"/>
<nd ref="651244270"/>
<nd ref="651244271"/>
<tag k="building" v="university"/>
<tag k="name" v="Athabasca Hall"/>
<tag k="short_name" v="ATH"/>
</way>
Each <nd ...>
element refers to a node in the XML file identified by the ref
attribute. Also, these elements are logically ordered in the same way they appear in the file. Paths (open or closed) must be represented in the database inside two tables:
way (id integer, closed boolean)
waypoint (wayid integer, ordinal integer, nodeid integer)
Tags
Finally, tag
elements should be stored in two separate tables:
nodetag (id integer, k text, v text)
waytag (id integer, k text, v text)
Constraints
The following constraints must be enforced at all times. Your solution will be tested with insertions, deletions and tuple updates meant to violate them.
- Primary keys:
node(id)
andway(id)
- Foreign Keys:
waypoint(wayid)
is a foreign key referencingway(id)
waypoint(nodeid)
is a foreign key referencingnode(id)
nodetag(id)
is a foreign key referencingnode(id)
waytag(id)
is a foreign key referencingway(id)
. + Other constraintsclosed
inway
should be true if and only if the path is closed.- The
ordinal
values inwaypoint
form a dense ordering; that is: (1) each node is assigned unique ordinal between 1 and the number of nodes elements in the path. The ordinals should initially correspond to the order the nodes appear in the XML file.
Your solution must allow insertion, deletion and modification of tuples. Triggers that block all updates will be considered incorrect. Instead, your solution must recompute the ordering of the nodes and the value of the closed
attribute after each valid update, as appropriate, so that the constraints are not violated.
Notes:
- There may be loops in a way, even if that way is not closed (because its first and last nodes are not the same). You can leave those ways in the database.
- There may be ways with references to nodes outside of the MBR given to Osmosis. Apparently, this happens with ways that are partially within the MBR (e.g., imagine a road that starts within the Edmonton MBR but ends outside of that MBR): Osmosis leaves all such ways intact, even if some of the nodes referenced in the way are outside of the MBR and therefore do not have a corresponding
<node>
element in Osmosis’ output. You must remove all such references, to prevent foreign key violations.
Loading the XML Data into a SQLite Database
The final product of this part of the assignment is a database with the schema described above. Below we give a suggestion of how to do that. You are free to choose a different approach, and of course, you can discuss it with your TA beforehand. Regardless of the design you choose, you are free to either write programs that produce CSV or TSV files as output and load them by hand into SQLite or you can write programs that connect to SQLite and populate the tables with Embedded SQL.
NOTE: due to memory restrictions, you are NOT allowed to load the entire XML file in memory using DOM. Instead, we suggest you scan the file element by element.
The proper way to do this is using the Simple API for XML (SAX) for this purpose. However, the XML file you are dealing with here is so simple that a less elegant alternative can work, which is to parse the file line by and decide what to do after each line. You should never do this for arbitrary XML files where you do not know a priori how many levels of nesting are used.
Naming convention
The test scripts for the other questions assume that there will be a SQLite database on a file called edmonton.db
inside the q1
folder. If you do not follow this convention, you need to change the test scripts in every question.
What will be graded?
Your TA will grade the README.md
file inside the q1
folder, and the .gitignore
file in your repository. README.md
should have clear instructions to build the database, while .gitignore
must have the path to all XML or SQLite files mentioned in your solution.
The TAs will test your program with insertions, deletions and tuple updates meant to violate the constraints above.
Grading–Q1
Marks | Correctness (60%) | Clarity (20%) | Documentation (20%) |
---|---|---|---|
100% | The instructions are correct and complete (the TA can create the database by following exactly those instructions and nothing else) AND all constraints are implemented correctly and handle insertions, deletions and modifications of existing tuples. | The instructions are properly formatted and clearly written, using proper terminology AND the DDL statements are properly formatted and contain comments as needed. | The README file explains what each step in the instructions does; the video explains what each step of the instructions does. |
70% | The instructions are correct but some small steps, even if obvious, are missing but the TA can build a database by guessing what is missing without the help of the student OR all primary key and foreign keys are implemented but at least one trigger-based constraint is either missing or incorrect. | The instructions are not clearly formatted on the README.md file OR the DDL is not properly formatted and documented. | Comments and explanations (README and/or video) cover some but not all steps in the instructions. |
40% | The instructions are incomplete and many steps are missing but the TA can build the database without asking for clarifications from the student OR at least one step in the instructions is incorrect but the TA can fix it on their own OR all primary key and foreign key constraints are implemented but two trigger-based constraint are either missing or incorrect. | Instructions are confusing or contradictory; necessary parameters are missing OR DDL is not properly formatted and/or comments are missing. | Explanations (README and/or video) cover just a few of the steps. |
0% | The TA cannot build the database without contacting the student for help OR no trigger-based constraint is properly implemented. | Missing. | Missing. |
Part II – Embedded SQL
Required reading:
Before you write any code:
- Read all of An Introduction To The SQLite C/C++ Interface very carefully.
- Read the following documentation:
- Database Connection Handle
- Prepared Statement Object, paying especial attention to the part explaining life-cycle of a prepared statement object
- Result Values From a Query
- Binding Values To Prepared Statements
- Create or Redefine SQL Functions
Specifications
For each question, you are asked to write a C program (no exceptions or substitutions allowed) that follows the life-cycle described here to accomplish the following tasks. Your programs should work on any SQLite database with the schema as inPart I. Your TA will test your code on multiple databases conforming to that schema and also on invalid inputs.
Input/Output
All input to the programs must be done via command line arguments in the order specified in the question, or in text files. A tag given as command line argument will always be a single string in the form key=value
(i.e., a single entry in the argument list). For example, a tag could be wheelchair_accessible=yes
.
The output of the program must be as specified and match the unit test code provided.
Q2 (3 marks)
Write a C program, in a file called q2/src/solution.c that takes as input: the database file and two node
identifiers, in this order, and prints to STDOUT
their geographical distance in meters, computed by a suitable function from here and links therein.
The distance must be computed by a new user-defined SQL function, and your program must use only a single query which must use that function. Add an explanation to the README.md
file for this question justifying your choice of distance function. Your explanation will count for the documentation marks towards your grade in this question.
Output: your program must print to STDOUT
the geographical distance (i.e., a single number) or the word error
(in a single line) if the parameters are incorrect (e.g., the database file is missing or the number of node ids is incorrect, or some id is missing from the database).
Q3 (1.5 mark)
Write a C program, in a file called q3/src/solution.c that takes as input the database file and a list of strings of the form key=value
; finds every node
in the database having at least one tag matching a key/value combination from the input list; and prints to STDOUT
: the number of such node elements, as well as the largest pairwise distance among those nodes.
The maximum distance must be computed by a single SQL query that uses the function you created for Q2. (HINT: use CTEs.)
Output: your program must print to STDOUT
two numbers separated by a space or a tab, or the word error
(in a single line) if the parameters are incorrect.
Q4 (1.5 marks)
Write a C program, in a file called q4/src/solution.c that takes as input the database file and a list of strings of the form key=value
, and finds every way
in the database having at least one tag matching a key/value combination from the input list; and prints to STDOUT
: the number of such paths, and the length of the longest such path, computed by a single SQL query as in Q3.
All lengths must be computed in SQL, as in Q2/Q3.
Output: your program must print to STDOUT
two numbers separated by a space or a tab or the word error
(in a single line) if the parameters are incorrect.
Q5 (2 mark)
Write a C program, in a file called q5/src/solution.c that takes as input a database file and a tsv
file containing zero or more lines, each describing a node
to be inserted into the database.
In that file, each node must be described by at least three columns: (0) the node id, (1) the latitude, (2) the longitude. Subsequent columns in the file correspond to strings of the form key=value
providing tags for the node, and must be inserted accordingly.
Output: your program must print to STDOUT
the word success
on a single line if the execution was successful or error
on a single line followed by the SQLite error message on subsequent lines, in case the input is invalid or some constraint is violated.
All or nothing: your program must leave the database unchanged if there is an error on any of the nodes described in the input file.
Q6 (2 marks)
Write a C program q6/src/solution.c that takes as input a database file and a tsv
file containing zero or more lines describing zero or more way
elements, which are to be inserted into the database. The format of the input TSV is as follows:
- Each
way
element is described by two consecutive non-blank lines (i.e., blank lines are used to separateway
s) - The first such line has the id of the
way
in column 0, followed by zero or more strings of the formkey=value
in subsequent columns, with tags for the way. - The second line has all the nodes (identifiers) in the way, with the column number corresponding to the order of the node.
Output: your program must print to STDOUT
the word success
on a single line if the execution was successful or error
on a single line followed by the SQLite error message on subsequent lines, in case the input is invalid or some constraint is violated.
All or nothing: your program must leave the database unchanged if there is an error on any of the ways or any of the nodes described in the input file.
Grading
Each program will be scored on a four-point scale on three criteria: correctness, clarity, and documentation as follows:
Marks | Correctness (60%) | Clarity (20%) | Documentation (20%) |
---|---|---|---|
100% | The program computes exactly the expected answer following all restrictions specified for the question: its output is formatted as expected, no errors or warnings are raised during the tests on Github and all values are computed as required. No unnecessary steps are done by the program or SQL statements in it, and all data manipulation expressions and operations are appropriate. The program is generic and would compute the correct answer on any legal instance of the schema. | The program and SQL statements it contains are formatted and indented properly. The program is broken down into functions and modules as needed; all necessary sub-queries are declared in CTEs; all tuple variables have readable and sensible names; all tests in WHERE clauses are needed and sensible. | The program and SQL statements in it have appropriate comments; the README file explains the intuition behind each step in the query; the video explains each step of the program/query instead of essentially reading them. |
70% | The program computes the expected answer and does not raise errors or exceptions during the tests on Github, but it contains unnecessary steps OR some of the data manipulations are inappropriate. The program is generic and would compute the correct answer on any legal instance of the schema. | The program is not modularized or the SQL statements do not use CTEs when possible OR tuple variables are not properly named OR the code or SQL statements are not properly formatted. | Comments and explanations (README and/or video) cover some but not all steps in the program. |
40% | The program correctly computes at least one of the steps necessary for the correct solution OR the program computes correctly only some of the tuples in the expected answer OR there are unnecessary steps or inappropriate operations in the program or its SQL statements, even if otherwise correct. | CTEs are not used OR tuple variables are confusing OR query uses unnecessary complicated steps that could be replaced by simpler and cleared ones. | Missing comments OR explanations (README and/or video) cover just a few of the steps in the query. |
0% | Program does not correctly compute any logical steps needed for the correct solution. | Missing. | Missing. |