Instructions:

  • This assignment is meant to be completed individually or in pairs, under the Consultation model of collaboration as per the Computing Science Department Course Policies.
  • You must not upload binary files or large text files of any kind to your GitHub repository.
  • Your answers must be on the GitHub repository created by following the instructions on eClass.
  • That repository already has the folder structure for the assignment. DO NOT MODIFY that directory structure.
  • Remember to submit the URL of your repository through eClass before the deadline.

Learning Objectives

XML is a versatile markup language suitable for encoding almost any kind of data there is, from structured data (e.g., relations) to unstructured data (e.g., emails, books, etc.). XML is widely supported by commercial DBMSs out there and has found many applications outside the realm of database systems.

This assignment is intended to provide you with an opportunity to:

  1. Get an introduction to XML and two of its main query languages, XPath and XQuery.

Required Reading:

The following reading is required for you to complete the assignment:

  1. XQuery: A guided Tour. Chapter 1 of XQuery from the Experts: A Guide to the W3C XML Query Language. Follow the instructions on eClass to access this resource.

The following should be used as a guide and reference:

XQuery processor

You will use Zorba, a modern and highly optimized XQuery processor. It has been installed in the lab machines under the executable zorba. However, you are encouraged to download and install it on your laptop.

The TAs will grade your queries using Zorba.

Dataset

For this gentle introduction to XML and XQuery you are given a small dataset with data for 81 movies to practice your queries. Download it from eClass. Your submission will be graded on a much larger XML file (also available on eClass) with the same structure.

Tasks

Write XQuery scripts that answer the following 5 questions using the file large.xml. Commit your queries on GitHub with the appropriate filename. Each answer should go in a separate folder. Document your query in the README.md file inside that folder and add a video explaining your solution if you worked in a pair.

1 mark each:

q1/answer.xq – How many comedies with average ratings higher than 7 are there in the file?

q2/answer.xq – What is the average rating of all movies with Bill Murray?

q3/answer.xq – Who is in the cast of “Lost in Translation” and which characters did they play in the movie?

2 marks each:

q4/answer.xq – Who directed the most movies with the word “love” in the title? Your query must ignore the casing of the letters in the words.

q5/answer.xq – What is **P(R in USA PG in United Kingdom)? In other words, what are the odds that a movie is age-certified as **R in the USA given that it is age-certified as PG in the United Kingdom.

We are only interested in age certifications in countries, so all events of interest will mention a certification and a country. For example P(7 in Spain) denotes the probability that a movie has age-certification 7 in Spain.

Recall that **P(A B)** = P(A & B)/P(B) (see Conditional Probability).

For the purposes of this problem, define P(A) as the number of movies satisfying condition A divided by the total number of movies in an XML file. For example, if we use the small.xml, we will estimate that P(7 in Spain) = 1/81, since there is only one movie (Matilda of 1996) rated 7 in Spain, and there are 81 unique movies in that file. Similarly, P(A & B) will be the fraction of movies that satisfy A and B (over the total number of movies).

3 marks

Write an XQuery script that uses a user-defined function to compute the conditional probability as specified above, and uses that function to estimate the most likely certification in the United Kingdom for each of the certifications in Canada, using all movies in the large.xml file. The output of your program should look like this:

<country name="Canada">
  <certification name="PG">
   <country name="United Kingdom" certification="PG" probability="0.34"/>
  </certification>
  <certification name="R">
    <country name="United Kingdom" certification="18" probability="0.63"/>
  </certification>
  ...
</country>

Note that, for each certification in Canada, your script should include the (single) certificate in the United Kingdom that is most likely to correspond to the Canadian certification. Your script must also compute the actual probability.

The function must take five inputs: four strings country1, certification1, country2, and certification2, and a sequence of movie elements from which to compute the probabilities. This last parameter helps save time as, without it, the function would have to read the entire XML file each time it is called. Your function must return **P( country1 & certification1 country2 & certification2 )**.

For example, if used on the large.xml file, your function should compute (approximately):

  • **P(PG & United Kingdom PG & Canada)** = 0.34
  • **P(18 & United Kingdom R & Canada)** = 0.64

Write your answer in a file called q6/answer.xq and explain your solution in the README.md file. Solutions without explanations will not be accepted.

Other requirements

To get full marks in this problem your solution must meet the following requirements.

One: Each country name can only appear once in your script. This is because the TAs will test your code with other pairs of countries besides Canada and the UK. You can accomplish this using variables instead of country names and initializing them to constants:

let $country1 := 'United Kingdom'
let $country2 := 'Canada'

Two: Zero probabilities should be ignored. This would happen if there are no movies that satisfy both certification1 in country1 and certification2 in country2 at the same time.

Three: Ties should be reported. For example, based on the small.xml file, UK certifications 15 and 18 are tied (with 1/3 probability) for the most likely equivalent to the R certification in Canada.

Grading

The TA will download your submissions and test them with zorba on a lab machine, using the following command:

% zorba -i -f -n -q <qN/answer.xq>

Rubric

Grading

NOTE: students working individually do not need to submit videos. In those cases, the documentation portion of the grade will be based on the README.md files only.

Each query will be scored on a four-point scale on three criteria: correctness, clarity, and documentation as follows:

Scale Correctness Clarity Documentation
100 The query is named as specified, and, when run with zorba as indicated above, computes the correct output on any XML document called large.xml that is in the directory from which zorba is called, and all values are computed as required. The query does not perform any unnecessary steps and all data manipulation expressions and operations are appropriate. The query is generic and would compute the correct answer on any XML document with movies organized as in the examples provided. The query is formatted and indented properly; all necessary sub-queries are declared in let statements; all variables have readable and sensible names; all tests in where clauses are needed and sensible. The query has appropriate comments; the README.md file explains the intuition behind each step in the query; the video explains each step of the query instead of essentially reading the query.
70 The query is named as specified, and, when run with zorba as indicated above, partially computes the correct answer on any XML document called large.xml that is in the directory from which zorba is called; OR the query contains unnecessary steps and/or output OR some of the data manipulations are inappropriate. The query is generic and would compute the correct answer on any XML document with movies organized as in the examples provided. The query does not use let clauses when possible OR variables are not properly named OR query is not properly formatted. Comments and explanations (README.md and/or video) cover some but not all steps in the query.
40 The query is named as specified, and, when run with zorba as indicated above, correctly computes at least one of the steps necessary for the correct solution OR the query computes correctly only some of the elements expected answer OR the query has several unnecessary steps OR the query uses inappropriate data manipulations, even if otherwise correct. let clause is not used OR variables have confusing names OR query uses unnecessary complicated steps that could be replaced by simpler and clearer ones. Missing comments OR explanations (README and/or video) cover just a few of the steps in the query.
0 Query does not correctly compute any logical steps needed for the correct solution OR the query is not correctly named or the query does not refer to large.xml, even if otherwise correct. Missing. Missing.