The data we will use for this assignment consist of two CSV files: Friends and Ratings. The former
contains friendship relationships as tuples of the form UserID1 UserID2, denoting two users who are
also friends. A user can have one or more friends and the friendship relationship is symmetric: if A is a
friend of B, then B is also a friend of A and both tuples (A B and B A) are present in the file. Ratings
contains user ratings as tuples of the form UserID MovieID Rating. For example, tuple 12 3 4
means that “the user with ID 12 gave 4 stars to the movie with ID 3”.
Hint #1: You can use Python’s CSV reader to parse input files.
Hint #2: Consider encapsulating your Python tuples in ATuple objects (see code skeleton).
- TASK I: Implement backward tracing (credits: 40/100)
The first task is to extend the operators you built in Assignment 1 with support for backward tracing. For
each operator, you will have to implement a new method (in Python 3 syntax):
lineage(tuples: List[ATuple]) -> List[List[ATuple]]
that returns the lineage of the given list of tuples.
As discussed in Lecture 2, the lineage of an output tuple, let t, with respect to a query q(D) is the
collection of input tuples that contributed to having the tuple t in the output of the query. Let
recommendation be the output (movie id) of the second query from Assignment 1:
SELECT R.MID
FROM ( SELECT R.MID, AVG(R.Rating) as score
FROM Friends as F, Ratings as R
WHERE F.UID2 = R.UID
AND F.UID1 = ‘A’
GROUPBY R.MID
ORDERBY score DESC
LIMIT 1 )
To successfully complete this task, you must implement a new method for ATuple:
lineage() -> List[ATuple]
so that you can retrieve the lineage of any recommendation as follows:
lineage = recommendation.lineage()
3
Calling recommendation.lineage() should internally call:
operator.lineage(tuples=[recommendation])
where operator is a handle to the operator that produced the tuple recommendation (i.e. the root
operator of the query tree).
Example: Let’s assume that the recommendation query returns a recommendation r=ATuple(10)
(where 10 is the movie ID) to user 0 and this recommendation was based on the following input tuples:
(in Friends.csv) (in Ratings.csv)
0 1 1 10 5
0 4 4 10 8
0 18 18 10 2
In this case, calling r.lineage()should return the following list of tuples as the lineage of r:
[(0, 1), (0, 4), (0, 18), (1, 10, 5), (4, 10, 8), (18, 10, 2))]
Hint #1: This task requires modifying each operator so that it maintains some intermediate data in
memory (input-to-output mappings). Try minimizing the amount of intermediate data by leveraging
knowledge about the operator logic.
Hint #2: Use the track_prov flag to enable and disable materialization of intermediate data.
Hint #3: Make sure the input to the operator.lineage method is a list of tuples all of which
exist in the output of the respective operator.
Hint #4: Calling lineage on an operator with a list of tuples (recommendations) as input must
return the lineage of these tuples as a list of lists:
[ [lineage_of_tuple_1], [lineage_of_tuple_2], … ]
Sample Solution