Course Project Information


Project Overview


As a team of five (or six) programmers, you will assist in the implementation of a data processing system that includes its own search engine. The teams will be chosen in the third week of class.


Each member of the team is to code one or more functional areas of the project. When complete, the individual components will be integrated into one program, which will be demonstrated at the end of the quarter. While the final result is one integrated program, each team member’s work is to be maintained as separate file(s). If this is not possible, clearly document each function including the name of the developer.



Data and Topic Requirements


Your team must select one of the below topics as the focus of your project. Under special circumstances, and with the permission of your instructor, you may select a different topic:

  • Songs
  • Poetry
  • Blog posts
  • Recipes

Within your topic, you will need to select a focus or genre. Some examples of genres are as follows:

  • Songs: Rap, or 1940s hits, or Pink Floyd songs, etc
  • Poetry: Haiku, or Love Poems, or Elizabethan poetry, etc...
  • Blog posts: Travel blogs, or tech blogs, or cooking blogs, etc.
  • Recipes: French cooking, or vegan cooking, or desserts, etc...

Note that no other team should have your topic AND genre.

Once you have selected a topic and genre, your team will be responsible for finding a collection of data that falls within the specifications of your topic. The data collection is of your own choosing, with the requirement that each record in the system must contain a unique key (must be a string), and at least three non-key fields, one of which must be the full text of the entry for that particular record. For instance, your team might select a collection of poems with the following attributes:


Title + Author– unique key

Year

Country of origin

Full Text of poem


Your collection must have a minimum of 15 records with which to seed your project. Users of your system must be able to upload additional records contained in additional files.


Once you have gathered your collection of data, you will be required to create a class whose fields are the attributes of this data set. For example:


class Poem {

    string title;

    string author;

    int year;

    string country;

    string text;

}


Each record within your data set should be an instance of this class.



Menu Requirements


The goal of this project is to create an interactive data processing system, with a menu of options provided to the user. Your project must display the following options for the user:


- Add new data

- Delete data

- Search (has a sub-menu)

- Find and display one record using the primary key

- Find and display records using keywords (your search engine)

- List data sorted by primary key

- Write data to a file

- Statistics (When this option is selected, your program must display at least 3 different statistics of your choice about the data)

- Quit


At the end of the program, the file is to be automatically written to a disk file. This feature is in addition to the menu's write file option.



image shows a touch screen of menu options

image source



Required Data Structures and Additional Project Specifications


The system’s data structure must contain a Binary Search Tree, sorted according to the primary key of your choice.


The system must also contained two hash tables, which will be used for the purposes of the search engine. The first hash table will be used to store the inverted index (see discussion below) and must resolve collisions using separate chaining, where the chains are binary search trees sorted by the title. See the below section on the search engine for more information. The second hash table will be used to create a one-to-one mapping of words to numerical ids (see discussion below).


Part of the project will require you to draw a UML class diagrams for your data structures and other classes.


image shows the names of some popular search engines

image source



Search Engine Requirements


The heart of this project is to create a search engine. The purpose of your search engine is to allow users to search via keyword for entries in your system. For example, if the user types in the word "love" to the search engine, the title of each entry where the word "love" appears in the text should be displayed to the user. The user should then be able to select the entry of his or her choice to view additional information.


For example, if the given topic is Pink Floyd Songs and the user types in the word "time" to your search engine, your system should display an alphabetically sorted list of song titles. The screen might display an output like the following:


The following songs contain the word "time":


1. Breathe

2. Money

3. The Great Gig in the Sky

4. Time


To view more information about any of these songs, enter the number next to the song title:


The search engine will build what is known as an inverted index - a well-known technique from a subfield of computer science known as information retrieval. You will doubtless hear more about concept of the inverted index as you continue your study of computer science. In fact, most modern search engines, including Google, build some form of inverted index in order to display results to a user.


What is an index? If you flip to the back of your text book, you will see one example of an index, where the keywords in the textbook are mapped to page numbers where you can find those words.


What is an inverted index? According to Wikipedia: "In computer science, an inverted index ... is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents (named in contrast to a Forward Index, which maps from documents to content)."


In the example above, your textbook provides a forward index in the back of the book, where topics are mapped to page numbers. In contrast, we will be creating an inverted index for our search engine by mapping words in to titles (of songs, poems, blog posts, or recipes depending on your topic). In the example above using Pink Floyd, the word "time" was mapped to four song titles.


Building an inverted index like this one allows for greater speed when searching for an entry as, once the index has been built, the search engine does not have to scan through all of the text entries looking for matches each time a user makes a query.


image source



Your search engine is required to follow these steps:


Step 1: Continuing with our example, the lyrics of the songs Breathe, Time, and Money, as well as the other Pink Floyd songs in the collection, should each be placed in a separate file. Each file should follow the same format. For example, perhaps the first line will always have the name of the song, the second line will always be the artist, the third line will contain the year and the rest of the lines will contain the lyrics.


Step 2: Remove any words that would not be considered useful, such as as any articles (a, the), prepositions (of, to), conjunctions (and, but, nor) and common adjectives such as some, any, very. It is up to you which words you remove from your collection. I will be testing your search engine by using some common words such in the examples above to see if you have removed them. You should remove a minimum of 20 non-useful words from these files.


Step 3: Create a hash table (you will need to estimate the size), where the index of each bucket should be a unique word in your files. The basic idea is as follows.


For ALL the unique words 1 through n contained in your data files:


HashTable[word1]

HashTable[word2]

HashTable[word3]

...

HashTable["love"]

....

HashTable["time"]

...

HashTable[wordn]

...


Thus, each unique word contained in one or more of your files, should be the index of one bucket in the hash table. For more information, see the below discussion about using strings as array indices.



Step 4: Go through your files word-by-word, and, for each word, insert the title (or - even better- insert the song, poem, recipe, or post object) into the chain at that word's index in the hash table. Each chain should be a binary search tree.



HashTable[word1]-> bst containing SongTitle4 and SongTitle6

HashTable[word2]-> bst containing SongTitle2, SongTitle4, and SongTitle5

HashTable[word3]-> bst containing SongTitle1

...

HashTable["love"]-> bst containing "Love Scene (Version 6)"

....

HashTable["time"] -> bst "Breathe", "Money", "The Great Gig in the Sky", and "Time"

...

HashTable[word1000]-> bst containing SongTitle1, SongTitle6, SongTitle8, SongTitle24, etc.

...


Note that you should not store duplicate titles within one bucket even if the same word appears more than once inside of the same text.



More Information About How to Create an Index from a Word


Part of the complexity of this project will be to decide how to use a word as an index of an array, which is not possible to do directly in C++, as strings cannot be used as the indices of arrays in this language. My suggestion is to assign each word an integer id, and then use the id as the index of the bucket in your hash table.


Using this approach, you will also need a way to store the information about which word has been assigned which integer id so that you do not accidentally assign the same word two different ids. It is important to keep in mind that each time you retrieve a word from one of your files, you will need to determine first whether it has been assigned an id yet, and then proceed accordingly. Thus, you will need to use an additional hash table to store the information about which id has been assigned to which word.


I strongly encourage you to make use of your knowledge of classes and objects here. You should consider creating a class like the following to bundle words and their ids together:


class wordID {

    string word;

    int id;

}


Then, each time you assign an id to a word, you can create a new instance of the wordID class and store this object in your hash table that contains the word-id information.



This project is intended to be challenging. If the ideas contained in this section seem confusing, I recommend that you do more research into the concepts behind an "inverted index"and then come to your professor's office hours with any questions.



Additional Resources


  • A good definition of inverted index can be found here
  • A conceptual description of an inverted index, with examples can be found here
  • A video description of inverted index can be found here
  • You can find out more about indexing by watching the video series starting here.



Expectations for Team Members


Your grade for the project will be a factor of the team score weighted by a peer evaluation.


In addition to writing and debugging code, each member of the team will:

  1. Give suggestions for the application data: think about original/interesting/educational data that matches the program requirements. During the first team meeting, every student will bring his proposal, then the team will assign a preference number to each proposal (1 – the first choice, etc.); the final decision will be made by the instructor (no two teams should process the same data). 
  2. Participate in designing a solution to the problem: Data Diagrams, Structure Charts, UML Diagram
  3. Participate in writing the project reports. 
  4. Provide relevant test cases for the final presentation. 
  5. Individually test his/her part of the program (as much as possible). This implies writing test drivers (code that will not be included in the final project, but it will be used to test parts of the project).
  6. Participate in all of the team meetings. Come prepared to the meetings. If a team member must be absent from a meeting, he/she should inform the other team members in advance; also he/she could keep in touch by phone or email. 
  7. Turn in his/her part of the project on the scheduled date. 
  8. Work with the team coordinator and other members in the team to integrate his/her part of the code. 
  9. Prepare for and participate during his/her team presentation. 
  10. Actively attend all of the presentation sessions. (Ask questions/provide answers). 
  11. Write the peer evaluations.


image spelling out the word team
image source

Suggested Team Assignments


Each member of the team must write at least one unit of code. If the team is small, team members may have to write multiple units.


Team Member 1: Pro Team Coordinator: Data structure design coordination, main(), Integration, Testing, Project presentation coordination, making sure reports submitted on time, setting up meetings, keeping everyone on track, interfacing with the instructor about any issues that arise.


Team Member 2: BST Specialist: Implement BST Insert, Delete, List, Search by Primary Key.


Team Member 3: Search Engine Whiz 1: Create the Hash Tables and BST for the Search Engine.


Team Member 4: Search Engine Whiz 2: Create the Hash Tables and BST for the Search Engine.


Team Member 5: Screen Output Expert: Create user interface. Connect menu of options to the data structures. Work with Team Coordinator to integrate individual data structures into a unified system.



Project Reports


The following reports are required. The due dates will be announced in class, as well as written on the course schedule. These are as follows:




Peer Evaluation


Each member of the team is to rate the contribution of the other team members using the Peer Evaluation Form. These evaluation forms are to be turned in individually. There will be two (2) peer evaluations in total. 


The peer evaluations are an opportunity for any team member who is not pulling his or her weight to make improvements. Therefore, the score for the first peer evaluation will count towards one quarter (1/4) of each team member's score, while the second peer evaluation will count towards three quarters (3/4) of each team member's score.


Any member who does not complete a peer evaluation will receive a five (5) points deduction from his/her score for the project, two and a half (2.5) for each missing evaluation. 


Note that you are not to rate yourself. 


These evaluations will be used to determine a tentative score for each individual on the project. You will receive an average of the scores assigned by your team mates. The instructor, however, is responsible for marking the final determination of each person’s grade on the project and will use the teams’ score only as one input in the grading process. See the Project Score section below for more information.


Here is a link to the peer evaluation form that will be used for both peer evaluations:



What to Do If Someone on Your Team is Not Doing His or Her Part


If someone on the team is not doing his or her part, it is the responsibility of the team leader to talk to that person and encourage him/her to take the project more seriously. If this teammate is unresponsive, the next step is to meet with the instructor to explain the situation. The instructor will intervene and talk to the student. Ultimately, however, if your teammate refuses to work, his/her portion of the project should be reassigned to another member(s) so that the team doesn't fall behind or fail. In this case, the other members of the team should give a score of 0 to this teammate on the peer evaluation.


Do not be afraid to give a low peer evaluation score to a teammate who is not meeting his or her responsibilities!



Project Score


A team score will be calculated as shown below.

  • Completeness and Accuracy (80%).

-The project runs without errors

-All portions of the project are complete and functional

-The BST sorted by primary key and the Interface are each worth 25% of the project. The search engine is worth 50% of the project. If any of these components is missing, a deduction of 25%-50% will be made.

  • Reports (10%). All Reports are complete, thorough and well-written.
  • Presentation (10%). See presentation section below for grading criteria.


An individual grade will be determined: by multiplying the project score by a score factor as shown below, which is determined using the average peer evaluation score (25% for the first peer evaluation and 75% for the second peer evaluation):


Average Peer Rating

Score Factor

16 – 20

100%

12 – 15

95%

10 – 12

90%

8–9

80%

5–7

70%

3–4

50%

<3

0



Personal Score = Project Score x Score Factor

Schedule of Due Dates

1. Report 1: Project Proposal and Team Member Document
    Hard copy due on Tuesday, October 17
    Note: One document per team required.

2. Report 2: Requirements Analysis and High-level Design
    Hard copy due in class on Thursday, November 2
    Note: One Requirements Analysis Document per team required.

3. Report 3: Low-Level Design + Peer Evaluation 1
    Hard copy of Low-Level Design Document due in class on Thursday, November 16
    Hard Copy of Peer Evaluation 1 due in before class on Thursday, November 16
    Note: One Low-Level Design Document per team required; one peer evaluation per team member required.

4. Report 4: Test Plan
    Hard copy due in class on Tuesday, December 5
    Note: One Test Plan Document per team required 

5. Report 5: Walk-through and Test Results + Peer Evaluation 2
    Hard copy due in class on Tuesday, December 12
    Hard copy of Peer Evaluation 2 due in class on Tuesday, December 12
    Note: One Walk-through and Test Results Document per team required; one peer evaluation per team member required.

6. Project Demonstration/Presentation and Project Report 6
    Demonstration in class on Tuesday, December 12
    Instructor will test your code to note what is working and what is not on this date. No extensions.
    Make sure that you at least have
a running version, even if it is incomplete or you will receive a 0 on the course project.
    Also, you will test other team projects and submit Project Report 6 in class


Presentation and Demo

  • At the end of the quarter, you will need to present the final version of your project to the instructor who will also test out your code.
  • This demonstration will occur on the day scheduled for the final exam.
  • Each team member will need to be present to talk about their portion of the project, and answer questions the instructor may have.
  • Your project will be evaluated by the instructor on this day.
  • As no extensions are allowed, make sure that you at least have a running version of your project (even if it is incomplete) or you will receive a 0 on your project (no exceptions!).
  • The presentation portion of the demonstration should be no longer than 5 minutes followed by a live demonstration of your project that it at least 10 minutes long.
  • Your presentation will be graded on a 20 point scale as follows:
 Requirement Description Point Value
 
Slide 1

Your first slide must display your topic and the full names of all presenters. At this point in the presentation, each team member should introduce him or herself and one person should state the topic of the presentation.

 
1
 
Slide 2
 
Your second slide must provide an outline of what you will cover in your presentation. This outline must be a list in either bullet point or numerical form. One member of the team should briefly go over what will be discussed during the presentation.

 
1
 
Slide 3
 
Your third slide must give an overview or brief summary of what your chosen topic is. In other words, define your topic.

 
1
 
Slide 4
 
In the fourth slide, you will need to give an overview of your data set, where you found it and what are the fields contained in the data set.

 
1
 
Slide 5-6 or 7
 
In the next 2-3 slides, give an overview of the data structures used in your presentation. You must present these by using the UML diagrams that you created in your project reports. You may update the UML diagrams if your data structures have changed since submitting the reports.

 
2
 
Slide 7 or 8
 
In the next slide, list the results from your walk-throughs. What errors did you notice and what feedback did you get from your participants? Did you make any changes since completing the walk-throughs?


1
 
Live Demo of Project
 
During the live demo, you should display the main functionality of your system, including ALL required menu options, as well as any special features you would like to show off.

 
10
 
Final Slide
 
In the final slide, thank the audience and ask for questions.

 
1

Presentation Aesthetics
 
Is your presentation laid out nicely with lots of images and clear, readable text?

 
1
 
Professionalism
 
Is the presentation delivered in a practiced and professional manner?

 
1