Advanced Database Systems (CS6093)

Spring 2011, Polytechnic Institute of NYU

Course Schedule   Course Projects

Current News:

2011/04/23: Project presentation and demo: email us your presentation slides by 4pm ET on May 9th (the day of final class).
2011/03/07: We apologize for the mistakes made in lecture 5 about the deadlines of the midterm report. It is now corrected. Again, the deadline for midterm report is 5pm ET March 21st.
2011/03/01: office hour on 03/28 will be canceled because Cong is out of town during the day, if needed, please email Cong for other possible meeting times during that week.
2011/03/01: project assignments are finalized.
2011/01/17: first class will begin on 01/24 at 6pm in room RH602, we will go over the logistics and course outline.

Brief Description:

This course covers a variety of advanced database topics. In this semester, we will focus on topics that are closely related to information retrieval and management on the Web. Specically, we will study the following main topics: information retrieval ranking and evaluation (IRR and IRE), personalized information retrieval (IRP), structured search and information extraction (SS and IE), and web data mining (DM). We will also briefly touch upon a recent topic on human computation (HC).

Instructors and Logistics:

Dr. Fernando Diaz (Yahoo! Research), last_name_and_first_initial [AT] yahoo-inc dot com
Dr. Cong Yu (Google Research), full_name [AT] google doc com

Time and Location: M 6:00p - 8:25p, RH602, 6 MetroTech
Mailing List: spring1112.cs6093.2256 [_a_t_] utopia [o] poly [o] edu list
Office Hours: M 4:50p - 5:50p, LC246

Course Schedule

Notations: IRR = Information Retrieval Ranking; IRE = Information Retrieval Evaluation; IRP = Personalized Information Retrieval; IE = Information Extraction; DM = Data Mining; SS = Structured Search; HC = Human Computation; FD = Fernando Diaz; CY = Cong Yu.
Reading materials will be provided on the web site approximately one week before the lecture date.

DateTopic (Instructor)Reading Material
Lec 01 (01/24) Course Overview
(FD + CY)
handout
Lec 02 (01/31) SS (CY) 1. Decker et al, The Semantic Web: the roles of XML and RDF, IEEE Internet Computing, 4(5):63, 2000.
2. Berners-Lee, Hendler, Lassila, The Semantic Web, Scientific American, May 2001.
3. Amer-Yahia, Lakshmanan, Pandit, FlexPath: Flexible Structure and Full-text Querying for XML, SIGMOD 2004
4. Jagadish et al, TIMBER: A Native XML Database, The VLDB Journal, 11:274, 2002.
Lec 03 (02/07) IRR (FD) Project selection and group formation due.
1. Salton, G., Wong, A., and Yang, C. S. A vector space model for automatic indexing. Commun. ACM 18(11):613-620, 1975.
2. Brin, S., and Page, L. The anatomy of a large-scale hypertextual web search engine. WWW7, 1998.
3. Najork, M. A., Zaragoza, H., and Taylor, M. J. Hits on the web: how does it compare? SIGIR 2007.
Lec 04 (02/14) IRR (FD) 1. Nallapati, R. Discriminative models for information retrieval. In Proceedings of the 27th annual international conference on Research and development in information retrieval (2004).
2. Arguello, J., Diaz, F., Callan, J., and Crespo, J.-F. Sources of evidence for vertical selection. In SIGIR 2009 (2009).
3. Dong, A., Zhang, R., Kolari, P., Bai, J., Diaz, F., Chang, Y., Zheng, Z., and Zha, H. Time is of the essence: improving recency ranking using twitter data. In WWW 2010 (New York, NY, USA, 2010).
02/21 President's Day no class
Lec 05 (02/28) IE (CY) 1. Wrapper induction for information extraction. Nicholas Kushmerick, Daniel Weld, and Robert Doorenbos. IJCAI 1997.
2. Snowball: Extracting relations from large plain-text collections. Eugene Agichtein and Luis Gravano. Digital Library 2000.
3. To search or to crawl? Towards a query optimizer for text-centric tasks. Ipeirotis et al. SIGMOD 2006.
Lec 06 (03/07) IE (CY) 1. Building structured web community portals: A top-down, compositional, and increment approach. DeRose et al. VLDB 2007.
2. Methods for domain-independent information extraction from the Web: An experimental comparison. Etzioni et al. AAAI 2004.
3. Open information extraction from the Web. Banko et al. IJCAI 2007.
4. Prioritization of Domain-Specific Web Information Extraction. Jian Huang and Cong Yu. AAAI 2010.
03/14 Spring Break no class
03/21 - Midterm report due at 5pm ET
Lec 07 (03/21) DM (CY) 1. Fast Algorithms for Mining Association Rules. Rakesh Agrawal and Ramakrishnan Srikant. VLDB 1994.
2. Efficiently Mining Long Patterns from Databases. Roberto Bayardo. SIGMOD 1998.
3. Mining the Most Interesting Rules. Bayardo and Agrawal. SIGKDD 1999.
4. Bottom-Up Computation of Sparse and Iceberg CUBEs. Beyer and Ramakrishnan. SIGMOD 1999.
Lec 08 (03/28) DM (CY) 1. Mining Sequential Patterns. Agrawal and Srikant. ICDE 1995.
2. CloseGraph: Mining Closed Frequent Graph Patterns. Yan and Han. SIGKDD 2003.
3. Mining Graph Evolution Rules. Berlingerio, Bonchi, Bringmann, and Gionis. ECML/PKDD 2009.
4. If time permits, we will also talk about clustering.
Lec 09 (04/04) IRE (FD) 1.van Rijsbergen, C.J. Information Retrieval. Butterworths, 1979. (Chapter 7 only)
2. Jarvelin, K., and Kekalainen, J. Cumulated gain-based evaluation of ir techniques. TOIS 20, 4 (2002).
3. Hassan, A., Jones, R., and Klinkner, K.L. Beyond dcg: user behavior as a predictor of a successful search. In WSDM 2010: Proceedings of the third ACM international conference on Web search and data mining (New York, NY, USA, 2010).
Lec 10 (04/11) IRE (FD) 1. Carterette, B. Robust test collections for retrieval evaluation. In SIGIR 2007: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (New York, NY, USA, 2007).
2. Chapelle, O., Metzler, D., Zhang, Y., and Grinspan, P. Expected reciprocal rank for graded relevance. In CIKM 2009: Proceeding of the 18th ACM conference on Information and knowledge management (2009).
3. White, R. W., and Dumais, S. T. Characterizing and predicting search engine switching behavior. In CIKM 2009: Proceeding of the 18th ACM conference on Information and knowledge management (New York, NY, USA, 2009).
Lec 11 (04/18) IRP (FD) 1.Haveliwala, T. H. Topic-sensitive pagerank. In WWW 2002: Proceedings of the 11th international conference on World Wide Web (New York, NY, USA, 2002)
2.Das, A.S., Datar, M., Garg, A., and Rajaram, S. Google news personalization: scalable online collaborative filtering. In WWW 2007: Proceedings of the 16th international conference on World Wide Web (New York, NY, USA, 2007)
3. Teevan, J., Dumais, S.T., and Liebling, D.J. To personalize or not to personalize: modeling queries with variation in user intent. In SIGIR 2008: Proceedings of the 31st annual international ACM SIGIR conference on Research and development i n information retrieval (New York, NY, USA, 2008)
Lec 12 (04/25) IRP (FD) 1. Dou, Z., Song, R., and Wen, J.-R. A large-scale evaluation and analysis of personalized search strategies. In Proceedings of the 16th international conference on World Wide Web (New York, NY, USA, 2007), WWW 2007.
2. Shen, X., Tan, B., and Zhai, C. Implicit user modeling for personalized search. In Proceedings of the 14th ACM international conference on Information and knowledge management (New York, NY, USA, 2005), CIKM 2005.
3. Teevan, J., Morris, M.R., and Bush, S. Discovering and using groups to improve personalized search. In Proceedings of the Second ACM International Conference on Web Search and Data Mining (New York, NY, USA, 2009), WSDM 2009.
Lec 13 (05/02) HC (CY) 1. Luis von Ahn, Manuel Blum, Nicholas Hopper, and John Langford. CAPTCHA: Using hard AI problems for security. EUROCRYPT 2003.
2. Luis von Ahn, Ruoran Liu, and Manuel Blum. Peekaboom: a game for locating objects in images. CHI 2006.
3. Shawn Jeffery, Michael Franklin, Alon Halevy. Pay-as-you-go user feedback for dataspace systems. SIGMOD 2008.
4. Michael Franklin, Donald Kossmann, Tim Kraska, Sukriti Ramesh, and Reynold Xin. CrowdDB: Answering Queries with Crowdsourcing. SIGMOD 2011.
Lec 14 (05/09) Project Demo
(FD + CY)
email slides to us before class, by 4pm ET.
05/13 - Project final report due at noon ET.


Course Projects

Project group information is posted. Contact us if you find any discrepancy.

Please email the instructors with the project ID and emails of the group members to reserve a project. Each project has a default instructor to contact (listed in the parenthesis) after the project ID.

Project IDTopicGroup Members
P03 (FD) Detecting Trends in Facebook/Twitter Feeds Maggie, Nitin, Quentin
P10 (CY) Mining Patterns from Status Updates Devansh, Prayag, Ting
P11 (CY) Recommendations in Social Networks Konstantinos, Ratan, Yigit
P12 (FD) Learning to Rank with Label Noise Josh