COMPSCI 514: Algorithms for Data Science
|
|
- Derek Bryan
- 5 years ago
- Views:
Transcription
1 COMPSCI 514: Algorithms for Data Science Arya Mazumdar University of Massachusetts at Amherst Fall 2018
2 Lecture 9 Similarity Queries
3 Few words about the exam The exam is Thursday (Oct 4) in two days In this room this time Duration 1 hour Syllabus: Till the end of last class Focus: Clustering An extensive list of chapters were provided in piazza, you should concentrate on things that were discussed in detail in class.
4 From the two books From Blum et al. book: Ch 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9.1, 3.9.2, All exercise in Ch 3 are relevant Ch 4.1, 4.8. Ex , 4.11, 4.12, 4.25, 4, Ch 7.1, 7.2, 7.3, 7.4, 7.5 Ex , From MMDS: Exercises are at the end of a section Ch 5.1, 5.5 Ch 7.1, 7.3, 7.4 Ch 10.4 Ch 11.1, 11.2, 11.3 Only things that are covered in class are in syllabus Concentrate on topics that were not covered by the first homework It is fine if you learn something by mistake
5 Finding similar items [Hays and Efros, SIGGRAPH 2007] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 3
6 Finding near-neighbors in a high dimensional space Pages with similar words For data de-duplication, classification by topics Customers who purchased similar products Product classification Images with similar features The space is high dimensional The database can be huge
7 Finding near-neighbors in a high dimensional space Near neighbors: Points that are only a small distance away Have to define distance or similarity For example, Jaccard distance/similarity For any two sets A and B, the Jaccard similarity A B A B For any two sets A and B, the Jaccard distance 1 A B A B
8 Jaccard metric into one of set intersection, we use a technique called shingling, which is introduced in Section 3.2. For any two sets A and B, the Jaccard similarity Jaccard Similarity of Sets A B A B The Jaccard similarity of sets S and T is S T / S T, that is, the ratio of the size of the intersection of S and T to the size of their union. We shall denote the Jaccard similarity of S and T by SIM(S, T ). For any two sets A and B, the Jaccard distance A B 1 A B Example 3.1 : In Fig. 3.1 we see two sets S and T. There are three elements in their intersection and a total of eight elements that appear in S or T or both. Thus, SIM(S, T )=3/8. S T similarity = 3 8 ; distance = 5 8 Figure 3.1: Two sets with Jaccard similarity 3/8
9 Running example: Similarity of documents Jaccard distance for textual similarity To find plagiarism No simple automated process (documents are not exactly same) To find the same news from different sources To match mirror pages
10 Documents to sets: Shingling Jaccard distance for textual similarity Document: string of characters k-shingle: a substring of length k within the document Example: abcdabd Set of 2-shingles: { ab,bc,cd,da,bd} Bag of 2-shingles: {ab,bc,cd,da,ab,bd} We might remove whitespace altogether We might replace sequence of whitespaces by only one whitespace
11 Shingling: choosing the shingle size k = 1: bad k should be picked large enough that the probability of any given shingle appearing in any given document is low. k very large: not enough statistics: bad For large documents k = 9 is considered safe Hashing Shingles: Give each shingle a number and then instead of set of shingles, keep the set of numbers
12 Shingles built from words First rule: avoid stop words: and, such, to etc. Use stop word followed by the next two or three words: and other world leaders etc.
13 3.3.1 Matrix Representation of Sets Summarizing the shingles Before explaining how it is possible to construct small signatures from large sets, it is helpful to visualize a collection of sets as their characteristic matrix. The columns of the matrix correspond to the sets, and the rows correspond to elements How of tothe go universal from Setsset from Signatures which elements of the sets are drawn. There is a 1 in row Thercharacteristic and column c if matrix the element of setsfor row r is a member of the set for column c. Otherwise the value in position (r, c) is 0. Element S 1 S 2 S 3 S 4 a b c d e Figure 3.2: A matrix representing four sets U = {a, b, c, d, e} S 1 = {a, d}; S 2 = {c} etc. Example 3.6 : In Fig. 3.2 is an example of a matrix representing sets chosen from the universal set {a, b, c, d, e}. Here,S 1 = {a, d}, S 2 = {c}, S 3 = {b, d, e}, and S 4 = {a, c, d}. The top row and leftmost columns are not part of the matrix, but are present only to remind us what the rows and columns represent.
14 Sets Signatures: Minhashing Hash the columns of the characteristic matrix How? Pick a permutation of the rows Hash value of a column is the element in first row in the permuted order where the column has 1
15 elements of the characteristic of the universal matrix. set from In this which section, elements we shall of the learn setshow are drawn. a minhash There is is computed a 1 in rowinrprinciple, and column andcinif the later Minhashing element sectionsfor werow shall r issee a member how a good of the approximation c. to Otherwise minhash theisvalue computed in position in practice. (r, c) is 0. set for column To minhash a set represented by a column of the characteristic matrix, pick a permutation of the rows. Element The minhash S 1 Svalue 2 Sof 3 any S 4 column is the number of the first row, in the permuted a order, 1 in which 0 0the column 1 has a 1. b Example 3.7 : Let us suppose c we pick 0 the 1 order 0 of 1 rows beadc for the matrix of Fig This permutation d defines 1 a minhash 0 1 function 1 h that maps sets to rows. Let us compute the minhash e 0value 0 of set 1 S 1 0according to h. The first column, which is the column for set S 1, has 0 in row b, soweproceedtorowe, the second in the permuted order. There is again a 0 in the column for S 1,so we proceed Permutation to rowfigure ofa, the wherewefinda1.thus.h(s 3.2: rows: Ab,e,a,c,d matrix representing 1 )=a. four sets Example 3.6 : In Fig. 3.2 Element is an example S 1 S 2 of asmatrix 3 S 4 representing sets chosen from the universal set {a, b, c, b d, e}. 0Here,S 0 1 = 1 {a, d}, 0 S 2 = {c}, S 3 = {b, d, e}, and S 4 = {a, c, d}. The top row e and leftmost 0 0 columns 1 0 are not part of the matrix, but are present only to remind a us what 1 the 0 rows 0 and 1 columns represent. d It is important to remember c that 0 the 1 characteristic 0 1 matrix is unlikely to be the way the data is stored, but it is useful as a way to visualize the data. For one reason not to store data as a matrix, these matrices are almost always sparse (they h(shave 1 ) = many a; Figure h(s more 2 ) 3.3: = 0 s c; Athan h(s permutation 3 ) 1 s) = b; inh(s practice. of 4 ) the = rows ait saves of Fig. space 3.2 to represent a sparse matrix of 0 s and 1 s by the positions in which the 1 s appear. Foranother
16 suppose we pick the order of rows beadc for the matr tation defines a minhash Why Minhashing? function h that maps sets the minhash value of set S 1 according to h. The fir For any two sets S, T, the probability that the minhashes (with a lumn for set S random uniform 1, has 0 in row b, soweproceedtorow permutation) being equal uted order. There is again a 0 in the column for S 1, P(h(S) = h(t )) = Sim(S, T ), herewefinda1.thus.h(s 1 )=a. the Jaccard similarity of the two sets. Why? Element S 1 S 2 S 3 S 4 b e a d c
17 Minhash signature Take n (say 100) random permutations Each define a minhash function: h 1, h 2,..., h n Minhash signature of S is a vector [h 1 (S), h 2 (S),..., h n (S)] n is much smaller than the number of rows of the characteristic matrix
Similarity Search. Stony Brook University CSE545, Fall 2016
Similarity Search Stony Brook University CSE545, Fall 20 Finding Similar Items Applications Document Similarity: Mirrored web-pages Plagiarism; Similar News Recommendations: Online purchases Movie ratings
More informationSlides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More informationDATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing
DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining
More informationHigh Dimensional Search Min- Hashing Locality Sensi6ve Hashing
High Dimensional Search Min- Hashing Locality Sensi6ve Hashing Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata September 8 and 11, 2014 High Support Rules vs Correla6on of
More informationPiazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)
4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 Piazza Recitation session: Review of linear algebra Location: Thursday, April, from 3:30-5:20 pm in SIG 34
More informationCS60021: Scalable Data Mining. Similarity Search and Hashing. Sourangshu Bha>acharya
CS62: Scalable Data Mining Similarity Search and Hashing Sourangshu Bha>acharya Finding Similar Items Distance Measures Goal: Find near-neighbors in high-dim. space We formally define near neighbors as
More informationCOMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from
COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard
More informationAlgorithms for Data Science: Lecture on Finding Similar Items
Algorithms for Data Science: Lecture on Finding Similar Items Barna Saha 1 Finding Similar Items Finding similar items is a fundamental data mining task. We may want to find whether two documents are similar
More informationB490 Mining the Big Data
B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1 Motivations
More informationFinding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman
Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures Modified from Jeff Ullman Goals Many Web-mining problems can be expressed as finding similar sets:. Pages
More informationCS5112: Algorithms and Data Structures for Applications
CS5112: Algorithms and Data Structures for Applications Lecture 19: Association rules Ramin Zabih Some content from: Wikipedia/Google image search; Harrington; J. Leskovec, A. Rajaraman, J. Ullman: Mining
More informationDATA MINING LECTURE 4. Similarity and Distance Sketching, Locality Sensitive Hashing
DATA MINING LECTURE 4 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining
More informationBloom Filters and Locality-Sensitive Hashing
Randomized Algorithms, Summer 2016 Bloom Filters and Locality-Sensitive Hashing Instructor: Thomas Kesselheim and Kurt Mehlhorn 1 Notation Lecture 4 (6 pages) When e talk about the probability of an event,
More information1 Finding Similar Items
1 Finding Similar Items This chapter discusses the various measures of distance used to find out similarity between items in a given set. After introducing the basic similarity measures, we look at how
More informationToday s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara
Spring 2017 W5.B.1 CS435 BIG DATA Today s topics PART 1. LARGE SCALE DATA ANALYSIS USING MAPREDUCE FAQs Minhash Minhash signature Calculating Minhash with MapReduce Locality Sensitive Hashing Sangmi Lee
More informationCS425: Algorithms for Web Scale Data
CS: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS. The original slides can be accessed at: www.mmds.org Customer
More informationFinding similar items
Finding similar items CSE 344, section 10 June 2, 2011 In this section, we ll go through some examples of finding similar item sets. We ll directly compare all pairs of sets being considered using the
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Advanced Frequent Pattern Mining & Locality Sensitivity Hashing Huan Sun, CSE@The Ohio State University /7/27 Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan
More informationThe University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.
The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment 1 Caramanis/Sanghavi Due: Thursday, Feb. 7, 2013. (Problems 1 and
More informationMining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More informationLecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Text Processing and High-Dimensional Data
Lecture Notes to Winter Term 2017/2018 Text Processing and High-Dimensional Data Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur Schmid, Daniyal Kazempour,
More informationData Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur
Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Lecture 21 K - Nearest Neighbor V In this lecture we discuss; how do we evaluate the
More informationNCC Education Limited. Substitution Topic NCC Education Limited. Substitution Topic NCC Education Limited
Topic 3 - Lecture 2: Substitution Substitution Topic 3-2.2 Learning Objective To be able to substitute positive and negative values into algebraic expressions and formulae. Substitution Topic 3-2.3 Key
More informationTheory of LSH. Distance Measures LS Families of Hash Functions S-Curves
Theory of LSH Distance Measures LS Families of Hash Functions S-Curves 1 Distance Measures Generalized LSH is based on some kind of distance between points. Similar points are close. Two major classes
More informationAnnouncements Monday, November 13
Announcements Monday, November 13 The third midterm is on this Friday, November 17. The exam covers 3.1, 3.2, 5.1, 5.2, 5.3, and 5.5. About half the problems will be conceptual, and the other half computational.
More informationCS6931 Database Seminar. Lecture 6: Set Operations on Massive Data
CS6931 Database Seminar Lecture 6: Set Operations on Massive Data Set Resemblance and MinWise Hashing/Independent Permutation Basics Consider two sets S1, S2 U = {0, 1, 2,...,D 1} (e.g., D = 2 64 ) f1
More informationCS5112: Algorithms and Data Structures for Applications
CS5112: Algorithms and Data Structures for Applications Lecture 14: Exponential decay; convolution Ramin Zabih Some content from: Piotr Indyk; Wikipedia/Google image search; J. Leskovec, A. Rajaraman,
More informationWhy duplicate detection?
Near-Duplicates Detection Naama Kraus Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze Some slides are courtesy of Kira Radinsky Why duplicate detection?
More informationAnnouncements Monday, November 13
Announcements Monday, November 13 The third midterm is on this Friday, November 17 The exam covers 31, 32, 51, 52, 53, and 55 About half the problems will be conceptual, and the other half computational
More informationAsymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment
Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment Anshumali Shrivastava and Ping Li Cornell University and Rutgers University WWW 25 Florence, Italy May 2st 25 Will Join
More informationThe Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)
The Market-Basket Model Association Rules Market Baskets Frequent sets A-priori Algorithm A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small set
More informationDatabase Systems CSE 514
Database Systems CSE 514 Lecture 8: Data Cleaning and Sampling CSEP514 - Winter 2017 1 Announcements WQ7 was due last night (did you remember?) HW6 is due on Sunday Weston will go over it in the section
More informationCS246 Final Exam, Winter 2011
CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including
More informationName Period. Date: have an. Essential Question: Does the function ( ) inverse function? Explain your answer.
Name Period Date: Topic: 10-3 Composition and Inverses of Functions Essential Question: Does the function inverse function? Explain your answer. have an Standard: F-BF.1c Objective: Compose functions.
More informationCS246 Final Exam. March 16, :30AM - 11:30AM
CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions
More informationCOMPSCI 514: Algorithms for Data Science
COMPSCI 514: Algorithms for Data Science Arya Mazumdar University of Massachusetts at Amherst Fall 2018 Lecture 8 Spectral Clustering Spectral clustering Curse of dimensionality Dimensionality Reduction
More information1.1 Administrative Stuff
601.433 / 601.633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Introduction, Karatsuba/Strassen Date: 9/4/18 1.1 Administrative Stuff Welcome to Algorithms! In this class you will learn the
More informationMATRIX DETERMINANTS. 1 Reminder Definition and components of a matrix
MATRIX DETERMINANTS Summary Uses... 1 1 Reminder Definition and components of a matrix... 1 2 The matrix determinant... 2 3 Calculation of the determinant for a matrix... 2 4 Exercise... 3 5 Definition
More informationRelational Nonlinear FIR Filters. Ronald K. Pearson
Relational Nonlinear FIR Filters Ronald K. Pearson Daniel Baugh Institute for Functional Genomics and Computational Biology Thomas Jefferson University Philadelphia, PA Moncef Gabbouj Institute of Signal
More information2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51
2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each
More informationHomework Assignment 6 Answers
Homework Assignment 6 Answers CSCI 2670 Introduction to Theory of Computing, Fall 2016 December 2, 2016 This homework assignment is about Turing machines, decidable languages, Turing recognizable languages,
More informationRecommendation Systems
Recommendation Systems Popularity Recommendation Systems Predicting user responses to options Offering news articles based on users interests Offering suggestions on what the user might like to buy/consume
More informationCS60021: Scalable Data Mining. Dimensionality Reduction
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Dimensionality Reduction Sourangshu Bhattacharya Assumption: Data lies on or near a
More information4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More information26 Group Theory Basics
26 Group Theory Basics 1. Reference: Group Theory and Quantum Mechanics by Michael Tinkham. 2. We said earlier that we will go looking for the set of operators that commute with the molecular Hamiltonian.
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationReal Analysis Prof. S.H. Kulkarni Department of Mathematics Indian Institute of Technology, Madras. Lecture - 13 Conditional Convergence
Real Analysis Prof. S.H. Kulkarni Department of Mathematics Indian Institute of Technology, Madras Lecture - 13 Conditional Convergence Now, there are a few things that are remaining in the discussion
More informationProbability (Devore Chapter Two)
Probability (Devore Chapter Two) 1016-345-01: Probability and Statistics for Engineers Fall 2012 Contents 0 Administrata 2 0.1 Outline....................................... 3 1 Axiomatic Probability 3
More informationPrivacy in Statistical Databases
Privacy in Statistical Databases Individuals x 1 x 2 x n Server/agency ) answers. A queries Users Government, researchers, businesses or) Malicious adversary What information can be released? Two conflicting
More informationText Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University
Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data
More informationMLE/MAP + Naïve Bayes
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University MLE/MAP + Naïve Bayes Matt Gormley Lecture 19 March 20, 2018 1 Midterm Exam Reminders
More informationFrequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:
Frequent Itemsets and Association Rule Mining Vinay Setty vinay.j.setty@uis.no Slides credit: http://www.mmds.org/ Association Rule Discovery Supermarket shelf management Market-basket model: Goal: Identify
More informationSlide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.
Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 votes) #2: K-Means - Clustering
More informationFooling Sets and. Lecture 5
Fooling Sets and Introduction to Nondeterministic Finite Automata Lecture 5 Proving that a language is not regular Given a language, we saw how to prove it is regular (union, intersection, concatenation,
More informationCSE446: non-parametric methods Spring 2017
CSE446: non-parametric methods Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin and Luke Zettlemoyer Linear Regression: What can go wrong? What do we do if the bias is too strong? Might want
More informationExploring Large Graphs
Richard P. Brent MSI & CECS ANU Joint work with Judy-anne Osborn MSI, ANU 1 October 2010 The Hadamard maximal determinant problem The Hadamard maxdet problem asks: what is the maximum determinant H(n)
More informationLecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane
Randomized Algorithms Lecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane Sotiris Nikoletseas Professor CEID - ETY Course 2017-2018 Sotiris Nikoletseas, Professor Randomized
More information1111: Linear Algebra I
1111: Linear Algebra I Dr. Vladimir Dotsenko (Vlad) Lecture 7 Dr. Vladimir Dotsenko (Vlad) 1111: Linear Algebra I Lecture 7 1 / 8 Properties of the matrix product Let us show that the matrix product we
More informationCS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014
CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014 Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute NAME: Prof.
More informationcompare to comparison and pointer based sorting, binary trees
Admin Hashing Dictionaries Model Operations. makeset, insert, delete, find keys are integers in M = {1,..., m} (so assume machine word size, or unit time, is log m) can store in array of size M using power:
More informationContent-Addressable Memory Associative Memory Lernmatrix Association Heteroassociation Learning Retrieval Reliability of the answer
Associative Memory Content-Addressable Memory Associative Memory Lernmatrix Association Heteroassociation Learning Retrieval Reliability of the answer Storage Analysis Sparse Coding Implementation on a
More informationCourse overview. Heikki Mannila Laboratory for Computer and Information Science Helsinki University of Technology
Course overview Heikki Mannila Laboratory for Computer and Information Science Helsinki University of Technology Heikki.Mannila@tkk.fi Algorithmic methods of data mining, Autumn 2007, Course overview 1
More informationExam Question 10: Differential Equations. June 19, Applied Mathematics: Lecture 6. Brendan Williamson. Introduction.
Exam Question 10: June 19, 2016 In this lecture we will study differential equations, which pertains to Q. 10 of the Higher Level paper. It s arguably more theoretical than other topics on the syllabus,
More informationCOMP 175 COMPUTER GRAPHICS. Lecture 04: Transform 1. COMP 175: Computer Graphics February 9, Erik Anderson 04 Transform 1
Lecture 04: Transform COMP 75: Computer Graphics February 9, 206 /59 Admin Sign up via email/piazza for your in-person grading Anderson@cs.tufts.edu 2/59 Geometric Transform Apply transforms to a hierarchy
More informationAlgorithmic methods of data mining, Autumn 2007, Course overview0-0. Course overview
Algorithmic methods of data mining, Autumn 2007, Course overview0-0 Course overview Heikki Mannila Laboratory for Computer and Information Science Helsinki University of Technology Heikki.Mannila@tkk.fi
More informationMath 1021, Linear Algebra 1. Section: A at 10am, B at 2:30pm
Math 1021, Linear Algebra 1. Section: A at 10am, B at 2:30pm All course information is available on Moodle. Text: Nicholson, Linear algebra with applications, 7th edition. We shall cover Chapters 1,2,3,4,5:
More informationLecture 10: The Normal Distribution. So far all the random variables have been discrete.
Lecture 10: The Normal Distribution 1. Continuous Random Variables So far all the random variables have been discrete. We need a different type of model (called a probability density function) for continuous
More informationMatrix-Vector Products and the Matrix Equation Ax = b
Matrix-Vector Products and the Matrix Equation Ax = b A. Havens Department of Mathematics University of Massachusetts, Amherst January 31, 2018 Outline 1 Matrices Acting on Vectors Linear Combinations
More informationSocial Choice and Networks
Social Choice and Networks Elchanan Mossel UC Berkeley All rights reserved Logistics 1 Different numbers for the course: Compsci 294 Section 063 Econ 207A Math C223A Stat 206A Room: Cory 241 Time TuTh
More informationMachine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)
Machine Learning InstanceBased Learning (aka nonparametric methods) Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Non parametric CSE 446 Machine Learning Daniel Weld March
More informationAlgorithms for Nearest Neighbors
Algorithms for Nearest Neighbors Background and Two Challenges Yury Lifshits Steklov Institute of Mathematics at St.Petersburg http://logic.pdmi.ras.ru/~yura McGill University, July 2007 1 / 29 Outline
More informationCS 124 Math Review Section January 29, 2018
CS 124 Math Review Section CS 124 is more math intensive than most of the introductory courses in the department. You re going to need to be able to do two things: 1. Perform some clever calculations to
More informationHash tables. Hash tables
Basic Probability Theory Two events A, B are independent if Conditional probability: Pr[A B] = Pr[A] Pr[B] Pr[A B] = Pr[A B] Pr[B] The expectation of a (discrete) random variable X is E[X ] = k k Pr[X
More informationCS425: Algorithms for Web Scale Data
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org Challenges
More informationCS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction
CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction Tim Roughgarden & Gregory Valiant April 12, 2017 1 The Curse of Dimensionality in the Nearest Neighbor Problem Lectures #1 and
More informationMATH 122 SYLLBAUS HARVARD UNIVERSITY MATH DEPARTMENT, FALL 2014
MATH 122 SYLLBAUS HARVARD UNIVERSITY MATH DEPARTMENT, FALL 2014 INSTRUCTOR: HIRO LEE TANAKA UPDATED THURSDAY, SEPTEMBER 4, 2014 Location: Harvard Hall 102 E-mail: hirohirohiro@gmail.com Class Meeting Time:
More informationIntelligent Data Analysis Lecture Notes on Document Mining
Intelligent Data Analysis Lecture Notes on Document Mining Peter Tiňo Representing Textual Documents as Vectors Our next topic will take us to seemingly very different data spaces - those of textual documents.
More informationSection 4.6 Negative Exponents
Section 4.6 Negative Exponents INTRODUCTION In order to understand negative exponents the main topic of this section we need to make sure we understand the meaning of the reciprocal of a number. Reciprocals
More informationCS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)
CS5314 Randomized Algorithms Lecture 15: Balls, Bins, Random Graphs (Hashing) 1 Objectives Study various hashing schemes Apply balls-and-bins model to analyze their performances 2 Chain Hashing Suppose
More informationMarch 19 - Solving Linear Systems
March 19 - Solving Linear Systems Welcome to linear algebra! Linear algebra is the study of vectors, vector spaces, and maps between vector spaces. It has applications across data analysis, computer graphics,
More informationLecture 2e Row Echelon Form (pages 73-74)
Lecture 2e Row Echelon Form (pages 73-74) At the end of Lecture 2a I said that we would develop an algorithm for solving a system of linear equations, and now that we have our matrix notation, we can proceed
More informationMA 1128: Lecture 08 03/02/2018. Linear Equations from Graphs And Linear Inequalities
MA 1128: Lecture 08 03/02/2018 Linear Equations from Graphs And Linear Inequalities Linear Equations from Graphs Given a line, we would like to be able to come up with an equation for it. I ll go over
More informationData preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data
Elena Baralis and Tania Cerquitelli Politecnico di Torino Data set types Record Tables Document Data Transaction Data Graph World Wide Web Molecular Structures Ordered Spatial Data Temporal Data Sequential
More informationMining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More informationStat 609: Mathematical Statistics I (Fall Semester, 2016) Introduction
Stat 609: Mathematical Statistics I (Fall Semester, 2016) Introduction Course information Instructor Professor Jun Shao TA Mr. Han Chen Office 1235A MSC 1335 MSC Phone 608-262-7938 608-263-5948 Email shao@stat.wisc.edu
More informationToday s Menu. Administrativia Two Problems Cutting a Pizza Lighting Rooms
Welcome! L01 Today s Menu Administrativia Two Problems Cutting a Pizza Lighting Rooms Administrativia Course page: https://www.cs.duke.edu/courses/spring13/compsci230/ Who we are: Instructor: TA: UTAs:
More informationIntelligent Data Analysis. Mining Textual Data. School of Computer Science University of Birmingham
Intelligent Data Analysis Mining Textual Data Peter Tiňo School of Computer Science University of Birmingham Representing documents as numerical vectors Use a special set of terms T = {t 1, t 2,..., t
More informationCPSC 340: Machine Learning and Data Mining. More PCA Fall 2017
CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).
More informationTesting a Hash Function using Probability
Testing a Hash Function using Probability Suppose you have a huge square turnip field with 1000 turnips growing in it. They are all perfectly evenly spaced in a regular pattern. Suppose also that the Germans
More informationOptimal Data-Dependent Hashing for Approximate Near Neighbors
Optimal Data-Dependent Hashing for Approximate Near Neighbors Alexandr Andoni 1 Ilya Razenshteyn 2 1 Simons Institute 2 MIT, CSAIL April 20, 2015 1 / 30 Nearest Neighbor Search (NNS) Let P be an n-point
More informationI am trying to keep these lessons as close to actual class room settings as possible.
Greetings: I am trying to keep these lessons as close to actual class room settings as possible. They do not intend to replace the text book actually they will involve the text book. An advantage of a
More informationSelect/Special Topics in Atomic Physics Prof. P. C. Deshmukh Department of Physics Indian Institute of Technology, Madras
Select/Special Topics in Atomic Physics Prof. P. C. Deshmukh Department of Physics Indian Institute of Technology, Madras Lecture - 9 Angular Momentum in Quantum Mechanics Dimensionality of the Direct-Product
More informationPerform the same three operations as above on the values in the matrix, where some notation is given as a shorthand way to describe each operation:
SECTION 2.1: SOLVING SYSTEMS OF EQUATIONS WITH A UNIQUE SOLUTION In Chapter 1 we took a look at finding the intersection point of two lines on a graph. Chapter 2 begins with a look at a more formal approach
More informationCh. 7 Statistical Intervals Based on a Single Sample
Ch. 7 Statistical Intervals Based on a Single Sample Before discussing the topics in Ch. 7, we need to cover one important concept from Ch. 6. Standard error The standard error is the standard deviation
More informationCS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization
CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization Tim Roughgarden February 28, 2017 1 Preamble This lecture fulfills a promise made back in Lecture #1,
More informationMATH 433 Applied Algebra Lecture 22: Review for Exam 2.
MATH 433 Applied Algebra Lecture 22: Review for Exam 2. Topics for Exam 2 Permutations Cycles, transpositions Cycle decomposition of a permutation Order of a permutation Sign of a permutation Symmetric
More informationDecision T ree Tree Algorithm Week 4 1
Decision Tree Algorithm Week 4 1 Team Homework Assignment #5 Read pp. 105 117 of the text book. Do Examples 3.1, 3.2, 3.3 and Exercise 3.4 (a). Prepare for the results of the homework assignment. Due date
More informationLast Time. x + 3y = 6 x + 2y = 1. x + 3y = 6 y = 1. 2x + 4y = 8 x 2y = 1. x + 3y = 6 2x y = 7. Lecture 2
January 9 Last Time 1. Last time we ended with saying that the following four systems are equivalent in the sense that we can move from one system to the other by a special move we discussed. (a) (b) (c)
More informationSection F Ratio and proportion
Section F Ratio and proportion Ratio is a way of comparing two or more groups. For example, if something is split in a ratio 3 : 5 there are three parts of the first thing to every five parts of the second
More informationCSE 494/598 Lecture-6: Latent Semantic Indexing. **Content adapted from last year s slides
CSE 494/598 Lecture-6: Latent Semantic Indexing LYDIA MANIKONDA HT TP://WWW.PUBLIC.ASU.EDU/~LMANIKON / **Content adapted from last year s slides Announcements Homework-1 and Quiz-1 Project part-2 released
More information