CS5112: Algorithms and Data Structures for Applications

Similar documents
4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS5540: Computational Techniques for Analyzing Clinical Data. Prof. Ramin Zabih (CS) Prof. Ashish Raj (Radiology)

CS5112: Algorithms and Data Structures for Applications

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #15: Mining Streams 2

Streaming - 2. Bloom Filters, Distinct Item counting, Computing moments. credits:

Introduction to Randomized Algorithms III

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

Lecture 6: Edge Detection. CAP 5415: Computer Vision Fall 2008

Mining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s

Signal Processing and Linear Systems1 Lecture 4: Characterizing Systems

Digital Signal Processing Prof. T. K. Basu Department of Electrical Engineering Indian Institute of Technology, Kharagpur

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

CSE 473/573 Computer Vision and Image Processing (CVIP)

Mining Data Streams. The Stream Model Sliding Windows Counting 1 s

Math 163: Lecture notes

Chapter 4 Continuous Random Variables and Probability Distributions

Midterm 1 Review. Distance = (x 1 x 0 ) 2 + (y 1 y 0 ) 2.

Operation and Supply Chain Management Prof. G. Srinivasan Department of Management Studies Indian Institute of Technology, Madras

Lecture 31: Miller Rabin Test. Miller Rabin Test

Thermodynamics (Classical) for Biological Systems. Prof. G. K. Suraishkumar. Department of Biotechnology. Indian Institute of Technology Madras

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Concept Drift. Albert Bifet. March 2012

Approximate counting: count-min data structure. Problem definition

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

(Refer Slide Time: 03: 09)

MATH 10 INTRODUCTORY STATISTICS

CS 361: Probability & Statistics

Algorithms lecture notes 1. Hashing, and Universal Hash functions

CS 361: Probability & Statistics

Lecture 5. September 4, 2018 Math/CS 471: Introduction to Scientific Computing University of New Mexico

Laboratory Exercise #10 An Introduction to High-Speed Addition

Math 350: An exploration of HMMs through doodles.

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Chemical Applications of Symmetry and Group Theory Prof. Manabendra Chandra Department of Chemistry Indian Institute of Technology, Kanpur.

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Lab 2 Worksheet. Problems. Problem 1: Geometry and Linear Equations

Power Laws & Rich Get Richer

Industrial Engineering Prof. Inderdeep Singh Department of Mechanical & Industrial Engineering Indian Institute of Technology, Roorkee

Lecture #11: Classification & Logistic Regression

MITOCW MITRES_6-007S11lec09_300k.mp4

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Constrained and Unconstrained Optimization Prof. Adrijit Goswami Department of Mathematics Indian Institute of Technology, Kharagpur

MODULE 9 NORMAL DISTRIBUTION

Numbers & Arithmetic. Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University. See: P&H Chapter , 3.2, C.5 C.

Designing Information Devices and Systems I Spring 2018 Homework 11

State & Finite State Machines

Basic algebra and graphing for electric circuits

Announcements. Lecture 5 - Continuous Distributions. Discrete Probability Distributions. Quiz 1

CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University

Edge Detection. Introduction to Computer Vision. Useful Mathematics Funcs. The bad news

Chapter 18 Sampling Distribution Models

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 5

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

CS173 Lecture B, November 3, 2015

Edge Detection PSY 5018H: Math Models Hum Behavior, Prof. Paul Schrater, Spring 2005

Introduction to Algorithms March 10, 2010 Massachusetts Institute of Technology Spring 2010 Professors Piotr Indyk and David Karger Quiz 1

Lecture 27. DATA 8 Spring Sample Averages. Slides created by John DeNero and Ani Adhikari

MA 1125 Lecture 15 - The Standard Normal Distribution. Friday, October 6, Objectives: Introduce the standard normal distribution and table.

Lecture 2. Frequency problems

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur

Computers also need devices capable of Storing data and information Performing mathematical operations on such data

Notes for Lecture 11

VU Signal and Image Processing

Electromagnetic Theory Prof. D. K. Ghosh Department of Physics Indian Institute of Technology, Bombay

Introduction to Algebra: The First Week

State and Finite State Machines

UCSD CSE 21, Spring 2014 [Section B00] Mathematics for Algorithm and System Analysis

Numbers and Arithmetic

Modern Algebra Prof. Manindra Agrawal Department of Computer Science and Engineering Indian Institute of Technology, Kanpur

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

CS 361: Probability & Statistics

CS5112: Algorithms and Data Structures for Applications

CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Exam 1. March 12th, CS525 - Midterm Exam Solutions

Edge Detection. Computer Vision P. Schrater Spring 2003

Laboratory Exercise #8 Introduction to Sequential Logic

CS 124 Math Review Section January 29, 2018

CS425: Algorithms for Web Scale Data

Support Vector Machines, Kernel SVM

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

Class Note #14. In this class, we studied an algorithm for integer multiplication, which. 2 ) to θ(n

Kernel-based density. Nuno Vasconcelos ECE Department, UCSD

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 20

Physics I: Oscillations and Waves Prof. S. Bharadwaj Department of Physics and Meteorology. Indian Institute of Technology, Kharagpur

CS60021: Scalable Data Mining. Dimensionality Reduction

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institute of Technology, Kharagpur

Lecture Lecture 25 November 25, 2014

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

CPSC 340: Machine Learning and Data Mining

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

CS246 Final Exam. March 16, :30AM - 11:30AM

EE100Su08 Lecture #9 (July 16 th 2008)

State & Finite State Machines

Adam Blank Spring 2017 CSE 311. Foundations of Computing I. * All slides are a combined effort between previous instructors of the course

Math Lecture 3 Notes

1 Approximate Quantiles and Summaries

Transcription:

CS5112: Algorithms and Data Structures for Applications Lecture 14: Exponential decay; convolution Ramin Zabih Some content from: Piotr Indyk; Wikipedia/Google image search; J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

Administrivia Q7 delayed due to Columbus day holiday HW3 out but short, can do HW in groups of 3 from now on Class 10/23 will be prelim review for 10/25 Greg lectures on dynamic programming next week Anonymous survey out, please respond! Automatic grading apology

Survey feedback

Automatic grading apology In general students should not have a grade revised downward Main exception is regrade requests Automatic grading means that this sometimes happened At the end, we decided that the priority was to assign HW grades based on how correct the code was Going forward, please treat HW grades as tentative grades We will announce when those grades are finalized After, they will only be changed under exceptional circumstances

Today Two streaming algorithms Convolution

Some natural histogram queries Top-k most frequent elements 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Find all elements with frequency > 0.1% What is the frequency of element 3? What is the total frequency of elements between 8 and 14? How many elements have non-zero frequency?

Streaming algorithms (recap) 1. Boyer-Moore majority algorithm 2. Misra-Gries frequent items 3. Find popular recent items (box filter) 4. Find popular recent items (exponential window) 5. Flajolet-Martin number of items

Weighted average in a sliding window Computing the average of the last k inputs can be viewed as a dot product with a constant vector v = 1, 1,, 1 k k k Sometimes called a box filter Easy to visualize This is also a natural way to smooth, e.g., a histogram To average together adjacent bins, v = [ 1 3, 1 3, 1 3 ] This kind of weighted average has a famous name: convolution

Decaying windows Let our input at time t be a 1, a 2,, a t With a box filter over all of these elements we computed t i=0 Instead let us pick a small constant c and compute 1 t 1 t 1 a t i a t = a t i 1 c i i=0

Easy to update this Update rule is simple, let the current dot product be a t a t+1 = 1 c a t + a t+1 This downscales the previous elements correctly, and the new element is scaled by 1 c 0 = 1 This avoids falling off the edge Gives us an easy way to find popular items

4. Popular items with decaying windows We keep a small number of weighted sum counters When a new item arrives for which we already have a counter, we update it using decaying windows, and update all counters How do we avoid getting an unbounded number of counters? We set a threshold, say ½, and if any counter goes below that value we throw it away The number of counters is bounded by 2 c

How many distinct items are there? This tells you the size of the histogram, among other things To solve this problem exactly requires space that is linear in the size of the input stream Impractical for many applications Instead we will compute an efficient estimate via hashing

5. Flajolet-Martin algorithm Basic idea: the more different elements we see, the more different hash values we will see We will pick a hash function that spreads out the input elements Typically uses universal hashing

Flajolet-Martin algorithm Pick a hash function h that maps each of the n elements to at least log 2 n bits For input a, let r(a) be the number of trailing 0s in h(a) r(a) = position of first 1 counting from the right E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2 Record R = the maximum r(a) seen Estimated number of distinct elements = 2 R Anyone see the problem here? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14

Why It Works: Intuition Very rough intuition why Flajolet-Martin works: h(a) hashes a with equal probability to any of n values Sequence of (log 2 n) bits; 2 r fraction of a s have tail of r zeros About 50% hash to ***0 About 25% hash to **00 So, if we saw the longest tail of r=2 (i.e., item hash ending *100) then we have probably seen about 4 distinct items so far Hash about 2 r items before we see one with zero-suffix of length r J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 15

Convolution Weighted average with a mask/stencil/template Dot product of vectors Many important properties and applications Symmetric in the inputs Equivalent to linear shift-invariant systems Well behaved, in a certain precise sense Primary uses are smoothing and matching This is the C in CNN

Local averaging in action

Smoothing parameter effects

Matched filters Convolution can be used to find pulses This is actually closely related to smoothing How do we find a known pulse in a signal? Convolve the signal with our template! E.g. to find something in the signal that looks like [1 6-10] we convolve with [1 6-10] Question: what sense does this make? Anecdotally it worked for finding boxes

Box finding example

Pulse finding example

Why does this work? Some nice optimality properties, but the way I described it, the algorithm fails Idea: the [1 6-10] template gives biggest response when signal is [ 1 6-10 ] Value is 137 at this point But is this actually correct? You actually need both the template and the input to have a zero mean and unit energy (sum of squares) Easily accomplished: subtract -1, then divide by 137, get 1/137 * [2 7-9]

Geometric intuition Taking the dot product of two vectors Recall a b c [e f g] = ae + bf + cg Basically the projection of a vector on another The normalized vector with the biggest projection on x is, of course: x! 23