GEMINI GEnerc Multmeda INdexIng Last lecture, LSH http://www.mt.edu/~andon/lsh/ Is there another possble soluton? Do we need to perform ANN? 1
GEnerc Multmeda INdexIng dstance measure Sub-pattern Match quck and drty test Lower boundng lemma 1-D Tme Sequences Color hstograms Color auto-correlogram Shapes GEnerc Multmeda INdexIng Gven a database of multmeda objects Desgn fast search algorthms that locate objects that match a query object, exactly or approxmately Objects: 1-d tme sequences Dgtzed voce or musc 2-d color mages 2-d or 3-d gray scale medcal mages Vdeo clps E.g.: Fnd companes whose stock prces move smlarly 2
Applcatons tme seres: fnancal, marketng (clck-streams!), ECGs, sound; mages: medcne, dgtal lbrares, educaton, art hgher-d sgnals: scentfc db (eg., astrophyscs), medcne (MRI scans), entertanment (vdeo) Sample queres Fnd medcal cases smlar to Smth's Fnd pars of stocks that move n sync Fnd pars of documents that are smlar (plagarsm?) Fnd faces smlar to Tger Woods 3
$prce $prce $prce 1 365 day 1 365 day dstance functon: by expert 1 365 day (eg, Eucldean dstance) Generc Multmeda Indexng 1 st step: provde a measure for the dstance between two objects Dstance functon d(): Gven two objects O 1, O 2 the dstance (=dssmlarty) of the two objects s denoted by d(o 1, O 2 ) E.g., Eucldean dstance (sum of squared dfferences) of two equal-length tme seres 4
ε-smlarty query Gven a query object Q, fnd all objects O from the database are ε-smlar (dentcal for ε = 0) to Q {O DB d(q, O ) < ε}. Types of Smlarty Queres Whole match queres: Gven a collecton of S objects O 1,, O s and a query object Q fnd data objects that are wthn dstance ε from Q 5
Types of Smlarty Queres Sub-pattern Match: Gven a collecton of S objects O 1,, O S and a query (sub-) object Q and a tolerance ε dentfy the parts of the data objects that match the query Q Idea method requrements Fast: sequental scannng and dstance calculaton wth each and every object too slow for large databases Dynamc: easy to nsert, delete, and update objects 6
Basc dea Focus on whole match queres Gven a collecton of S objects O 1,, O s, a dstance/ds-smlarty functon d(o, O j ), and a query object Q fnd data objects that are wthn dstance ε from Q Sequental scannng? May be too slow.. for the followng reasons: Dstance computaton s expensve (e.g., edtng dstance n DNA strngs) The Database sze S may be huge Faster alternatve? GEnerc Multmeda INdexIng Chrstos Faloutsos QBIC 1994 A feature extracton functon maps the hgh dmensonal objects nto a low dmensonal space Objects that are very dssmlar n the feature space, are also very dssmlar n the orgnal space 7
Basc dea Faster alternatve: Step 1: a quck and drty test to dscard quckly the vast majorty of non-qualfyng objects Step 2: use of SAMs (R-trees, Hlbert-Curve,..) to acheve faster than sequental searchng Example: Database of yearly stock prce movements Eucldean dstance functon Characterze wth a sngle number ( feature ) Or use two or more features Basc dea - llustraton S1 Feature2 F(S1) 1 365 day F(Sn) Sn Feature1 1 365 day A query wth tolerance ε becomes a sphere wth radus ε 8
Basc dea cauton! The mappng F() from objects to k-dm. ponts should not dstort the dstances d(): dstance of two objects d feature (): dstance of ther correspondng feature vectors Ideally, perfect preservaton of dstances In practce, a guarantee of no false dsmssals How? Objects represented by vectors that are very dssmlar n the feature space are expected to be very dssmlar n the orgnal space If the dstances n the feature space are always smaller or equal than the dstances n the orgnal space, a bound whch s vald n both spaces can be determned 9
The dstance of smlar objects s smaller or equal to ε n the orgnal space and, consequently, t s smaller or equal to ε n the feature space as well... Lower boundng lemma f dstance of smlar objects s smaller or equal to ε n orgnal space then t s as well smaller or equal ε n the feature space 10
d feature (F(O 1 ),(O 2 )) d(o 1,O 2 ) ε o.k. d(o 1,O 2 ) ε d feature (F(O 1 ),F(O 2 )) WRONG! d feature (F(O 1 ),F(O 2 )) ε d(o 1,O 2 ) ε d feature (F(O 1 ),F(O 2 )) ε d(o 1,O 2 )? No object n the feature space wll be mssed (false dsmssals) n the feature space There wll be some objects that are not smlar n the orgnal space (false hnts/alarms) That means that we are guaranteed to have selected all the objects we wanted plus some addtonal false hts n the feature space In the second step, false hts have to be fltered from the set of the selected objects through comparson n the orgnal space 11
Tme sequences whte nose brown nose Fourer spectrum... n log-log Tme sequences Concluson: colored noses are well approxmated by ther frst few Fourer coeffcents Colored noses appear n nature 12
Tme sequences Eg.: GEMINI Important: Q: how to extract features? A: f I have only one number to descrbe my object, what should ths be? 13
1-D Tme Sequences Dstance functon: Eucldean dstance Fnd features that: Preserve/lower-bound the dstance Carry as much nformaton as possble(reduce false alarms) If we are allowed to use only one feature what would ths be? The average extendng t 1-D Tme Sequences... If we are allowed to use only one feature what would ths be? The average extendng t The average of 1st half, of the 2nd half, of the 1st quarter, etc. Coeffcents of the Fourer transform (DFT), wavelet transform, etc. 14
Feature extractng functon 1. Defne a dstance functon 2. Fnd a feature extracton functon F() that satsfes the boundng lemma Example: Dscrete Fourer Transform (DFT) preserve Eucldan dstances between sgnals (Parseval's theorem) F() = DTF whch keeps the frst coeffcents of the transform 1-D Tme Sequences Show that the dstance n feature space lower-bounds the actual dstance DFT? Parseval s Theorem: DFT preserves the energy of the sgnal as well as the dstances between two sgnals d(x,y) = d(x,y) where X and Y are the Fourer transforms of x and y If we keep the frst k n coeffcents of DFT we lower-bound the actual dstance d feature (F(x),F(y)) = k 1 2 X f Y f n 1 X f Y f f = 0 f = 0 = 0 2 n 1 = x y 2 d(x,y) 15
Tme sequences - results keep the frst 2-3 Fourer coeffcents faster than seq. scan no false dsmssals total tme cleanup-tme r-tree tme # coeff. kept Tme sequences - mprovements: could use Wavelets, or DCT could use segment averages 16
Images - color what s an mage? A: 2-d array 2-D color mages Color hstograms Each color mage a 2-d array of pxels Each pxel 3 color components (R,G,B) h colors each color denotng a pont n 3-d color space (as hgh as 2 24 colors) For each mage compute the h-element color hstogram each component s the percentage of pxels that are most smlar to that color The hstogram of mage I s defned as: For a color C, H c (I) represents the number of pxels of color C n mage I OR: For any pxel n mage I, H c (I) represents the possblty of that pxel havng color C. 17
2-D color mages Color hstograms Usually cluster smlar colors together and choose one representatve color for each color bn Most commercal CBIR systems nclude color hstogram as one of the features (e.g., QBIC of IBM) No space nformaton Color hstograms - dstance One method to measure the dstance between two hstograms x and y s: t d 2 ( x, y) = ( x y) A ( x y) = a ( x y )( x y ) h h h j j j j where the color-to-color smlarty matrx A has entres a j that descrbe the smlarty between color and color j 18
Images - color Mathematcally, the dstance functon s: Color hstograms lower boundng 1 st step: defne the dstance functon between two color mages d()=d h () 2 nd step: fnd numercal features (one or more) whose Eucldean dstance lower-bounds d h () If we allowed to use one numercal feature to descrbe the color mage what should t be? Avg. amount for each color component (R,G,B) t x = ( R, G, B ) avg R avg = (1/ P) avg P Where, and smlarly for G and B avg p= 1 R( p) Where P s the number of pxels n the mage, R(p) s the red component (ntensty) of the p-th pxel 19
Color hstograms lower boundng Gven the average color vectors and of two mages we defne d avg () as the Eucldean dstance between the 3-d average color vectors 3 2 t 2 davg ( x, y) = ( x y) ( x y) = ( x y ) = 1 3 rd step: to prove that the feature dstance d avg () lower-bounds the actual dstance d h ()......by the ``Quadratc Dstance Boundng'' theorem t s guaranteed that the dstance between vectors representng hstograms s bgger or equal as the dstance between hstograms of average color mages. The proof of the ``Quadratc Dstance Boundng'' theorem s based upon the unconstraned mnmzaton problem usng Langrange multplers Man dea of approach: Frst a flterng usng the average (R,G,B) color, then a more accurate matchng usng the full h-element hstogram x y Images - color tme seq scan performance: w/ avg RGB selectvty 20
Color auto-correlogram pck any pxel p1 of color C n the mage I at dstance k away from p1 pck another pxel p2 what s the probablty that p2 s also of color C? Red? k P2 P1 Image: I Color auto-correlogram The auto-correlogram of mage I for color C, dstance k: γ ( k ) C ( I) Pr[ p1 p2 = k, p2 IC p1 I C ] Integrate both color nformaton and space nformaton 21
Color auto-correlogram Implementatons Pxel Dstance Measures Use D8 dstance (also called chessboard dstance): d max ( p,q) = max( p x q x, p y q y ) Choose dstance k=1,3,5,7 Computaton complexty: Hstogram: Correlogram: Ο( n 2 ) Ο(134* n 2 ) 22
23 Implementatons Features Dstance Measures: D( f(i 1 ) - f(i 2 ) ) s small I 1 and I 2 are smlar m= R,G,B k=dstance or hstogram: For correlogram: + + ] [ ) ' ( ) ( 1 ') ( ) ( ' m C C C C h I h I h I h I h I I + + ] [ ], [ ) ( ) ( ) ( ) ( ') ( ) ( 1 ') ( ) ( ' d k m k C k C k C k C I I I I I I γ γ γ γ γ Color Hstogram vs Correlogram
Color Hstogram vs Correlogram Correlogram method: 1 st Hstogram method: 48 th Color Hstogram vs Correlogram Correlogram method: 1 st Hstogram method: 31 th 24
Color Hstogram vs Correlogram C: 178 th H: 230 th C: 1 st H: 1 st C: 1st H: 3 rd C: 5th H: 18 th Color Hstogram vs Correlogram The color correlogram descrbes the global dstrbuton of local spatal correlatons of colors. It s easy to compute It s more stable than the color hstogram method 25
Images - shapes Dstance functon: Eucldean, on the area Q: how to do dm. reducton? A: Karhunen-Loeve (PCA) Images - shapes Performance: ~10x faster log(# of I/Os) all kept # of features kept 26
Mutlmeda Indexng Conclusons GEMINI s a popular method Whole matchng problem Should pay attenton to: Dstance functons Feature Extracton functons Lower Boundng Partcular applcaton Conclusons GEMINI works for any settng (tme sequences, mages, etc) uses a quck and drty flter faster than seq. scan 27
GEnerc Multmeda INdexIng dstance measure Sub-pattern Match quck and drty test Lower boundng lemma 1-D Tme Sequences Color hstograms Color auto-correlogram Shapes 28