Welcome to Pittsburgh!

Size: px

Start display at page:

Download "Welcome to Pittsburgh!"

Augusta Barker
5 years ago
Views:

1 Welcome to Pittsburgh!

2 Overview of the IWSLT 2005 Evaluation Campaign atthias Eck and Chiori Hori InterACT Carnegie ellon niversity

3 IWSLT2005 Evaluation campaign Working on the same test bed Training Corpus elease: ay 20,2005 Test Corpus elease: Aug 16,2005 esult submission Due: Aug 18, 2005 Camera-ready Paper: Sep 25, 2005 Technical paper Submission: July 25,2005

4 Translation target anual transcription (plain sentences in BTEC) AS output of spoken BTEC sentences Where would you like to go? Is there a discount for children? Did you have fun today? Sure. Can I have a receipt? I'd like to try some local wine. No discourse.

5 Scientific Question How well AS output could be translated in the face of recognition errors? How much T performance could be enhanced by considering multiple hypotheses? Which hypothesis can contribute for T performance?..

6 Translation target AS output of spoken BTEC sentences eal evaluation conditions: ead aloud speech of the BTEC No spontaneity The difference between text and AS output translations -> handling recognition errors Providing multiple hypotheses: N-best, lattice (HTK format)

7 Directions and source input Translation direction anual transcription AS output Chinese English Japanese English Arabic English Korean English English Chinese (Dr. Chen, NLP) Dr. Yamamoto (AT), () - () - () (r. Paulik, /KA)

8 Provided Data All Data from BTEC corpus 2 Development sets CSTA 2003 test sets: 506 sentences IWSLT 2004 test sets: 500 sentences Training Data 20K sentences Test Data 506 sentences

9 Data and tool restriction Supplied Supplied & Tools nrestricted C-STA IWSLT05 corpus Tagger/Chunker/Parser x Public data x x Proprietary data x x x

10 Detail of Data and tool restriction Supplied Data Track: - the supplied corpus only Supplied Data + Tools Track: - The training data is limited to the supplied corpus only - Parser/Chunker and Tagger tools is available nrestricted Data Track: - all publicly available data - the data crawled from the www C-STA Track: - no limitations on the linguistic resources - Full BTEC corpus and proprietary data

11 Participants - 17 institutions/16 groups Institution Systems Aachen niversity ITC - Center for Scientific and Technological esearch ITC-IST niversity of Edinburgh EDINBGH Nagaoka niversity of Technology NGKT niversity of Southern California Information Sciences Institute niversity of Tokyo TOKYO AT Spoken Language Communication esearch Labs AT-ALEPH AT-SL

12 Participants 19 translation systems Institution Systems IT/Lincoln Laboratory Airforce esearch Laboratory IT-LL/AFL National Laboratory of Pattern ecognition NLP Cyber Space Laboratories TALP esearch Center ram rase esearch icrosoft esearch ICOSOFT Carnegie ellon niversity Oki Electric Industry Co., Ltd. OKI Sehda Inc. SEHDA

13 Participants - Techniques 19 translation systems ST ST+Synta x EBT ET ram SEHDA TOKYO EDINBGH TALPphrase ICOSOFT AT-ALEPH NGKT OKI AT-SL ITC-IST ITLL/AFL NLP

14 Translation systems Techniques Country of Origin Italy, 1 China, 1 ET, 1 K, 1 EBT, 3 Germany, 1 ST + Syntax, 3 Japan, 7 (5 groups) Spain, 2 (1 group) ST, 12 SA, 6

15 System participation anual transcription Supplied 12 Supplied & Tools C-STA nrestricted Chinese English Japanese English Arabic English Korean English English Chinese

16 System participation AS output Supplied Supplied & Tools nrestricted Chinese English Japanese English English Chinese C-STA

17 esults an. Trans. atthias

18 BLE BLE Geometric mean of n-gram precision of hypothesis compared to the reference translation Length Penalty for short translations Benefits issing references can be covered by combining of other references Correlates well with Fluency Scores: 0 1 Problems e-combination of references could cause errors All words are equally important Weak correlation with Adequacy

19 NIST NIST Variant of BLE using arithmetic mean of weighted n-gram precision values Scores: Tt Tt Benefits Problems Considers information gain p to 9-grams, usually 5grams Good correlation with Adequacy 0 e-combination of references could cause errors Weak correlation with Fluency (human judgement)

20 mwe, mpe mwe Word Error ate on multiple references edit distance: hypothesis closest reference Scores: 0 1 mpe mwe without considering word order Benefits Correlates well with human judgement... Problems...if enough references are available

21 GT, ETEO GT Similarity between texts using unigram based F-measure ETEO Considers: Exact matches stem matches synonym matches (using WordNet) case insensitive Cannot be used on Chinese output (yet) Scores: werwerwerwerewr 0 1 houses (exact match) house (stem match) home (synonym match)

22 Automatic Evaluation Evaluation specification for English outputs Focus on speech-to-speech translation Punctuation marks and mixed casing less relevant Standard Evaluation case insensitive, all lowercase removed punctuation marks:.? removed - to split compounds!,:; but: Optional Evaluation case sensitive, mixed case separated punctuation marks only done if submitted data contained mixed case characters No numbers reported here please refer to overview paper

23 Automatic Evaluation Evaluation specification for Chinese outputs Evaluation 1 sing given (AS) segmentation emoved punctuation marks Evaluation 2 Character segmented Eliminates segmentation influence emoved punctuation marks Eval Server

24 Online Evaluation Server

25 Online Evaluation Server Language Pair Data Track File Further Comments

26 Evaluation Server Output

27 Evaluation Server Output ixed case!! Automatically detected Number of lines?

28 Evaluation Server Output (2)

29 Subjective Evaluation Fluency/Adequacy Fluency Adequacy 4 Flawless English 4 All information 3 Good English 3 ost information 2 Non-Native English 2 uch information 1 Disfluent English 1 Little information 0 Incomprehensible 0 None Typically used metrics Fluency/Adequacy (e.g. IWSLT 2004) Here: 0 4 instead of 1 5

30 Subjective Evaluation eaning aintenance eaning aintenance Adequacy 4 Exactly the same meaning 4 All information 3 Almost the same meaning 3 ost information 2 Partially the same meaning and no new information 2 uch information 1 Partially the same meaning but misleading information is introduced 1 Little information 0 Totally different meaning 0 None

31 Why eaning aintenance? Focus on comparing meaning of translation with source Degree of misleading information? 2 types of errors errors Obvious error no meaning change Translation is still useful Adequacy and eaning aintenance Score are similar Error changes meaning (e.g. negation) X Translation is not useful Adequacy grader might ignore change and judge only correct parts Prevented by focus on meaning

32 Subjective Evaluation procedure All translations shown at the same time andomly ordered Comparison among translations of the same sentence No explicit reference reference included in translations No bias by shown reference gives oracle score Source is shown for Adequacy and eaning aintenance scores 5 bilingual graders (scores shown are for 3 graders) First all Fluency scores, then Adequacy, finally eaning aintenance

33 Subjective Evaluation Tool - Fluency Part 1: Fluency

34 Subjective Evaluation Tool - Adequacy Part 2: Adequacy

35 Subjective Evaluation Tool ean. aint. Part 3: eaning aintenance eaning aintenance

36 Evaluation esults Human Evaluation was only done for most popular track Chinese English translation of manual transcription (T) Supplied Data Track 11 submissions for this track were evaluated +10% translations graded a second time by the same grader to measure inconsistencies

37 Human Evaluation esults Adequacy Fluency ean. aint. IT-LL/AFL 2.71 ITC-IST 3.15 IT-LL/AFL 2.63 ITC-IST ITC-IST 2.60 rase rase ram ram 2.44 EDINBGH 2.81 ram 2.40 EDINBGH 2.33 IT-LL/AFL 2.79 EDINBGH rase

38 Human Evaluation esults - Adequacy pper bound: reference performance Adequacy IS I SC C TT N ra ED m IN B G H AT -C 3 -n g IB TA LP e ra s -p h W TH -I ST IT C TA LP IT -L L /A F L 0.00 Significance

39 e TT -IS I SC C N ra ED m IN B G H AT -C 3 -n g IB ra s W TH -p h TA LP L -I ST /A F IT C -L L TA LP IT Adequacy Significance? Adequacy

40 e -IS I SC C TT Adequacy N ra ED m IN B G H AT -C 3 -n g IB ra s W TH -p h TA LP L -I ST /A F IT C -L L TA LP IT...and Fluency Fluency

41 ...and eaning aintenance Adequacy Fluency ean. aint IS I SC C TT N ra ED m IN B G H AT -C 3 -n g IB TA LP e ra s -p h W TH -I ST IT C TA LP IT -L L /A F L 0.00 Adequacy - Fluency

42 Analysis: Adequacy - Fluency 5 4 Fluency Adequacy 3 4 5

43 Analysis: Adequacy - Fluency Fluency >> Adequacy 5 4 Fluency Fluency ~ Adequacy Adequacy Fluency << Adequacy

44 Analysis: Adequacy - Fluency Fluency 3 (Good English) is rare 4: Flawless 5 4 Fluency : Non-native Adequacy Inconsistencies

45 Consistency? (Inter - Grader) How consistent are the scores assigned by the 3 graders? Average differences between grades: Adequacy Fluency ean. aint. G1-G G1-G G2-G AVG Agreement between all 3 graders for about 40% of sentences Agreement between 2 graders for about 60% of sentences

46 Consistency? (Intra Grader) How consistent are the scores assigned by each grader? (based on 10% sentences graded twice) Average differences between first and second grade: Adequacy Fluency ean. aint. Grader Grader Grader AVG

47 Do we need eaning aintenance? Difference to Adequacy is less than 2 in 91% of the cases High correlation with Adequacy (Pearson: 0.82) But low correlation for low scores (ean. aint. 0, 1) Avg. difference is 0.75, Pearson 0.20 For high scores (3,4): Avg. difference is 0.25 with Pearson: 0.65 Graders tend to use similar scores on good translations Differences on bad translations Lower grader inconsistency for eaning aintenance No additional score necessary ake graders aware of meaning in Adequacy scoring BLE Scores

48 IT - H -C N TT I 3 SC -I S AT IB C se LL /A F TA L LP -n gr am ph ra G W TH B P- IN TA L ED IT CI ST BLE: Chinese English T Supplied Data

49 IT - H -C N TT I 3 SC -I S AT IB C se LL /A F TA L LP -n gr am ph ra G W TH B P- IN TA L ED IT C -I ST BLE: Chinese English T Supplied Data

50 I H TT SC -I S C 3 se G N ph ra B P- m -C ng ra IB AT P- IN TA L ED TH LL /A F W L IT C -I ST IT - TA L NIST: Chinese English T Supplied Data

51 I H TT SC -I S C 3 se G N ph ra B P- m -C ng ra IB AT P- IN TA L ED TH LL /A F W L IT C -I ST IT - TA L NIST: Chinese English T Supplied Data

52 NT T AT C3 S CIS I C IB TA LP -n gr am W ED TH IN B G TA H LP -p hr as IT e -L L/ AF L IT CI ST mpe mpe TE

53 NT T AT C3 S CIS I mwe C IB TA LP -n gr am W ED TH IN B G TA H LP -p hr as IT e -L L/ AF L IT CI ST mwe, mpe mpe TE

54 Addtl. etric for Chinese English, T, Suppl. TE Translation Error ate Newly introduced metric: easure error as the minimum number of edits needed to change hypothesis so that it exactly matches one of the references TE = <# of edits> / <avg # of reference words> TE is calculated against best (closest) reference Edits include insertions, deletions, substitutions and shifts All edits count as 1 error (=edit distance) Shift moves a sequence of words within the hypothesis Shift of any sequence of words (any distance) is only 1 error 0 1

55 NT T AT C3 S CIS I mwe C IB TA LP -n gr am W ED TH IN B G TA H LP -p hr as IT e -L L/ AF L IT CI ST mwe, mpe mpe

56 TE NT T AT C3 S CIS I mpe C mwe IB TA LP -n gr am W ED TH IN B G TA H LP -p hr as IT e -L L/ AF L IT CI ST mwe, mpe and TE

57 Chinese English Supplied Data - ankings BLE NIST mwe mpe GT ETEO TE ITC-IST ITC-IST ITC-IST ITC-IST IT/AF IT/AF Adeq. ITC-IST IT/AF Fluency ean.. ITC-IST IT/AF IT/AF IT/AF ITC-IST ITC-IST ITC-IST ITC-IST IT/AF IT/AF IT/AF IT/AF

58 Chinese English Supplied Data - ankings BLE NIST mwe mpe GT ETEO TE ITC-IST ITC-IST ITC-IST ITC-IST IT/AF IT/AF Adeq. ITC-IST IT/AF Fluency ean.. ITC-IST IT/AF IT/AF IT/AF ITC-IST ITC-IST ITC-IST ITC-IST IT/AF IT/AF IT/AF IT/AF

59 Chinese English Supplied Data - ankings BLE NIST mwe mpe GT ETEO TE ITC-IST ITC-IST ITC-IST ITC-IST IT/AF IT/AF Adeq. ITC-IST IT/AF Fluency ean.. ITC-IST IT/AF IT/AF IT/AF ITC-IST ITC-IST ITC-IST ITC-IST IT/AF IT/AF IT/AF IT/AF

60 Chinese English Supplied Data - ankings BLE NIST mwe mpe GT ETEO TE ITC-IST ITC-IST ITC-IST ITC-IST IT/AF IT/AF Adeq. ITC-IST IT/AF Fluency ean.. ITC-IST IT/AF IT/AF IT/AF ITC-IST ITC-IST ITC-IST ITC-IST IT/AF IT/AF IT/AF IT/AF

61 Chinese English Supplied Data - ankings BLE NIST mwe mpe GT ETEO TE ITC-IST ITC-IST ITC-IST ITC-IST IT/AF IT/AF Adeq. ITC-IST IT/AF Fluency ean.. ITC-IST IT/AF IT/AF IT/AF ITC-IST ITC-IST ITC-IST ITC-IST IT/AF IT/AF IT/AF IT/AF

62 Chinese English Supplied Data - ankings BLE NIST mwe mpe GT ETEO TE ITC-IST ITC-IST ITC-IST ITC-IST IT/AF IT/AF Adeq. ITC-IST IT/AF Fluency ean.. ITC-IST IT/AF IT/AF IT/AF ITC-IST ITC-IST ITC-IST ITC-IST IT/AF IT/AF IT/AF IT/AF

63 Chinese English Supplied Data - ankings BLE NIST mwe mpe GT ETEO TE ITC-IST ITC-IST ITC-IST ITC-IST IT/AF IT/AF Adeq. ITC-IST IT/AF Fluency ean.. ITC-IST IT/AF IT/AF IT/AF ITC-IST ITC-IST ITC-IST ITC-IST IT/AF IT/AF IT/AF IT/AF

64 Chinese English Supplied Data - ankings BLE NIST mwe mpe GT ETEO TE ITC-IST ITC-IST ITC-IST ITC-IST IT/AF IT/AF Adeq. ITC-IST IT/AF Fluency ean.. ITC-IST IT/AF IT/AF IT/AF ITC-IST ITC-IST ITC-IST ITC-IST IT/AF IT/AF IT/AF IT/AF

65 Chinese English Supplied Data - ankings BLE NIST mwe mpe GT ETEO TE ITC-IST ITC-IST ITC-IST ITC-IST IT/AF IT/AF Adeq. ITC-IST IT/AF Fluency ean.. ITC-IST IT/AF IT/AF IT/AF ITC-IST ITC-IST ITC-IST ITC-IST IT/AF IT/AF IT/AF IT/AF

66 Chinese English Supplied Data - ankings BLE NIST mwe mpe GT ETEO TE ITC-IST ITC-IST ITC-IST ITC-IST IT/AF IT/AF Adeq. ITC-IST IT/AF Fluency ean.. ITC-IST IT/AF IT/AF IT/AF ITC-IST ITC-IST ITC-IST ITC-IST IT/AF IT/AF IT/AF IT/AF

67 Chinese English Supplied Data - ankings BLE NIST mwe mpe GT ETEO TE ITC-IST ITC-IST ITC-IST ITC-IST IT/AF IT/AF Adeq. ITC-IST IT/AF Fluency ean.. ITC-IST IT/AF IT/AF IT/AF ITC-IST ITC-IST ITC-IST ITC-IST IT/AF IT/AF IT/AF IT/AF

68 Chinese English Supplied Data - ankings BLE NIST mwe mpe GT ETEO TE ITC-IST ITC-IST ITC-IST ITC-IST IT/AF IT/AF Adeq. ITC-IST IT/AF Fluency ean.. ITC-IST IT/AF IT/AF IT/AF ITC-IST ITC-IST ITC-IST ITC-IST IT/AF IT/AF IT/AF IT/AF correlations Human/Automatic

69 Correlation Human Automatic Scores Pearson Correlation between scores Adequacy Fluency ean. aint. BLE NIST mwe mpe GT ETEO TE other data conditions

70 C IB AT -C 3 SC -IS I N TT N G K AT T -S L AT N LP -A LE PH IT C -I ST ED W IN T H TA B LP G H -ph IT r -L ase TA L/A LP F -n L gr am Chinese English BLE Scores Supplied

71 C IB AT -C 3 SC -IS I N TT N G K AT T -S L AT N LP -A LE PH IT C -I ST ED W IN T H TA B LP G H -ph IT r -L ase TA L/A LP F -n L gr am Chinese English BLE Scores Supplied Supplied + Tools

72 Supplied C IB AT -C 3 SC -IS I N TT N G K AT T -S L AT N LP -A LE PH IT C -I ST ED W IN T H TA B LP G H -ph IT r -L ase TA L/A LP F -n L gr am Chinese English BLE Scores Supplied + Tools nrestricted

73 Supplied C IB AT -C 3 SC -IS I N TT N G K AT T -S L AT N LP -A LE PH IT C -I ST ED W IN T H TA B LP G H -ph IT r -L ase TA L/A LP F -n L gr am Chinese English BLE Scores Supplied + Tools nrestricted C-STA NIST Scores

74 IT - W LL TH /A IT F C- L I ST TA LP -n gr AT a m TA L P -C -p 3 hr as e ED IN NT B T G H C SC N I SI G AT K T -S L AT N L P -A LE PH Chinese English NIST Scores Supplied

75 IT - W LL TH /A IT F C- L I ST TA LP -n gr AT a m TA L P -C -p 3 hr as e ED IN NT B T G H C SC N I SI G AT K T -S L AT N L P -A LE PH Chinese English NIST Scores Supplied Supplied + Tools

76 IT - W LL TH /A IT F C- L I ST TA LP -n gr AT a m TA L P -C -p 3 hr as e ED IN NT B T G H C SC N I SI G AT K T -S L AT N L P -A LE PH Chinese English NIST Scores Supplied Supplied + Tools nrestricted

77 IT - W LL TH /A IT F C- L I ST TA LP -n gr AT a m TA L P -C -p 3 hr as e ED IN NT B T G H C SC N I SI G AT K T -S L AT N L P -A LE PH Chinese English NIST Scores Supplied Supplied + Tools nrestricted C-STA

78 IN C O KI B G AT H -C 3 N TT SC IC -I S I O SO AT F T -S L TO KY N O AT GK -A T LE PH ED IT CI ST W TH Japanese English BLE Scores Supplied

79 IN C Supplied O KI B G AT H -C 3 N TT SC IC -I S I O SO AT F T -S L TO KY N O AT GK -A T LE PH ED IT CI ST W TH Japanese English BLE Scores Supplied + Tools

80 Japanese English BLE Scores Supplied Supplied + Tools nrestricted O KI B G AT H -C 3 N TT SC IC -I S I O SO AT F T -S L TO KY N O AT GK -A T LE PH C IN ED IT CI ST W TH 0.0

81 Japanese English BLE Scores Supplied Supplied + Tools nrestricted C-STA O KI B G AT H -C 3 N TT SC IC -I S I O SO AT F T -S L TO KY N O AT GK -A T LE PH C IN ED IT CI ST W TH 0.0 NIST Scores

82 W TH O KI AT IT C3 CI ST ED S C -I S IN B I IC G O H SO F TO T KY N O G K AT T A T -S L -A LE PH TT N C Japanese English NIST Scores Supplied

83 W TH Supplied O KI AT IT C3 CI ST ED S C -I S IN B I IC G O H SO F TO T KY N O G K AT T A T -S L -A LE PH TT N C Japanese English NIST Scores Supplied + Tools

84 Japanese English NIST Scores Supplied Supplied + Tools nrestricted O KI AT IT C3 CI ST ED S C -I S IN B I IC G O H SO F TO T KY N O G K AT T A T -S L -A LE PH W TH TT N C 0.0

85 Japanese English NIST Scores Supplied Supplied + Tools nrestricted C-STA O KI AT IT C3 CI ST ED S C -I S IN B I IC G O H SO F TO T KY N O G K AT T A T -S L -A LE PH W TH TT N C 0.0

86 Japanese English Supplied Data - ankings BLE NIST mwe mpe GT ETEO ITC-IST ITC-IST ITC-IST ITC-IST ITC-IST EDINBGH EDINBGH EDINBGH ITC-IST EDINBGH EDINBGH EDINBGH Different metrics rank differently!

87 Japanese English Supplied Data - ankings BLE NIST mwe mpe GT ETEO ITC-IST ITC-IST ITC-IST ITC-IST ITC-IST EDINBGH EDINBGH EDINBGH ITC-IST EDINBGH EDINBGH EDINBGH

88 Japanese English Supplied Data - ankings BLE NIST mwe mpe GT ETEO ITC-IST ITC-IST ITC-IST ITC-IST ITC-IST EDINBGH EDINBGH EDINBGH ITC-IST EDINBGH EDINBGH EDINBGH

89 Japanese English Supplied Data - ankings BLE NIST mwe mpe GT ETEO ITC-IST ITC-IST ITC-IST ITC-IST ITC-IST EDINBGH EDINBGH EDINBGH ITC-IST EDINBGH EDINBGH EDINBGH

90 Japanese English Supplied Data - ankings BLE NIST mwe mpe GT ETEO ITC-IST ITC-IST ITC-IST ITC-IST ITC-IST EDINBGH EDINBGH EDINBGH ITC-IST EDINBGH EDINBGH EDINBGH

91 Japanese English Supplied Data - ankings BLE NIST mwe mpe GT ETEO ITC-IST ITC-IST ITC-IST ITC-IST ITC-IST EDINBGH EDINBGH EDINBGH ITC-IST EDINBGH EDINBGH EDINBGH

92 Japanese English Supplied Data - ankings BLE NIST mwe mpe GT ETEO ITC-IST ITC-IST ITC-IST ITC-IST ITC-IST EDINBGH EDINBGH EDINBGH ITC-IST EDINBGH EDINBGH EDINBGH

93 Japanese English Supplied Data - ankings BLE NIST mwe mpe GT ETEO ITC-IST ITC-IST ITC-IST ITC-IST ITC-IST EDINBGH EDINBGH EDINBGH ITC-IST EDINBGH EDINBGH EDINBGH Arabic English

94 hr a se IN -A LE PH I TT SC -I S C m G H N B TH IB W ng ra P- AT ED -p IT CI ST LP TA L TA Arabic English BLE Scores Supplied

95 Arabic English BLE Scores Supplied Supplied + Tools LE PH I AT -A SC -I S C TT N G H B m IN ng ra ED P- IB TH W IT CI ST TA L TA LP -p hr a se 0.0

96 Arabic English BLE Scores Supplied Supplied + Tools nrestricted LE PH I AT -A SC -I S C TT N G H B m IN ng ra ED P- IB TH W IT CI ST TA L TA LP -p hr a se 0.0

97 Arabic English BLE Scores Supplied Supplied + Tools nrestricted C-STA AT -A LE PH I SC -I S C TT N G H B m IN ng ra ED P- IB TH W IT CI ST TA L TA LP -p hr a se 0.0 NIST Scores

98 W TH AT LE PH I m SC -I S -A ng ra G H IB TT B P- IN TA L ED C N IT CI TA ST LP -p hr as e Arabic English NIST Scores Supplied

99 Arabic English NIST Scores Supplied Supplied + Tools LE PH I -A AT SC -I S m ng ra P- G H TA L B IN C TT N IB ED W TH IT CI TA ST LP -p hr as e 0.0

100 Arabic English NIST Scores Supplied Supplied + Tools nrestricted LE PH I -A AT SC -I S m ng ra P- G H TA L B IN C TT N IB ED W TH IT CI TA ST LP -p hr as e 0.0

101 Arabic English NIST Scores Supplied Supplied + Tools nrestricted C-STA LE PH I -A AT SC -I S m ng ra P- G H TA L B IN C TT N IB ED W TH IT CI TA ST LP -p hr as e 0.0 Korean English

102 Korean English BLE Scores Supplied PH D A LE -A SE I SC -I S TT N C H AT ED IN B G H 0.0

103 Korean English BLE Scores Supplied Supplied + Tools PH D A LE -A SE I SC -I S TT N C H AT ED IN B G H 0.0

104 Korean English BLE Scores Supplied Supplied + Tools nrestricted PH D A LE -A SE I SC -I S TT N C H AT ED IN B G H 0.0

105 Korean English BLE Scores Supplied Supplied + Tools nrestricted C-STA PH D A LE -A SE I SC -I S TT N C H AT ED IN B G H 0.0 NIST Scores

106 Korean English NIST Scores Supplied PH DA AT -A LE H G B IN ED SE H I SC -I S TT N C

107 Korean English NIST Scores Supplied Supplied + Tools PH DA AT -A LE H G B IN ED SE H I SC -I S TT N C

108 Korean English NIST Scores Supplied Supplied + Tools nrestricted PH DA AT -A LE H G B IN ED SE H I SC -I S TT N C

109 Korean English NIST Scores Supplied Supplied + Tools nrestricted C-STA PH DA AT -A LE H G B IN ED SE H I SC -I S TT N C English Chinese

110 English Chinese BLE Score Supplied EDINBGH ICOSOFT AT-ALEPH

111 English Chinese BLE Score Supplied Supplied + Tools ICOSOFT EDINBGH AT-ALEPH

112 English Chinese BLE Score Supplied Supplied + Tools nrestricted EDINBGH ICOSOFT AT-ALEPH

113 English Chinese BLE Score Supplied Supplied + Tools nrestricted C-STA EDINBGH ICOSOFT AT-ALEPH

114 English Chinese NIST Score Supplied EDINBGH ICOSOFT AT-ALEPH

115 English Chinese NIST Score Supplied Supplied + Tools EDINBGH ICOSOFT AT-ALEPH

116 English Chinese NIST Score Supplied Supplied + Tools nrestricted EDINBGH ICOSOFT AT-ALEPH

117 English Chinese NIST Score Supplied Supplied + Tools nrestricted C-STA EDINBGH ICOSOFT AT-ALEPH

118 AS esults Chiori

119 Directions and source input Translation direction anual transcription AS output Chinese English Japanese English Arabic English Korean English English Chinese (Dr. Chen, NLP) Dr. Yamamoto (AT), () - () - () (r. Paulik, /KA)

120 Japanese AS performance Word error rate (%) 506 utterances were recognized. 1-best best lattice DEVSET1 DEVSET2 TESTSET

121 Japanese AS performance Word error rate (%) 506 utterances were recognized. 1-best best lattice DEVSET1 DEVSET2 TESTSET

122 Japanese AS performance Word error rate (%) 506 utterances were recognized. 1-best best lattice DEVSET1 DEVSET2 TESTSET

123 Chinese AS performance 506 utterances were recognized. 1-best 50 Word error rate (%) 20-best latteice DEVSET1 DEVSET2 TESTSET

124 Chinese AS performance 506 utterances were recognized. 1-best 50 Word error rate (%) 20-best latteice DEVSET1 DEVSET2 TESTSET

125 Chinese AS performance 506 utterances were recognized. 1-best 50 Word error rate (%) 20-best latteice DEVSET1 DEVSET2 TESTSET

126 How much the performance of T is degraded by AS recognition? Chinese English track 160 #sentences % 20-40% 40-60% 60-80% Word error rate (%) %

127 BLE 0.6 IT BLE score Word error rate (%)

128 BLE 0.6 IT BLE score Word error rate (%)

129 NIST IT 10 NIST score Word error rate (%)

130 NIST IT 10 NIST score Word error rate (%)

131 ultiple AS hypotheses translation #sentences Word error rate (%) CASIA CASIA Word error rate (%) BLE NIST Score Word error rate (%) Word error rate (%)

132 Acknowledgment All participants NLP and AT for AS BBN for TE /KA Thanks a lot!

Evaluation. Brian Thompson slides by Philipp Koehn. 25 September 2018

Evaluation. Brian Thompson slides by Philipp Koehn. 25 September 2018 Evaluation Brian Thompson slides by Philipp Koehn 25 September 2018 Evaluation 1 How good is a given machine translation system? Hard problem, since many different translations acceptable semantic equivalence