Data Mining Decision-Trees for Comparative Models and Possibilities for Uniting Texts and Coded Data

Similar documents
Structures of Life. Investigation 2: Growing Further. Big Question: 3 rd Science Notebook. Name:

Alaska Content and Performance Standards Science Grade: 7 - Adopted: Concepts of Life Science (SC1, SC2, SC3)

INTERNATIONAL CULTURAL TOURISM CHARTER Managing Tourism at Places of Heritage Significance (1999)

Learning Outcomes 2. Key Concepts 2. Misconceptions and Teaching Challenges 3. Vocabulary 4. Lesson and Content Overview 5

GEOGRAPHY ADVANCED LEVEL

Beyond control: agricultural heritage and the Anthropocene

Randomized Decision Trees

Safety Guidelines for the Chemistry Professional: Understanding Your Role and Responsibilities

PYP of the IB: Program of Inquiry. An inquiry into: An inquiry into How we express ourselves. An inquiry into How the world works

Organizing Diversity Taxonomy is the discipline of biology that identifies, names, and classifies organisms according to certain rules.

STATE GEOGRAPHIC INFORMATION DATABASE

What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.

Robert D. Borchert GIS Technician

Rhode Island World-Class Standards Science Grade: K - Adopted: 2006

Decision Support. Dr. Johan Hagelbäck.

Environmental Ethics: From Theory to Practice

1 Mathematics and Statistics in Science

Wisconsin Academic Standards Science Grade: K - Adopted: 1998

Symbolic Logic. Alice E. Fischer. CSCI 1166 Discrete Mathematics for Computing February 5 6,

The Solution to Assignment 6

NGSS Example Bundles. Page 1 of 13

Application of Text Mining for Faster Weather Forecasting

The paradox of knowability, the knower, and the believer

Prentice Hall Science Explorer: Inside Earth 2005 Correlated to: New Jersey Core Curriculum Content Standards for Science (End of Grade 8)

Bachelor s Degree in Agroalimentary Engineering & the Rural Environment. 1 st YEAR Animal & Plant Biology ECTS credits: 6 Semester: 1

HISTORY 1XX/ DH 1XX. Introduction to Geospatial Humanities. Instructor: Zephyr Frank, Associate Professor, History Department Office: Building

Course Title: Social Studies People We Know Grade: 2

Seymour Centre 2017 Education Program 2071 CURRICULUM LINKS

Database Data Mining: Practical R Enterprise and Oracle Advanced Analytics

LANGUAGE ARTS Junior Elementary

Modelling environmental systems

UTAH S STATEWIDE GEOGRAPHIC INFORMATION DATABASE

Doppler Weather Radars and Weather Decision Support for DP Vessels

Reading for Information Grade 2 Correlations to Oregon Content Standards

Predictive Analytics on Accident Data Using Rule Based and Discriminative Classifiers

Investigation 1: Separating Mixtures

correlated to the Nevada Grades 9 12 Science Standards

The National Spatial Strategy

ISO/TR TECHNICAL REPORT. Nanotechnologies Methodology for the classification and categorization of nanomaterials

The role of multiple representations in the understanding of ideal gas problems Madden S. P., Jones L. L. and Rahm J.

NOAA Surface Weather Program

ECML PKDD Discovery Challenges 2017

Investigation 3: The Stars

Climate Forecasts and Forecast Uncertainty

Chi-Squared Tests. Semester 1. Chi-Squared Tests

TCs within Reanalyses: Evolving representation, trends, potential misuse, and intriguing questions

Outline. Geographic Information Analysis & Spatial Data. Spatial Analysis is a Key Term. Lecture #1

An Introduction to Scientific Research Methods in Geography Chapter 3 Data Collection in Geography

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Knowledge Representation

Investigation 2: The Moon

Data Warehousing & Data Mining

Hurricane Season 2018

Grade 6 - ENGLISH LANGUAGE ARTS READING COMPREHENSION Reads with sufficient accuracy and fluency Reads and comprehends literature at grade level

Soils, Rocks, and Landforms

Economic and Social Council

Investigation 1: The Sun

CSC200: Lecture 30 1 / 1

COLLEGE OF THE DESERT

Hydrological forecasting and decision making in Australia

Artificial Intelligence Decision Trees

TRAITS to put you on the map

where Female = 0 for males, = 1 for females Age is measured in years (22, 23, ) GPA is measured in units on a four-point scale (0, 1.22, 3.45, etc.

CHOOSING THE RIGHT SAMPLING TECHNIQUE FOR YOUR RESEARCH. Awanis Ku Ishak, PhD SBM

World Geography Unit Curriculum Document

David G. DeWitt Director, Climate Prediction Center (CPC) NOAA/NWS

The Reasons for Seasons By Gail Gibbons

Treatment of Error in Experimental Measurements

GENERAL CURRICULUM MULTI-SUBJECT SUBTEST

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Holdout and Cross-Validation Methods Overfitting Avoidance

Big Bang, Black Holes, No Math

Introduction to materials modeling and simulation

1 What are probabilities? 2 Sample Spaces. 3 Events and probability spaces

VCS MODULE VMD0018 METHODS TO DETERMINE STRATIFICATION

Hypothesis Testing and Confidence Intervals (Part 2): Cohen s d, Logic of Testing, and Confidence Intervals

Analysis of Variance and Co-variance. By Manza Ramesh

61A Extra Lecture 13

Leveraging GIS data and tools for maintaining hydraulic sewer models

Bayes Formula. MATH 107: Finite Mathematics University of Louisville. March 26, 2014

A SYSTEM VIEW TO URBAN PLANNING: AN INTRODUCTION

Quality and Coverage of Data Sources

POPULATION AND SAMPLE

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

Essential Question: What is a complex number, and how can you add, subtract, and multiply complex numbers? Explore Exploring Operations Involving

Five Themes of Geography

Table 2.14 : Distribution of 125 subjects by laboratory and +/ Category. Test Reference Laboratory Laboratory Total

Nature s Art Village

The Dust Bowl By Jessica McBirney 2018

Background. Developing a FracMan DFN Model. Fractures, FracMan and Fragmentation Applications of DFN Models to Block & Panel Caving

Investigating Solar Power in Different Weather Conditions.

CONSTRAINTS TO YOUTH INVOLVEMENT IN COCOA PRODUCTION IN NIGERIA

The Wind Hazard: Messaging the Wind Threat & Corresponding Potential Impacts

GEOGRAPHIC INFORMATION SYSTEMS Session 8

GEOGRAPHY (GE) Courses of Instruction

The Weather Information Value Chain

Summary and Conclusions

Constructing and solving linear equations

Truth Tables for Propositions

SPLIT UP OF SYLLABUS ( )

Transcription:

Data Mining Decision-Trees for Comparative Models and Possibilities for Uniting Texts and Coded Data Michael D. Fischer, (U Kent) m.d.fischer@kent.ac.uk

Comparative Research There are two basic approaches to comparative ethnographic research; doubtless what Doug might refer to as the right way and the wrong way. One approach is that which is featured in these sessions, using coded databases representing cultures chosen for the Standard Crosscultural Sample (SCCS), and examining the relationships between variables, now much improved by Eff's contribution to resolving Galton's Problem relating to autocorrelation. But ultimately these datasets depend on the quality of work by researchers in the original coding of ethnographic works, which works are selected, and how consistently the researcher has balanced judgements in arriving at a value.

Comparative Research The other approach is to use the ethnographic texts themselves directly by reading them and depending on either on the fly coding or generally scholarly skills, increasingly in short supply. One attempt to be more systematic in this approach is that of the Human Relations Area Files, in which the Outline of Cultural Materials is used to code each paragraph in an ethnography with oner or more categories. There are a number of problems with this approach. Some reasonable level of consistency is present, as a small core of professionals do this full time. However, there are few tools presently provided to make it easy to do research that can be generalised safely.

Comparative Research Even if HRAF's set of cultures included all the SCCS cultures to make these interoperable as analytic tools is challenging. This seems to be the project I have landed myself with, at least to the extent of considering how this can be done, and having some sense of obligation and opportunity as a minor partner in the CoSSci project, and the Vice-president of HRAF. I have been experimenting with data mining decision trees for classifying model outcomes, then normalising the decision trees into production rules to extract a logic underlying the classifications, with a goal to creaing a bi-directional bridge between the codes of SCCS and the texts of HRAF, which hopefully will greatly expand the capability of comparative research. A long way to go yet!

The Enterprise...

What is Culture? From Behaviour to Thought... that complex whole which includes all the habits acquired by man as a member of society. Ruth Benedict, 1929... the integral whole consisting of implements and consumer goods, of constitutional charters for the various social groupings, of human ideas and crafts, beliefs and customs. Bronislaw Malinowski, 1944 A society's culture consists of whatever it is that one has to know or believe in order to operate in a manner acceptable to its members. Ward Goodenough 1962... a system of symbols and meanings. David Schneider, 1976

Crashing the Party In 1971 G. P. Murdock presented an enigmatic lecture, Anthropology s Mythology, to the Royal Anthropological Institute. He made a pair of dramatic claims; neither culture nor social structure can be reified to serve as an explanation. They were, to the extent they existed at all, our characterisation of patterns of interactions between individuals, not the source of these interactions. Anthropologists must abandon subjects of a superorganic nature and deal with individuals and their productions to explain what we described as social and cultural phenomena.

Adaptive Agency Adaption: optimising around minima or maxima of a resource or challenge Adaptive Agency: changing the rules New Ideas - Reconceptualisation Changing the context Fabrication Modifications that must be reproduced Increase choices

Lots of shared stuff: American KT 9

American KT Algebra 10

The SCCS - aggregated agency?

So what about Murdock's warning?

Adaptive Agency & 'Big Data' Many Choices/Degrees of Freedom Order emerges because of shared stories and the need to apply knowledge need to produce context need to leverage outcome need to reproduce story

How Ethnography (and Big Data) works Common 'stories' common Range of approaches by individuals. Different sub-groups have different ongoing narratives that support specific interpretations of these stories. These interpretations inform actions that set up repeated patterns in and between sub-groups

How Ethnography (and Big Data) works Ethnography identifies different groups and their group narratives Ethnography includes case studies of how people respond to different contexts Intensive interaction with a relatively small collection of people (<15) leads to surprisingly robust results

How Ethnography (and Big Data) works Robust results from small number of people consistent with power of stories and larger narratives to influence behaviour These stories and narratives effectively represent a 'logic' that informs reasoning in different domains To what extent can we recover this logic from patterns of behaviour in conjunction with the stories?

PolySocial Reality (PoSR) image infvark.com + sapplin

PoSR Layers (exploded)

Weighted results - tick those that are of interest

Now with ability to record decision-making notes

PATTERN IDENTIFICATION: DEVELOPING RULES Generally need body of examples to train from Usually these are hand selected Testing on wild data seeded with similar examples Improved example set

So following a search, extract & parsed out scrapped data tf idf weight (term frequency inverse document frequency) - evaluates the importance of a word is to a document in a collection or corpus Importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus Goldilocks words - not too common or too rare

Currently Accuracy - 90% (based on the orchid training dataset) Reports every 10mins Future Aim to combine assessment of the results into the training dataset Expanding to make it to more general for other sites e.g. alibaba.com & gumtree.co.uk

So can we amplify SCCS with ethnography?

Sample of HRAF Text - Expert Judgements

Weka - Environment for Knowledge Analysis

Weka Explorer

Weka J48 Output J48 pruned tree Plantain (add crop) = p Macabo (add crop) = p: Plant Cocoa (79.0) Macabo (add crop) = na Climate driver = na: Plant Cocoa (0.0) Climate driver = Longer rainy season and more wind every year: Plant Cocoa (0.0) Climate driver = Increasing dry spells in small wet season and more wind: Plant Cocoa (4.0) Climate driver = Dry spells getting longer and more wind every year: Plant Cassava (2.0) Plantain (add crop) = na: Plant Plantain (10.0) Number of Leaves : 6 Size of the tree : 9

Weka J48 Output === Stratified cross-validation === Correctly Classified Instances 93 97.8947 % Incorrectly Classified Instances 2 2.1053 % Kappa statistic 0.8984 K&B Relative Info Score 4548.0423 % K&B Information Score 62.8488 bits 0.6616 bits/ instance Class complexity order 0 74.3763 bits 0.7829 bits/ instance Class complexity scheme 12.8016 bits 0.1348 bits/ instance Complexity improvement (Sf) 61.5747 bits 0.6482 bits/ instance Mean absolute error 0.0038 Root mean squared error 0.0585 Relative absolute error 7.404 % Root relative squared error 41.6883 %

KNeTs Expert Rules IF Plantain (add crop) IF Macabo (add crop) THENHYP Plant Cocoa (79.0) IF Plantain (add crop) IFNOT Macabo (add crop) IF Climate driver is Increasing dry spells in small wet season and more wind THENHYP Plant Cocoa (4.0) IF Plantain (add crop) IFNOT Macabo (add crop) IF Climate driver is Dry spells getting longer and more wind every year THENHYP Plant Cassava (2.0) IFNOT Plantain (add crop) THENHYP Plant Plantain (10.0)

KNeTs XML Output <rules> <rule> <if> Plantain (add crop)</if> <if> Macabo (add crop)</if> <then val="(79.0)"> Plant Cocoa </then> </rule> <rule> <if> Plantain (add crop)</if> <if><not> Macabo (add crop)</not></if> <if> Climate driver is Increasing dry spells in small wet season and more wind</if> <then val="(4.0)"> Plant Cocoa </then> </rule> <rule> <if> Plantain (add crop)</if> <if><not> Macabo (add crop)</not></if> <if> Climate driver is Dry spells getting longer and more wind every year</if> <then val="(2.0)"> Plant Cassava </then> </rule> <rule> <if><not> Plantain (add crop)</not></if> <then val="(10.0)"> Plant Plantain </then> </rule> </rules>

Data Mining Decision-Trees for Comparative Models and Possibilities for Uniting Texts and Coded Data Michael D. Fischer, (U Kent) m.d.fischer@kent.ac.uk