Data Mining Decision-Trees for Comparative Models and Possibilities for Uniting Texts and Coded Data Michael D. Fischer, (U Kent) m.d.fischer@kent.ac.uk
Comparative Research There are two basic approaches to comparative ethnographic research; doubtless what Doug might refer to as the right way and the wrong way. One approach is that which is featured in these sessions, using coded databases representing cultures chosen for the Standard Crosscultural Sample (SCCS), and examining the relationships between variables, now much improved by Eff's contribution to resolving Galton's Problem relating to autocorrelation. But ultimately these datasets depend on the quality of work by researchers in the original coding of ethnographic works, which works are selected, and how consistently the researcher has balanced judgements in arriving at a value.
Comparative Research The other approach is to use the ethnographic texts themselves directly by reading them and depending on either on the fly coding or generally scholarly skills, increasingly in short supply. One attempt to be more systematic in this approach is that of the Human Relations Area Files, in which the Outline of Cultural Materials is used to code each paragraph in an ethnography with oner or more categories. There are a number of problems with this approach. Some reasonable level of consistency is present, as a small core of professionals do this full time. However, there are few tools presently provided to make it easy to do research that can be generalised safely.
Comparative Research Even if HRAF's set of cultures included all the SCCS cultures to make these interoperable as analytic tools is challenging. This seems to be the project I have landed myself with, at least to the extent of considering how this can be done, and having some sense of obligation and opportunity as a minor partner in the CoSSci project, and the Vice-president of HRAF. I have been experimenting with data mining decision trees for classifying model outcomes, then normalising the decision trees into production rules to extract a logic underlying the classifications, with a goal to creaing a bi-directional bridge between the codes of SCCS and the texts of HRAF, which hopefully will greatly expand the capability of comparative research. A long way to go yet!
The Enterprise...
What is Culture? From Behaviour to Thought... that complex whole which includes all the habits acquired by man as a member of society. Ruth Benedict, 1929... the integral whole consisting of implements and consumer goods, of constitutional charters for the various social groupings, of human ideas and crafts, beliefs and customs. Bronislaw Malinowski, 1944 A society's culture consists of whatever it is that one has to know or believe in order to operate in a manner acceptable to its members. Ward Goodenough 1962... a system of symbols and meanings. David Schneider, 1976
Crashing the Party In 1971 G. P. Murdock presented an enigmatic lecture, Anthropology s Mythology, to the Royal Anthropological Institute. He made a pair of dramatic claims; neither culture nor social structure can be reified to serve as an explanation. They were, to the extent they existed at all, our characterisation of patterns of interactions between individuals, not the source of these interactions. Anthropologists must abandon subjects of a superorganic nature and deal with individuals and their productions to explain what we described as social and cultural phenomena.
Adaptive Agency Adaption: optimising around minima or maxima of a resource or challenge Adaptive Agency: changing the rules New Ideas - Reconceptualisation Changing the context Fabrication Modifications that must be reproduced Increase choices
Lots of shared stuff: American KT 9
American KT Algebra 10
The SCCS - aggregated agency?
So what about Murdock's warning?
Adaptive Agency & 'Big Data' Many Choices/Degrees of Freedom Order emerges because of shared stories and the need to apply knowledge need to produce context need to leverage outcome need to reproduce story
How Ethnography (and Big Data) works Common 'stories' common Range of approaches by individuals. Different sub-groups have different ongoing narratives that support specific interpretations of these stories. These interpretations inform actions that set up repeated patterns in and between sub-groups
How Ethnography (and Big Data) works Ethnography identifies different groups and their group narratives Ethnography includes case studies of how people respond to different contexts Intensive interaction with a relatively small collection of people (<15) leads to surprisingly robust results
How Ethnography (and Big Data) works Robust results from small number of people consistent with power of stories and larger narratives to influence behaviour These stories and narratives effectively represent a 'logic' that informs reasoning in different domains To what extent can we recover this logic from patterns of behaviour in conjunction with the stories?
PolySocial Reality (PoSR) image infvark.com + sapplin
PoSR Layers (exploded)
Weighted results - tick those that are of interest
Now with ability to record decision-making notes
PATTERN IDENTIFICATION: DEVELOPING RULES Generally need body of examples to train from Usually these are hand selected Testing on wild data seeded with similar examples Improved example set
So following a search, extract & parsed out scrapped data tf idf weight (term frequency inverse document frequency) - evaluates the importance of a word is to a document in a collection or corpus Importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus Goldilocks words - not too common or too rare
Currently Accuracy - 90% (based on the orchid training dataset) Reports every 10mins Future Aim to combine assessment of the results into the training dataset Expanding to make it to more general for other sites e.g. alibaba.com & gumtree.co.uk
So can we amplify SCCS with ethnography?
Sample of HRAF Text - Expert Judgements
Weka - Environment for Knowledge Analysis
Weka Explorer
Weka J48 Output J48 pruned tree Plantain (add crop) = p Macabo (add crop) = p: Plant Cocoa (79.0) Macabo (add crop) = na Climate driver = na: Plant Cocoa (0.0) Climate driver = Longer rainy season and more wind every year: Plant Cocoa (0.0) Climate driver = Increasing dry spells in small wet season and more wind: Plant Cocoa (4.0) Climate driver = Dry spells getting longer and more wind every year: Plant Cassava (2.0) Plantain (add crop) = na: Plant Plantain (10.0) Number of Leaves : 6 Size of the tree : 9
Weka J48 Output === Stratified cross-validation === Correctly Classified Instances 93 97.8947 % Incorrectly Classified Instances 2 2.1053 % Kappa statistic 0.8984 K&B Relative Info Score 4548.0423 % K&B Information Score 62.8488 bits 0.6616 bits/ instance Class complexity order 0 74.3763 bits 0.7829 bits/ instance Class complexity scheme 12.8016 bits 0.1348 bits/ instance Complexity improvement (Sf) 61.5747 bits 0.6482 bits/ instance Mean absolute error 0.0038 Root mean squared error 0.0585 Relative absolute error 7.404 % Root relative squared error 41.6883 %
KNeTs Expert Rules IF Plantain (add crop) IF Macabo (add crop) THENHYP Plant Cocoa (79.0) IF Plantain (add crop) IFNOT Macabo (add crop) IF Climate driver is Increasing dry spells in small wet season and more wind THENHYP Plant Cocoa (4.0) IF Plantain (add crop) IFNOT Macabo (add crop) IF Climate driver is Dry spells getting longer and more wind every year THENHYP Plant Cassava (2.0) IFNOT Plantain (add crop) THENHYP Plant Plantain (10.0)
KNeTs XML Output <rules> <rule> <if> Plantain (add crop)</if> <if> Macabo (add crop)</if> <then val="(79.0)"> Plant Cocoa </then> </rule> <rule> <if> Plantain (add crop)</if> <if><not> Macabo (add crop)</not></if> <if> Climate driver is Increasing dry spells in small wet season and more wind</if> <then val="(4.0)"> Plant Cocoa </then> </rule> <rule> <if> Plantain (add crop)</if> <if><not> Macabo (add crop)</not></if> <if> Climate driver is Dry spells getting longer and more wind every year</if> <then val="(2.0)"> Plant Cassava </then> </rule> <rule> <if><not> Plantain (add crop)</not></if> <then val="(10.0)"> Plant Plantain </then> </rule> </rules>
Data Mining Decision-Trees for Comparative Models and Possibilities for Uniting Texts and Coded Data Michael D. Fischer, (U Kent) m.d.fischer@kent.ac.uk