Accessing the LEI using ELCat data

Background

The Catalogue of Endangered Languages (ELCat) implements a metric of language vitality known as the Language Endangerment Index, or LEI, described in Lee & Van Way (2016). The LEI draws on four factors to create a language endangerment score in the form of a percentage. Each factor is rated on a six-point ordinal scale, where 0 = “safe” and 5 = “critically endangered”. The four factors are:

speaker numbers (N)
speaker number trends (S)
intergenerational transmission (T)
domains of use (D)

If factors are unavailalbe/unknown, they are ommitted from the calculation without effect on the score. Instead, the number of missing factors is assessed separtely through a certainty score.

The implementation of the LEI on the Endangered Languages Project (ELP) site differs slightly from that described in Lee & Van Way 2016. Most significantly, ELP includes three additional levels: “at risk”, “dormant” and “awakening”.

ELCat data dump

The first source of ELCat is a CSV file available via the Download link on the ELP website.

This file lack a lacks header row, so we need to supply the names.

rnames <- c("id","code","name","alt_name","status","speakers", "classification", "varieties", "classification_comment","comment", "country","region","coordinates")
elcat_csv_raw <- read.csv("http://endangeredlanguages.com/userquery/download/", 
                      stringsAsFactors = F,
                      sep = ",",
                      header = F,
                      col.names = rnames)

5th column combines status and certainty, so we need to tease apart.

elcat_csv <- elcat_csv_raw %>%
  filter(! str_detect(status, "Dormant")) %>%     # ignore dormant
  filter(! str_detect(status, "Awakening")) %>%   # ignore awakening
  mutate(certainty = str_extract(status, "[0-9]+") ) %>%
  mutate(status = str_extract(status, "^([^\\(]+?) \\(" ), 
         status = str_sub(status, 1, -3) )
#         status = as.factor(status, levels= risk_levels) )

ELCat uses a kludge to include a category “At Risk” (Lee & Van Way 2016:290). “At Risk” is the same as “Safe,” except with less than 100% certainty. We’ll just equate these both to “Safe.”

We’ll also go ahead an ignore the rows with NA’s in the status column.

Then we can set the status to a factor variable.

elcat_csv <- mutate(elcat_csv, status = str_replace(status, "At risk", "Safe"))

elcat_csv <- filter(elcat_csv, !is.na(status))

risk_levels <- c("Safe", "Vulnerable", "Threatened", "Endangered", "Severely Endangered", "Critically Endangered")
elcat_csv <- mutate(elcat_csv, status = factor(status, levels = risk_levels), ordered = T )

ELCat database dump

The csv file available on the ELP site is not very useful because:

it doesn’t show the value of each of the four factors
it only shows the values for the preferred sources

Better to use the nightly sql database dump, which is available at:

http://hdl.handle.net/10125/61860

The relevant tables for the LEI factors are:

language_vitality
language_speakers

language_vitality <- read.csv("language_vitality.csv", stringsAsFactors = F) #, quote = "\"", sep="," , allowEscapes = T)
language_speakers <- read.csv("language_speakers.csv", stringsAsFactors = F, na.strings = c("",NA))

language_vitality contains the values for three of the LEI factors:

intergenerational transmission (in column transmission_id)
speaker number trends (in column speaker_number_trends_id)
domains of use (in column domains_of_use_id)

However, to interpret the values in this table we must use the relevant lookup tables:

language_transmission (T)
language_speakernumbertrends (S)
language_domainsofuse (D)

language_transmision <- read.csv("language_transmission.csv", stringsAsFactors = F)
language_speakernumbertrends <- read.csv("language_speakernumbertrends.csv", stringsAsFactors = F)
language_domainsofuse <- read.csv("language_domainsofuse.csv", stringsAsFactors = F)

These lookup tables give the values and a description for each of the six levels of endangerment for each of these three factors. However, in order to clearly distinguish the “safe” level from a null value (no data), “safe” is not coded as 0, as in LEI, but rather as -1. The, the ordinal values in these lookup tables are ranked -1, 1, 2, 3, 4, 5. For example:

kable(language_transmision, rownames = F)

id	transmission_level	description
10	-1	All members of the community, including children, speak the language.
11	1	Most adults in the community, and some children, are speakers.
12	2	Most adults in the community are speakers, but children generally are not.
13	3	Some adults in the community are speakers, but the language is not spoken by children.
14	4	Many of the grandparent generation speak the language, but younger people generally do not.
15	5	There are only a few elderly speakers.

The speaker number data is treated differently. There is no separate lookup table for speakers. Instead, the various levels for the LEI speaker number factor are coded as text strings in the speaker_number field. The first 6 of these values clearly correspond to the LEI risk levels from “critically endangered” to “safe”. But ELCat includes two additional levels not included in the LEI calculation in Lee & Van Way, ostensibly designed to handle languages with no speakers: None and Awakening. The speaker_number values should probably be moved into a lookup table as is done with the other three LEI factors.

speaker_number_values <- language_speakers %>%
     #      filter(!is.na(speaker_number)) %>%
             group_by(speaker_number) %>%
              summarise(count = n()) %>%
       cbind("LEI" = c(rev(risk_levels),"Awakening","Dormant","(no value)"))

## `summarise()` ungrouping output (override with `.groups` argument)

kable(speaker_number_values,  
      caption="Distinct values of speaker_number in the speakers table")

Distinct values of speaker_number in the speakers table
speaker_number	count	LEI
1-9	1007	Critically Endangered
10-99	1615	Severely Endangered
100-999	3005	Endangered
1000-9999	3146	Threatened
10000-99999	1609	Vulnerable
100000	411	Safe
Awakening	107	Awakening
None	519	Dormant
NA	3287	(no value)

Some insight into how these two additional levels are handled in ELCat can be gained by examining the value of the endangerment_level column in the language_vitality table. This column stores the overall LEI evaluation and certainty level for a given language and souce reported as a text string in the form:

Threatened (20 percent certain, based on the evidence available)

Note that ELCat does not report the actual LEI score but rather the levels associated with the scores. For example, LEI scores in the range 61-80% are converted to the level “severely endangered” in ELCat, but the actual score within this range is not directly shown on the website.

It turns out that ELCat actually reports 9 distinct levels. In addition to the 6 levels in the LEI, ELCat also includes “At risk”, “Awakening”, and “Dormant”. The latter two levels clearly correspond to the “Awakening” and “None” levels in the speaker_number column of the language_speakers table. The “At risk” category is calculated in a different way. Languages are counted as “At risk” if they have an LEI = 0 (otherwise “Safe”) but also a level of certainty less than 100%.

elcat_risk_levels <- c("Safe","At risk", "Vulnerable","Threatened","Endangered", "Severely Endangered","Critically Endangered", "Awakening","Dormant")

elcat_vitality <- language_vitality %>%
        mutate(overall_level = str_replace(endangerment_level," \\(.*\\)",""))  %>%
         mutate(certainty = str_replace(endangerment_level,"^.*\\(","") ) %>%
         mutate(certainty = str_replace(certainty, " percent certain, based on the evidence available\\)" ,"" )) %>%
         mutate(certainty = str_replace(certainty, "\\)" ,"" )) %>%
         mutate(certainty = str_replace(certainty, "Dormant" ,"100" )) %>%  # arbitrarily assingn 100% certainty to Dormant and Awakening
        mutate(certainty = ifelse(overall_level == "Awakening" | overall_level == "Dormant", 100, certainty)) %>%
         mutate(certainty = ifelse(certainty =="", 0 ,certainty )) %>%
         mutate(certainty = as.integer(certainty) ) 

# factor so that graph displays endangerment levels in order
elcat_vitality$overall_level = factor(elcat_vitality$overall_level, levels = elcat_risk_levels, ordered=T)

elcat_vitality %>%         filter(overall_level != "") %>%
   mutate(certainty = factor(certainty, levels=c("100","80","60","40","20","0"))) %>%      
   group_by(overall_level, certainty) %>%
         summarise( count = n()) %>%
     ggplot( aes(overall_level,count, fill=certainty)) + geom_col() + theme_bw() +
          theme(axis.text.x = element_text(angle = 45, hjust = 1), , plot.title = element_text(hjust = 0.5) )  +
          ylab("Number of sources") + xlab("Endangerment level") +
            ggtitle("Distribution of languages by endangerment level and certainty \n(all only)")

## `summarise()` regrouping output by 'overall_level' (override with `.groups` argument)

Note that the above plot is for all sources, which is much greater than the number of languages, since a single language can have mulitple sources of vitality data. In order to look just at the so-called “preferred” sources we need to join to the speakers table which contains a column preferred, then filter() for preferred sources only.

elcat_vitality_preferred <-      
  right_join(elcat_vitality, filter(language_speakers, preferred==1), by=c("code_id" = "code_id"))

We can plot

elcat_vitality_preferred %>%  
    filter(overall_level != "") %>%
   mutate(certainty = factor(certainty, levels=c("100","80","60","40","20","0"))) %>%      
         group_by(overall_level, certainty) %>%
         summarise( count = n()) %>%
     ggplot( aes(overall_level,count, fill=certainty)) + geom_col() + theme_bw() +
          theme(axis.text.x = element_text(angle = 45, hjust = 1), plot.title = element_text(hjust = 0.5) )  +
          ylab("Number of languages") + xlab("Endangerment level") +
          ggtitle("Distribution of languages by endangerment level and certainty \n(preferred sources only)")

## `summarise()` regrouping output by 'overall_level' (override with `.groups` argument)

Note that in the graph above the certainty level for Dormant and Awakening languages is reported as 100%. In fact, the “endangerment_level” field in the language_vitality.csv does not give a certainty level for Dormant and Awakening languages. However, as we will see below, it is possible to caculate this from the underlying data, since some languages do have LEI factors reported for Dormant and Awakening languages. For example, see the Australian language Kaurna.

Note also that even when we restrict to preferred sources of only, most of these sources only report one of the LEI factors. This is likely to be the Speaker Number (N). We will revist this below.

Individual scores for the factors

Okay, let’s look at the individual factors in the LEI. Start with speaker numbers (N), which is found in the languages_speakers table. As seen above, these are in a text string format, so we need to convert these values to corresponding LEI factor value. Let’s just create a data frame which we can join to the languages_speakers table. We won’t bother with “Awakening” or “None” since we will be ignoring those anyway.

elcat_lei_speakers <- data.frame(
  "speaker_number" = c("1-9","10-99","100-999","1000-9999", "10000-99999", "100000"),
  "lei_speakers" = c(5,4,3,2,1,0)
)
language_speakers <- left_join(language_speakers, elcat_lei_speakers)

## Joining, by = "speaker_number"

distinct(language_speakers,lei_speakers)

##   lei_speakers
## 1            2
## 2           NA
## 3            1
## 4            3
## 5            4
## 6            5
## 7            0

Okay, now let’s work on the other three factors. First join the lookup tables into the language_vitality table so that we can access the scores directly. Clean up the column names.

language_vitality <- language_vitality %>%
  left_join( language_transmision, by=c("transmission_id" = "id")) %>%
  left_join(language_speakernumbertrends, by=c("speaker_number_trends_id" = "id")) %>%
  left_join(language_domainsofuse, by=c("domains_of_use_id" = "id")) %>%
  mutate(lei_domains = domains_of_use_level) %>%
  mutate(lei_trends = speaker_number_trend) %>%
  mutate(lei_transmission = transmission_level) %>%
  select(id, code_id ,endangerment_level, lei_transmission,lei_trends,lei_domains, preferred)

Now let’s build a dataframe for LEI scores.

We’ll start with the main language_vitality table;
then grab the speaker numbers from language_speakers;
then grab language info from language_codes

We also still need to replace the -1 values in domains, trends, transmission.

#language_language <- read.csv("language_language.csv", stringsAsFactors = F)
#language_codes <- left_join(language_codes,language_language, by  = c("id" = "code_id"))

# build main table, starting from the three LEI factors in the language_vitality table
#elcat <- select(language_vitality, 
#    id, code_id, lei_transmission,lei_trends, lei_domains, preferred, overall) 

# get speaker numbers
elcat <- left_join(language_vitality, language_speakers, by=c("id" = "id") )

# add in language data
language_codes <- read.csv("language_codes.csv", stringsAsFactors = F)
elcat <- left_join(elcat, language_codes, by=c("code_id.x" = "id") )

# remove extraneous fields and clean up
elcat <- mutate(elcat, 
                code_id = code_id.x , 
                preferred = preferred.x)

# replace -1 values
elcat <- mutate(elcat,
    lei_transmission = ifelse(lei_transmission==-1, 0, lei_transmission),  
    lei_domains = ifelse(lei_domains==-1, 0, lei_domains), 
    lei_trends= ifelse(lei_trends==-1, 0, lei_trends))

# clean up the table
elcat <- select(elcat,
  id, code_id, code_val, primary_name, 
  lei_speakers, lei_transmission, lei_trends, lei_domains,
  endangerment_level,
  code_authorities, preferred
)

We need to exclude so-called “upapproved” records. To do this we’ll need the “approved” field from the original language_langauge table, as that tells us which languages are actually published on ELCat. We’ll also grab the coordinates" field here, although this is a derived field from the langauge_locations table.

approved_langs <- read.csv("language_language.csv",stringsAsFactors = F)
approved_langs <- select(approved_langs, code_id, coordinates, approved)
elcat <- left_join(elcat, approved_langs, by=c("code_id" = "code_id"))
elcat <- filter(elcat, approved==1)

Calculating LEI from the four endangerment factors

Functions for calculating LEI

Before proceeding let’s set up some functions to calculate LEI. For details on how this works see Lee & Van Way 2016.

lei_score <- function(n=NA,s=NA,d=NA,t=NA){
  # n = speakers
  # s = speaker number trends
  # d = domains of use
  # t = intergenerational transmission

# if all are NA then return NA
  if ( is.na(n) && is.na(s) && is.na(d) && is.na(t) )
  {
    return(NA)
  }
  else
  {
    num <- ifelse(is.na(n),0,n) + ifelse(is.na(s), 0, s) + ifelse(is.na(d),0,d) + ifelse(is.na(t),0, 2*t)
    den <- ifelse(is.na(n),0,5) + ifelse(is.na(s), 0, 5) + ifelse(is.na(d),0,5) + ifelse(is.na(t),0, 10)
    lei <- num / den 
    return(lei)
  }
}

lei_certainty <- function(n=NA,s=NA,d=NA,t=NA){
  certainty <- (ifelse(is.na(n),0,1) + ifelse(is.na(s), 0, 1) + ifelse(is.na(d),0,1) + ifelse(is.na(t),0, 2)) / 5
  return(certainty)
}

LEI for preferred souces

To start with let’s work only with preferred sources. These are the ones which are displayed first on each language pages in ELCat, and they are the ones used to determine endangerment status overall (and ostensibly inclusion in the Catalogue).

It will be handy to just filter the elcat df and store as a new variable.

elcat_preferred <- elcat %>%
  filter(preferred == 1) %>%
  select(-preferred)

However, there are some errors in the ELCat database, in that 21 languages contain more than one preferred source (in fact, exactly two sources each) . This isn’t supposed to happen, and probably causes havoc on the ELP website. We should probably figure out a way to deal with it. But for now we’re going to ignore this. Just bear in mind that our analyses below will end up treating each of these “duplicated” languages as separate languages.

We can generate a list of duplicated languages in the elcat_preferred table by using group_by() on “code_id” and then filter() to show those with more than one “code id”. I put in link in case you want to check out these languages on the site and see if you can determine which should be the preferred source.

elcat_preferred %>% 
  mutate(language = paste("<a href='http://endangeredlanguages.com/lang/", as.character(code_id), "'>", primary_name, "</a>", sep="")) %>%
  group_by(code_id,language) %>%
 filter(n()>1) %>% 
  summarise(n=n()) %>%
  datatable(caption = "Languages with more than one preferred source", rownames = F, options = list(pageLength = 5, autoWidth=F), escape = F)

## `summarise()` regrouping output by 'code_id' (override with `.groups` argument)

Now let’s try calculating the LEI from scratch.

We use mutate() to add columns for “lei” and “certainty”. The lei_score() and lei_certainty() functions defined above are used to calculate the values for these new columns.

elcat_preferred <- elcat_preferred %>%
  mutate(lei = lei_score(lei_speakers,lei_trends,lei_domains,lei_transmission)) %>%
  mutate(certainty = lei_certainty(lei_speakers,lei_trends,lei_domains,lei_transmission))
datatable(select(elcat_preferred, id, primary_name,endangerment_level,lei,certainty))

Now plot it. Using factor() on the certainty levels ensures that the legend shows up ordered.

elcat_preferred %>% mutate(certainty = factor(as.character(as.integer(certainty*100)), 
  levels=c("100","80","60","40","20","0"))) %>%
  group_by(lei, certainty) %>%
  summarise( count= n()) %>%
  ggplot( aes(x = lei) ) + 
  geom_col(aes(y = count, fill = certainty), width=0.025) +
  theme_bw() +  ylab("Number of languages") + xlab("LEI") +
  theme( plot.title = element_text(hjust = 0.5) )

  ggtitle("Distribution of languages by LEI score and certainty\n(preferred sources only)")

## $title
## [1] "Distribution of languages by LEI score and certainty\n(preferred sources only)"
## 
## attr(,"class")
## [1] "labels"

How many LEI factors?

While the LEI is based on four factors, for the most part only Subjectively it may seem that there is a lot of dota missing from ELCat. However, it turns out that most languages have information for at least of the four LEI factors. We can calculate the number of factors (num_factors) available and plot:

elcat_preferred %>%
  mutate(certainty = factor(as.character(as.integer(certainty*100)), levels=c("100","80","60","40","20","0"))) %>%
  mutate(num_factors = 4 -( is.na(lei_speakers) + is.na(lei_transmission) + is.na(lei_trends) + is.na(lei_domains) ) )%>%
  group_by(num_factors) %>%
    summarise( n  = n() ) %>%
  ggplot( aes(num_factors,n)) + geom_col() + theme_bw() +
       theme( plot.title = element_text(hjust=0.5)) +
           xlab("Number of LEI factors present") +
            ylab("number of languages") +
          ggtitle("Number of languages by number of LEI factors\n(preferred sources only)")

## `summarise()` ungrouping output (override with `.groups` argument)

Another way of approaching this is to just plot the number of languages for each certainty level.

elcat_preferred %>%
  group_by(certainty) %>%
  summarise(n = n()) %>%
  ggplot( aes(certainty,n)) + geom_col() + theme_bw() + 
    theme( plot.title = element_text(hjust=0.5)) +
    xlab("Certainty") + ylab("Number of languages") +
    ggtitle("Number of languages by certainty level\n(preferred sources only)")

## `summarise()` ungrouping output (override with `.groups` argument)

So which factors are present? As we might expect, the number of speakers (N) is by far the most frequent LEI factor present.

elcat_preferred %>%
  mutate(certainty = factor(as.character(as.integer(certainty*100)), levels=c("100","80","60","40","20","0"))) %>%
  mutate(num_factors = 4 -( is.na(lei_speakers) + is.na(lei_transmission) + is.na(lei_trends) + is.na(lei_domains) ) )%>%
  mutate(what_factors = paste( if_else(is.na(lei_speakers),"","N"), 
         if_else(is.na(lei_transmission), "", "T"), 
         if_else(is.na(lei_trends),"","S"),
         if_else(is.na(lei_domains), "", "D")  , sep=""  )) %>%
  mutate(what_factors = if_else(what_factors == "","none",what_factors)) %>%
  mutate(what_factors = if_else(what_factors == "NTSD","all",what_factors)) %>%
  group_by(num_factors,what_factors) %>%
  summarise(n = n()) %>%
  ggplot( aes(num_factors,n, fill=what_factors)) + geom_col() + theme_bw() + 
    theme( plot.title = element_text(hjust=0.5)) +
    xlab("Number of LEI factors present") + ylab("Number of languages") +
    ggtitle("Number of languages by LEI factors present\n(preferred sources only)")

## `summarise()` regrouping output by 'num_factors' (override with `.groups` argument)

If we look just at whether the speaker number (N) LEI factor is present, the dominance of the speaker number factor among the four possible LEI factors is even more striking.

elcat_preferred %>%
#  mutate(certainty = factor(as.character(as.integer(certainty*100)), levels=c("100","80","60","40","20","0"))) %>%
  mutate(num_factors = 4 -( is.na(lei_speakers) + is.na(lei_transmission) + is.na(lei_trends) + is.na(lei_domains) ) )%>%
  mutate(contains_N =  if_else(is.na(lei_speakers) , "no N", "has N")) %>%
#  filter(num_factors == 1) %>%
  group_by(num_factors,contains_N) %>%
  summarise(n = n()) %>%
  ggplot( aes(num_factors,n, fill=contains_N)) + geom_col() + theme_bw() + 
    theme( plot.title = element_text(hjust=0.5)) +
    xlab("Number of LEI factors present") + ylab("Number of languages") +
    ggtitle("Number of languages by LEI factors present\n(preferred sources only)")

## `summarise()` regrouping output by 'num_factors' (override with `.groups` argument)

Of the data plotted above the interesting cases are those with 1, 2 or 3 LEI factors present but with no value for speaker number. There are 159 languages like this in the database:

elcat_preferred %>%
  filter(is.na(lei_speakers)) %>%
  filter( certainty>0 & certainty <1) %>%
  select(code_id,primary_name,lei_transmission,lei_trends,lei_domains,lei,certainty) %>%
  datatable(caption = "Languages lacking the speaker number (N) factor")

It is difficult to understand how we can know about Transmisison (T), Speaker number trends (S), and Domains of use (D) without knowing anything about speaker numbers. This is particularly true when all three other factors are known. Some of these are likely to be mistakes, or due to the difficulty of assessing small numbers of speakers For example, the ELCat entry for Zazao includes the following text:

Current situation unknown. Appears to have formerly been the language of Kilokaka village. However, Kilokaka is now a Blablanga speaking community due to encroachment of that neighboring language. A wordlist collected at the beginning of the 20th century (Napu 1953) seems to be a different language to Blablanga, and a list collected in the 1960s or 1970s (Tryon & Hackman) also appears different to north coast Blablanga, but less so. Ethnologue gives a figure of 10 speakers in 1999 which is plausible. Kilokaka is now Blablanga speaking, but it is possible a very small number of speakers may remain.

In this case ELCat seems to be hesitating to assign a speaker number value of “1-9” (i.e., level 5, Critically endangered), because there may not actually be any speakers remaining. Here the LEI metric breaks down, because LEI explicity does not assess values of endangerment for languages without speakers. ELCat circumvents this problem in some cases by applying the label “Dormant”, but this value cannot by defition be calculated by the LEI.

There are 159 languages like this in ELCat, as shown in the table below. Just for fun I put in links to the ELCat pages in the table below so that you can check out the details.

elcat_preferred %>%
  filter(is.na(lei_speakers)) %>%
  filter(!is.na(lei_transmission) & !is.na(lei_trends) & !is.na(lei_domains)) %>%
  filter( certainty>0 & certainty <1) %>%
  mutate(language = paste("<a href='http://endangeredlanguages.com/lang/", as.character(code_id), "'>", primary_name, "</a>", sep="")) %>%
  select(code_id,language,lei_transmission,lei_trends,lei_domains,lei,certainty) %>%
  datatable(caption = "Languages lacking the speaker number (N) factor but having all other factors", escape = F)

Variation in endangerment levels

How much variation in endangerment levels is there across different sources for a given language? Let’s go back to the original elcat df (not just preferred sources) and calculate lei and certainty.

elcat <- elcat %>%
  mutate(lei = lei_score(lei_speakers,lei_trends,lei_domains,lei_transmission)) %>%
  mutate(certainty = lei_certainty(lei_speakers,lei_trends,lei_domains,lei_transmission))
datatable(select(elcat_preferred, id, primary_name,endangerment_level,lei,certainty))

Now plot it.

elcat %>% mutate(certainty = factor(as.character(as.integer(certainty*100)), 
  levels=c("100","80","60","40","20","0"))) %>%
  group_by(lei, certainty) %>%
  summarise( count= n()) %>%
  ggplot( aes(x = lei) ) + 
  geom_col(aes(y = count, fill = certainty), width=0.025) +
  theme_bw() +  ylab("Number of sources") + xlab("LEI") + theme(plot.title = element_text(hjust=0.5))

  ggtitle("Distribution of sources by LEI score\n(all sources, not just preferred), with certainty indicated.")

## $title
## [1] "Distribution of sources by LEI score\n(all sources, not just preferred), with certainty indicated."
## 
## attr(,"class")
## [1] "labels"

elcat %>% mutate(num_factors = 4 - (is.na(lei_speakers) +
                   is.na(lei_transmission) + 
                    is.na(lei_trends) +
                   is.na(lei_domains) ) ) %>%
    group_by( code_id ) %>%
    summarise( mean_factors = mean(num_factors)) %>%
    ggplot(aes(x=mean_factors)) + geom_histogram(binwidth = 0.25) +
             theme_bw() + xlab("Mean number of LEI factors")  + ylab("Number of sources") +
              ggtitle("Mean number of LEI factors present across all sources")

## `summarise()` ungrouping output (override with `.groups` argument)

Comparing the sizes of the elcat_preferred (14340) and **elcat* (3424) tables we can see that there are in general multiple sources for each language. This distribution is given in the following histogram.

elcat %>% 
    group_by(primary_name,code_id ) %>%
    summarise( num_sources= n() ) %>%
    ggplot(aes(num_sources)) + geom_histogram(binwidth = 1) +
    theme_bw() + xlab("Number of sources")  + ylab("Number of languages") + 
    ggtitle("Histogram of number of sources for a given language")

## `summarise()` regrouping output by 'primary_name' (override with `.groups` argument)

Calculate the varition across sources for each language. We can do this by first using group_by() on the code_id and then using summarize(). However, we need to first filter() to only look at languages with more than one source. (There are 3151 of these.)

elcat %>%  
  filter(!is.na(lei)) %>%
  group_by( code_id, primary_name ) %>%
   summarise( n=n(), sd=sd(lei) ) %>% 
  filter(n>1)  %>%
    ggplot(aes(x=sd)) + geom_histogram(binwidth = 0.01) +
      theme_bw() + xlab("Standard deviation of LEI score across sources")  + ylab("Number of languages") + 
    ggtitle("Variation in LEI score across sources for a given language\n(only including languages with more than one source)")

## `summarise()` regrouping output by 'code_id' (override with `.groups` argument)

Mapping

We’ve already added the coordinates coordinates field from “language_locations.csv” into elcat_preferred, but we need to extract the lat/long values. This field allows for multiple point coordinates, separated by semi-colons. We’ll just use the first one.

elcat_preferred_test <- elcat_preferred %>% mutate(first_coord = str_split(coordinates, ";", simplify=T)[,1] ) %>%
      mutate(lat = as.numeric(str_split(first_coord,",", simplify = TRUE)[,1]) ,  long = as.numeric(str_split(first_coord,",", simplify = TRUE)[,2]) ) %>%
      mutate( endangerment_level_only = str_match(endangerment_level, "(.*?) \\(" )[,2] )

## Warning: Problem with `mutate()` input `lat`.
## ℹ NAs introduced by coercion
## ℹ Input `lat` is `as.numeric(str_split(first_coord, ",", simplify = TRUE)[, 1])`.

## Warning in mask$eval_all_mutate(dots[[i]]): NAs introduced by coercion

## Warning: Problem with `mutate()` input `long`.
## ℹ NAs introduced by coercion
## ℹ Input `long` is `as.numeric(str_split(first_coord, ",", simplify = TRUE)[, 2])`.

## Warning in mask$eval_all_mutate(dots[[i]]): NAs introduced by coercion

elcat_preferred_test <- filter(elcat_preferred_test, !is.na(endangerment_level_only))

factpal <- colorFactor(brewer.pal(9, "Reds"), elcat_risk_levels, ordered = FALSE)

leaflet(elcat_preferred_test) %>%
  addTiles() %>%
  addCircles(lng= ~long, lat= ~lat, color= ~factpal(endangerment_level_only))  %>%
    addLegend(values = ~endangerment_level_only,  pal = factpal, title="Endangerment Level", labels=elcat_risk_levels, opacity=1)

## Warning in validateCoords(lng, lat, funcName): Data contains 165 rows with
## either missing or invalid lat/lon values and will be ignored

To-do

Need to have more robust error checking.

What if more than one source is listed as preferred for a given language?
It seems like one of the fundamental problems with the LEI is that it treats the variables as continuous, when they are actually ordinal.