June 16, 2015

Question 1

  1. Change the name of the "Location.1" column to "location"

Question 1

Names are just an attribute of the data frame (recall str) that you can change to any valid character name

Valid character names are case-sensitive, contain a-z, 0-9, underscores, and periods (but cannot start with a number).

For the data.frame class, colnames() and names() return the same attribute.

> names(mon)
[1] "name"            "zipCode"         "neighborhood"    "councilDistrict"
[5] "policeDistrict"  "Location.1"     
> names(mon)[6] = "location"
> names(mon)
[1] "name"            "zipCode"         "neighborhood"    "councilDistrict"
[5] "policeDistrict"  "location"       

These naming rules also apply for creating R objects

Question 2

  1. How many monuments are in Baltimore (at least this collection…)?

Question 2

There are several ways to return the number of rows of a data frame or matrix

> nrow(mon)
[1] 84
> dim(mon)
[1] 84  6
> length(mon$name)
[1] 84

Question 3

What are the:

  1. zip codes

  2. neighborhoods

  3. council districts, and

  4. police districts

that contain monuments, and how many monuments are in each?

Question 3

unique() returns the unique entries in a vector

> unique(mon$zipCode)
 [1] 21201 21202 21211 21213 21217 21218 21224 21230 21231 21214 21223
[12] 21225 21251
> unique(mon$policeDistrict)
[1] "CENTRAL"      "NORTHERN"     "NORTHEASTERN" "WESTERN"     
[5] "SOUTHEASTERN" "SOUTHERN"     "EASTERN"     
> unique(mon$councilDistrict)
 [1] 11  7 14 13  1 10  3  2  9 12

> unique(mon$neighborhood)
 [1] "Downtown"                        "Remington"                      
 [3] "Clifton Park"                    "Johns Hopkins Homewood"         
 [5] "Mid-Town Belvedere"              "Madison Park"                   
 [7] "Upton"                           "Reservoir Hill"                 
 [9] "Harlem Park"                     "Coldstream Homestead Montebello"
[11] "Guilford"                        "McElderry Park"                 
[13] "Patterson Park"                  "Canton"                         
[15] "Middle Branch/Reedbird Parks"    "Locust Point Industrial Area"   
[17] "Federal Hill"                    "Washington Hill"                
[19] "Inner Harbor"                    "Herring Run Park"               
[21] "Ednor Gardens-Lakeside"          "Fells Point"                    
[23] "Hopkins Bayview"                 "New Southwest/Mount Clare"      
[25] "Brooklyn"                        "Stadium Area"                   
[27] "Mount Vernon"                    "Druid Hill Park"                
[29] "Morgan State University"         "Dunbar-Broadway"                
[31] "Carrollton Ridge"                "Union Square"                   

> length(unique(mon$zipCode))
[1] 13
> length(unique(mon$policeDistrict))
[1] 7
> length(unique(mon$councilDistrict))
[1] 10
> length(unique(mon$neighborhood))
[1] 32

Also note that table() can work, which tabulates a specific variable (or cross-tabulates two variables)

> table(mon$zipCode)
21201 21202 21211 21213 21214 21217 21218 21223 21224 21225 21230 21231 
   11    16     8     4     1     9    14     4     8     1     3     4 
21251 
    1 
> length(table(mon$zipCode))
[1] 13

Question 4

The "by hand" way is cross-tabulating the zip codes and neighborhoods,

> tab = table(mon$zipCode, mon$neighborhood)
> # tab
> tab[,"Downtown"]
21201 21202 21211 21213 21214 21217 21218 21223 21224 21225 21230 21231 
    2     9     0     0     0     0     0     0     0     0     0     0 
21251 
    0 
> length(unique(tab[,"Downtown"]))
[1] 3

> tt = tab[,"Downtown"]
> tt
21201 21202 21211 21213 21214 21217 21218 21223 21224 21225 21230 21231 
    2     9     0     0     0     0     0     0     0     0     0     0 
21251 
    0 
> tt == 0 # which entries are equal to 0
21201 21202 21211 21213 21214 21217 21218 21223 21224 21225 21230 21231 
FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE 
21251 
 TRUE 

> tab[,"Downtown"] !=0
21201 21202 21211 21213 21214 21217 21218 21223 21224 21225 21230 21231 
 TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
21251 
FALSE 
> sum(tab[,"Downtown"] !=0)
[1] 2
> sum(tab[,"Johns Hopkins Homewood"] !=0)
[1] 2

We could also subset the data into neighborhoods:

> dt = mon[mon$neighborhood == "Downtown",]
> head(mon$neighborhood == "Downtown",10)
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
> dim(dt)
[1] 11  6
> length(unique(dt$zipCode))
[1] 2

Question 5

How many monuments (a) do and (b) do not have an exact location/address?

Question 5

> head(mon$location)
[1] "408 CHARLES ST\nBaltimore, MD\n"  ""                                
[3] ""                                 "100 HOLLIDAY ST\nBaltimore, MD\n"
[5] "50 MARKET PL\nBaltimore, MD\n"    "100 CALVERT ST\nBaltimore, MD\n" 
> table(mon$location != "") # FALSE=DO NOT and TRUE=DO
FALSE  TRUE 
   26    58 

Question 6

Which:

  1. zip code,
  2. neighborhood,
  3. council district, and
  4. police district

contains the most number of monuments?

Question 6

> tabZ = table(mon$zipCode)
> head(tabZ)
21201 21202 21211 21213 21214 21217 
   11    16     8     4     1     9 
> max(tabZ)
[1] 16
> tabZ[tabZ == max(tabZ)]
21202 
   16 

which.max() returns the FIRST entry/element number that contains the maximum and which.min() returns the FIRST entry that contains the minimum

> which.max(tabZ) # this is the element number
21202 
    2 
> tabZ[which.max(tabZ)] # this is the actual maximum
21202 
   16 

> tabN = table(mon$neighborhood)
> tabN[which.max(tabN)] 
Johns Hopkins Homewood 
                    17 
> tabC = table(mon$councilDistrict)
> tabC[which.max(tabC)] 
11 
29 
> tabP = table(mon$policeDistrict)
> tabP[which.max(tabP)] 
CENTRAL 
     27 

Question 7

Question 7

> monTab = read.delim("http://www.aejaffe.com/summerR_2015/data/Monuments-tab.txt",
+                     header=TRUE, as.is=TRUE)
> identical(mon$name,monTab$name)
[1] TRUE