Lab 4 Review

June 16, 2015

Question 1

Change the name of the "Location.1" column to "location"

Question 1

Names are just an attribute of the data frame (recall str) that you can change to any valid character name

Valid character names are case-sensitive, contain a-z, 0-9, underscores, and periods (but cannot start with a number).

For the data.frame class, colnames() and names() return the same attribute.

> names(mon)

[1] "name"            "zipCode"         "neighborhood"    "councilDistrict"
[5] "policeDistrict"  "Location.1"

> names(mon)[6] = "location"
> names(mon)

[1] "name"            "zipCode"         "neighborhood"    "councilDistrict"
[5] "policeDistrict"  "location"

These naming rules also apply for creating R objects

Question 2

How many monuments are in Baltimore (at least this collection…)?

Question 2

There are several ways to return the number of rows of a data frame or matrix

> nrow(mon)

[1] 84

> dim(mon)

[1] 84  6

> length(mon$name)

[1] 84

Question 3

What are the:

zip codes
neighborhoods
council districts, and
police districts

that contain monuments, and how many monuments are in each?

Question 3

unique() returns the unique entries in a vector

> unique(mon$zipCode)

 [1] 21201 21202 21211 21213 21217 21218 21224 21230 21231 21214 21223
[12] 21225 21251

> unique(mon$policeDistrict)

[1] "CENTRAL"      "NORTHERN"     "NORTHEASTERN" "WESTERN"     
[5] "SOUTHEASTERN" "SOUTHERN"     "EASTERN"

> unique(mon$councilDistrict)

 [1] 11  7 14 13  1 10  3  2  9 12

> unique(mon$neighborhood)

 [1] "Downtown"                        "Remington"                      
 [3] "Clifton Park"                    "Johns Hopkins Homewood"         
 [5] "Mid-Town Belvedere"              "Madison Park"                   
 [7] "Upton"                           "Reservoir Hill"                 
 [9] "Harlem Park"                     "Coldstream Homestead Montebello"
[11] "Guilford"                        "McElderry Park"                 
[13] "Patterson Park"                  "Canton"                         
[15] "Middle Branch/Reedbird Parks"    "Locust Point Industrial Area"   
[17] "Federal Hill"                    "Washington Hill"                
[19] "Inner Harbor"                    "Herring Run Park"               
[21] "Ednor Gardens-Lakeside"          "Fells Point"                    
[23] "Hopkins Bayview"                 "New Southwest/Mount Clare"      
[25] "Brooklyn"                        "Stadium Area"                   
[27] "Mount Vernon"                    "Druid Hill Park"                
[29] "Morgan State University"         "Dunbar-Broadway"                
[31] "Carrollton Ridge"                "Union Square"

> length(unique(mon$zipCode))

[1] 13

> length(unique(mon$policeDistrict))

[1] 7

> length(unique(mon$councilDistrict))

[1] 10

> length(unique(mon$neighborhood))

[1] 32

Also note that table() can work, which tabulates a specific variable (or cross-tabulates two variables)

> table(mon$zipCode)

21201 21202 21211 21213 21214 21217 21218 21223 21224 21225 21230 21231 
   11    16     8     4     1     9    14     4     8     1     3     4 
21251 
    1

> length(table(mon$zipCode))

[1] 13

Question 4

The "by hand" way is cross-tabulating the zip codes and neighborhoods,

> tab = table(mon$zipCode, mon$neighborhood)
> # tab
> tab[,"Downtown"]

21201 21202 21211 21213 21214 21217 21218 21223 21224 21225 21230 21231 
    2     9     0     0     0     0     0     0     0     0     0     0 
21251 
    0

> length(unique(tab[,"Downtown"]))

[1] 3

> tt = tab[,"Downtown"]
> tt

21201 21202 21211 21213 21214 21217 21218 21223 21224 21225 21230 21231 
    2     9     0     0     0     0     0     0     0     0     0     0 
21251 
    0

> tt == 0 # which entries are equal to 0

21201 21202 21211 21213 21214 21217 21218 21223 21224 21225 21230 21231 
FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE 
21251 
 TRUE

> tab[,"Downtown"] !=0

21201 21202 21211 21213 21214 21217 21218 21223 21224 21225 21230 21231 
 TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
21251 
FALSE

> sum(tab[,"Downtown"] !=0)

[1] 2

> sum(tab[,"Johns Hopkins Homewood"] !=0)

[1] 2

We could also subset the data into neighborhoods:

> dt = mon[mon$neighborhood == "Downtown",]
> head(mon$neighborhood == "Downtown",10)

 [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE

> dim(dt)

[1] 11  6

> length(unique(dt$zipCode))

[1] 2

Question 5

How many monuments (a) do and (b) do not have an exact location/address?

Question 5

> head(mon$location)

[1] "408 CHARLES ST\nBaltimore, MD\n"  ""                                
[3] ""                                 "100 HOLLIDAY ST\nBaltimore, MD\n"
[5] "50 MARKET PL\nBaltimore, MD\n"    "100 CALVERT ST\nBaltimore, MD\n"

> table(mon$location != "") # FALSE=DO NOT and TRUE=DO

FALSE  TRUE 
   26    58

Question 6

Which:

zip code,
neighborhood,
council district, and
police district

contains the most number of monuments?

Question 6

> tabZ = table(mon$zipCode)
> head(tabZ)

21201 21202 21211 21213 21214 21217 
   11    16     8     4     1     9

> max(tabZ)

[1] 16

> tabZ[tabZ == max(tabZ)]

21202 
   16

which.max() returns the FIRST entry/element number that contains the maximum and which.min() returns the FIRST entry that contains the minimum

> which.max(tabZ) # this is the element number

21202 
    2

> tabZ[which.max(tabZ)] # this is the actual maximum

21202 
   16

> tabN = table(mon$neighborhood)
> tabN[which.max(tabN)]

Johns Hopkins Homewood 
                    17

> tabC = table(mon$councilDistrict)
> tabC[which.max(tabC)]

11 
29

> tabP = table(mon$policeDistrict)
> tabP[which.max(tabP)]

CENTRAL 
     27

Question 7

Try reading in the tab-delimited Monuments-tab.txt file from: http://www.aejaffe.com/summerR_2015/data/Monuments-tab.txt

Question 7

> monTab = read.delim("http://www.aejaffe.com/summerR_2015/data/Monuments-tab.txt",
+                     header=TRUE, as.is=TRUE)
> identical(mon$name,monTab$name)

[1] TRUE