Provide an “R” script that includes code and explanatory #comments for the following steps:
Load the full 2018-2020 workspace.
1) Choose a set of key words or phrases that are useful for your team project and use GloVe word embeddings to find additional synonyms within the corpus. List any new words as a #comment
2) Generate a frequency table showing the appearances per document of your key words/phrases using the dfm_select or dfm_lookup function.
3) Use the kwic function to extract a text window around one of your key words or phrases and combine the pre- and post- windows.
4) Choose one of the following and analyze the text windows: readability, lexical diversity, or one of the sentiment analysis approaches.
5) Write a few sentences at the end about what this analysis shows you (include this as # comments at the end of your R script).
6) Please use the stringr and regex syntax to view all instances of “wage,” “wages,” “Wage,” and “Wages” in the second document.
Which member of Congress utters the word “wage” (or a variant thereof) in this document?
7) Extract all instances of “wage” (and its variants) that occur within 50 # characters of the word “living” in the first 50 documents. Save these matches # to an object named “living_wage”
How many matches did you find?
Which of the first 10 documents has the highest number of matches?
What were the phrases captured by the regex?
Now run the code again, expanding the window to 100, 200, and 500 characters. # Does the regex find any additional phrases? If so, what are they?
8) Use the kwic command to extract a 100-token window around the regex you wrote # for Problem 7. Save this kwic object as “lw_window” and convert it into a data
frame named “df_lw_window”.
What are the dimensions of this data frame?
9) As you may have noticed, many words are split in half by a hyphenation followed by at least one space (“- “).
This is a function of how the PDF documents were originally formatted and the difficulty of converting these documents to plain text.
Write a regex to replace all occurrences of this break in the first 10 documents in the cr_txt object and create a new data frame named “cr_txt_cleaned.”
Then check the text to make sure that you have performed this replacement properly.
Sample Solution