We start with loading the required packages.
mediacloudr
is used to download an article based on a
article id received from the Media Cloud Topic Mapper.
Furthermore, mediacloudr
provides a function to extract
social media meta data from HTML documents. httr
is used to
turn R into a HTTP client to download and process the responding article
page. We use xml2
to parse - make it readable for R - the
HTML document and rvest
to find elements of interest within
the HTML document.
In the first step, we request the article with the id
1126843780
. It is important to add the upper case
L
to the number to turn the numeric type into an integer
type. Otherwise the function will throw an error. The article was
selected with help of the Media
Cloud Topic Mapper online tool. If you created an account, you can
create and analyze your own topics.
# define media id as integer
story_id <- 1126843780L
# download article
article <- get_story(story_id = story_id)
The USA Today news
article comes with an URL which we can use to download the complete
article using the httr
package. We use the GET
function to download the article. Afterwards, we extract the website
using the content
function. It is important to provide the
type
argument to extract the text only. Otherwise, the
function tries to guess the type and will automatically parse the
content based on the content-type
HTTP header. The author
of the httr
package suggests
to manually parse the content. In this case, we use the
read_html
function which is provided in the
xml2
package.
# download article
response <- GET(article$url[1])
# extract article html
html_document <- content(response, type = "text", encoding = "UTF-8")
# parse website
parsed_html <- read_html(html_document)
After parsing the response into a R readable format, we extract the
actual body of the article. Therefore, we use the
html_nodes
function to find the html tags with defined in
the css
argument. A useful open source tool to find the
corresponding tags or css classes is the Selector Gadget. Alternatively,
you can use the developer tools of the browser you are usually using.
The html_text
provides us with a character vector. Each
element contains a paragraph of the article. We use the
paste
function to merge the paragraph into one closed text.
We could analyze the text using different metrics such as word
frequencies or sentiment analysis.
# extract article body
article_body_nodes <- html_nodes(x = parsed_html, css = ".content-well div p")
article_body <- html_text(x = article_body_nodes)
# paste character vector to one text
article_body <- paste(article_body, collapse = " ")
In the last step, we extract the social media meta data from the
article. Social media meta data are shown if the article URL is shared
on social media. The article representation usually include a heading,
summary and a small image/thumbnail. The extract_meta_data
expects a raw HTML document and provides Open
Graph (a standard introduced by Facebook), Twitter
and native meta data.
Open Graph Title: “ICE drops off migrants at Phoenix bus station”
Article Title (provided by mediacloud.org): “Arizona churches working to help migrants are ‘at capacity’ or ‘tapped out on resources’”
The meta data can be compared to the original content of the article.
A short analysis reveals that USA Today chose a different heading to
advertise the article on Facebook. Larger analysis can use quantitative
tools such as string similarity measures, such as provided by the
stringdist
package.