Methods for Web Content Analysis and Context Detection

This project was part of Portland State University’s senior capstone program. It is the work of seven students over the course of six months. For the duration of the project we worked with a Mozilla adviser, Dietrich Ayala, to keep on track with the project’s original requirements. The team was composed of the following students:

Overview and Goals of the Project

This project was a research-intensive proof of concept for a feature that would expand the reader mode to content beyond articles, for one of many Mozilla projects. We set out to solve the problem of how to “put the internet back in the hands of the user”, as web pages are often bloated with unnecessary content that degrades the user experience.

In developing nations with low-powered smartphones and slow internet connections, this can incur a high computing cost on browsing and affect battery life. In our research we divided the problem into four main areas: the quality of the user’s internet connection, the target device of the user, what content is important to the user, and is the data accessible to those with disabilities.

For example, the graph below shows that the difference on one of the web pages tested with and without reader mode was nearly 6mb.

Data Usage

By understanding what part of a web page is content and what is not, we can limit data usage, by only downloading the relevant content. In addition, if we can grab only what’s necessary from a site, it opens the possibility of the user’s device optimizing the view of this data.

This transformation of the data for contextual presentation can be used to improve accessibility, or enable alternate browser models. We outline several possible efficient methods of content analysis. Ultimately, we found that currently available tools solve only a subset of the problems identified. However, by utilizing several of these tools and the concepts explored in our research paper, we believe it is possible to implement such a feature.

What does this mean for an everyday web developer? Imagine smarter tooling for content analysis, detection, and optimization that could be built as advanced features of the browser in the near future. Imagine developer tools that would make building website accessibility and platform-specific features far easier and less costly than it is today.

Read on to learn more about our findings and the research we designed to test our ideas.

Installation & Usage

The process outlined in our paper is referred to as “Minimum Contextualization”, or contextualization for short. This process is split into three main phases: Content analysis, content filtration and content transformation. Each of these phases has several steps.

Phoenix-node is a command line application written in Node.js that we developed to analyze HTML document structure. It relies on Node.js 4.0+, the npm package manager, and the jsdom npm package and its dependencies.

  1. Install Node.js 4.0+ following the instructions for your environment: https://nodejs.org/
  2. Clone the Phoenix-node repository from https://github.com/cap7pdx/phoenix-node
  3. Install jsdom into the source directory with ‘npm install jsdom’. A node_modules folder will be made.
  4. Run phoenix-node parsing http://google.com with ‘node alt.js’. This will print the DOM structure to the terminal.

Phoenix Output

 

Research Findings

Our research identified three major phases in the contextualization process: content analysis, content filtration and content transformation. Our findings focus on content analysis. Content filtration and content transformation are not covered in our research.

 

Contextualization

For content analysis, we recommend two distinct steps: The first step should identify which “Structure Group” a site falls into by utilizing cluster analysis of document structures. In the second step, one of several methods can be used to parse through the site to determine which content is essential for the user to understand its meaning. For example, if a site is placed into a cluster which is text-heavy and has little to no other content, then basic reader mode features are sufficient for this, such as shallow-text methods. Otherwise a more advanced method must be used, such as semantic segment detection (discussed further in our paper).

Through our research we were able to learn about the limitations inherent in modern reader mode techniques and the status of similar research. Our team’s recommended method for content analysis and context detection is to utilize a cluster analysis to group like pages in order to learn about the archetypal structure in a cluster and group sites with similar structures together.

Read the full paper here:

Methods for Web Content Analysis and Context Detection

About Jonathan Hasbun

More articles by Jonathan Hasbun…


3 comments

  1. André Jaenisch

    Hello, Jonathan,

    thanks for the effort put into writing this article.
    I’m sorry to say that but in my opinion it’s a bit sparse and reads like PR.

    I could easily wrap up the whole article into this:

    This project was part of Portland State University’s senior capstone program. It is the work of seven students over the course of six months.
    There are three phases in the contextualization process: content analysis, content filtration and content transformation, but we will cover only the first one in our paper:
    Methods for Web Content Analysis and Context Detection

    Look, I’m disappointed, since I was expecting to read more in-depth explanation (I’m used to find this kind of depth on Mozilla Hacks blog …).
    And when I refer to PR I have repetition of content and certain phrases in mind. Like “Imagine …”, “Read on to learn more about” and “We set out to solve the problem”. But maybe this is just how academics speak today (I was one on my own ;-)).

    Maybe the next article reads better.

    March 30th, 2016 at 13:20

  2. ZEE Center

    is it possible to make a web analysis in a much simpler way. It’s a complicated long story why don’t project makes it easy and simple.

    April 2nd, 2016 at 09:18

  3. Evan

    I agree its too hard to make a web analysis in such way at the moment.

    April 12th, 2016 at 16:22

Comments are closed for this article.