## ‘Trend Analysis’ – an effective way of gathering information

admin | August 3, 2009A ‘Trend Analysis’ can be an effective way of gathering information, taking Google’s search engine behaviour of indexing and ranking information: they don’t use the ‘MetaTags’ and/or TAGs set by authors, instead they use Content-Based ranking. A score based on the query and the contents of the page using some metrics: Word frequency, Document location and Word distance together with the number and quality of incoming links. I mean no artificial clustering, based on possibilities but on tendencies inferred.

They don’t consider how an Author wants to be classified (or ranked) but on the contents itself together with external interest on that content (links to the site) crawling websites, clustering/grouping a referential dataset. Last week I’ve been dwelling on processes used by search engines on trend analysis and some mathematical formulae to process the dataset:

- Euclidean distance – graphing two points, it determines how close they are
- Pearson Correlation coefficient – a measure of how higly correlated two variables are
- Weighted Mean – to make numerical predictions based on similarity
- Tanimoto coefficient – measure of similarity of two sets
- Conditional probability – way of measuring how likely something is to occur
- Gini impurity – tells the probability that we would be wrong
- Entropy – to see how mixed a set is
- Variance – how much a list varies from the mean average value
- Gaussian function – weighting function for weighted k-nearest neighbours
- Dot-Products – to calculate vector angles from classifying items

(This a reference list)

In practical terms the process has 3 stages:

- Creation of the dataset through crawling sites, RSS feeds, feedback forms, reports, etc and maybe some generated complementary data in a database.
- Clustering, ranking, filtering the dataset/database.
- Presenting the ‘searched’ information in visual/graphic format

Nothing is knew on this flow but the realisation of it can be innovative, using established technologies and metrics used by search engines with ‘on the fly’ visualisation through the web (eventually iterative)

Seems that all my efforts move on that direction:

- the creation of the database example for the BR projects together with the rearrangement of our Projects Directory using Python frameworks,
- the list of mathematical formulae above are not oriented for analysis but for clustering, ranking and filtering the information,
- and the presentation layer without a doubt is to use Flash/Flex SWF files (Adobe Flex is a Flash type of application capable of handling/interacting with big sets of data on the fly – Flash for corporate environments).

For the presentation layer it is worth mention ‘**Hans Rosling’. **He was a keynote speaker for the ALT C 2008 (here) – (the conference video) or in our servers (Elluminate). He developed a software (Trendalyzer) that he uses to put data together in some innovative ways in his website http://www.gapminder.org/world Google acquired the rights, and partly incorporated it in the Google Visualisation API that I’m considering using for the presentation layer.

Thoughts appreciated…