~ Journal ~

Report 1 - Report 2 - Report 3 - Report 4 - Report 5
Report 6 - Report 7 - Report 8 - Report 9 - Final Report (.doc .pdf)

Report 1 - Friday, June 10  Next ->
"First steps"

During this week I studied a lot of the graph drawing, graph layout, data mining, clustering, the graph-based two-dimensional constraint solver, cluster merging, force-directed methods, and about aesthetic criteria for visualization k-means algorithm.
Also, I analyzed and compared different visualization systems that we can use to implement in our project, such as Graphiz, JGraph, Grappa, ToughGraph, JUNG, Prefuse.

Graphiz is a Java package with full Java graph data structures. It allows the user to draw a very aesthetic and really nice graphs and clusters. But, unfortunately, it does not allow you to draw a dynamic layout. Grappa has serious problems with Java applet performance when graphs have hundreds of objects.
The same problem occurs with ToughGraph.
The JUNG seems useful, but its libraries don't include codes for clustering.
In my opinion the Prefuse is the most reasonable package for the implementation of our project.

A short description of the Prefuse package:
Prefuse toolkit is a user interface toolkit for crafting interactive visualization applications of structured and unstructured data. Prefuse architecture allows you to create animated graphical interfaces for visualizing, exploring, and manipulating various forms of data; offers force simulation, graphics transformation, including panning and zooming. The Prefuse extensible library supports writing new layouts by composing existing modules, force-directed layout, integration of color maps for assigning colors to data elements, components interaction for common interaction including drag controls.
It also includes an extensible library for force-based simulation( nodes exert anti-gravity, edges act as springs).

Prefuse is open-source software. It is written in Java programming language using the Java2D graphics library and is designed to integrate with any application written using the Java Swing user interface library.

In my opinion, it is the best package that allows creation of "fast-drawing" and well designed force-directed layout with several hundred nodes, clustering and filtering data.

I think about our project and I consider that in different field of applied science the most important thing is data analyzing, so often it's necessary to split clusters into different windows or to save some clusters or extract smaller clusters from the given cluster. The Prefuse package gives you a solution to all of these problems.

About Prof. desJardins's list of tree key steps that she sent out. My unordered new knowledge does not allow me to make any corrections or variations about Prof. desJardin's proposal. Right now I agree with her ideas completely.

About the k-means algorithm:
The k-means algorithm is considered an impractically very slow algorithm. Dan Pelleg and Andrew Moore from Carnegie Mellon University have implemented and tested Accelerating Exact k-means algorithms which scale very well with the number of centers, permitting clustering with tens of thousands of centers with faster execution.
The short period did not allow me to study this algorithm very well, but I set that as my goal for the next week.

Other goals for next week:
1. Meeting and discussion:
- Meeting on Monday with Blazej and Julia. Discussion of the possibility of using Prefuse
- Make a final decision about the package that we'll be using in our project
- Meetings during the week for discussion of parallel tasking and implementation of algorithms
- Other meetings.
2. Studying process:
- concentrating on Java, editor
- focusing on graph manipulation and visualization
- acquiring knowledge on graph layout, constrained clustering, k-means algorithm and new modification of this algorithm, force-directed methods
- background reading about distance learning.
3. Practical process:
- detailed familiarization with Linux, Fedora, software application, etc.
- creation of a simple graph layout
- implementation of dynamically changing the graph, animating the results.

<- Previous  Report 2 - Friday, June 17  Next ->
"Exploring and Ordering my new Knowledge"

The second week of my DMP program started with
- tutorial overview of Linux terminal and the "bash" shell;
- acquittance with XML files and how to create XML files
- reading articles about clustering, k-mean algorithm, also about different approaches and methods to solve k-means algorithm, and modification of the k-means algorithm (like dynamic clustering algorithm, that use Mahalonobis distance).

After analyzing approaches of creation of clusters, I proposed the same variation to Prof. desJardis's first key step about building the constrains.
Some problem appears in clustering when there is a similarity between two points, but these points belong to different concepts. So, for each pair of used-moved points we must consider not only distance, but contents of those points too.

Also during this week I was fully concentrated on studying the codes and implementation of the algorithms from the prefuse package.
Running programs from prefuse required using Emacs.
Emacs is a class of the text editor, possessing an extensive set of features.
Thus, new problems appeared - acquittance and adaption to the use Emacs. Blazej downloaded new version of this editor. Emacs is amazing - it has everything a programmer needs for fast switching from program to program, finding algorithms that you need, wizard applications, visualization s and much more. Those were the only aspects to the application I was able to get familiarized in the time allowed. I fully recommend for all your employers to use this editor. At first it seems little bit strict, but it should satisfy even the most meticulous user...

Running demo application from prefuse package surprised me. This package contains all algorithms that we need for our project.

Prefuse implements a simple and fast algorithm using Euler's method for updating velocity and position data; algorithm for spring in a force simulation 4th-Order Runge-Kutta method for updates in velocity and position data; spring-force algorithm to compute the force on a node resulting from treating incident edges as a system of springs; algorithm for a constant gravitation force, like the pull of gravity on an object on the Earth; for efficient n-body simulation implemented by the Burnes-Hut algorithm, that can be used to create force-based layout.

We discussed with Juliya and Blazej the possibility of using this package. On Monday, during the meeting, we must discuss our project. I think we must concentrate more on the visualization problems and interactive approach that would be more adapted for users. My wide proposal about the project - I'll sent that to Prof. Marie desJardin tomorrow.

Also, during the week I was concentrated on the stying of creating programs in Java. I created a few applets, that we can use for interactive information revisualization (not very well designed, but this is not a problem right now), and created same modification from prefuse's applications.

Unexpectedly, Mark invited all people from the lab to a party. We received the invitation a tad late, so I was not at all ready to give up the lab at 7.30P.M., but, because I don't have a car, I was fully dependent on Blazej's offer to give a lift to the party. That's why I haven't finished my report. The party was great and very relaxing. I received a lot of positive emotions.

Aside from that, below is my project's proposal:

As far as the prefuse package includes the implementation of the creation of the force directed layout and clusters, I propose:
1. Force-directed layout:
- adapt this program to our dataset (the program in the prefuse package is little bit strict);
- Investigate different algorithms for measuring spring force, distance, cluster's creation, and amounts of clusters;
- Analyze and compare these different methods and approaches to measure distances and force;
- Investigate accelerated k-means algorithm (I mentioned it in the first report).
- Choose the faster and most effective method (same methods work better for not a very big amount of nodes, and cannot work properly with large data sets; other methods work properly only with a huge data set;
- Document our investigation.
2. Visualization and analyzing clusters:
The force algorithm pulls apart the clusters, and for huge amounts of nodes very dense clusters or a big amounts of clusters will be created, that crawl away from the frame. The problem is stopping this process. So, we could develop this algorithm.
For big number of nodes in a cluster, the cluster might be very concentrated, so I propose to extend our project and include such task as:
a) Splitting the main frame on different frames for each cluster;
b) Big clusters can include a lot of smaller clusters. We can develop algorithm with an interactive applet to create smaller clusters and save new created clusters for further analyzing;
Adapt the algorithm from prefuse to represent strong and weak connections in the cluster, in case of investigation of the data in the cluster - we can use the algorithm for deleting the weak connection, or the strong connection;
Adapt the algorithm represented in the prefuse to save data from each of the clusters (if we needed to) in output files.
Compare data from different clusters, such as amount of nodes, dependency, etc.
We can adapt some algorithms to enlarge nodes that will be interesting for the user, etc. New ideas are expected to appear in the process of creation of our project.
3. Documentation.

P.S.: Discussion on every point in my proposal is welcome
P.P.S.: Julia and Blazej are in full agreement of the use of the prefuse package

<- Previous  Report 3 - Friday, June 24  Next ->
"Applying my new Knowledge"

This week seems for me more exited, than previous, because fortunately and unexceptionably fast I have been created, I think, a very nice application for force directed layout. I'm using some codes from prefuse and combine they together, introducing some changes and it gives an exiting (at least to me) result. Of course, the result might be better, but prefuse contains a lot of codes, which I want to use during next weeks, but I must spare my time to diminish my gap in Java, Eclipse.
I finished DMP dataset. I have been used XGMML format to create a XML file.
DMP graph contains 109 nodes and 193 edges. It took more time than I expected to create an input dmp.xml file. I found only little info about XGMML-format XML file. There are different versions of graph's XML files, but some versions caused errors while running program with. What is important the XGMML format allows represent weight of the edges. This is important for better visualization the edge's thickness.
The DMP program is ready, but I want little bit to improve somewhat for better visualization. Maybe, I'll finish this tomorrow.
Also, I have been acquaintance with articles " Clustering Relation..."[1] and Semi-supervised..."[2]. I'll report my opinion about that before our discussion during next week.

I'm not ready to represent the data generation script right now. I understand the algorithm and almost know how to implement this program. But, Java differs from C very much (such as creating matrix etc...) I decided to write my pseudocode in the most approximate to Java way. Also, I searched and retrieved useful codes in JUNG and prefuse. I must decide what kind of package I'll use to generate graph with randomly generated edges. My thoughts are little bit messy right now.

<- Previous  Report 4 - Friday, July 1  Next ->
"Getting over the Severities"

1.Data generation:
I almost finished the program for the data generation process. The program works properly, exactly as outlined in Prof. desJardin's script. The XML file in GraphML format that I implemented in GraphML format from the console, successfully loaded in my first program (force-directed layout). Right now I'm working on the last step in my new program ? inserting an already created matrix with random distributed edges into the new created output file "toygraph.xml", which I finished today. Also, I must improve my program in the most professional style possible. A lot of codes were written in style from top to bottom. Also, I must add the missed documentation. All this I'll finish during my "long" weekend. At the beginning of the new week every thing will be presented in the best fashion.

2.Problem that appeared with program:
Before using some codes (from packages, Java library, or Internet) I always checked them. Big problems appeared with the random integer generator from the Java library. It works properly for a large set of data. But , in our case, when the the number of attributes can be very small (2-5), the integer generator doesn't give seasonably distribution. For example , for int 1 and 2 the generator gives a distribution( 100 times loop ) like 1,1,1,1,1,1,,2,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1............. This is not proper uniform distribution. I spent a lot of time to improve this situation, and right now the integer random is working correctly, like 1,2,2,1,2,1,1,2,................Also, I concentrated on checking the correctness of all data; it is int, double or string, verifying every mistake that can appear during input data from console (such as empty string, incorrect data, etc.)

3. Studying process:
Our project is based on writing the codes in Java, so I decided to write a program in Java. I'm not very familiar with Java. I took a course in Java four years ago, and the teacher was a professional programmer in COBOL, so my knowledge in Java is very sketchy. Overall this week was very productive in the sense of acquiring the knowledge of Java by myself. New experience allows me to create data a generation program and this is amazing...

4. Next week I'll hold on to Prof. desJardin's Layout Project schedule and also improve my gap in Java.

<- Previous  Report 5 - Friday, July 8  Next ->

I finished the implementation of the syntetic data generation package. In my opnion it works perfectly. To this package I must only add 4 lines of codes to create a bridge between nodes in the case if the randomly set edges will be created unconnected on the graph. Without Blazej's help I would have to ask about such important nuance in the program. I know how to implement these codes and I'll add them after finishing the report. From a Java programmer's view some codes might seem maybe not as professional, but I'm fully satisfied with my programs. Some of the code was very hard to implement, espessialy to read data from the console, string's conversion into double and integer and then the insertion of verifyed data into a new created output file. And the hardest moment during programming was to find the way how to use the name of the output file as global variable. A lot of problems appeared with the scopes of the variables when I rearranged my codes in the "normal" style. In this case Java is very different from C, the language that I used during the past two years. And when I finished this package I was fatigue, but I was very proud by myself.

After finishing data generation package I started to write the function for visualization and customizing the graph layout. I'll provide my new application during our weekly meeting.

Also, I created next function that will be often used in our project:

- EdgeConnection - for connection of the new set node into graph with another nodes
- ClusterGeneration - to generate dummy invisible "cluster center"
- ReadGraph - to read all attributes and values from the graph.

I started to develop the data set " Hollywood actors". Inserting images into graph node is a very complicated process (I repeated that I have never implemented such things). Prefuse includes codes to manipulate with images, and I think that this part of the project will be solved during the next week.

I didn't start coloring the node's dependecy of the value (in case if the node contains four attributes with four values, the node must be painted with four different colours), because the node's coloring requires reading the values from the graph. Well, rigth now, this is not a big problem, but a little bit labor-intensive aspect. I'll try to this done during next week.

Learning process:

- recognize Java's possibilities and mine too(self-knowledge);
- familiarize with new approaches to solve clustering problems from sources that Prof. desJardin provided

<- Previous  Report 6 - Friday, July 15  Next ->

1.Codes for the Force-directed graph layout
Last week I spared attention to making the visualization of the force-directed graph layout clearer. I consider that the background's color in the implemented application must "play" with graph's node's coloring. Choosing the appropriate coloring in the Java took a lot of time, but I things that the project results must be represented in the best fashionable way. Also, I added into application a few usable facilities to operate the clustering process. I studied prefuse's codes to color the nodes depending on its attributes names and values of the attributes. The author of the prefuse package, J. Heer, wrote that the codes from prefuse are not easy to implement. Sometimes to write a few lines of codes I must to explore 10-20 programs from a package that contains more than 100 programs. The nodes coloring required knowing a very deep structure of the codes that operated the graph, tree, map, arrays, etc. Right now I'm almost done with creating the code for node's coloring depending on its attributes names and its values of the attributes. If this part of the my program is successfully implemented, this will give me the possibility for easy and fast implementation of the codes for representing node's attributes and the values of each attributes in the case of the mouse being hovered over this node. It is important to visualize the structure of the graph's node.

2.Synthetic data generation package.
I added the codes that allow the creation of the bridge in the graph in case if the randomly generated edges will be created unconnected on the graph.

3.New dataset.
I decided to create Oscar nomination dataset instead of "Hollywood Actors". It's seems like a more interesting and more informative dataset than a huge dataset about actors. To collect the data I used "The official Academy Awards Database" that is placed on the http://www.oscars.org/awardsdatabase/index.html website. The input XML file with awarded actor's and movie's data is almost ready, but if I find a way to insert actor's picture into the graph layout, creating the input file will require more time to collect the actors' pictures.

4.Discussion.
I discussed with Blazey some Linux software application that photographed the computer's window. This allowed me to insert screenshots of the applications that create my datasets and programs into my website. Also he explained me the way how create the Java codes for creating and reading the input file with pictures.
We discussed the clustering algorithm that I consider the most appropriated for the uniformly distributed data in the n-dimensional attributes space. This algorithm required feedback from user, but it is not very complicated to implement. In my opinion, it is a fast algorithm to collect data into clusters. My proposal and the algorithm i will sent to you on Tuesday.

5.Next week.
-Reading the papers about different approaches to create clusters and cluster analysis.
(The best way to carry through this is the time that I'll have spent into the bus during the traveling to NY and back into Maryland County).
-Finishing the coloring of the nodes.
-Create codes for representation info from the node.
-Implement "Hollywood Actors" dataset maybe with actor's pictures.

<- Previous  Report 7 - July 22  Next ->

1. Codes for the Force-directed graph layout:
This week I have implemented very constitutive functions:
Reading the information from visual nodes when hovered over by the mouse. All attributes and values that belong to the attribute are printed on the console;
I consider that resizing of the node's shape will create additional forces on the edges and this process will require additional complexity of the program. I used a simpler way for the visualization but not for the implementation of the codes. If the mouse pointed to a node, near the node appears a small icon with the text information that represents this node. It's important in the case of zooming when a graph contains a lot of nodes and it is not possible for the user to see the text on the node.
Created codes described above allowed me to implement codes for painting the nodes depending on its attributes. Also, I started to create the codes for painting the nodes depending on the attribute's values. This is a very tedious process, but I think that I'll master with this problem.
Two last days I tried to create the codes for setting an image in the background of the application. For the Hollywood's data set there will be a background picture of the Oscar statuette, for data generation set ? MAPLE logo, for DMP set ? something else. But these are very complicated codes because of the implementation of these codes with prefuse which requires the use of Graphics and Graphics2D. My classes for setting the background in separated files work properly for setting an image to the background of the screen, but in Prefuse's built-in functions the image appears incorrectly and clipped. The Internet cannot help me resolve this problem. Analogue codes are either in the paid packages or in the new books, which the authors proposed me to buy they. I'll put some effort into it, but if this should require the time that I have to spend to solve other tasks, I'll extend the frame of the application with additional layout where the images will be placed. We could also add representative information about running application.
I implemented the codes that created the clusters but the cluster's algorithm is not the one that is based on the attributes and values. This is just like the "representative" clustering process when clusters are formed by the edge's connections. This clustering algorithm is useful for visualization, especially for the DMP dataset where the CRA_W_DMP is a general centroid and other centroids will be created for students or mentors, etc. Implemented painting codes helped choose properly new cluster's centroids. And it looks great.

2. Dataset "Hollywood"
Most likely tomorrow than today I'll finish the creation of this dataset. I'll also try to implement the codes to insert the image into the node. If it creates these codes successfully I'll have to collect the actor's images...

3. PCK-Means
Julia implemented the PCK_Means algorithm from Weka. Julia and I, for almost 3 hours we checked the correctness of created clusters by value. We investigated the clusters only on the example "toygraph.xml" with a small amount of the nodes. It seems that the algorithm works properly in case of collecting the nodes by values but it not allowed to force the nodes around their "own" cluster; and the clusters are settled improperly. I proposed to Julia to create an additional "zero" centroid that will be connected to all nodes at the beginning. If the new cluster will be created the edges between "zero" centroid and nodes that belong to the new cluster will be removed from "zero". This process will solve four problems:
a) The new clusters will be placed near the "main" node that will be collected all dependent nodes around new cluster;
b) The disconnected from "zero" nodes will be forced to their "own" cluster;
c) Without "zero" centroid-cluster new clusters collected all nodes even some nodes don't belong to the cluster by value. With "zero" cluster such nodes will stay around this cluster; and this is a simple solution for the data's filtering;
d) Simple resetting algorithm:
- Deleting all clusters except "zero" cluster;
- Connection all nodes to the "zero" cluster;
- New investigation will be started.

4. Analyzing and statistical algorithms
I acquired these with the Adjusted Rand Index and graph-distance measure. I have not yet started to create the codes for these tasks. I'm waiting on the Julia's PCK-Means codes. These tasks I'll complete during next week.

5. CVS
At this moment CVS is not work properly. Blazej, who complacency decided to help me with this task, was very busy during this week and we should be finally done with this on Monday. We're still working with Julia on the different tasks, so absence of the CVS does not influence our job's process. We are sharing together all important codes.

6. Reading papers
I read additional papers from Internet about:
- Approaches to create data clusters in the "n"-dimensional;
- Data mining;
- Statistical analyzing for clusters;

7. Discussion
I discussed with Julia and Blazej my proposed k-mean clustering algorithm for the valued attributes in the n-dimensional space based on the feedback with user. (I promised to you that I'll send you my algorithm but I was very-very busy with codes and representation of this algorithm required some graphical image, like settled nodes in 1, 2,3-dimensional space, etc.). It should be a very fast (I think so) algorithm with meaningful data and not very hard to implement, but we don't have a lot of time. Right now everything depends how fast we finish today's tasks...

<- Previous  Report 8 - July 29  Next ->

1.Datasets and application:
I completed the detailed descriptions of the datasets that you proposed to use for the paper:
- OSCAR nomination data set;
- Data set generator.
I consider that for the paper we can use the DMP dataset also. It is a properly created and a well-designed application for this dataset. Some of the application's screenshots I put up on my website and during this week I received a letter from Nancy Amato, Professor, Texas A&M University, Co-Coordinator, CRA-W Distributed Mentor Project:

"I especially like the CRA-W DMP graph! We're looking forward to learning more about you and your research project from your website. Be sure to include some fun details too, since future DMP applicants will look to your site to see what the program is like."

This dataset is really a really good set that we can use for comparing efficiency of the cluster's creation by the values of the attributes. It's easy to create and analyze created clusters for students, or for the mentor, etc. The structure of the DMP dataset is not complicated at all. Every node contains only one attribute.
The dataset generator is more complicated and can contain the nodes with n attributes. All attributes in this dataset are the same for each node.

And the most complicated structure is in the "OSCAR" dataset where nodes contain different amount of the attributes and different names of the attributes.
So, comparing the correctness of the different clustering algorithms will be most representative for dataset that differ by the complexity.

The applications were on my (http://members.cox.net/lozova/index.html) website.
I created my website on my Internet service provider web space, but the bandwidth on the web space they give me has just run out for the month. Right now I need to relocate my website to http://maple.cs.umbc.edu/~nlozova/.

2. Code implementation and support:
- I created a demo GraphLayoutMAPLE.java in the "professional" style(almost documented)
- I finished the "very hard to implement" codes for inserting the pictures into the nodes.
- I collected all pictures of the actors and actresses that were awarded in the Academy Award in between the year 1999 and 2004. (The application with the picture looks very impressive)
- Class Combinatorial.java that support the Adjust Rand index algorithm was created.
- Adjust Rand Index(ARI) algorithm for cluster evaluation was implemented(not documented)
- The algorithm for measure the Euclidean distance for the graph was created. (not documented yet)
- Dijkstra's algorithm to find the shortest path between nodes on the graph
3. CVS is working properly

4. Discussion and helping
- Thanks to Blazey who spent a lot of time to join Julia's and my codes in the Eclipse workspace and helped Julia and me to create the environment that allows to share all our codes.
_ Qianjun, one of Prof. desJardins's PhD students, supported me with the paper about ARI. We discussed the way of using the ARI algorithms.

5. Problems solved and unsolved
- Julia couldn't use some of my codes because she used the old version of the prefuse and I used new version, especially in the coloring of the nodes. In my opinion, without Blazej, we would have taken a much longer time solving this miss match problem. Blazey is a professional programmer, so it was much easier for him to notice our problem and then solve it in a flash. Right now we use the new version in CVS of the prefuse and Julia can use all my codes.

-3 These last couple of days Julia was sick and I cannot join my ARI codes and the codes for Euclidean distance. The reason is because I don't know the structure that she used to collect information for the clusters work and I couldn't find the clustering result from the k-means algorithm.

<- Previous  Report 9 - August 5

I.Datasets and applications:
1.During this week I revised the dataset for CRA-W DMP "CRA_W_DMP_dataset_N.xml" that was created by hand using the data from the official site "Mentoring Undergraduate Women in Computing Research CRA-W Distributed Mentor Project (DMP) Summer 2005 Awards" http://www.cra.org/Activities/craw/dmp/awards/2005/2005.php data. I added new information in the dataset to the nodes such as the rank of the universities from Academic Ranking of World Universities 2004/Top 500 World Universities from the following website http://ed.sjtu.edu.cn/rank/2004/top500list.htm and information about students and mentors. This information will be usefull for building clusters. I added images of the mentors that I collected from mentor's and student's websites to the nodes that contain information about mentors.
2.I revised the detailed descriptions of the datasets:
- Oscar nomination data set;
- Data set generator.
3.I added description of the dataset "CRA_W_DMP_dataset_N.xml" to description of the datasets.

II. Program and packages
1. I finished painting of the nodes depending on the information of the nodes belonging to different categories; students, mentors, universities, etc. I considered that the images and coloring of the nodes are the best visualisation tool for the verification of the clusters that will be created.
2. I created packages for Adjusted Rand Index(ARI) algorithm, where I added a lot of simplifying codes (rigth now it's very easy to use for different clusters algorithms), use of the result of the algorithm to analyze distribution of nodes in the clusters(not documented yet).
3. Right now I'm developing a program that produces a data set generation script with meaningfull clustering properties. The limited time doesn't allow to create and most important to verify the clustering distribution according to a multivariate Guassian distribution in Java. We discussed with Qianjun this problem (limited time) and she proposed to use the software Matlab to create a multivariate Guassisan distribution. Matlab supports codes to solve this algorithm. I'm in the process of creating the program in MatLab that will be implemented in the easiest way to use this program. The program in Java will insert produced node's information into Matlab, create edge connections and write all information in the outputs files.

III. Next week
1. Finishing MatLab program in the easy to use style and description of the program (Monday, Tuesday)
2. Tuesday - final run of all application.
3. Update the description of the dataset with the description of the array of the datasets with clusters that will be created according to a multivariate Guassian distribution.
4. M-Thursday meeting with James to run the program and explanation of some "stricky" aspects of programming.
5. M -F Final description of all packages and program.
6. W- ice cream
7. Thursday Final report.
8. Fr- preparing lunch for everybody and then saying good-bye.
9. Fr night - flight home