Geo-tagged tweets analysis

History

From the end of 2009, I had started to collect geo-tagged tweets written in Japanese. And now (Mar, 2012), over 5 million geo-tagged Japanese tweets have been stored in my database.

Many researcher have focused on huge data of twitter, but general API provides only a limited tweets. Specifically, it allows us to access randomly sampled tweets within 1 week. In other words, there exist no way to search the past tweets posted more than 1 week ago.

In early 2009, I have proposed and implemented “GeoIME”, which is a dynamic dictionary update mechanism according to a user’s location. I thought it can improve the efficiency of text input on a mobile terminals. But, it was hard to prove it. This is because the variation of location is infinite. It means that the variation of automatically-generated dictionary is infinite.

Just when I thought how to prove it, Twitter starts to add “geo-tag” into a tweet itself. It was so amazing, and two weeks after, I have already succeed to write a script to collect “geo-tagged” tweets.

In this page, I introduce my research topics since 2009.

GeoIME

GoeIME (link to another explanation in Japanese) is an text input application works on Android, Mac&Win (with ATOK), and web. I developed it in 2010 as one of context-aware mobile text input method[1] which concept is described in Fig.1.

Figure 1: Concept of Context-aware mobile text input method
In this research, I mainly focused on how to make dictionary dynamically and automatically at any location (not focus on IME itself). Our system uses several public Web APIs for generating a dictionary as shown in Fig.2. In order to speed up the response, we have already tried several system architectures, and have confirmed that one of them can achieve enough response time.

Figure 2: System architecture of Context-aware mobile text input method
GeoIME won some prizes from academic and industries.

  • Yahoo! Geo Hack Award in Mashup Award 6
  • Oki Electronic Award in Mashup Award 6
  • Encouragement Prize in Fukuoka Ruby Award 3
  • IPSJ IPSJ Yamashita SIG Research Award
  • IPSJ DICOMO2010 Best Paper Award
  • IPSJ DICOMO2010 Outstanding Presentation Award
  • IPSJ-MBL53 Best Paper Award

TweetCollectionSystem in 2010

When I started to collect geo-tagged tweets, only a few tweets include geographical information. Based on following two assumptions, I developed an efficient tweets collecting system.

Assumption1: A user whose twitter client doesn’t have a function to add geo-tag can’t add geo-tag on his tweets.
Assumption2: A user who want not to appear his location to others never add geo-tag on his post.

It means that a user who once add geo-tag on his post, maybe add geo-tag in the past & in the future. Therefore, I use combine Streaming API and Search API as shown in Fig.3. Streaming API is used for finding a new person who add geo-tag. In addition, Search API is adopted for salvaging his past tweets and observing his future tweets. By this combination, my system achieved to find geo-tagged tweets efficiently.

Figure 3: Geo-tagged twitter collecting system with Streaming API and Search API

Visualization and Discovery of location dependency and time dependency in 2010

In [2], I refered the dependency of time and location. Fig. 4 shows a geographical distribution of tweets which include “noodle”. Since plots are widely distributed all over Japan, the word “noodle” can be determined as a location independent word. On the other hand, we notice that each plot (circle and plus) in Fig. 5 is concentrated in certain areas respectively. In this figure, circle plots and plus plots show the geographical distribution of tweets which incude “Shinjuku” and “Shibuya” respectively. Centers of concentrated areas are Shinjuku station and Shibuya station of JR (Japan Railways). From this result, the word “Shinjuku” and “Shibuya” can be defined as location dependent words.

Figure 4: Geographical distribution of tweets including “noodle”
Figure 5: Tweets including “shibuya” and “shinjuku

Fig. 6 show a distribution of tweets which include “Ryoma-den” per day and per hour respectively. “Ryoma-den” is a popular TV program broadcasting in Japan Broadcasting Corporation (NHK) at 20 o’clock on every Sunday now. As shown in a left figure, the number of tweets on every Sunday is obviously larger than those on other day of the week. Also, we can notice that the number of tweets at 20 o’clock is remarkably larger than other time slots. As a result, the word “Ryoma-den” highly depends on time.

Figure 6: Time distribution of tweets including “Ryoma-den”

Quantification of location-dependency in 2011

Some issues rised after above visualizations. One is how to retrieve such dependency, for example relationship between the keyword and location. Another is how to define its dependency quantitatively, which is also important for the former issue.

In [3], we have proposed two spatial analyzing methods to find location-dependent words by detecting whether the keyword depends on a certain location or not. Time-dependency is not dealt with in this paper. First method is a fast and simple way based on standard deviation of both latitude and longitude. We call it “SD method”. In SD method, we calculate the standard deviation of both latitude and longitude of extracted tweets including a specified keyword. If both standard deviations are smaller than a threshold, we decide that the keyword depends on a certain place, which must be average position of extracted tweets. Since the value of standard deviation shows the variability, it becomes large if tweeted in various places. SD method has an advantage of a retrieving speed, because only two calculations, SD of latitude and SD of longitude, are necessary for detecting the dependency of each keyword. However, it can’t detect the location-dependency of certain keywords that are often used in multiple places. For example, the name of department store chain and electrical store chain, are shown in all the tweets in Japan. They may also depend on some small area, but SD of such keywords becomes relatively large.

Therefore, we additionally propose TTBFS (Three-Tier Breadth First Search) method for finding multiple 1km square area where the keyword is frequently used. In TTBFS, we introduce three-tier architecture for speedy retrieving. On the 1st tier, all the areas are segmented by 100km square area (we call it “100sq”). On the next tier, all the areas are sliced into the 10km square area (we call it “10sq”). On the 3rd tier, all the areas are divided into 1km square area (we call it “1sq”). First, we roughly search the area where the keyword is often used among all the 100sqs. If we find the area in which the rate of tweets including a certain keyword (we call it “tweet appearance ratio”) is larger than the average, we set the area as the target of 2nd-tier retrieving. Next, we check all the 10sqs in the squeezed 100sqs as same as before. Finally, if some 1sqs are extracted for the keyword, we decide that the keyword has the location-dependency and which locations are the same as 1sqs.

Figure 7: TTBFS (Three-Tier Breadth First Search)
Finally, I analyze 80 keywords by adopting proposed two methods to the collected half million Japanese tweets, and specified the location dependency of each keyword. As a result, we show that SD method can quantify the location-dependency of keywords, and TTBFS can detect the dependency of the keywords with large standard deviation.

Fig.8 and Fig.9 are one of example results. Both shows the location of search terms. We can find the location of facilities, or electrical shop chain, by using only tweets information. In these figures, yellow region shows extracted 10sqs and red region shows extracted 1sqs.

Figure 8: Result of the name of unique facilities

Figure 9: Geographical distribution of 1sqs where the word “bic camera” is frequently used

Conclusion

Since 2009, I tacked with these topics. I still keep collecting geo-tagged tweets, but I already created a new tweets collection system. In the future, I will deal with following topics.

  1. How to find and define the time dependency?
  2. How to pick up location dependent event?
  3. How to eliminate inefficient information?
  4. Credibility of tagged location

If you have questions or suggestions, please contact me!

References

[1] Shinji Suematsu, Yutaka Arakawa, Shigeaki Tagashira, and Akira Fukuda, “Network-based Context-
Aware Input Method Editor
,” The Sixth International Conference on Networking and Services (ICNS 2010), March 7-13, pp.1–6, 2010.
[2] Yutaka Arakawa, Shigeaki Tagashira, and Akira Fukuda, “Relationship Analysis between User’s Con- texts and Real Input Words through Twitter,” IEEE Globecom 2010 Workshop on Ubiquitous Computing and Networks, pp.1813–1817, December 10, 2010.
[3] Yutaka Arakawa, Shigeaki Tagashira, and Akira Fukuda, “Spatial Statistics with Three-tier Breadth First Search for Analyzing Social Geocontents,” 15th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems (KES2011), Vol. 4, pp. 255–263, September 12, 2011.