<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Infrastructure-less Occupancy Detection and Semantic Localization in Smart Environments</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>01/01/2015</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10073266</idno>
					<idno type="doi">10.4108/eai.22-7-2015.2260062</idno>
					<title level='j'>MOBIQUITOUS'15 proceedings of the 12th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services on 12th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Md Abdullah Khan</author><author>H M Hossain</author><author>Nirmalya Roy</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Accurate estimation of localized occupancy related information in real time enables a broad range of intelligent smart environment applications. A large number of studies usingheterogeneous sensor arrays reflect the myriad requirements of various emerging pervasive, ubiquitous and participatory sensing applications. In this paper, we introduce a zero-configuration and infrastructure-less smartphone based location specific occupancy estimation model. We opportunistically exploit smartphone’s acoustic sensors in a conversing environment and motion sensors in absence of any conversational data. We demonstrate a novel speaker estimation algorithm based on unsupervised clustering of overlapped and non-overlapped conversational data and a change point detection algorithm for locomotive motion of the users to infer the occupancy. We augment our occupancy detection model with a fingerprinting based methodology using smartphone’s magnetometer sensor to accurately assimilate location information of any gathering. We postulate a novel crowdsourcing-based approach to annotate the semantic location of the occupancy. We evaluate our algorithms in different contexts; conversational, silence and mixed in presence of 10 domestic users. Our experimental results on real-life data traces in natural settings show that using this hybrid approach, we can achieve approximately 0.76 error count distance for occupancy detection accuracy on average.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>Localized commercial (University, Office, Mall, Cineplex, Restaurant etc.) and residential (apartment, home etc.) building occupancy detection and estimation at room/zone level granularity in real time can provide meaningful insights to many smart environment applications, such as green building, social gathering, event management etc. Smartphonebased participatory and citizen sensing applications have adhered to the promise of building such applications by utilizing various context sensing sensors on board. Different sensors can be exploited individually or in tandem to build a variety of such novel applications to satisfy the myriad requirements of differing smart environment applications. For example, potential benefit from microphone sensor based application is the assessment of social interaction and active engagement among a group of people by leveraging their conversational contents <ref type="bibr">[1]</ref>, speaker identification and characterization of social settings <ref type="bibr">[2]</ref>[3] <ref type="bibr">[4]</ref>. To enumerate the number of people in a conversational episode, such as during a social gathering, interactive lecture session or in a restaurant or shopping mall environment, various speaker counting paradigms have been explored <ref type="bibr">[5]</ref>[6] <ref type="bibr">[7]</ref>. Most of the recent studies which focus on conversational data features to extract high level occupancy information, assume that all of the users need to take turns at some point. While this specific scenario is feasible it is not ideal. To tackle this ideal situation, researchers have proposed using arrays of microphone sensors, video cameras or motion sensors for identifying microscopic occupancy information in real time <ref type="bibr">[8][9]</ref> which are obtrusive in nature. We envision to move one step further by considering a more natural environment where people may spontaneously participate or abstain from any conversational phenomenon. We posit to augment smartphone-based locomotive sensing model in absence of any conversational episode along with acoustic sensing-based audio inference model to precisely capture the characteristic of a natural environment and accurately estimate the occupancy count. To further pinpoint the occupancy we integrate the smartphone's magnetometer sensor-based location sensing model. In pursuit of these goals we design a model which opportunistically exploits both the audio and motion data respectively from smartphone's microphone and accelerometer sensor to infer the number of people present in a gathering and their semantic location information as supplemented by the magnetometer sensor on the smartphone. We also introduce a crowdsourcing model to reduce the effort for obtaining semantic location information at scale.</p><p>In particular we propose a zero-hassle ambient and in-frastructure-less mobile sensing (aka smartphone) based approach by exploiting only the smartphone's sensors to provide significantly greater visibility on real time occupancy and its semantic location. The key challenge in this case is to effectively estimate the number of people in a crowded and non-crowded environment either in presence of any conversational data or not. Such hybrid sensing approach could potentially furnish more fine-grained occupancy profiling to better serve many participatory sensing applications while saving smartphones' battery power by advocating a distributed sensing strategy. Main contributions of the paper are summarized below:</p><p>&#8226; We propose an acoustic sensing based linear time adaptive people counting algorithm based on real-life conversational data which promotes a unified strategy of considering both overlapped and non-overlapped conversational data in a natural environment. We propose to select opportunistically minimal number of microphone sensors which can substantially reduce the energy consumption of smartphones. Our proposed people counting algorithm can dynamically select length of the audio segment compared to the other existing work <ref type="bibr">[6]</ref>.</p><p>&#8226; Although acoustic sensing based approach holds great promises in inferring the number of occupants it fails in absence of any conversational data. Therefore we propose to augment motion sensing based counting strategy with our acoustic sensing based people counting algorithm which works on extreme modality of either of the data sources, be it acoustic or locomotive.</p><p>&#8226; We design a magnetometer sensor based localization technique at zone/room level granularity to infer the location of a conversing group. We propose a novel crowdsourcing model to map the magnetic signature of different locations and collect a large number of annotated location information to tag the occupancy with its semantic location information.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">RELATED WORK</head><p>We particularly review the most relevant literatures on occupancy inference problem in the context of conversational sensing, localization, and speaker estimation which are smartphone based.</p><p>Smartphone Speaker Sensing: A large number of prior work have used smartphones' microphone to opportunistically analyze audio for context characterization. For example, SpeakerSense <ref type="bibr">[4]</ref> performs speaker identification and SoundSense <ref type="bibr">[10]</ref> classifies sounds from macro to micro contexts. They have often in common employing the supervised speaker learning techniques. In contrast, our model's occupancy counting process is entirely unsupervised. Our proposed model anonymously estimates the number of people from smartphones' acoustic cum locomotive sensing model where we have employed unsupervised learning techniques to cluster different forms of acoustic signatures. For example, <ref type="bibr">[11]</ref> have built a model from mean and covariance matrices of Linear Predictive Cepstral Coefficient (LPCC) of voice segments in conversations and used Mahalanobis distance to determine if two models belong to the same or different speakers. <ref type="bibr">[12]</ref> has performed speaker clustering using distance of the feature vectors extracted from different speak-ers and finally applied modified C-means algorithm with distance metric data. However, their experiments for occupants estimation were on telephonic conversational data, where multiple participants were present, and voices were frequently overlapped and intertwined with the noisy environment. Our proposed model performs speaker counting without any predefined environmental setup and collects data from natural conversation. Our proposed speaker counting algorithm is close to <ref type="bibr">[13]</ref>, <ref type="bibr">[6]</ref> where smartphonebased speaker counting has been proposed in a controlled scenario where all the participants spoke actively. <ref type="bibr">[6]</ref> used a fixed length audio segment (3 sec) where each segment corresponds to an individual but we performed this audio segmentation dynamically to increase the accuracy of occupancy inference. <ref type="bibr">[6]</ref> also classified a few segments as undetermined but our system never discards segments as undetermined which is achieved only through employing dynamic segmentation. Therefore, our proposed audio based occupancy inference model tackles a richer problem, where none of the speakers are discarded for handling the computational challenges. Crowd++ <ref type="bibr">[6]</ref> proposed to combine pitch with MFCC to compute the number of people with an average error distance of 1.5 speakers. On the other hand our model improved the average error distance by a factor of two (0.76 Speakers).</p><p>Indoor Localization: UnLoc <ref type="bibr">[14]</ref> proposed an unsupervised indoor localization approach exploiting environmental identifiable artifacts and specific signatures on single or multiple sensing dimensions using smartphones' different sensors readings (mainly from accelerometer, compass, gyroscope, and WiFi APs). <ref type="bibr">[15]</ref> measured geomagnetic field in a way which is spatially varying but temporally stable, using an array of e-compasses to infer location. However they used a bunch of sensors or sensor arrays for location detection where as our model only used smartphones' magnetometer sensor to infer semantic location information of a gathering at zone/room level granularity. <ref type="bibr">[16]</ref> used magnetic fingerprints with dynamic time-warping algorithm to predict location information with a 92% accuracy. Our model used standard Random Forest algorithm and achieved 98% accuracy to detect high level semantic location information of any gathering. IndoorAtlas location technology <ref type="bibr">[17]</ref> utilized anomalies of ambient magnetic fields for indoor positioning. This platform provides the functionality for participatory sensing where the crowd can contribute by war driving magnetic signatures of an unexplored location.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">OVERALL SYSTEM ARCHITECTURE</head><p>We envision developing a minimally invasive cost free robust mobile system for counting the number of people present at any time in any environment and enlighten their semantic location information. Our model boosts these capabilities by employing smartphones' magnetometer, microphone and accelerometer sensors. Our system as shown in Fig. <ref type="figure">1</ref>, comprises of two subsystems, one deployed on smartphone and other in server. Using only acoustic sensing it is not always possible to predict the correct number of the occupants present in a specific location as some people get involved in a conversation while others remain silent. For example, in a class room scenario while professor lectures some of the students participate but majority of the students remain silent. Sensed data are stored in a data sink (sink) for posterior analysis in the mobile part of our proposed architecture con-sort the candidates with respect to this value assuming that in an ideal conversational episode the participants remain in close proximity. We calculate E(Ci) based on Eqn. 5.</p><p>where k = 1, l = 1, 1 &#8804; a &#8804; n, and b = 1</p><p>After calculating the error measurements for each candidate, we sort CF and choose the first 10 candidates from CF . We plot the magnetic signature pattern of these candidates and the test pattern. The crowd now have to choose the signature pattern in which they find the test pattern. In our experiments there were some cases where we observed empty candidate set. In these cases, we selected the last iteration's candidate set which was not empty. We also asked the crowd, if they found match with multiple candidates then they have to choose the earliest signature pattern.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">SYSTEM IMPLEMENTATION AND EVAL-UATION RESULTS</head><p>We now discuss the detailed implementation and evaluation of our model framework.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Tools and Resources</head><p>We used Google Nexus-5 with built in microphone and three axes accelerometer sensor for our experiments. Our entire system comprises of two parts: i) sensing, and ii) classification and clustering, first one was implemented on Nexus-5 and latter on the server. Application software was written in Java which utilizes Android Programming Interface (API) to sense microphone and accelerometer signals. Classification and clustering algorithms and our occupancy counting algorithm have been implemented on the server side using python.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Data Collection</head><p>Magnetic sensor signals are sensed through our android application and stored temporarily on mobile storage. We first collected magnetic data for training set, and subsequently for the testing set. We divided the room space into small regions each contains area 0.5 &#215; 0.5 m 2 and named as cell. Thus each room forms grid containing cells. We collected data from each cell for 5 minutes both clockwise and counter clockwise direction to form the training set. We also maintain fixed height (approximately 4 feet from the floor) when collecting our ferromagnetic fingerprint because it also depends on the height. Partial 3rd floor map is shown in Fig. <ref type="figure">10</ref>. It shows sample data collection path of room number 305 where green line shows how the grid forms and red line shows the data collecting path in both direction along the grid. We use sampling rate 5Hz for magnetometer sensor data. We implemented the acoustic sensing and collected conversational data from different places at different times in natural settings. Conversational data have been collected and properly anonymized during the spontaneous lab conversation among the students (without making the occupants aware of it), lab meeting, and general discussions in the lobby/corridor in presence of a variety of surrounding noise levels. The demographic for our conversational data collection was 1-10 persons (with 5 females and 5 males) in age group of 18-50 years. The acoustic data were collected at a mono sampling rate of 16kHz at 16bit pulse-code modulation (PCM).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Privacy</head><p>One of the major concerns of smartphone based acoustic signal processing is privacy. This concern becomes more serious when smart-phone records the conversation data. Our counting algorithm determines the number of speakers in this environment in an anonymized manner. We used text file as cover in which our recorded audio is embedded. A secret key is induced for embedding and extraction process which is known by both the sender and the recipient. A steganographic function takes cover file as argument and then embeds audio file and key to produce stego as output which is sent to our server. A reverse steganographic function on our server side takes stego file and key as parameter and produces audio file as output. There are different steganographic methods (i.e. LSB coding, parity coding, phase coding) but we used the simplest method, least significant bit algorithm which replaces the least significant bits of some bytes in the cover file to hide a sequence of bytes containing hidden data. To generate the stego file, the algorithm first converts each character of the cover file into bit stream followed by converting the audio file into bit streams and finally replacing LSB bit of the cover file with the bit of the audio in the secret information. We also ensured that the size of the file was not changed during this encoding and it was suitable for any type of audio file formats.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4">Magnetic, Acoustic and Locomotive Feature Extraction</head><p>We discuss different features relevant to our acoustic, locomotive sensing and localization technique in this section.</p><p>Magnetic Features: For location detection we used only magnetometer sensor. Smartphones' magnetic sensor provide three axes values x, y and z axis. From these values we calculated magnitude using m = x 2 + y 2 + z 2 . We considered only the resultant magnitude to mitigate variations of the readings resulting from smartphone's different axes based on different positions. We also calculated mean, variance, and standard deviation of each readings and combined those features to generate the feature vectors.</p><p>Acoustic Features: We generated two basic features which are used in the speaker identification -MFCC and Pitch. Each feature has been described in details in the following. i) MFCC is one of the most significant features which is used for acoustic processing. We followed the following steps to process it. 1. Take the Fourier transform of (a windowed excerpt of) a signal, 2. Map the powers of the spectrum obtained above onto the Mel scale using triangular overlapping windows, 3. Take the logs of the powers at each of the Mel frequencies, 4. Finally, take the discrete cosine transform of the list of Mel log powers. We excluded the first co-efficient of MFCC and then chose 20 coefficients as feature vectors. ii) Pitch is defined as the lowest frequency of a periodic waveform. It is the discriminative feature between man and woman. Human voice pitch interval falls within the range of 50Hz to 450Hz <ref type="bibr">[23]</ref>. We calculated pitch of different segments using YIN <ref type="bibr">[22]</ref> algorithm. We used 32 msec hamming window with 50% overlap for computing the Pitch and MFCC feature.</p><p>Locomotive Features: We considered the magnitude of the accelerometer data as our locomotive feature in order to toolkit <ref type="bibr">[27]</ref>. We implemented our mapping algorithm on the server side and then used the function active interactor of VW to interact with the users. We showed 10 magnetic signature patterns and 1 test pattern to an user and asked him to choose the magnetic signature pattern in which he/she finds the test pattern. 10 participants participated in the crowdsourcing task and in Fig 20 <ref type="figure">we</ref> show the overall accuracy for each participants when given 15 pattern matching tasks. Average accuracy of gaining correct annotation for these 15 patterns is &#8776; 81% which is adequately high. Our results indicate that the probability for getting noisy labels is very low and the crowd annotated data can be chosen as input to the classifier.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">DISCUSSION AND FUTURE WORK</head><p>In the current version of our work, we have assumed that people keep their smartphone in the pocket or in the hand which might not be ideal in some cases. In future our plan is to make our architecture more robust and independent of smartphones' location. The performance of our counting algorithm does not get affected by TV or radio sounds as TV or radio follows different modulation techniques which make it easier for us to remove those external noises from resultant audio signal systems. We have used source separation where significant overlap between human conversation and TV occurs. In the current implementation, location mapping process is independent of the classification process. In future we plan to develop and integrate a combined mapping and classification model. We also plan to investigate fine-grained floor level location using smartphone barometric sensing. We plan to investigate more advanced opportunistic sensing model considering microphone, accelerometer and magnetometer sensor participation not only based on a serverbased architecture but also based on an inter-smartphone distributed collaborative sensing based approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">CONCLUSIONS</head><p>In this paper, we presented an innovative system to infer the number of people present in a specific semantic location which opportunistically exploit accelerometer and microphone sensor of smartphone for people counting. We proposed an acoustic sensing based unsupervised clustering algorithm by addressing the underpinning challenges evolving from naturalistic overlapped and sequential conversation to infer the occupancy in an environment. We posit a change point detection based locomotive sensing model to infer the number of people in absence of any conversational episode. We implement an opportunistic context-aware client-server based architecture to leverage smartphones' microphone, accelerometer and magnetometer sensors and combine our acoustic sensing with locomotive and semantic location sensing model to better predict the location augmented occupancy information. We have also demonstrated a novel crowdsourcing model for reducing the effort of collecting location information at zone/room level at large scale. Our experimental results hold promises in a variety of natural settings with an average error count distance of 0.76 in presence of 10 users. We believe this investigation holds promises and helps to open up many new research directions in this opportunistic multi-modal sensing domain.</p></div>		</body>
		</text>
</TEI>
